KubeVirt SIG Performance and Scale, 7 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-07-07

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, all right welcome to six skills july 7th 2022.. uh The notes are in the chat. If you want to open them up and add topics, please feel free to and add yourself also as attendee okay. um Today, uh let's start like, we usually do. Let's take a look at the periodic job.

A

um I took a quick look at this earlier. This is all really good to see we're starting to get back to consistently green, which is really good. um So I guess that we've got over those memory issues and we're starting to get to um getting back to what we'd like to see um and we're not going over any of the thresholds, which is good.

A

Yeah, this is all really good. It looks good, so our thresholds uh 45 for our p50 we're at 20.. It's been about the same now p95 29. This is also looking really good on the p90 yeah. This is a really good test. 29 was the the slowest worst case. It's really good, um usually it's about in the 30s and the 40s high 30s low 40s, um uh our patch counts, also well, within our thresholds and update accounts all worthwhile folks. So it looks all looks really good.

A

Okay, um let's go to the um the let's see where's, the uh in our pre-submit job, which should also look just the same.

A

Okay, better than once before, a few failures, probably probably doesn't have to do with us- is my guess. This is probably something outside our control, but let's take a look, it looks like the puzzle right here, okay um and then, let's see, let's go to the um dedicated cluster.

A

Okay still failures here, I think this is just the we're still in that same place where we need to handle the cleanup properly. I think that's just where we are on this, so I think that's.

A

A

And there's a job, completing and then.

A

Looks like I don't see the error. This all looks fine here, 400 uh pods.

A

We don't have thresholds to use, but I'll have to add that at some point and then okay here we go yeah, okay, so there's a so there's still yeah it's just to make the make clean function is just what's airing this out, so we just need once I think, lay you have that would that work. So what once um I think once we have that and we should that'll it should go away.

A

Okay um and then our 100 density test. Okay, looks great. This is, I think, uh what we expected. We don't see that cleanup issue on the 100 density test, but everything looks good here. Okay, good.

A

Okay, that looks good, okay, um one, uh let's uh one topic that I want to bring up for today. There's um make everyone aware, if there's a there's, an increase in memory overhead um that has just went into master.

A

um This is something I I need to investigate and create an issue around this and and look to fix. So the uh this issue we've seen previously we're not the issue. The memory increase is something that has been done um previously. uh It was, I don't know exactly how much it was increased, but whatever it was, it affected our jobs and we had to increase the memory in all our jobs.

A

What this pr is saying is that um the that that increased previously was based on some tests and some estimates and after some more testing over time, that the author noticed that the amount of memory that they originally had allocated was not enough and that, um over time the actual launcher takes. Oh a variable amount of memory, sometimes more, it looks like even sometimes less, but the key thing is that we have to account for more, because we don't want for launcher to get killed um because it's you know it's over.

A

The amount of allocated memory- it's gone, we don't want to kill the vm and so the we need to increase the amount of memory for the launcher just to make sure we don't run into this problem. um But it's this is an interesting experiment. It's because you know we don't know why this is the case and um yeah.

A

This is going to take some digging, uh so I I'm not really sure, but this is just something that I want to bring up for everyone to be aware of that, going into master there's um uh there's this change that increases memory. I think it's. Let me see if he's got it in here, for how much it increases to buy okay, so we're going from another 25 megs we're increasing.

B

Hi, I'm federico hey.

A

B

A

B

Yeah yeah. I am, though, uh yeah that.

B

This pr was born because uh we see that uh the rss that with launcher their the rss increase initially it seems to increases constantly, but and we think that we have a memory leak, but after a deeper test and a long long time test, we saw that there are. There is no memory also because we all we have tried with the people off to see if we have a memory leak and it seems that there is no memory. Okay and uh yeah. We are running this. We are still running this test.

B

Okay and uh I supposed to uh um the spreadsheet, where we uh collect the data and uh graph graphics.

B

um In the yeah there are uh at the bottom, there are the the graph about the rss file, rss unknown and the rss entire and yeah. This is uh um this file is not uh automatically updated, but uh it uh will be updated by me uh kinda every every morning, so this data are updated to six hours ago and what's interesting, is that the rss file, which is the part of the shard shard memory?

B

It's not constantly, it's not constant, but uh it also has some spike and in the long range, obviously it will be constant. I don't know if you want to go to the 1234. The graph rss file, tab yeah, that one.

B

Yeah, as you can see uh now, it's pretty stable uh from uh uh uh for yeah from about three days, but uh it uh it's uh interesting in my opinion to investigate uh uh how this uh who um affect these disallocations. This amount of memory that virtual requires.

C

Federico, can I ask you something yeah on the other tab, I was seeing that you are using windows 10.. We are getting better desserts with windows 11..

C

Can you also upload our change to windows 11 also.

B

Yeah, yes, we can, but uh uh in my experience, because we have run uh too many many kind of vms and uh the kind of the oems does not affect the uh the part of the memory.

B

Absolutely we can try to include it.

A

I appreciate that so uh so federico, um so why so, can you walk me through, like the timeline is? This is like? Is this a this orange line? Is this like one vm over time the rss file, it's smaller, okay,.

B

Yes, uh consider that uh when, um in the uh when it uh goes down, uh it was uh migrated because there were there was a problem in one node and uh it was migrated to another one. And when you see the the drops. The first drops of the orange line, for example,.

A

Okay, so with this is this: could this have to do with the node then, and not with the launcher.

B

uh Sorry, can you repeat that.

A

I just have to do it that it could just have to do with the node um instead of you know, with something to do with rss or the for launcher.

B

Oh okay, so basically um what happens is that uh this is a cluster, a bare metal cluster, okay, in which there are uh three nodes uh or three watcher nodes: okay and uh in uh in each node. uh There are truvians, okay and two nodes uh in this in this experiment uh goes down and the vm migrated.

B

But then the notes comes up and the vms was migrated again. So uh if uh I'm not sure if something changes uh right now, because uh uh I'm not pretty sure, but I think that actually there are two viewers per node.

A

Well, I wouldn't I mean like where I'm going with. The question is like um I'm just trying to understand like this issue a little bit more so like is it is it? What you're saying is that, like we are um you're running over launcher right now and part of your pr, is that we need to increase 200 megabytes and um is it's and like I'm trying to just uh narrow a little bit narrow this down kind of where our search is going to be like, um like what I mean, I I don't.

A

I just don't understand like why um what you think like what do you like? What do you think is the problem? That's causing us to need to increase to.

B

7500 I, in my opinion, I think that uh it is uh um it could be because I'm I'm not sure um it could be uh the the garbage collector, because uh for us the garbage collector is a black box. Okay, we it can run every five minutes or uh when, when he wants, probably if the the node uh is not overloaded, the garbage collector will run not so often uh has when the the node is under pressure.

B

But um I really don't know because I initially, I think that we have a memory leak, but the memory profile does not show this memory. Okay,.

A

B

A

By garbage collector, you mean like so the node garbage collector, not like kubernetes.

A

No, nothing like this is okay, so so this is on.

D

A

No okay! So on the node, you think that, because of you know, the the kernel is doing some garbage collection and collecting and that's and that's what could be causing us to increase the amount of memory we need.

B

Yeah also because this is uh the so there are uh two errors, uh two uh part that uh um uh could uh increase the total rss file, so uh the rss fi, the rss. Sorry there are the total rss uh requested by uh avian.

B

It could be the shard memory or the the memory itself of the of the application.

B

So uh if you see uh the the other graph, the graph or of rss unknown, you see that it right now it seems to be.

B

That it will grow infinitely, but if you will see the graph rss, you will not see that this increasing. So it seems like the two parts of rss anon and rss file are complementary. So if, uh uh if if one increase the other one will uh decrease because uh yeah I don't know because if.

B

It's really strange, in my opinion, also I'm not an expert of performance topic, so this is why I I'm addressing this topic here and I don't know if we can do something.

B

Deeper to analyze it.

A

Yeah, I I think um so I have, I think. uh Well I mean do people have any ideas like because I I'd like for me like, I think uh I don't know. I need to do some research into like how we could do how we can do some analysis on the node to really narrow this down to like what we should expect.

A

Yeah, I don't know I mean yeah. I think I think we'll need to do some analysis. I think we should um yeah. I mean if people don't have any ideas, then I think like we should do like. Let's we'll keep this topic around, like you know federico, hopefully you can join us for a few views every week, let's take like let's write up an issue in q vert.

A

Let's start um I can help populate some ideas and maybe you can do federico as to like how we could do this um and uh let's, let's just start, let's start with that, because I I don't like, I think these graphs are really good like let's, when you share them on the on the issue.

A

Let's get all our information out and let's, um let's just start doing some research and you know getting some ideas- how we can how we can try and tackle this, because I mean I think, right now, it's just a little too broad and we don't have like we just need to get some. We need to get some more ideas on the table before we can say like we need to look at absolutely.

B

Yeah consider that this test will uh continue to runs.

B

I don't know uh for how long, but I I think that if we can leave it.

B

Approximately uh forever it's it's really good, uh so yeah one thing: that is a really uh interesting. uh If you go yeah, it's the first uh drop of the yellow and the red line, the first drop the at the beginning.

B

Not that one, because this is when the vm was migrated. But uh oh.

A

B

That one there nothing happens: okay, but the rss goes down. So if we can.

B

B

B

Which was the cause of this one? We can try to.

B

To put all the vm in that condition uh so that all the rss of the of all the vms goes down. Okay, because it seems that.

B

The first launcher can use less memory, okay, because it continues to continue working but five minutes. uh Previously it will. uh It was using 10 megabytes more.

B

I see yeah but but yeah me I'm I don't know. Basically what is the reason but uh yeah.

A

So, let's uh it's really yeah, it's really interesting. The other thing that I just I still find curious about. This is like I kind of separate this graph. Like you said, there's live migration in here like before seven one. It's like here's node one and here's node, two and node do performs fairly well, uh like the launcher does pretty well here.

A

It really does not do well over here, except for, like you know this one area where the red and the yellow, like you're, saying where it's pretty much in line with what node two has so it's it's kind of was this. It's also kind of interesting to this I think like. So what would be, um I think, would be valuable. Like I was saying with the issue. I think I think, let's take take your grass, let's create an issue. um You know we could totally like you know investigate.

A

You know why um vert launcher is taking. You know more than more an unusual amount of memory or an unexpected amount of memory. Whatever, like you know, then you know- and we can explain why- and I think like having some having this graph here and explaining what's like this is unusual up here, and this is unusual on this drop, and this is probably what we want to see.

A

You know on the other side, and I think that that could be our starting point and let's we need to do some investigation as to like, okay, let's you know, start narrowing some things down and we can start doing some um and do some different tests and let's just use the issue as our like. uh You know the way we can the area we can track. How we're gonna investigate this, then we'll we'll bring it up in this call and we'll see how we progress.

A

Okay, good yeah, okay, well, so uh federico I'll um I'll. Send that one to you, um you can please create the issue here. Under this there we go.

D

A

Okay, so federico yeah, if you can, when you, when you have that um you know, please tag him here and we'll we'll start tracking it. During our meetings and I'll put my this investigation in myself and see what I can do soon, I can answer the discussion. Okay, thanks for the content.

A

Okay, um next, let's go to uh this topic here. Six scale approved vm christian legacy up to four yeah.

C

This is an email from marcelo uh from ibm on the group. Can you open and we we talk a little bit.

C

C

Changes he has done. I didn't find what changes uh he is mentioned. Virtual controller. Therefore kps. What are the changes? He changed it's this one. We we talked about this one last time this was um yes, but the guys didn't find. What are the changes that make these results better before we try.

A

Here, oh well, here it's the it's just this, the so like all he did so there's. Also there's an important thing to note about this experiment is that marcelo's testing is probably different than than than others, and and that's why like so he what he did is.

A

Basically he found a bottleneck and he found a bottleneck by um because he was able to see at least this is where I think he found a bonnet because of the density that he was deploying his virtual machines and um the way that he was doing it and- and it was, it was applying a lot of pressure, and so this led to um a lot of timeouts and the 20 30 qps bursts that were defaults were way too low and he explains why they're way too low, so we race up to 200 400 and for hi, and for that case he gets a massive improvement which you can see here like his latency was, was 20 22 minutes when at the default uh qps and burst, uh I think we went down to.

A

um I think we went to this third line. I think so 200 um and now he's. uh I think a third graph, I'm not sure what it is, but it's like in maybe it's in seconds or milliseconds. So it's pretty significant the improvement that he was able to see. By doing this, so I mean, but again like this is maybe different for your use case. Like I don't know how fast you create vms- and you know they're very.

B

A

Yeah, but it's like you might not you're the right. Your rate might be different, your density might be different. Your kubernetes cluster might be different, and so you may not have hit this level of pressure that he was able to hit here. But the point is that, if you are, if you are able to generate as much pressure as he was, then um you will still be able to achieve the same level of performance um now with this change.

C

Let me explain you my pain, everybody logging in at 7 00 am when they arrive at the lc.

C

Yeah and so create like 100 000 gm's.

A

Right and are you like so in his case like he was seeing that you know like we'll just go to your example: you're printing 100 000 vms he's creating a thousand. He was seeing that it was taking 22 minutes and for in some cases for the vmis to come up. Now I I mean: are you seeing up? Do you seeing that level that is it taking you that long for your a hundred thousand vm to know.

C

Because we use ram disk behind that's why.

A

B

C

Like to to understand, I mentioned on my last uh on the last call also, yes,.

B

I would like to to change.

C

It the same way he changed it to see. If this is, there is an improve, because we have that already in our uh road.

A

Yeah, but it's it yeah, I do I I it's not, but it's not just that ram disc. Like he's, I don't think I don't know what his bmi, um but we need the pods.

C

Correct on the cluster we are reaching like 10 or 10, 000 or or at least 1 000 times faster than regular uh spin up new pods. For you understand, because these the pvc is actually ram, you understand.

A

I don't think he I don't I don't he may not even be using. Pvcs, like, I would have said, is that sort of apples to oranges. It's not his your case and his case is different and and like you're talking about like pvcs like I, I don't think that's what's limiting him, because if you see this qps and burst, this is on the cuvette side. This is nothing to do with like pvcs is all kubernetes like this is.

A

This is all kubert's pod creation, latency and vm vm ready latency like this is all all cuber code here that it's nothing to do with the pvcs and he's and he's running into problems, and so I what basically, what I'm saying is like it's? It's it does it's not as it's. It's totally orthogonal like the the pvcs it he's seeing he with his in his experiment, he's just what he's seeing with just the ends.

C

Okay, because on top on top of the the pods we spin, the vms and everything supposed to work better, I'm gonna try that anyway, okay.

A

Yeah, well I mean all I wanted was so this one. All I wanted to just mention is like with this experiment like you, you to run into this problem. You'd have to get to equal to the amount of pressure that he generated, but again like it's, it's the pressure that he generated has to do with the type of cluster he's running the type of hardware.

A

He has the density, the rate that he creates things and even just the the the um the specs, the vmi specs, all those things will go into it and and he was able to increase the qps because he was able to hit a a bottleneck and, and he was able to, he was able to notice a bottleneck and he increased the ups and bursts and he was able to fix it.

A

So I guess the point is like if you are, if you ever run into this, if you ever run into the uh this, um if you're using the default qps burst that you know right now at 20 and 40, if you generate enough pressure, you need to increase it, um and this will go away.

A

That's that's the that's! The takeaway it sounds like you're, not hitting it just yet, but so you know which is fine. But if, uh if you do increase your scale or.

C

Your pressure enough- and you won't want to issue, is because he was across only 12 worker notes and we have 1 250 worker nodes.

A

Yeah yeah, so I mean you're, it's it's different. You have a you, have a different kubernetes cluster, a different hardware. You've got a whole different setup, so the amount of pressure you're generating, isn't quite equal to his. But if you do get there, your qps and bursts will need to be improved. So I mean, when you take a new kubernetes you'll, be um you'll, be fine, he's already taking care of it. So our new new cuber version he's got it increased, so he'll take care of it. So you so you won't hit this particular bottleneck.

A

Wonderful, wonderful, thank you! So much sure, um and then, let's see, let me read romans um so um remember: usage.

A

And then uh yeah this was um this one's cool like. I think this is a really good we've talked about this previously. This is a really good area that we could use that we can improve on like fewer update patch calls, just because of like it is experiments I think per vm. It's like I don't know if it was 50 or something patch calls um here.

A

I can, I think, he's got it right here um somewhere if we just decreased it a little bit, um but if we just if we just decrease it a little bit, we would probably see um some some really nice gains.

A

And so that, in that regard, we would just be we'd, be able to produce our qps and burst so that we wouldn't need to leave it at the level it's coming out and there's also that there's a third issue which is the um what was the third issue was the one of the um one of the graphs shows like one of the um controllers I think was, was slow. I don't I don't remember what it was.

A

Maybe he's going to link to it at the bottom or something I don't see it, but there's a third issue: oh here is the vert controller node working this one that we need to investigate. This is one that we need to do a really. We need to do some profiling, probably and do a deep dive into figuring out exactly why this is um marcel did some tracing, but it wasn't really conclusive as to what the problem was so yeah. This is another one.

A

We'll just need some spend some cycles on to drop that qps and burst down. So yeah, that's cool, I mean so anyway. That's something that like, if it's just something to keep in mind like as one area that you can run into some some slowdowns.

A

Okay, all right, I think um I think that's.

D

All we have for today.

A

Yeah sure you're welcome. Okay, I think we'll have any more topics for today everyone, you know thanks for your time, we'll uh we'll see you all next week. Thank you. Thank you. Bye.

B

Thank you, bye.