KubeVirt SIG Performance and Scale, 21 Sep 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-09-21

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Yeah, okay, good all right, welcome to sixth scale: this is September 21st, okay,.

A

So one of the first items was um it's actually so this is this. Is these items actually been carried out for a little while, since this is our first meeting of I, think three weeks um I had this a little while ago um this uh I came across this PR and um essentially what it is. It has to do with um adding a status on the VMI spec that will assist with live migration, and um so, unless I comment on here that there's a I can't see it here.

A

Basically, we it's a field that we'd upgrade we'd update, pretty frequently on the few miles back, and um so there's been some discussion about this and I I love to comment at this and talked athletic about this um and and they're aware, like there's some challenges here, like we're gonna We, would we would really? If we had this PR, it would really increase the number of update calls or the number of patch calls to the VMI.

A

um So there's a there is an alternate approach that Roman proposed. um Instead of going this route, where what we can do is um use um so there's so the waves ascribe to me is like we can. The way the metrics are reported right now in um kubernetes is pretty much. Everything goes through Prometheus, but uh there's I guess some proposal out there that will allow the user to change how metrics are reported so that it doesn't necessarily go through Prometheus that it would go that we can send it through some other different places.

A

In other words like we can have some sort of single signal handling. That's that's um kind of like in the form of metrics, which is essentially what this is and have something listening off of the end of it. So now, instead of just Prometheus, so in other words like, we could send this to Prometheus and read Prometheus, but that's not what we want. We instead would rather just have this messages, be these metrics be sent out we'd like something else that we write to be listening for the whatever these formatted metrics and then so.

A

We can have some tool that that does something with it. So that was basically the suggestion that Roman had in here. um So that would be instead of having the a bunch of objects or a bunch of fields that we update on the spec, so I think we're this isn't going to be implemented this way, but we'll we'll have to keep an eye on the pr I. Don't think that's going to be the case.

B

um Thanks for walking through that PR, it's interesting problem. um Do you I wonder what are the consumers of that information.

A

Of this information, so it should just be um it's like when you do live migration, it's like when um or one of the challenges. So what the I guess. The use case here is live migration Works in a lot of cases, but there are some cases it doesn't work and um the rate the reason it doesn't work in some cases or it doesn't make sense, even is because there is a lot of activity on on the VM.

A

It's writing a lot to disk or to memory or something to a point that um that doing the life migration would be extremely challenging because we right, we think about it. We have to we have to take the data it has to be um has to have the same. The the new VM and the lvm have to have, like the you know, we're gonna. That's gonna, pick up all the the rights to disk and things like that um and start executing on the new on the new VMI.

A

But the challenge is, if we're doing so, much work on the old VMI on like it's. It's writing a lot to disk or something. Then it's really difficult to then. Do the live migration, so the idea is like this is a metric to say like okay, this VMI is going to have a really hard time doing live migration right because we've got a high dirty rate or whatever this one will have an easy time, because we're not actually doing a whole lot of things. We're not writing a lot to disk.

A

It won't have a whole hard time. Moving to the um live, writing live migrating and moving to the new PMI. So that's the that's the background for this, and so the user would be if I wanted to migrate. All of my VMS in my zone, I think mortality said it was for upgrade for upgrading things. If I wanted to migrate them all, we want to increase the chance of success and so for the ones like we'd have this metric.

A

We can figure out all the ones that we can, that we'll be successful with and then I'm not sure how we'd handle this like I, think. Maybe we need user intervention or something when we have VMS that are really hot and we don't want to migrate them right now. I'm not sure what we do, but the idea is that we'd identify them.

B

I see I, see, okay makes sense. So basically, you need an API which an agent who is live migrating all VMS go and can query and look at the current state of vmis and decide which ones can be migrated and which ones are risky.

A

Yeah, that's right. The.

B

Proposed API is the status field of VMI. That's right, yeah.

A

B

So I I think the problem I see with this is it can cause like cause like a watch strum so because this is an update in the status field. That means any controller that is watching. The VMI will always um get an update event and trigger a reconcile uh for it. So um yeah I understand this is going to be very hard to scale.

A

Yep um so hopefully the solution that I mentioned is going to be. What is is going to be the path forward and not the something on the status field. uh Unfortunately, I don't fully understand it because I don't at least I haven't, read anything about it like I, don't but um Roman's comments somewhere in here, I think it was um I, don't know which one but um somewhere here he made that recommendation.

A

um I don't see it, but that was, um but that was that I think is the path that this is gonna go. Is that they're gonna try and find some sort of other signaling outside of this? So I guess you could say some sort of API outside of this, um but not like some sort of crd, but we used since this is essentially a metric.

A

You use the metric system, not Prometheus, but like some other way of signaling like a metric and then have something scrape it some sort of agent or something so I, don't know I, but as far as I'm concerned, like the um as long as we don't grow, the Sprout, then I, don't think it'll be any concerns for us on the scale side.

B

Yeah makes sense.

A

Okay, cool all right, so nothing with this I think we'll skip this. We'll do this later, um let's go to Jed's question I'm currently trying to get better understanding of our rate limiting metric exposed cues in the broken controller controllers, like watch migration, uh previous crops for covert working depth, expecting values to go up and create times like the current VMS, but they rarely do anyways. uh Someone knows okay, I! Think so.

A

Ignace of this is the um the work you depth is going to only go up when I reconcile takes too much time so I mean in theory the the number of cues can go up if we run into a lot of doing a lot of work, but it's not only dependent on the amount of work. We do it's much larger than that. It's like our work because we're calling kubernetes we're calling lots of other things in Cuba.

A

It would take um something in that in that chain of events to be slow and then some retries to happen and then for us to then build up a work queue. So it would I think it would be a bunch of things that would need to happen not just tons of concurrent VMS and migrations.

A

So I don't know we can I can love Chuck knowing as well. We can talk to him on slack about that. One yeah.

B

I think there is one more thing around this, so the vacuum Matrix are exposed by underlying client go as well, and they are not uh prefixed with cubot underscore so might be worth checking those as well.

A

Oh, you think. um Oh oh check those. In addition, you mean yes.

B

Because we use client go right, so client go also exports. Those metrics okay.

A

So here's the queuing, multiple VM migration, causes our controller to hit a Deadlock.

A

Large scale setup on 32 nodes, 6, 000, VMS.

A

Testing Ram migration books schedule, hundreds of VM migrations, wait for completion and then schedule another 100 migration. Please was slowly degrading with every book, starting with 20 seconds and reaching 1570 seconds in the last bulks in order to debug this I scheduled 800, VM migrations, and so these are announced. The root cause.

A

Okay, get stuck in the vert controller migration queue.

A

Making sure the automation, the number of scheduling my migrating mean, will always be less than the parallel migration cross reading. So I was able to complete television in just about 12 minutes.

A

Parallel migrations per cluster I, don't know what this is.

A

This some variable that we have somewhere in Cuba.

A

Let's go here we'll do a quick search. I have no idea, but this is.

A

Parallel migrations per cluster: this is total number of concurrent live migrations. Last episode.

A

So there's a relation, so he's saying is: there's a relationship between.

A

Or is it it's where it's less than the there we go the void triggering this is making sure to go out through automation of the number of scheduled migrating VMS. Vmq vmsq will always be less than parallel migrations per cluster.

A

Okay, the first thing we're going to look at I, don't know what I don't know what this does like I, don't I understand what the variable is saying here, but I don't know how this affects the it's total number of concurrent life. Migration allowed cluster-wide.

A

The number of scheduling migration vmqs will always be.

A

Will slowly degrading.

B

Oh, so I think it must be there testing automation, so my what I'm understanding is that they have a test mechanism where they schedule n number of VMS for migration. So if they start with 100 um VMS being migrated, then you know the result. Is that first line?

B

But if they schedule say five and the default is five, then they are able to make it work.

A

I guess that makes sense, though right because so we're doing five at a time here, or maybe it does I, don't know so, like you get five at a time and then and then it's queuing 95 of them, yeah, yep and then but I guess, maybe where this is confusing to the users like that all of these 95s are being are taking a long time and the, but whereas if they just did five at a time and wait till each of them finished it'd be much faster overall, maybe that's a surprising piece where it's like this is consuming memory.

A

I think he's got it somewhere in here.

B

Yeah and also they are getting stuck in the queue, so the the user expects that the 95 will then be slotted. So 555 you go for 20 such rounds and you get done with it, but they're saying it gets stuck all the queues got stuck in word, controller.

A

B

A

If this uh so seconds per pm and was both, it would be easier to you know, is there any sort of I don't know if there's any sort of like um we would see this? Wouldn't we like I, think so he's saying so, whereas Chad was saying that he didn't see it in the work. Hubert work Hue depth, but maybe it's not so this is what you were saying. It might not be, might just be in a work through depth from a client go that has this.

A

Maybe we're not looking in the right work queue.

A

Yeah I'm not sure where these get stuck like, where what, where they end up getting queued up in I, guess she's claiming it's the vert controller.

A

Okay, so I guess there's an open question. I, don't also understand this Behavior like five, so it also I think that's another thing. It's like the behavior doesn't seem right and that, like we should be, we should be taking things whenever there's one of these one spot free in the parallel migrations per cluster, it should be freed up.

A

uh We should take another one, I guess. Maybe that's not the that's not! The point is that maybe what's happening is um so parallel, migrations per cluster. So it's doing five migrations and at the same time it's got a.

A

Maybe these are all getting processed one at a time. Maybe the 95 ones are that'd, be weird, but maybe that's what maybe that's what's happening.

B

Yeah I think what I'm understanding is that so I don't know if this is possible, but that there will be 95 of them in the queue right and for some reason they are creating like they are exhausting the queue from constant uh updates, so like none of them will will get uh processed I, don't know if that's what is happening so while the five are being migrated, the 95 could, you know, continue to re-queue and cause uh watch, Q, exhaustion, I, don't know if that's happening here.

A

um The work control accuse size right then.

B

A

Say that there is over controller, it was the third controller queue.

A

Let me see if I can imagine this. Oh yeah, oh so they have one here.

A

We're queued up.

A

Yeah, okay, so this is, this is okay, so it all get just gets loaded in the queue. Okay.

B

A

B

Don't maybe this isn't.

A

The birth control activity- maybe this is the thing I, don't know, but that's um okay. So that means then the we're we're stuck in the queue. Some queue. I, don't know what, but we're stuck in some Q I guess number controller queue and it's just slowly slowly, process a little bit. Okay,.

A

So I guess another thing we're missing is like what is like. um Why is it that when I guess the thing I wonder is like if you increase the state under, you know like what would happen or a thousand decreases to a thousand I, don't know what happened.

A

B

A

Know how this I don't know how this works so I don't know what that would do, but.

B

I'll try to take a look at the work you and see if um what we are doing is similar to client go or not,.

A

That's interesting, okay, all right! Let me leave a note in there, so let me see so we can um I guess. The question is: are only depression here a little bit. Maybe we can reject we'll see it. So what happens when.

A

I guess that would be the question: okay, I'll just message Jen on slack, and so we can. We can correspond with them that way, and maybe he's funny thing on the birth control accumulator. We can also don't know, I'll tag you on the thread away and we can we'll just go that way. Go about it that way.

A

Okay, all right! um um All right is there anything else. Did we? uh Let me check our post B1 tracking. Did we have um is everything merge here like is everything all done or did we okay, sell slope items.

A

Oh, this is a peer review. Wait where's the uh this one.

A

Okay, these are all still open. Do we need to have like um Daniel take a look at some of these are Google.

B

Was helping out with one um with the thing that we need next, the updating of graph, but it has fallen off uh the burner. So, okay.

A

Wait we're needed, which one is it? Is it like.

B

uh It's the need: a post submit job for performance, Bank, two, nine yeah.

A

B

A

Oh, this is the thing: okay got it all right all right. Let me start two threads. Let me just ping lubo and see where he is and then um start a third legit. Let's see if we can get some more info about this book. Okay,.

B

Okay and open PR.

A

Oh, there is, what's the what's: the open PR I mean if.

B

You look at uh the work in progress. One.

A

Which is a let me.

B

A

B

Oh, it actually got dropped.

B

Okay, here, um two nine three one.

A

But as in Project infra.

B

A

A

Yep: okay, okay! What's this.

B

Yeah we so that's the pr it has been a while and it's a small one, so um I'll try to Ping lubo again if he gets time for this. If not, maybe I can help out I'll push changes to that PR.

A

Ok Okay yeah. All right are you thinking about that and let's see if we can um see if we can get this further yeah I thought we were close because I thought like we had.

A

I thought you had like most of us like are you Stamped Out,.

B

Yeah so I think Luber ran into one issue, which was um so the pr here creates a post submit job right. So um our current automation, scrapes, the metrics, puts it in uh CI benchmarks repository. Then this job gets triggered because it's a post submit job, so it will generate the HTML and again uh post it to GitHub right. So the issue was that, because this was a post submit job, it would always um recursively invoke itself um creating a problem.

B

So lubo was wondering if we can instead create like a pre-submit job, but one day after after the scraping is done.

A

B

Yeah, that's where we were at last, so we need to figure that out.

A

B

It's okay to do a pre-submit right uh just the day after.

A

Yeah, that's fine! We don't need to be like these things. Are we don't need to be the day yeah like it's? Okay, yeah that doesn't work, fine, okay, cool all right! Well, you can start there with him. Let's see if we can get this going again, it sounds like we've already sounds like it's pretty like we've thought a lot about this and we're pretty close. We just need a little bit more to get it on the finish line: yeah, cool, okay, all right thanks, so late, um I think I. Think! That's all!

A

Let me just move this.

A

All right, I'm gonna, unlock this one.

A

Okay, okay, all right anything else move this we'll just stack for um whenever you have time we'll talk about this one.

B

Yeah any changes on that in next couple weeks with other issues.

A

All right, okay, I, think we're good. Then all right thanks, Lily thanks everyone thanks Ryan bye, thanks, bye.