KubeVirt SIG Performance and Scale, 11 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-05-11

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, six scale: it's May, 11th 2023,.

A

Let's see, please add yourself in this attendee.

B

A

All right well, so we're gonna start off with uh I mentioned this last week, so today's the day we're going to talk with the um kubernetes six scale. So this is, um if you want to find me an invite, it's I think it's here, yeah I think it's just I think you just click. One of these things that has a has the calendar link. I, think I. Think it's this one, um and so it's uh this afternoon, eastern time later this morning, PST so I I talked last time about the slides that we have.

A

um Let's copy them up here,.

A

So the same um I I haven't made any changes, so this is gonna. This plant's gonna be the same, so um we'll just talk through. um So what I'll do is seven cents, so I'll talk through the really quickly the slides. Just it's our mission. You know how it's a little bit, how we measure and what we do and, and then um you know some interesting tidbits of like how we how we can estimate scale using these metrics and then, um let's we'll just talk about the collaboration.

A

Hopefully this is the main focus where we're going to be. um We have this questions that we have about how we can collaborate, and hopefully this leads into a bunch of discussion.

A

Okay, all right: let's look at the performance jobs Elaine. Did you get a chance to integrate with um sorry to integrate the.

A

The dedicated uh cluster, let me just check it real quick, see, make sure it's running working. It is okay, uh the um the performance cluster, no.

C

I did not I'm looking at the artifacts directly first, um my plan is to knock those things out. It's easier changes um and then look at the imagination.

A

Okay, all right anything was there anything in here, I, don't know: let's just go through.

C

It was same patterns as last time, the um one the one we noted last time that was poetic, that one uh did not get as a clear signal. um The node count is pretty okay. Let me see it's the.

C

It's the create VMI account that is busty.

C

ah Sorry, it's it's in the VM, not here.

A

C

Yeah, we have already noted this uh need to spend time looking into why this metric is like that uh yeah. Apart from that everything has been seen across last weekend this week. Okay,.

A

All right so for this uh for this work, so we've got the artifacts directory for the dedicated cluster, so how um I am kind of zooming on a little bit in terms of B1 and.

A

um I want to see so what we could try and tackle for V1. What's the what's the repo we're posting to, is it what's the um yeah.

C

I mean this is my: this is on my personal GitHub right now. We need to.

C

B

C

A couple of things we need from community one is that the repository um it's an issue in Project infra.

C

The other thing is the pr for the latest bug fixes is out there. We need to get that merged as well.

A

um Okay, do you have a link to these, so this is okay, so we've got we need when she put them in here. So we've got. We need this one. We need the artifacts artifacts directories for density jobs. uh What this one oops and I get something.

C

Okay, I'm putting it here.

C

And then issue for.

C

Yeah I don't know.

C

Yeah I'm, currently working on the last.

B

C

Point the first two needs some attention: okay,.

A

So we've got the uh which one is the creating the repo. We just have a listening issue right: okay, okay, we need to ask Lobo.

A

All right I can follow up with Google on this one um and get the traction to get this one going, and then this is.

A

Okay, so this is the posting. The superb update, we're getting a spot to include all the graphs I want to index.

C

Yeah, we just need somebody to bless, approve on that.

A

Oh okay looks, like everything, seems: okay from all right, I'll ask Brian and Daniel to take a look at this one and let me get logo on this one. So we can um create the project.

C

Yeah I was wondering um if we are going to iterate on this Tool. uh Would it make sense for us to have like six scale owner um files, in that it will be much much easier to reduce latency? That.

A

C

A specific directory, which is our tool. So if you go back to that, PR.

C

And yeah so the perf report, Creator yeah. That tool is specifically us so so.

A

Yeah all right I can ask Daniel all right that should give us, then if we can get off three days ago, and then it gives us the repo um you've got this bug fix and then um and then it was scraping the density data. Okay. So what else do we need? Because we want to get to a point where we're.

C

Posting something yeah so after the repo is created, we will need a bunch of pro config for that Repository uh and that Pro config will be very similar to uh to the project health, CI, Health, Project. Sorry,.

C

Okay, yeah that CI Health Project already has automation to run these um scripts on a weekly basis, so we can just uh model it from there.

A

Okay, okay, so that gives us the packages you get. This data automatically now yeah and it stores it inside this new repo. Okay. So now, then the last step is going to be. um We need to post the data somewhere now that we have it in this repo.

A

uh Let's see so CIA Health has.

A

We can do this in a few different ways: I mean I, don't I'm not particularly like. If we want to do so, like it's got some badges here, we could do that.

A

um I, I'm, I, guess like I'm inclined to like have this at least on our repo for a little bit before we post it on the maybe we posted in Cuba cuber at some point, um I, don't know like we can start with something like this. We just create some badges in the readme sure.

C

All right, yeah I, think we can take uh P95.

B

C

Okay creation to running is the metric that we will expose as a badge or or uh chart on the readme.

A

A

C

Right that sounds good yeah.

A

Yeah, that's very good.

C

I I expect that a lot of time will be spent in creating that uh Pro config, so I I really wish. We can get the repository as soon as possible so that um we don't miss the ev1 timeline.

A

um I think I think we need luvo to do a bunch of these, because Google understands this really well, if you can help us in this one, let me talk to him and see if we can get some of this time to do to do these. For us, oh.

C

Yeah, okay, okay, then, uh so, then, for the tool itself, I think uh we need one bullet point for integration with all the jobs. So right now it only does uh take performance job which does not include the density. Yes, um so that part is yeah once we have that, then I think we can round it up.

C

Full set of bullet points needed to execute this.

A

Okay, so this gets us all the data. Okay, this is the other piece that it needs, so those two together damn.

A

Okay- and these are kind of all real, related okay.

C

Yeah, okay, I wonder if it will be helpful to just create um tracking issue um in I, don't know keyboard or some sixth scale uh and put this um integrate into the V1 issue that you have.

A

Yeah, maybe what we can do, that's not a bad idea. What I, what I think?

A

um What I kind of want to do is I I'd like to for us to get this project created and then create the issue in there, and then we can just kind of dump all these in there and then tag it over to the V1 this year. I like the idea of having it as part of E1, so we have a track there, yeah yeah, okay, let me see a Lulu should be able to do this. It should take him a second just to create this project.

A

So if I can ping him after we finish here and we'll see if we can close this quickly.

C

And uh one more thing I wanted to share is um so after the second bullet point that PR merges I am hoping to create a currently. There is two months of data in uh Sig performance, job I'm, hoping to create. Instead of this eight week, um I mean we will continue to do this. Eight week charts, but just for now, I was hoping to create, like a two month analysis as one of uh and just store it for our.

A

A

All right, I think, once we have like the once, we have all the pieces in place. Then we have the um we can get the owner's file I. Think it's just going to be easy, we'll just you know you can fire away on these PRS and um okay I'll. Take this uh after this meeting and we'll get an answer from about this okay, this is a good picture. I think, then, makes sense.

A

B

A

There anything else, I think we kind of we've slidences I, don't know if we finished doing anything from here. So we just talking about the crate view. My accounts, uh I'm gonna I, don't know if there was anything else that looks yeah.

A

B

A

Came in that affected anything.

C

No doesn't look like it um this one again, we spotted as sporadic last name the the first chart. You have.

C

So I mean the same patterns are repeating. We did not see any go up or down.

B

A

That's fine, okay, sounds good. All right, I think we covered everything. Is there anything else that we want to talk about.

C

um So there was, um there is one more thing I wanted to bring up. uh This is regarding to the six scale kubernetes meeting later on today, uh so we found out one issue with the PVC.

C

So in the kubernetes PV, and so that is volumes and claims controller which reconcile PVCs and volumes.

C

um The way it works is those controllers, uh cue, all the PVCs, to the controller for Q every second, and what we observed is that if there are a bunch of pending uh pending PVCs, then the work queue keeps growing so much that it will go. It will have a virtual latency of Eight Second, nine. Second, um in order to process that queue I've identified the the problem as this one second enqueuing causes all the PVCs to be on the Queue and then the cross.

C

The item that are on the QR processed as follows, which is if the PVC is in bounded state it will be skipped, but if the PVCs are in pending State, there will be one event which will be created that states that this PVC is in pending state. So if you have like 100 or 200 PVCs in pending state, every second 200 events will be created, I mean it's not much, but that causes the working latency to go um really bad, at least the p95s, so I I'm, hoping at some point.

C

We make enough progress in the discussions that we can raise. This point. uh I did not get a chance to create an issue or see if there is already attacking issue for this, but um just wanted to share and get your thoughts.

A

Yeah I'm trying to I'm trying to think this is good. This is like a very specific example that we've noted, and we very specifically already identified a very specific problem and I mean there is another one who even um uh the deletion cleanup controller for PVCs has problems. There's like we should run into that one.

A

These are all really good, I I think, like so I'm trying to think here. So I guess like so here's what I'm thinking for this approach, like like my goal for this beating, is just sort of build a bridge for us here where we can.

A

They can understand what we're trying to accomplish in our own little Sig, here in cupert and then and and to appoint that you know we attend this meeting for that this meeting they have every.

A

You know on a regular basis and we sort of like- and this isn't just Cupid I- think like we sort of um like Hubert- has- has its own specific use case this year like for these right, but then we get into like all the ones that, in that we have internally in video right, like all of the controllers that we've run into so basically I guess the way I'm seeing it like. We can start this off.

A

As you know this this topic, where we talk about keyword specifically and then as we attend more meetings, I I think you know get to know them. Let's take. Let's take it as like: um let's take a good opportunity to get very specific with how we're doing things and and problems we're observing and it's gonna and it's gonna.

A

We can talk about in terms of keyword as well if we wanted to because it does right we're using Cuba in this in this case, and we can talk about how it affects us, so I I agree like I think but I guess like what I'd like to do. Is we take this this sort of first meeting? Let's take it as like um in a way we can like these.

A

These problems, specifically, like you know, talking about um ways that we can measure more, is going to help tremendously and it may even get to the point may even start to give up, maybe make it obvious um the problem that you're you're hitting right now, maybe like um you know, maybe like we, we were able to find some states and PVS and PVCs like with pending or attached and we're starting to see like pending, is taking a really long time, because now we're measuring it, you know, and then we start to dig into it, and we start to find that.

A

Okay, here's the controller right like you know in the calls like he already dug into it like in the calls and and that's and that's sort of a way that we can sort of collaborate. So I guess the way I'm saying is: um why don't we um these ones and we can create it? If you want to create issues for the mouse during kubernetes, like that's, you know, feel free, but we could also use them in future calls. However, you want to approach. This is, is fine, but I'd say it was for this meeting.

A

Let's just um let's get a let's just get a few questions out there. Let's just see how they respond. I, like I, have no idea what we're gonna get. What so, let's just see how they respond, and hopefully, let's build a bridge to Future collaboration and no.

C

I agree: yeah no I agree with your approach. um I mean I just wanted to share it here because it might affect that a PVC attachment time and if we get um a good enough uh traction and discussion going. Maybe we can have these um specific issues as follow-up items um that that we can post.

A

On there yeah, maybe maybe what we end up happening, so it ends up happening here, is that you know this PVC rates, one resource that we're looking to like improve scaling on this is just measuring the PVC right. There could be so many things that come out of the square. We say we want to improve PVC attachment performance and one, maybe that maybe that's sort of like the issue like or the the cap or whatever you want to look at it as and the way we look to improve.

A

Is we start with just a timestamp where we measure, and then we start to look at specific areas like we've already done, where you have a specific scenario, whether it's a problem and we've, we've got a controller that runs into a work queue that gets overloaded so, like all these things can sort of be like. I could see this just this PVC portion of it being its own initiative.

A

um I mean there may be one for networking as well. Like there's a bunch of this stuff. We might. We might find that I think out of this.

C

Yeah yeah no I agree with you, 100 um and I think that's the main reason why I'm sharing is that we um I wanted to make sure this is captured in one of our dog and when time comes for it, we can take that as a follow-up discussion Point um after.

A

After this yeah yeah, let's, let's keep that makes sense. Let's keep in mind what happened? We have it in our doc here noted so um I think I think I, I kind of like the way of approaching this is like yeah. Maybe we can look to just enhance the PVC performance and we just sort of list off things that we see like. We were saying earlier, like with the the cleanup PVC cleanup. The war queue is not good. You notice another one, there's probably more, you know.

A

Maybe we just kind of lump these together as PVC performance and as part of that we have measurements that we would need anyway right because we're gonna need to do to actually measure these things in Automation, in a gate or whatever so um I mean I, think maybe they're they're kind of all the same yeah I mean they're all the same initiatives.

C

A

Makes makes sense.

C

So the the controller work you problem, I explained that that is prevalenting all four controllers: TV, that is volumes PVC. That is, claims uh PV protection, which is the cleanup and.

A

C

Production, which is the game they all think about cues every second and the the only thing caveat is that the pending State affects only the creation timestamp and the deleted. State affects the second, the other two, um so they are divided like that um in terms of when it they get affected. But the underlying issue is the same and you're right that we can. All. uh This can all be one.

B

C

Single PVC performance thing makes sense.

A

Yeah, um okay I, like this, we'll hold this in the notes. This is something we're going to keep this around and, let's um I think that's sort of our goal right. We want to build these into like camps or initiatives, whatever issues and and start putting code down for this and yeah. This makes sense. This is good we've already, we've already got like some more concrete stuff. We want to build after we do some of this initial stuff. So that's good.

C

Okay: okay, looking.

A

C

That meeting yeah.

A

Yeah me too: okay, anything else from from folks.

A

Okay, I think we've done that all right. Everyone. Thank you, talk to y'all later bye-bye.