KubeVirt SIG Performance and Scale, 6 Jul 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-07-06

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, the statistic scale: it's July, 6 23.- is that yourself as an attendee all right, let's start with uh the one I think we're almost to the point. We can wrap this up right, I think we're um take a look. What are we having it's open.

A

Oh lee wanted to add some other uh metrics away.

B

Yeah so I think uh I'm going to comment soon, but uh that instance type tests are in the density test and we decided not to upload the density benchmarks because we have not had consistent results if that job was failing for a long time and the graphs are not really meaningful so um and another issue is that in our scraping we only scrape the first two jobs.

B

We leave the instance type out, so we need to add another, um like small uh fix, to also scrape the instance type so and I don't think this is going to be possible for V1, but definitely possible going forward. Okay,.

A

Let's just create an issue and let him know when we're tracking it so he's: okay,.

A

All right and then Andrew.

A

Okay, I think um I asked for almost to take a look. I have a hard turn, so um we need recent songs, get approval on so I, don't know if we need that, but all right all right. So let's do some progress there.

A

Okay! Next, um that was just an issue three years ago. Okay, nothing anything to say about this.

B

No I we've discussed this last time. I think we need to um wait for that thread on Google Groups keyword, Dev mailing thread where they are discussing about the changes. Once that is settled, we can create a page for sixth scale.

A

B

A

That's one two open.

B

Ones so, apart from the uh keyword dock, everything else will be moved to a different tracking issue right because essentially we are saying that that's not required for V1.

A

Yeah, so it's just these three and then um oh, no, that's the you just sent me this right, yeah where's! This um is it this.

B

No um I'll put it here.

A

A

Rendered data or V1, okay, okay, wait! This is so: okay, mainly updated. Scrapes data wait. This is um so this covers it right. We've got this. Has this? Is the index.html? That's what we wanted? Okay,.

B

A

Okay, all right cool.

A

All right there we go okay, so that that'll be the last okay. We can merge that one today and then I'm, just gonna close these out in half until next Tuesday, okay, um yeah, so I guess as soon as you can. Let's get those comments closed out and then I can pin Roman again and let's see if we can get this merged in the next day or so.

A

Okay, oh I, know we're gonna talk about this Friday and then.

A

And then we came to so okay. What should we need to move this out? Then? Because this is gonna, we're saying this is going to be post people on right, so I think this needs to be a nice to have.

B

Yeah I'll do that.

A

Okay, all right close, almost very close, all right, I'll, look at this after we can get it merged um all right. What else um benchmarks PR? Is this something we actually look at? We.

B

Already looked at it there, it is.

A

Oh, this was the documentation. Okay, yeah got it. Okay, good.

A

All right, maybe this will be quick and easy. um Do you want to talk about the uh oh? Is there anything else about V1? Before we just say we can talk about some flow control.

B

No I think um that's all for V1, so the two open Major items are that PR in the blog post I think we have plans for both of them, so that covers it.

A

Okay, there, it goes all right, let's, uh let's just quickly talk about flow control, so did um so I think this is the test that you did right. You follow, you did a test like this and we should have some results right.

B

Yeah yeah, so I need some more exhaustive testing.

B

Okay, so in that I had 25k PVC.

B

um 100 list requests per second, okay and I ran this for two configurations, one for five minute and then for 30 minutes.

A

Oh yeah I'm gonna be like you like this IMAX and Spider-Man yeah all right. So what do you got for results.

B

The restrictive policy was that it allowed eight list requests. So, however, many request to be fire, API server will allow eight of them to go through yeah and yeah. So with this, I saw that API server platters out. So initially we were using a like. Initially, the aps server was using around nine gigabytes of data and then after this it was close to 14. So there is a spike up for the first two minutes and then for the rest of the time the API server memory usage is plateauing, so it's a straight line.

B

um Although the test fires requests for 30 minutes, oh actually, I was staying that for the 30 minute time. Sorry I'll keep continuing the. Although that client fires the request for 30 minutes the API server, because it has enqueued a lot of requests and because client is configured to retry upon failure. The test ends up um stretching for a long period of time, so let's say an hour or so before I hit like uh session timeout.

B

um So that was my other observation, as in the when the client is done, the test is not done. It continues to um process those requests and we see the plattered line go up to more than an hour.

A

B

A

Are just gonna get so the difference between the two here? Is this memory Plateau it's just bigger when we hit 30 minutes. Let me do this is.

B

Isn't the spike is not bigger, but this.

A

Is longer sorry, yeah.

B

A

Yeah; okay, because.

B

That I have used, they don't uh context timeout. So let's say our client go routine. One has started uh uh a list called, and that list called let's say timed out or returned an unknown error. Then the same go routine same client internally will retry and continue to retry until it passes. So eventually that list call will give us like an end-to-end timeout of let's say 15 minute right. It took 15 minutes for that list call to continue.

B

So, if you have similar list called skewed up, you can imagine that, even though the initial burst was of 30 second internally, the library is spending more time and that's why our test is stretching out, because the test is stretching out the load on the API server is stretching out. So we can see the plateau last longer than just 30 minutes.

A

A

So that's a pass for both I mean we didn't know. So. That's good, yeah, okay, yeah.

B

We did not um I think my next step is to take one API server down and see if this test um continues to pass with uh two API servers.

A

Okay, yeah: let's see that's another, that's an interesting one, because I'm going to do whatever, like one more so I'll do like okay,.

A

Let's see what we come up with: okay.

B

A

There we go yeah.

B

A

Like run off by that, okay, um good, that's cool all right, let's see what we found from that um yeah that'll be good, started right, I! Think! That's all we had from last time. I want to follow up on I, don't know! Is there anything else, so we want to go through I think it was a full control. V1 yeah.

B

um I think at some point we would have we would. It would be good to do like a post. We want triage, but um not ready, yet um we'll have to create a new tracking issue for all the um Skipper fan scale related items that open that are open and find a new like post V1 issue for it, and we can triage that um in the next one.

A

Yeah, okay, all right we'll talk about it when the time comes, yeah, all right, I think we're good Let's, Go, Wrap, Up, really yep sounds good all right! Thanks boy! Sorry, thanks man see you talk to you later, bye-bye.