KubeVirt SIG Performance and Scale, 2 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-06-02

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

All right welcome to sixth scale everybody. uh I'm gonna put the note link the notes in the chat. Please add yourself and visit as an attendee. Please.

A

A

Okay, well, let's get started with.

A

First thing is.

A

The performance job okay hold on one second.

A

A

Let's take a look, uh we're still getting some failures. Let me just we're gonna look at both of these.

A

Okay, so for uh for those who, for those of you don't know, um this performance job is uh something we run periodically. uh It'll go through, create 100, bmis and we'll um we'll grab a bunch of metrics.

A

We have a um an audit tool that this is like a um a little script that goes through and grabs metrics, and then we have a bunch of thresholds that we can pair and kind of the way it works is um at this end, at the end of the test.

A

It's um uh here's thus there's a summary and uh so the way the way to read this is like uh so I said like there's a hundred females that we create, but the uh the way like you'll see this isn't exactly 100, and this is actually expected. We we've done. We've done like extensive amount of work to try and figure this out, because uh it's actually it's really tricky to um to get an exact value, because the way that prometheus does its measurements a time series database.

A

The way we need to measure is we actually need to measure rates of change and prometheus. Does this by doing some estimations over periods of time. So the count the the crate pods count is is an estimation of of what we would see over some certain amount of time. So let's say our test ran for three four minutes.

A

Whatever the prometheus does some sort of extrapolation to to get us a value, so it'll never be exact, but it should be close roughly to the uh the amount of the exact amount that um that we see so we get 105 and create is like the create request that we make kubernetes we're actually grabbing those metrics and we're and we're comparing um uh we're using this, like kind of as our as our like anchor point to say like okay, this is you know, this is what we expected in the test, and so what we actually do is we take this metric and we do.

A

We use it to compare against some of the other ones so like, in other words, when we create 100 vmis. Here's, how many api calls that we expect of each type. So we expect you know a certain amount of of these and the ones that we have very confident are like stable or the ones we created thresholds, and those are these are right here. You can see that um the way uh the way to read this is we have.

A

um We take a relationship between the number of update requests and we relate it to the number of create pod counts and we have like a certain um uh threshold and we think it's ten to one that we allow and if, as long as it's within that threshold, we're we're happy we've we haven't regressed in any way of the number of calls number of update, calls and same with patch.

A

um We have. uh I think those are the main two that we have thresholds too. So I mean for this. This all looks good. This job obviously passed. You can see the running phase, we actually caught them. We had 100 in running phase, so that gives you. This is an exact metric because it's just a at count specific point in time, um and then we have ourselves the um the amount of time it took for each of the uh the vmis to go through their phases.

A

So from crate to running we saw like 90 percent we saw um within our threshold. We say it's 45 seconds with almost no more than 25. When the muslims took 25 seconds, the p95 is 38 seconds. We expect to be less than 60 and we don't have one for p99, because that can very well seem as high. It's like 60, 70 and sometimes as low as we see here is 39. So we don't even count it it's it's kind of an outlier statistically, but it's something that's interesting to see.

A

So this test looked good in the past and uh um and this runs periodically based on off of the master branch, and um this is just a tool that we like, I said, tool that we developed and you can actually run it locally after any test that you do in your cluster.

A

Okay, we're going to look at one that failed just to just see if we can create an issue on this, I'm not sure why we've been seeing this a little bit more resembling some failures.

A

See the test will open.

A

Turn it on. I can't even get to that one. Can we get to the details? Let's see.

A

All right, let's slow down.

A

Okay or we might not be able to get to it this year, they might just let it load in the background. Maybe you'll get an answer in a few minutes. um Okay, so this is one of our tests. Let me go to a few other ones. um We have uh so it's periodic. We have um number periodic once you have two more through that periodics.

A

This is the density test. This one we're actually working on fixing right now, so this is expected that we're seeing some fires. I think it's the same failure. I think we've had some merge. I'll change merge recently to cut this a little bit closer.

A

See if it's the same as you want to do that, I'm going to open up the last one, which is the pre-submit job. This is an optional job. It's the same thing that and the periodic, which is we, we allow people who are doing um pull requests to optionally run the performance tests.

A

Oh there's an excuse: uh presets.

A

Like this, okay, so a little bit more successful failures open one of these up.

A

Okay, this looks like it was a just uh setting up the job error. Okay, that's that's fine! We're gonna! Some of this is still being worked outside, like I said, expect this one to fail all right, we'll go to the pre-submit.

A

In case there was.

A

Something we would pay attention to.

A

Okay looks like a lot of these failed.

A

Okay, the test actually didn't run, something is just failing. Maybe it's tear down that's failing or something 102 creates.

A

And then, um let's see your thresholds, so actually you can see here that there's sometimes we get um there's a lot of this. Sometimes we get a lot of other metrics that we scoop up um that happen during this time. Since we uh we grab quite a bit of of api requests. We sometimes there are other ones like list cuberts, for example. So that's why we don't have thresholds for some of these they're inconsistent. They don't they show up.

A

Sometimes sometimes they don't um so we ignore them, but they're still important to keep an eye on, though, when we do when they do pop up, because we would never want this value like if we're doing 100 vms. We would never want to have any correlation between this value and and the great point of accounts account or we're going to have we're going to be in trouble, because this is expensive. So it's good. These are all low. It's all expected.

A

Okay, patch pms, that looks good also.

B

Yeah anyway, so I think this one has similar uh characteristics in failures uh as the other one. uh The the similarity that I observed is line two zero to nine um one is uh so one vm is not running and then uh at the end of the test it says uh the phase of one vm is not running. I I'm not sure if uh that's a red heading, uh but I that's the similarity I have noticed.

A

Okay yeah, I just I just saw this- that um one is in scheduling.

A

That's interesting: where was the area you said it was like? Is it further out if you scroll.

B

All the way down- that's one of them.

B

Yeah uh yeah, you see line two three, four, nine. The face is not running.

A

B

And it's it's similar on the other failure as well the one I pasted on the chat, okay, yeah. So in that the line number is two three five, eight.

A

Okay, let's see okay, that's I wonder if this is the uh this might be, we might need to increase the memory. Let me see if the um you can get the artifacts here and tell us why.

A

Pretty good this one has it.

A

A

So the name of it.

B

um Will it help to search by the name.

A

Yeah I'm looking for the name. um There you go. You.

B

Know what it is yeah.

A

This one looks like that.

B

B

Today's running wait.

A

ah Here it is okay, so it's mission memory, okay, that's what I thought it was okay. The memory is not high enough on these notes. All right, all right we this was. This- is a there's, an issue we've seen in the past. We just raised it like a few weeks ago and we need to raise it again looks like looks like it's just a little bit too tight.

A

B

Just so I understand um we need to increase the um the memory for provisioned clusters that are.

B

A

Yeah, the the we there's a certain amount of memory, that's allocated for the performance job and um we we need to increase it a little bit. I think I forget what it is: it's something in the 30 gigs or something we just need to raise it. A few gigs and I think, they'll take care of it.

B

A

Good yep all right, good that'll. That should fix that one and probably that's what's going on yeah, we well that's beside where you saw it in both cases, so that's likely what's going on in both cases, okay, good, all right, that'll that should fix that, and then we have the other work in progress for um the dedicated cluster um performance job.

A

Okay, um all right! I don't think that marcelo here today, so much this is marcelo's fix that he's working on to um on the load generator. This is what will fix the um the dedicated cluster performance job um so this job. So I didn't talk a ton about it. um It's it's over.

A

Or is it's in the periodics.

A

It's uh it's this one, the uh these two actually um so these are. These are run on a dedicated cluster, which is better for us to do scale, testing on um and right now. The target for this work is that run what we call burst: tests and uh burst tests, we're defining as create a bunch of vms.

A

You know whatever variable rate and um and then we're going to wait, till they're running and then and then delete them, um and so there's a lot of variation to that like we could create them at rate. We could create them weight. um There's a lot of things that we can do. That's one of the two types of tests that we're going to do we're just starting with first.

A

The second type is steady state, which is something we haven't integrated with and that's create a certain amount and then do a certain amount of deletes and then refill the amount back to the um the expected number of vmis so like.

A

If we expect 100 create 100, we delete 10 and then the job should automatically recreate 10 more and that one has a lot of variation, because we can change the number of how fast we delete how fast we recreate um and it sort of attached like how the different you know how pressure affects the cluster based on you know the different rates and and how fast you recreate and so on.

A

So that's something we're doing in the future, but so first is this this um this, mr.

A

uh We just need to review and put this in, and I should fix the burst test.

A

Okay. um Lastly,.

A

So cubecon n a submissions. The call for proposals ends, I think it's tomorrow um marcelo and I are going to talk. This is what it is where we're tracking it. Actually, this needs to be updated, but uh we have. um This is where uh we're actually collaborating in the school doc, but um this is what we're looking online to submit. Actually let me this is what I'll do.

A

Let me just go to this one, so we wanted to talk about actually how we're gonna, um how we've created the performance infrastructure for kubert and um cater to talk to like how other projects can do it so go through some of the steps that we did and talk about some of the things like the metrics, which I think are really important for any.

A

Anyone who wants to do this performance really critically understand how you measure so we're going to talk about a number of those things and as well as other apis that we're working on to improve performance and scale.

A

Okay, all right, I don't have any more um topics, uh I don't know la, do you I think yeah like here lay do you have anything else? You want to talk about.

B

uh No, oh, I was just listening in. Thank you, cool.

A

Okay, all right sounds good. All right, we'll end it. So then all right, thank you. All right have a good day.

A

You too, bye, bye.