KubeVirt SIG Performance and Scale, 18 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-05-18

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

All right, let's start all right: it's May, 18th 2023 with the sixth scale. Please add yourself in those attendee, please, okay, uh let's start with the looking at the performance job result. So lay we've got um three okay, so we got the density, one and sweet. Okay, anything to note on this one.

B

um Yeah, so this.

B

Yeah faulty density job uh PR is out. uh There were some changes needed to make sure that density job works. Well, um I've added it to the agenda. um It would be good to get that merged. uh This particular graph is from March.

C

B

Actually it's from February 2022, so more than uh years worth of data um I've created two PRS uh to add all of this historic data to the sit Benchmark Benchmark repository that was newly created yeah and we can, you know, store this information in a very public way.

B

um Let's see so other things I have observed is there are a bunch of times when uh this job was not working, which is reflected in the blank thing. But apart from that, the grouping has been very consistent.

B

um If you scroll up the very first to chart which is P95 creation to running you can see in the last two weeks, we have been kind of seeing some spikes on on there, uh but there was some interesting uh observation.

B

If you look at the Sig performance graph, those numbers have been going down actually.

B

On the same timeline, oh and one more thing, I have added both uh p50 and P95, so we should be able to get an average as well as the worst case sounds.

A

B

Yeah so the worst case numbers for these have been going down, uh but for the density clusters they have been going up. So there are some um weird uh behavior I'm, not sure if it is related to the cluster version or the cubelet that runs this, but.

C

B

An observation I have uh posted the PRS that that went in around the time when this number started going down. So this was May 7th, May, 8th.

B

So from from these I'm, not sure yeah that one I doubt that could be the one that that is saving us some. uh So my understanding is that a it that PR changes the way network is set up for VM. It uses up in memory cache to set up Network for VM, so that could be yeah.

B

C

B

I was hoping somebody from keyboard Community would attend today, and we can ask questions like does this affect the end-to-end tests or or things like that, but um I think we can raise that question in the pr.

A

A

So um yeah like maybe we can um did this I, want to see if, uh let's see so we should have this is so this is where is it so? This is the it's the periodic that you're getting this right.

C

A

This is from the periodic, so, okay.

A

um I was wondering if we okay here we go if we can see on here the same numbers uh so like looks like the P95: it's these ones, that are lowering at least 19 and a halfs.

A

I wonder if we can see.

A

That appearing here so um I think it was a 19 half right there.

A

A

1909 um wow, which is our oh, no, that's, not our low okay, so it looks like 15. So, whatever all.

B

D

A

Is 15. yeah yeah, that's 50 Okay, so 15. yeah. So it is our low and sitting um the lower Bound for both of these okay yeah.

B

A

Well, I guess we'll see how this continues.

B

Can you go back.

A

B

I up uh it should be able to find a link that says job history, yeah the first one.

B

So I wonder if we can find uh a job prior to this and see if it was lower or higher than 19.

A

I'm, not okay,.

A

Okay, so alonas is right here. So let's, let's look at this one.

A

Nine a lot right. This is Ryan I, don't know why it's! No, no wait! So that's the first one.

A

First, one.

A

Yeah, okay: here we go so yeah here we go 29 yeah, so this is a 29 interest. Now, look at that, so uh we can do this too. Let's look at this one.

A

And we should um see if we can do this again. So if we can get two for two here.

A

All right past them: okay, okay, 32., interesting, okay and uh um so in your chart. You've got um so it looks like uh okay, so we're hitting a lot more of these 19th, so we're adding fewer of these. These um looks like 28s.

B

A

B

A

Interesting, okay, maybe there's um so maybe what we can do is like I, yeah I mean so start thread on the on the pr. You know it's, let's post our observation and you can post this chart, I mean I, think it might be valuable. Just to see like hey. Your PR was like right here. uh No, it's like May 8th. It was um yeah, it's like right here and, and it is we're starting to see a better P95.

A

Do you think that it? You know? Maybe it's associated with your change. Do you think I don't get their thoughts, see what alonas says I mean I, don't know. Maybe she wasn't thinking about performance when she went through this, but it would be good to identify this for a lot of reasons, I mean. Maybe we have. Maybe we can understand exactly what the performance changes like. Maybe something we don't know, maybe something we can look at doing in other parts of the code whatever it is that she was doing here.

A

Sure, okay, all right, do you want to start that thread? Just kick it off? Maybe post some charts and we'll see what she says and she's got some insight. Yeah all right, I'll make a note of it. So um that's that's cool all right. So we see on.

C

A

Cool okay, uh anything else interesting, you noticed uh any other well so back to the uh the change that we were saying or the increase in the okay I wonder if this is just kind of within the normal behavior like I'm looking over here and we see like it kind of goes up and down a little bit. Maybe we um maybe this is um well yeah. This is also a different version of kubernetes I mean.

A

Maybe we should still let this play out a little bit sort of within our bound, our lower and upper limit, where we're kind of less than 60 about and greater than 50., it's like yeah, greater than 50 less than 60. we're still within there yeah 56 and 49 we're still right around. Let's see what this plays out. If this kind of continues to go up- or maybe it comes back down.

B

Sure, um just for my understanding what version of kubernetes moving on in the density, cluster.

A

I, forget I, think it's like. Oh it's! So it's it's it's openshift, but it's um does the installation, but it's um I thought it was one 124. I see or 125, or something yeah yeah.

B

A

Remember I I had to look it up.

B

And um I know for for the uh Sig performance periodic job, uh we install a new new kubernetes cluster and it runs on there for the density cluster. It's just this Standalone cluster and we run VMS after every um every day. Right.

A

Yeah, it's just Standalone, so nothing, no one uses it, but what else.

B

Got it okay, so we're not creating a.

A

B

In cluster kind of.

A

No, no yeah yeah, it's just uh it's right. It's just that! It's a the point cluster we just create and delete all day. We don't do any employment. Okay makes sense.

B

Yeah I just wanted to understand the hardware, so um we can compare better both the results. Now there we have it.

A

Yeah so um well, I mean the thing that I want I, think that I actually wanted to get to with these p95s, and you can like this is um so 49 and what do we have for um yeah? So it's higher right, it's it's we're getting higher numbers and that's like we are creating more.

A

We are creating more vmis I, don't know that could affect it. I mean there's still yeah I mean something we will have to get into, but decipher the differences.

B

So I have couple of suggestions from what I have observed, um so the think performance test. We just run one test every day.

B

um What I have observed is that sorry, the density job we run one test every day, the performance we run three tests every day. The three days test gives a little bit better signal if there is a flake or or something, and usually that helps me better. So one of the suggestions I was thinking was: can we change this density job to run three times a day? I think that will improve the signal we get of this cluster.

A

C

A

Should this be an easy proud to.

A

Just one little switch and probably.

B

Should do it, and uh one more action item for me is to add a p50 creation to running uh for the density as well.

A

A

Okay sounds good. This is all really good progress. So now we've got some other things that you've made progress on too. We've got the repo and everything do you want to talk about that? So where are we with um uh I I'm, just not caught up so like? Where are we with uh we've got the report created that I think we need and then and uh and where? Where are some of the other changes you've been working on or actually no? We have it right here right, oh so here we go.

B

I have put it in the agenda if you click the first link, which is issue for the project. Everything is summarized there actually great yeah, so we have the Repository. um We have a PR, that's open. That will scrape the result every week uh and dump it into that repository going forward. But before that PR merges I have seeded the the repository with historic data. So if you go I, don't have a link handy, oh yeah! It's in the agenda.

B

If you go at the the repository, you will see two open PRS I mean all of that is generated data, so it will be hard to go through, but um if you can just verify the the directory layout, that should be enough for us to get started. Okay, we can merge that, um so that's that will seed us with historic data and then from there uh we can. You know, merge that project to start collecting data every week going forward. So we would not have to do this manually.

A

Okay, what are these, um what are these numbers? Those.

B

A

B

Yeah, so they directly mapped to Pro job IDE, the one you see, so the directory structure is that daily results are in a folder called results.

B

uh Weekly results are in the weekly folder and then they are separated by the job name, so the performance name and the density name. Those should be two separate folders inside the results and weekly and within the daily results. You will see a job ID within the weekly research. You will see it separated by the metric.

A

So how come in the outcome for reviews it's the job idea? What's the that's.

B

What I was saying within the daily results, it's separated by job ID, uh then we aggregate those all the jobs into weekly and so during a weekly, it's aggregated by the metric that we care for so in this particular run. You will see all the metrics that uh that the job reports, so we have the data, but out of those let's say the first five is the one we care about you'll see only those in the weekly aggregation okay, so we are not actually losing data in the future.

B

If you want to compare let's say um the last one which we have not been aggregating it weekly, we can always go back and run the automation.

A

A

That's convertible to zeros, I. Think I. Think everyone.

B

Yeah you, you can pick db95 that there will be most.

A

Important, oh the update! That's why? Okay, all right! Sorry, because I'm in there, okay, so the p95s in here as well! Oh here it is.

A

This one all right so now I, should see some like okay makes sense. Okay got it. Okay, that makes sense to me sounds.

B

Good and one more thing, um the weekly directory will have uh index.html.

B

That's the one you see me linking over in our uh agenda. uh So one thing yeah, that's the one. So one thing that we have not captured yet is.

B

How to make sure that that index.html is rendered in a web page the way I do it? um For my personal Repository, we should have a similar.

A

Start to do the same thing where we like, we use the GitHub Pages. We just need a script or something to update the index.html.

B

Yeah, that's that's.

A

B

A

Job very proud, composite call it or something maybe has.

B

Github has uh automatic job for it. I.

A

Mean like to update the index.html, so are you saying it already gets updated automatically.

B

um So when you merge a PR with index.html GitHub will run a deployment job to automatically change the GitHub pages to reflect that PR margin. The only thing that we might have to do is that GitHub expects GitHub pages to be in a specific format like the way I have. It is fixed scale, Dot, uh username, dot, IO dot, github.io right.

B

So we will have to do this. Like CI performance, benchmark.com,.

B

So we'll have to set up that repository and then it will be able to run this uh automatic deployment.

A

Oh so you're saying what you're saying is that we need to. We need okay, we need another repo.

B

Yeah just to render that uh automatically.

A

B

So if you look at my uh I think it's the fourth tab right. It's myusername.github.io, slash.

A

I thought this was configurable in some way.

B

I'm, not that part that word, that would be keyword, dot, github.io right, but the second yeah second yeah. That part. That's the name of the repository right.

B

Yeah that entire thing so I just need to find out ways to do that for the benchmarks.

C

A

Yeah all right, let's make some sweet, Okay cool, so I think I have approval for this one. So let me see.

A

Okay, so this one looks fine and then um the second one is oh, okay, the rest of the oh. No, that was the one I was just gonna.

B

Yeah so I created Saturday different job, different job yeah.

A

Okay sounds good to me: yeah. We can get those first, okay, good! So that's! uh Okay! Here's our C! Okay! Now we've got so here's our Automation and this one.

A

I think this was close right. We needed to.

A

Do we need just uh an approval on this, or is here.

B

um So right now, this job only sets us up for the sake performance which is not, which is from the shared cluster uh we.

B

So if we want to do it for both shared and density, then we will have have to wait for the other PR to merge before this.

A

C

Is here the ones.

A

That were in the in this review in this repo. No.

B

No, no, uh the one which is.

A

Scraping work from the density of this one: yes, okay,.

B

Yeah, so once this merges, we can, uh we can ask uh Daniel to add that another command to even scrape the density job, and we should that should set us up. You can also.

C

Do it in your phone or.

B

Pr I'm not too on the order. Okay, whatever is easier.

A

All right, let me talk loop on this one, and maybe we can. Let's have them, do quick review and if it's not anything he notices in a fun place, then maybe we can get this immigration quickly. Okay um sounds good, sounds good. Okay, so we've got you've merged this performance. uh We've got the density. Job changes we also okay, here's the artifact search okay, so um this one I think was okay, so also it means so I'm working through the testing.

A

Okay, all right, you got a bunch of them in progress, so it sounds like they're all moving along and then.

A

Okay, this one too.

B

Yeah, so this is setting up the owners just for the uh perfect report, generator I've added you and Logo as the reviewers and approvers. It's a copy paste from cubot keyboard, so I don't anticipate any problems, but.

A

B

Good okay blessings, yeah.

A

Yeah yeah we'll get. Let me get Lugo on these and I think um today they all look like they're, making good progress. Okay, cool all right, thanks, delay, all right anything else about this. This thing's gonna do it. This is a lot of progress.

B

uh So next steps I think from here on, we need row configuration in the new Repository to.

B

uh Weekly Aggregate and uh post graphs.

C

B

All summarized in that issue: I think Daniel Daniel heard this idea. We just need to follow up with Daniel and yeah. It's the bullet point two 2.2 yeah 2.2 2.3.

A

B

Yeah I think those two. If we can get Daniel to set up raw in, say, performance, Benchmark repository yeah.

A

Then so I think so Louisville city is only when they're doing it. Okay, I thought Louisville was gonna, do it so I guess? Maybe um maybe we can ask Daniel and if he's.

B

Already yeah see that comment. uh The second last comment: yeah Daniel had ideas on how to do this already. I think that sounds good to me.

A

A

Okay, so we need help with the post in the job. Okay, so um okay.

A

All right sounds good. All right, I'll talk to luo in touch with Daniel as well, so where they can help. Okay, thanks, Soleil yeah, that's it! Okay!

A

um Last thing, then, is just for the sixth scale for the kubernetes work, um so we already have a lot going on, but the the only thing I wanted to do is just make a little bit of progress on this and I was hoping I'm gonna. Try to do the do some of this today, um I want to I. Think we need to do is is well. This is what we agree to do. Is we take the inventory of the existing performance related metrics? So it sounds like we.

A

Some of the things aren't actually totally known. I mean we don't I, don't know all of them. I actually don't know many of them, but it sounds like within a larger group than that. Not all of them are known. So let me just take a quick look at what's there and just out: let's have a description of them and that'll help us out and guide us as to like what how we want to improve or um change them. Anyway, so uh that's the first step that I see.

A

The second step is just to do a little design about like how we would want to look at metrics and for one of you guys, I think we just start with one of them. I think I'm thinking, PVCs, I, think pvc's is the way to go the start to me.

A

It seems simplest, like we know there I, like I know there are a few phases like just attaching like we know there are a few phases, so I think it makes sense like let's, let's see, let's I think the way women approaches is like what are the phases for PVCs, uh who sets them which transitions between the phases and and those are our entry points and I. Think it's that simple as that um and then the final thing is like okay: what will having these metrics do uh how it impact anything.

A

um You know whether it's people using them or how would it impact the cubelet or um how could it impact your monitoring? We now have a lot of PVCs.

A

A lot of metrics I think that's sort of the final step, so just those three and I think that would be a good start and um uh for for us to begin the work I think I'm not going to be here next week, so I'm not going to talk, um bring us up at the you know if I get far with this I'm like I'm, going to talk to about it at um the design session, so if La through there, if anyone else wants to talk about it, feel free to if I got some documents, sir or I can bring it up on slack, that's fine!

A

We can. We can go that route um in their uh six scale, ability in the kubernetes. We can go that way as well. So, depending on how far we can um coordinate how we can so.

B

Ryan, this sounds great um I think we did a bunch of work to identify these metrics in cubelet I. Have it in one of the documents in our Downstream uh effort, I can uh link you I, think that will save you some work. Okay, so to put this together, yeah, okay, the only thing is, it is I think you blood is missing the PVC, uh metrics and, and the network related metrics right now. So the thing it has is end to end uh pod.

B

The time cubelet sees a pod and the time cubelet is able to get it into running. State on that node is what it seeks in between. The breakdown is not available, at least in the version that I was taking into so I will link you that those metrics I think that that should get us started.

A

Yeah I suspect, I, I I'm, pretty confident. That's all we're going to get away, I, I, just I, don't I think we kind of caught. My impression was we kind of caught them by surprise in that, in that their approach has been, and it makes total sense, like their approach, has been to measure things and mention them in the aggregate right. It makes a lot of sense because they want to measure they want to measure and report and have you know, slis and slos makes it makes sense.

A

That's what the approach was where our approach is slightly different. It's that it's sort of two things like one of them is that we don't want to do the aggregation Within, the cubelet. We want to do it at the the Prometheus or the dashboard level, so we we want. We basically we're going to essentially pay the price of doing the uh holding on to more data and then and doing the aggregation later and then. The second part is like our approach. Is we want to look a little bit further under the hood?

A

We don't just want to see into end time. That's not necessarily the most important thing there are sort of more to the story, and- and it's just because of like because we're an add-on, it sort of brings out this use case.

A

It makes it more more obvious, whereas, if you're just looking at pods, it's sort of not it's not how you it's it's just not as obvious so I think yeah I mean I, think that was kind of my impression, as we compromise surprise so having these metrics as to like you know, in our background, to say: okay, here's what the current approach is compared to here's, what it could be I, think it'll, I, think it'll kind of change, Minds a little bit or maybe catch some off a little bit less surprised when they look at the two and understand like okay.

A

This makes a lot of sense as to okay. Here's why we didn't do this, here's what we can! Here's why it makes sense to now do this.

B

Yeah I agree with you, I think. The another surprising bit for me was that they were using the events to collect when what happened, even though the data is available in scheduler, metrics or cubelet metric. So I think that change in approach that you were talking about is first, it should happen in the test. End-To-End uh results that that they are using so like skip collecting things from events and use actual exported metrics from scheduler and cubelet.

B

So that's first change, and then that will give us a list of things that are missing in the exported metrics which which you already have started and listed here.

B

That can, you know, fill up the entire story with with missing data. um I, wonder if you need a bullet point here to locate the metric collection that they have been doing and suggest changes to that as well.

A

Oh yeah, so the cluster loader.

B

A

Okay, yeah makes sense: okay, okay, it sounds like a plan, I think um so something I'll try and get to today and see if we can come up with um yeah we'll come up with a plan. I mean I'll just start by Friday, depending on how far I get the weather. Will you know, I will approach them with um whenever it is okay. This sounds good.

A

All right, I think. That's all um all right anything else, anything else you got away.

B

No I think we have made a lot of progress so happy with how things are going.

A

A

Okay, I think we're good. Let's go under here. How is lubo here? Oh hey, Google,.

D

Sorry for bringing light no.

A

Problem did you I, don't know if you caught some little discussion I we had. um We had a few things. We wanted to ask you yeah, so this um actually La. Why don't you cover it since you have written the points there? So what? uh What did you wanna? What do we need here still.

B

um Still we we already have set up the CI performance benchmarks repository, but that is a Bare Bones repository right now, I have seeded initial historic data to that repository.

B

um What we would really need is a post submit job on that repository so set up. Setting up raw like Daniel was mentioning here.

D

Sorry for interrupting, you I think that two seven seven three uh is exactly that: Daniel is trying to trade periodic job for it.

B

um No, so my impression was that this does an initial data Gathering, but so this will just push data to Sig, sorry to CI Benchmark repository after data is pushed. We need a post to submit job on that repository to aggregate it into weekly, folders and publish a chart for it. So if you look at line 818, it just does uh initial result. Gathering for this, we need the same command with weekly report and weekly graph sub commands to actually do the graphing for it.

D

ah Now I'll get you, um do you want the second job, or do you want to do it actually in this one.

B

um Daniel was suggesting that we could do it as a post submit job. So once this job publishes data uh second job, which is a post submit job, will you know, aggregate it and publish it that way, we can change the publishing logic without affecting the aggregation.

D

Okay makes sense, yeah.

B

That sounded like a good approach, but I'm not too opinionated anything that gets us. There is okay by me.

D

Okay, yeah um I would suggest I'm. Just you know comment here if the, if it barely makes sense in a sense like how often do we want to run it, and we can merge this one and follow up with the with the pause submit I think that's a yeah that should not be a problem.

B

Okay, yeah I think that CI repository Benchmark repository that needs raw setup and everything I, don't know if it is enabled right now it's a Bare Bones repository with just admin access, so yeah I think that would be really helpful. I'll I've already summarized this here and we can get this merged. Okay,.

D

Perfect I think that is out of office, but I will catch him up after he's back.

B

Sure, okay, so that gets us covered here and then uh Ryan. If you can go back to that summary issue or the summary description.

B

I've taken a step at uh creating a PR for owners in in the perf report, Creator tool, I have added Ryan and yourself, as uh as the reviewers and approvers I, think that will help us iterate faster on on this too. So, if you are okay with it, um please take a look at this PR.

D

Sure, um just a note, I don't have an approval on this repository, so I will need to ask somebody else for approve.

B

Sure yeah, but I think this will get us set up with with the things we need for now.

B

Yeah right I think I covered most of it. Okay,.

A

Okay, so this this post match up, so um do we away we're gonna wait? Is that what it sounds like yeah.

B

Okay, we need to um need Daniel's to create and set up Raw on the new Repository yeah.

A

All right sounds good all right, the rest of these PR, as we can look at uh Lou Elizabeth. Here we have another. Oh actually, sometimes you can't approve um this one. You can, though, um yeah I think just this one.

A

I, don't know if you're already on this one, okay, you're, okay yeah- this was the other one. We needed all right! Well, there's a bunch of them here anyway, like I'm gonna, go through and review these and um just making you aware of them as well enough.

D

Thanks I will try to look but um didn't.

C

D

Much time this week, sure okay.

A

All right anything else, I think we're good. During this meeting.

A

Oh okay, all right everyone! Thank you. Okay, great thanks, guys, okay, take care everyone, bye-bye.