KubeVirt SIG Performance and Scale, 4 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-05-04

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

In in that successful run,.

B

Yeah sure hold on one second, my hedge funds aren't working sure.

B

There we go okay, hear me now.

A

B

Okay, all right there, it goes I, don't know where my sign was going somewhere else. Okay, sorry, you want to look at the the format. The results is that what it was yep all right there we go.

B

Let's see you probably want the okay, let's see a regex for it.

A

But here's just one there's just one output in this particular file. uh Sorry in this jump.

B

Yeah, so it looks like six.

B

What did this one do.

B

They just yeah, so it's a six or six six hundred I, don't know this hundred okay. So it's a hundred yeah. So this is 100 and then let's see what else I think it's just one yeah it is it's just one but hold on. That's not all so there's this one. This one has a little bit different. Let's look at this one bullet challenge. Do you have any questions about this? So it's just one. It's just 101 output.

A

No, what I was wondering is um over my tool right now. It scrapes for both VMI and uh VM together and then uh so the regex filtering for this one will be little different. However, I think it will still work, because we have recently proposed to change it in the values, so it will find the correct line item two five one, one so I think that will work here.

A

um The only difference is in the code path where it should not. If it is this job, it should not look for the second one and fail gracefully okay. If, if I can handle that, then we should be able to Crunch these uh jobs as well.

B

Okay, all right: let's look at the second one. Now this one uh I don't remember if it does two runs, but it should have more than 100. So this is, let's see 400.

B

Looks like it's just the 400.

B

Okay, yes, this one's fine. Here we go so the same kind of thing looks like the same reggae. Basically.

B

A

Yeah and and then the other question I had is how would you propose to format uh these output so right now, it's very simple right now. All we have is a weekly BMI and a weekly VM folder, with these new files coming in I think we should put it in like density tests, folder and then continue to segregate. Well, with this test, segregation at VM resource level does not matter right. uh Do you anticipate in future? We are going to add a vmis and sorry VMS and instance, type tests for the density test.

B

Okay, I I wouldn't count that out like not happening. I. Think I think it's pretty reasonable that we do something like that.

A

Okay yeah, then, then we can add a density test folder and then uh within that 100 400, 600 and then within that resources, VMI VM and instance types and then, with the doseable host host the charts for this.

B

B

All right, yeah, okay, uh all right! Well, that's good! It's good to see this is back up and yeah so we'll when you have whatever the records where we can review it. Okay, let's go to your charts that you generated from the last two weeks so anything from here, so that we should focus on.

A

uh No just wanted to give you um a heads up that now the VMI results are actually uh VMI results and I was hoping we can um do one or two tests runs to you know verify. This is indeed um as expected.

A

um So that's one thing. The other thing is the the list call on the virtual machine. Well, yeah that one. So if you can scroll little bit down some more, maybe half a page yeah this one, the patch Emi account. This is very sporadic.

A

um I wonder why such a big uh spread across the graph to look at most other charts. It's not as bad as this one.

C

B

I don't know any theories.

A

No I, actually the regex thing guard took more time than anticipated. So I don't know exactly, but this is something you'll have to check out. um One thing I can I can think about. Is that if the patch call fails because of the older resource version, that's why that that could lead to more batch calls.

A

So, like all the bottom ones are passing in the first try and then the top ones are taking more than one try, but I don't know if that's a valid Theory, because the ones at the bottom they're like halfway, for example, we if we are running um 100 vmis, then 40 or 50 particles, don't make much sense like there is no Trend with that.

B

Yeah yeah, that's I'm. Thinking too, the trend is bizarre. There's no relationship to the number of Creations. We did like this one's got clearly two to one with some failures right, there's like never less than 200, we shouldn't well, it is one, but it should be almost never less than 200. and that's two to one. But here right, like you, have 35 I, don't even understand that.

B

C

B

Third of the vmis got Patches.

A

Yeah, so either something is wrong with the monitoring there or something is wrong with uh the bad girls. This sounds a little bit like you need to spend time digging into it.

B

Yeah I mean we do get some that are clearly one-to-one and we see a lot of them that are one to one, but we also right. We've got some strange ones, so there is sort of a trend up here.

B

It's just it's just these yeah yeah.

A

I think this helps us find the uppercase right, so there is nothing Beyond, one more than one call per VMI uh so that that's helpful uh beyond that with this graph is noisy.

B

Yeah all right, let's go to the second one.

B

Anything in here here's another yeah, a noisy list.

A

This one I can understand because it's list um it might be a time-based thing or something sorry. It's list of nodes.

B

Yeah this was the one where we were kind of hoping. If we figure out what it is, we actually could look, um and here you can see what we get for the trend for a longer running one. So let's say this is list node.

C

C

B

Okay well interesting, so then my note I guess I would expect this to be higher on this longer running job. It's only two.

A

B

Okay I mean this is one when we can figure out like as we get as you start scraping this data Maybe, the trend line will emerge. Maybe this is just no layer for this um on this job.

B

I, don't know these are kind of low up too so, okay, I guess we'll see when we get the data, we'll see what the trend line shows. I guess. So our expectation is that we'd expect this to be higher than than these. Like 400 emis versus 100, read expect to be a little higher, yeah I guess it's! Maybe it's constant.

A

Yeah, that's I mean we I, don't truly know I think we should not be listing notes at all honestly, because list and watch call should only start at the beginning of the uh whatever controller is watching for it, and then it should cache all the entries and work from that cache. So seeing it here in the first place is a little bit um surprising for me.

B

Yeah, let's see what the trend shows because I wonder so the dedicated cluster might be telling us that we're like it could be just this is noise from the background and uh another job sharing the cluster. Maybe this is from uh you know the dedicated cluster.

B

Maybe we see a lower, maybe we see a more consistent value of you know when you start seeing a trend at like a low like around this, maybe around two or something or I, don't know like we need more data, but that's just like that could explain something that could mean that maybe it's just noise but we'll need to see we'll need to see what we get when we start scraping that.

B

Okay, let's see what else about trend lines, that's good get notes. I mean that's another weird one. There's no correlation here to the number of emis.

B

We still don't know this one right. This is three little. It looks like three to one. We don't. We don't have an explanation for this. One.

C

A

Well, it's it's a get call right, so I mean it's low hanging fruit. It might not be.

C

Very expensive.

A

Yeah I see here um the patch call beings for headache again, yeah yeah I think this is coming from the underlying um oh I did looking at the in link by any chance. Sorry, the link in the top bar it shows VMI should be looking at VM right. We looked at the vmis earlier.

B

Yeah I went to this one. I went to the second one. Oh.

A

They're both PM, my I, made a mistake. You should remove yeah, okay,.

B

It's gonna be um okay, okay, so here's the here's, the yeah last. uh What else.

B

That's kind of yeah all over the place. This should be a hundred.

A

Pretty consistent, this should be very um yeah. Can you you know this down as an action item. We definitely need to track down this one.

B

Let's say: create the mic.

B

So our trend line puts us around 60., that's like all over the world.

B

In fact, yeah makes no sense. It's like we're just missing a lot of things. I mean I, don't even know how we get zero I, don't even know how we get that very good.

A

Yeah, um interestingly, the patch calls for vmis are holding up nicely.

B

Yeah these look good. This is this looks like we've got a good Trend here about two to one. It looks like so this. This could be like a few failures and then and then we kind of.

A

Oh, you know what actually sorry, but what I was thinking is that um it includes the patches from the VMI controller as well as dbm controller and the variation you are seeing uh comes from the VMI controller, the older chart. We were looking at so if you, theoretically, if the patch for VMI count in VM controller, remains constant, then the only thing going up and down is the previous chart.

B

C

A

See yeah because uh you you look at some of the uh once on the lower end of the graph. It's like 100 and.

C

A

25 or something.

B

That's like 137 is what I got on one of.

C

A

So that that one is yeah.

B

B

And no one updates looks like yeah I think it's been constant for a little while.

C

A

I compared that update call with the VMI controller and the VMI is at 800., pretty constant, so I think uh my understanding is that this call is split into two eight calls by DVM I controller and two calls by the EM control that gets us to 10.

B

B

So then it takes so it's two update call. So if you're managing with VMS, you get eight more updates than we get.

A

No, the other way around the the BMI itself has eight update calls. If you are managing with VM, you can.

B

Oh you get two you get. Two more is with me yeah. This is what you're saying oh seriously: okay, okay, cool.

A

Wait seems reasonable.

B

Yeah yeah I mean that makes sense, because I mean you think about a number of phase changes. You have to do that. Okay, so like in terms of so like the way we classify this in terms of skill. It's like we get two more. If we had a value for update calls, it would be whatever plus two times the number of VMS, and that would be the our scale number versus the number of emis, which would be whatever the the base is, which is eight yeah saber.

A

C

A

We are making in managing with with VM controller, is just additional to update code yeah.

B

Okay, that's a good one that when we talk to the kubernetes six scale that they have some equations to model a few things: that's where, when we we um hoping, we can get some punch out of the knowledge share where, however, they have their calculations, we can use some of it to do some modeling for for what we have, and that should be. It should be helpful for us to do estimates.

B

Okay, all right! That's all I have for these uh I. Think Glade I, don't know. There's anything else. I think that was.

A

No not regarding these two graphs uh since logo is here. Do we want to talk briefly about that artifact directory problem? It will just be repeating, but I think we can get some more discussion.

B

Yes, sir, which is the is.

C

B

um Yeah, business yeah, okay, these two actually.

B

So you need this: uh we haven't merged this guy right, yeah.

A

I I need to spend time.

B

A

This this is not the one where I need help. I mean this. One is very clear: the second Point second one yeah yeah the density test versus the enduendence yeah. If you can open up the links for both of them, okay,.

B

Yeah so artifacts.

A

There yeah um so my question here is that what configuration exactly enables that artifact directory in the in the project?

A

So if we look at the other tab, the density cluster, if that article directory is missing so I on the thread in keyword, Dev I got a couple of uh data points as to you can we can populate the artifact directory, but that is just an environment variable for the test. So I still don't know if it is the one that enables this artifact directory.

D

I have linked the documentation for pro where the environment variables are explained. They are basically the artifacts is the one you are looking for. Would you indicate to Pro a everything which is written to this path? I need to be export, I need to export um thing. We studied on the job, but I would need to check.

A

I think you are right, um I think we set it up. It might be good to verify that, but my understanding is that this environment variable is just uh exposed to the jobs so that they can write the output in a particular directory, so my that that plumbing might already be in place. The the question is that does this environment variable also enable that directory being populated or is there some other configuration.

D

um I mean the test: suit needs to populate it right or you.

A

Mean that the test populates it and writes it, but it's just not put it in the uh in the S3 bucket.

D

ah Then that's that's.

C

My understanding.

D

Yeah then that's a bug from Pro.

D

Okay, so uh is it like, is it all the time or does it happen only from.

A

Time to time, all the time, all the time.

D

C

D

I see so that's a density. So that's perfect. uh Give me a second I will have a look on the pro.

C

A

um Ryan, while loop was searches for that, um can you go to the density test that we walk through here.

A

So in this output build your text, can you search for artifacts.

C

Yeah there it is, are.

D

We sure that uh so we therefore we run the perf scale test. Sh right um are we sure we we are using the deluxe artifacts directory there.

A

Yeah um Ryan, if you can scroll to the a place where we log the audit output, the values I think it was two five one. One yeah.

D

Interesting, oh yeah! No! No, because this is uh yeah, but the path was not not the same right. There was prefix of out underscore that out.

B

Yeah, it looks like a different path. Well, so logs are in fact- and this is L for the facts.

D

A

D

A

To change is this the.

D

Audio right yeah, the writing, the results, so whatever does produce those.

B

Okay, yes, it's the okay, so we just need to change the script and to use log logs, artifacts, yeah or yeah, or we need to change this to Output effects.

D

Yeah, so that will be it I mean um yeah, not sure if you, if we use to be dockerize it or containerize it this script, probably not where I am looking at it. I don't see anything like that. You.

A

Know I, don't think so so I think the the place the environment variable for this output file is the audit. um Can you search for Capital caps uh or audit I think you'll find.

A

Oh lots of things.

D

Okay, nice I can see it right now. um Let me actually open in it in uh in the webs browser, so I can share with you.

A

D

D

And we override the artifact, um the artifact variable here. So that's a problem so I think that this simple ingredients just to remove the prefix of out and that's it.

D

Or or you may want to actually override the config on the pro, whatever works for you.

B

C

A

This is the problem.

B

A

Overrated yeah I see yeah yeah.

A

Okay yeah, so the line number 34 gives us the audit result directory so that contains the output out underscore okay, so I think we just need to skip line 32. That should be it.

B

Oh I, see, you know, use the whatever you said: uh okay, I get it. You use log artifacts, okay yeah, so we just removed this line and it should work.

D

Exactly okay but yeah, maybe you use it not sure why it was used like this, but maybe maybe there is a reason so just keep an eye on it.

A

Yeah I will try it out I. um Another thing that bugged me in trying this um test is that I have not actually gotten any of the perf skill tests to run in my environment. My Hardware seems to be the limiting factor. Do you guys? um Are you guys able to run even with like five VMS or something where the test runs end to end in in the local environment?.

B

With far VMS uh I haven't tried in a while, but we originally did it yet, but you have to you, have to have really really small VMS, like you're using the cluster up yeah yeah. You had to get like really tiny ones. I.

D

Mean you can modify the the memory, the default memory we use for the nodes for the curacia nodes, but it depends on your Hardware right. I mean, for example, me I have 64 gigabytes, so I'm pretty okay, but.

A

Yeah I think my Hardware is leafy enough. It's just that I uh I am using the default configuration I. Think I'll have to change that. To get.

D

This message yeah definitely for the performance job. We use different values.

A

Okay, yeah, it's it's a good point. I'll! Try that out. Okay.

B

Okay, I think all right, so we just need to um that's already. This is easy. We just didn't remove the slime.

B

Cool okay and let's do it all right? Let's go to this uh presentation, so this is um what, uh when we go to next week to the Upstream sixth scale. This is what the slides don't want to talk through. So uh does everyone have an invite to that of students? Sixth scale calendar. It's like it's in Thursday afternoons, it'll be next. Thursday is every other week, so on Thursday afternoon, well, eastern time, I don't know, especially it might be.

B

It's like I think it's um I think it's uh so it's let me see so it would be like um like 2100 UTC or something it's later than this. This is um I. Think it's like four. It was like six hours after this time, so, if you don't have it just join, uh grab the um or locate the invite. If you want to attend I, don't know if it's gonna you're gonna be able to do at lubo I, think it's gonna be late for you being in your time zone.

D

I will try to but yeah I think it's eight or nine yeah.

B

Yeah I think it's gonna, be it's probably late for you, but anyway, so I'll just give you the um what the slides you want to talk about with um with them. So this is so basically, you just want to go over what we do in our group and some of what some of the results we've seen. So uh this is our mission, so we maintain our own API server scheduler.

B

We have workloads that are independent of kubernetes, and so we should have our own performance standards tools and best practices, so our focus with measurement is to leverage existing tooling as much as possible. So that means reethius. We have um we focus on two main metrics that give us a lot of our data right. We've got phase transition times and we've got the client go. Http calls to a kubernetes API server and the two things we focus on is like.

B

We want that to be able to measure from our laptop and in the dashboard, and so we have phase transition times, which is we basically took that idea of a creation, time stamp and deletion timestamp and add more granular phases and we update the status with it and we do it when we ever we change these faces.

B

And what this does is allows us to make great stuff like this, which is valuable for from seeing how virtual machines are performing and in our clusters and uh create a bunch of expectations around those runtimes. In those different phase transitions.

B

And then same with the client go HTTP calls, so we have, um we should be able to catch PR's that increase the hcp calls and we should be able to measure it. So we want to filter for these PRS and and catch them and and comment on them, and so, in this case, like we've got the ones that Olay had caught earlier, which was the two changes we had to finalize with the finalizer change to the vmis and the controller location which affected the patch counts, that we have for VMS that are managing vmis.

B

And so our future goals with that we want to collaborate with that's up through six skill groups. We want um is that we have a virtual machine object. It's made up of a lot of things: we've got pod, pvc's networks and other things, and there's just too many variables for us to measure and isolate performance and scale for just the virtual machine object. It's too large, there's too many things. So how do we isolate these things? How do we isolate the virtual machine? You know?

B

Is there a way we can get more details, uh our detailed phases for the different pieces that make up the virtual machine that kubernetes is primarily responsible for so things like PVC attachment times, Network attachment time, pod phase, Beyond creation, timestamp. So all the things that basically applying the idea that we have the phase transition timestamps to other objects in the kubernetes?

B

Maybe we don't need to post them on the actual object themselves, but like perhaps we can have a way to post this in Prometheus, so that this is actually actually measurable for us. So more than just that creation timestamp, you know it could be when PVC is attached, like we know when this is the controllers know when this is, we just need a Mark when it happens and then then expose it so same with any other things that could make up possibly a virtual machine. We want it. We want to expose those things.

B

So that's the ask I think it's the last that I've got yeah. So that's what I want to cover with the uh with the group and then hopefully we can dive into discussion on some of these things.

B

I mean I'd like to get a lot of questions from them on how this works, how this works, and we can sort of talk through this and how we use it and so on and so forth, and maybe get some dialogue to eventually get to a point where they're interested in working with us on these things and maybe a design or something.

A

So- um and this looks great uh I just wanted to give you a heads up- uh Upstream kubernetes has uh they have deprecated the use of phase in in things like broad faces and stuff. They are leaning towards conditions, so um I, don't know if seeing phases here will trigger that discussion points in people. But if you want to avoid that chance, you can use conditions or something similar of that nature. Okay,.

B

Look up our conditions.

D

And actually conditions are better because you have also the time step. So we also know where you reach the condition right well,.

B

So the the only thing is, though, if we're, if we're going to go this route like pods have so if Upstream is moving towards conditions, then what you know they're just going to tell us that hey we're moving toward conditions, you need to wait for that. That's not what is that.

A

What the answer.

B

A

No I think so the way this will the way I see this happening is that um Upstream. So, let's take an example of pod right. Pod has a lot of different agents that are reconciling pod status and some of the information is sent out in conditions. Some is sent down in events so, for example, the PVC whether it's attached or not, it's uh it's an event. Whether network is attached or not it's an event, so that is the low level details.

A

So what what I Envision is that all of these things will turn out to be a cubelet metric that gets exposed, along with that SLO of end to end running time that you are sharing in last call.

B

A

So it might not be all in the object uh status itself. It will just be metrics that are collected across various components that help us build. This timeline. Okay,.

B

That's yeah, that's what we're looking for. So if it's yeah, if it's I also I wanted to avoid, is that someone telling us that, like no like we're we're? Someone else is working on this because then it's just a dead Empress like it's just gonna, be so they're. Just it's gonna be unhelpful. I mean you know what I mean, because they're gonna, basically what they're doing is just passing this off.

B

So the hope is that what whatever this is like, it's just to convince them that this is a very simple idea, like it's, the equivalent of a a creation timestamp that we want to have for all of these objects. So it's at least I'm, hoping where we can sort of take. The conversation is that we just want a bunch of these creation timestamps, but we want them to be about creation Thompson. If we want them to be that more specific yeah things.

A

Yeah I think the way I see it is. If you, you need an ability to break down that creation to running into much more granular things, so you can identify what's causing problems right. uh That's yeah! If you can capture that I think that would be a great yeah open-ended question at the end.

B

Okay, yeah, maybe they're, composed as sort of a question instead of like just a uh well I, have a question here, but maybe I can make it as a little bit more specific, because I think that's sort of the thing is we want our goal out of this would be to basically to have like a cup or something or or some sort of PRS. You know that are that are generated at the end of this. Not um you know that that would be the goal so yeah.

B

If we can get the conversation that direction I think that's. That would be the ideal thing. So maybe we just need the right question here at the end, okay, all right sounds good guys. I think this is um so yeah. This is what we'll use for the um for the discussion, and hopefully this could get us some places.

B

Get us starting at least.

B

A

B

Good all right are there any other items for today, I want to discuss. Thank you.

A

No not from my side.

B

Okay, all right, everyone like a bar and then.