KubeVirt SIG Performance and Scale, 29 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-09-29

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

All right, uh it's a sixth scale. Oh let me start my screen. This is sixth scale. um um Excuse me on September 29th.

A

um In the chat.

A

There it is okay, um so the.

A

Okay, so for today, just uh I think uh so we'll do we'll do a review of the periodic, um sharp results and I saw a few failures that occurred. We'll just see if um it was just a flake.

A

Okay, periodic's looking good, it was maybe the.

A

Okay, a few failures, I believe I did look at one or two of these and it seemed yeah. Okay, it seemed like there was just something that basil didn't work: okay, yeah! This is fine that that's just a I feel like it looks like it went away. Okay, that's good. I'll. Just take quickly get these one of these.

A

Yeah all looking really good so within our thresholds.

A

Well, they're, not thresholds, health benefits. Okay, that looks really good and then, let's check out the periodic here.

A

For the large cluster.

A

I suspect we'll be seeing will Sail since I, don't think we have anything that fails. This.

A

Yeah, okay looks about the same okay that looks good, um okay uh and then Olay I saw you at a PR um opening. Do you want to go over that? One, real, quick.

A

B

um So there are a couple of things that um I was digging into monitoring of this workflow from, like usual user, issuing a delete request to BMI and then all the way up to the VMI is final finalized. So there are like two steps uh in in this process. One is when the delete request actually comes in. It comes in as an update um which sets the deletion timestamp and then the controller.

B

Looking at that deletions time, timestamp does a bunch of logic to bring that VMI into final state, so either uh uh failed or succeeded, and then once it goes to failed or succeeded State, then the finalizer is removed. A bunch of work is done like removing all the files on the Node and stuff like that, and then actually the object is released from HCT. So what I?

B

What I noticed is that there are metrics available that that tell you uh the histogram like uh they export the histogram from you, delete deletion timestamp all the way to final state, but there is no metric for what happens to objects from the final state to actually releasing from hcd and I. Think it is measurable um from final state to releasing the hcd would be looking for that final State and then looking for the delete event in the informal um so I there could be a couple of metrics interesting there.

B

One is the sheer number of uh objects that are lying in that state, where they are not being finalized. For some reason, you remember like we had a race bug um regarding that, where some objects were staying in the failed State and um that that could be a good um metric and then other metric would be uh the time it takes to go from uh final state to releasing the object.

B

So those are the two things that I could think of, but yeah all of this was um when I was digging into exporting uh that metrics for audit logs um I found out that we don't actually have any uh metrics for the other half of the process. Oh.

A

So, okay, so we'll um will so the other half of the process. It's out since it's outside of Qbert, it's sort of a question here: is it since it's outside of Qbert is it the question is like? Is it should we track it? Because at this point in time you know we, the control point is done all the work it can and now it's it's completely up to kubernetes. It's.

A

um You know to complete the to complete the garbage collection yeah, in which case, like you know, the in terms of you know what we want to measure.

A

um You know this might be a little bit outside of it and maybe there's already a metric for it um that we can use, but in terms of like what we need, I think I think this will do like I, think user delete um so the time the user issues a delete that first uptake call. We got the deletion timestamp all the way until the finalizer gets removed. I think is like that's, that's the yes, our control Point delete time. Yeah.

B

Yeah, so the the current PR. um Let me talk a little bit about what that PR does. uh So um we have an audit tool here that will collect all the metrics um in that audit tool. I have added uh two more percentiles um that is 99, percentile, 95 and 50., um and or like so in total. Six six more uh data points, uh one that goes from deletion time stamp to three. That goes from deletion. Time stem to succeeded and then three that goes from deletion.

B

Timestamp two failed: um okay, currently I think we should expect the failed deletion timestamp to failed uh all the three to be zero because we are not introducing any failures.

B

um All of our vmis are successfully garbage collected and only then uh the the test succeed. So for a successful run, it should be zero, but I haven't in I. Have it in there in case if we want to add a test that introduces a failure and also checks for a failure, path, yeah. So.

A

We could do yeah, it makes sense, because so this is a we we could generate this by. uh We just need the the guests, or maybe it's the launcher process to exit. One is all we need to do and then it will be. The container basically just needs to exit one, and then we get failed State. Otherwise we get succeeded so yeah. It makes sense like because we're gonna get both these cases. Probably this one, the most common in our testing, but we could very easily generate this yeah. It makes all sense.

B

Yeah, and so we have 90, 99.95 and 50. This is exactly similar to what uh we have from creation timestamp to uh running state.

A

Awesome, that's awesome, yeah cool! All right, that's a great addition! I think that'll be. um This is gonna, be really good like in that um 600 p.m. Job where we actually waiting for the cleanup that'll, be awesome to see, see how these do by comparison to the creates cool, okay, nice.

B

Okay, um really quickly, regarding the first topic um on whether this is a cube or cube, kubernetes or cube, word thing: um I actually am not really sure uh whether it's a kubernetes thing alone, because there are some uh informers that Cube word has written like uh informers in word, tender which deals with um cleaning up the files and it from what I remember it is.

B

uh It is happening after the like, after final State and in between uh releasing the the object so I, let me try to dig it up whether there is some action or some kind of garbage collection that is happening in in between uh final state, in releasing the object. uh If there is something happening, then um I think we should track it in terms of metrics.

A

Okay, see what you find that's interesting.

B

Because the bug I remember which, where the uh finalizer um was where objects stayed in final State, as in the object, was in failed state but finalizer was not removed, was precisely the bug where race was happening in this state. So, um like my recollection, is little bit uh off right now. I need to go back and dig into it, but I I feel there is some logic that is executed.

A

That's interesting so I guess like see when I was my understanding was. Is that the cleanup any sort of like the cube root? Garbage question was done in this period? You know, basically, this being the the base, the final State beginning final State, ending um unless I guess the only exception to that would be. If something needs the Pod needed to be removed after I guess because I, don't know why I wouldn't or I guess, maybe it has to.

A

Maybe the guest has to be terminated, so it doesn't have to so they're asked but I guess no one here like I, don't know I guess I have to think about it, because I guess, if we're waiting for the guests to terminate because we don't want to cause any damage to it by doing any garbage collection or waiting until the state and then we're doing a garage collection, so I guess it could be possible and I guess some things in the file system. Perhaps like some things like in um like.

B

A

B

A

um Anything in.

B

Canada, yes, exactly the Ghost Records are are the ones. So what I remember is what word: Handler has a bunch of files on on the Node that represent a particular VMI, uh so it does like it does a watch on those socket files in order to look for uh events, and once this is failed or succeeded, then it will go ahead and delete those course records and release the object from finalizer.

A

ah Right so, okay, so it's holding on to the finalizer until the ghost record, so okay, but then we are still technically in the final State, though right, because the finalizer hasn't been removed.

B

Correct so, but that.

A

B

Different from the failed like right.

A

It's different than this.

B

Describe although I I'm I mean um I still need to go refresh my memory on whether the workflow I am talking about actually happens or whether that bug happened because something in between the failed State didn't uh reconcile properly. So that's something I need to check, but yeah I'll do it this week.

A

Okay, yeah yeah, basically yeah see there's a tricky part here. It's like so right. This is the end, so yeah they're, they're, yeah I, see how there is because it's basically we would declare something, as you know, in the final basically exiting the final State and the next thing we're supposed to do is remove the finalizer. But if there's any garbage collection, that's done or being done. That's failing and we fail to do this, then what what state are we in we're? Actually we're sort of in between here yeah.

B

Okay and then, if, if any for any reason, VMI gets there I think chances are that it would stay there. I don't.

A

Know yeah I, remember right, I, remember: we've we've run into this book. This was the um this was the The Ghost Records were not cleaned up and essentially what happens? Is the node restarted and been pissed of vmis we're just sitting around and get restarted yeah because we're able to launch any more of bmis.

B

Finalizer logic was waiting for was not able to read that file, so yeah, a bunch of things went wrong, but I think these are the metrics that would uh you know, bring out these error cases like in in the PRS itself, so uh yeah.

A

It would be interesting to see if, like if we don't yeah see it's tough, the the removing the finalizer step is like. um uh Maybe you can check um when so I, don't remember when, when the met the delete network was written, I I, don't remember which, if it's hangs off of this or if it hangs off the same function, that does this.

B

I think the Matrix hangs off the state. So what the Matrix does is it. It is an Informer on the BMI and as soon as it sees the deletion timestamp, that's the old time it will use, and then uh the phase transition to final or succeeded is the new time it will use, and then it will observe the difference. Well,.

A

So what I remember, though, is like is that, when this happens like that, these things are supposed to happen like very close to one another. That's like like they're, almost in the same function, call that's what I remember right and, and that's where um it's in so it's I can look it up real quick, because it's in the um in the pr.

A

Because I remember looking at this yeah.

B

So from the control loop, it's just the next date that that's what we're trying to say like immediately next date.

A

Yeah, it's like it's almost like it's like we're talking lines of code like um like apart. um Let's see here various transitions- oh okay, so yeah! No I, don't see it. So, let's see.

A

Oh okay, no I, okay, I, don't have it here. Okay, so I thought I thought I did I, don't so I just remember. Looking at it, I remember looking at the the line that this occurs.

A

um Yeah, okay, I, think we're just just have to take a look at it again because I, don't I, just don't, remember sure, yeah no problem but I what I? What I recall is like it was like the same function. Call basically the thing that processes failed and succeeded that sets it into like trails and succeeded was like um in the same function. That was supposed to remove the final answer, um but there might be some steps in between which is what you're getting at, which is this and and if.

B

Those fail, then,.

A

We don't get to this. That.

B

Makes me think if those steps are happening asynchronously so like the controller, is making the failure to remove finalizer like immediately and then the word Handler is also uh removing those course records after failure succeeded and then both of them happening. Asynchronously like that could be one workflow, but I'm not sure I need to go.

A

A

Yeah: okay, yeah: it's a good, open question. Okay, sounds good all right thanks Elena that makes sense and.

B

Then I have another small update, um so you remember, I was um like once we were seeing the end points uh Matrix being reported in audit tool that was um kind of little bit of um I. Looked at the um the test, clients and the test. Clients do not use those metrics, so those metrics are purely coming off. uh Word controller word, Handler, word API and one more web hook, and the reason was that, even though both of them use the same client, you need to like import from it.

B

Monitoring, client Prometheus package uh and that's only imported in those four packages. So um I posted an update on slack uh asking those questions and I I think I've found answers, so I just wanted to give a heads.

A

Up so this is the one where you said like where we were talking about um the um I think it was back here when we were talking about what um what were the clients that were doing these is that what it was? Yes.

B

Correct so for once, I thought that uh the end-to-end test clients in the uh the clients used in the um in the actual API server, all of or well even controller Android. All of those share the same logic of um monitoring the metrics by intercepting the rest call, uh but those metrics are only enabled for for the four packages that I mentioned, which are on the server side. Those are not enabled for end-to-end tests or any other packages.

A

So then handset effects are the way between these I'm sorry about. So how does it affect the way we read these metrics then.

B

It does not that's what I was trying to say that those questions were so. There was an open question whether it does or not does not and I'm saying that it does not, though these metrics are only coming from four uh four components: I created ah word, API, word controller word Handler and the web hook.

A

A

All right that makes sense cool all right. Thanks for the updates on on that sort of client s, okay, cool all right, I, don't have anything else, um hey Andre I saw you joined I, don't know if you've got anything else, you want to add.

C

uh I just I uh tried to reach you, because I would would like to tell you that we are releasing ddesk. Finally,.

A

Congratulations.

C

That's exciting yeah. um The only issue that I would like to talk to you is that is without gpus. For now. Let me explain you why let me put the link here on the chat, for you know.

C

Gcp doesn't allow uh uh on the abuse to enable vtd iomu equals true on the Kernel. Doesn't work.

C

Gpus doesn't work without that correct. There is no work around.

A

uh Yeah I think you need to enable IMU yeah I, don't.

C

We fight big, as you can see on my last one.

C

C

They say not working it's on the roadmap, but not working yet or you know. Okay, we're gonna, release the services, as is without your gpus, so.

A

What is this, what does this mean then, for for you guys.

C

Very all the games, don't don't work and many things. uh Autocad, don't gonna work, but we need to start with something. Okay, we cannot wait anymore. Okay,.

A

C

A

I totally understand yeah, okay I mean were they uh so what's I guess sounds like they're working on it.

C

No I'm open to discuss even to move to another Data Center.

C

Any suggestions we try uh Amazon uh why we are not in Amazon or Azure, is because they they charge for the traffic between.

C

Regions- and this is too expensive for us, because all the traffic, uh the Pfizer store on the US and the users are logging, for instance, with his desktop in Australia, we threw the pipes of Google, we accessed the files inside uh the US storage to become governance, compliance we understand and there are lots of paints.

C

Any suggestions is welcome. Brian, okay,.

A

I I'm not sure I uh I'm, not sure under it, I don't know, I'd have to think about it. um A little bit who.

C

Are the the guys that has lots of gpus from Nvidia? That's the my question for you like.

A

Because in are you asking like who are the like? Customers who run clouds with gpus? Is that what you're asking that.

C

Other than Amazon and azure.

A

um I, don't have a list: I don't have a list off the top of my head. Okay, but.

C

A

Okay, I can try and connect you and find that for you, if you want wonderful.

C

Anyway, uh we have Nvidia G4S on the lab works perfectly.

A

C

A

I think what would help me Andre if you I I, because you've talked about like your use case all right. Would you rather send me an email with like a more detail about your use case and the requirements that would help me I. Think further the conversation to like understand what you're exactly what you're looking for like because you're talking about like like what were the requirements you had on on using GCE? And you know what were your requirements on AWS? You know whatever any gaps, and things like that that would that would help yeah.

C

I'm gonna write it down for you, okay, anyway. Wonderful job are you working. I saw my technical team uh are seeing the performacy parts that you are in charge. This is amazing job. Please.

A

Keep doing it sure, oh I'm, happy you guys are seeing some improvements since yeah we've got I, think we've got a bunch more other things planned in terms of other metrics and visibilities and and trying to discover more problems. So yeah well happy to hear the feedback. Thank you and.

C

A

Can work to the team too I mean it's. Not it's not just me. It's everyone. That's been.

C

A

C

You're just the coordinator.

A

Yeah, we're we're all we're all working together on this, so it's the community's done a great job. Yeah.

C

I completely agree with you anyway. uh If you can help me any shape or form are very welcome. Sure.

A

Yeah just send me an email with like what I asked and we can um I'll try and figure something out: yeah, wonderful,.

C

Well, um in the meantime, uh I appreciate, if you can test the solution yourself as a person. Okay,.

A

Sounds good cool, oh Andre, thanks for the thanks for the info and uh congratulations again, that's that's exciting yeah! It's always a it's always hard rocking to get to get the product release out the door and yeah. That's cool! Happy to hear it's pretty awesome.

C

Ben, thank you so much anyway. Thank you. Ally.

A

Thank you all right, I think I think we're good for for the meeting for today. Thanks thanks folks, bye, bye-bye.