KubeVirt SIG Performance and Scale, 5 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-05-05

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, all right uh welcome to sixth scale. It's may 5th 2022.

A

uh the design. The meeting notes are shared in the uh in the chat. Okay. So today uh we're going to talk about um a few things uh that we're seeing in performance right now.

A

So we have the um um so so sanja. I don't know if you know this, so we have this.

A

uh We have this performance job um that we run every three times a day and uh basically, what it does is it um it will run and create 100 vms under vmis and what it does is it measures the amount of time it took to for each um how long it took to go and to end from start to running, and we have some thresholds where we expect, like um we expect a few things to come within you know. A certain amount of time needs to start in a few certain amount of time.

A

We also have a bunch of thresholds around a few other metrics like um uh like uh http requests like things like, create requests like if we do um like. We expect a certain amount of like crates for 100 vmis, um and we expect a certain amount of guts, patches, etc. So this what this job does, is it measures and make sure that we're within the threshold?

A

um So this is like an example, um so the thing I wanted to talk about is that we're actually seeing this job fail. As of 4 23 and what's interesting is there's, um this has been debugged and there we're short memory on the we're short or suddenly short some memory on these jobs, and this is a bit surprising um because I'm not sure what happens uh something changed on this date, um where all sudden now are.

A

We need a lot more memory in order to um to launch these hundreds of pmis and actually measure, which is a little strange, and I did some investigation into cuberd and see what was wrong and from what I found, I mean on on the dates that this occurred around the 4 22 423 date. Time frame, um I didn't see anything that looks particularly suspicious so roughly uh 12 days ago, this right around the date.

A

A

11 12 days ago, or so the 23rd so around this time frame, something happened here either something I think one of two things could happen. If there's something changed in the in the ci, which certainly like, is something that's hogging more memory or we merge some code that is suddenly not as our stake is simply taking up more memory, I'm not quite sure what it is. uh It could be one of these here, but I'm not sure so.

B

A

April 22nd, okay, yeah something happened, I'm not sure. So what I think, um what uh what we'll have to do at some point is we're. Gonna have to go through these pr's and and audit them just to see because we actually have so we have two jobs. We have. We have this one, this periodic job. We also have a pre-submit job, but the pre-submit job is optional, and- um and so we have no way of actually knowing since people since someone may not have ran it when they did their uh when they did their pr.

A

We have no way of knowing which one actually caused it to fail, so the periodic caught it and um so we're just kind of gonna have to play guessing game and see if um one of these in here suddenly caused a massive memory increase and it's like a decent amount like it's like a few gigabytes like is kind of what we're seeing.

A

So I don't know we don't have to do that now, but like something, I think what I'm going to do is as a follow-up, I'll, create an issue and tag a bunch of these pr's and something for tracking to try and figure out which one of these yeah try to get an idea of which one of these actually caused. This.

A

A

All right so well, we don't know the issue. Well, so we don't answer this, but something we'll figure out. I think um over time. Okay, so um the next thing this is.

A

um This is something that's interesting, um so I I um uh something I don't know if you know I work at nvidia, so we have um we run keyword internally, and um so I do a lot of experiments on scale and performance based on our clusters and um one of the things I've noticed, which is a little bizarre as uh so we we go through.

A

um I've got three graphs here we go through uh periods of of churn when uh and in some of our uh data centers, where we'll basically have a high virtual machine instance count and also then um we we delete a few of them and uh you know we recreate them, so that would be turned like we're, replacing replacing running virtual machine instances and during this period of time you know, there's more pressure on the control plane. So things happen things just kind of. I don't know things just kind of um take a little bit longer.

A

Naturally, because you obviously more things are happening, and uh so that's what this investigation was is there's there's um we have a bunch of metrics and um we have this period where we have a little bit of turn um and we see a few interesting things from the components, and so um the original observation was that we, I saw a high virtual machine, put requests during a low low churn, and uh so you can kind of see here this. um I have this this goals I mark on my golds and you can see them.

A

This is from um uh from prometheus, like this gold line, you can see that um there's a regular cadence and you have this very high amount of put requests that are returning 200 from the controller and during this time, there's there's very few create requests, um and you can actually see that this is the corresponding um graph in in grafana and there's not that many um vms that are actually being created during this time. There are some, but not many, and um it's a little strange that uh we have this.

A

We see so many put requests from the bird controller and that um and that's not a regular cadence. It's a little bizarre and then you can see the other part of this is this is from the vert api. um You can see that um this matches up with this period right here.

A

This is when we have some turns so we're creating a lot of vms and we're doing a lot of vms during this time period and all sudden, we see a lot of the flurry of requests to the to the vert api, and if we look over at the vert controller, this purple line here represents put requests returning, four nine four nines a conflict, and so this this is interesting because we're seeing high 409s, um which is getting a lot of traffic in the api server, which kind of makes some sense like we're, sending a lot of requests and there's some conflicts, um but we're also seeing a reduction in the number of um 200s for put requests like that.

A

Gold line goes down, which I I guess you could explain it as like. Where we're seeing you know fewer requests get through to be processed.

A

um So it's kind of um it's kind of interesting, but it's contributing to things being a little bit slower like we're we're doing a lot more things, so our like this is time on the on the y-axis here, it's taking longer for these um for these vms to actually get uh get processed. So it's that's interesting. So there's there's two observations. Just to kind of summarize, lots of foreign lines seem to have a positive correlation to so scheduling times, which is interesting and um that the divert controller has these.

A

Has this regular cadence, where it has high amounts of put requests? And I don't know what it's doing, but um it's it's doing something it's doing it at a regular cadence when there's not many uh vmis being created, which is a little bizarre, I would expect. I guess the way I would say this is. I expect this gold line to be to be low. I expect it to be low and then kind of increase it like with the screen. With uh kind of with this.

A

I expect that these lines to follow this cadence that the uh api is doing and it's not it's got a completely different signature. The.

B

Bell curve, instead of the synchronous pattern thing.

A

Yeah yeah right because this, what this tells me is like: okay, we have some activity, that's going on and and it could be that um you know maybe that this, like there yeah we have like this, would be a lot of activity, and here we see a lot of activity and not we see a little bit less activity now that could be. You can buy the because of the four or nines.

A

um But then we see like very little activity, and we see um we see a ton of activity from the controller. It's a little strange yeah.

A

I don't know so I'm something I'm still going to look into this one. I I'm still digging it to find. I think so, my next step for this one is I'm going to try and find what um what could possibly be causing this to occur at a regular cadence, and it's not just the the put requests that I returned 202. It's all these lines.

A

I mean every single one of these, like post 201s get 200s all from the vert controller are fairly regular, a fairly regular cadence, and um it's just strange that it just seems strange that that's what's going on.

A

So I'll just have to do some more digging in the code, but I think I'm getting closer to actually getting an answer on this one, and so, if kind of uh to kind of go up a higher level here, one of the things that we want eventually like I showed with the um periodic uh performance job is that like we want to see, we want to limit the amount of requests that we have, because these all go to the api server, and this affects our ability to scale in general.

A

So being so not being a noisy mover is important. So if we're doing this at a regular cadence and there's no reason for it or you know where maybe we have something with leaked in the code, we definitely need to fix it.

A

Okay, um so that's all I had um for today. I I don't know: do you guys have anything else? I could also talk about tracing. I think that was a thing I had last time. I don't know if anyone's interested in tracing, but uh if you want, I can talk about that. There's some there's some folks in the community who are interested in working on tracing.

A

All right, I'm bringing up here um so there's uh so I'll, do a brief discussion on this. So there's uh with tracing right now, there's um one thing: that's interesting for us uh is like so tracing's got a great lot of good use cases right. It's um you know can tell us when their slow could pass. It can identify.

A

You know when things are, um you know when certain things should be or misbehaving at a very small level, and so we can- and we can also- and the other part is. We can also visualize that, like there's a lot of good ui tools uh so right now the kubert's support for tracing is it's very limited. It's actually they're supporting the vert controller for doing tracing out to standard to standard out, and it's very simply just not a few. It's um it's very simple library, and it's just around a few hot code paths right now.

A

So the idea that we've discussed a little bit in the community is possibly having um expanding this, so not just to be something we put the standard out, but actually use something like open telemetry, which is something that kubernetes has actually done.

A

A lot of work around and a lot of work on integrating um some of those solutions um and there's sort of a few use cases for it, and so one of them is that being able to trace in inside of into individual components, and so, like I said before, like it's important, we can see hot code paths within like the rear controller things that are taking a long time to be able to visualize those and there's the second. A second use case, which is, I think, a lot more challenging- would just be tracing across components.

A

That would be like if we sent a request to the very api tracing it from the vert api to the controller down to the word handler down to the launcher and seeing you know all the paths it takes and how long it took in each of them. um It's it's difficult, because these are separate components, they're different pods.

A

um It would. We would basically have to find a way to uh to pass the trace all the way through. But I mean it's not impossible. I think would just require a lot of um a lot of work to get that to go to work, but that would be very cool to see and then the uh the third use case is really like. I said it's in greeting with in the ui, and the anger has been one of them.

A

This is a bunch of these, so I I think one of the one of the goals was that at some point um I was hoping to do to come for the design for this uh once I have a little bit of time.

A

So, but it's the three use case that we'd like to cover.

A

Okay, any thoughts on that. I don't know if you guys have any experience with tracing if you think that makes sense or not.

B

No sorry never really done much with tracing before. Okay.

A

Okay, well, I don't have any more topics I mean. Do you guys have anything you'd like to talk about? I think it's the first time I've seen um you know, meeting him. Is there anything that you want to hear about? um You know cuber scale performance, any interesting topics you want to explore.

B

I I hi ryan. uh I do have a special topic. uh Just uh I have a question about the uh the api latency uh okay, yeah. So uh yeah you said you, you watched a lot of api uh api calls from the kubecon controller.

B

So I also noticed there are a lot of enqueues happens in the controller and the handlers, but in in our case it is because the secret is deleted after the vm is a bootstrap, so the secret controller always monitor and watch the secrets. If that way is ready. So if the secrets controller failed to the check, it will be enqueued, so there are always tons of secrets, controller errors in the log. So I'm wondering if that is relating to the a lot of put and api calls.

A

Oh, that's interesting, um oh fans. Do you think this might be um monitoring that secret yeah? Oh, that's, a good point that could be um that's! Oh yeah! That's that'll be interesting to find out.

A

Okay, that's a good theory! I think that's totally possible.

A

A

Okay, we'll have to explore that then, let's, um let's see if we can.

B

A

So this is uh so all of the um so this isn't just put it's put get post, I mean, I guess it could be right. I mean we're gonna, be I don't know what would we be putting, but um I guess at least for the other ones. I think I could see.

A

um I could see that being the case. Okay. Well, it's interesting. I asked like um we'll have to um yeah we'll have to continue to explore that. I think that's.

B

Yeah, probably.

A

The the most plausible theory that I've I've heard so far. I know that yeah, because I don't have one right now. I think that's the best one. I've heard.

B

Yeah yeah, I think I can. We can start uh checking these parts of code to see if it really cause put. uh So if it is that might be related into this um latency.

A

Okay, make sense, cool, okay, all right. Well, I don't have any more topics, guys I you know thanks very much for joining and uh and uh so we meet. um I don't know if let me know so. We meet weekly at this time. um So we'll we'll continue. uh I think uh we'll see if we can find out what this is and then, hopefully, by next meeting, um get a little bit more information about um this performance test to get a sense of you know. What's going on?

A

Okay, all right, everybody! Thank you have a good day.

B

A