KubeVirt SIG Performance and Scale, 10 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-06-10

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit

A

A

um Okay, you should be seeing the doc I shared the I shared a doc in the um in the chat I'm going to everyone enter something attendee okay, so we have a bunch of topics for today um uh we don't have to go in any particular order, but I posted kind of on the mailing list a few hours ago on.

A

One of the topics that I didn't want to talk about, um and particularly um it was interesting because something that we were looking at actually in a video the last few days, um it's something we were dealing with, um and so we were just to kind of give some background. So we were doing some load testing recently um around the 700 vm mark. um We were doing it actually between releases because we were looking to upgrade.

A

We were running zero, two four um and uh we were moving to zero through five um and we were noticing um some latency with api calls. This is actually a partial snapshot of the the kubernetes api call latency, and you can see that there's a huge difference between zero, three five and um zero, two four six, four being there's, there's no activity and then zero through five there's.

A

Actually, quite a bit of activity um and to the point where, when we hit 700 vms, I think it's roughly about 200 nodes, um the list api call latency explodes um up to a minute. I think it even goes higher um as soon as you go higher. I don't know this in this graphic. It just caps in a minute, but everything balloons you can see. Even um let me even zoom in more, you can see like how update balloons to four seconds normally the baseline is.

A

You can see it down here and some of the other ones. It's it's um it's in the milliseconds um and it explodes quite a bit and um yeah, as opposed to two four- um and I thought this was interesting. I thought I'd mention it because um it was uh we don't I don't. We don't know the cause, um what it was with zero through five that that caused this, um but it was.

A

It was interesting to see um to actually measure this and then saw a major impact because I think you know one of the goals that we we set, at least with for this sig, um was to try and stay at like less than one second uh latency for the apicals. That's generally what the the goal that kubernetes has set. um So it was interesting to see this and and from our um I've, even provided more information. There's like we, we see from events the virtual machine instance events um is exploding, I mean it's like.

A

The number of events uh is astronomical compared to other things, um there's very little very few pods, very few other things and there's just tons and tons of list events coming from vert handlers um that causes this. So um I figured I'd mention it in case I don't know. Maybe people had some ideas or we could talk about some of the details and I'd love to.

B

A

About that yeah we fixed the back there.

B

With that there were very gentle. I was listening listening on such you know, on these releases and the truth. Yeah.

A

So yeah, so I so mentioned that, so this is actually with that bug fix. We pulled this in because you I you I saw you guys did this, and this is actually with this. um The result with that bug fix and.

C

What was this thanks? Just.

A

C

A

Yeah, let me find the.

C

I mean what's the high level like roman, do you remember what list we were? What object it was? We were calling list on I'm guessing so.

B

That there was a bug in the prometheus uh metrics collector. It was really listing all vms periodically. Instead of going to a.

D

Through oh, my gosh through the.

B

Watcher yeah, I think even you did the fix or was it uh marcus could do two.

E

A

Me just a second I'll find the bug. I I I forget if it was so when I remembered the thing we pulled in okay here, it is it's um um pr.

C

So you said you pulled it in, uh were you able to rebuild and then deploy, or how do we know that this bug fix made it into your test environment.

A

Yeah, so we we, we did, um we pulled it in. We didn't build with it and and deployed with it.

C

And are we certain that the um the build manifests are referencing the newly built containers that you had, or I guess I'm trying to just make sure that there wasn't a scenario where we rebuilt and everything, but the manifests still reference like an old container version or something like that. Instead of the one that you all built.

A

Yeah um we went through the normal, build process on this and and yeah I mean like it's, I'm pretty sure it's in there yeah I mean I'm pretty confident in there sure. um So let me see where's the here. It is this one. Is this the one you're talking about? This was the one we kind of backported.

C

A

That one, let me.

C

See it that one's not terrible, it's performing a list watch and all vmis and the cluster and every vert handler.

C

But it's going to just do that periodically, like it's part of the informer that wasn't great, I don't think we're just going to see a thrashing of lists necessarily unless there's a lot of errors in the informer and it's constantly having to resync.

A

Okay, so which was the so, which was the one that you guys uh saw this on. So you said it was prometheus because that that sounds sorry, I'm still.

B

A

B

Yeah no problem.

A

B

I think it is the one I talked about. It was in the way.

F

Do you have the other metrics of the processes that slow down like if how the go processes.

A

Kevin are you, um are you asking me.

F

A

F

A

Yeah, you wanna sorry, your question was um you wanna see like what other what uh what else slows down.

F

Yeah, what process uh you've I like about gold processors have like a huge load or garbage collection, explodes or a lot of cool machines like where, where the slowdown might come from.

A

Where were they slowed down too and like this is so the um this is uh basically I all I have here is just the the api call legacy, so everything was affected in the cluster, because this is this is the cube api. Everything was affected. I mean I'm not sure, I'm answering a question but like what like you're looking for, like I don't have the metrics for every other load like this was the only thing that changed in this cluster was launching a lot of vms really quickly.

F

Yeah, I thought if some of our keeper control plane processes might have also high cpu memory or go routine loads that could identify what's coming from. Oh.

A

Yeah, I don't have that. Okay.

C

Do we know what list calls are coming from so in that metric that you're looking at there should be a way to uh say what pod it's coming from the requests they're coming from, so maybe we could do a filter over.

C

Do we know what api is being called from our controllers to cause this.

A

G

E

Yeah so we have the metric, so maybe we can, we can just see uh which metric, actually you are, you know, checking and also which labels you know. Then we can check label the metrics has so maybe we can figure out some more information.

A

Yeah hold on um it's going to take me a little bit to get to some of this, so the um the so let me go to the first one. We you wanted to see like some of the events, so we can see like like I can get the from the api server like I can look at some of the list calls is that what you're looking for.

C

I want to know what list calls are coming from vert handler if that's possible,.

A

Yeah, let me um there you go.

A

Okay, you should be able to see. Do you see my terminal?

A

A

A

Okay, so here you can see this was. This is from the api.

A

A

There's an example of one: this is five.

B

Seconds yeah, I'm pretty sure that this is fixed already. Okay, so.

E

It does look like it's.

C

Listing all virtual machines there yeah.

B

That was the issue which we fixed. I just can't find due to the noise. In my background, all right, vr.

A

C

We can uh maybe follow up on this.

A

Yeah sure, okay, that sounds good yeah, that I figured I'd mention it because I thought it was interesting in case it was something that wasn't known. So that's okay, good to know.

C

A

C

Yeah we'll find out uh yeah, that's certainly a good finding. um I I think that would be an interesting thing to present as well. Sometime in the future.

F

Yeah, if it was.

A

Yeah kevin.

F

uh If it was that, but um do you think it's just that your build was was not having it or did we not backboard it and should.

A

We, oh you think it doesn't have. This- is that um with that list call you're saying.

F

Yeah I like did we backport the the bug fix to do the others now or might your three five builds before bill? Just not have it yet because it was a while ago.

A

It I was not back for it as far as I know. Basically, we just did a build of it ourselves.

A

So what should this look like? If this is not um what it should be? This should just be. This should have a namespace. Is that is that what's missing here or is this like? It should have a selector of some sort, selector.

B

This list, or what.

C

So does the list have the selector in the actual url or is the selector something that's like a header.

B

C

B

At least, if it's the bug, I think then it should not do the call at all.

C

Well, we're going to get a list of uh virtual machine instances due to the informer on all the doors, so you'll still see.

B

uh Yeah yeah, that's true: yeah you're, right yeah. There is.

C

For sure right, uh there's we're seeing multiple lists like this is all within a second.

A

Yeah, that's the question like. If we have an informer, why is it constant like it should just be it should list, and then you know, compare what the results were and then attached to the you know the watch channel.

D

But it's just it's just listing.

A

Yeah, it's just constantly listing.

B

A

An error there's.

D

B

You can cheat, you should see it in it from the handlers.

B

Yeah, so if the informants have an issue and are constantly re-listing and watching.

A

Okay, we can look at a.

A

A

Is that I don't know what this is.

A

I think from our estimates, we'd see, like um we'd, see a decent amount from handler, just because there's so many nodes that and so many vms, it's like all together. It just kind of combines into this like symphony of just like a delusion of requests.

A

So is this I don't know, I don't know what resyncing for handler is, if that's something that's an error or something or something that should be like what what should I be saying like if, given what you're saying.

B

If everything should work correctly, you should see sometimes a list when, for whatever reason, the handler loses the connection.

A

B

But as soon as it would then retry to try until it finally reaches the node, uh the api server, then it says: okay, I have to do a list again because I don't know the actual state, but that should be it yeah.

A

B

You should not see any lists, definitely not periodic or anything.

A

Yep yeah, I agree. Okay, I guess like so we don't. We don't need to like spend on time. Debugging like if you think, if we have like some issue about it, we can. We can talk about it on offline and slack and see if we can find it but yeah. I agree like with that conclusion. Like it, we seeing the number of lists is sort of surprising.

A

We shouldn't there shouldn't be that many anyway, okay, but I'll leave that in here, because we can sort of follow up on the circle back and see what um what exactly we're missing with this or what's going on okay. So, um let's move on to the to some of the other work. That's going on. um We have a few things, so we have the when we start with this one done so measuring performance, this one's- um oh, actually, uh this one so do the work that you're doing david um in this pr.

A

So dave is working on uh dmi phase transition times. um Do you want to talk a little bit about this or I don't know like? I don't know how much you want to highlight from here. If you want to have some discussion.

C

um I'm sure I'll just I'll highlight my goal with this real quick. The goal is, I wanted a way to uh when we're doing these stress tests to be able to uh monitor um sorry, I'm a little bit distracted. My daughter wants my attention. I think one second give me like five seconds.

C

I wanted a way to monitor uh the total um time until running when we're doing these stress tests. uh So the way I was looking at it was, I just wanted to track the time between creation and running and then be able to look at the outliers. So I wanted to see the p95 for when we're p95 outliers for the time between creation and running, and when I do a stress test, I see that p95 go up similar to if you're doing a stress test on http server.

C

You would see the request latency go up, so we can see that uh it's taking longer and longer for vm's identical vms to go from the creation state to the running state. So I can do that with the gauge, and that was the the original idea I had through the discussion. We began talking about some more advanced ways of getting something similar that might give us a finer granularity of detail of the interphase transitions so between, like, for example, scheduling to scheduled. That's not reflected in my creation time to running, but with a histogram.

C

We can track uh the transition time between every single phase as well, which gives us a finer, more fine granularity view of exactly at least between what phases we're spending the most time.

C

With that histogram I found that it was really difficult to get to my ultimate goal of just the p95 of creation to running you can get something pretty similar kind of it's just a lot of calculations and it's presented a little bit differently and I was never really quite satisfied with the histogram for my for my goal, uh but I see the the usefulness of the histogram, so I think what makes sense and what I've landed on uh is uh the histogram and the gauge would make sense. So we would get both metrics.

C

What I'm curious and what I think the discussion that will help me right now is understanding how people would use the histogram, because I'm still not seeing completely the value. I think I might be seeing the value but like what. How would people use this histogram and what would it give them this interesting, because it's not obvious to me.

E

Right so one of my comments here about instagram is you know uh the gouge you you take like the only the least uh metric that it's reported so, for example, consider that you have like a 30 seconds. Is a scrap in a scrap interval or even longer depends on the cluster? Maybe you don't want to have permitted scrapping all the time.

E

You know everything one minute, for example, and then you have one minute and when you scrape you just get the the the you know the list, the list metric, you don't get all the uh vm transitions of any creation that happened in between so histograms actually shows uh everything. You know you get like uh all the vm transitions phase in the histogram and not just the least uh metric. You know uh value that it's there so yeah. That's why I I mentioned that.

C

So what we would miss here so, in my my gauge we wouldn't be able to um so I have a label for every phase. What you're saying is the granularity of the time interval for scraping perhaps would miss some of those, and that makes total sense. We.

D

C

Always hit running because running assuming a vm stays up longer than the scrape interval will will eventually hit it. So for the running condition. It probably gives me what I'm looking for the individual phases time less accurate because we'll miss it. That makes sense to me. Okay, so really, my metric is only valuable, I would say for running.

E

A

What about like histogram is like one way to represent the data right like in the gauge sort of. It is another way right like can't. We can't you still like so this kind of what this is representing a representation of a gauge right. Can you have multiple lines on this sure so representing each phase.

C

I mean the problem: is that we'd miss it? uh That's what um he was just pulling out uh because of the scrape interval like we'll get running, because running is going to last longer than our scrape interval.

B

C

So you're working with changes to create a gorge.

B

For since you create an extra gauge for every vm with extra labels, I think you would see all of them for as long as you don't delete them, but uh yeah.

A

Hold on, I wanted to see one of them, because I.

B

D

A

Oh sorry, go ahead, you you could use that. I think you can use this. You can use a gauge vector where you just like apply a bunch of labels like essentially like what I was saying like to create multiple lines.

A

The only thing you just apply is like um the phase and the name, and we just we just um we just get. We use these to differentiate. So we have a line for like a phase or something. Maybe we don't need name. We just have for everything, and then we have a phase to differentiate and we just record and then eventually we record in the same gauge, but then prometheus just scrapes. It all and gets a few different metrics from from a single gauge.

C

So we would have one gauge reporting values for uh every transition. Constantly that that's that's possible yeah, then we would never miss it right.

A

Yeah like because then because then isn't like, because then histogram is just one data format, and this is like another data format, then we can choose, we could have both like we could. We could technically report both if we want to, I guess, um depending on how we want to look at it, but I think we can get the data either way I mean like, like I said here like you, can use a histogram back, because what it's called and then I'd gauge back either one.

A

However, we want to report it, but I think we can either way we can get the data for the individual phases.

C

The histogram works with buckets. That's the thing that uh kind of threw me a little bit so we're only getting the level of detail down to what bucket the metric falls within so you're, not getting exactly how many seconds you're getting how many seconds it is. What bucket that landed in you're getting a count. So, let's say um let's say it took 30 seconds to transition between scheduled to running you're, not necessarily getting the value 30 seconds recorded you're getting what bucket that fell into in the histogram, which maybe let's say that was.

E

The so actually histogram also has the count and sum, and then you can get the average and you will give like the exactly what golgi average is doing for you.

C

E

C

B

C

But you couldn't get like the p95.

B

Yeah the gorge does the same. The some watch is this is, is pretty much exactly the same like the gorge, which you have david, so you can do the quantile there too, but yeah.

E

Yeah but the quantile, which goes it's this for the sample that you haven't promised and the control in the instagram it will be like uh you know, with all we don't miss any. You know.

E

Yeah so- and uh I don't I'm not sure if I understand what's the problem with the histogram for the data representation that you guys mentioned before.

C

How do I get the p95 from creation to running with the histogram? If somebody could give me that I want this graph. If I can get this graph with the histogram, I would be happy and I can't figure out how to do that.

F

It's a histogram.

E

If the metric that I just comment less the last one, you should get something very similar to that because you apply, you know, array in the histogram and you take the also the p9. You know 95 as a point and then you plot it.

F

The problem I have with uh sorry.

E

Don't need to have a heat map with histogram. You can just you know, take the quantile of the histogram and plot it as a.

C

I'm getting the create I'm not getting from creation to running I'm getting a um between different phases like it's a different metric, though.

B

So I mean you could create a histogram also from creation to running but yeah. If we talk about having just the face itself, then you would get uh the ground tile for each phase, which I personally think is very appropriate for scaling for scale testing.

B

So that's always the point I don't get. I mean with the gorge from the histogram, if it's really just for the scheduled and scheduling phase, it's really just for the face, but we just have a few and the data which you get is exactly what I would consider to be. The thing we want to see, but obviously not for the for.

C

You well I'm just thinking about like what's the thing that we're tracking in real life like if somebody posts their vm, what they're impacted by is the time of when they post it to the time that it becomes available. So I just want to track exactly that. I don't want to interpolate or have some sort of interpretation of what that might be. I just want to actually track that.

B

Yeah, it might.

E

Yeah yeah, no, not yet. I would say that it's true, so you should have maybe a histogram for just the whole thing you know from create, you know, submission to run and- and uh we don't have the the face between yeah, because that.

C

Would make sense to me that would be more accurate. That would be closer to what I'm looking at in a histogram for, but for performance.

B

Testing, it's definitely perfect. Histogram is definitely what I would want to, because there I can really see easily with heat maps and so on. Where things go, when the scale goes up.

F

The problem I had with prometheus histogram so far that nobody could probably explain to me it is um the bucket sizes are, are predefined. You can't they can't change, so we have to define bucket sizes that make sense and that's always hardest part for me. So far I don't know if that's a problem here.

B

I personally don't see it as a problem, because there are interesting, I mean you have to do some upfront checks and see what you want to see what you would define as interesting, but you do that once and you can then say: okay in this time range, it's interesting for me, and everything above is just too slow.

F

B

And so, for instance, you wouldn't care if, uh if the starting on 1000 vms there is, there are some games where it then takes for some 200 seconds to start it doesn't really matter if it's 200 or 100, when you say it should be below 50.. So.

C

B

Your upper bucket of 50 and then then you have the bucket for everything above.

A

So we're talking like so we're talking about I'm gonna just show the two side by side we're still here. Okay, so here's one one data format, so we have like here we can see like over time.

A

We can see that there was an increase in at in time it took to go to running and then with like the histogram here, we can see that um over time we have a lot of really slow ones that are taking this. I think they're both valuable they're, both valuable.

B

Right, all I want to repeat again is if, if, if you create a histogram too, in addition to the histograms per phase, where you go from creation to running uh you, the histogram is also collecting the sum and the count. So you get exactly the same thing too, with it.

A

Okay, so what you're saying is that we can generate both these graphs so.

B

The heat maps heat map in in grafana is a special interpretation of the instagram which you get from prometheus, but the histogram has also the count and the sum as a gauge exposed or it's a content.

A

So then, I guess the question is too david: like does that? Do you have enough, like.

D

It sounds to me like histogram,.

A

Would work, but it does does, do you have enough information? Does that make sense to you.

C

I don't know yet I need to play around with it. It's kind of frustrating to me uh just because I like, I know exactly what I want.

C

As far as what I want to see the performance increase up and I know how to represent that um and I'm I feel like I'm going through a lot of hoops to get what I want just in a different way, and I don't understand entirely why it's useful, like I understand why it presents the information in a different way with a histogram and how you can get some uh some interesting, more detailed results, especially with like the verb phase, transitions and things like that.

C

But when it comes to ultimately the metric that I'm interested in I'm going through all these hoops just like.

B

The exact same.

C

B

I would say one per face and from creation to running that would be good for both.

D

B

I mean immediately after you have done what you have done you would you immediately want to have the details from the histograms then, as I see it,.

C

Yeah I see the histogram is definitely being useful. I I want to make sure I'm not like shooting it down. uh I think that it's like the next step is, or maybe it's the first step for some people, but when I'm running this I want to see okay latency increased and the histogram is going to directly correlate to what phase we spent the most time in and that's all great. But I like both views, yeah.

B

Just make a histogram for this too, then you can then we're good all.

C

B

Help me understand why histogram.

C

Helps for the uh the view that I currently have? Is it performance, wise or like uh from a metric, collecting perspective or or.

B

What yeah it has the advantage that uh it's kill. It's collecting all the data client-side already and you can't miss anything.

A

It's just collecting more data right than the gauge.

E

A

That's all it is, it's.

E

Gauge override, so you see some some new metrics appear in the new music, so instagram keeps you know tracking of things, because it's counting many many different uh points.

C

So with doing a client side, what happens if vert controller crashes and we get a new controller, are we losing uh the histogram.

B

It would reset and prometheus would uh realize that it did a reset.

C

Okay, I'll mess around with a little bit more and uh see if I can get similar results that I'm looking for.

E

Yeah, so my suggestion is to have two histograms one is for creation, another one for the face.

C

Okay, I'll I'll see how this works out.

E

A

Okay, cool thanks; okay, um let's go to another, uh so this is um on the mailing list. uh Fan proposed this: um how to prove improve performance in the vert controller, um there's a link to the thread and he put together um a pull request.

A

um It's a little bit of an example of what he's looking to do. I I don't know, fan if you're here, um but I think um it will be good kind of talk about some of the ideas. Maybe we can find kind of learn some things from this um ways like we can find this, because um fans saw some improved performance from this little fan. Are you here? Yeah? Oh, hey, okay, yeah.

C

A

C

A

Walk through a little bit of what you did and I'll leave, this.

D

Yeah yeah, thank you yeah, so the basket. It has three topics. So one one thing is we uh I want you to reduce. The income uh happens in the world controllers. uh The third.

A

One well one, second, one, second, so to preface this a little bit more that so this isn't so the vert controller um we have our! um Oh, I don't have the I wish. I had the picture that, um but okay, so in the vert controller we have a. We have a reconcile loop, um we're going through we're doing an update and that's and that's where this update status and there's this informer and then and that's kind of the background in this I'll, find the I'll find the picture and I'll put in the background.

A

While you talk, then sorry go ahead.

D

Yeah uh yeah. Okay, so can you go to the issues yeah?

D

uh The the issue link uh this one at the top.

A

D

Yeah so yeah, so basically this what we observed in the in practice, so the latency between the power creation and the vmi creation. So uh when I look into the uh into the uh print out the logs for the parts and they're my status updates in the real controller, so you can see a lot of inques happens in the word controller, even though these enqueues are not relating to the status updates. So when I reduce the inquiry times by incur only the uh status includes the event.

D

Events uh relating to the status changes, so the the qns reduced uh here he'll reduce the very much. uh So that's the the first part of this uh proposal. So we uh I want you to reduce the enqueue happens, so only reduce something like a crit or create a part or delete uh or update the vmi status happens or something like that and the the second thing is I uh so right now uh yeah.

D

I agree that the uh using the q worker q model is good for the concurrency process, but I think we should have some uh supplement uh for pre-processing like to rolling the ball fast uh before income uh incurring to the worker queue so that that's the purpose, that's the third part of this proposal or the second part of the proposal. So we, when we create a pod, uh the credit pod can uh can trigger uh trigger the creation of uh yeah. When we, uh when we uh have a svm event, happens this will.

D

uh This will trigger a create a pod event so uh before, including that, so we can use this uh uh when the uh when something failure happens, we also we can still encue this event to the worker queue anyway. So this is the basic idea of the rolling ball. We still keep the a major logic of the worker queue, but we just a faster speed, the processing and yeah. So that's uh basically logic.

C

If I can ask a question, um what I don't understand here is the work is still being done, so we we have. The logic is still being processed in the exact same process, um we're just moving it around.

C

So to me, that's it's curious that we see a performance improvement by moving logic from one place to another place when it still has to be performed within the same process. So if the worker queue can't keep up for whatever reason, then we're shortcutting some logic to be executed early.

C

If that improves the situation, I feel like that's an indication, at least from my standpoint, that somehow our worker cue is is wrong or we we're doing something horribly inefficient to not be able to have the same performance as you know, putting some stuff in the actual callback handler.

C

Do we know what like what's going on with our worker queue? Do we have any indication for why it's not able to keep up, and this actually improves performance.

D

Yeah, the the the best good on the issues is the working queue has very long uh q length. So, as I look into the q lens, you can see uh hundreds of events. So if we don't change the, if we don't change the thread using the default, uh tens rest, for example, uh we uh in the practice we always have hundreds of uh vmis uh hundreds of keys in the q lens, that's keeping hundreds of keys in the q less or for creating of 500 vms.

D

So these kill these keys in the q lens uh come come from uh uh kids from the past events and the vmi events, so they mixed.

H

D

So if I as I illustrated in the issue so when we, the time is when we created a vmi and uh uh associated the events like updated, the paths updated the vmis uh queued up in the distribute. They are not sequential as a distributing the keys so next round waiting for available worker to pick up the this key would would be, would wait uh for a while. So that's the latency happens.

C

I see that the cue backs up so and yeah you're right that that's caused by uh we're in queueing from lots of different places, so we're looking at the pods and the the virtual machine systems and stuff and lots of places, can pop or play something on the queue I'm more interested in why the queue worker queue isn't able to keep up with that. So the fact that we're seeing it backed up I'd like to understand why our worker queue is inefficient.

C

Even when we expand it sound like we gave it lots more threads, um what's going on in our worker queue. That means it's backed up, because I would think, for example, if we have let's say 100 vmis cued keys, cued, that we should be able to chew through that in like milliseconds. It should be nothing especially when no api calls are being.

B

Invoked yeah, and especially considering that, for instance, when we look at the enqueues coming from parts, we have to think that, at the same time where we seem to collect all the backlog in the queue the kubernetes was able to process the vm update it and even send the watch update to us and enqueue it and we are behind. So this is really weird.

C

We could be doing something silly like uh making an api call. Every time a reconcile is popped and if we.

D

Just remove that.

C

Then we, it sounds like we're just doing something bad in our work queue and that what you've the evidence you've given us is about removing logic from the work queue, execution somewhere else improves things.

C

That means that the work queue probably is doing something inefficient. Like I mean, I think, that's really valuable. I I question um more so, um where we're targeting our efforts on and improving the performance, I would want to understand more details about the actual work, cue execution and where it's slow and perhaps even get some pprof on on that during a stress test, instead of moving logic out of it, because I understand that improves things, but I don't know why.

E

Right, I I see like hiding the problem, you know like you are bypassing something and then you just you know, don't see the problem anymore. um I I and also I think, like bypassing you know the work kill. It's like uh you know, removing the whole. You know uh you know fundamental. You know idea of kubernetes that it you must have things asynchronous and send it to the queue and process everything asynchronous.

E

um I think it's going to a different direction. What kubernetes suggests to do.

B

E

Would you say.

B

A

Would you say this so um uh we'll kind of break down some of these? So, like number three, if we go from say like pending, um like we say, we shouldn't skip phases, is that is that what we're saying or.

B

Yeah yeah, so we can skip phases. That is not the issue. That's not the issue that the thing is that even having this one phase, more or less should not. By far have this impact which you're seeing there is something wrong in our code which okay, that's what we.

A

Try to say, okay, that makes sense to me so so then, like sorry, you keep going I'll go after you.

D

Oh okay, yeah, sorry, so in the after distribution number, three, I'm talking about anniversary so in the updated status. So currently, for example, if the bmi is when it is created, so it's automatically going to onset. So it's waiting for the next available worker to pick up said one and check the vmi status and moving forward to the next stages. So um considering in the case, um the the policy has been running very quickly and and by now there isn't a worker.

D

If there isn't the available worker, the peacock was updated vmi, so it's still in the onset and for the next available worker pickup. So in a current logic, it's a it only checks if the part exists not ready, so it will go into the scheduling, not the schedule b so, and it also needs to wait for uh the third third well uh available worker to pick up to process it to schedule b. So the latency, having is the ty and the t2, so the the this is the latency. In the word controller.

A

So you're saying one worker needs to pick this arrow each of these arrows one worker one worker needs each time when you do this and each.

D

A

Take a noticeable amount of time. I guess what you know.

D

What everyone's.

A

Saying right is that is that this should be not a noticeable amount of time right. It should be happening instantly.

E

We need to identify what's blocking the work queue, so you know and not bypass the work queue.

B

E

Should, of course,.

B

See I mean we have a limited amount of workers and it has to go some places, so we of course increase the api load when we have more faces and everything. But, as you see, the api server as such seems to be perfectly happy it. It can start the pods fast and the api server reports, the updates to watches fast.

B

So uh we should. You should see a delay when you start more vms and if you start a lot of them at the same time, you should see also, maybe quite some delay for a very short amount of time, but not over such a long period. This is really. It really looks very weird from a logical point of view. In our controller.

A

Okay, so I guess like the next steps like if we can, what we can do here so like, I think, we're we're kind of circling over the problem. So we need to figure out what is so what's going on with the work queue?

A

What is it with the work you that could possibly not be causing um what's causing it to be slow, like, I think maybe one thing would be useful if we could see a diagram of like the keys and how it balloons and see like okay, like over time, you know how quickly they're they're being processed. You know how many they're getting to relative to that of vmis.

A

I think that helped, um maybe a picture of that and then and then maybe we can start looking at some some of the code snippets like how does that sound uh for next step.

B

Yeah, I'm not sure there should be some metrics for controllers already, I'm not sure if we expose all the controllers metrics, which are built in by default. You should see some, but at least uh you would need some measure points regarding to the sync logic, the update status logic and you would probably want to watch the rest api calls with puts if they take long or not stuff like this and often.

A

Yeah, I think this was a the uh one of the um one of our goals we wanted to.

F

We wanted a key.

A

Size right, I think it's a work, cue.

D

A

D

B

But I think there should be a message.

D

B

Collects how long the the key itself takes or the the the processing of the queue entry takes in our logic? If, because we have this controller framework, if not, it should be pretty trivial to add one or two measurement points in the controller itself, and then you can probably see at least immediately if uh in which parts of the controller, roughly the the time is spent yeah, because that.

D

B

Right now, really where we are blind, so we see there they're piling up in the queue in the queue length which really means that our individual processing of the entry seems to be slow.

E

And yeah, maybe one of the keys you know, is the problem you know to check which one it's taking like too long to you know in the thread.

A

Yeah, so I guess like well fan, I don't know you can tell me what you think like so we can do so. I think there's like we have three. We have three metrics around work queue anyway.

A

um We can look at doing something like that kind of looking at implementing some of these metrics and trying to get some of this information or we can, if you want to look at, uh do the faster way and kind of just generate it again and look and generate a graph from whatever data you can gather um whatever you think, but we do want to definitely get to these eventually get some of these working.

D

D

uh Yeah I have some. uh I have some tests on my local uh development machine. uh I just print out some laws for each uh uh in just directly print out the log in the code. So I can. uh I can see the log file of the controller to see how the working, uh how the queue works, how the event color works. uh For example, in my I can see that there are uh for each uh vmi updates. The vertical controller call is uh 30.

D

uh There are 30 events or 30 actions on the bmi updates and uh uh 70 percent of them about the inquiry uh incu activities. So um when.

H

They're they're normally.

D

H

A little bit on that sorry for interrupting.

B

But could you deliver it a little bit on what you mean by there are 30 updates due to the review. What does that mean.

D

Oh yeah so uh means uh so every time. Every time the event handler is caught, they were in queue. uh It will include the worker queue right. That's what we call it's. uh It's our activity of the inquiry, so the eq happens uh 30 times.

B

Yeah so, but that should not necessarily be a problem. The the the the the the part which we have to do right is that we don't do any rest calls when not necessary so uh and uh to not hot loop on status updates like changing timestamps or whatever there did. You see something there like that, we are really doing status updates of the vmi, which should not happen.

D

Oh, I I think yeah, that's my point. So if uh for currently as uh we, we do, we do nothing except the interior in the worker queue and waiting for the main logic to process. So if I reduce the inquiry times so uh just to include something relating to the status happens, so we we can still update the vmi status, but the q lens reduced very much so we do need to.

D

We don't need to incur so many keys in the work queue first of the same vmi uh so and also uh I I see that in the current stages, uh when we increase many when when we started creating the vmi for 500 bmis, the q lens grows very fast and we have to wait for a while, like three minutes for the for the vmi to be picked up and updated, starting. The update.

A

Yeah, so that's that's! So that's one of the problems like with the q q length how slow they are and then and then the other thing you mentioned like, and maybe this is like, so do we need to re-cue you every time a pod is updated like, for example, what you're doing here like like what like d is this? Would this? Is this something that we want to run through? All the yes? Yes, we do every single time, even like on a scheduling for a pod.

B

Yeah I mean you would have to embed here than the logic when something is interesting or not, and uh which means that you have to keep it in sync. Here, like within the main processing logic. Are they looking for scheduling conditions and want to update then on the vm status? You would have to keep it here in certain to to really enqueue it, because it's interesting and you actually do not, and you would also have to look on the vmi if the vmi has the data already or not.

B

So it's really hard to tell here- and this should really not be the thing right now, which we should look at. uh There are other things with the callbacks uh which I'm happy to explain in detail of interest regarding to go routine scheduling where we have issues if we do too much in the callbacks due to the lock mechanism, but the the key thing for me is: I do not on it.

B

I can understand if things become slightly slower if we really start a lot of vms, but what we- but I did not see so far, is an indicator on how long we spent on processing the bmi, for instance, even if we don't have to do anything because it should just not matter too much if we enqueue it 10 times or 20 times, when it doesn't have to do anything on the objects. That's most, that's that are very much all in memory actions.

B

We do a few memory lookups, which we drew about a few comparisons of objects and decide. Okay, we don't have to do anything and that should be really fast, just again think about the fact that the cubelet and api server are doing. In the same time, all the work with the same model and giving us us all the updates on those objects, and they really did something there, including rest api updates, and we are supposed to do pretty much nothing on many of them, but they have the same model.

B

They have the worker queue, they have the the execute functions. You know what I mean.

E

Yeah yeah, so one.

B

Question either there is a an issue with popping out the cues or our controller. Loop is extremely slow and we have a probably stupid or silly mistake in there, which slows it down. Did.

E

You check the cpu usualization of the nose that it's around the controller, maybe you're getting saturated. You know I don't know just guessing. You know 100 of the cpu and then because, if you are processing many things, for example, when you bypassing you decrease the cpu utilization, and then you see things going fast.

E

So, but if you have to keep processing, you know small things, but you need to keep processing just to double check. If it's not the cpu. The problem.

B

This is actually a very good hint because I think we are not setting cpu limits. uh Cpu requests in this old release by default, and uh you may end up in the uh in the best effort quality of service class and, if which and which this means that you may share the cpu with a lot of other processes and what kubernetes does here is. It gives the process just a very small amount of cpu time very often, if nothing else is there, so you get minimum amount of time.

B

Then the process gets stopped, cubelet checks if it can run it again, it lets it running or not the kubrick, the c groups, but the keyboard configures it that way and that's why we do, for instance, on vmis.

B

We do by default always a minimum cpu request, I think at least 10 m or so just to be in the other class, where we get much more cpu time. So this could really be an issue that it doesn't get cpu time.

E

Right, it might be that it's not keep up. You know, processing things, so I'm not sure so. Just guessing.

B

Yeah, it's very good.

A

Yeah, okay, I'm writing a few things down. So like um let's diagram the q length, let's measure the cost of each and q, let's do pre-prof of the for controller and then um let's look at um what was it? You wanted me like the cpu.

B

Cpu request: okay, cpu request: you should set one um you can do that pretty easily by specifying on the keyboard. Namespace some defaults. You know some regress defaults. Okay, then you get them immediately. There just set them to something reasonably high to just rule it out. Let's see.

A

Okay, all right, so we have some options. um Okay, I think that'll be like the next steps for for this one, um okay, and we can kind of circle back on that. So there's a mailing list thread. We can maybe we during the week we can use the mailing list thread to see what we find and then we can um in two weeks time. We can talk about like some of the findings for this one: okay, cool all right, we're at time.

A

So if the last few seconds here is there any other open items, um anything else, anyone else wants to bring up and see listed here.

A

Alrighty, that's it then thank you. Everybody oh and make sure yourself is attendee. Okay, I don't know if it doesn't look like I hear more voices than there are listed here. So all right, thank you. All have a good day.

E

Thank you thank.

E