KubeVirt SIG Performance and Scale, 16 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-09-16

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.qs7aweajr18k

A

Okay, hey everybody! uh Welcome to sixth scale september 16th. I added the link to the notes in the chat. If everyone can please open it and there's offices attendee or add any items that you want to talk about to the agenda.

A

Okay, let's start uh with the first item, um so there are a few uh there's some new bugs that uh that actually reported um that I'd reported pretty recently. um These are kind of interesting. So we're going to talk about um two of them. I think there are they're actually some other items.

A

We can even talk about related to this that I saw, but I um I kind of investigate a little bit more before I post these issues, but we'll just talk about um the two that I uh um that I think I understand the most um so just to show some context on this.

A

um So we're doing some testing internally at pretty large scale and we're seeing the vert controller panic when we delete um a bunch of vms, not not a ton like I mean, maybe like a few hundred or something and at this scale there's a lot of events that are occurring and- and um sometimes you know with this- a lot of events occurring. uh Some of the edge triggered events like deleted um can be missed by uh fire watch and the way that the controller handles this.

A

It puts this specific key on the queue, and this key is a different type than um than what the vert controller expects. It expects a um a vmi type and it tries to do type assertion on it and it actually uh causes a runtime panic um and so there's um this.

A

Actually, what ended up happening is that we there's two of our controllers and they basically would alternate between uh panics and I, and it was hard to really tell if it was um kind of um like kind of the it sort of eventually healed, but it's hard to tell exactly how it healed than that. Perhaps there was enough events that the the sort of deleted key the um this deleted- final state um unknown key, was eventually flushed and then tanks were away um it's hard to really say, but eventually it does heal.

A

um So there's a pr open to to fix this one. That's that's here and um and roman uh has already reviewed it. So that's, uh that's one. Are there any questions on this one um I'll go to the second one. If there aren't, this was kind of neat.

A

Okay, um there are even so this um this sort of like um key there's. There are multiple um objects: we're actually um have cues for the other ones, the other objects we have like. Pods and data volumes have um catches for this. I think it was just virtual machine instances. I was missing it um and it occurs in rare cases. I like I've, never seen this before it just occurred at this very um at this kind of at the scale, and it doesn't occur all the time.

A

So I don't know what it is, but it just. I think we just hit a point where we um we just started missing some deleted events, and then it started to pop up.

A

Okay, let's go to the second one, so this is actually from the same um same incident, same scenario deleting bmis at large scale um in the in the vert controller logs. There are tons of these uh status reasoning, valid errors, which is a a 422 error um and uh basically like the server or the request that you're making is valid, but the server is not processing it, um and I was looking around a little bit on this and we we do so.

A

The vert controller is at this point um for for some of the objects, it doesn't control them. The vert handler has them so it's trying to do patch. I try and do a patch on the status field. It does it for conditions and um and for uh this active pods um field under status.

A

um My thought is that it's it's, this active pods field, that something is failing with this like we're trying to patch it. It's not working. um So we get these 422's. This isn't harmful like it's. It's like it's just, there's, there's tons of them in the log I mean because we're cleaning up the object. It would be nice to get rid of these and here's an example of the error.

A

You can see right here just from a few different objects, but um it completely populates the entire log.

B

And what component was that is for controller, that's generating that yeah.

A

A

Yeah- and this is what I think uh so this is the code from the air, so you can see, um I mean the matching error here, and um so we try and do this patch and patch bytes, um like where the two things that I was mentioning is that the status field, that's we're patching some conditions and we're passing patching this um active pods field.

A

um I haven't fully tied it together, but it's it only really showed up when we were deleting now. I didn't notice that in the case but- um and it seemed really- I hadn't noticed this prior and it seemed like this. This only occurred at large scale. Something was something just seemed to be off. um This is the because I mean we've. We've done like some other error code analysis, and we haven't seen this many 422's. It's just sort of there's now, there's sort of an explosion of them.

C

Right ryan: do we know who's rejecting it? Is it from one of the validating web hooks or what convert api.

A

No, I don't know, I don't know who's rejecting it, I'm that it could be the validating uh web hook. um I don't know I I didn't like. I couldn't.

D

A

Like I tried to trace this in the aps, server logs and I didn't find like a a pair of like the requests in the api server to say, like okay, this was rejected and then the.

E

A

But maybe it's in the api server reads on the cubic fan server, so I I don't know, but the reason I posted this one, those I think I have a little bit of better understanding than this as for this one, but I think there's still some more investigation here as to like to pinpoint exactly because I think like there are two like. I know that we're like I can see what we're we can see the air.

A

I can see what we're trying to do, but don't know like you, said yeah it was gavin, I don't know who's rejecting it and um I don't know how um how often we're doing this. I also don't know why, like um that, wasn't quite clear.

B

Here's my theory, uh so in this specific instance, where is a patch correct that we're yeah.

F

B

Yes, so in our patch we're doing a json patch and we have a test condition followed by the actual, replace or add or remove whatever we're wanting to do and the test conditions saying here is the way we think this struct should look based on. What's in our informer and what the information we have and here's what we're wanting to change if that test condition fails, the whole thing will fail. So that'll say that the reality is that the thing you're trying to patch doesn't look any more like what you're thinking it looks like.

B

So your patches are going we're not going to apply your patch and that's when I've seen things like rejected due to an error in the request.

B

What could be happening is that our formers are behind and we're constantly trying to make a change on the vmi and the information we're using to make that change is inaccurate, because the informers haven't caught up to reality yet, and I made an optimization for the update path for the the vmi controller to say: don't try to update this vmi again until we've seen uh the previous update arrive. I didn't do that for the patch, so it could be the same problem where so was.

A

That the 429 error, like that was the optimization you made- isn't different.

A

I'm sorry! What was that the um was that the um oh? I don't remember the error code now that was the uh I remember you.

B

Yes, it was, it was what you're thinking of I'm pretty sure.

A

B

But it was specific to the update api call, not the patch one, and we we do both, uh and I only addressed the uh instance where we're doing an update. I did not address this since we're doing a patch, so it could be the same problem.

B

It's just manifesting itself differently because in the case of an update, for example, uh we're going to fail in the back end, because the updates revision or whatever is not accurate, but in the case of the patch we're rejecting the update or the patch, because the patch condition is failing that we've supplied so we're going to get a different error. I think, but the same cause could be invoked for both like it could be the same underlying cause that we're trying to modify an object and the object that we're trying to modify.

B

We think it looks different than what the back end uh thinks. Okay, that's my theory. Okay, I I'd bet on it.

A

Yeah, I'm trying to look for your patch. um If I have it here somewhere, oh, reduce feed. My collisions. This is this one yeah.

E

A

Yeah that sounds sounds plausible to me: okay 409, I said 429, okay, yeah, okay, so yeah, it sounds possible. I mean so even like he said for additional context, and this is sort of the like. Like I said this is happening at the same time. As you know, the panic is going on. So there's we're restarting for controllers a lot. Reformers are catching up a lot um and this kind of was kind of led to another thing that that I was kind of wanted to explore.

A

A little more is like the time it took for the vert controller to catch up. It took a little bit because there's a lot of requests where it's like hey. uh We want to update or want to do something that's objective like, but we're not we're behind. There's a ton of those.

A

So I mean it sounds like it sounds yeah that sounds. It's a good point um I'll tag this in here, just as something as a reference, and we can uh just so we mark it. Okay,.

B

And the fix is relatively simple: we just have to follow the same expectation logic. It's almost it's only like maybe 20 lines down that we're doing it for the update. We just need to follow something similar for the patch, but we need to. I need to look at that a little bit just to make sure that we're only issuing the expectation when the patch is actually going to change something I I think that's the way it would work, but.

A

Anyway, cool okay, yeah. Well, so those are two um two new bugs um and, like I said there, there's some other ones that I was thinking of. Like I briefly mentioned like the time it takes for her controller to catch up um could be another one, um but I want to get a little bit more measurement on that one to see like get like a rough amount of time and some more logs to see um just to get a better picture of it before for a file vision. But it's something that I've noticed.

A

Okay, all right: let's go to the next uh bullet points, um so we use memory overhead for the launcher. So this uh this is a discussion that you want to have here. We have. We have a few issues open um waiting for all of them.

A

And I want to kind of consolidate here um on a on a mission: let's see where's the first one, okay, this one um so reduce memory. Overhead of launcher is the first one um and the ones that that I thought we had some overlap on was the was this one, so removing the monitoring process and the launcher to reduce its memory footprint um and then also the um the profiling, the control plane high load? um So I guess so.

A

My point is this is like um I kind of wanted to talk about like the goal here and see if we can outline some of the tasks- and I think, like you know, this is sounds like to me like one of the tasks, I could actually go into here um as a possible optimization for launcher, but I kind of want to talk about you know. Some of the others so um is. um Is it daniel? Is that who this user is.

F

Hi here it's me.

A

Hey, do you want to um so do you do you want to talk a little bit like elaborate on some of your goals with this, and you know, maybe we can kind of enumerate some of them and um see where we can kind of yeah go ahead.

F

So the context is that there is aldergy internships and it is paid, uh opens open, source, community internships for for diversity uh in attack, and we just thought that I mean we need some projects, so the inter internships can work on them and this topic comes about, or was there for a long time. So we thought why not do it and we don't really have a password.

F

We just know that we can probably reduce the size with the with the separate binary for this or if we rewrite it to different language right and the other option can be to remove the forking all together. But I guess that needs to be discussed. If we can guarantee the guarantees that the forking process is doing.

A

C

Jana should have done some work in that area. um Where is haven't you? We no longer fork.

G

Yeah exactly look so I I've tried to uh um estimate on this um this removing this, this um beard launcher for king and um I think, when I go um gone back to to the when what is when it was introduced. There was there was some special uh case um and that's that I think it covers. I think it's for container disks. um I haven't really used this feature and I'm not sure if we ever use it on our side.

G

So we thought that if it's not critical for us, we could just just remove it, but I think if it's in, if it's um yeah the the problem with with forking, is that I I don't see that that it really does what what what it's supposed to do because like there is always a chance that that this uh and this monitoring process can go uh away like I don't know it could be, killed or or whatever, and then there there is no monetary process and there is no one to to clean up or or wait for for those.

G

um So I don't see really. I don't really see a reason why we can just do it in in like special, go routine or or something like that, that that would do this uh like watching um and um yeah like as daniel mentioned it this this. um This monitoring um just just uh adds a lot of memory and and uh to overall uh so yeah that should be discussed.

A

Okay, um so would you say silly, um maybe we can define some of this um like well um to what was how we talk about like the reason for um for like forking. Does it like? Does anyone want to talk to that like um in terms of you know what was the benefit of it and then okay can.

D

Anyone wait a little bit.

A

Yeah, hey roman.

D

Okay, great, so I want to adjust that uh the main purpose of it is that if we have it as a precaution, it it does very little by intention with because that means that that it's very unlikely that this process crashes, while our work launcher process that that one which is actually talking delivered and so on, and to do him, does a lot a lot more things. And if it crashes our main process and the container is down and then it then all other processes would be stopped immediately and yeah. It's a precaution.

A

Right, so we don't want it as pit one right. We wanted something else so that we just don't immediately once it fails. We just go away like so. In other words, we it makes like, because I I understand like like we want some sort of something else to to um to be there. um So I guess the question is like, so you knew you said created another go routine like if we have.

A

um You know how would that look like if, because now, you're now we're not forking, but um you know if we were if it was like a go routine like what would be the go routine, it's like how would that sort of solve the problem of not having the process that this pin one crash and we kind of lose.

B

um I'm just gonna panic.

B

Excuse me, sorry um if people crashes today or I'm sorry, burt launcher pat crashes today, the point of the pid one and the forker is that we do some graceful cleanup of the queueing process if it's still around so we'll attempt to shut it down in a way. That's not gonna cause this corruption, and things like that. So I think maybe that's the underlying reason why we even have a catch-all kind of thing like that. Originally it was a bash script.

B

H

I just wanted to confirm.

B

What you said, yes, that's different, so originally we had a bash script. That would do this kind of clean up. It was really unwieldy and uh we just made a function and vert launcher that would do it and then we just decided to fork launcher from vert launcher. I think we can achieve the same thing in a go routine and have it catch panics. So very first thing we do when burnt launcher starts, is we'll create this go routine, we'll have it catch panics.

B

So if the second fault occurs or anything like that, occurs, it immediately executes the scare routine and then it's going to be the same effectively. The same thing.

A

Okay, so yeah we the goals, so the goal is that we want we want to. We want to have the ability to gracefully exit, so I guess the the question is how we can keep that without, while doing it efficiently right that way,.

B

We don't have to fork. I think that we can do it in a go routine. uh When a panic occurs, we have the opportunity to still execute something. I think that will work. Maybe maybe somebody else has more thoughts on that. If we don't do that, the other alternative is to create a really stripped down uh forker.

B

That only does exactly what we need that fork logic to do, rather than loading everything that vert launcher needs because there's just a lot of dependencies and everything to get started with vert launcher and if we just have a really really small, thin binary, that's just in charge of launching burnt launcher and then sharing a vert launcher exits that the commuting process is torn down as gracefully as possible. Then that's great and the result.

D

If we don't have any of.

B

That the result is that the container run time's just going to kill kimu and it might cause this corruption. That's our concern.

D

Okay, I'm not aware of an easy global catch or panics not been going, but if that exists yeah we can do that too.

B

It's the same thing we do um in all of our controllers. Isn't it.

D

Well then, we know exactly how many go routines there are and we are doing that for every goal routine, that you catch them, but.

B

If I thought it was a hierarchy, so if we created the very first one.

H

Let's say maybe we'll see.

A

Okay, so um you knew sir, do you think you have a path forward like I said you will be brought up coverage and you think uh you think, like you, think that makes sense. Do you want to explore that.

B

Look at the recover command and go laying. I think that we set a recover uh command at the very top level of like the hierarchy. That will basically be a catch all for everything. That's the very first line we have in the code. I would think that uh anything that causes the second fault from there would I could be wrong, but that's my expectation we'll get caught. It's not mine, but no, it's not yours. Okay! Maybe I'm totally wrong there uh investigate that. I'm, like 50 sure now.

A

You said this isn't: is it go or is this like something in cuber? Is it some sort of recover.

B

A

B

It's like a standard. What do we call it? Oh.

A

Like recover from a panic from.

B

Yeah, it's literally called recover as a function.

G

Well, I think, like this, this recover, like can be, can be called in indifer right, so you could just run defer and and in this defer function you would do the cover.

D

But as far as I know you you would have to do that for every go routine because it's bound to the go routine where the panic occurs. If the panic occurs, another go routine and you did know that if you recover there, then you would.

H

Still crash, but I may be wrong: okay, so recover is.

B

A built-in function that regains control of a panicking go routine.

F

B

So it might not be a catch-all yeah.

C

Is there an equivalent of the c at x attending, we can add exit handlers for.

C

I thought those don't catch like these. Do that.

A

Yeah they don't mind well, so so what are the um so we have so we could do panic. We could. We could have something to cover, I guess the panics, but that so like that. We said that doesn't catch everything right like if it's um um like what, if um like, what are the other cases that we want to like what, if qmu, just kind of like crashes,.

B

That's fine, if commute crashes, the invert launcher knows what to do.

A

Okay, so it's just for launcher that we're trying to make sure we recover well, yes, yeah.

D

So the the the forked work launcher takes care of cuimo, libert and all the other stuff. It knows in any case what to do, but if something goes wrong there, then the other one kicks in and cleans up properly in a safe in a hopefully same way.

B

It's best effort, it's all best, ever it's just to maintain uh data consistency. I think.

A

Okay, does that make sense to you? uh You knew some um that kind of satisfy. That's this issue to you, like you, think you haven't yeah.

G

Yeah yeah totally. I think it makes total sense.

A

Okay, um all right, let's kind of bubble up a little bit higher. So we have this like this issue. We have, we have to produce memory already. I think this is probably one of the issues that we can produce the memory overhead in there. So I mean I guess what we could do is we can kind of tag cross-tie the issues and this could be one of them, but there's there's probably more. um Maybe we can like. We have profiling that we want to do.

A

um Maybe we can find some more in there and we can. uh We can kind of cross tag them to this issue.

A

It can be like our catch-all. Okay, all right. Are there any more discussion you want to have on this? I want to go to the next one.

A

Okay, all right, let's go to marcelo with uh another evaluation report.

E

Yeah, so I run the updates one with the well the main repository of this week. We it's running like from 100 to 800 vmis uh in their three clusters and something that it's, I would say, maybe not go too much details here right now, but something that it's interesting here is uh from 800 vmis.

E

It's taking like from like the worst case scenario, 95 percent attire to create you know uh the vms yeah. If you can.

E

Yeah yeah the first image now yeah, okay, the vm creation time. Yes, so it has like 600 and 800 the two last ones and the first one took like five minutes: uh the 95 percentile. You know the worst case to create the vm, the vmi and then with 800. It's uh you know it's jumped to then to 10 minutes, so double the time to create the vmi so um and it was not. The double of number of yams was 600 to 800. So I would say is not scaling very well. You know the creation here.

E

So this is. This is what this test showing here.

D

Don't know exactly.

E

What's yeah, it's very like 50 100.

A

So like how many this is, how many notes is I haven't. Even I mean it's not. This is eight hundred million.

E

It's eight hundred you, you have the via my account yeah you have of the vmi account there.

A

Oh, I see it: okay, yeah, six hundred four hundred two okay, okay, so you're saying we double the time from six to eight. um The slowest bmi's take five or take ten minutes, five minutes slower the last two hundred roughly or the the slower ones in the last one. Okay,.

A

Was this with the.

E

The gps change marcelo!

E

No, I have another another experiment for that, but it it the kps. Well we I will talk about that later, but it doesn't change too much the vm creation time. Okay,.

A

E

um Yeah, we will still have like this uh stuck thread in the work queue that will. We will need some investigation later that we can. We can check in the in the.

A

Yeah, what's that one, we.

E

A

The structure in the work queue what's.

A

What's that exactly.

E

Finish work and finish, work.

A

E

Yeah, however, you see that it doesn't gross too much when we have like more vms being created. You know from 600 to 800 it's it's small, the difference, so something yeah. It's not it's not because of the scale. It's all that it looks like it's not because of the scale of the vm creation, but something is stuck in the code. So that's the definition of unfinished work, but from kubernetes this metric that it's when it grows the time here. It means that some threads are stuck so um not necessary.

E

Is that what's happening the code, but might be that um okay, something's true is uh running through slow and something like that.

A

Okay, so we um we might be able to catch these with profiling right, maybe something we can look at. Okay, all right. I think this is. This is a good one for um to create an issue about that. We can uh see if we can locate these, they stuck threads and we do some profiling.

A

Okay, um let's see what else yeah.

E

Also just uh work, you, the work, you add rate um down yeah this. I think we already have an issue for that, but it's still like the vm controller disruption budget. It's the most intensive one, and uh we have like some discussion about that. You know previously here and I don't know who who mentioned that that this controller is actually the description budget shouldn't be that intensive.

E

So it's something that also that we should take a look and and see how to minimize.

D

E

You know this controller, maybe if it's if it's possible, actually I don't know what this this disruption budget is doing. So, if you guys know, I don't know yeah.

B

E

B

It shouldn't be, you shouldn't, be creating that much work. uh Yeah it'd be interesting to profile that it's creating um pod destruction, budgets for virtual machines that have eviction strategy equals live migrate to ensure that the the vmi can't be torn down. What's in my trusty ray node, so we're forcing uh it to the eviction to fail and we're creating a live migrate as a result.

G

B

It's a really simple controller. It shouldn't be doing this, so there's definitely something to investigate.

A

Okay, all right, I can add this as a um another picture to that that issue then, um and then this one kind of interesting. So we have, we have some vert handlers that are a little bit higher.

A

You have a vert controller and up there just what's vert controller vmi versus vert controller node is this: is this just the vert controller just different? uh I guess different um transformers or different control loops or something inside of our controller.

E

A

Work, you latency pretty steady, so we have one okay, so we have. What does this tell us? This is like we have 10 seconds for.

C

E

A

From the smallest largest huh and they're all 10 seconds,.

A

Okay, that's kind of weird.

A

um Let's see so vert controller node gets pretty large and then second place is the vert handler vm, which looks like it's down here, pretty tiny, so we're controlling. I know it's got a big retar rate.

A

Yeah this one, this might be. This is a good picture to go into the vert um into the the um the efficiency issue is another data point.

C

I think I've seen that retry rate before it's it's when I think both both bird controller and vert handler both want to update the node structure with labels, and so that you know whoever wins the other has to back off and retry or something along those lines. I I um I saw lots of retrials anyway as well.

A

C

Think it was to do.

A

With the labels, okay, so it wasn't so very controller, node will label and it's going to label a node say like okay. This is ready. This is where this like. What's good, what does it label with? I forgot like? Is it to say, like this.

C

Game's gonna land here.

C

I don't recall it.

A

I had to look at it like our diagram again and we talked about there, but that is kind of interesting um I mean you have this is this is three nodes yeah.

E

A

Yeah, I wonder I wonder what this would look like if you know this.

A

Yeah with 100 nodes like um if we're seeing this, if this jumps a lot bit.

D

Controller is coming from the no it's good question of this coming from, but we normally only do some thing on the.

D

No, I have no clue what is coming from.

A

um Okay, I'm going to add I'll. Add these to the to that, like catch-all card that has like the efficiency of the control loop efficiency, whatever it is, that I remember know what it's called that shoes um I'll find it on that we have it in six scale uh document but um I'll add those to it, um but it has additional data points. Okay, uh unfinished work.

A

um We kind of saw this earlier. This was, um I know we already talked about this. I mean yeah, so we have a bunch of unfinished work. Yeah, eight, the node and bmi controller- do not have much, but I mean still it needs sec 37 seconds, eight minutes. I guess this is the total count right like it's, the total amount of time spent, waiting or or stuck like it could be like it's it's um an accumulation. It's not just like one thing right, that's I guess I I just depends how you put the metric together.

E

It's a rate, so oh.

A

E

Should be yeah.

A

This is not an accumulation, okay, interesting! Okay! That's I mean, I think all these are high. Then this seems like it yeah. Okay,.

E

It can be like some specific thread watching something, maybe I I don't know it's just need to provide to see what's actually happening.

A

Okay um memory, uh we jump up.

A

Well, first of all, cpu, this looks really good like like just matches our infinite work, the clear when we're doing work, we're doing work, we're using the cpu this like we're.

A

I wonder, marcelo like after you delete right here how long this takes to go down um kind of what it looks like over time. I mean we can see kind of. We see a slight dip here, but it's still kind of high and we've deleted the vms. At this point, um where's the count where's our account to compare right here.

A

What like that, so, let's go back before so. This was like around eight a little bit past eight o'clock, eight o'clock there so right here, so there's no vmis, this huge gap, here's our memory, yeah.

E

It's the v, the vm count is the second year so yeah.

A

We're talking we're we're talking megabytes and it's like. Oh, I.

C

Is that a stacked graph, so each vertex is using what two or three hundred meg is: that's what we think and then they're all stacked together, all the vertenders.

C

Of course, all the nodes are taking 1.86 giga, but just trying.

E

To read the graph.

C

E

A

So it's not actually um yeah. So it's not.

E

um What seems to be increasing is the purple. Actually, we can.

A

E

A

Okay, I was looking at them as if they were um they were combined, but I think the purple one is the one like. So it goes at a max of 484, but it's actually like. We see good um peaks and it looks like it comes down to pretty close to what baseline was so that actually looks fine, because these are like again we're saying these are stacked. Then this is then they're all doing that some of these aren't as well, but it looks like they eventually get there.

E

Yeah it's increasing a little bit, but maybe the garbage collector will remove this memory.

B

Memory being what's that metric of the memory of the process or uh like, is that what goeling reports or memory's weird it's.

E

You can double check that.

A

I think I think overall, that I think overall actually does look good it actually like, because the graph is stacked it it wasn't stacked it should it would look. I think it would look similar to this like we would see like okay, so it's.

E

It's process resident memory.

A

E

And I think it's reported by the goalie, we have the garbage collector. You know down also.

A

E

Okay, it's okay! It's not a problem! You know we have this a coup. You know rbc a proxy, that's actually it's doing more things for a garbage collector, but there is nothing to do with compared. So it's fine yeah this. This thing is: is it expected to use that much of threads.

E

Number of threats, so the government's it's it's expected to be high, but number of threads like 2 000 threads, so that well well, you know 350 threads in 300.

A

Do you know where the airbag proxy is who uses it.

E

B

So threads get spun off uh by the go run time, depending on how many go routines uh is that is that accurate? I think that makes sense to me right.

D

B

Was some dynamic and then maybe it gets cleaned up based on the number of routines that exit, so we're leaking go routines. Perhaps it's possible that would be represented in threats. I'm not seeing a huge thing like I didn't.

E

No, it's not licking anything anymore, so in the last one you see like this. Actually, this is still many girl, routines yeah, but because the vmis didn't finish, many vms got stuck to be deleted and that's actually. My next comment for the next experiment.

B

E

First of all,.

B

So there was a what version? Were you testing? Was this the latest in maine, or is this a previous release.

E

It's the main main repository as.

B

Of like this or last week,.

E

This week, actually probably monday.

B

E

And, as I mentioned before, uh with ryan the tcg latency, you see it's fine, it's everything it's under 10 milliseconds. I think you were doing some evaluation with a huge spike. Isn't it official dtc latency before.

F

E

F

That was with my.

E

Test about that so.

A

Yeah, I I don't, I didn't think you would it's like that was showing the worst case from the api server under incredibly high load.

A

E

He thinks that gets interesting here.

A

Yeah, this is the right. Okay,.

E

So if, but it's related to the the previous one, also for the storage operation, okay, so yeah we have the another. I have another uh file showing things when I increase the you know the rate level actually the the burst and pairs per second okay. So we can see here that I was mentioned uh to delete the vm, so I have like, uh in this experiment here very low performance. When I delete the vms, it sometimes takes hours to delete vms or doesn't delete. So I need to force the deletion of the vms later.

E

You know by hand- uh and this is directly related to many storage operation neighbors. You can see here uh well in the figure in the right part. The storage operation rehearse it's uh okay, so the vmware that we're creating it's uh using ephemeral uh volumes with empty direct plates, the empty gear. Maybe different volumes, has different performance, but that's what we have here and it has a lot of empty gear uh er to unmount this and get stuck to amount that and it doesn't delete the pod and it remains forever.

E

uh Sometimes, however, surprisingly, when I increase the rate, uh the cars per second at the burst time, for you know the experiments for the rate limiters, it's fixed that problem.

E

um I don't know why what's related to that, but it's it's actually. I don't see any uh vm deletion problem anymore after increase this rate limit, so maybe you can go to to the next link.

E

Actually, the next document.

C

So I didn't follow.

A

F

E

Did you increase.

E

Yeah, I I'm going to show so okay, if you see here uh no, this is not just one also. Sorry, it's not very well.

A

Which that so, is it this one right here how to configure uh yeah.

E

Now the other one in the right, yeah.

A

E

13 also the name here: the experiment 13.

A

Okay, storage, operation rate, okay,.

E

A

E

Go if you go up just a little bit for the configuration.

E

Just a little bit more: oh yes, okay, so just just this is the configuration that I change. So uh it's I increase the burst to 100 for this uh components here, the carrots per second to 50., I think by default, was five and ten. Something like that before uh I also it's maybe a question to raman. um This is: is there other components here that we also tune or desire? Are all the components? And the other question is when I change that only the vert controller reboot so well, we're started.

E

You know the pod were started, the root handler didn't restart, so I don't know if it's actually got uh the configuration there or if it doesn't need to restart to use this uh configuration um anyway. So just this was my doubt here when I applied that.

E

Yes, if and without um I didn't so- I didn't have time to put all the figures here. Also, but the point is it didn't change the vm creation time? Also it didn't fix all the the rate limit. With this configuration, it's improved, it's not five 500 milliseconds anymore. It's actually 250. If you go to the other figure and you we don't, I don't yeah, you see it's still like. Oh.

A

I see this is what that means a.

E

Little bit more.

A

Okay, yeah, the storage operation, air rate disappears.

E

Exactly so surprisingly, I would say um I don't know, what's the relationship of that, but it's interesting I would say.

A

And so this was going back to the other document, so you had. Is this just more zoomed in? Is that what the yeah? Okay, I see, this is 600.. Okay, these are this. This is just one.

D

Of them, this is the 600 okay,.

A

I see and yeah we got none and then, while it's over here, we get quite a few okay and also by comparison yeah. Okay, I see interesting, okay, so yeah, it's a thinner, um thinner, graph, interesting and none for their rate. huh Okay!

A

Well, this is, I mean this is a good thing marcel to just keep a note of. I mean I guess we can. I don't know what the conclusion point yeah we can draw from this, but um I mean this seems better.

A

It's a good number that we can kind of keep in the back of our minds of something that does provide this um this improvement. So I guess that's something we can keep in mind.

A

I wonder if it's one of the. um I wonder if it's one of these, like I wonder if it's um if it was literally one of these that did this um I'm guessing it's like. uh Maybe the handler.

A

A

Cool, okay and then um did you want to talk at all about this one? This was the other one I saw this. Was um this one? I think.

E

Yeah this is this one yeah, let it go very quick from that. So this was the experience to try to run 500 gm. You know you know and actually was hard to do- that I increased this uh timeouts for the uh chemo that we had before and and then many other. You know many other change numbers of device and things, but in the end I integrate 500 vm once and actually the vms.

E

It's when it's get like close to the cpu, very, very close to 80 or 90 percent of the cpu utilization in the old, and also, I would say, not also eight or nine percent of um memorization and all things get very nasty. So the operations start to kill you, even though it has one giga of memory. You know you know: version systems start to kill some, some. No some containers and the via. So the the the operation system also is very slow.

E

um It's it's. I saw I log into the node and even though it drops the cpu utilization, for example, 250 something, but it's everything is very slow. I see a lot of I animation interrupts. You know calls in the kernel and and some containers being killed, but I'm not I'm not sure exactly what's happening, but when it's the it's saturated, the node, it's get like unstable. That's the what I'm saying I also test with three different runtimes: the docker container g and cryo and container d had better performance.

E

It's the cryo and docker were tying me out to create the containers. With uh far you know, much less uh could create less vms, less pods, I would say, and and then I could create safety 550.

E

So uh you know before, but not five. You know 450 sorry, but not 500. So on 50 contain 50. Vmis changed yeah a lot so because we were pushing to a limit, I would say to pressure the system that it was breaking the system. Let's say with 500, but 450 was fine. Okay,.

A

So this uh what's.

E

C

A

This this creation time marcelo so we've seen here so uh looks like fifty hundred two hundred fifty hundred two hundred and then two hundred all the way to four hundred. It's almost the same. It's like like we hit a threshold here and then we'd kind of we leveled off, which is that which is kind of interesting.

E

B

There's a limit to the bucket size.

A

Oh, it's like a 10 and so we're.

B

A

Higher then, let.

B

Me double check this uh yeah, so if that's.

A

B

Everything's, just going in that last bucket.

A

E

By the way, david, I still don't see the pending.

B

It is 10 minutes, that's the highest. Oh okay,.

A

Okay, so it's it's up here somewhere off the chart. Probably I didn't think we.

B

A

A

All right, you've broken through the limits, so.

B

We can add more buckets there or or revise the buckets. I I based it on what I thought would be realistic. um That's well that's an indication.

B

I mean, what's more data going to give us here, that's pretty terrible if it takes 10 minutes to start uh to get to a vmi it's running, and is that the p99 or something or is that the average? What.

E

Nine to five.

B

Okay, okay! Well, that's that's terrible! um So I don't know if it matters, we add more buckets. We need to figure out why it's taking so long.

A

Okay, pretty cool yeah. I mean this would be another really good um and we call it. We call this. This is what's called density like we just kind of load, the node up is that what we want is what people call this, that the kind of canonical term for this density test. Oh.

E

Then yeah well.

A

We can whatever that's. We can call that so like. I think this would be another one. I I kind of I'll add this. I think if I don't have it, I'm going to add it to our profile list um marcelo as a thing I mean since you're kind of doing this all right. This would be cool when you hit like right here I mean this seems like it's going to be exponential. I'm guessing like we're going to be up here.

A

I would be cool to see a profile of this and to see like this guy at 200. What's going on then 304 and see what see what we can find, I mean you probably have. Do you have the other? uh Some work, you add rate. Let's see work, you retry rate.

A

This is so work controlling it's not it's one node, it's interesting! It's one node and we're still shooting up there. Okay and then unfinished work. Oh, so here we go we're at 12 minutes.

A

Yeah, okay, yeah. I mean this.

E

Would be cool in profile even worse,.

A

Yeah I mean you actually see like this. This peak is not much of a difference which is interesting and then we kind of basically double yeah. I mean I I'm just this unfinished. Work is really just an interesting one. Maybe I'll attach these unfinished work ones to the um to that um profiling. I think getting this information, um that's gonna be really helpful. It might give us some good stuff.

A

Okay, pretty cool, okay, there's a lot of good charts in here, um thanks marcelo, okay, uh all right any last uh finishing words here we're at time anything if anyone wants to bring out it's.

E

Yeah, just every quick update for the continuous performance jobs. There- evaluation jobs there. um It was not uh so the job there was not being collected, the metrics um I'm working on that with frederico um we are, we tried like we are debugging that to see many configuration what's happening, that the metrics were not being collected so because what we have is it's like a global prometheus, that's running the cluster, and then there is another cluster that is created for the for the uh the job to run.

E

You know the tasks and that also creates a promises, and this global parameters needs to collect the metrics of this local prometeus and it was not being collected so so um so we don't have results for that, and the new uh graffana dashboard is there. uh I think maybe we will have it married soon. So let's hope for that yeah.

A

Awesome yeah, I was gonna say I mean I, I wanna. I have some ideas of what we could add to. um So that's awesome. Okay, all right! Everybody extra time um have a good day talk to y'all later bye.

A

Thank you. Bye.