KubeVirt SIG Performance and Scale, 16 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-06-16

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, uh this is sixth scale. It's june 16th 2022. I'll share the link to the notes in chat, and uh please add yourself as an attendee. Okay, um first thing uh so marcelo we I I've um yeah. This is actually good. I want to review these, so first is which one qps yeah. So this one, um this experiment, marcelo um yeah, I mean why don't you talk through do like a little high level overview and then we can have a discussion on it.

B

Sorry, what was the question.

A

Well, I say why, don't you do a high level overview of the pr.

B

A

And then we can have a discussion.

B

Okay, so uh well, the experiment is to test the vm. uh You know creation latency, it's uh the vm object is which well it's the object to start, and you know stop.

C

B

The vmi and I'm following the whole workflow, so I create the vm object and the vm object will automatically create the vmi object. Then, of course the vmi creates the pod so um and the focus here is it's in the vm creation time. uh It's interesting because I I wasn't seeing uh this. You know huge latency in the vmi creation time. This happens to in the vm creation times so um and uh and uh yeah.

B

So that's that's the basically the experience, the experiment and the experiment it's actually in uh is medium-sized cluster or maybe can be considered smaller. You know it has 12 worker nodes and three masternodes and I am creating you know. uh 1200 uh vms, I'm also doing another experiments in the report there, that is, to creating more than 2 000, vms um and um and it's which means it's 100, vms per node and then 200 vms per hour. So and the the experiment, I'm focusing, I tried, I changed.

B

Actually the you know the cars per second and burst of other components, but the one that makes uh more difference is the bridge controller. So the pr is focusing on the virtual controller, especially because it's to you know narrow down the discussion, so I'm changing the the the vert controller curves per second and burst from the default configuration which is 20 30, and then I increase that to 100 and 200 and then 400 cars per second to see how it impacts the performance.

B

Especially you know, because I was seeing you know some huge. You know uh vm vm creation time for uh when was creating 1 000 vms. At the same time and like 22 minutes, you know uh which is not acceptable uh for this and uh and then here it should say so- the the latency and the impact also in the throughput, so how many vms being created per time. So it turns out that maybe not maybe, but one of the reason that it's uh taking a lot of time also to create the vm okay.

B

So there are two things here is the time to create the whole batch, for example, one thousand how long it takes to create one thousand and then the each vm uh time and uh and then I'm also, I'm also analyzing this here. So, of course, to create, since it to create one vm, it's very slow to create like if we're considering how the total amount of vms being created.

B

uh It's also very big in the first scenario the default one is this: is the default configuration okay and and then it turns out that, with the current you know, um uh configuration burst configuration it scan, creates only one vm per second. So in uh it's, there is a bottleneck.

B

So it's it's because the vm, the it's, not the vmware okay, so it's the vm controller, it's not being able to do too many requests. So just comment. This query. You know 20 quarters per second here it's shared, be in between all the controllers in the bridge controller. So internally, in the virtual controller, it has the vm vmi.

B

uh You know node, and um uh I don't remember now where I list there. So there are many other. You know uh cues. That means different controllers controlling different things and all of them share the same pairs per second configuration.

B

Another option could be maybe have separate characters per second for each of them, and then we have more control of that. But the way that it's implemented now it's something that shared between everything with something it's internal to the virtual controller yeah it's here, so you can see highlighted here all the controllers.

B

You know evacuation, so it will also impact like uh migration that you know at some point.

B

So when I increase that, I put, you know for the maximum of course, uh this pr after some discussion, I'm not going to increase to the maximum curse per second that I tried uh because some concerns I I'm going to get a middle ground with something in the middle of that, uh but with the maximum carriage per second that I test 400 and 600 for burst, I could get up to 17 gm's being created at the same time and note that my test I configured to create 20 per second.

B

So, even though I increase you know very high curves per second, I couldn't get 20 vms per second, so it did it's the there is some limit somewhere else and.

B

Yeah and uh and then the question here was okay, so what happens if I increase that? Isn't it in the cluster so to understand that we need to analyze first the number of requests that it's been generated and they also the number of in-flight requests. The current things request. That is, the current request that is being processed in the work api, because using flight request is the most important one because it takes let's assume, for example, it's not happening okay, but we need to understand that, let's assume a scenario that could be.

B

It's overloading divert api, getting all the requests that vertex they're sorry, api server, getting all the requests that the api server could get and then other controllers could not access that. I know that we can have priority and fairness openshift has by default. Kubernetes it's uh you know, it's still, I think alpha or better, but uh it will have priority and fairness, just something that we will improve in the future, because it's not there yet. But the point we need to understand how is cooper, increasing this overloading the api server.

B

It turns out that the current requests that are being processed per second, it's only 40. and the default uh ap a maximum in-flight request in the api server. It's 500. so just understand what does it means it means when we see here. You know the the api request total. Then we see 800 here, isn't it, for example, the maximum scenario?

B

It means that it is started. You know, 800 requests and it's waiting for 800 requests so and summer request takes more than one second, and it means that the occupancy you know of in here it will be more than 100 seconds, uh but the the number of requests that are being processed at the same time in the in the api server, it's only 40, which means we are not getting all the the api service still has a lot of room to.

B

uh You know to reply for requests, because summer of requests here takes a lot of time. That's the point, and, and I'm running a system that has you know it's very powerful, cpus uh very fast. You know machine uh with any energy ssds, so uh in the the other scenarios means in slower cluster. It's what takes even you know more time to maybe process some uh lease operation or request like that and then we'll take so our post operation, because the pulse, for example, create a gpu, sorry create a vm.

B

It goes through a lot of process and then it takes time to process that so um and that's what's happening here. So um it's uh I'm just saying that the odds are now. Let's just say we are, even though we increase that we see an increase in the number of requests. The api is surveying. This is on a safe mark. Okay, um that's that's the explanation of these two figures and then another question is okay.

B

So what's the impact in this resource utilization and- and here is the virtual controller, of course it's increased uh the restrictionization from few. You know uh cp utilization to uh at least one uh or uh you know one and a half core in the system. um It's it has some. uh You know the cpu has some high frequency, okay, it's a powerful cpu, but I wouldn't expect to take like uh more than uh two cpus and especially because we are going to the scenario that here that took only uh you know, uh 100 percent.

B

Here it means one one cpu, so it's different controller will be using only one full cpu, which I'm considering also to be okay, because it's an extreme scenario that we create 1000 vm and we enable it to scale. You know in a reasonable uh throughput and perform and latency um and the other. The other thing here is to show the impact in the work queue. We have some prs before some discussion, especially in the beginning of the sixth scale, about the performance of the work cube, and we were not understanding.

B

We have a lot of you know uh a lot of discussion to maybe to create, trace, to understand that better and turns out that when I increase the queries per second with the high value here, it's definitely improved a lot.

B

The work queue so the the issue that uh a guy from new videos uh presented like a long time ago that some keys were processing very slow in their vert controller, and he was actually also you know, uh proposing some another approach to bypass work queue. Something like that. It turns out that only increasing the cars per second and burst.

B

It's uh you know eliminate the problem here, so we can see the longest running process drops to less than one second here you know, and and very few of them you know considering the the longest running process here in the first scenario default one. We have a lot of them that it takes six seconds. You know to process a key which is very slow and then, when we go to the best scenario here uh well at least the scenario that uh the maximum scenario that I test- it's drops a lot.

B

So I would say that we have no work to issue here anymore, yeah, you're, doing that.

A

B

A

Did you do you see like? I wonder if you see this in the traces uh you should be at the threshold. I have one second and the trace is by default.

A

It would be interesting to see what you what it also shows in the logs, because I understand this metric and what you're showing here, but I I'd be curious to see because, um though the metric is granular and enough that, like we should be able to see exactly where in the work queue like this is being slowed down. If it's like a specific uh call, we're making, um it would be good to find this.

C

Yeah because I mean I guess.

A

B

You have yeah, I still have the system. Maybe I think I can run it on again. Let me check and I didn't get the logs, so I will get the logs of every component.

A

Okay, that would be a really really helpful experiment just to see like yeah I mean I would like to get a better picture of like what this is, and this is kind of what I was like kind of. My comments were centered on marcelo marcelo's like this is really interesting.

A

There's a lot of really good data and, like you said like we, we like in video we've done like some experiment like this, and we saw that qps was like the biggest influencer on improving performance and the thing, but the thing that's really interesting to me is like um at least my take away with this.

A

Is that when, um uh when you talked about like okay, so, like you know, 800 requests uh or whatever 500 big fight requests whatever like it's it's well, you know I'm trying to like picture in my mind like what is like the. What should kuvert's footprint be in in the in your kubernetes cluster like? Should it be?

A

Should it take up like half, should it be like be able to take up half of the apr service requests, or should it be lower, like you know what is like, you know, what's the what's our right, the right approach like which what should be like the right way? We look at this, it's very possible that the defaults for kubernetes should should be like hey like we should need to take up half the pay request.

A

You know for the api server, but I'm wondering if it could be lower, like I'm, I'm kind of interested in seeing like if, because the data you're pointing out here is actually like it to me, it seems like you, you're you're, hitting some some bugs like the bug isn't necessarily like. I agree that the qps burst is probably a little low as the default with whatever it is, 10 or 20..

A

It probably should be higher, but uh um but it's also when you go to the high end it um and the way that it affects the effects it has like how much important improvements in performance makes me think that maybe we're just making too many requests like it just seems like it seems like we're we're a little too active with the api server like it seems like we may be, I'm wondering if, like you know, if there's a way we like, we can end up with the same performance at around 100 or less than 100, or something because that seems to be like.

A

Maybe where, like you know, what one eighth of what the aps server is maybe like what the right footprint is or something I don't know, what do people think.

A

To me, it just sounds like it just seems like we're using a lot like uh like 400, because you know like thing is this argument can go on forever right. We could say why not 600, why not 800? You know why? Don't we just use the whole? We don't want people to do that like we want. We don't want people to have to say, like oh I'll, just increase the qps and burst forever. You know and then also not get my performance that I need.

A

We don't want that we want to you know we want to reduce it as as much as possible and that's like you're exposed to like a lot of. I think you exposed probably multiple bugs here, the like the number of put requests per. Second, these things like seem very high, and maybe we can lower them and then.

B

Yeah, if, if you see oh, let's first, I want to discuss like two things before sure uh can: can you roll up.

B

A

B

Just to comment about yeah the cars per second, you know when we are seeing the 500 this the 500 means, what's the amount of now, you can go here to the figure with yeah below the one yeah. That's the green one yeah. So here is the here's, the number of requests that it's in the api server, the real one that is being processed at the time. So the api server has this maximum in-flight request that it's the default 500.

B

Some people increase that for very huge clusters, but the default one is 500 and you know, even though, after increasing the cars per second, we can see here that it didn't impact too much the api server. So the api server were still processing.

B

You know around 30 requests per second, but what means that we go to the 500 800 there that we're showing it means we restarted more requests. You know the apis, for example, the components were able to request more things to the api server. However, some of these requests are waiting are pending. You know because they are processing somewhere else and then uh it's it's not being uh you know the in-flight request that it means it was replying the request. You know uh it's, it means we are not overloading we're not impacting.

B

You know the vertipi server. You know in a bad way. It's it's! It's just safe to increase to this number that that's what I'm describing here and you're right. We can uh improve. uh You know, convert based on this, the request, um if you, if you wrote down, I replied david quite uh very in the end yeah yeah here, so david actually asked the same thing that you mentioned.

B

What's the number of requests per vm, more or less like that and okay, I don't have it like specific per vm, um especially because we can, as you mentioned, so I didn't remove all the to to get the exactly number that it's been doing.

B

Maybe we need to get like uh remove all the you know, cars per second restrictions, and then it will do the maximum that group vert will need, and then we can understand, but I think we are close to the to the. Maybe we are close to the limit there um for the the last scenario.

B

In any case it was able to create. You know it's requested. You know, 22 000 uh put requests. So since we I created, you know more or less one. One thousand two hundred you know uh vms. It means you know there is some rounding here in bermuda's, so but it will be approx 12. You know, 19, put requests for the virtual controller, six, poles, four packs and very few get so and considering that when we create a vm object, it create the vm. You know: do some uh you know request to create the vmi.

B

It's request then request to create the pods. I see that 19 put maybe a little bit high, but it's still fine. I don't know we can we can get this. You know impression from uh more folks about that. You know what they think about. I I don't know I I think that maybe 19 is high, but it's doing a lot of things to create a vm.

B

uh I we can maybe go again. So you have to create this six seek sequence: diagram for creating vmi, isn't it uh or.

A

B

I don't remember if it's a vm or only.

A

It's a vmi yeah, it's a vmi.

B

It's been, maybe it would be nice to incre, you know, expand it to the vm and uh and then we can, you know, try to to figure out. You know where the where the put you know and post uh actually the put requests are coming from.

B

Yeah yeah yeah. I think it's vmi only yeah. We could.

A

Yeah, it's more available, yeah.

B

Or maybe yeah or maybe create another one for vm. You know.

A

Yeah we can do, we could do nothing yeah. I think uh what's interesting, so I always um you know with what you have like 19, that's interesting. I like it does like. I was saying like. Maybe it is a sensible number, maybe you know whatever of any amount, some sensible number I mean. I think, like um that. You write that, like you know your graph about the api server and the weights be able to handle it like in theory right we could always.

A

We could always increase api servers because resources, you know so on and so forth and eventually will be. We could serve the load. I think it. I think it's totally possible. It's really reasonable yeah. It's just sort of the question of like you know. What's the right default, I think that's! You know one important question. um You know what is our default workload?

A

You know what what should the right default be um and then, in the case of like your example like, for example, sort of outside of like what the average person is doing well, we need to then document it like we need to.

A

We need to like you know your experiment, highlights the importance of how you can achieve performance, because if you you're, someone comes along with your exact use case, you know like like you're, showing here they're, not going to achieve nearly the amount of performance that they should be, and so these are kind of this is that whole other area, where, like the slo, is document where we should spend a bunch of time um finding these things and and documenting them. So like.

B

C

A

Like finding that defaults and documenting and then and then the third thing I'd say is like um maybe we can, you know reduce some of this stuff. Maybe I mean, I think, like the experiment, we did a long time ago was trying to skip things in the work view. Maybe the the right thing to do is instead of skipping things in the world, we're cute, maybe it's to like. Maybe it's to skip like or not do put requests immediately or something maybe just to combine them. Maybe it's a skip request.

A

Maybe that's you know, sort of the same idea just you know just a little nuance like, maybe it's um so instead of looking at it as like, you know, um you know skipping steps, let's just maybe we can reduce these because it would be interesting to see like this could be easy to tell if we were to reduce this value by one right. This value should go down significantly in your experiment and we should see pretty quickly. You know these numbers should decline like and we can.

A

We should be able to easily measure like the performance, just with one less put request. So it would be interesting experiment because of how impactful, just one request, or even two or three could be on like you're on your overall performance. So that would be really interesting to see, because if we, because I mean even if we were to scrap just one of those requests it would, I think it would be incredibly valuable. Probably even like you know, 20 to 30 qps of value, just by you know, maybe reducing one of these.

A

So that would be. I say the third thing that we could do.

B

Yeah, so let me share my screen very quickly here that I want to show.

A

B

Can you can you see that maybe yeah the next two is more and if we see here you know this, the metric from the work work. You retried rate. We can see here that there are a lot of retries.

B

So, for example, if we go to the scenario that the maximum scenario- okay, we have like a different. uh I have different configurations here, but considering this one, just one here.

B

So it the root controller, for example, node, which controller node, which it's I I would. I was not expecting. This work queue to be like that it was 80 operations per second for the retry rate, so that it was, there is probably definitely a bug in the virtual controller node, it shouldn't be retrying to process a kill, a key that that much isn't it.

B

Yeah, so we have this here, work queue, retry rate, it means a key was not processed and then it retry to process this key again. You know some retry is expected, but the verge controller node, which means it's getting information from the node.

B

It's having some problem here, for example, it was eight requests per second which it was retrying the scheme too many times. You know too high, for example, when we increase the curves per second, it was retrying. This key, you know a lot. uh I think. Definitely here is something that we can improve.

B

You know uh you know it shouldn't. It shouldn't be processing or retrying this key to that that much it's too many eight times.

A

I wonder if the tracing, I wonder if the tracing will pick this up or if it if it doesn't like something we can do to improve it, because I, I would really be interesting to see the tracing on this like he talked about earlier, like the um the amount of time spent in the work cube. Maybe it's the retries, that's causing it and- uh and maybe that's I hope that gets picked up at the tracing, but maybe it's not, and that would be interesting. um That would be another thing to look at be really interesting.

A

Yeah, I'm I'm super interested in seeing what you find in the tracing, because that would give us, I think, a much better picture as to like you know. What is it in the work queue is it is it retry is something else. Are we even getting the truth? Retries? I would bet if we don't see anything.

A

I guess if we don't see anything in the tracing, then it's it's like all in this. This retry and we probably aren't supporting retry and tracing, and we probably need to add it, and I bet we could find something cause yeah. That's that's really interesting, though, like it's like I I forget what happens in the retro.

A

I think if, if we we just fail like during one of the steps- and we just you know, we send ourselves back to the queue and it's a rate limiting queue, so we just uh whatever the time is that we have is the default rate. Limiting we wait and try again I mean eight retries, though that's yeah, that's that's. Probably that's. Probably most of your time is by half that time is waiting in the weightlifting queue.

A

That's interesting.

B

I think someone wrote some uh requests.

A

Did you are you didn't sharing marcelo.

B

Oh yeah, can you share again your screen? So I I think, that's okay, someone wrote a question here in the chat, so.

D

Oh yeah, that that was me, can you can you hear me? Yes, yeah? Okay, because I was just speaking working on ryan's point as in if uh on on the previous point, that is, uh for the 19 put request. um Let's say: let's say the ideal number is not 90 right. The ideal number is 15 or or 10 how much performance improvement. Do we get if we bring that number down to uh 15 or 10? um That was the question I had.

B

Yeah, I think you know it will improve the overall thing, especially the throughput that we were saying for latency, it's three keys and it it's it's. It's not easy to say about latency, because there are many kills going on different components, and also I don't know if the latency is coming from the put operation might be so, uh but uh it's it's hard just to say without you know, testing.

B

If we don't find it, but I I it definitely will improve the performance, at least for I would say throughput or you know um or make the system uh more. You know say I would say healthy, you know doing less requests.

D

B

D

And then another question I had is when, when you were showing those numbers uh for for the um api server right, that it has a total quota of 500 requests that it can process, but it is only going up to 40.. um So sorry, I'm new to keyboard, but um word has its own aggregated api server right.

B

uh Yes, just yeah the the view api, it's the the api that the the client used to make you know uh requests for the for the cougar.

D

I see so this was against the actual cube api and not uh the the word api.

D

The 40 number.

B

Yeah, it's the api service, the kubernetes api. This is the kubernetes okay,.

D

Okay, make sense thanks for clarifying.

A

Okay, so um I so let me close this point so marcelo, um I here's the things I think will follow up. That makes sense. Well, fine, let's find the you know the right defaults um so, like uh so good balance,.

B

Yeah, I I put there.

A

B

I got some feedbacks, you know and uh I defined like. uh Actually I put two hundred four hundred.

A

Twenty-Five: okay,.

B

And again, just although it's it seems like very high, but you know two hundred four hundred it's what the web hook. uh It's it's configured by default. You know for the it's our. We already have some api that it's doing. You know this amount of requests.

B

You know well not doing this amount of requests, but it's configuring to support this amount of requests. So I'm saying it's not like a very insane value so and uh the other thing is uh it's we I I check it there. You know that it's not impacting too much the api service or we're still like in a safe march and uh yeah. I forgot the third.

B

The other point that I was going to say, but I think just this is a good value so, uh based on the experiments that I did and uh yeah okay, someone wants to say.

D

Yeah, sorry, you know this is allah again. um uh We have so in the past. Whenever I have done whenever our team had done these kinds of performance. The guidance I have received from api server folks was that um 100, qps and thousand burst could could be okay um and, and maybe 200 and thousand burst is also okay, uh but but that's the defaults that uh um we were running our controllers with. um So just just wanted to give a data point. Oh.

B

This is good, it's more or less a line at what um what um I'm suggesting here so yeah.

A

Well, do you have like um any issues that we can point to or like unveiling this thread, that we can point to that talks about that or um in case we can like just sort additional evidence for like why we should go with a number like this.

D

um I I would try to find find it. um The the best uh I have is qps and uh bus configuration in in my repo, but um I I will try to find the discussion around it and send it over if that's okay,.

A

Yeah, I think it would be good to have just as an explanation for everyone to see in the you know, because.

B

I think martial you've.

A

Got a good explanation. I think this would just add to it even more like you know, you know like relay's doing this. His controller and other people are doing this and their controllers, and this is what recognition for people.

C

A

I think that's just that's a solid point to say: like yeah, we should probably be we're very low. I think that would make sense. Okay,.

B

Yeah just saying, if it's possible, I don't know how long it all this can be. You know takes maybe to to merge that or how much discussion do we need more, but if we managed to get this next week I was planning, maybe it might be interesting to present in the demo sections the cooperate demo.

B

You know about this improvement so.

A

Yeah that makes sense: um okay, yeah, that that that'd be cool. So, let's see, uh let's go to the second point, so documentary qps based on performance scale requirements. I think this will be we'll just need to. um I think, like that's some of the stuff I have in that slo's talking, we'll just need to refine that a little bit. I think eventually, we'll just we'll find a place for this. I guess is the point we'll just I think, there's just there's going to be a place where you know this is configurable.

A

um This is valuable to documents just because, like let's just say you only run vmis like you know, it makes sense like you just want to give it as much access to the api server as possible. Like you know, what's you know? What is it when? When should you expect to do that? I think those are like good questions that um you know good answers that we can provide to those questions. Yeah, okay,.

B

A

B

Maybe like write uh blog posts in the converts of or somewhere and then.

A

Okay, yeah that sounds cool, okay uh and then last one reduce the paraquest. This would be an awesome experiment to do just because I think we can get just because of how like this is a small number and if we reduce by one we should see some fast improvement, so that would be cool. If there's a way we could do this. I think I saw on the chat um or that um I p wreck. The number of requests can be caused by conflicts yeah. We we've seen this. uh The number of put requests.

A

We've seen a lot of conflicts. We've had some code around to improve this over time. um I don't know marcelo you're, just you're dying. Does your diagram show the number of conflicts, because it would be interesting to see if uh you know we might be able to.

C

A

Able to pinpoint like like when we do the put requests, if that's, why we're seeing um if the puts are having conflicts and that's maybe why we're getting retries or why it's requiring 19 on average, because we're just you know we're any of those things like it could also have to do with the work. You could obviously have to do a number of requests. I think.

B

um Sorry, I just understand the question so.

A

The in your in any of your diagrams, do you have um the uh the http error codes for put requests like.

B

Oh yes, no! It's it's only it's all of these requests are 200 uh cold.

A

They're all 200., okay, I.

B

Checked that yeah.

A

Okay, all right, so then I guess we are at the 19th, so anyway it would be. It would be interesting to experiment reducing this. You know whatever skipping one that would be interesting to see. Maybe we can reduce this just a little bit. That'd be that could be a follow-up. Okay.

B

A

B

Just to finalize so again, if, if you agree and you like the explanation, it will be very valuable if you write that in the pr so so that we can, um you know, move on. You know some people, some people can then also you know. I I'm.

D

Waiting, the david.

B

Comment here you know but anyway, so uh if, if you discuss you know what, if you agree with the numbers, things like, that will be very valuable if you write.

A

Yeah well so I'm gonna, um I'm going so like this is what I'm saying like to me. This seems like a bug. Let me take. I want to take what we've talked about here and I'll paste it into the into the as a comment, because I think like just so. We have all the follow-ups, but I think overall, though, like to me, it seems. Okay, like 200 400 seems like seems okay for balance. It just seems like we're too low like that might be the same defaults, so I think I'll be on board with that.

A

If that's, I don't remember where you are right now in your pr, but to me I think 200 400 seems reasonable.

B

A

Whatever it is, 20 to 30.

B

Yeah, I think it's yeah 200 400 yeah.

A

Yeah and then we can work on some of this stuff like trying to improve, like you know, when we're at that rate, let's try to improve the number of requests and that could be limited and stuff. I think yeah.

B

I think it's a good start, because that's this is one thing that I said also to david, so you know increasing that we're not hiding the problem. Actually, we are highlighting the problems and you know you know having a very low carbs per second, that's where maybe we are hiding the problem, we don't see. Actually, what's we were not seeing the beginning. What's actually what's limiting us, we were not understanding that so, and it's definitely relates to that.

A

So our default right now is really 20 to 30.. Is that what it is? Wow? Okay, yeah! That's that's! No! Okay! That's I mean that definitely has to be increased. I mean for it. I mean to me like, like the biggest convincing argument to me, is like when I look at this like I like this.

A

Just I mean look how long this is like like how much time you can see right there and then and then it gets thinner and thinner and then like when you look at these two, like it there's kind of diminishing returns in some way. I mean it's good to see like it gets faster, but it's I mean it's it's um I mean look, how much better of an improvement between just these two. I mean it's like a fourth of the time.

A

You can just see it right there I mean it's, that's passive, so yeah that yeah there's another. This is a really good illustration. The distance between this line. This line is absolutely massive. I mean that's such as that's just free speed that we can get with a simple improvement that just doesn't uh start.

B

To be like some reasonable, you know it's from me. You know 22 minutes, 30 minutes to create yeah yeah. It's not! You know acceptable.

A

Yeah and the third one is: we have 200 400 yeah, which, like I mean, there's another jump in here like that halves, it again and so yeah, okay, yeah, I mean I'll, put um I'll, write a comment on there. I think it makes sense to me: okay, let's go to the next one, so we have enough time to get through these, so um you added vmi migration free face range and times this looks good exactly.

B

Yeah we have. We have like this, this metric for the vm creation, but we didn't, we don't have it for the vmware migration. I was trying off another experiment to understanding the performance of migration and then.

A

B

And then implement that it's basically very similar to what we have for vm, but for the vmi migration object.

A

Okay, did you find any? What do you find in here? What's uh anything interesting in the results anything um stick out to you.

B

Yeah this is the latency um it's I think it's might be seconds. So it's you know. Migration might the the time that it migrates uh the whole time. It's depends. You know of the size of the vm, but what we can see here. It's preparing the target.

B

It took like 36 minutes to prepare the target, um which means.

A

B

No sir seconds, I'm sorry, oh yeah, okay, it's 30 seconds more or less 36 seconds here or more a little bit more than that. um I I think it. This experiment was migrating 100 vms, um from uh from an old and and then I have like a very high configuration like I could migrate to any parallel having 20 parallel migrations and yeah. So I don't have like too big conclusion on that. Yet, but uh I'm saying here it's maybe you know get.

B

I need to get the pod, maybe creation time latency that might be to about 20 seconds or 10 seconds, actually but creation time um and then 20 seconds to prepare the pod. But maybe uh it's too much isn't it. I don't know. I don't know it's just like a few seconds, so I think we're fine migrating.

A

I I'm so I'm not I'm not not familiar with how the like what the expected times are. So it's hard for me to comment, but if you I mean, maybe you have probably a better idea than I do, but it would also be interesting to hear from like I I don't do you know this. It would be interesting to hear um to show them this. You know they can see if this meets their expectation. I mean that would just as another data point.

A

um I don't know who did it, but um I don't know it would be interesting. I mean I I'm not like uh yeah I mean it would be interesting to see just because I mean they've, probably maybe they've never seen this. I mean they've probably done some migrations, but maybe they've never seen it like the way you're doing it like with 100 of them.

B

Yeah, so in the in the cooperate, has some migration. uh You know metrics, but it's to count just like uh you know to count how many vms were migrated. Things like that, maybe maybe it's possible also to create a rage to see how many are being migrated at the same time, but maybe.

D

B

Like latency there, there was no matches.

A

Maybe maybe this is maybe we're starting a mailing list on this, because we can kind of get some feedback from the people who want who you know who like have some certain expectations around this um to see like you know, just to get a better idea, I mean um I I don't know like preparing the target 30.

A

So this is like this isn't additive right like this is like 36 seconds for this one and then the red one. It took 38 seconds, and this isn't like this is transition time. So this is between like scheduling.

B

And running right, no, it's from creation to um it's! It's not from the last phase. It's from the from creation.

A

Oh okay, so the though so the way this I'll see how to see. I don't know how this breaks. Okay. So so we go scheduling to schedule to preparing preparing target, to target ready.

B

To the time between those yeah the time between those bars, it's the time between the phase.

A

Well, so why is pending? Oh no, is it succeeded? It.

B

Succeeded yeah when it was completed.

A

Oh okay, so pending is probably down here: okay, yeah.

A

I see okay, so this is like the total time it took to one migration. So it's like almost 39 seconds or so. Okay.

B

Which is 95 dollars, yeah.

A

Okay, it's interesting it's kind of flat like this. Well I mean I I don't know. I think my recommendation marcelo would be that post this uh post, your data on the mailing list. Let's see what response we can get in terms of I mean I.

B

Think this is very.

A

Valuable in terms of like the idea, it would be interesting also to get some feedback in terms of people think this should be faster.

B

Yeah so yeah, if you can also you know, write some comment and review that.

A

B

It would be good because we can speed up also just here.

A

A

To do uh melanie performance see okay, uh next one, let's go to.

A

uh This pr, okay, um so this one um I I think it makes sense to me like overall what you have- um and here I think more so the only thing was.

B

Yeah you were mentioned about factoring some some class, and then I mentioned that I think we can we can do. Maybe you know changing the way that the class are maybe another pr. I was pretty much trying to follow.

A

B

You create that so.

A

Yeah, I think what made me realize like there's, there's maybe another way to do this um like there may be some improvements that we can make on the current structure. So yeah I mean we can do this in another one. I think um yeah.

B

With that, I think like it's a long time that the performance job is not running anymore. So if we can move on to that, it will be better. So then we can improve. You know differently, refactor, that if we we want later.

A

Yeah, okay, yeah. I think, overall, when I saw this fine it'll just um like, I think it's fine to proceed, and so I'll put my plus one on there. We can get this out and we'll do a refactor. It's like that. I think what we should do is like maybe as a follow-up. We can have a discussion here in terms of like how we want some of these classes to look, because um I have an idea of like what they should be but, like.

A

um I think there are some some like nuances for, um like you know how the like study state should look. um I just want to make sure we get those those apis, correct.

B

Yeah, I think there are. There are different ways to implement that, for example, instead of we have like a burst and steady state job, we can have only one but inside we can have like uh you know uh actions. So it means with we create. We will delete the object, it will wait between the deletion and then we will generate like you know the steady state uh and then it's only one kind of job, but it depends how we configure the action, but anyway we can discuss that later and uh if you can.

A

B

The comment there so that we can merge this pr as soon as possible. It's if you can do that. You know yeah.

A

I'll give you a question yeah I'll, give you a plus one. I think I'm okay with what's there and we can do as a fault. I think what I'll do is I'll book some time in this meeting next week and we can discuss. um Would you like just an overview of it and kind of get?

A

Do some a little bit of design, see how we can properly structure this um just to make sense like make sure it's very clear what our interfaces are.

B

A

Okay, all right last uh is the so just look at the performance job results, so I made a change we. Last week we talked about there's a um there. One of the vmis was stuck um and didn't have enough memory, so I increased it that merged. um I think I did it correctly. Like I mean I did this, I didn't increase the q verts allocated right, adjusted the um the um the.

B

A

The clusters remember yeah, that's that's what it is. It's.

B

um It's actually it's it's interesting and might be alarm for us, because that's what this kind of job that this kind of things that we want to see, isn't it. It means that the memory footprint increases the overhead, the vmi overhead increased.

B

So this is definitely something that we should. You know, have a look on, because.

A

Yeah, it's not, it was one kill.

B

Yeah, it was working fine for months, you know or years you know read this job, and now the memory is not enough. So it's maybe it might be an alarm for us.

A

Yeah I there's been a there was, I think, um pointed out that the um the issue where the minimum amount of bmi memory was increased, like the the buffer for uh just so that the launcher processes can run that was increased and that we're starting to feel that effect. But I'm actually surprised because of how much and we're only launching 100 and we've already had to increase almost like almost like 10 gigs. Now, and so it's it's it's a little.

B

Surprising that we've had too.

A

Much it's it's so yeah. It doesn't really add up. Actually, so it's a bizarre, but I so anyway, that's what I did um because it should should have fixed this. So I am still seeing a failure here. So what was the date of this? I just uh I'm sure again correct.

B

And this was gone.

C

You need to actually increase the keyboard memory, because I think that that's the indicator on how much uh the vm should have by the way uh the memory you increased should be only for the job, so it will not, for example, crash on the cluster and so on and inside the job. The the vms are created.

A

Sorry wait uh um which memory do I need to increase it's it's. Not this you're saying.

B

It's the, I think what we were saying is the vm memory allocation, not the the job. Remember, because just this memory location here that you change is the memory to create the cluster. So it will be like the know. What, if the node.

C

Memory, this is for the pot and the pod will be scheduled on our cluster and.

A

C

The pod we are creating the vms, and so the keyboard memory size is the indicator of how much memory each vm should get so uh right now you should see four vms with 10 gig, and so, if you want to increase the memory for the vm, because you cannot settle the vms that uh like kuberiums there, you need to increase the parameter yeah. Also.

A

B

So I didn't say.

A

B

A

C

B

Because, probably it's related so but now you increase the.

C

The pod, it is better to increase both because otherwise you can get killed by om.

A

Okay, got it okay, so that's that's probably why I'm getting some failures here? Okay, so this probably needs to be 12 or something then, okay, that makes sense. I was getting suspicious as to like. I didn't really do the right thing here: okay, so that would explain this. Okay, let's probably explain all these I think like I, I didn't go look through them, but I have a feeling that I'm gonna find that there's there's still an room in here somewhere like it's.

A

Not it's not gonna work um for what to have okay, so it all increases to 12, and I think that that should that should fix this okay, I think so we've done, I think we were originally eight is what we were. uh Let me check the blame. I mean I think we've gone up almost um so it would be like four gigs by the end of this. I think, let's see what it was.

B

But it's still like something that we need to debug. Isn't it why we need to increase that.

A

Well, I think it's um I mean lubo. I think you were the one who pointed this out to me that it was um it had to do with.

C

Yes correct: we had a pr that recalculated the overhead which is needed for the pot uh for the build long. This ah one.

B

A

Yeah, what was do you remember? What was the exact number that increased? Was it like 30, megs or something? I think, that's what I would call per yeah yeah. Okay,.

B

Okay, I'm using this pr yeah, okay,.

A

A

B

A

We were yeah, let's see so we were originally like around. um Oh no. This is what I've done previously. I think it was you marcelo. I think that did it or I don't know if I can find it.

B

The pr doesn't matter, you're clicking some pr yeah increase the performance, job memory.

A

Yeah, I was just wondering where we started at because, oh okay, so it was nine okay, so we went nine to ten and then we're going to go to 10 to 12. okay, so we're gonna go up three gigs just to deal with this issue. Does that make sense, because I've heard 30 for 30 megs per bmi? What that doesn't quite end up?

A

No, it does that's three gigs right.

B

A

Okay, then that would end up, so it should be 12. yeah.

A

Okay, then that should okay, so then 12 should be the number okay. That makes sense.

C

But keep in mind that that.

A

Sorry cut out there.

C

Yeah, sorry, the the memory will be multiplied by the number of notes.

B

B

We have foreign.

C

So if you increase uh it by one, then you have actually four gigabytes of memory because you have four vms. So in the cluster you will have four gigabytes more memory.

A

Then how, then I don't follow this, like we just increased it by four. So how is this not enough, then? So then, if I'm go, if I'm going to 12, then we're we're 12 or increasing by 12 gigs. How like how we, I don't know how we wouldn't have enough, which is for here.

B

Maybe it's good to check. They are again if it's through the memory, so.

A

What's happening saw.

B

Maybe it's better to check there again to see if it's just you, the memory.

A

Well, I mean at 10 gigs, it was like the 1048, it still was.

B

A

Yeah like it was, there was one out of 100. That wasn't so I mean I guess. Maybe it should be 11. I mean that's so that and now we're raising it um eight, but I mean ish four should have been enough right because it should have been roughly 30 or three gigs, since it was about 30 bags of vm.

A

I mean I guess we were close. I mean it was one vm short right, so I guess.

C

That's fine, maybe it's maybe it's a bean picking it problem on how they will land on the node yeah.

B

A

Yeah, that's that's what I'm thinking yeah, which probably should be 11 just so then we go. That's probably the most responsible thing to do here for the more responsibility here so increase key. Remember, size, 11,.

B

Yeah, it's good. If we put 11, also we'll have some safe margin that if it increase again a little bit, it will be fine.

A

Yeah, okay, that works all right. I think we're right time guys. You know thanks for discussion, all right I'll, see you all next week, bye-bye.

B

Thank you, bye-bye.