KubeVirt SIG Performance and Scale, 23 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2023-03-23

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, all right welcome to six Gallery buddy, it's March 23rd, the um Anderson's attendee, please and please enter topics. Okay, uh we're gonna start off with analyzing the performance, job uh and I, specifically, I think what we should do is like. Let's lay why don't you start with? You know talking about some of the work we've been doing, some of the graphs and um kind of like the new way. We eventually want to analyze these things.

B

Sure um do you mind if I share my screen yeah.

A

Go ahead, take it away.

B

Can you um see my IDE yep okay, so um today, I would like to talk a little bit about um some of the new tools that um we are working on to analyze the um scale Reserve like analyze, the results that are run by Sig performance jobs.

B

um So there are two major buckets of jobs that we run. One is the uh pre-submits which are run on, which can be optionally done on every PR and the other is uh periodics every day.

B

um There is a job six scale job which will run three times and give us some results that we have been reviewing here.

B

um So instead of going and manually reviewing, um we've been trying to um work on a tool which can give us a nice graph. So I have opened the tool here and I just wanted to go through some of the um phases of this tool and and the results from this. So the way I'm thinking about this tool is in three phases: phase one will go uh look at each job and collect its results and put it in this directory, so the output format will be the directory name.

B

Slash result slash job name, so this is here: uh output, history, periodic keyword, end-to-end, Sig performance and then under each. This is the job name, and this is the job ID, which was run and for each job ID. We will get the VMI results and the VM results, so this is directly scraped from uh from the build log.txt in the future. This phase might go away because we might have the ability to dump this uh values that are observed in an artifact directory.

B

This is this was suggested in in one of the threads, so this phase might be little bit easier. We don't have to scrap, scrap the blog dot text for results so moving on from there. um We can do lots of things with this data. One of the initial steps that um I took a short at is to aggregate this data into weekly averages and then plotted on a graph.

B

So the phase two of this tool is another sub command that will aggregate the results of this per um per resource and it will give a summary of of the average. So then, this will be output, history, weekly sub directory, and in that let's check the VMI.

B

um The one of the interesting metric here is creation to running P95. So if you look at this directory, the sub command creates multiple directories, which is the starting Monday of each week and within that we have results. So the start date is here. The average is here and within within this data structure. There are data points for each date right, so you can go through this and figure out.

B

What are the um the data points and what its averages, but really this is a pre-processing for the next phase, which is plotting this thing in a chart.

B

um So the way I have CI Health already had a good tool that would um put this data into two in in a single plot. um It would draw a scatter points for these values and then align line chart for the averages across uh the weeks. So now I would like to show phase three, which is the result of this aggregation.

B

So this is the weekly VMI 2 running P95, for for VMI.

B

You can see that right around this time, the performance of the weekly uh creation to running um degraded a little bit, because these um there are some observations where it went to around 39 40 seconds and then right around this uh oops right around. This is when we got back so there is a. There is actually two ways to plot this in in this up, command, one is to get static. Graph um other is to get a plotly uh graph. So this is a much more Dynamic.

B

It creates an HTML output, and here you can figure out where exactly things started to get bad and where it it got better. So one thing we have been trying to do over the past week is: we know when things got bad, which is December 22 to 23 2022, and we know when things got good, December, sorry, January, 22 to January, 23 2023, so we're trying to vet each PR in there and find out the culprit. uh We do have like couple of um PR pairs that.

B

That we have a suspicion on, but we were not able to confirm uh from one or the other, so yeah, that's that's. The VMI creation to running um I do have couple of other interesting observations, um so this is the second chart.

B

This chart is weekly batch pods count for for the VMI uh right around this week, the patch for VMI was um doubled so initially for each VMI. We had one patch pods count. Now we are having two.

B

um Similarly um sorry wrong. Similarly, if you look for the patch count on parts from VM, it also follows the same Trend. uh So this allows us to see that okay, something in the code base change right around this time, which increased one patch count for dbmi. We were actually able to pinpoint what that change was and we'll share more details in in the six scale talk next week.

B

Another observation from this tool was that patch counts for virtual machines virtual machine instances, so this is on the left. The chart is the patch count for virtual machine instances. If the VMI was created via VM controller, you can see that there were increase in the patch calls um first week of February and second week of February, there were two spikes, but if you I do not have the right chart here, let me get that.

B

So if you look at the same plot for VMI, so this plot is, if you, if a user creates VMI manually, the patch counts for VMI remain stable right. So there were two changes that when, when team, uh in the way, VM controller manages uh virtual machines and because of to those two changes, this patch count increased again.

B

um We'll share more details in the six scale talk next week, but just wanted to give a overview of this Tool The, Phases and and the output.

B

Yeah, um so if you guys have any questions or feedback um on on things to change or things to, we can do better. Please share.

C

Hi um I mean this is great and thank you for taking the task on um I. Just have a question: do we actually track or bake in what, um but PR's are for the pre-submits.

B

Yes, so the the results that I have churned right now these are uh for periodics. um There is some improvement to be done for running the same thing against PR, so.

B

The pre-submits are yet the pre-submits uh data is organized little bit differently than the periodics in in the GCS bucket. So I had not gotten a chance yet to go, modify this tool to look at the pre-submits, but once I do um we should be able to get the pr number in here and I I think the better um place to put it is in this interactive chart.

B

So if somehow we can associate a PR number with these dots, then we should be easily able to figure out like what what this dot is doing or what this dot is doing, or something like that. So that's the that's the direction I'm moving towards.

C

Perfect and I think uh even what Iranian is writing um maybe have a PR's uh in the periodic. So, for example, what PRS went in within the uh within the beat we are putting the uh we are putting the graph. That will be also interesting to see.

B

Yeah, that's yeah, so I think we might yeah. Do we have that data in the project that gets run or is that something that has to be collected outside with with a separate uh GitHub utility.

C

So if you are going to plot the graph once per week, so, for example, let's say Friday- uh we can use the GitHub client to actually uh query what PR's got in in this week and then include it in into the plot.

B

Got it makes sense, okay, yeah, that that will be very helpful.

A

Yeah you'll have to um yeah okay, so you, oh, you could do this on the client side. I see. Okay, I go because yeah, it was I was gonna, say like we could do this based on time frame or you can include it in the periodic, but I think what you just said.

A

It makes sense the time frame where, like you, query, good and it's like we're doing it weekly, we check what's been merged in a week, and if someone wants to do this, like every three days, is their time frame, which I think is allowed in your tool. The way you set it up delay then it'd be the same thing. We would grab it by every three day periods or something what PR is reverse in those chunks of time. Yeah.

B

So I I think that might be so. This hmm see high CI Health might already be doing this right. So we we might have some uh code written to to pull this PRS right right, I'm, not sure but I I'm trying to think it aloud because I know like Ci Health uh does some processing on on the PRS that went in it might have utility and we can take it from there.

C

Yeah, we definitely have this.

B

A

B

Okay, um is there any other? um Is there any other call? I mean metric that you would like to see. um I've tried to look at the major ones, and these are the ones that popped out, but I'm, not sure if I missed anything. There are lots of these here.

C

I actually wanted to ask if we do want to track all of it. Maybe do we want to I, don't know, maybe use fuel which we know that might affect the performance a lot and then others would be not collected. Maybe- and we could just take this on demand if we see that something spiked out- and it's not this usual to expect.

B

Yeah, that's interesting.

B

And what would be the right phase to drop this.

B

If we drop this at the data collection layer, then we will not have the data to look back and analyze later on right.

C

um We don't we don't have a garbage collection on the job. So technically we would not lose it yet.

C

um So we could drop eat it there or we could drop it on the post, processing or post collecting.

B

So I think that is Diga the garbage collection.

B

What I have found in the GCS bucket is that the periodics we only have around 15 16 weeks of data, but the pre-submits is where we have all of the history until now. So something cleans up the data in in periodic directory now in the bucket.

C

Okay, I mean I, wouldn't be against dropping in just after collecting race.

C

And did you um did you think about what graphs do you want to publish or do we do? We want to publish all of them, I mean um maybe most of the interesting ones would be the P90 P95 for um creation to running right. That's, uh yes, that's most interesting, most interesting and then I think operations as the at least patch.

C

That would be something to look after oral.

B

Account, yes, so I think historically, I've tried to look at what um calls have been Troublesome and I. Think there was batch, node counts or so something related to nodes that had been troubling us. uh I. Think that might also be interesting to.

C

Yeah, yes, there's also okay, because usually those calls are inward Handler which well that's multiplying by number of nodes, and usually people try to put into the some kind of routine So based on some frequency.

C

Vision is just a storm for the API.

B

B

I actually was thinking to add some more uh Matrix to this, so, for example, um CPU and memory utilization of uh word, Handler and World launcher.

B

Those are the two things that would be nice to have plotted over time. Just so, we can have an idea of how uh the memory, consumption and the CPU consumption of our components are evolving at scale. um Those might be some metrics. We would have to figure out in the audit tool and then populate here.

C

I agree that can be very interesting.

B

Okay, yeah I think that's all I had.

A

So I've gotten the notes for the um the graph. So we got list get create patch update of vmis VMS pods node counts. We could also do I mean I, don't know about any other ones like PVCs.

A

uh I mean I, don't know. Oh hold on you're, covering the you're covering the thing. I was looking at.

B

Holder I'm, sorry.

A

No, it's okay, you can move it up yeah there we go okay, um like what else do we think makes sense in here like I mean limit ranges, I, don't think so. Oh Hubert's, jobs, no game and sense. Config maps and most of this stuff is.

A

Pinpoints I remember: endpoints is the higher one. That's high I, just don't remember if it's one that is, has any correlation to anything else.

A

Christmas machines, yeah endpoint.

B

A

Do you have the plot? Is it? Is there anything interesting, you see there.

B

I don't know I haven't seen, let's see, oh, it's a flat line.

A

Migrations, okay, I just remember, there's a lot of them so that just seemed like you know. You probably want to be responsible with this since we're using what.

B

A

B

Is around 74 megabytes of data for 16 week.

B

So yeah it can fill up really quickly. We don't like we, we might have to have a garbage, can continue to publish all this so.

A

Yeah I think we're gonna have to do. I I almost feel like we're gonna have to we can't we can't we can't keep.

A

We have to do some I forget what phase it is, but whatever it is that when we committed to get, we I think what we do is we keep only the ones that we have here designated in the um your srl share, I'm, highlighting something and sure and I'll grab it from here.

A

There I, like I, think this is kind of like our regex's and we keep we keep this and then, as we see you know, if we see any other weird things by just analyzing the job because, like we do like lupus saying like we could, if we wanted to go back and look at the the actual job and look at the plain text, we could see it there and we can always add more later. If we find anything strange.

B

Yeah makes sense.

A

Okay, all right yeah. Let's start.

B

With those nice that's cool yeah, my only um thought would be to be careful with the data as to. If we want to get other calls in the future, we should, even if we don't publish it, we should leave out doors for US Open to process it later um and I think um it was mentioned that we don't garbage, collect it at the bucket layer. So we should be fine.

A

A

Pretty cool thanks.

A

All right and yeah, like you, said, we'll, be talking some more about this. Some Hubert Summit we'll have some more on this and some crafts and stuff, okay um or in some PRS that go along with the graphs okay. uh What else do we want to have that was the topic? I know the top of your house um delighted. You want to talk anything about some of the things that um or anything with to do with the windows.

D

Games, so um if we I can discuss in general what we do today around the q word, so uh we have a the last. The last workload we had is called boss term. We test the time it's close to your test with one and 100 VM, and but we test in parallel the total memory allocation per VM and paranoid and the total CPU we fetch it.

A

D

From Prometheus for specific run- and we run it twice a week to see that we get a stable result against latest cluster openshift cluster, and so we start to test the DOA around a 300 VM and our limitation in general is the memory um we configured the request memory to 128 Mega in order to make to run the maximum of Fedora VM and.

D

In Windows VM we run around one Run, 100 and 120.

D

B

Distributed across.

C

D

Node 40 and for 40 VMS paranoid, okay,.

B

Also in Fedora.

D

Is 300 it's 100 per nodes and.

D

So the minimum memory uh we in.

A

D

The memory is to uh two gigabytes and yes, two gigabytes per uh per Windows Server Windows server 2019..

D

We plan also to use Windows 10.

D

um So today, the the main reason that uh avoid us to use more is the memory, because each node, we have 100 28 chicken.

D

Each and no totally node and by the way when we launched the windows VM, we there is more memory allocation for each and launcher board. So if you put 2G gas, you get two gig 20 and 20 Mega.

D

So if you re, if you launch more than and so you get to, you will get memory uh sufficient insufficient because you reach to the Limit and.

D

In general, um so when we run it, we run it in balance according to the number of the CPU.

D

So if we have in our physical, we have 20 physical core, so we we run it in bulk of 20..

D

So each time will each time we we launch 20 uh VMS and in order to see that it starts on the same exact time we We Run The yaml with the running Force. So all the VMS will start at the same exact time and we will do it in a ramp up what I mean in ramp up. So each time we add more 20 VMS till we reached in each node till we reach to the limit in each node, and we continue with the next node.

D

So at the end we have the only in the already running VM we, as I said we collect the Prometheus for each run and we distribute the data into uh elasticsearch and share the result into grafana. So we have a clear result for each nightly run.

D

Is any question about it.

D

A

I didn't understand how many uh cores per note. Would you say it was.

D

And the course is 20 physical and Hyper fat.

A

Faulting, okay, 40 and now or 40 total.

D

The memory and the memory paranoid is 128 giga.

D

um In general, I don't know if.

B

It's the right.

D

Place to share the result, but it's depend of uh so we we create the distribution of the result uh and in in general, the the times look stable. Sometimes we need to investigate why there is intensive memory and all this stuff, but in general we install the nightly CNB of openshift okay, so we take the latest version. I guess it should be. The latest version version of cubert right.

A

From the nightly open shift, I don't know.

D

So um I don't know exactly what the version of I I actually tagged. The CNB nightly version inside our grafana dashboard.

A

Okay, so that's cool I, I guess like um so we're.

D

Getting away all the logs uh by the way, all the logs we store in has three buckets. So if we you discovered before that, you don't have enough place to keep the lock, so we Target and upload to elasticsearch and for future analyze. So we have to next to each one a link to S3 bucket. So by one click you go and all the logs on your locals. So you can do local analyze in order not to say you know all the logs and locally.

A

Cool, well, that's pretty awesome, I mean I, think um that's cool and yeah as you get like some results from this. You know.

A

Let us know and uh yeah like, like you know, like I said where, like we have that's the similar Hardware topology on the dedicated cluster and the tests, we're doing aren't quite the same they're, not they're, not they're, not the same and we're just you know we're creating VMS of certain quantities and and analyzing the results, so I mean, as you get like, um I mean what I, what I would actually what I'm actually curious is like, as you do some of this stuff, um it would be cool to see.

A

um We have the audit tool that basically scrapes the metrics and we'll analyze them in such a format like that they can be scraped using the uh this work, that Olay is doing for graphs, so you should be able to eventually um when Illinois has this published, you should be able to use the audit tool and then build graphs, and it would be cool to see you know based on what you're doing here, what else we find you know because you're doing nightlys we should see like you know, we should be able to compare like what it is you're seeing and maybe we can get some additional data points from it.

D

So I know that my fellow there is a dedicated sport in the past that related to the keyword result. You talk about this no.

A

I mean so like with um so like what what lay was showing earlier like this stuff, with um these metrics, like the HTTP request, counts like you, you mentioned, like you're, looking at the the amount of memory and the CPU of the note um and how how that gets affected by you know the the scale that you're going with what I was suggesting is that you could also incorporate, as part of your analysis, that we have the HTTP requests and the the VMI phase transition times.

A

Both those things will give you more insight into performance and, and then one of them's like this one's, particularly with scale on the kubernetes side, and so since, when you use the audit tool for this and I can send you a link afterwards. It's it's something you can just run locally as part of your uh your job. It's pretty easy to add in yeah.

D

Do I need to run it against each VM to get it uh the memory no.

A

No you just so what you'll do is you just when you just run after you? Do your uh the way it works after you do your test. This isn't exactly what we do say is we run our tests and then we run this audit tool.

A

So you do exactly what you're doing today after you're done, you run this audit tool and it'll capture a bunch of this this stuff from Prometheus and organize it in such a way that that um that it will uh be helpful for doing like performance and scale analysis and that and then, when Olay has this stuff at.

D

The beginning, and at the end right that it will take the period of uh right.

A

Yeah well so what you do is like the way it works. Well, so what you actually do, is you run at the end? All you need to do to make this work. The only requirement is that you actually create a primer for your work. So, if like, if you all you do, is you create it on a new cluster?

A

If you have an old cluster one note, so you've used in the past doesn't matter, but if you create, which is you create one VM before you start and then you you do your whole scale up here with like 300 foot rvms, you run your audit tool afterward after you're done over the time period, like the that that you want to analyze, and it will give you a bunch of um data about the the client go hcp request the VMI transition times.

A

You know organize it in such a way that you can get the P90s, the P95 P99, the p50 and a bunch of that stuff, and then, when delays lays working on this tool that he presented earlier, where you can graph this stuff, it's like because it sounds like you're. Storing the data somewhere, so we should also be able to use this tool to get the point at your persistent storage.

A

If you, if you store this in some way, maybe in plain text somewhere, we should be able to scrape it similarly with this tool, and you should get a great graphs from it.

D

It's gave us the percentile right uh across all the VMS, that's running on the specific cluster right, yeah.

A

Like it'll tell you, could you get like a P99, a P95 from it or something.

A

Yeah, you could get this and right. You can get it in grafana. If you want to the the point is the reason I was mentioning is like if you wanted to do this programmatically from like as a part of like your like, if you wanted to write a bunch of code around the results like getting you'd, run this tool and run a bunch of code around it, but you could obviously do this in grafana like if you want to review each job one by one.

D

Nice, so it's something embedded in uh in the client.

A

It's it's not it's! It's a it's a part of Upstream keyword. I can I can point you to it. um You just use um you just use. um Let's see here.

A

The tools right here, so uh you can, what you do is you'll, compile it yourself, I think it's a there's, a command for it here and, um and you can run it.

A

Oh, no, okay, we compile it in that repo, then, okay, so what you do is you'll um in in here you'll you can compile it in here and um it'll. Get you a binary and- and you can run it um like I was saying you can just run it. There's an example. I can point you to an example. um I can put in the dock or I can send it to you on slack afterwards of like how this how this looks inside the test, it's pretty straightforward. There's some I mean it's.

A

It's actually just right here, like there's some where's, the.

D

Data will be placed locally when we're on it.

A

Yeah so like it ends up looking like this um I'll show you exactly.

D

So we can get more detail inside the keyword like distribution per block or something like that. Not just the memory and the CPU yeah.

A

So what we this is, basically what we've done is there's these two metric there's a few metrics we've added that are specifically geared toward.

D

You know foreign loading, the time of loading take uh 30 seconds, I want to know the distribution of of this 30. Second, in the cube word itself, how many spend each um in each block in the code the results.

A

So we no, we don't have that yet so this is. This is actually where some of the enhancements we want to have so like just to give you an idea of where we are like. So here's like here's. What like the output, would look like so you'll get like a uh you'll, get like um a plain text: Json dump um to your to your local terminal.

A

You can put it to a file or whatever, and um so this is based on the time period that we ran this here's our p50 from create to running time in seconds or P95 and RPM.

D

Profile that can give me more details inside.

A

No, not not yet no, so this is what I'm saying it's like this is. This is what we have now, but there this is actually one of the things we have talked about on a road map where, like we want to get further into like the vert launcher pod, when like after, like so. This is where this is to running.

A

So, if you look at there's the little nuances to this is like we have the phase transition from from schedule to running, like that's the period of like when we're booting the domain and it's showing it's ready. So that's a whole time period and you could technically track that time period right. It's maybe in this case this is the whole thing, but maybe it took like eight seconds or something so during that time period technically we could write some metrics that could look to track. Okay.

A

It took eight seconds like what was the breakdown of like what took eight seconds. Was it the domain being created? Was it the callbacks between the Handler and the launcher? Was it something you know whatever? It is like that that kind of stuff is what we can create, but my point is like just show you kind of what we have today and eventually where we want to break this down and where we could break this down more and how we can output it in a useful way.

D

So what what we see here is the what what it means the threshold value is the.

A

These are this: is we created this or CI so that like, if we over, if we go over the threshold, we'll fail the job? So this is the p50 24. If you go.

D

Over 45., the value is the time of creation in a second.

A

Yeah, this is the um we had. Is it the creation timestamp that gets posted on the Pod all the way to when we see the running, the the VMI is running.

D

D

So, for this time we already have it. We we.

C

D

The exposed we we do exposed VM and test.

B

D

Requests log in and get the the total time. You say that when we run it but the overhead we can get if we run it against our VMS.

B

So um let me jump a little bit in here.

D

B

You can scroll up a little bit, uh so these value.

A

B

But that's it! That's it okay! So the value of that this tool, in my opinion, is that you get an aggregation right, that um creation to running P95 timestamp is 45 seconds or or something right. You get that you can get that even without this tool, but um in the keyword stack, a lot of time is being spent uh in making these API calls.

B

These API calls are made against these API calls are made by keyword to kubernetes API server, and you can correlate whether the creation to running P95 has gone up or gone down or you know, stayed there based on the aggregation of these API calls right. So you can say: Okay um P95 creation to running has gone up because we are making more list calls which are being expensive in our reconcile Loop, and that's why it has gone up. So this is the first level.

D

Right this is the breakdown only of the P95 right. No.

B

Well, nor so it's not really a a total breakdown, but it's more of a way to understand the um scaling behavior of cube vert itself, so you can correlate data. The next thing is Ryan was mentioning, was to actually do a breakdown of that uh creation to running right. So the.

D

Time what I see here in the value is the time in millisecond.

B

No here, the value is the number of get calls made by qubits.

D

It's just the calls not the total time right. Yes, just the calls, so how? How can I know from.

B

D

Calls the number of the calls if it's good or not.

B

Because you know you can call.

D

Something and it's called and it's take nanosec and you call you can call once and it's okay, it can take 30 milliseconds, so it create distribution, but we need the time for each one we can get or no.

B

um We we can get and that's something in in the roadmap, but the value of this is that you can understand the scaling behavior of cubeboard. So we can look at this data and say: okay, if we scale up to 1000 vmis, we are getting 4000 get endpoints count um and kubernetes will not work well in our environment, with 400 uh 4000 get get endpoints.

D

You decide for something all more than than expected, so you know that something is going. The wrong in this uh in this section called The Rise. Yes,.

B

D

B

Us understand these scaling behavior and predict whether um kubernetes will be able to handle this, or the stack will be able to handle this. The performance behavior that you are talking about is to exact breakdown of each of the uh phase into smaller components is something that we would have to uh work on. um So.

D

If I arrive, you say it against our scale VM. This is running against one VM. What we see here or.

B

Now this is aggregation that is monitored in the cubeboard client. So what will happen is any any time you create a bunch of VMS right uh keyword. Client will increment each of the uh value when it makes the call so when it makes the create pod call, it will increment the Prometheus counter. At the end, the audit tool, like Ranch, said here. The audit tool will look at all the increment value and Report the the data back.

D

D

Okay and so I just need to get.

B

D

Way how to run it and I.

B

D

On our cluster I think it can be great starting. It's like mini profile. I can call it a mini profile against.

A

The total call right the way, I wear a characterized device like our thesis, is that we're looking at performance and scale sort of related things and and that, if you were in the traditional virtualization world right when you're launching your virtualizing you're, very focused on the performance of like how long it takes from like when I created the domain. To the point that I'm, having available to me right. That's that's critical, but as like what I'm saying is part of our thesis?

A

Is that there's so much more happening here and that's what this illustrates is like? We have to deal with kubernetes the kubernetes layer, the coup, the cuber control plane, and then we have to deal with the the actual Parts machine. And so all those things together is what ultimately affects our quote-unquote performance, because that's what we're dealing with with kubernetes. And so what we're doing is here is we're we're taking the kubernetes and the kuber control, plane, portions and we're sort of factoring that into our analysis.

A

And then we're also doing the part which you're emphasizing, which is we're trying to get to the point where you're emphasizing where, like we want to get to the breakdown of like just like the domain and the virtual machine, how quickly that the guest is ready when we? Actually, you know when going to find the domain to when it's actually difficult to use. It.

D

So I think that uh there I I think that this this is not come embedded inside the keyword, so I thought. Maybe it should be like configuration.

B

D

Let's say Okay enable it or not enable it's something. That's running by HTTP.

A

D

A

No, it's it's! This is by this is on by default. These are metrics already in um provided by default when you install kubert all we're doing like so you can get this you can in your cluster. I know you can get this in your Prometheus. You can look at these values. We're just saying what we're doing here is we're we're analyzing them useful way, and this by the way like getting it they're getting this stuff is, is not easy.

A

Like there's a you can I can show you the formula to doing it the kind of the right way, because it's a little bit complicated because you have to.

D

By the way, it's all organized, oh you just list all I.

A

Don't know if there's a specific I think we just do it by alphabetically like what the verb is. I, think it's how it ends up. It's the.

D

Best practice you know in my last company we when we enable profiler, we put it from a up and down, but it's never mind you. We can do it till later to read it yeah.

A

Well, I anyway, I just wanted to mention this as something you can add as an addendum, because there's a bunch of data that we can that we can capture and I just wanted to mention it, because I think it'll be useful to see like when you want to bring your your data, like you did here to this meeting. It would be cool to also see this as a comparison to what we're seeing and some of the tests. We.

D

Have for sure, because we can have more details uh around across all the blocks code blocks that calling.

A

B

D

When we use them, maybe it can. Yes when we run like the 100 VMS, so all the depending on on how much calls we have so I think then the numbers will be the same, but at the end, when will, if we want to add it to our nightly, so we can see the the behavior across the run. So if there is the in there is a one-run peak, so we can find that there is issue, maybe this method or something like that. Okay. So if.

C

We can see.

D

Degradation in the result, we can connect it to the specific.

B

D

That maybe call a lot of time. You can cause it right.

A

Yeah I think it'll give us a lot more data to compare and it will be very difficult.

D

To add the time for it.

A

um Like you're talking about like the like what I was saying like the specific domain like how long it takes to for it to be running or something like.

D

That yeah, which block you you have the count right and.

B

D

And no option to get the total time how much time it's run.

A

The whole job there it is.

D

A

The vmis, like.

D

The specific method this.

B

Is the methods inside or.

D

Class I don't know it's class right, no, okay,.

A

We have I mean that's what we have today. We just used the transition times to get the P90 RSP 15.99s, but I mean the like I was saying: there's there's further breakdowns. We can do here in a lot of different directions. From just the domain, the launched the networking there's lots of stuff. We can do um but I mean that's. We started with this because this is something that gives us an easy on-ramp, but yeah I mean what you're talking about is things we want to add.

D

Yes, so for for now we I guess that.

B

D

A

D

These data so I think that once and we'll see any.

B

D

All the conditions, so we can see it over time, so we we just started it last month to enable what we, what I, discuss, what what actually we discussed before. So we don't have enough data to you know for investigation.

A

D

um This week there is also the performance and also the uh the cubert summit. So there's a lot of things so.

B

I was busy around.

D

C

D

I think that next month, beginning of April more time to do the investigation around it and to you know to First this tool can help us to know the distribution and for sure I will try it on my side and also to what we discussed last meeting and to enhance the Upstream try to enhance and to understand more about performance test in the Upstream side. Sure.

A

Okay, well I, we got we'll get two minutes left, so I want to go to the next topic. um I'll lay this one for you. We have the K-1 question: let's let's go to this one. So why does no controller, not race with K1 control, I think we had that last time, yeah.

B

A

We have here yes.

B

Yes, um so this was one of the questions raised in the last discussion and um I I took the time to figure out what what's happening here so really quickly. Summarizing this, um the cubelet controls a set of conditions on the Node status.

B

The node controller only looks at the heartbeat time and if uh cubelet has not posted a heartbeat for some timeout period, Then, the node controller will add a attained to the pods that are running on on that and it will basically terminate the the pods with with uh with retained value, no execute. So, basically, what I'm trying to say is the cubelet owns those status conditions. The node controller reads from those status conditions and takes a bunch of actions.

B

The reason why it does not interfere with the Kwok controller is that the Kwa controller becomes the fake cubelet and it owns those status conditions instead and because it runs a regular heartbeat on on those status conditions. The node controller things that the node is healthy and it does not do anything.

A

With it, okay, so it replaces the cubelets part, not the node controller, correct, yes, okay! So that's why it doesn't erase I see.

B

Yeah um I actually did not get time to prepare a demo for this. Maybe we I can get it going. The next next time.

A

Yeah, we can aim for. We can aim for another time- maybe maybe next week or a week after okay sounds good away. Okay, I we're at time everyone thanks very much for the discussion. This is great.

B

Thanks everyone: okay.

A

We'll see you later bye, everyone.