KubeVirt SIG Performance and Scale, 22 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-07-22

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.i2ab65exaot0

A

Okay, all right um welcome to sixth scale everybody, um so a few announcements uh so for today, um next week um next week, next thursday, the 29th july 29th- um I'm gonna- be out the whole week. um So we could we haven't. We could have a uh meeting if folks want to, but I'm gonna need a volunteer. If someone wants to host the meeting, I can give you the um the admin control, so you can start the recording and self recording, etc, um or we can not have a meeting.

A

I don't know what people prefer, but someone to volunteer to run this next week, while I'm out or do you uh do you want to skip it.

B

uh Yeah, I don't mind, writing it. If we have things to talk about, let's see how uh this meeting goes and if we feel at the end that we need to sync again next week, then I can run it all.

A

Right, okay sounds good um all right, so let's get started with the first things. um It's I don't know if marcelo joined, um but maybe he's got a conflict today, so the um so last time we talked about um this document from marcelo's experiments um where we got a bunch of information about um from the from grafana. I took this and um oh looks like there's. Marcelo might be talking about it right now, um so the uh I took these and I I created a bunch of issues from them.

A

um There could be more here, um but I figured we could just quickly go through them and see if they make sense to people uh and things that are actionable and are also accurate, based on what we saw in that document.

A

um So I posted the the graphs here based on uh and the metrics, and I just kind of drew a conclusion. So this is the first one. um So api requests the keyword. I o virtual machine instances returns 404s. uh So our expectation um on this one right is that we don't see this um like. We don't expect to have a bunch of 404s on this on this api so and it just, and it's really got worse every time uh as we created more vmix, we just saw a lot more requests.

A

I mean that kind of makes sense, we're making a lot more requests to this api, and it's just it's returning a 404. um Does that make sense of something to to have um I as a as an issue here or is this something that I mean? Do we expect this? I mean I, I don't think so. Right like.

C

A

Is we should at least investigate it, so it's good to have an issue for it. I'd say.

D

B

Further than investigating, I think that we can make it go away. We should make it go away. We shouldn't see anything grow like this. I don't think. Okay, I.

A

Saw that's an error.

B

A

Yeah, I saw, I think, um because he does have the kubernetes metrics in here, and I thought I saw that that kubernetes also gets these. I mean it's not a good thing, but um it's just another data.

C

Point so they're there there can be cases like of a valid cases where you are just basically just trying to get an object to see if it still exists or something in a controller loop.

C

We need to find out why it exists and yeah.

A

Well, maybe we can define like. So what is what exactly? What is what's happening here when we're hitting this 404 we're calling this we're calling uh a um is it that we're trying to request a virtual machine instance, and it's just not there or is it that we're like? What is it that we're doing? Because this this metric is read, we don't know we're.

B

A

B

A get on a resource and the resource is not there. So what resource we're calling we do not know, so it could be anything. We have like 12 controllers, anything, that's keying off a bmi. Is this calling a get on? Something could be causing this, for example, the pod destruction budget? If we were calling a git on pod disruption budgets every time a vmi gets queued, then that would cause a 404 or even like more like snapshot controllers and things like that could possibly do it as well. We have to investigate it. Yeah.

A

Okay, all right, um okay, we'll keep it in, sounds good. All right, we'll go to next one. um This one is for controlling disruption. Budget work, you add rate, is very high, so creating a lot of vms. We just get.

A

We get this in the work you add rates, it's just so much higher than the other um in the other. Metrics here word handler for controller, and this is yeah another one. Why is that happening? I I actually, I think I define it yeah. I did okay, so the total number of ads handled by a work queue yeah, why? Why is this one so much higher than any other one? um Maybe two like there might be like another issue buried underneath this, but this could be like the at least get us to.

A

You know um a little bit closer to what's going on.

B

Does marcelo have the exact bmi spec that he uses, or is there a way to drive that? Is it just the one used in the density test or is there something else do we know?

B

Did he end up posting it here? He posted kind of the shape of the vmi like how much memory and like the image it uses, cirros and stuff like that, but I didn't see the spec. So the reason that's useful for us is certain things trigger these other controllers in the vmware.

E

B

Example, an eviction strategy sets a live migrate. It's going to have a different behavior than one that doesn't have that set potentially.

D

A

Yeah, that's that's a good one. Let's get, let's get a little more information than this already used. Okay, uh next one's the work, q performance, this one's kind of general- I I thought about splitting this up into a bunch of different ones, but it just kind of seemed all related here in some way. So uh first metric was the work q latency.

A

um This is worst case. I think right, I don't miss it.

C

Yeah, for that, I would really just leave it as one issue until we have the possibility to configure qrs and read limited to the tests again so because it really does at least right now they have to be high pretty much. There is no, no, not any other chance, even and as soon as we have it, we can then see what remains high and then investigate that.

A

All right, let me um you have your um your pull request somewhere in here right. It's uh yeah.

C

This one yeah yeah this one all right.

A

I'm gonna just link it to this. So let's, let's test this again.

A

Oh, let's gather.

A

All right, let's see.

A

All right: let's do that then I'll. Give this. Let's see the changes, yeah, okay, um next one, so the go routine count and memory remains high after vmis are removed. This one was weird yeah. I mean the level of go routines, just stays that way. Even when we are, we've got no vmis um memory and there's even another question here, but I didn't.

C

Create and increase.

A

The picture a little bit, it might lose some quality but yeah, so the so. What I was saying is the you can see like the go routines, um the line, the baseline. When we have no vms it steadily increases as you create more, which is interesting, um cpu usage stays um down. This was a weird metric. This is oh, I spelled the wrong time spent using the cpu, um as opposed to like, like percent or like number of cpus you used or something um memory it seems like it kind of generally, is on an incline.

A

Well, except it's made with yellow lines. Some of them are going out, but.

C

It's pretty small, just a few.

A

Minutes but, and then there's another yeah and then there's another yeah yeah there's another test. Well is there does kubert currently like does the upstream currently have um requests for overt handler?

A

That would be kind of one thing that made me think of this, um like I know at least like internally, like we set requests, uh we we have some requests on it um for memory, but I don't know if it's something that's currently.

C

Yeah, we said something requests, but uh no limits uh and yeah and they.

E

Also, don't plan to set.

C

Limits yeah but yeah yeah.

A

Yeah and then the other question here was that that I brought up was um if we fill the nodes um you know does like. Do we still see the uh the memory climb like you know, let's say like we have three nodes and they have a hundred vm limit. We fill them does like.

A

Is that um do we see increased memory for that like just to know if, like if there's some sort of other, um if it has any correlation with the number of bmis or if it's just simply like, we have invert handler that's at max capacity?

A

um So that's another question, but we haven't. We haven't figured that out, though, that's just something that kind of looks here, something that we will need to figure out um at some point, but that's that's kind of what I wanted to get into some other tests that we can do.

A

Okay, so that's that one um all right. So those are the four issues. Were there any other ones that we could think of uh that need to go? Are they gonna be created? On this I mean. um Should I break the I don't know did?

A

Does anyone have any leftovers? I think that's kind of what I had rpc rate unfinished work. I.

B

Had a thought uh after the meeting last week about the 409s, so that that's what my pr I had a pr that helped address that under load. We actually see less foreign, because the queue is backed up which gives the informers more time to catch up. So the 409s would actually only become an issue when we become more efficient, they're.

D

Efficient at low scale.

B

And then, as we become more efficient at high scale, we would start seeing them. I think again.

A

Okay, yeah, that's interesting, yeah! I yeah! Okay, let's that's another! Well, okay, look well! I've already have like this is kind of what I wanted to do in the evaluation sections. um I want to start like defining things that we can like, because we have now we've got like this data and stuff like um things that we can evaluate.

A

So let me just write this one so like um so four nines, um probably after the qps change, we need to actually when you do the qps change, so qps um change uh and then and measure or q efficiency, and then, following that, um we want to see if the four nines uh effect effect um with an efficient work. You under high load, that's another one. Okay,.

A

um Okay, yeah, I I don't, do you think it should be? I mean I don't think we need to have issues with these. I think we could just keep them like this or something or maybe I could just create an issue- just kind of load these all in there and we can just check them off or something I don't know all right, we'll just go with this for now and if it becomes a problem, I can create an issue for it all right, yeah, okay, all right next item, um so baseline thresholds.

A

um I created this this morning, so I talked about this um recently. I was trying to find the right way to like um make this usable. um I'm just going to start really really simple for this. So basically um the goal is to have some source of truth, so that ci can read performance scale, metrics per release.

A

So, like every time like we go to where these four three we'll have some sort of um code. That's that hangs around that just holds these these um thresholds in place so that you know for when we run ci for like back ports and stuff or just if people want to consume, uh have ci. That's that runs externally. We can just we'll have this.

A

This is what like our expectation is, um and all I figured we do is we just add a bunch of constants in here that kind of builds um our list of things, and we just have a process for approving these. These thresholds, based on what we know about the release and uh um and what we want to measure.

B

A

What I was thinking, I'm thinking about this.

B

As well, so I think it makes sense to commit um at least some sort of expectations into the code base around this. uh Are we thinking about using that that tool that I was creating, that would retroactively, gather or report results in order to determine these thresholds or how? What were you envisioning here.

A

Yeah, I think we I so I took these from um marcelo's tests, but yeah we would. We would refine so like what I'm thinking is like kind of my like long term, medium long term. Here with this, is that we eventually get to a point where, like when we cut a release.

A

We we know what these are we just kind of like when we cut the release, we we set them in stone and then we we cut the release and here's our here's, our um here's, what we expect like, if you don't change the code base. This is what you should see all the time, and so we use um this is that source of truth and we gather we put. We gather that data like on the day.

A

We do the release from your tool uh to figure out to set all of these up, um so we so I kind of see it as like.

A

It's like it's it's our way of communicating like what exactly what's uh what our expectations are. What uh for this.

F

Yeah I I haven't just I would say like this thing in my pr before my huge pr that I had before so then we simplified that um the idea was actually that thing that you mentioned. So, for example, we have jobs running for prs or daily. I think we should start daily now, just to make it easier. You know to run those tests and then we we define this threshold some somewhere.

F

I think it's good the place that you did and and then we were discussing the framework to collect this and and then to uh compare the results in it. So maybe we can combine everything so this uh threshold that you define here, plus the david um tools to compare the the experiments so and then and then it should. You know, write some alert or fail a task. Then we should discuss that. What's better.

B

Yeah I like the.

F

B

Of representing thresholds- and I think I can use like- create some sort of config that that we can pass into that perf tool that I was working on, the one that gathers results. uh But we can say these are the thresholds. We want to meet and did we meet them or not and have that and the results or something tell us that we passed or failed our thresholds as well.

B

So we can have all this represented in a config, so our threshold is represented in config and then pass that into the perf tool to get uh the results.

F

Yeah so the my initial idea, like the first version of the pr um I I use, I put this in a ml, so a configuration file and then we could like easily change in all the thresholds, and everyone can maybe change that according to their environment. If it's running you know.

B

Exactly that's the big thing, so we need this per environment because it only makes sense for.

E

The environment.

B

So I think I like the thresholds, can we move that to the test package and maybe have like a perfscale um subdirectory and the test package and then have these configs per an environment, the threshold configs per environment? That's what I think would make most sense.

A

Well, so one question about this: uh let me clarify so the um so per environment. I understand that so we use it as a config, but what about like? um Oh okay, I think I understand you mean per environment so like if, uh let's say like uh we're like I'm in the mindset of like when, when we release this, you know how we're going to communicate this. So would this be like in a 100 nodes, here's what we expect to see for your per scale thresholds is it is that is that kind of.

C

I would say.

A

C

Not similar, it's simple, it's more like the environment, where you run the test, such as very different. The machine you're running kubernetes on can be fast or slow. For instance, that's the initial step where it can hugely differ. So.

B

You just pick the config. You want to run against your threshold config when you're running the test, so the perf uh test that's running like daily or whatever, when we create that automation, job, maybe there's a cli argument or an environment variable or whatever that we specify the threshold config. We want to compare against that matches the environment that we're testing against, and then it's just always used.

A

F

A

F

Because if we have it hard coded so every time that we change the environment, we we need to. You know change this. Also, for example, let's assume some someone wants to run cooper in their local environment and want to test also. So if this is configurable, they can, you know, use this task there, so it's more like generics anyway, so right now, our focus, of course it's in our ci environment, but it's making it useful for more people. Just saying that.

A

Yeah, okay, I I I do like to do so.

A

I can move this um I'll move this over to to where you're working david and then so the other thing it's kind of the other question, because this this I I just wanted to clarify so like what like, what do people think of that because, like this is where I'm kind of working toward with this, it's like. We have some way to say like um with the release like what our expectations are. Should we would we say that like um like this is like? Should we create sort of that expectation?

A

I guess is sort of the question here because, like what we're saying is it's gonna be we're aiming toward per environment um for testing for thresholds? Should we even go that route and say: okay, here's what um our expectation is for performance for this release. Is that even something we want to go.

F

B

F

Yeah right, we need to say that so.

A

How would we environment that.

F

A

But like we keep saying the environment like, because if we keep saying that, how do we say that we're trying to we expect you to meet this level of performance? If we're saying at the same time that the environment's.

C

A

C

When you run it on a permission, metal machine, if you with your f1 ibm cloud, then it has maybe faster cpus or something so it will. I mean for the release. We of course have to compare it to the same machines, but when you're doing local experiments you're not running on the same machines like on ci, and you may want to change some parts. I think that's all what this is about.

F

For example, you have a cluster that you run some experiments in nvidia and if you want to check you know the performance times times in your cluster, you need to have different thresholds. Maybe you know, maybe it's faster or is lower right.

A

No, no, I I yeah, so I get that so I get the configurable aspect of it. All I'm saying is like, um like I had mentioned, it's like what makes sense to me is like we like. One of the goals I laid out here was that, like you, want it to be like used by ci, to evaluate, and so like what we talked about last time was that the tool that david's using is configurable and that the performance is going to be specific to the environment, that it runs on that.

A

That makes sense to me because we want to compare apple samples, um and so the other part of this, though, is that uh it's like I'm saying, every release that we do at cuber like do. We want to say like what our expected thresholds are, but it like. If like can we can we like say that if we're also saying like um we expect like without saying, let's specifying any sort of like hardware requirements or like saying if you use this script or something like that,.

B

A

B

It's expected for um it's the expected requirements based on our ci hardware. That's it! Yes, yes, it's not publicly! This is what you would expect. I mean if you completely reproduced our environment, then yeah that you would expect.

F

I think that's what kubernetes, so they say this is the cluster that we have, and this is the you know what they say: slos, that we expected in this environment.

G

They cannot guarantee.

F

Things in different environments, so just saying.

A

Yeah, okay, I just wanted to clarify because, like the just around the same page, so this is what I was trying to find kubernetes. um They have a whole section where they wrote about the sli and slos. um I.

F

Created something very similar for us. Let me share the document.

F

No, I think we never discussed that. So maybe it's a good time.

A

Okay, um I think well, so let me take this so I'm gonna move this over. um So I'll wait till david, your patch merges and then um I'll move this over there and test and we'll um um we can just make it configurable or yeah. That's fine or.

B

If you want to add it, here's.

A

What I'll do.

B

um So I will add thresholds to my pr so the ability to pass in some sort of gamble and define your thresholds and then in the reports, file, you'll, you'll understand if you've met your thresholds or not, then you can consume that when the patch lands so I'll try to get that done today. Maybe we can get that in this week and then we can start using that and ci. um You know soon.

B

When are you leaving on? uh I assume pto ryan.

A

Yeah, I'm gonna be out um it's the last every friday. Let me out.

D

A

B

Yeah uh wait so next week, you're this friday you're leaving.

A

Yes, so I'll be I'll be at all next week.

B

Okay, are you actually going to be here tomorrow.

A

Yeah I'll be a change tomorrow,.

B

Okay, maybe we can uh I'll see what I can get done today. Maybe we can make some progress right before you leave or at least get things uh where.

A

Yeah we'll see we'll talk, okay, yeah, that's fine, and then that works for me and if, if, if not, we can just I mean we could just throw it in your pr. That's that's fine, too. Okay, um all right! That's good!

A

For me, I think like so I understand here so, like baseline thresholds will have like um so we'll have like our definition, like I said like um so like what oh someone's got, the yeah the this is uh so like we basically, um let's like like define sli like slo per release based on cis um uh based on ci, and that's how we can communicate it and then another tool, usually per we'll use it kind of as a developer tool per um environment yeah that makes sense, okay, um cool!

A

So let's go to um evaluations next, uh so these all I want to do with this was just kind of come up with a list of other tests that we can do here. Now that we have like, um we already have the uh the the different performance sort of phase changes uh for vmis. We have a bunch of things that have merged.

A

um I just kind of wanted to like enumerate a list here of uh of tests that we kind of we want to do just to kind of start building towards, like these bass lines start finding them um do people have any ones that we have this list. I think the first one we got to do is like we had to bring in that qps change measure the work, q, efficiency again and then it sounds like four nines would follow and see how that how that changes things do we have any other ones?

A

We want to do like specific numbers of uis um specific scale. I don't know, maybe it doesn't. I.

C

Think ran in his initial tests exactly what we would want to evaluate and just see if the minimalists go away right.

F

I can't I can't run that when it's get merged and then it I think it. Ideally I'm going to prepare the the pro job to run like daily for the the dance test that we have so that we can, you know, start to check it like you know, um in a high level, in a way that we can go through our phone public graphic dashboard and check things.

G

That sounds great yeah, maybe maybe I did all right.

F

Yeah, which one before.

H

F

A

No, I I wasn't clear like what what did you want me to add.

G

uh Yeah, the the daily project, with the scale test that we get collect the data for it, so that.

A

This is the daily density test right.

G

A

Okay- and this is um how many vmis is this like: what's.

F

Only 100 right now, but we can we can discuss. Maybe we can have the escape that I did in the you know. The previous test.

G

F

G

Let me first run it and collect this stuff.

F

Right, yeah, yeah.

C

Okay, that sounds good.

A

This actually brought another thought, like um maybe we're not there yet like I'm thinking like how we should define how we test each of these. So we have a consistent, I think, like marcel, I like what you did on the I think for these. Definitely we should do the same thing: the more same tests. You did marcel that generated these metrics here um because that's yeah. We want to compare exactly those dashboards before and after I think for those two. So um this was we come up with a name for this.

A

uh This was like that that 10 uh 10 20 30 40, whatever 100 300 test. um I don't know what would we call this something I don't know marcelo's test 10 to 300 view my ramp up. So this is uh we want to do for these.

A

Okay, all right, I just wanted to clarify that okay and then we have our daily tests um so that we can start on things and uh start getting that going. Okay. um So let's go to other items, so uh roman you've got a pr here.

A

This is the qpsj.

C

Entry yeah, this was, would just be my initial proposal on how to make stuff configurable and basically right now we have four clients user, which we use. We have. We have two clients in word api, one for console connections and the stuff and another one for validations, so for the webhooks that it's fast and we have one for.

G

Controller and one for handlers and yeah this way you could configure it per component. Basically, okay, so.

B

F

The default, the default values- or you are increasing already so.

C

The 400 and 200 number on the weapon configuration at the bottom. That's what we have already that's, why you never had any issues with the validation web books when they did anything um but for controller configuration and I'll increase it here. In this example, too, or I said here, the same defaults like kubernetes did for the controller manager, that's 13, that's the 1320 and for the rest, I left it.

F

And right now, what do we have.

C

Also, ten five, so it's right.

F

Now the reality is then.

C

Five, ten five, ten five four hundred, so this pr increases only controller configuration leaves the rest like. It is right now what.

B

Where was this default defined? This is just the default within the kubernetes client that we have.

C

Is just the kuni's.

B

Client default yeah yeah. I see that you created, uh like I'm just briefly looking at this there's a package that you introduced called rate limiter. I this is more complex than I thought it would be, because I thought it was just gonna. It would just be setting something in the clients that was yeah.

C

But um uh well I mean I am studying the rate limit limit on the client configuration and uh so that I can so. The nice thing is so in the client configuration you can set burst in qps directly and then, when you create the client, a rate limiter will be created for you with this with these values.

C

But you can also just directly pass indirect limiter, and this has the advantage that I now created a wrapping rate limiter for the token bucket rate limiter from kubernetes, where it's passed in and I tied it together with our qubit config. So you can change the values on the fly test. Okay, so that that's what you did so that's where the complexity is.

B

This because it's dynamic yeah. Otherwise I would have to restart all the components and this is slow and uh are we worried at all about, like the the locking or anything like that.

C

Like when is this not really because the token bucket rate limiter uses, locking anyway on every call already?

C

Okay, so yeah, we don't get slow down, so uh maybe maybe as so. I guess we added or not. I guess I know we had a delay of one additional locked look up, but it's even a shorter look up than the default rate limiter does so at first we have a very small delay in general to request, which should not really be measurable, but no throttling not more threatening. You know what I mean yeah we're already.

C

The throughput will be the same, but yeah every call takes a very, very, very small amount longer because of the additional lookup in between every chord but yeah. Okay,.

A

Okay, and that allows us to do dynamic because we're doing because you're doing that um additional work.

C

Yeah, I don't I mean it could have exposed it also by uh uh command line flex, for instance, or or something like this or yeah. Okay,.

E

C

If just after the changes and reboot the components but they're ahead right, so the issue with that is when I just for instance, tell the tender to reboot itself. When it detects changes there, then we would have to add delays there that no, not all of them, are rebooting at the same time and so on, and it's when I just do it this way. It's an absolutely simple change and fast and sure.

A

Okay cool this is uh I I don't remember what we did internally for these. I know I know this one was bumped up. I don't remember what it was.

C

So yeah I took the opportunity here and set it to the defaults, which also kubernetes is using yeah, so they use for as defaults. I guess they are also increasing it for the density tests and one but yeah.

F

C

F

In their escape test or their.

C

Research, when you look at the api, docs or another api networks at the kuwaiti stocks that you can read what the default values are for the command lines, so that for the cubelet, it's 10.5 for the controllers, 3020 and yeah for the api server. They don't have it because the api server directly is the receiver yeah.

F

Right, so is there a way to I don't know, just thinking, do you measure you know what we should have you know like yeah.

C

um You can you can just change the values on the fly, so you basically can write a test which just repeatedly.

E

Runs this test.

C

With different values, we also have uh helper functions in the tests package. If you want to automate that easily. So there is something you can just fetch the keyboard config change, the values you want, and then we have update, cube with complicated way for propagation, and once this this function is done, you can run the test again and all components have. The new word is guaranteed.

C

Does it answer the question.

F

Yeah, it's answered yeah.

E

I was thinking just.

F

Like, for example, if you see it's easier just to to do, uh I was thinking if you see like if it's throttling the request and then how you know how much request it's arriving and then you know things like that.

A

You've got a metric you added for this right. It was the rate limiting um something like that. You had it. You know.

C

Yeah, but I edited after marcelo did his test, so he didn't have the evaluation yet.

C

Which metric the the there is, the rate limit uh metric exposed.

A

F

Okay, I can include that in the dashboard in our dashboard now.

C

Yeah, so this would so, if you want to find the optimal value for a specific size, you could, for instance, just start with the default values run the test. You would see it, you would see the rate limiter kicking in in this metric. You could increase, it run, re increase. It run the same test until the rate email that doesn't get hit anymore. That would be a possibility, and that's probably what you talked about right.

F

Sorry I missed this pr, so yeah, okay, yeah.

E

I will take a have a look.

F

On that and to include that yeah, it will be very interesting.

A

Yeah, so that's so it's what was it five, five, nine six, three marcelo, um I'm actually thinking like we could that's another test we could do. We could change qps um to see how it uh affects um like how much like we see like maybe like. Maybe there could be a tell for us like see like okay, it's just we're just being rate limited like crazy.

A

um Let's see how uh the rate limit uh metric is expected. It can at least get us to a point where we could figure out. Okay. Well, what should we be at? It would be interesting too, to see like if um you know how this how this changes based on scale, like you know, just a different um like how it how it moves based on our environment,.

E

And keeping uh um if we do those kind of tests, we should also have it like, maybe a separate dashboard for it. We also look at the at least api server metrics, to see at what point we put too much on that, but are better for it's better for us, because it's always going to be a balance.

A

Yeah, that's the scale part of this yeah that that that's the other thing with this qps change that you can kind of wonder like what is it that's um like what is the yeah? What's what's going to be our effects like?

A

If we're we don't want us like, if we're just increasing qps, to say if something's inefficient and we're just increasing qps and a mass a problem, so, let's so something I guess we can keep in mind, so it's actually really good so that this is dynamic because we're gonna, I think, won't be really easy to test uh a number of these things.

A

So yeah that'll be that'll, be helpful. Okay! Does this? My only last question was, like um I don't know. Is that is your news here or tomas like? Do you guys remember what the um what we set the uh qps burst to for all of these.

H

I think it was around 30 or 40.. um I don't remember the exact number, but I think it was something around 30.

A

Was it just on a controller or was it uh was it for.

H

um I think we also set it for for the v weird controller and vert api as well.

A

Okay, all right! Well, I think maybe we can do some testing and see like we do. I think we definitely need this. It's I mean it's a question of like whether this these two change. If we can make some, we can do this in testing and find out if it should be higher if defaults or whatever, that should be okay, cool all right, um thanks, roman all, right! This is this last item: do we have?

A

Are there any other pr's that out there that we want to talk about.

F

Maybe um the one related to the maximum number of yams per node, I included here. It was the last item actually.

A

This one yeah okay,.

F

So, just to contextualize very quickly, um it's uh you know. I think we already discussed that so kubernetes. The kubelet has some pod limits, that's default 110 and we can increase that easily um and, however, the cube handler I could run there also has some uh you know parameter for that with the maximum device, which is actually the number of vmis it's, even though the name doesn't look that, but it reflects that it's the maximum number of it will implies that it's the maximum number of vmis per node.

F

However, the vert operator that actually creates the writ handler the demon set has some very strict reconciliation. So if we change the virtual handler demon set, the virtual operator will overwrite that. So we cannot change things unless we apply some some things, for example, hco, which we are not using. The convert conversion uh can patch and change the default values from the controllers. That which operator is doing roman pointed some way to do that directly. I didn't check. That's sorry, yeah, so.

C

Yeah all I meant you have yeah, so it's we we have so hco is just using the keyword patch support there. That's basically what I meant.

C

So if you go to 3612 the pr link there, you can just directly define in the keyboard cr.

C

When you go down a bit a little bit more in the patch section here, you see, for instance, you can k yeah, you can just do a json patch, on which controller like described here. ah Okay, it's more or less similar. What hco is doing, I see yeah, I think h2o is just passing it through to that section, but I'm not.

E

F

They implemented.

C

It directly, I don't know but yeah.

F

Okay, so we can use that or the I submit a pr. So if you can write if you can go back to the other page so down the page.

F

Yeah, so not sure, so we can discuss that if it makes sense or not it's similar to what we discussed that, but this pr actually was also something that uh you know. David mentioned, I looked for the max number of pods and I use that as the maximum number of device. So right now it's hard coded for 110, but we can just search that value for the maximum number of quads and use that that's the pr is doing so.

C

I I've checked the apr a little bit. um Historically, my opinion was to just set the number very high to two to one thousand or two thousand thousand instead of 110, I'm not sure. If that is an option with this approach, my main problem was, is only that we have a few edge cases which are hard to catch. One is when the default configuration path is changed, obviously, and the other one is when the value is changed. While the demonstrator is running, we're, also not picking it up.

C

F

Yeah, so um it makes sense, especially because you see you see, some tests are failing, because it's complaining about the the path of the file for some reason in the in the environment doesn't find the path of the file for some reason, um and it makes sense um the the the thing of the changing the parameter. I think I kind of leave the priority of the the for the flag, but you know.

C

I don't know what david is thinking, but maybe that maybe you remember our discussions back then that I always just wanted to set the number very high uh set, the 500, so 2 000.. It doesn't matter yeah.

B

I wanted to say.

C

2000, always if david is fine with that, do just set it fixed to that number and go on.

F

C

Yeah so just change the 110 default to 2000, and that's it okay, so fixed in the pr, not even really in your pr setting to 2000 as the default value.

F

Uh-Huh, I can yeah revert this change here and just put the two thousand yeah.

A

So you wanna, and what you're saying is like set the number of viewers really high and kubernetes, is just going to block us anyway right because it's got a max number of pods, so it just doesn't.

C

Matter this is the only the only reason why it's it's. Why we have this number at all is because device plugins are not really meant to expose. Unlimited resources or devices, but fkvm is is unlimited from the perspective. It's like death null or something it's just there.

C

It's not like you request, one dev kvm and another one requests another fkvm, and basically we just have this number, because the underlying device plugin x wants us to give it a number and quantity I see, but there's no real. I mean we will see it on the pr. Then in theory there could be some inefficiencies in the cubelet. When we set it to a higher number, then we would probably just go with thousand or something but yeah.

B

Set it to a number, that's slightly out of reach, that's all we have to do so. One thousand one thousand is totally fine.

E

B

Seen somebody want to launch over five hundred pods per node, so.

E

Appreciate the premium.

F

E

Yeah in the previous call I mentioned they, they pushed openshift to 300 or 500, and that was the maximum they tried so far. I think.

C

Yeah then, let's take 1, 000 and re-evaluate it when it's time for it, but.

F

Even for a virtual machine, so I think you know 1000 will be like a two-way, isn't it so it's. I think it's fine yeah, perfect.

B

Yeah, let's do.

C

F

It's not even a virtual machine.

B

Limit this is like roman said, this is a limit that doesn't mean anything really yeah. It's not.

E

It also won't cause file descriptors being open too much or something.

B

We don't care, I mean irrelevant amounts, okay,.

A

Okay, all right marcel, I think you got your answer.

D

A

All right cool all right: um we got nine minutes left. um Are there any other things we want to discuss.

F

Yeah, I I had the grafana dashboard and this one. So if you deploy now the convert um ci with grafana, you can see the dashboard that I was using. The tests that I did.

C

I guess we have to wait until tomorrow, so until we merge it in qubit master but yeah.

F

Oh okay, yeah, when it's written, merge it it's not merged. Actually, so.

C

When it's merged here you will, we have a periodic job which twice a day I think, uh takes the latest release from keyboard ci and creates a pr in keyboard.

F

C

When the tests pass there with the new provider only then we merge it. That's a precaution so that we know that the keyword tests have no downtime. Basically, it makes it a little bit more cumbersome to propagate changes, but the keyboard tests are then more guaranteed to to work.

D

A

Oh, I was gonna, say where's the the link to the dashboard. I was trying to find it is this something that's like. I can't see.

B

You can't see it unless you run your own.

A

B

So it's part of our like dev workflow,.

F

Yeah, it's it's it's! This will be the same thing that you see in the report that I do so.

A

So I I this would be like if, if I did um so, we we like right now the prometheus isn't like make cluster upright. So if I did make cluster up and then check this I'd see it microphone, I'd see it.

F

Yeah it's. It should.

E

Be as rama mentioned,.

F

It's not there yet, but maybe tomorrow and I'm going to include the new metrics that you guys mentioned. So I will update this dashboard.

A

Oh yeah, that was, um um we had discussed on the uh on slack there is, I found a few more that's what you're talking about right um like for um inside of uh what's the controller or the work you metrics like there was uh oh, I wish I had the link. Let me see if I can find it, there was a ton of them that I that I.

F

Found retry is it retry, and I include that one, um but we can. I can double check so um if you can just highlight again those those metric and double check if everything is in the.

A

Yeah it was, uh I found it here.

A

F

I think I didn't check that search.

A

Yeah, that's where I got a bunch of descriptions for some of these, um but you had most of these. It was. uh It was a few of them that were not there. um Yeah.

F

They retry, so I it wasn't.

A

Remember, which.

F

A

That I mentioned.

F

Oh actually, I have, I thought.

A

Yeah there it is yeah work, yeah, work, duration and retries. That's what it was yeah it's this one.

F

Work duration. We have, we have yeah. That was.

A

um Yeah, it was, though it was the number of we tried, so this will give us the. um How does it define it think it's.

A

Total number retry is handled by the work queue yeah this one and there's some other stuff like great loaner and stuff in here um I don't know, but there's I don't know if we're hitting any of these, but there was some interesting ones, so yeah.

A

Okay, um all right thanks, marcelo, that's pretty cool! um All right! Are there any other open items, then, um last minute, what.

E

I, what I wanted to bring up was the um that I wanted to look at the amount of left around goal routines um because that's quite concerning hey there we go to like we're growing exponentially and b that we have so many leftover teams. I've been looking at a few parts of this and other things, and I'm gonna have a look.

A

Okay, yeah I've got in this. um This issue kevin um yeah, like the how the go routines climb like this hey the stairs when we scale.

E

Down but also the growth seems um unproportional to the amount of vms we create yeah, okay,.

A

Cool okay, um any other things. Four minutes left.

A

Okay, um all right, let's like uh like, I said we can revisit the first topic um from the meeting, so I mentioned I'll be out next week. um If folks want to have a meeting um david, you said: you're, okay, with hosting it. That makes sense. If you have some items, you want to discuss, um yeah I'll leave it to you and I'll have I'll I'll, send you the admin code just so you have the ability to record and everything and then um yeah. If you have an agenda, then it makes sense to have it.

D

B

Yeah, do you so do people want to meet next week? Yeah, I'm fine with posting it. It's fine.

F

B

A

Okay, perfect all right. Everybody thank.

D

D

All right, bye.

H

Okay, thank you.

D

All right have a good day.

H