KubeVirt SIG Performance and Scale, 5 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-08-05

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.lmkwmnkao9j0

A

Okay, all right: everyone welcome to sixth scale um august 5th, uh the link to the docs in chat. um So everyone can please have myself as an attendee.

A

All right, so we've got a bunch of agenda items to go over um and um before we do how was uh last week it seems like we had a bunch of things that were discussed.

A

Looks like um things progressed pretty well like any leftover items for this week, uh assuming people added them, I brought them up or is there anything we want to discuss from last week.

A

No okay, all right we'll just keep going on then, with today's agenda. Okay um and thanks david for hosting it. I really appreciate it. Okay, first item uh discuss how to handle mem cpu requests on pods uh different scale requirements. uh Return scale requires different resource requests, to control playing components.

B

Great, so this is something that actually uh ramon pointed uh last time actually to discuss uh this uh meeting. So it's coming from last meeting, okay, yeah uh yeah, so we know um I was like trying to see like the the metrics.

B

uh What's what they have in the convert, and I saw some uh you know- alerts uh in the vert operator and one of the alerts it's related to um their requests, cpu and the usage and and then I try to verify that and in the skating or the small scaling experiment that we have and uh and then actually we can see.

B

You know this especially for memory is fine, what's negative here, which means that the request is smaller than the usage so and uh yeah, and uh it's, although it looks like small, but it's significantly for cpu requests, and this shouldn't be like that, so we need to. Of course we need to do some more tests to define what should be at least a desirable cpu request.

B

It can be like too big, but it also shouldn't be too small in the way that I think it's it's happening now and uh roman mentioned something less meaty. I don't remember now what he mentioned, but one thing that impacts that it's for sure this scheduling um you know and also, if it's requesting less things and we are placing that with the you know the scheduling and should be putting more interference in the workloads and so on. So and uh again, so I don't remember now understand that from animation I think he's not today.

B

A

I'm trying to understand like so this is so we have these negative numbers here. This means that we are um we're not quite using what we've requested, or is it that we're using over.

B

It's losing over yeah okay. So it's the request. Minus the usage.

A

I see um uh okay, I'm trying to so this green line is the vert api. So this is, I don't understand, what's the so the lowest one we have, I can see the orange one. Here is this? Yes, I'm assuming this is a controller.

B

A

Yeah, so we're okay, server control is using quite a bit more and then it looks like the handlers are following up in second here.

B

A

Then did they operate? Is that also in here.

B

No, the operator is fine, isn't it? I know.

A

That is positive. Okay,.

B

Yeah, so it's the handler controller and api, so those ones are, you know two underestimated the cpu requests.

A

And these numbers, like um I I'm just trying to quantify like, is this: do we consider this very small, like I'm just looking at this, never look small, but I don't know what just based on what the metric is like like is this? Is this um like write this off, or is this like something? That's just uh pretty significant.

B

It's cpu usage and uh yeah. I I'm not sure so because need to understand, understand this into cpu metrics in parameters. It's sometimes hard. So because I I don't know if it's relative to the number of cpus in the node, if it is, the node has 48 cpu. So it's a very big node.

B

Then then, maybe in a smaller machine it might be more significant. You know this values, but but yes it's, although it looks small, you know, um I think it's it's something that um maybe impacts the the performance. You know in somehow.

A

Okay, so we go a little bit over um okay, so we maybe need a little bit more investigation to understand the metric to see what this is. There's also um like this doesn't go to zero. Looking at this right, like it does here or even just these, like our controller, goes right back up to, uh I guess looks like underneath what it requests and then we're.

A

This is the one we scaled down right. This is your largest, I think. 300. I mean we have zero here right and we're still over.

B

Yeah, so this might be related to what kevin mentioned before that when we clean up the cluster. It's still, you know, have things going on and it doesn't you know completely goes. You know to the to the first stage. You know at least for a while. Maybe maybe this is something that kevin mentioned he's investigating, but maybe the you know exactly.

A

B

Yeah the interval between the experiments, maybe must be because I think what kevin may shall he will talk about that for sure, but uh maybe it was the garbage collector. You know working and yeah yeah. um We need maybe to wait more time. You know to between.

C

Experiments something like that, you know it took like five minutes for the cpu to calm down after 100 vm tests, and I from what it looks like. I really think it's a garbage.

D

Collector because we have a lot of like adjacent decoding with all, creates a lot of resource objects that seem to get cleaned up during that time, but it's just an assumption based on flame charts and traces. um So we should wait more time if we do like step-by-step tests.

A

D

But what you see there also could be what I've seen before um during one of my tests. My cluster didn't have just enough resources and it was very full and at some point burt handler did never recover.

D

It was at full cpu load trying to do whatever I didn't have profiling then, um and it only came back down after deleted all the vms, and I think it it got stuck on something I don't know.

A

So this uh this garbage collection, so to me it just I don't know like does it? Does it sound like to anyone? That's that's an abnormal. That's abnormal behavior, like we shouldn't um be spending so much cpu and memory to do this. Garbage collection, like I don't know what kubernetes has just to compare, but um I mean it seems like if you were doing json, you don't want to work with json sterilization or decentralization. It seems like um optimized significantly.

D

Yeah, so in kubernetes we had like we had a few issues where we built something new that did a lot of decoding and garbage collection was a huge issue at some point um and we had to rework that decoding to use different ways of solving the problem. But I think to some extent it's okay for us, um but we should or could still look at like how much we decode like. If our caches need to be um the way they are like.

D

If we can, if you can just not decode it some places where we don't need it or decode partial objects or something. But I don't, I think, that's a lot of work and I'm not sure if it's that much a problem.

A

Okay, well, we can talk about it more kevin when, because I think you have a section here when, when you bring up what your work is, um okay, so kind of the takeaway here is like we're using a little bit more cpu than we requested.

A

um We need to further understand exactly um what this means, so we could decide what we want to do next, okay, okay, all right! Is that sound good marcelo uh we'll go to the next one, then!

A

Okay, um next is uh a pug fix enhancement. So this is oh this is you kevin okay, you're? Next, all right, we can just roll over into the discussion. Then.

D

Yeah, I think the document is is um only the stuff I found. I took to use it as a notebook. I think everything there should be taken care of now. um That was just to go routines, um but I shared a few of those those snapshots showing in general. What's going on um in slack, I think yeah.

A

I saw the I saw the um the other pr's. I have them up here. So this uh the two right here.

A

D

I think both emerged now. I just got an email that mine also merged.

A

Yep, so they would fix one of them before the first lake and the second one was.

A

Was the block that we were doing in here in this routine and it was just leaking yeah? Okay, yeah! Okay! um Is there anything, though, that you want to say about anything more, you want to say about this.

D

um Just about the the resource usage, maybe a little like or the cpu load I saw after we see after deletion, um I'm like the numbers we see on word handler. I think I got only created, I think 100 vms and then 300. We have to see how how high we can get, but um in general it we we should look into if we can optimize it, but the numbers we reach on cpu load and also memory load still seem pretty. Okay like it.

D

If you think about that, this thing is managing, let's say a 300 vms per node. I think that's still a fair load. To some extent, I I mean the people actually running. It have to have to say that, but it doesn't, um it doesn't look like we're doing something completely wrong, only small step parts that we could optimize.

A

D

Somebody disagrees or agrees with that.

D

Let me pull up my snapshots, but I think we were still in the um like 0.0 or 0.1 areas of cpu load and that it's like, I think we can live with that yeah. Even the high load after it's like 0.02 cpu uh load and never more than I don't know. 250 megabytes of ram milly bytes of ram still sounds fair. As long as we don't see problems with stuff taking too long.

A

Okay, how should I classify this then so, like um just after deletion, we have a little bit more. uh We have some cleanup to do with.

D

Yeah the coverage collector gets uh load from deleting all the vm objects from its cache. I think.

B

It's expected so yeah, okay,.

A

B

It's if the load is too high getting too many cpus for too long time. Then then it might be a issue, but if it's like uh just a little bit and for five minutes.

D

I would say that yeah so obviously a reason I would see to fix that would be if we see the word handle not doing things fast enough, because the garbage collector takes too many cycles or it's taking away cpu for more important stuff, which I think it shouldn't um like. With uh your face transition stuff you're investigating. If we see the vertana is severely impacted at a certain scale. That might be why we, I'm pretty sure we have burst problems if that like that, could cause that, but um without any problems it's causing.

D

I think this is normal.

A

Okay, and so we can just I think I mean it seems like the more I mean, just kind of what we're talking about like there's more objects. It sounds like we're going to require more cpu, so this is going to grow like let's say if it was 10. 000. we'd expect this to expand even more right.

D

Yeah, probably yeah, but we also have limits on how many vms one word handler can handle in the first place, because.

A

There's limited nodes.

D

A

Okay, so this is 300. This is fairly high, okay, so this I mean this is probably like around the max. I don't know what it is, but that's fairly high 300 pounds. Okay,.

D

A

Mean that's a good way to characterize it. Okay,.

D

And it it in in the in the real world. It's also only a problem if you have like it could only be a problem. If your use case requires you to have a high turnover, you create a lot of vms and you get a lot of vms and you will have that load of cleaning up behind you. um But if you just run a lot of vms it, it won't be one.

D

Yeah, which a lot of use cases can be like test, runs and stuff. Of course,.

A

A

All right there we go okay, I think that covers it. um So yeah we're basically seeing this at, like I mean usually 300 business 300 tests and we see a non-zero level. I mean this is I think this is this? Is 200 or is this 100 200 200, it's um just a blip above zero and then um then it's fairly steady all the way in I mean it's 100 or less. That's, probably what the majority use case is going to be in here.

A

A

Good, okay: um let's go to the got you kevin! You are you all set with this uh next one.

D

Yeah, I think so. um Okay, um look. We have a few more things that popped out of the um of marcelo's tests. I think I'll see what I look at next or, if other components also leak stuff.

A

Okay um and the go routines like so we and this one's um are, we like in the latest after your two patches merged. Are we seeing um no more leaks? I think after that, second one. I think you did another test right and you shouldn't see many weeks leaks.

B

Yeah um he's still continuing it, so actually, I think the fix I think ramon also mentioned that it's related to migration and- and I don't remember the the other thing, but um oh you don't see any more leaks yeah. This is this is after the.

D

A

Before and after this is after yeah, fairly steady yeah, I mean that can't get much better. That looks great.

D

Sometimes the word handler goes up on itself and what I, what I was looking at was that it's generally using a lot of routines, because I I don't know we do a lot of stuff. In the background we watch a lot of stuff. Node labeler runs with 10 um threads uh stuff like that, but that's that's fine. I mean we're still building programs that run threats. Some are to be expected.

D

A

Yeah that looks good 140 yeah, I mean look, you have one yeah, so 246 I mean we don't go above and 147 the highest here. That looks good.

B

Yeah, it's good.

A

Okay, great um all right, we'll go to the next one. So new metric to monitor request, counts by resource and operation.

A

Okay, this is david's.

E

Yeah, so what I did here was, I add, a new metric, that's parsing, urls from the clients for all our control playing components. So we do any sort of client operation. This hook intercepts it parse. This url figures out what the resource was, so the resource being a pod virtual machine instance whatever, and then the operation not being the http uh verb but being the actual kubernetes operation, meaning a list, a watch, uh a get a patch a put or an update. Instead of a put things like that, um so what we have.

A

Is we have a counter.

E

Now that we can say uh across our entire control plane, let's figure out how many um gets we're doing on virtual machine instance objects or how many updates, for example, we're doing a virtual machine instance objects, and that gives us an idea for our density tests.

E

um How many writes we're doing for these objects and we can figure out which exact resources we're writing to the most and things like that, and we can create thresholds to say hey. We expect in this density test to call update and patch virtual machine instances x number of times, and we go over that. Then we failed our threshold, so that's kind of the idea. So I integrated this metric into that perf audit tool and we can create thresholds and all that good stuff for it.

E

A

Sweet um yeah: this is a cool one. I want like this. One would be a cool one to see over time after you know like if we had a vm running for a long period of time. I'm curious to see some of the events that it fires off.

E

And we've had mistakes uh occur here where some subtle code path, where kind of causes an update, storm or something like that with our vmis, and that puts quite a bit more load on the api server and it also impacts our um you know, time to to running creation to running so we probably see lots of things occur when issues arise uh when we have like an update storm, we'll see that the the time is increasing uh before we can get to running, but then we'll also see that certain thresholds around api calls will probably get hit as well.

E

So that should cue us in hey we're making a lot more updates now doesn't make sense, let's figure out why.

A

Yep, okay, that's awesome! Yeah! I think we'll read a lot of valuable information out of this okay. uh Any thoughts on this.

B

Yeah, so it's pretty good, so I would say this is nice and I just comment: did some few comments? um It's it's just not me that the comments, maybe uh it's important, but it's just something to discuss so first thing is uh operation.

B

So it's uh I would say, like the other metrics, I would say the other methods related to that. I don't know if it's super related but the in the same section. They actually call verbs and they also has like this list watch and I don't know. Maybe we should keep also verbs. So it's like the mecca sure that's been used and I don't know if if it has update, but I think it has it's still used like um maybe put in the atp, you know uh nomenclature yeah.

E

So verb is referring to the http. um uh I don't know what the I guess- operation or whatever it's referring to the http spec itself, so we're gonna get puts uh patches, gets uh deletes, creates things like that or I'm sorry not deleting crates. You get delete, but instead of create you have it put and things like that. So the reason that shows operation we can pick a different term or whatever. I didn't want to confuse what uh we're getting here. So we're not getting the http.

E

Whatever operation we're getting the kubernetes like action, maybe I could call it action, I don't I don't know I I didn't want. There's the event.

A

E

It's the type of kubernetes client action that is taking place or operation, that's taking place, which isn't necessarily directly correlated with http call. I think.

D

We call them verbs in kubernetes. We.

E

Do? Okay, I'm fine.

B

No verbs also has watch and least, and.

E

B

This kind of operation so.

E

So when we look at the request, uh client, sorry, rest client request latency seconds. I think that one has a verb in it as well. It's not reporting the http, I'm sorry. It's reporting the http verb, not the kubernetes verb.

E

Are we okay with uh these terms, meaning different things for what look like similar, um similar client type, behavior or monitoring.

B

Yeah- I'm not sure so I just I just saw that so and brought it to discussion so yeah.

A

Is it like, so this is the the verb that so where the kubernetes verb they were talking about like this is um like when kubernetes, when you create something we have a create event in kubernetes. This is like that request being caught.

E

Right or not, it doesn't.

A

Call it a created, we call it we're, calling it like a post or something is that I mean that's the confusion here. So um so you get a create and not a post right.

E

In my metric you would get a create, you would get exactly.

A

E

The kubernetes client is doing, which is called create.

B

So you mentioned, which is not a atp.

B

Is it another protocol, but I saw that you are parsing the urls and it actually has like this verbs um names in it. But.

A

Wait I mean, would these be classified as events and is that what this is? Yes, there's.

E

I wouldn't call it an event, because an event is something specific uh in cube: kubernetes.

E

Event is actually an object.

D

In http, it's called method and I think in kubernetes we call it verb. Okay,.

E

I'm going to call it verb and maybe we should.

D

E

D

E

Right, yeah and that's exactly what it is, the technical tournaments for it. Thanks for bringing that up.

D

I would I couldn't find it in my hand either. I was like what was it called.

E

Yeah yeah, it's just confusing, because now we have uh another metric, that's using verb. That is not it's referring to the http method, not the kubernetes verb, all right, I'm glad we have terminology now.

A

Okay, the verb will be all those uh creations.

B

My last comment.

A

B

Was I'm sorry, I'm sorry right.

A

I was just like all the verbs like create lists, get um verbs here that that's all I want to say, but yeah.

B

Yeah, my last comment is just right, so it's not very clear to me uh the the whole difference about the red. I know that it's getting different information but especially because you mentioned that the rest client requests it's for atp in your metric. It's getting something else. So is it uh what is just something else so which which other protocol are you getting here? That is not atp, so it might might be nice to describe the difference um just to be clear about the the metric.

E

The difference between what.

B

For example, the rest client requests latest seconds yeah.

E

Yeah we're getting the uh the resource and then the kubernetes verb. That's that's the difference. Risk client request, latency seconds is just getting the um the http method and then a a kind of normalized url, which is has the resource in it kind of. But you can't do things like say how many lists that I get on that resource or how many watches and things like that, because those are they're all gits.

E

So a list watch and a get um as far as like the kubernetes verbs. All are the http method get all three of those.

A

So you're saying that, like is this, why you chose to use the kubernetes verb as opposed to the the method for http method, so we can.

E

Get more information, they need different things right, yep also, latency seconds is not reporting uh watches because it's a long standing like long pole http.

E

So we don't have any visibility into, for example, um if if we were seeing, lots of watches occurred there during our stress test, and that would mean that informers or something are failing, a lot we're getting a lot of errors.

E

We're going to have visibility into that at least.

D

I'm a bit disappointed with the kubernetes client that it doesn't give us that information on the request. Somehow through context or again, you have to do regex, I expected more from them, but uh I think I think I like how the metric what the metric is now.

B

So yeah, the last thing is, I was thinking, is so if those metrics both are both of them, you know, are getting all the calls and they are reporting. I know different granularity of information.

B

However, maybe you know with this new metric, we don't even need need to care about this. Other one. Isn't it if you collect the latency um right now, you are only counting.

B

Does it make sense if you get the latency of the request and then you have, we have a whole complete new metric and we don't really need this one anymore.

E

I'm not sure one of the tricky things here is this breast client request. Latency seconds is not a coup vert specific metric. It's an extension of a kubernetes uh metric that otherwise.

A

E

So if we wanted to kind of seamlessly work with existing dashboards and things like that,.

B

Yeah, but I think, for example, for the kubernetes, if it has prefix, isn't it, I think it's I don't remember now, but it might be. It's have cubelet rest client request. You know it's still changed the name. So it's not like. uh I think.

E

They just have a label that helps they.

C

Have a component label yeah? Okay, so the name.

E

C

E

Yeah and we're actually using a kubernetes library to do this, like it's kind of like a wrapper.

B

So it's the same.

E

Thing that they're using internally.

B

Okay yeah. I I just want to discuss that yeah, I think, sounds reasonable. Yeah.

E

B

Have a lot of metrics.

E

And that's something that is concerning to me about the potential load caused by all this. And I don't have great.

A

E

For that, because I don't have a lot of experience with what's too much collection and if we are, if we need to be tighter or more restrictive about creating new metrics or even reevaluating the metrics that we have today to make sure that they are all valuable. I think that there's some things specifically around the um what we collect about every individual vm eye that might be pretty intensive as well, probably the most intensive, but I don't know something: maybe should we be tracking?

E

Is there a way to uh if we wanted to understand this better, the prometheus load and whether we're bumping up into any sort of would be even the limits? Are we talking about bandwidth limits, or are we talking about uh just the the database stream of keeping all these time series or uh where would you.

B

Start to fall apart, I think all of this yeah- um maybe maybe it's not just the bandwidth, but you know when prometheus is scrapping. You know the service and then the service needs to uh you know, bring up all these metrics and if it's too much, maybe it's using too much memory for that. You know cpu to compute that so this might be a problem in the service itself and also for me too.

B

So if it's gets too large, so it's also it's known to break you know at scale because of the size of the you know, metrics.

A

Yeah I mean where do you? Where do you think we might hit this uh this issue um because I, like even um you, know what we're doing now at least the stuff and the stuff that I'm aware of, but I don't think we're really causing too much um usage from permeability. I mean.

F

Is this? Do you.

A

Think this one would be, or is it like.

E

I'm just talking in general this.

A

B

A drop in the bucket.

E

I would say compared to everything that we have yeah.

B

This should be.

A

B

A

Like we're like, you might run into troubles, if, like, if you do like per bmi like say, for example, we were to track every single bmi's um and get gathered data from every single bmi, or something like that. um You can run into some trouble there, uh like that's at scale. I can yeah well like but like per like um like, and I guess it depends on like in some ways, events that you do like, which, which ones, for example, are like. We gather information on like an individual bmi basis.

A

Like we report like information like our the metric, we're scraping like the the tag is the bmi name like the numerous amount of data. We're exporting is crazy. High.

E

There's a lot of information that we expose invert handler that aggregates uh metrics for every vmi on the local note and that's how it's exposed. I'm curious!

E

If we're doing it per vmi that or yes.

D

Per per per resource tags on our labels on metrics can be a problem.

D

So far, the only load issues with prometheus I saw were really storage related like if you create too many labels like that at some point, you can't store a week of history but three days um and at some point that prometheus just takes a while, scraping and runs into timeouts if the uh metrics endpoint gets like too big a file, but that really only happens. If you um I mean.

A

D

The amount of labels to your resources.

A

Like, ultimately, I'm not like, like you, can disable like you, don't have to scrape all of them and we can always like if it's like, it's not like we're totally going to be just um like cornering anyone like if it's something that we're like we're doing too much, but I mean even this, like I'm just talking about the ones that we've worked on here, I I don't think are don't say like especially like the transition times that david did like those were um like the number of tags that we exports. Not many like it's.

A

For, like five, it's a constant amount. We just kind of have an average of times for every time we report, instead of like an individual, vmis, runtime or timestamp per um uh per transition. That would be a lot.

D

Yeah but one as a rule of thumb, I think you can say if you create a metric and uh the amount of labels and different labels is um a fixed number. You're fine, and I think, with this we are there, it's not it's not growing, with um the amount of labels and leg variations that growing with the amount of objects we have.

A

Yeah yeah, that's that's where that's exactly like when you run into trouble, it's like when it's it's when the it's, when the it's dependent on the number of objects you create and that's when, uh like number tags you have, it's depend on the number of objects you create and then it can become overwhelming so yeah. If that's that's it's fine and then, and then again like you can always you know, people can always disable them if it's something that it's just very granular.

E

We're definitely doing that with vmis, though we have vmi name is a label for some of these metrics and they're gauges. There's lots and lots of prometheus cages.

D

In invert handle or for those you just submitted.

E

Oh, no, no invert handler.

D

Okay, I just want to make sure.

E

I was saying that we do so bad. I don't know bad is the right word we're doing some things that have uh potentially performance implications. I don't think the thing.

A

I just introduced.

E

Does because the labels are pretty um set, we're not going to get a lot of new, it's not an infinite number they get created, but.

D

E

D

The yeah with the dashboards and the promisives in our test environments we haven't get. um It should also provide us metrics about prometheus itself, and we can also have a look at those and see how how much we kill prometheus with some changes and measure our impact. That way as well.

A

Yeah this is so. This is something we can. I guess we'll kind of take away from this. It's something we keep in mind when we generate metrics. If we're reviewing things like we're, creating new metrics, if we are having things that are dynamically or I guess the number of labels a tag scale with the number of objects created, and we just need to be aware of that- and maybe we should you know it might be okay to have it in some cases.

A

But we also want to make sure that, if we're introducing this metric that we also have one- maybe that can provide similar information but sort of like a summary instead of the more granular options, so that we can still provide some value in case someone wants to disable it because of this scale or something.

B

I think maybe it might be a good idea to come up with a plan to analyze that you see so every time that we do a scale test. I don't I don't know how. So we need to think about that yeah we we could just verify. You know if we are getting too bad on that or not so you know, and you know people can still keep like introducing metrics and if we have a way to evaluate that we can rise on. You know.

E

Yeah, so what you're talking about marcelo is essentially uh monitoring our monitoring, which is I, I think it makes a lot of sense. We're monitoring.

A

E

That our monitoring uh puts on the uh cluster. It's all it's all load, no matter where it's coming from.

B

Maybe prometheus has some information about that, you know and then we can. We can collect this.

A

This might be a good one, I think you know like if we have, I think, some point we kind of come up with some guidance in terms of like how to run kubernetes scale. This would be a good one say like okay, we have these metrics here. You know maybe they're important, like maybe we have there's a legitimate reason for them. um It's smaller scale, but at a larger scale like if you want to be, if you're talking thousand plus notes, you probably want to disable these because um they can affect your scale or something.

B

Yeah I just come up with like prometheus, has just targeted and it's at least says how long it's taken to scrape you know metrics, and we can have a maybe a look on that. So, if the target's getting too high to scrape metrics it's something can can become a problem for latency and also we can check like how how much you know is the perimeter's database. It's increasing, you know, you know in the test, so.

E

And ultimately, I guess what we care about here is the response time we query prometheus. Is it not so we want to know, isn't that where it would fall apart, I'm making stuff up, but my expectation would be if I made a query to promote the us to get some metrics that if it was time to do a lot of calculations that really intensive across the database, that it would be in the request latency of giving me back my results.

B

Yeah, this is one point, another point: it's what I mentioned, I'm not sure if it will happen, but if, if, for example, if this handler is generating too much metrics, you know just the process to generate the metrics and keep the metrics on memory. It's an overhead. You know.

E

For our process.

B

Yeah exactly yeah.

B

Because if the problem is only prometheus, it can, you know, think in a cluster that has dedicate many. You know you know dedicated nodes for prometheus itself and and doesn't collocate that. But if it's sexually introducing problem in the service they're running, you know yeah, this might become.

A

A problem, so I think, like I, have two notes here. I think this kind of captures it we're gonna, let let's well. First of all, let's just keep this in mind when we're, because, like it's true like right, this, the central like a lot of the essential piece of what we're doing with their what they're measuring as it's around prometheus right now. So we need to be very conscious of what we're doing so, that's true, we need to be. We need to monitor our load.

A

So, um let's, just whenever we see a situation where we we're adding metrics and the number of objects are created, skills, the number of tags labels scale, the number of objects. We just need to be aware of it, but um I think the real way we kind of communicate. This is when we talk about like having um a general guide of how to scale with hubert.

A

I think that's where we capture this as a thing that at best practice or something people need to watch out for um and that's yeah I mean, I think, that's at least the best thing we can do just to.

E

A

Users aren't affected and then and something we just be aware of with our code of view.

A

Okay, all right, that's a good discussion. All right! We have a few more points. We don't have a ton of time. So let's go to the next one um david. I think this is yours. It's the first scale, low generator. I know this is not.

E

Like myself, yeah, it looks great uh oops yeah.

B

Yeah, so it's uh well, it's regarding, like the the framework that we have been discussing for a while and uh david, creates the you know the profiler, no, not the profiler, but the report generator the lg two choose crate, metrics and generator port, and- and now it's uh you know proposed, which generally to generate the load and for different tests. So we have the density test, but we can add the idea is to add more tests later.

B

For example, you know this um stress test that has constant load and ramp up. You know and keep uh delete creating in the life cycle for the vm, so creating deleting in the system and and actually those are the tests that we can see this. They start still, you know um of the um the system and see how how much pressure it can support anyway. So um and then I think I saw that kevin and you guys um makes uh made some comments, so I will go through that and see.

E

Cool awesome, I.

A

Have one kind of general.

E

Comment about this and uh I don't think it's something that needs to be done immediately, but in the future, when we think about expanding this tool- and I haven't looked at this in great detail either. But the one thing that I noticed is that it looks like it's. Creating um it's got an internal way of structuring the vmis and the things that we have control over are primarily the image and things like that, um so that that comes in as an argument to this in the future.

E

Maybe we should look at a templating mechanism, something as simple as taking existing bmi and know how to use that as the base for our load and just create lots and lots of emis, with maybe different names uh for the same thing, because we are probably going to want to begin load testing in different ways like using different types of storage or different types of cpu and memory, and maybe even topologies with that like dedicated versus non-dedicated.

E

I don't know what all is gonna transpire in the future, but that would give us the flexibility if we had a template in the future.

B

Right this is a good idea, so it's like I I try to to start it as simple as possible, because um you know the the less the less experience to a big apr. It's it's it's hard to to go in. You know move forward, however um yeah this is. I think this is a good idea. Maybe maybe maybe I can already change the you know template you know you mean template has like uh the vmi uml.

B

Isn't it in a template, folder and then actually we use that and just change the name of the of this? Isn't it so then we can point which templates to use it is.

D

B

C

D

First thought that if, if like, I commented that right now, it's vmi, but don't we maybe also want to test vms to test the controller part load test that part as well, and I'm honestly surprised I I'll look for a bit. I was so sure there is something already providing that for kubernetes like you provide a folder of yaml templates and it does exactly what this is doing specifically for bmis. It just creates some x times and delete some x times and does it over and over.

D

I was sure there was a lot, a simple load testing tool for kubernetes like that, but I couldn't find it on first try a.

E

Q burner does that, but it does a lot more as well and um it doesn't have some of the tight integration with vmis and vms that we might want in the future. I mean I've had decent results using keyburner just with a vm or excuse me, a bmi's template and then creating a bunch of them and deleting a bunch of them. But when we look at creating uh lots of vms, specifically uh we'll probably want to begin doing actions on those vms.

E

Like um query, a bunch of vm objects, start them all, then restart them all like things that are vm specific, uh would be difficult, and then we're going to probably look at migrations at some point in the future as well being part of the density. So I think as much as I don't like writing our own code. Unless we really have to. uh I don't dislike the idea of creating our own tool to generate this being specific.

D

Yeah yeah, I thought I remembered something like I don't know, with a terraform or ansible style language, where you like, define test plans and and a load test, but like yeah, I might have made that up.

E

So marcel one of the things that might not be terribly difficult to do uh initially would be. What would you think about changing all these cli options to an input config instead, so we can pass.

A

In input config.

E

That we can expand in the future and make that repeatable.

B

Yeah, I I actually like just you know. um I was thinking to do that. You know in the beginning, but then yeah, um I think it's. I think it's good, and also I I maybe I can do something, as you guys also mentioned.

B

Instead of call vmis uh have the ml with the template and actually call an object and can be whatever the template is, and then it's just create me there as many objects that we want.

E

Sure um yeah, I think that the input config would be the thing that's going to serve us great in the future, um because it I can see this game really complex, uh trying to make a cli a repeatable cli command of this. um As far as the template thing goes, that's something we can follow up like. I want you to make progress on this and be able to get in fairly quickly. So, whatever you think, it's a minimum, um it's usable, mm-hmm.

B

Yeah, I will work on that tomorrow.

D

With a config, you mean like having all those parameters, I don't know as a yaml file and we can have a folder of different load tests that unci could run yeah. So.

E

A game will file, meaning yeah, a gamma file is input, and then I'm thinking like with our periodic or whatever we have in.

A

E

Then you just have a repeatable config that you run through here, and I want us to remain flexible with this stuff too, like um if we find that the structure like I don't want us to treat this as a like. A versioned api immediately yeah.

A

E

Be flexible and if we figured out that we want to restructure things in the future, um not try to figure out how to make things backwards, compatible or whatever. This is just tools to help us so there's something.

D

A

Okay, all right thanks, marcelo um yeah, the oh, the other thing I brought the eventually like you have um person here like we can. We eventually can get to extending it to this stuff um as well, which would be cool because I was kind of like you have in here like we could also get to more config, and that's what kevin was saying like if we could drop in a test file configure something we can do other types and then and you'll get all sorts of different tests and different results.

A

Cool. Okay! Next up is the kubernetes client rate limiter config, making a finger bowl. I don't think roman's here. This was a discussion. I was having a ramen.

E

Yeah, it got merged, or at least I oh wait. It might not get merged. I approved it, but I think it might still be.

A

Yeah this was just a summarize like I was just looking at this and I need to wait for him, but well. The only thing I was talking about this was one of the tests is like it's creating um it's it's creating feminize and measuring their like their performance and then increasing the qps and then well getting a baseline, increasing, qps and then doing it and measuring a difference.

A

um Yeah and my take on this- was that this. This is just a little complex, because there's just a lot that can happen when we're trying to measure performance here and this test, and I'm kind of hoping that we do it entirely outside in the in the tool so that we can just kind of um yeah like just so that we don't. um My fear, is like we don't we don't.

A

We do it like very deliberately, instead of kind of uh just in this, um this one functional test, I think like it, doesn't always necessarily get the best results. Just by doing this because, like what we're after is we're after like making sure that we're not getting rate load, not necessarily the performance, because qps could change in the future and all of a sudden. This just breaks on us.

E

A

E

This specific test is meant to target whether the client's rate limiter configuration get picked up. That's really all it's doing. We just want some sort indication that, when we set values that some performance noticeably changes, it's all about the rate limiter configuration being propagated. That's all we care about.

A

Yeah well, but that's that's kind of my point is like: can we is there a better way to test that specifically like do we? Can we do it without change like testing performance.

E

ah What's the end result like if you want to know if it did something, that's the only way of really testing it as an end user.

E

The thing that roman's doing here that gives me confidence that this will remain at least somewhat viable is he's using percentages, so he's running a some sort of um scenario, with one configuration for the limiter and then he's making a pretty drastic change to that configuration posting it and running the same scenario and just measuring the percentage of change, uh and he expects a certain percentage of change.

E

So it's relative to the actual machine and test scenario. That's running at that exact moment.

B

E

B

The test is flaky, so that's why we we start to talk a little bit about that. I did some suggestion about that. So, if you can, you should welcome. The last comment is actually related to what you know ryan is saying is so instead of we just check like how much lower it gets. For example, what's flaky here, it's actually roman rolls like it should be five times slower, but actually was uh three times lower only you know something like that. I don't don't remember exactly what was here, but something like that.

B

So then, instead of making like a true relative, you know and maybe hard to to verify that we can maybe count how many times the requests got like a throatless. You know uh because what I mean so in the rest, client, it has like just long throttle latency and aster long total latency and when it reads those things uh it's right in the log, so we could. Maybe just you know when we make it like. You know the character second tool.

B

We might see more these uh things on the log and then we can see that it's getting you know, throated. The the request, if we increase the throughput we expect to actually maybe don't see any of this uh thing on the log, and then you know it's just a way to count that if it's doing better or worse- and we don't need to play with relative- you know performance.

B

uh You know like five three times or things like that that make it very tricky to in the test seats flick for that. Because of that.

E

A

Yeah, if we had like yeah, if we have a way to measure latency, I think that just cuts a little bit better than yeah than the performance server.

B

Yeah, if we don't go to the log, we can just use this. You know.

A

B

Just thresholds that the the client goal has so and then just use that.

A

Okay, well something we can just go through roman when he's back. Okay, uh we do are almost out of time here, um so we can cover the last two, probably hopefully pretty quickly. um uh I have the next one's the the performance threshold, so um I wrote this originally and then I just wired it up to the um the audit um tool. Dave wrote basically the only the the thing I want to say about this. Is that so I took I took your density test.

A

um I kind of split it up a little bit into a few different things. um Well, a few like editions. One of them is like that that we um we make. We take a look at the prometheus, um the we reach out to previous.

A

We run the audit tool to after we uh we run the test, and then um I took your test and I took like some of your common functions and I split them out so that we can do things like things that are for the framework um just common functions in here and then things that are like for the vmi we can do in here and and kind of you were saying earlier how like, when we can generate um bmis from a template that would kind of be that would be cool here.

A

Actually that was thinking that would make these tests even easier to write, but the uh we're going to go back to your density test. So once I have it hooked up, um where's your density test right here, um yeah I mean this just will run the same, and then we just get the information uh at the end by just running the the person on a tool and that's it.

A

So I'm still testing this. I have the I'm just fighting with bank cluster up now to uh to make sure everything is looking correctly, but this is just the work in progress posts that I have right now.

E

How are you getting the time frames for this? Are you doing like get time, yeah.

A

So I have um yeah I do like right before each I do I get a time and at the end I do another time good time again.

E

Oh yeah and you had the 30 seconds all right: cool, yeah, yeah.

A

30 seconds is the default scrape.

E

Period, um it does help.

A

Yeah so yeah then they'll get us our um our metrics and make them available, and then we can see. I there's like so much I could see we could do with this. Like I mean I could see us getting if we could have the take a grafana snapshot or something like and then have it available, it'd be so cool too, but yeah I mean this will get us something that wires together, so that we can very quickly create more of this or these.

E

Marcelo in the future, do you see your performance uh tool, the strategy.

B

So I actually did also comment about that. So this it was like something you know to restart and we are writing as functional tests, and you know, performance test is not is no functional task, so we should maybe make it clear that it's different, so it will actually use the tool so.

A

It says what yes, what you're saying is like. We basically will replace this with the tool like we basically make a call out. We wire this up to the tool or something or would we not like I mean, is that what would be kind of um like is our density test would be triggered by something like you know in here like we do, or would we run it? um It.

B

Maybe can be even somewhere else, so yeah.

A

Well, I guess I mean I guess it doesn't really. We could talk about it when we get there, but I guess like what the um we want this eventually to be run in ci. So your your load, I mean this is basically your load test right and this is our um gather, information, our audit, so yeah I mean that this. We would basically just kind of the way it is now.

A

We would just call out to the tool right here place this entirely with your tool and then um yeah, and then we have, um and then we do our audit at the end. So it looks something like that: okay, okay, plenty of.

B

A

On these on these, uh mrs then, I might just wait for years to go in and I can decide what to do. Probably just pull this out and then we'll or maybe I'll just wire it right up to what you do.

A

B

Yeah, the idea is to have something like that and, but then later is to replace the tools that we are developing. So.

A

Yeah, okay, all right we're a little bit over uh the last one um evaluation. So I'm just giving you marcelo. I don't think I have access to the doc. Oh.

B

Really, okay, so I might might have created with my red hat account. Sorry I would. I will fix that.

A

Do we want to? Is this going to take a little bit? Should we save this for next time, yeah.

B

Maybe I think it's better yeah, okay, well.

D

Then I fixed that, because.

B

We're ready, like best in.

A

Okay, yeah well: how about? Let me just move it over for next time or something okay, yeah we'll hold on to this I'll copy it up for for the next time. Okay, all right, we're at time, everybody! Thank you very much.

B

A

B