KubeVirt SIG Performance and Scale, 10 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-03-10

Description

Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh

A

Okay, uh welcome to scale it's march. 10Th put the meeting notes in the chat find yourself inside tendee, please, okay, um the first thing is just an announcement. um uh Next, the next meeting that will be scheduled for for sixth grade will be april 7th. um So three weeks from now um so I'll be I'm gonna be out of office for three weeks, so this will be the the next time I'll be back. So um let me actually double check. I said april 7th one two, three, no thirty! Thirty, no, no yeah, that's right!

A

Okay, so I'll be back in the office on uh the beginning of april, so yeah april, 7th is right, so um I'll put something on the mailing list again as a just as a reminder, but that's when we'll we'll meet again.

A

Okay, um some prs, so I want to get to a few of these um first one is so I've brought this up a few times. I just want to decide on the fate of this pr here, um see what I want to do with this one uh we could either.

A

I mean it's been here for a little while, like we've had a little bit of review on it, um I mean: do people feel comfortable merging this, as is or should I just mark this as work in progress, and you know we spend time you know with our tests and our test framework, and we build out this this pr over time like? Is there any preference for that or um you know, do we do want to just merge it now or like? What do people think.

B

Yeah, I think it should be fine and if we want to update it later, we can, you know just update okay.

A

All right I'll have to ask I'll get some get you to attack it. Then, okay, all right! That's fine! We'll go with that. um Okay, second, one, um the load generator. um I wanted to just spend a minute or two on this, because I did make another change to this. um I changed the interfaces a little bit um to make this even uh I think, a little bit more friendly for different types of um jobs that we could add. So what I did was um originally is like designed um a few interfaces.

A

Let me see uh for for a job. I call uh actually I originally was in low generator, but now I call it a job. So these are the things that I consider to be anything that we need to um to manage any type of workload.

A

I'm thinking, eventually what we could do is we could allow different types of jobs to override this stuff. Like like, I could see different types of study, state jobs that have different types of that sort of handle that have that handle like the refill differently or something or delete differently, and so we can override these things and then the interface for actually doing these things. I just use it with a run and delete, and that's it so I change the api so that it's a little bit.

A

I think it's a little bit easier to use um based on my previous change and then the other change I added is uh the let's see here. It's in steady state.

A

It's just a midterm sleep, so this is just a um all. This is, is a a um the configurable amount of time that we're going to wait um like that weight the weight api that I just highlighted um for every every time we do this. uh Let's see if I can find steady state, it's easier to show this.

A

Steady state, okay here so so, every time like we go through rate like so we go through iteration. The my expectation is that world will do some sort of weight in between um you know the creates and deletes, and this could be um any value that, and it also could vary quite a bit like I mean weight could be. um You could do something and wait I mean for for all. We know based on how we want to run this job, but um I I simply just have it as a configurable.

A

I think this is just something that I think generally people want to use. So I have a a min turn: sleep duration.

A

So what it'll do is it'll calculate the amount of time spent, creating and it'll subtract it um to any like say it took uh 20 seconds to do create, and I set my min turn sleep to 30 seconds it'll then sleep for the remainder of that time, so 10 seconds so that'll give us just a little bit of a buffer between creating deletes, but if they're, if we're creating like up up until you know the point that we need to delete, then that's fine, that's fine as well, and so, if you really wanted to have control over this, you could set this.

A

You know value to be very high. If you want to have you know some sort of sleep in between um during the periods of turn, so it's configurable, I think that's, I think, that'll be a better um interface for testing.

A

So that was the second change I made.

A

Okay, um I've been doing some testing with this internally so far um so far so good, um I still want. I still have some ideas: how to explain it but um uh yeah so far. I think so far I like it. I want to um I'll eventually take this and uh expand this to one of your like what you have right now with the well. We have right now with the burst job, um I'd like to include it as a periodic, um but I want to get to the right.

A

uh I think once I have the right um configuration um because I haven't tested this at like the same scale that we would run that job at so I want to make sure that um I can figure this correctly, so I don't have that yet, once I do I'll I'll make the change to the periodic uh to add it for data city but yeah, here's what the results look like um yeah, you can see like we get this study create. This is with a um a turn of two, so we create five. We delete two.

A

We recreate to delete two we create two over and over and over again.

A

Yeah, okay, um all right! Well, if uh so I mean the only thing I wanted to just bring attention to is those changes but um marcelo you're, the one who mostly reviewed this. But if you have any other comments, let me know um yeah. I think uh I think it's.

B

Mostly ready yeah, the only thing that I I think I comment is you know some hard cold parameters, so I really think that we should avoid hard cold things. So if we have flags, maybe you know- um and uh if you don't want to make it- you know configurable in the email. Maybe you have a flag, you know I don't know so, I'm just against.

A

B

Coded parameter, so it's yeah.

A

My good practice, I.

B

A

My thought for this marcel is that um well like I wanted to find a value that is not likely to change. Like I understand what you're saying like with the some tests, I want to make it different. That's totally fair, but the um if we set this to like certain values like I expect that majority of tests will not change them and and that's kind of what I wanted to do like with 20.

A

It seems fairly reasonable that it's not going to change, but I I totally respect that it could and what I want to see if, like. If we get to that use case, you know, then maybe what we do is we increase this value or if, if we find that, like the use case, is that we constantly need to change this on tests because, like you know, we've done a number of tests and we see a huge difference based on cluster size and other things. Then yeah.

A

I I think it makes sense to be its parameter, but I mean overall, like I guess I'm saying is that, like I think much most cases will be okay with this. You know.

B

Yeah, so that's that's defined to have a default parameter. You know it's like, for example, if you don't set it and it to get the default one, but I don't like to have it like hardcoded 20. You know, especially because I think the global you know configuration should be unlimited.

B

You know zero for example, so it's it should have rate limiting when you know when I interval between creations, but the the client itself you know um should uh should not have like rate limits, because if you do, for example, we are not doing lists, but if you do lease or get or anything it shouldn't get any rate limit for the you know, benchmark standpoint. I think so um that's why you know. In the beginning they have this global configuration and also the job configuration rate limit.

B

uh The global configuration is for the you know. The client in general, so actually by default, was using zero, so no rate limit, for you know for all the requests.

B

However, between creations we can control, you know the rate of create requests or delete requests, and then we have another rate limit for this kind of you know, uh um you know, requests.

A

So that rate limit is like that's on the it's enforced by the client right like we're gonna like if we make requests too quickly, it's just gonna, the client is gonna re-limit us that's. What this is getting at is that you're saying.

B

Yeah so like two kind of level, so one is the the client goal rate limit. So this one, I think you know by default, we shouldn't make it unlimited, so no rate limit. For that. However, there is another limit between the requests. I think I had some waiters.

B

uh I was using the the library rate limit actually to put some weight between creations. So then just case I could configure. For example, you know 20 requests, 20, crates, requests per second or 10 crates for quests per second. I had some logic about that before.

B

So I don't remember, if I don't remember if you change that or not I'm just thinking, because now that you remove this global configuration and then you put 20 here and then it will impact, you know everything you know not only create, but uh you know in all the requests so get leasing whatever will be also rate limit with 20 and and then, if you run things also, you know I'm just thinking in the in the global case.

B

So if you, for example, run two jobs in parallel, they will also be rate limit with 20 between themselves, um and so that's why I think the client shouldn't have rate limit, but you can have you know, control the creations per second in the in the in the code. So that's insane.

A

So you're saying this should be, this should be zero, and then I should rate limits like between um with some other mechanism in um like when we're actually going through and doing the crates like after each create. I should have something to rate limit again: yeah.

B

Actually, that's that's how cooper is also doing so, and I think other other tools are doing some similar things.

A

So like in that, in that case,.

B

In the case you you, you can make the default value as zero, or maybe you know can I I still think that can be configurable, but anyway, I think should be the default zero for the global. You know, client configuration and, and then it should have some control for creations for the leads, and you have a control for updates, for you know for the steady state you kind of have this control with slips.

B

So but you there is a library. You know that you have it's called rate limiter so and then you just configure like also burst and carriage per second, and then you do the some weight. I thought I implement that during the original load generator, but I'm not sure now. So if you didn't, if you didn't see that, maybe it's not was not that yeah.

A

Yeah I um okay yeah, like I did yeah, I see what you mean like I do do with weights um uh yeah. It is a good question kind of how that interface should look, um but it's like uh yeah. I don't know like it's. um I don't know this just seems like a an easy way to do it. I I I can look into it, but I mean I think. um I think, though I think I agree with you that this should be zero, like I guess we shouldn't, we shouldn't rate limit at all here.

A

It's it's zero right that you said it's not like negative one or something. I can look this up whatever.

C

A

It's just zeros.

B

A

Limit yeah no limit. I guess that makes sense. Okay, so yeah, I could do zero for this, and then we could um yeah, then uh I'll leave. So how I think, like I think I guess so. My approach myself was like I just wanted to when I want to try and like I want to see how this goes like, because it seemed like um it's fairly basic like what I have.

A

I think what so I'll make this zero, and then I mean the sleep kind of works, but I mean I totally acknowledge that there could be a better way like we could. I kind of want to revisit that, though I think um okay, if we can make it more powerful, I definitely want to revisit it as we I kind of want to let it evolve like as we kind of build the the use cases as we use it more yeah.

A

Yeah yeah, okay, I'll change this to zero, and then um I completed that okay sounds good: okay, cool, uh okay and then here's the I wrote the config to go with it marcelo um I forget. If I see you, I did okay yeah. This is, and I added daniel too so um yeah when you get a chance this will. This will go with it for that'll make it work with just the same way that you have now with your burst test.

B

Yeah, it looks good so when the other one gets merged, we can. I won't hold this one. So.

A

Okay sounds good, okay, um so the other thing that I had for this meeting I wanted to do a little design, because so one of the things that we've talked about um we've talked about this tool. We generate, we we generate load, we talk about a tool. We have. You know we audit.

A

um The other part we we've kind of talked about is, like you know, how can we measure like how we could? How can we measure pressure has been like our that question that larger question- and um you know we talked about it within this presentation and um and kind of the way I wanted to think about.

A

This is like what's the first thing we could do like you know when we approach this problem and I'm thinking you know we we need to answer the question like how do we measure clusters pressure at at rest, like before we test?

A

We need to know a certain things right about the cluster like we know how many nodes it has, how many nodes, um how many knows it has how many, um how many other things that that has that could be causing pressure so that when we do tests, we know the difference like we know how much pressure that that we're causing um you know with our workload.

A

um Does that make sense to people like kind of as a way to to measure like so basically, what this would do is like this would give us a way to understand um like when someone is running our tool with fluid generation and auditing. You know when they're telling us here's what our scale is. We understand um what their their topology is. Their cluster topology is so that we can put the topology against the numbers and understand you know what what their scale or their performance actually means.

B

Yeah, I'm not sure if I, if I got it so you mean to visualize the data to analyze the data because, for example, you have a cluster, you know 100 nodes, and then you create some pressure. For example, you want to do a burst test, creating 1000, vms or the steady state, and then you configure in the way that you configure, for example, creating 500 and with a churn of 20 and and then reduce this analysis, which means, and then you check the.

B

For example. You need to check the some slo, for example the vm creation time. um If you see you know this, uh the the creation time is too high. You need to decrease the pressure, because this kind of pressure the cluster doesn't hold so and but I don't know what kind of topology you know the benchmark tool should analyze.

B

I just think that it should be like many tests and then someone goes check the data and see if the pressure is high or not so it can be, like you know the the latency of creation, time or throughput how many vms you can actually create um because there is a limit yeah. So.

A

What I'm worried about.

B

A

Say we want to publish slows right. We say that in release zero five one zero, fifty one we this is what the slo is like. We expect you to be able to do a thousand. You know vm start to create or create to running time in less than um I don't know each vm with less than 20 seconds on average or something you know, that's our slo and um you know someone does goes and runs this in their in their data center and they don't hit the slo, and you know why.

A

Why didn't they hit it? Well, maybe they had a hundred thousand pvcs just sitting there. Maybe they had a thousand namespaces. You know like their topology is totally different than what we're testing in so their slo is totally different. So it's it's almost it's not even a fair. It's not a fair comparison. So.

B

Yeah, I don't remember now, but I think kubernetes in their documentation. They don't put numbers also, so we they just describe what's the slo, for example the vm creation time, and then we describe what is that vm creation times it's the vm, it's in the running state, which means the running stage, is you know, lived very domain, got created and then received the uh run command things like that. You know so um we described scenarios, we don't need to say.

B

Oh, the vm should be 10. You know less than one minute something at the creation time, because this varies a lot so especially because different them different than pots the vms creation times linearity. You know increase linearly with the number of vms.

B

So if we we have more vms, you know a a batch of morphemes burst a burst of 110 000. You know vms the worst case scenario some vms will be like is lower because they are, you know, waiting. The work kills things like that so, um and there are, of course there are much more things behind the scene that slow down the rim creation, but is um we cannot right now? We cannot guarantee and also we should not guarantee in our official documentation.

B

So we can just say what you should consider you know regarding regard latency, you know in the official documentary. So if we have some report, you know, for example, nvidia report- I don't know red hat ibm report and in the report we say: okay, these. These are the numbers that we measure you know in our environment and then it should be fine. But in the official you know the in the official you know on the github and the official documentation of um uh cooper.

B

I think we should not put numbers because, especially because of the finger reset, the hardware will change a lot the size of the cluster, the number of vms being created. um We can double check now, hello,.

C

Yeah just one question here: I did not hear everything so you may have answered it already. Would it wouldn't it still make sense to kind of come up with some numbers for the hardware we have to stay? This is what we want to have, and this is not a regression so to just see if we have on non-hardware regressions.

B

You you know in the yeah, so it can be a discussion. So I was just thinking like in this ryan was saying when we described about the slows. Okay so and then in the document that we described about the slos, I don't think we should put numbers here there, so we should just say what is the slos and and then you know how we can. We can measure that and then later, as I was comment, I don't know you guys can disagree with that.

B

It's fine so and then, for example, if we have the convert blog and then we describe what we see in our vert ci, for example, you know in the hardware that we have the slos that we defined. What's the numbers that we see or some other experiments that we can, we might run. You know we can report that has a kind of a report, but we don't.

B

Maybe we don't need to say the official numbers, like uh you know, in our slo document document and say: okay, the vm, should you know creating a vm should be you know lower than this.

B

This number here, because I think it's hard to to to you know just say that, but again in a blog or in a report, it should be fine.

A

Yeah, I understand that you're saying I mean I guess like what I'm saying is that I, I think, I'm thinking like, theoretically, that it might be possible, because, if assuming this theory that we can measure assuming here, that we can measure a clusterous pressure like if we can quantify it, if we're able to quantify it, then like that should be a consistent number. That, like I could put, I could say, like our ci system, has, for instance, this measurable amount of pressure right at at rest.

A

Like the moment, we run our tests, so this is the what we'd expect within some plus or minus range for performance, and this is you know what like this is what you should. This is what you should get, and that would give us that would give us a lot of confidence now. That would be very similar to. I think what you're saying, which is like the value, if like, for example, if we were to just say in rci here's how what the performance we respect uh expect, I, I would say it's sort of the same.

A

It's the same thing as that, except when someone goes and runs someone else and another outside of you know. The ci environment wants to run um one this performance or load generation, test and audit tool and do a performance test. They're gonna see different numbers right. So what's gonna change well, the only thing that we could give them to tell what's changed is is a measure of their pressure arrests so that they can know that okay, their cluster is different.

A

Here's how it's different, and so we, if we know that we could we could. We might be able to estimate what their expected pressure would be within some range. If you know if, if we had that number that that's all I'm saying like so, in other words like instead of saying, instead of like just documenting what we you know, we've tested, I'm saying we might be able to, we might be able to provide a way for other people to estimate for themselves.

A

What they're, you know not just like you know like what they should expect, instead of just like having to run the test like 100 times, and you know we might be able to estimate this.

A

Do you follow me at all, or do you disagree with that or I it's a little bit more difficult like? I would say it's more difficult because, like I think it's the same thing, it solves the same problem that you're saying which is like. We want to have a number that we could say our ci. Currently the performance we expect with rci, but we could also say we can also estimate other different forms of performance based on you know pressure we could get.

A

We could gather data like that like, for instance, if we found, like so say internally nvidia had like some high pressure. You know it's totally from ci. We could. It might provide more justification to the performance that we're seeing, which might be totally different than ci, and it might just be that, because our performance, our profession, our pressure number is higher, so our performance number is lower or something like that.

A

Just another metric, I guess.

B

Yeah, so I get it so you want to have like some baseline yeah. I think a baseline is fine, so I'm just you know. I think we are not. Maybe you know we are not ready yet to have baselines you know um in, but it should be fine. If we create a report and uh a blog, you know you know showing this results, but to put like you know, what's the target pressure that we want to have, um especially because in especially in the ci is very small, so we can even say about scalability there.

B

So you know we can say about some performance, um but.

A

Yeah about this marcelo is like let's say we could. um What if we, for example like this, would be actually good to test in ci. This could actually like I mean I think this would help. This theory is that if we were to test consistently in their ci, you know how we do now like clean clustered right. We get the same results.

A

It would be interesting to modify that test where we would apply pressure, for example like create our environment and then create 10, 000, pvcs and name spaces and stuff like just create all sorts of chaos and then run the tests and then take a look at this, and- and it would be interesting because this would be an interesting point on a graph, because now we can graph pressure against performance right like now.

A

We can put those points on a line, whereas, if we don't do that, then what we're doing is we're we're just putting um we're sort of graphing and we're just we're just putting numbers out there we're saying okay, this is here's. What we see in ci there's what you expect, but we're not comparing it again. It's anything. You know like so we could do this. Actually, we could test this in ci as well like we could test the pressure.

A

I think it's just it's just a wider, it's just more things. We could test.

B

I test the pressure in the ci, so it was like no. I did the presentation defrosting um you know to create like, for example, what's the maximum number of vms that we can create with the tiny vms, for example, um I could create you know, 500. You know vms per node. Okay, we have only three nodes, but it was like a big pressure was very slow with 500. You know reached some limits. It was a huge pressure, so that was the maximum, but it was like vms.

B

Without pvc you know with just ephemeral containers things can get like a more. um You know. uh We can have a lot of tests. You do things, but I think the ci goal here is okay, so we have the set of tests that we run and we want to keep them to see how the code evolves.

B

You know deep performance evaluation, I think, should be in some other system. You know what I mean so the ci. We have this test it the results that we have there. We can understand that as a baseline, especially to compare how it evolves with the code, because people are changing the code and we need to now we have like we net. We have a way to verify if the performance is being impacted or not.

B

You know things like that, um but to find you know the limits that you are comment now. This is something for example. We we do that. You know we are doing that this kind of testing internally in red hat, but it's uh and then we can find another, maybe some cluster or even some of this data. If it's possible to publish you know um and make it public, uh then we can, you know, write a blog or you know, create a report saying what is the limits?

B

What we what's the pressure that we see you know and then, as I'm saying, I think this is kind of analysis for some deep performance evaluation in a specific cluster. But this ci shouldn't have this deep performance evaluation with the extremely stressed test. You know um it should be like something that some tests that it we can reproduce and see how the code evolves.

A

Well, that's what's interesting to me is that is that it might be reproducible like that's what, like you saw right when you did your your pressure tests. You probably saw that, like that 500 vm on the node, it was consistently slow right. It was not unexpectedly fast, you know when it was always slow, like that. That's what I'm saying is like defining those expectations and and then, if you're, if they're predictable.

A

If those are predictable, then we might be able to measure them in such a way that- and this just like in the way that um you know we're measuring at low pressure.

A

um So because, because honestly, like sometimes things like code, changes can have an effect have no effect at low price low pressure right, but they could have an effect at higher pressure. So there's like there is like testing. The extremes might not be a bad idea like we might find things that are different with this type of test, but really the whole thing that's important here is that is it predictable? Like that's the question like? Can we predict what is going to happen?

A

You know based on the pressure- and I think like just from some anecdotal evidence just by your testing I've I've seen it myself. um I mean, I think even this presentation talks about it that you know that you can predict it like. You can predict. It is predictably slow, but I mean is it quantifiable like?

A

Can we can we put it to a mathematical equation that we can actually measure it and plot it, and then, if we can, then we could measure it and then we could test both extremes because we might get different results based on code changes, so it may be useful in ci. Just for just for that reason, yeah do you agree or disagree.

B

Well, I think too much pressure can break the cluster and the ci. If we break it, you will be hard to recover.

B

So I'm just saying like we need to be careful with the kind of test that we put there yeah and- um and uh you know in the for right now we have a test that it's, I think it's already. You know with a lot of stress. If I create 600 vms, it's a small cluster three nodes only and it's creating 600 vms with 200 per node.

B

um It's uh officially, you know, I think, officially, you know uh openshift it's, I think, with official 250 vms, which is recommended or something like that yeah. So I think 200 200 vms per node, it's red like a very high, and uh we have this test and it's is low. So if we compare, you know 600 vms, uh against, creating 100, so cr, I don't remember now the exactly times, but um let me check here and have it so 100 vm, it's less than one minute. Okay, when we have like 600 vms, we reach the.

B

Sometimes we reach the 10 to 10 minutes, and you know execution.

B

Okay, in fact, in fact,.

B

Yeah, let me actually let me share my screen, so you can see that.

A

I'll stop sharing.

B

Oh, hopefully it will be working. You know if this thing that we do.

B

It's loading, but anyway we can see that, uh can you can you guys see yeah.

B

Yeah, it's for some reason. It's super slow. Now, you know, um might be my internet anyway. So just uh I don't know what is this.

B

Okay, you know the this is the zoom panel. Is here? Okay, so probably you guys are not seeing this open anyway, so we can see here the vm creation time. These are many tasks, some old executions- they were, you know, reaching to up to 10 minutes, but later so just just see the the first graph you know in the upper yeah and then later they start to be like five minutes.

B

You know the worst case scenario, so something changed in the code that make it better and uh I actually I'm planning to write, maybe a blog about these results or something that I will try to do like later. I don't have the time, but I will do that.

B

um ah It's updated.

A

So you saw, but you saw that it was an improvement at uh it, was noticeable like when you did it at this amount of pressure like there was a noticeable improvement exactly.

B

And uh so, and then we see like it's been created like the worst case scenario, it's like uh you know five minutes and.

B

Yeah, it's not five, eight minutes so yeah. Okay, we can see here there are some spikes. You see some some variations um and don't know what this is variations, um something that you know is also uh we don't.

B

The tests here are not all the same, because sometimes we run the tests creating only 100 vmis and the other time we range. You know, 200 400 and up to 600 vmi creation, uh yeah we're not sure. If we can see that was a performance improvement here, because there is, there is some variation. We need to check that more. You know longer term to see if we can trust you know these improvements or not anyway.

B

So if we see here, for example, when we are creating 600 gems, it's taking in the you know worst case scenario here isn't eight minutes, so it's very high, so um we, I think we already, you know, see some uh pressure here.

B

uh Of course we can reach even further limit like rate 1000, maybe because I think the cluster is fine, so it's up to 1 000. more than 1 000. It doesn't create. It's actually breaks.

B

So read some limits if, as far as I know, uh you know before you know enabling the jobs I was trying to do this, um you know performance analysis. um I have it, you know documents for that. It should be open. I will open this and then we'll share this again.

B

It was a long time that I did stop this documentation anyway, so um I mean it was like for creating more than one thousand, I think was reaching some limit, but I don't remember now which limit was reaching and but we can you know, theoretically, we can create 500 400 per node, so it was oh, yes, we can create 1200 maximum 1200, but with 400 it's uh it's already put a lot of pressure because it's it's creating too many pods per node. So it's another limit.

B

You know we originally we start to see a limit of number of vms per node and not really, you know a limit of a number of vms that we can create in the cluster. You know, if you have more nodes, it would be more fair. You know analysis I would say, because it's the kubelet start to be overload.

B

The you know, the container runtimes start to be overload because it's creating too much containers and then we start to have another bottlenecks. You know, and it's not the cooper but on x, that I th that's what I'm saying so up to 1000. I think should be fine in this cluster that we have and to not see other bottlenecks. That is not related to convert, that I'm saying and uh and then we can analyze. You know uh um metrics, um especially the convert work queue. I saw you know something that it's interesting.

B

um I don't remember what it was my conclusion here, but yeah so anyway, so.

B

I want to point that yeah we have this vert controller node, it should be, you know, have maybe less pressure now, because there are some tiers. Maybe that will affect us to reduce the number of gets kind of things um and then we might be interesting if we disappears get merged. We can see. You know things like these things here in the convergence.

A

You know like you're saying with like keyboard pressure. It would be interesting to see if we're within the the range that kubernetes expects right with with pods um like well, because, like you said, if we're just loading onto nodes like kubernetes, already knows that that's that's a lot of pressure um so yeah we we expect like.

A

We expect what you're seeing right, yeah like and and- and so do we know like um like do we know like I guess what I wouldn't see or would be interesting to see, is like um if we're in a you know. If we have, I don't know after how many you know, it's three notes or whatever, and if you, if you're loading it to you, know 300 or something um you know, what would the?

A

What would your expected performance be? You know for cuber. What would be what would we expect to provide you? We might be able to measure that and, and then would does cuver add any pressure like it would be a question of like if we're measuring it consistently. We should know if it does um at this pressure.

A

You know if, if we, because I mean at baseline rate, we will, we might see it a little bit, but it might not be noticeable, but we'll definitely see it here if kubert is adding anything, especially as it changes right like between code changes like you were saying earlier like. If we saw a code change and it had some sort of improvements, we would definitely anything would be um any any improvement would be amplified here or any any anything that made it worse would be amplified yeah.

B

Yeah so yeah, I agree, I agree so I'm just you know. I think we should be just careful. What's the pressure that we put because.

C

B

You know depends on the pressure. It's not any more preferred components that we see. You know performance problem. It will be like something else. You know um you know container runtime, google ad pressure things like that, because we are we're already like uh officially.

B

I don't remember now how how many pods bernould kubernetes officially say that they support um well the default. One is 100, isn't it 110 and I think the in the document that you showed before we see like 100? Also they recommend um you know open shift. It's recommend it's by default, using 200, something um more than that we are. Rather you know beyond the limit. So then we should like be careful. That's what I'm saying, um because maybe we will not see what we want to measure.

B

So that's what I'm saying especially the cobra ci, should be something that we control. We know what's happening because the main goal here is to see some change in the code. Yep.

A

I agree, I think, within reason like, I think, that's sort of the the limit here is within reason likely, but but a higher pressure job. What can yield some new information that could be helpful to us yeah, I think um so. Yeah I mean I yeah. I mean I like the the other example. That's really interesting that you showed yeah okay, I wrote that down. I added that I think I think it would be cool to see if we could do a high pressure job just to see how you know the measurements there.

B

Okay, but we see things that it's getting better, for example here you know when, in the right request, duration, okay, it's like delete and uh delete. You know before you know we can see here. uh What's this date, uh I don't know. What's this date here march, that looks like march 5th yeah, maybe march 5th, yeah, okay, no, it's march 5th, okay, so it's march 5th- and we see this- uh you know delete the delete- was getting like high three seconds. Okay, it depends on the it was varying.

A

B

A

Would be interesting marcelo to see what what happened in the cases that it was? I guess what I'm wondering here is like what would cause the delete to be high and what will cause it to be low. You know, because I mean it, it's possible that between these tests and like there is it's not there's, probably not code change, or maybe there is, but maybe there isn't, but uh there also could be. You know the the way the job is run. Maybe there's something that's different.

A

The pressure has changed in such a way that maybe one of these tipped it over. That could also be the case like it would be interesting. That's why I was like kind of saying it would be interesting to know the pressure right when you measure each of these. If there was a way to do that, um it would be. It would provide a lot more information than just looking at the graphs. We could have a little bit more than okay code has changed.

A

It could be that oh wait a second, the pressure has actually changed, and maybe it wasn't the code. What do you mean? The pressure so like the pressure like what I'm saying is that the test that you ran- I mean, I don't know the test that you've run here. But let's say that um you know between each of those little bars is a different test.

A

Maybe the cluster and in each of these tests have different numbers of objects, maybe they've, maybe it's various, and so when you have the higher, maybe those those bars where the the thicker ones.

B

Perfectly it's not the sh.

A

It's not a shared cluster.

B

It's a dedicated cluster, so it will be always the same.

A

Well, yeah, okay, I mean, I guess, I'm just I'm just proposing a theory like. Maybe the pressure is different because, like it is possible like now, maybe in your case that could be the case like with what you're doing it, but it with any in any general test like it would be good to know what the pressure is at rest, because we would know that if there is a difference between these two tests, like.

B

A

B

The rest is here so there you know when you see these gaps here yeah it is no job, nothing is running the cluster.

B

So what happens here in the test is, it starts, deploys cobra, run the tasks and undeploy scuba and then waits hours. You know I've it's just running, so each test is just running once a day. So then the next day do the same. So it's it's. It takes hours, you know idle and the the and then we have just just two jobs, one that creates 100 dmi's, another one that range 200, 400 and 600.

B

So we can do a zoom. For example, you know one of these tests.

A

Yeah, I understand myself like yeah, like you, you have it set up so that it's there should be. The test should all be clean in between right yeah. I I understand. All I'm saying is like that. um It's just another data point like that. um It's another data point. If you are doing testing in your data center and you're, not doing this, you know if you just want to test that whenever you know like you want to know like as another data point for your test, what the your pressure was when you tested it.

B

Yeah, I think so. I think that the two you know that we are writing it's might. It can be like many different tasks, I'm just thinking that maybe the cover ci, you know should shouldn't have like too many stress tests. Unless it will be very essential to see some performance. You know problem that we want to see.

A

Yeah, okay yeah, I agree, be good to have one but like we don't need to have a. I don't think it needs to be tons of them, but it would. I don't think it'd hurt to have one.

B

B

And um yeah because we were saying like before you know a stress test to test kubernetes objects. So, and I I think those kind of tests shouldn't be in the converted ci, you know um just it should only test the convert object. So.

A

We, when you like yeah, I mean I agree like we- should only be testing keyword, objects yeah, like the when we're doing any sort of stress tests, yeah yeah, I completely agree, okay, makes sense.

A

Okay, sure I think that's all. Let me see, I think it's all. I have yeah, so I wrote down this this I mean this would be good to have at some point. We can talk about this in the future, but I think I think we got the topic pretty good, um and then I mean it's something to think about. I think it's something it would be interesting to see. If this is something we could do, I I mean, like I said I I don't know like.

A

I don't know if this is quantifiable, it's hard to say, but it would be interesting to know um to know that I think it would just bring. It might be a good data point that we can add when we're when we're talking about slos and we're telling people to measure their clusters like something helpful for people.

B

Okay, you know when you mean about measure the pressure.

A

B

When it's idle, so I mean no pressure, isn't it so um it's just to check like uh resource usage, this kind of things that is.

A

You mean yeah like um like how many nodes you have like. That's a pressure right, that's the form of pressure, even if you're at rest, that's still foreign pressure like you could have.

B

Yeah, it would have some.

A

Requests yeah right, like you, could have one api server and a thousand nodes, and that's that's a bit of pressure. That's quite a bit of pressure for one api server. So like there's, there's a number of things like that where it would affect your ability to um like it would affect the numbers that you're seeing and the way you you know from the from what work. The tools that we have is for people to use the test.

A

Yeah, that's basically what I'm saying it's like um this is like helpful yeah. It's stability.

B

If, for example, yeah but yeah, this is hard, but it's something that we can think about. For example, if we create 1, 000, vms and just live there, you know you know in the long term.

B

Does it change anything? You know what's the pressure. So if the system behaves well because stability can change, yeah, definitely.

A

Well, I mean that one might be difficult, because I mean maybe we could. I mean again, it depends how we define pressure but, like all, I was thinking of like things that we know to find like some of the things listed here all have different forms of like the number of nodes, the number of pods number of pvcs number of name spaces.

A

um Those I just think like those could be numerical values and then, when we're at like during our test, you could measure pressure again, and you could say uh you know now we have a new number of vms above and so on. Our pressure values should change and based on you know, like um the pressure value during the test, you know we might be able to predict what you would expect for performance just because of the amount of pressure.

A

Yeah, I don't know this is an idea I mean, I think like. It really depends because, like what I want to get to eventually is like, if we can say like our goal, is that we can, we can say, cuber can scale to this number of notes right, and how can we do that?

A

If um you know if, when we could test it right, it's one way like we have to get access to certain amount of nodes, and then we have to run a you know test again against it and we probably have to continuously test it.

A

There's some challenges there, but you know that would give us a way to say that. Okay, it works with this many notes and the other way, which is what I'm saying here is like. If we knew the amount of pressure and the performance and we we could get a better idea, we might be able to get a better idea of how well it scales or how well it performs at some at different scales, just by a measure of pressure.

A

Yeah makes sense, yeah, okay, I'll, keep thinking about this, like I, I think, there's a lot to unpack here, so um I'll keep thinking about this, and maybe something we can talk about in the future, as uh maybe, as I get some a little bit more clarity on some of the some of the like different forms of pressure and how it can affect it. But I would I think what I'll do is I'll, probably do a little bit more testing.

A

I don't want to use that steady state job, do a little more testing and kind of continue to form this theory based on you know what I see and the results from that. I think, because I'll go with this okay cool all right, I don't have any more points. Do you guys have anything else you want to discuss yeah, so um roman.

B

I think, if it's possible, you know to put your name in the meetings.

C

B

Just because we see that more people are joining our.

A

Mission, roman uh you're quiet on some of this stuff. Could you do you have any opinion on this of what we were talking about.

C

I I could not share my full attention to the meeting where I wanted to listen in so uh yeah.

A

That's fine, okay, the yeah! That's all right! We can. The tldr was that um we're trying to figure out. um You know like a way to measure pressure based on the number of nodes in a cluster, the number of femi's and so on as a way to to sort of normalize someone's performance like numbers that they gather and possibly have a way to. We could also predict scale or estimate scale. You know, based on someone's.

A

You know, based on so much pressure or something like okay.

C

um Yeah um you've probably mentioned it anyway, but um I mean we have some. We would see some pressure in general if we have to ask kale and have prometheus properly deployed right. So I guess we would just do on our deployment, ensure that with certain goals we want to meet, we don't see disk pressure or whatever or it's just, but I said I didn't fully listen, so it may be a little bit of what I'm saying.

A

No yeah like what we um yeah, it's mostly like we we want to take, is like it's based on. Actually this this a little bit on what I've. What we've seen from testing like marcelo, talked about like when we see things slow down. For instance, when we have what was it 500 vms, on a node marcelo like we see like the 400th 499th vm is a little bit slower than the first like the right like there's, there's a difference there and and and we're trying to see if there's a way to quantify it like.

A

Why is the 499 vm slower.

C

Yeah, I guess we have a lot of metrics there already like you, you, you see how the watches are performing you're, seeing the rate limiters in the clients right yeah, and this is mostly important to to see if you're hitting some limits there. I guess but yeah I mean we can also maybe miss some and, of course, disc pressure uh needs to be this precious cpu operation. At this pressure memory pressure needs to be monitored.

C

I guess that are our main entry points. From my perspective, at least yeah, okay,.

A

Okay, yeah something I'll keep thinking about because I think like it would be interesting if there is a way I I don't know like if there is a way to do this, that's sort of that would work it's hard to say, like I don't I'll think about it, some more!

A

Okay, all right! Do you guys anything else? I think we're at time here so we'll end on nothing.

A

Okay, all right guys, so next meeting will be um in three weeks april, 7th, so I'll be up for three weeks and then uh I'll return and it'll. Be our next call. Okay, okay! Thank you. Everybody.