KubeVirt SIG Performance and Scale, 27 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2022-01-27

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.yg3v8z8nkdcg

A

Okay, um this is sixth scale january 27th. Okay, um so I'll go with this again. uh Let me open the explanation, um so the the period job we talked last week about what has been bothering us with the periodic job trying to get the answer to the http request, counts and and trying to figure out what we should be seeing. So um uh what I did was like marcelo had that explanation. Where he talked about you know, should we prime, should we have to prime our our job so that we can?

A

Actually, um when we measure, um when we do an increase metric, do we need to have a initial vm to prime the um to prime? The time series database to, in other words, have like a the first first item to compare again so that we don't increase. You know from no value, we have something we increase from. You know whatever initial value, so we can actually measure this, and so what I found was that um that, for uh when I every time I ran this test, uh where I created a cluster, I did this.

A

Like a dozen times I created a cluster, I ran the the perf audit test. I would that would be missing the the create events, the crate requests- um and I did this both with um the hard-coded range selector to five minutes which which um all that does is it increases. So you can see like, I think, a good way to look at this is like it increases the amount of time that we can like the the set that we um we measure from.

A

So it actually gives us a um an opportunity to find the data, and this was this is actually another problem- that if this number is too small, you can miss it, and that was that was one other problem and, and you need it to be large, so that um you can avoid aliasing, um which I'll explain in a second that's another problem, um but the um so I did this with both five minutes set the five minutes hardcoded and then both uh and then the original um value, which was it was a varying value, is based on the amount of time it took to run the test, which is like roughly one or two minutes.

A

So it's kind of in this small range, but no matter what happens it.

A

I never found the um the create requests showing up, so I added the primer, um and this is a picture of the primer running every time and you can see, there's no create pod um requests that show up in the primer, and then I run the density test after and I we do get the create pod count um that shows up so that verified to me the theory that like where we need to prime this, um so that um so that that was helpful. That was relieving to see um so they.

A

I have a whole explanation here as to like, um what's going on, basically like um increases based on rates, uh the rate function and there's a good here's like a little summaries like the problem with the first sample. The new metric series is that the rate is attempting to compare against a non-existent previous value. Prometheus does not have enough data in which to interpolate, so increase and rate are both our um interpolation.

A

So we're like trying to figure out what we expect to happen like we're, trying to look at this set of data and expect how you know what would it look like if it was run over this series of times so we're kind of like we're kind of guessing that's what these values are. That's what these metrics are supposed to do, um but we have nothing to compare against. So we absolutely need to prime to figure this out.

A

Okay, uh that's so that's the that's why we weren't seeing it so priming does work. The other thing that I mentioned earlier is: um is this problem right here?

A

Why do I see 40 when I use smaller numbers, like so smaller interpolation times, smaller range vectors, um and I found an explanation for that. While I was actually looking for um answer this and it turns out that um this is a known issue- it's it's has to do with signal processing. It's called aliasing, there's a link to it here, um but you can see here like this, this guy, um who is commenting. I think he put comments on both of these he's, like one of I think he's one of the core contributors to prometheus.

A

um I saw his name all over the project, but he um he has an example here like so, for example, in the.

A

If, if a query, you know that's executed whatever at this time only sees um uh 580 at you know at this time stamp whatever this and then 581, um and that's an increase of one over two minutes and that gets extrapolated over four minutes, which is an increase of two so like you can see like how this gets measured like it's actually only increase of one, but the value that it spits out is two because of um how the value gets extrapolated.

A

So the recommendation was that if you want to get an accurate value, you have to have a large and large enough range vector um otherwise, you're gonna get these um these uh extrapolated values that are incorrect uh like 40. um and that's. Why and that's what we observe like when we, when we run this at five minutes, it's enough of a um of a range vector that it um that the extrapolated value is close to what the expected value is it's because that's like 21.

A

and and this to safely measure this we need to in our excuse me. Our range vector needs to be longer than our test. It needs to be much longer than our tests. Our test right now is like one minute or so, and um so setting it to five minutes is fine, but we have longer tests. We're going to need this to be. We need this to be a little bit longer. We don't want it to be considered short, otherwise, the extrapolated values um we can run into analysing again.

A

So this is something that we need to watch out for, I think in future tests, and we do larger ones.

B

Yeah, that's why the range carry could help with that. So.

A

Yeah yeah you mentioned that um so I tried marcelo. I tried to do the range query and I ran into all sorts of problems um with it and just I couldn't get it to work. um I'm not saying it can't work, but I had a lot of trouble getting it to work.

B

What problems you do have I I've been used that for a long time, so.

A

So pretty much whenever I did the increase, ram, increase, query um or increase metric with the range query uh with when I would set the time, no matter what I would change the metric to it just failed with the range query. It was just wrong, no matter what what I did. I wasn't sure what I needed to set it to, but if you have an example for how this would look, that's fine.

A

I can try it, but I was just kind of stuck with how it should look, and I was kind of satisfied with how this test ran right. Now like it worked fine, so I mean I'm okay like if we want to change this in the future, but I think this works fine. For now.

B

Yeah, that's, I think, that's what I mentioned to you. It's, oh. First of all, sorry I misunderstood the the code before um and this works fine. uh The only thing that it's what you mentioned depends on the metric that we are collecting.

B

If it's only for a counter, you know that we do the increase just check the the last five minutes should be fine, isn't it we won't just want to take. You know the? What's the uh you know the the biggest value that comes from the metric, I think maybe the last five minutes might be fine. I'm not not really sure I didn't think very well about that, and and uh well I think, maybe we can approve that. So actually it's it's ready to to be merged.

B

Isn't it but the range query what it does is we define the exactly interval of the execution yeah.

A

B

A

Yeah, it's got a start start time, stop time and step which I think was like the range vector equivalent. That's where I was having trouble because increase expects a range vector so does raise so.

B

A

I've yeah I've tried it, it just failed. I don't know, I don't know what it uh how it should look. If you want to try this like after marcelo, I like, I I'm all for it like. I think it, because if we could to get rid of offset that's fine, I mean austin works. Fine and just like it brings us to the the end of the test that we just ran, and we just look back.

A

I mean it's a functionally equivalent, but if it's a, um if we need to like do more advanced things, I mean I could see how the range or how the um what's it called range query could could help us. You know isolate the time frame. It's.

B

Like this has no.

A

Limit to like the start time, that's sort of the problem with this like we're. Looking five minutes back, but we don't say, like you know what the exact start time was.

B

Yeah, basically, what french query is doing is, for example, you take like an interval of 20 minutes. Okay and then you see like a steps of five minutes and then you get like uh you know. um You know four, you know uh results, so it will be like uh you know um many requests of five five minutes the steps and then you need to you know, take the average or the max of this uh results depends what you want to see here and the steps can be also big.

B

So we it's you know, can be like pretty much what um david did before so then we will return only one value um yeah, but yeah the query itself. It will be with this five minutes interval. So we don't need to change the query so.

A

Yeah I mean I mentioned the limitations to this to doing this, like I, I put it at the top like limitations to this. That I see is like we. We first need to find the right range vector which is going to be based on the time of the test and then like that's one concern I have you know using a range query could help there. I think, um but I mean I think, like this is at least a good start yeah.

A

The other theory I have, which I think was along the same lines, is like how many primers like I only do one. I only prime one time at the first time that the this one, the sweet tart, starts and that's it, um and we only have one test right now. So if, but, if there's too much time between tests, then we would need another primer.

A

So I'm that's another thing like that. I'm concerned.

C

About, I don't think so well,.

A

Because the the entry- well that's my fear, it's like if we're only looking at, if we're only looking at the period of time,.

C

So the entry actually existed, though that's the thing. So it's got something to compare against forever once the entry actually is put into the database. It's it's just for an increase. We have to start from something to calculate what the final result is like. If we don't have something, then it gets weird, but we do have something if it was ever primed.

C

If there was ever a vmi started, then it should remain consistent from that standpoint, because that increase means what like, if we're talking about uh vmi creations or whatever the increase, could be at the start of the interval. Let's say there was 30 uh bmi creations from a previous run, uh so it existed there and then, uh when we do our test, we'll see a difference between like start at 30 and then maybe there was like 100 or something after that, and the difference would be 70.. So we know that's why we get returned.

C

So it has something to compare against.

A

Yeah I mean, I guess, that's true, given the fact that you know every time we run the test after the initial one like. I would see this like afternoon after it went back like hours later it was okay. um I mean I think I was running the primer, but I think we would. We were seeing this earlier, like with our tests like every time we ran it. You know no matter when it was, it was working.

A

I don't know, I left it as an open question, just in case like I think it's like it might be. Okay, but um I don't know I I'm just in case like um there's an open question, but um anyway, so that that's kind of that's where this is though, but I think like. I think this is good enough to merge.

A

I think this gives us like what we're, after in terms of the measurements for for create um requests, but uh um but yeah I mean, I think, range effect range query could be something we we improve on marcelo. um Okay, I mean: do you guys have any other questions about this like? Does this make sense to you guys? What is what I have here.

C

The only thing I'm uncertain about is that five minute interval, if it should be something dynamic where we somehow calculate based on the time of the test and just ensure that it's large enough to avoid interpolation. I don't know where this interpolation problem occurs. I think it might occur. Based on this grape interval, it's unclear how what it's related to so like when does interpolation happen. It's going to happen.

C

Yeah, I don't like I understand when it would happen for one minute, because you had um like one sample and then it's interpolating, it's trying to interpolate what would happen when you don't have like even more results, uh but over five minutes we should have lots of samples.

A

uh Well like for the create requests like um where is it uh so if we do um so, if we our create requests happen like um yeah, I mean so they they happen pretty quickly like they're, we don't have any samples after like a few seconds like it's. It's done after, like maybe the first 10 seconds. I think so.

C

This one minute makes a lot of sense to me why it hits 40 because we scrape, I think, every 30 seconds so most likely. We got one scrape in that minute. uh Unless everything was time perfect, you might get two. So you only.

B

C

And uh since, if you know the scrape intervals every three seconds it's going to interpolate what the next interval that it doesn't even have would have been and say: well, it's probably the same as the first. So if we got 20 then it would say that over a minute that it would be 40. and then for two minutes. Let's see it probably looks a little bit more accurate depending on the timing, yeah see and then for I would say once you get past two minutes, it should probably start leveling out. Would it not yeah.

A

Pretty much uh like three minutes: it is like the difference between two and five, and you can see it's three yeah.

C

A

C

Totally stabilize at five that makes sense.

A

Yeah, um I don't know if it's based on the scrape interval yeah, I see where you're going with the math on this in between yeah. I don't know.

A

C

A

30 seconds.

C

Or a minute by default, but that would certainly make sense yeah.

A

Yeah, okay, yeah! Actually I was thinking it had to do with the amount of time we took or these results were available for. But I guess that I mean this is small. So, like I was saying it's like only a few seconds so yeah that would interpolate to a much larger value.

A

Okay, well um yeah, so I guess on this one I mean so I mean I think why, when do you think you think five's? Okay, I mean, I think five is fine like for our test. I think it's works, but um it gets.

C

Weird, when we have um multiple tests running because.

A

C

Not accurately looking, I would try to how close did we get with the uh just the range of the test like? Could we just sleep a little longer to give us more uh like samples, something like that? uh Just I'd like for it to encompass the test, or else we're going to have problems right, as we add a new test, because they'll begin overlapping and stuff.

A

Yeah, so okay, so you're saying like um well so this you're saying we want our test to run for this amount of time for five minutes.

C

Yeah we could get to run for five minutes. That'd be nice yeah, we're like.

A

Or just make sure.

C

Like the end that we sleep for whatever duration time makes five minutes something like that,.

A

B

If we hurt cold, we will have problem with tests. It takes longer yeah.

A

B

If the thing that wasn't clear for me, you know was, if actually takes longer, for example, 10 minutes 20 minutes would that be a problem? The problem is, is a lower interval? Isn't it.

C

B

C

B

Okay, so maybe if we make it like, you know um configurable again, and we document that you know just say: if a test takes like less than five minutes, you need to wait at least five minutes to collect the metrics. She doesn't don't have like an interpolation problem and we do that in our test. So.

C

I think that's the right approach to make sure every performance test uh it takes at least five minutes.

A

What about well see we're we're gonna have tests that will run longer than this and then we're well. So what we're saying is we make a dynamic and we just um we do a minimum of five minutes and anything longer than that. We um we just set this dynamically to that value whatever I think the test was.

C

I think uh for the test to take longer, I'm not convinced that we're going to see this problem and I think it's with it's a combination of the ratio between the scrape time and that duration that range that we're looking over so it as that ratio gets like more distance between it or I don't know how to describe that, but uh it becomes less and less of a problem. So, at five minutes with a 30 second scrape interval, it seems like it.

C

We get pretty accurate results and I imagine that would continue to be the case uh as we get further like longer. Tests.

A

Yeah I wish I had. I don't have a long test to confirm any of this, but I could try this different scrape interval that could at least give me some data. You can.

B

Create more vm so.

A

Yeah, I don't have well yeah. I don't have a that's my problem that that's yeah. I can only get to 20. Well, I can only get like 27 right now.

B

You can, you know, create a vm easily, create another sleep, so yeah.

A

I could do that. I guess.

B

I don't know if you well, this should test. You need longer interval if you have a problem, but I think you know for that for close, just pure make it like. uh You know, you know uh configurable again for the the time that the the tool executed and put like a sleep of I don't know four minutes. Maybe I I don't know how long the test runs. One minute, maybe so, and and then we run also to with a longer you know, interval and the offset is also good.

B

So we we use the offset but I'll bring back the range based on the the start and end time that we pass through the audio tool.

A

So let me um write this down so, okay, one of the things. Let me look at this based on the the scrape interval, because I want to see um I because I could do that just based on the exact test that I'm saying I have there so like how uh just scraping the hole.

B

Actually, this is very nice yeah. This would be very interesting yeah. You know we'll put down.

A

B

A

Let's find that one out and then the other one is like. I guess I could try with the sleep you said to create and sleep um and then see how. I guess like this will give you my answer. Well, it doesn't and I could try the the long test and then we know the other thing is like um five minutes.

A

A

Longer than five.

B

B

But I wouldn't hardcore the five minutes in the in the you know. All these two.

A

Yeah, no, I I agree with you like it's just. There are limitations as like I mentioned there like the it gives us like for now for the one test. We have it's fine, but it's not ultimately the what I think the long term should be yeah. Okay, I think that's that's pretty much where we'll get some more answers there. Okay, all right, yeah! I can do that and let me see what I find. um uh What do you want to do with this like? I can do? Should we wait, um since this is ready?

A

Like truly do you want to wait to include in here? Do you want to do this? um Should I do this separately? Do you guys care?

A

I was kind of hoping we could get at least some initial results on this in our job, and I think this will get them and then I can come back and do this. These changes.

A

Do you guys have a preference.

C

I'm a little nervous about the five minute interval. um Can we put a sleep cost of five minutes instead of doing the um kind of the forced look back over five months like that? Would.

B

Be my only comment: it's just to.

C

B

C

uh We actually the test, actually um takes five minutes somehow or we make up the time uh if.

B

C

That yeah, that's the only thing, I'm just nervous about hard coding.

A

Yeah all right, I can do that and then then I think um so then we know it's five minutes and then what I'll do is because yeah I just want to get this like kicked. I want to get like the start, getting some results in that job as soon as possible. So I can do this I'll. Do this right away after this meeting it um we can start like, hopefully, roll this out soon and then I'll follow I'll. Do a follow-up pdr with the rest of the stuff in the investigation.

A

Okay, cool good, all right. So let's go to uh next item so fabian mentioned on the mailing list. um He was asking if we have um a general statement of keyword scale, um I kind of wanted to get you what everyone thinks about this. I totally agree like we do want one, but I just want to see what you guys think because um scale is like.

A

I think, as everyone knows, it's more complicated than just the number of nodes, but um I mean do we want to have a general statement like this like right now or do we like you know? Basically we can pull the community or something or do we want to like?

A

Do some testing a little bit more testing in our jobs first or what people think.

B

Well, this is good, but we need resource for that. So.

B

We are doing you know, tests for openshift, but I don't know if it can be open or not, but we weren't, you know to open that. We need to do kubernetes tests and get access to resource so because you know this scale test, I don't know what would be like the best. You know something that we can show. 100 nodes should be enough. So for now, what do we need? We, we don't have this. You know it's actually very good to discuss that here.

B

Well, what we can expect. You know.

A

Yeah, well I mean mainly marcel what I'm wondering is like I think like there are. Multiple people are using hubert right. I mean like we're using internally like what is like the scale people are reaching like. We have a scale number that we're reaching, but our use cases is different, say you know what you guys are doing internally with openshift, and so I would imagine you're going to reach a totally different number of nodes, like that's how I'm interpreting this is like he's.

A

Looking for the number of like a general statement, saying, like hubert, can scale to this amount. You know like yeah and.

B

A

Like my concern is that if having a general statement says like okay, cuber can scale to this amount.

B

But that may be true.

A

B

There are different trying to you know: kubernetes says that yeah.

B

A

B

Now for this hardware, for this configuration for this scenario, we can scale like that and if someone else has something different, it's you know there is no way to guarantee. But you know you just say that for this specific hard run scenario, we have those numbers, so it will be fine.

C

What problem we have with this discussion, I think, is any numbers that we give out. That's a bar, and uh I want to make sure that we look favorable like I want it to reflect reality, but I want to make sure that you know we're reflecting like numbers that are good if they aren't good. I want to make them good, uh like I want to improve performance before we release anything. I don't.

B

C

To uh because it impacts, like other speakers, yeah yeah, it impacts customers, it impacts our ability to um like market the stuff, we're talking about like vendors and things like that. So we have to be careful.

B

I agree: yeah yeah.

A

um Okay, well, I mean, I guess so I guess where I I mean at least I'd like to go with. This is like I want to keep this in mind. It's like a goal. We want to get to yeah and.

A

But yeah, maybe we just need to pull in fabian uh one point and talk about it. I mean because we we don't have like. What's the I mean, if we were to ask fender, ask anyone who's using kubert right now, wouldn't find what the largest scale is. You know I mean: do we want to use that number? I mean that's kind of the question. I'm asking. Would we use that number or would we wait for us doing these tests that, like I have here in the slo document,.

B

Oh, you know if someone is willing to run the task that we plan and design. I think yes, but if they run tests that we don't know what they are doing, how they are measuring- and you know what they they are configuring it's hard to rely on the results.

A

Yeah I mean that kind of, like maybe I think it's taking me a while to get to the point like we we without tests like without a way to like empirically measure this like on a consistent basis.

A

I find it hard to say like to make this statement. I mean when I would disagree with that, like I'm always thinking.

C

About the test, first right and one of the things that kubernetes has that makes things more difficult for convert is kubernetes. We can test a ginormous scale very quickly by bursting into the cloud and like the instances that might be used might cost a fortune if you left them online, but just running a hour-long performance test periodically like it's not going to cost that much. So it's the cost-effective way to validate kubernetes at scale.

C

We don't have a cost-effective way of doing that with cuvert at the scale that we would really want to be talking about. We have like at red hat, there's some internal scaling, that's going on and there's huge numbers of nodes and huge results that we get out of these. That would be really interesting to publish some day if we can, but we can't reproduce it because we're borrowing that environment and it's going to be given back to somebody else eventually or we don't have it forever, and it's also based on downstream products, not kubert upstream.

C

A

So if we had these tests, if we had some bunch of tests that we could hand any user and they ran them and showed the results like with that, I mean that would probably help us right. Like that. Oh no! Wait! What do you think well.

C

I think that would be interesting, uh so we have a conformance type test for cubert that is guaranteeing behavior like feature behavior. It's not really exercising scale.

C

The idea of a performance test that uh exercises scale would be interesting and it could have like multiple variables for it. Like are we testing scale at uh with ephemeral, virtual machines? So it would you know let you kind of alter um or tune the test for your environment. What you want to exercise.

C

I don't know: that's not really exactly what fabian's probably talking about here. He wants a specific number of nodes and virtual machines that we can run uh like a number. That's so tough.

A

Yeah, I mean yeah, I mean I think, like kind of where, when I think where we are in this, like, I don't think I I think we agree, we don't really have this, or maybe we're not comfortable in saying this just yet, but I mean I think, if we, if we at least had, I think like maybe a few things like one is like, I think we need to be able to describe.

A

Our tests like we need to know like, because I mean scale is all about pressure and the pressure that you're pulling in all different ways. If we can describe the different ways that we're gonna apply pressure with our testing, if we have a way of consistently applying pressure, no matter what the zone is, we can at least get some numbers, and then I think, like at least like you know, like you know, he says like in our and our or our ci like this would at least give us like some numbers based on our ci.

A

However, many nodes, like you know, I mean I don't know how we'll be able to at least okay.

B

You know thousands.

A

Yeah, but I don't know like we could at least go somewhere and then and then finally like if we have tests that we feel comfortable with that, we can describe like we can at least like I would love to try to use them internally and see like you know what it is that, where our pressure like basically applying that pressure consistently.

A

Excuse me what's the scale that it achieves, um I mean then, like I'd, feel more comfortable because yeah I mean, I think, like that's, that's the minimum requirement before we can get to these feeling comfortable about this. I think we have to have. We have to agree on this. Like you know, we have to agree on. I mean really the tests that are listed here and we have to have a test framework, that's consistent and we need ci, like that's.

A

um That's running the tests to know that they're applying the pressure that we expect them and working the way we're expecting them.

B

Yeah, so the way that I see that you know we we need to have like we can define the best framework. For example, the cook burner might be a candidate for that um or not so we can. We can discuss that. Actually, kubernetes has this on his. You know their own tool for that you know to run the performance test.

B

Maybe we can we can discuss to go with burner or even extend kubernetes tools. You know, what's.

A

The what's the qrs tool that does this.

B

It's a cluster load too, what they say. They call it's very big, so very complex. The code so.

A

What's it called cluster right.

B

Cluster load, two two of them. You know uh number two, oh number, two and everything together. If you also try to github that.

A

B

Yeah try to get help. Yeah yeah, it's in the pair of tests, yeah.

B

And I think the first link shown.

B

This one all the tests- they are running it's inside this, so um they made it to be very configurable. You define the tests on the emails.

B

I look at that, but I didn't look with too much details.

A

Is this being actively worked on.

B

A

B

In they are, they are maintaining that, but I also don't know if they want they are willing to. You know accept. So probably we cannot put crds. You know based resource inside this because kubernetes doesn't want to. You know, support 35 code for that, but they are, they are stressing, pods and and all the others. You know official resource here, and this is pretty much how they run their tests using is using this tool. Also, this is toolbox. I think it also creates the cluster. So it's doing more than running the test, so they deploy.

B

You know, create a kubernetes close during aws and run the tests. Things like that, so it's doing more than we need anyway. So it can be this one or the core burner because I read start so open shift is using cook burner and it's much simpler than that and uh I submit the pr to extend it for our case.

B

Okay, so no link.

B

And what are you just just to conclude what we were discussing before? Okay, so we can have like a set of tests. Oh yeah, it's just so we can have yeah, so we can have a defined set of tests and the tool that we want or that we recommend people to use, but the the task that you know someone else might run and give some limits will be like known official.

B

You know limits that people can provide, but the official ones. It must be something that we define here. You know in the meetings um and find a cluster to run it, and somehow I don't know you know something like that, because if we ask someone else, it's it's just like an official. You know limit that people can can help us with that, but we cannot assume that it's as official limits.

A

B

A

I think like so what, when I think, like three things like one of those like we need, we need to describe our test, so it's clear the pressure, we're applying that's what I want to do here see. I I'd like to verify that the behavior like in the current keyword release like and and that the behavior of a test is what we expect, like the pressure that we're applying is, is doing what we're expecting it to do the current release of pubert, so our tests are just like, because we can't really measure scale.

A

If we have three notes, we just want to make sure the tests are, um and things are functioning correctly and then and then I think at that point we can.

A

We give this test, like I I'll happily do this internally like run these tests and then come up with some numbers of uh infra information about like how scale is defined like in my measurement um and that would like whatever those are like we need. We need all the measurements that define pressure and then we at the end of it, we spit out a number number of nodes and then um well I mean because we need we actually need all of them like we need nodes.

A

The number of vms number of vms total vms, uh that's the rate that they're being created like churn and so on, like we need like we need all those things actually.

B

A

We need all of the pressure points um and that's like our combination, and then then we can create our little headline like here like okay, here's, the number of notes we've seen scale to, but we want to have the detail like just like: okay nvidia reaches whatever this amount of nodes. This is what their summary of their pressure was. We know keyword can scale to this amount of nodes, and you know, given this amount of total pressure.

A

So I I think yeah, so we we do need to talk about pressure, then at some point, maybe it's something we can do for next week, um like marcelo, we talked about it previously, like in some of the um the other kubernetes scale meetings like they. They had some stuff that talk about it.

A

We should uh gather all the information that they have about pressure um that we know of, and we should try, and I think we need to add it to this document and that's what our test should talk about it and what they should focus on and.

B

A

Can get this number okay we're going to write this as a so I'll write this here, just as a so, let's do.

A

A

Alright, we'll do that next week. Okay, uh do you want to talk about this marcelo and this change? um Have you.

B

A

You had some review on this looks. Like said, we have.

B

Yeah I've been I've been actually using that for a while- um and I recently created a pr for that. So I extend quickburner to create vms and vmi as well and also replica set vm replica set. So it can understand those kind of resource, and so coop burner has a way you know just to. um It was also inspired in the test that you did.

B

um You know burn has a way to track the quad latency. It's actually create a map and just have some watts and- and it's like take the the timestamps of different pods conditions when it's, for example, initializing the containers.

B

You know all the the phase that the pot goes up to it's ready.

B

um I pretty much extended that for the vmi, also, so actually for the vms, it's not cmi but anyway, so um it you can create a vm and then we will have like all the detailed uh latency, the latest breakdown for all these steps that goes inside um I right now, so it's just another thing that maybe we should discuss with david.

B

You know: um okay, let's, let's first describe that today we can. I can go for the next topic that I want to talk, so this pretty much. Is that so it's it's creating that and then it it's create a collects prometheus data.

B

I also included here a file with all the metrics that I think is relevant to analyze, so it will have like uh you know, vmi uh metrics and it's the cluster metrics, the tcg metrics. You know uh now the control plane metrics that I think it's important and it's gets from prometheus.

B

uh It dumps it's done so the way that could burn actually do is dumps to a file or you can push your elastic search uh this this data and then you can have like graphone connect to the elastic search and then you can just visualize the data too, but you don't need to do that. So you can use kubernetes to generate the load and see the information in your prometeus and grafana.

B

You know dashboard, um it's just a tool to help. You know to create it's based on templates. It will be like I would say similar what our you know: perfscale load generator tool is doing, but it's more flexible and it has more. You know it's easier to create. The templates here include burner. Okay,.

A

Yeah I mean, if you think it makes sense and we can extend it. Then I don't see a reason. Why not to go this way? People are using it. You know.

B

A

Going to slow us down at all, I mean it seems like you had some reviews. So that's good.

A

Yeah I mean, if they're open to taking it.

A

What is like this so you're, adding the crd you're adding?

A

Is this wood um like? What does this bring us like? Where would you say like this brings us in terms of like our current audit tool like? Does it um bring us pretty close to the like? If could you swap this out right now like for, if you had this merged with the um uh for what you have right now in the in the the um the periodic job.

B

Yes, so the audit tool, actually it has like more friendly. You know, um output from the the matters that we are collecting. um The kubernetes can collect those metrics, but will be like a more um query format. You know um the the output, so it's I don't know. If we we can. You know uh yeah, but I think it would be nice to use that to generate load later so the cook burner, and then we try it as as you mentioned.

B

So if we try to, you know, standardize the two and the tests and different people can run you know, is you know in a similar way that we are defining things, so it would be nice.

A

So what about like the tesla we defined um and like the slo's here like the um with oh, not this one? um No, I don't have it the um the the steady state and the.

B

Yeah, it's only burst now because the first test- I don't want to include too much things. You know the same time and make the reviewers super crazy with that. But I also have a version with the steady state.

A

So you're also honestly you're you this this. Your change adds bursts capabilities.

B

A

B

The steady state, but.

A

Does the does the tool support like doing study state like tests, though.

B

No, I need to extend manufacturing.

A

So yeah, okay! So that's why? Okay, all right so the so! This gives us our framework and that's fine, then okay, so we just need to so you've got this this pr and then we can add some look at adding something I can help with you with this. If you like, if you want me to look at the steady state, I can help do this in a separate vr. If you want or if you're already looking at it. That's fine too.

B

Yeah I already implemented that so, but I just want you to support. Oh.

A

You did okay, yeah.

B

So that's great the next pr when it's just won't get no true, so.

C

Cool okay, I'll give my thoughts real quick. um I would like to converge on key burner if we could get all the functionality that we have today and converge it into the keyburner that'd be great.

C

uh I think uh we just need to see how open that community and how easy it is to work uh in that code base we're finding a lot of friction, for example, to get the things that we want in that kind of serve our purposes, then maybe we only use key burner to generate load and continue to use like the audit tool for the metrics and stuff collecting, or maybe some combination of it, but, like I like the idea of beginning to converge on this tool, if we can replace, maybe just the performance uh load, part that would be cool.

C

If we can't, then at least we tried.

A

Okay, well so marcelo I mean I guess we can well. I mean it looks like you're getting some attention and we'll see um when this converges, when this gets merged.

B

Yeah, so it's seems to me like the raul. Is the guy? That's responsible for that, and I I was you know I just checked the contribution of the you know, people on this. uh They could burn and since you mean like raul, is 99 responsible for that. So it's pretty much one one guy so and he he said he's reviewing that and he's very much like uh very giving a lot of attention. So.

C

How many reviewers do they have that's another problem when we look at you know the health of a project and our ability to get code in if it's just him and he's a bottleneck for us, that's another problem.

B

Yeah, I'm not sure about that. So can you uh rubble? Can you so sorry uh ryan? Can you go to the yeah and see? The point is.

C

There is no owner's file. How are they reviewing? Oh, is he just uh clicking merge.

C

B

C

A

C

I don't know so we're totally that that's my what I'm nervous about is. We are uh at his mercy uh for unless we get some kind of um ability to merge.

B

We can, we can talk to him. You know later well he's from red hat, so we can. We can discuss that.

B

Maybe we can also bring more people, I don't know if it's possible.

C

How, let's see how things are getting merged? Is he pushing a button or yeah yeah he's literally pushing a button? um Yeah, here's one where another page.

C

Yeah, I think, he's the gatekeeper.

B

Yeah anyway, so maybe if we we think that it's a good, you know no good way to go. We can also talk with people, you know, fabian and and see you know how prioritized can go for this project and or maybe even incubate, that inside you know, convert something like that.

A

I mean your code like that you've written here is this: um are you what like.

A

I don't know what's on this, I'm reading, but yeah.

B

This is the metrics that I think it's relevant for us to just to analyze.

B

And this is the configuration of the coup. You know the comfort density. It will have like global configuration to write the data to files and a jog. I I have three jobs here, one. It's only the idle job that I call it will just delete previous data and wait for five minutes.

B

Then it comes the cookware density that creates virtual machines with this rate of 20 quarters per second wait for all the future machines being the ready state and- and also I wait, 20 minutes just to leave them running for 20 minutes to see- and you see this object, replica, it's environment variable, you just export the number of flat replicas that you want to create, and then the delete job will delete everything and just wait five minutes after deleting that, so just of course, this can be configurable.

B

Also so, and um I I was doing like much larger, you know test, that's why waiting just amount of time makes sense. For me, this is the the template. This is a very simple template with ephemeral. You know a disk, um but we might want to test it. Also with um you know, real pvcs and maybe more network, because, as we saw in some other experiments, pvcs and more network um nics potentially increase the number of api requests and overload the system um and it's something that we can catch.

B

You know get in the performance test when we test this kind of thing. So.

A

This looks like the.

A

I mean this looks like it's duplicating what you've already done.

B

A

Yeah: okay, a lot of this is looks like.

A

B

Yeah it's. It has some a big overlap. Yeah.

A

A

Yeah, I think we were, I didn't realize it was only this. I thought this. Would this had a community associated like a larger community associated with it? I didn't know it was yeah.

B

I thought also yeah, but it seems that.

A

B

It's very I mean.

A

Okay, I I think, like I mean, let's see where you go with this, how this pr goes. I mean I think like when I guess marcelo is like. If we already have like a bunch of stuff in the low generator you know in in cuber 3.

A

I don't think it's that bad to continue there I mean just because the and see how this the community here grows, maybe a little bit more because we I mean, I think it's you have this pr open. I mean like continue with it, but I mean it might be that um we might need to like see where this goes.

A

I mean, do you have like you what what's like missing on the on the keyboard side like you have here like? Have you what about like the steady state like right now we have a burst test right on the keyboard side. We have create 100 vms. The density test basically is like a burst test.

A

There's there's! No, we don't have steady state tests. I mean you, don't you don't have code for that right on the keyboard side,.

B

No, but we could have yeah.

A

Okay, no, I well I'm just asking like the okay.

A

Okay, all right, let's see where this goes, then I mean I don't like want to waste the effort that you've already done. Let's see where this goes and let's see, um let's see if we see some things pick up there as we as there's you have these contributions.

A

A

A

Okay! um That's all I had uh do you guys have any other final thoughts before we conclude.

A

Okay, that's it okay,.

C

See you all next week.

A

Thanks guys catch you later, bye.