KubeVirt SIG Performance and Scale, 8 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-07-08

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.8taxjc2uv4bg

A

Okay, all right um welcome everybody, everybody to sixth scale um I'll mention this just like I did last week um um in case anyone wasn't here. We moved to um weekly meetings, uh so the cadence now, instead of bi-monthly, we'll be doing it every every thursday at this time.

A

Okay, um so just kick things off with uh with the agenda um marcelo. This is your pr. Do you want to give an update on how things are going with that.

B

Yeah, so we have a, I would say a large discussion about that. uh I simplified the pr uh more mostly, but it's of course it's not super simple um and as roma mentioned, but the pr it's now I remove it all the parts that's collecting, metrics from prometheus uh and reporting and and verifying that, and also the pr. I had some uh configurations coming from, uh for example, that the task would be a conf instead of be hardcoded was not yaml.

B

I removed that also it's it's kind of the task would be hard to call that now, unless you have me less things and I update the part, uh as you mentioned, it was actually a very good recommendation that I was checking the status of the vms, for the updates and for deletion uh doing the doing the get operation and when I was running, for example, a simple test creating 500 vms.

B

I was getting a lot of this totally message from the the the kubernetes client and and then it was not good because it was introducing delays in the experiment and everything, and I changed it to the other way to watch actually.

B

So it's uh now watching the vms, for both uh the change for for the change and for deletion um for the chains, I'm just getting the timestamps from the vm phase and get the running time of the vm for deletion there. It's a little bit needs more uh still need some.

B

Some information from that which means just delete the no uh an object in kubernetes doesn't mean that it's gone. Isn't it so we need to check if it's disappears. The vms appears in the in the cluster. So that's why I have. There are two calls uh a separate one for to check when it's deleted.

B

um That's pretty much so regarding the latency that I'm testing so roma also mentioned he, he was expecting the beginning, a test that was just creating and not testing the the performance the latencies uh for now. uh Well, I already have it implemented. I would say- and I think it's it's good to have the test.

B

um You know if we make sure that if some pr or something happens it will fail. You know the the the verification, by the way, how I'm measuring the perform the latency okay. So what I'm doing is I run the test. I run the test. Actually I run 15 times.

B

It would be better to run more, but I run 15 times and I take the the the times the highest times that the game gets created and I define like uh 1.5. You know uh higher latency than this one, just to make sure that we are not. We are defined some, you know um some range and like 50, you know higher than the latency. It's the final now and you know to to be okay in the system.

B

uh I don't want. I didn't define like uh very tight. You know what the like the latency that I'm collecting from the system uh just to avoid to introduce a test that maybe start to fail. You know many times and just to to have a comfort zone now for now, then we can. We can reject that later and.

A

You have like, um do you see you, so you actually have something that fails like you have a um this. This is like. What's the so, you have something that you have a threshold right now is what you're saying.

B

A

Okay, do we well, I guess um uh well it's worth talking about because I I like I'm wondering um like because we like, I have it here like baseline. We talked about like thresholds and other stuff, like I I mean. Is that um do we want to do this here like do we want to, or like is like do we want to?

A

I this wasn't cleared me from last thing. Do we want to do like actually have a defined threshold that we fail on, or do we just want to like report stuff here with this? What did we say last time and I guess I can check the notes.

B

Right, the proposal of this test was to actually have thresholds and fail, and this is part of the first implementation, also that I did- and I think this is available anyway. You know because it's if we don't check that after you know months of that, no one will care about that anymore. You know what I mean.

A

Yeah, I I think we we did right, we said yeah, we wanted a baseline, okay, yeah we're gonna. Do that? Okay, I see so whatever the number you decide what what is the number you came up with like? What's the um the latency that you have.

B

Yeah, it's it's! If you go to the pr, you can just just check that.

A

Is it? uh Is it in a comment, or do I have to like.

B

If you go to the file yeah, it's the dance test, it should be dot dot, co, yeah yeah go just a little bit.

B

This is yeah, so you can see here. So this is the configuration of the phone. So the test I mean creating. Oh there is a comment here. Sorry, I need to remove that from the parent. uh It's uh once.

A

Was the test you did.

B

Yes, I also did the 500, but I don't want to include that in the pr right now. I just want to include this yeah. This is this.

A

Is your music specials.

B

A

And this is seconds or.

B

Eight seconds yeah and also for the the batch you know the batch startup limit, for example, to create you know, 100 vm should be under 300 seconds, because it's also changed. You know the even those the the response from the the system to actually accept a new vm to be created.

B

It will not be um verified if we just check into video vm's latency. You need to also check the badge.

A

And this is like um this is to running like this is when we first see the vmware running.

B

Yes, it's the it's the the time between the object being created to the running phase.

B

Okay, so I get the timestamp of the object, uh be created, vmi object and the running phase. Okay,.

A

Cool, okay, good, all right, so we got some sort of baseline. Then that makes sense. So we fail. uh What's your failure like? What do you say is a failure if you are outside these.

B

Higher than that.

A

On any on these okay.

B

And uh those tests will be actually running in the environment that I'm running. So that's why I you know, I trust this uh latencies now so, okay.

A

A

um Do we wanna, um uh like roman? I saw you had a comment here. Do you wanna talk about this like we? um What do you think.

C

Well, um yeah, as I said uh I said, discuss with my childhood and also out of um I was really hoping to just have to just keep any collection right out right now completely out, so we would just I mean we collect all kind of stuff and with prometheus and we would catch it in our dashboard and I just wanted to have a very a few very basic scenarios which just run and are not even concerned about collecting the metrics and integrating them right now and then in the second step. Think about a framework.

C

So that's why I'm a little bit hesitant to this pr at the moment.

C

B

Not really a framework, so it's it's just.

C

I just didn't want to have any interpretation in there just the simplest possible way without thinking about how to get insights on this part. Just as as you see here on my comments, just creating vms waiting until they're running test done basically right.

A

Is it because of this like what I just had um no wait, why I don't know what that was, but the um like, where I just was with the density.go like because it has like a it's configurable. Is that what your concern is like?

A

I don't know why.

C

Well, there are still.

A

C

In there different start functions like uh consistent, start, vm or poisson density function, whatever I just right like they can all make sense. It's just really. I wanted it to have that simple without any interpretations right now, just to see what's happening in ci and then and then later on doing the all the other part. That's all that's my main thing. I I'm not even trying to uh how's it going to.

C

To say it's not at all about if I like or don't like, what's implemented there regarding to the rate limit functions and so on, it's really more like I didn't expect it to see. You expect to see right now, right, you're,.

D

Going to see much.

C

And also they configure all these tons of configurations options. I just expected one vm template via my template, which is the smallest one possible, where we know that the vm will crash or not yeah, but still the key is staying, running, state right and all the rest later. That was so that we can easily see things. So we get the collection, metrics collection fast.

C

We will probably then see immediately on the collected metrics that we run in the quality in the query per second limit of our client and stuff like this, and that's you know, that's what I wanted to have initially and then think about all the other stuff like what would be our values for for failing the test. uh What would be acceptable for us? uh How would we want to express it in code yeah?

C

So it's more probably more a question of the iterations creative approach, which we take david yeah, so.

E

How would you approach the metrics collection, then.

D

Yeah, maybe your thoughts there so so right now, I would just uh mirror material. You.

C

Yeah yeah right now it's collected in sea iron or when you run it with prometheus locally, you can look at all the decor, the metrics which are assigned to it.

C

And it basically.

E

Okay, so what you're saying is we would have a density test. That's just going to execute a creation of a bunch of bmis, uh make sure they go to running state, and then I guess, delete them and then externally, we'll have monitoring some sort of report given to us by prometheus or whatever. That would give us an indication of how how this did yeah from the.

C

Beginning and then.

C

We can also take what you have here: keep what you have here for the reporting. Why not it's just like it's, especially for the beginning, not without looking at prometheus anyway, not very helpful.

B

Right, it's. uh I understand that I don't. I just think in that. What do you this baseline idea, view that you want to have? uh It's doesn't really needs to be actually in the ci cd, because, for example, I already have access to the nodes that we're going to run and actually I'm doing today.

B

I just mentioned to you the the about the metrics again, that I forgot that they don't complete the promises but anyway, so um that I'm going to report. You know those informations, so I want to do actually a large test like uh I I didn't. I was trying to do that today, but I didn't finish like 100 500, something that the actual video guys did and expose a graphic dashboard, uh I'm creating graphing dashboard with the metrics that we've been discussing here in this.

B

uh You know in our meetings and then we can see that well, even though uh we don't it's not integrated to see icd. Yet we can already see that, uh but with tests we can make sure that uh then that for the the idea that we were thinking about this continuing evaluation of the contour plane, uh we are, we do uh just the the first step of uh we make sure that things will fail. If someone introduced something that is very nasty, you know and increase the you know decrease the performance too high.

A

Yeah like so, I guess I we're like I.

C

A

C

Talking about uh just having the test and starting collecting the metrics and then for months this will be. I was just expecting a set of pr's very fast per weird. I'm not. I think that of course makes sense in general. That's another thing.

E

C

A

Not convinced that.

E

I'm sorry go ahead, ryan. I.

A

Don't wanna, you were just trying to talk, we might be all saying the same thing like the the the like I have. This is like stage one because we talked about it. We've been talking about a few weeks now, um where we we have this initial um ci job and- and you know having um something like that- actually fails like having it measure something and have it fail like if we have that now, like I'm almost like okay, that's that's fine, I guess.

A

Maybe what we're saying is like we just we stop it at that, like we just we look at something that we have that we can have as like experimentally, that we run in ci just for a period of time. While we work on these right, that's like what we're saying we just we just it may have some framework stuff, but maybe we just kind of, and that may be okay, we just kind of come back to it. Like we part of these.

A

The two step two and three is that we kind of we we work some, maybe some of the framework stuff, that's in here or the stuff. That's generally loaded because it already has to right, we just kind of rework it. This is just kind of like our initial, you know, script to get things kicked off and it's like how I'm looking at is that.

A

Would that be okay with everyone? Is that, like even with this, these thresholds- and you know stuff like we just don't take it any farther than that? We just leave it at that until we finish this.

C

David, you wanted to say something.

E

Yeah, I was just trying to sort that out as well, so I want to see uh the decouple I mean I mentioned this a couple weeks ago. The decoupling of the load generator and the report generation, which would allow us to do things like create a density test in our ci framework, used a common reporter tool or whatever to to get the results, but also um give us the the power of using the same reporting tool with things outside of rci, so create load tests.

E

uh The don't have to be generated from our functional tests and still get reports uh they're consistent um yeah, so that I think that's my concern. Is we start momentum in one direction?

E

So if we, if we merge this the way it is right now, then we we've created uh a direction and reversing a direction is a lot harder than continuing with the same direction.

E

So, if we're not comfortable, this is a good long term or where we want to be headed, then why don't we just in a very small way, like the smallest way possible, create the things that are headed in the direction that we'd like to see long term.

B

Right, so I have some comments about that so yeah in the beginning, the pr I would say that it had a lot of things on it and actually the report generation. I kind of you know: oh it's still, printing printing things on this. You know stda out. However, it's not generating report anymore. I remove that part. So it's I would say that you know. uh Well, let's organize thing, my sorry, my thoughts here, I would say that we have two two main areas. Okay, one thing is to monitor the control plane.

B

You know uh for the in the cicd system and and make sure that things don't get. You know too bad. Another thing is to deep dive in the performance. So to do like a very you know, detailed performance, evaluation and deep dive on that. Actually, I think it was a good idea that you mentioned. Maybe you extend could burn. I actually tried to see the code and I don't think it will be too hard to extend that and coopburn generates this nice report.

B

We can, you know, put there all the metrics that we are thinking and then we run offline. You know you know in a cluster that we want and we deep dive the performance of cookware and the other the other direction.

B

It's what I was mentioning, it's you having the you know integrated in the cict system, this idea, so I I think I I don't. I don't know if you guys saw, I sent the documents. You know with the plan of that before so, uh and we have like uh three kind of of types of ser of jobs.

B

You know that I was saying the small scale to 100 vms that runs for hpr a medium scale that runs daily in a large scale, that in red hat, we have this uh possibility to access a large cluster that I want to run that before each release, and we can keep that you know and.

A

Marcelo that that makes sense uh like that makes sense to me like starting interrupting, like that. That makes sense. We, I think we're aligned on on this, like, I think, we're aligned on like for the idea of this.

A

Having a super job, I think maybe, where we're not aligned is, um is exactly how we get here because, like we're we're sort of like we have, we have these two steps where we're talking about how like we want to have a tool generate load and how to how to generate a report that will take us to a ci job that all the things you mentioned, um but I think, like I think, where we need to figure out is like because we, I think we like this first step like having something there. I I just.

A

Can we break this down like what is it that? What is what do we consider to be like an acceptable thing for this pr, because, like that's like that's, maybe where at least I'm struggling to like you know, we have a bunch of things in this pr like we have some thresholds, you know what is it that we wanted to do? What would we consider to be a step forward?

A

That's not you know inhibiting anything in the future like can we name these things and then and then I think then this will unblock this and we can move forward and start with the design on these.

B

Yeah, so as far as understood, what's in roman uh mentioned well, it's concerned is to have the thresholds.

C

What I meant is what I see there. There are just a lot of things right now, like configuration options for the tests, uh how you do the the creation of the vamps, with which I think you can write now for the scope for the initial test. It's really just one test right now, where you start a different number of vms.

B

You mean the poisson process.

C

Yeah yeah, you can, I mean you're, not using it as far as you see it's just still in the pr, but you have all the configuration options, 10 or 15 how you can configure the tests.

C

All I think, is: let's just throw all this out right now. You can do this a similar thing with, I don't know 40 or 50 lines of code and just in a follow up here. Pr think about how you want to report it for creating thresholds and all and in the meantime, we just collect in the ci shop, where we've prepared everything with prometheus.

C

Just the data right now. That's that's! Basically what I meant. I, I think all you said and legrand said and david says I think, on the division. We all agree.

C

It's just this first initial thing where we, where I personally am hung up a little bit at the moment. You know what I mean.

B

Right so well yeah, I partially agree so like uh I, I don't think, like uh you know, drop the structure you know to define. You know uh information from the test. It's it's good idea.

B

I like the idea to have it structured. Then we see what the test should have. I think many of the the functional test has it in the convert. You know and it's easier to read that later you know when it's get bigger, especially.

A

Search marcelo.

B

A

Too far, if we go too far like sorry again like if we go too far like we're, we're going to continue on like we're committing to a path like what david said like we're like, we we're going down a path, and so I guess, like part of what we're saying, is that let's do the simplest thing that can get some value that we can just pull out later and replace with this and, like you know, 100 lines of bash could do it right and that's like, and that's that could.

C

Provide yes, they go instead,.

A

Of but yes, all right, 100 100 lines of go, uh maybe it's called by a bash script whatever and all it does is like it. It creates 100 vms.

A

You don't need any configuration that you can set the threshold to fail. We have those hard-coded values like we could that's all we we so and that's it like we just that gives us like when we we have like some sort of um we don't do any reporting, we just kind of gather and- and we just uh we say, pass fail. That's it.

C

And just defer the rest to the next pr or just yeah- that's all yeah, and that can go very.

E

C

I mean it can be, can be that we would merge this very simple test today and two days later or one or two weeks later. We already have something good which we can merge afterwards, which would do already more because we had the time to think about it. And you know what I.

A

Mean does that make sense marcel like we just want to simplify it um as much as possible, so we're not compromising any like.

C

Correction yeah yeah, because when we think about the struct, which you actually configure, I think that you have.

C

You have to already some configuration variables where I'm not sure if thinking about these making these configurable, if this is valuable at all, for the tests or not, because your percent potential testing scenarios, where I'm not necessarily agreeing if it makes sense to even test them and by just moving that all out and just reducing it to the to what we want to test right now, we can bypass all these discussions for now and already start gathering producing value right.

B

Right so like for the configurations, so if maybe you can point, you know, for example, the arrival rate that things that you, you think that's maybe still fancy now for the test. If you can point that I can, I can yeah definitely remove those parts from the test and and then we can move forward from that yeah.

A

Okay, okay, we can comment on the pr. um Then, okay, okay, that's I think that covers this one um then so they um we can. So I so I that that should get this kicked off, and then um we need to do some design here um on these. I'm thinking um we've already heard some ideas about about this, but we can have sort of design.

A

I don't know if we're gonna have time to talk about today and um but, if there's any, if anyone wants to take on like writing about any either any of these things like how that'll look um goals or anything like that, um and you want to throw in an issue- google doc anything um whatever or if you want to just add a bunch of bullet points here, that's that's fine. We can.

A

uh We can look at taking this on next week, like we can maybe look at taking the first one on and trying to build some a bunch of different ideas around it, but if anyone wants to take it on feel free, okay, um we'll move on to the next point. um I I talked about this last time. That's something I said I was going to do um with with baseline um and there's a uh roman. You actually just did this patch. We talked so last time we talked about um in reconcile.

A

um One of the things that we saw with the um from from our testing internally was that we're being rate limited, uh so rowan put together a patch to uh measure this, um so we can see it. um I haven't, got a chance to use it. I have to pull it and use it and I'll come back with some bass lines for you.

A

uh I can do it in the middle of the week um next week on keyword, dev or something to give you some ideas, but it kind of got me thinking, like um you know, baseline, how how we will define baselines um for things. It wasn't really clear to me because you know one thing I was thinking of like um you know. What are the rules here like I could say, like you know, my cluster, is this big? Has this many vms?

A

You know. How am I doing this? You know how am I going to find this baseline it like I. We have these tools that we are thinking of of um generating load. I'm sad, it sounds like to me like eventually what we'll do is we take these tools and we use them to generate our baseline for different things and we sort of categorize them um like based on load and stuff, like that, um that's what I'm thinking.

A

um So, if we do um so any sort of baseline that we generate that's sort of you know ahead of time. We can kind of use this like just a temporary placeholder. um So what I'm thinking would do is like I'll create a like, maybe a table in here or somewhere, maybe an issue. We can kind of just kind of track, any sort of baseline, at least until we have this to normalize all our expectations and and maybe in a format like this kind of with the threshold and stuff, um or something like that.

A

um Does that make sense to people like what do you think uh or like? How would it does anyone have any suggestions than that.

A

Okay, all right I'll create an issue or something and oh yeah go ahead. Somebody.

B

Yeah, so I for regarding the baseline, so this is one of the ideas to have. Also just you know, just jobs, you know in the convert, ci um and you know to have the baseline can be very you know you can explode and and have a lot of configurations.

B

So the the first idea was to have a minimal configuration, a nice, very specific operation, system and storage, and you know that we can. We show and we provide some information because you know convert can run ever you know anywhere, and it can be hard. You know um to define you know where which system, which kind of system, so um we need to define you know just very well, I would say I I'm.

E

Trying to do that so.

B

When we have like uh okay yeah a summary for you,.

A

I was going to say I was going to say I'm going to say marcelo like it. We I'm always wondering if we could put this in like in plain text somewhere that we just kind of have like the ci just kind of um just absorb, maybe or something like that, because whatever this is, could be usable by or should be usable by ci, and then we can just um it'll be our source of truth. That's that's! I'm kind of leading toward that right now, yeah, okay.

A

So at least I can find a place for that um somewhere in the in the repo we'll just kind of we'll just track the stuff in plain text, and uh we got a little hell. Our jobs will eventually consume it.

A

And yeah we can, you can put your uh your stuff in there when I find it I'll. Let you know.

A

Okay, um next uh reducing update patch collisions. Let's take a look at this.

E

Oh yeah, that was mine. This is more of I'm just pointing out something that I saw uh that probably impacts our startup times and other stuff. I don't have evidence that this um reduces startup times, yet I don't see how it couldn't. I just don't know how measurable it is.

E

So when our vmis are starting up, uh we hit lots of these 409s at least two to four before a vmi gets to running and four nine is when we try to post an update to a vmi, but it gets rejected because our bmi that we have in our informer is different from reality, so it hasn't our informer. Has it caught up to to what is actually persisted at cd, so this causes things to get rate limited and it causes us to generate load on the api. Server that doesn't need to occur turns out.

E

The reason this was happening was because we have lots of other informers that aren't our bmi informer, they're queuing keys onto uh our vmi reconcile loop, so, for example, at pod informer. If we create a pod and then update our bmi, we'll probably get notified that the pod was created before the vmi was updated.

E

So then we queue because the pi was created, and then we tried to patch our vmi. That was out of date because we didn't get that update anyway.

E

The point is: there's a way to resolve this using a really simple, like heuristic, using an expectation that says every time we update the vmi, don't process that key again until we actually see an update has occurred in our informer and that pretty much made all of these collisions go away. So it reduces our the number of reconcile loops.

E

It takes to actually start a vmi, and the number of api calls that we make to start a vmi which hopefully at scale, can make like some pretty big gains for us, but I don't again have evidence for that. Yet.

A

Okay, this is cool. This sounds like a lot of what um fanna talked about with uh the reconcile interesting. That is really interesting.

E

So so, two to four four nine errors results in uh two to four: more reconciles for every vmi on startup yeah. So that means.

A

E

An error, I think there is a backup right, included right.

A

E

So we might be reducing the number of reconciles by like over half uh for every vmi.

A

Awesome, that's cool, I'm gonna! I really want to try this like with some of the other measurements we've done. I wonder if this might be the uh the thing we've been looking for. That's causing some of the collisions. Okay. This is really cool. um We'll take a look.

E

It might not be the thing that's causing you all to to have this spike as uh as more vmis are introduced than like the larger and larger. It's probably one of the things at least.

A

Yeah, it's a fair heads yeah, I yeah it sounds like it will help for sure. um We'll we'll see we'll see how much um this is definitely one. I want to see like um with some of the other graphs we generated. I want to see if like how this moves, the line, um cool, okay,.

A

Great, um the thanks david, um so the uh that's the last bullet point for today um does anyone else want to treat anybody bring up anything else.

E

I guess so: we have about 20 more minutes or 15, more minutes yeah, maybe just for the sake of discussion, um so forgetting the density tests and ci and all that what would it load um sorry, what would a reporting tool look like? What would we want to see in that? Would it depend on prometheus? How would it work?

E

uh Maybe we could, I don't know, make an exercise out of that. Does anyone have any thoughts.

B

um I, like the cool burn, you know that you sent.

E

That's a generation tool, though that's a density that that'd be like something. Actually, I know that it does um some metrics returning our collection as well, but I wouldn't consider that the tool that we use to gather metrics, necessarily because it's generating the load as well.

B

Okay, yeah so well. Moving to the parts actually kubovert is collecting noku burns collecting. I like the way that they are doing in a way that well, basically, if you you want a tool to watch, you know um the parameters, metrics and dump it in a report.

B

Just just just to be easier to parse and and see that comparison error isn't it is that is that is that the idea.

E

A

Yeah so something like, so we could um something we could consume in like ci like I could go through. You know my pr and see how I can say. Okay, here's my here's, my report um or my failure. Whatever, like you know, I can see like my thresholds, were a little bit off. um I had you know. Maybe like I was on my ninth percentile, I was I was at 120 seconds because maybe I had like one or two vms that were just slow for some reason, um so something that's consumable by ci.

A

So let's just write these so um by ci and the developer.

E

It's potentially.

A

E

That could run for every test case. Even I don't know. If we'd take it.

A

E

But it could be a precondition for every test and the post condition would be dump the report for that test.

A

Okay, so every vm we create as part of a test case. Would you say, would we get a report from.

E

You get a high level report of like how long it took bmis during that test case to start and reach running or whatever we want to capture um yeah. It wouldn't be treated differently, it would just be kind of a generic. This is the report of things that happened during this time frame.

E

A

I like the idea of having some thresholds like like we that you talked about earlier, so 50, 90, 99,.

A

So that's the way we can report.

B

And and the metrics and it so I think we discussed that in the beginning, my pr was doing that to remove that part that I call resource collector which, like as show the cpu usage, you know per vm, the cpu usually use it for the all, the the control, plane, modules, memory and plus the latencies that we were discussing um and and also show this kind of thing. You know uh the offer latency also again the decline, tiles and for cpu. You know in memory just average.

B

I think I don't know if ever it makes sense. Just because sometimes you know it's things can explode and we don't know.

A

Yeah there's some well, for example like if we're, if we, if, if our new pr causes a ton of load on the api server, we'd want to know that one so like so these are. This is just vmis. um If we're, we need to know um the.

B

Control panel latency.

A

So latency is another one, um but we do thresholds again. So we'll do um the amyloid thresholds. Let's do the same thing.

A

Apm latency thresholds um and resource usage also.

B

It's yes, sometimes you know not normally it's related, but sometimes cpu usage can increase, but the latest thing can still be fine, but it can become a problem later. You know, I mean there is force usage or especially because the contour plane start to be like too heavy start to be problems and yeah yeah.

A

We also yeah. We also want to think too, like so what what other, what other personas so like we. um So we have consumer, ci and developer. Let's talk about like tests um where you'd want to get reports um like what kind of tests like like so like one is gonna, be so like when we're doing massive scale.

A

We want to get reports. We also want to get it um when so, like the reason I'm thinking about that is because, let's say at massive scale, suddenly vert handler is is having um an increase or something in in usage of cpus or something we want to know that, like we want, since it's going to be reported, this is one of the things we want to it's one of the tests.

A

We want to run so like if we do just just a general performance test, um what we do in ci, um we do with our unit tests or our func tests.

A

These are the areas we'd want it so like what else like? Would we want if we're running it from these? If we're doing these types of tests, what other information we want to get from our reporting.

B

Tool, well maybe the report too should like show the system configuration you know, because if someone, for example, you know if we have different companies using this tool and as you mentioned like creating, maybe different base lines, um it would be nice also to show some report about the system. You know how many you know the kubernetes configuration information clustering for them, some some more information about the system where it was running well,.

A

Will we get, I mean I was thinking, maybe that the person running it would could provide that, but like would we even get like, would would be able like outside of the test, and would we even get that like um like how like that that sounds like it would be like? We would need to sort of scan the system with with the toolbox.

B

A

A

Sorry we we talked about thresholds like did. Does this cover um our you know, like all the information we want about a vmi, um this gets us like yeah. This gets us our like. You know how fast we are um how slow we are.

B

Those those metrics, so we we need to have some. You know high-level metrics, so we're kind of with the final order. You know slo service level agreement, something like that. You know escobar nets. Has I start to prepare a document about that? I don't remember now what I put in this document. I don't know if I share. I think I shared that some time ago uh and because, like the vmware thresholds, it's like just kind of high-level messages, the api latency it's. I would say that it's like low level- it's not!

B

The user has not seen that. I'm saying we should. You know, keep like some some things like what impacts the user and something that what's the metrics, that it's internally, that we you know should care.

A

A

um Another question like how: how should this be run like do we run the reporting tool after um we we execute a test. Do we run it before we run it during? I.

E

I consider it my thought has always been that it's like um it's kind of like a profiler like if you were wanting to profile a cpu, I mean. If you run the profile process, um you would start a profiler which would begin sampling the process and then you would stop.

E

Maybe you'd run a load test during that or whatever you're going to do, then you'd stop the profiler and examine the results so for our reporting tool, I would imagine starting the profiler or our report gathering tool running the test, then stopping the profiler and examining the results. It would only capture what occurred during that time period that it was actually running.

A

What about um what, if you were to run the report to afterwards and just gave it a time frame and it just scrapes the metrics like, does it or can it.

E

Work not get the information, maybe that would only work if we're uh solely using prometheus yeah yeah. So if there's anything, we want to do that's different uh introspection of the system, then it wouldn't work.

B

Right now it's everything permutation so, but we can.

A

That's a good question in terms of like the identity of the tool.

A

If, if it's something that is.

B

For metrics, I would say that the work queue metrics also might be. You know interesting.

A

So let's say we run it at the start. um uh How is it so we have? This gives us options. I guess is the point like this gives us options to either scrape from prometheus or um presumably do some sort of watching in gathering the same data.

C

Yeah right, I'm not sure if it matters too much where matrix are coming from, I mean, as we have just different sources, and they can also be combined. Probably in the report.

B

I would say that if we can run later, it's better because it doesn't introduce load in the system in it can interfere in all the tasks. But, of course, if we think that there is something that cannot be collected by permitted later, then we can change the approach, of course, but right now it's everything from permission.

A

That's something to think about um yeah, because it kind of defines the identity of the reporting tool. That kind of when I I'm trying to think of like some use cases like um it would be like an example of something that we'd want to get during um during when a test is being run.

A

Like I guess, would there be um like? Let's assume all of this is already in prometheus, maybe we, this is kind of what our plan is right, we're going to put these in prometheus. We already have some of these in there if they're all there.

A

I guess the idea is like we are giving ourselves the the opportunity to add things that could be. um It might not.

E

Be necessary if we're going to completely depend on prometheus for our reporting, then um it seems like we could run it afterwards with the time period. This tool.

E

I can't think of anything right now that doesn't exist in prometheus.

C

I think there are some we. I think.

C

I also think we can get pretty far with it, I'm not sure for some things like sometimes especially right now we have a watching approach, also for some prometheus metrics and some things may be hard to get with that, because some objects just disappear and you may not be able to watch them fast enough to collect something, and then it may be too difficult to distribute the metrics collection to the various components, but yeah, and so are you talking about the granularity of the reporting yeah I mean, like vert controller, is right now collecting, for instance, all the phase transitions, but if you want to, for instance, collect uh the time it takes to delete vms, it could be impractical with that approach, because, with controller is not necessarily the one deletes the vm and we may not get the the timestamp exactly which we want, because which controller may not be able to observe it.

C

Oh it does. Oh.

E

Wait you're right well.

B

It doesn't see the it disappears in it. So actually I put the watcher and the pr for that yeah. That's true. Yeah.

C

So there are some corner cases we could rearrange stuff. I think in kuber to also catch it, but it may not be practical for our cases. That's what I mostly mean yeah.

B

Yeah, I think the deletion is a good.

B

A

So that would mean um if we had, if we didn't have, if we ran this according to after would we still can we still get the deletion timer? Is that we're saying we can't because we're going to miss the event.

E

We'll still get.

A

E

Actually we're going to get it with the prometheus metric. We still get deletion uh because it's a histogram and we're we are updating it based on an informer locally, so the informer is still going to see the deletion occurred, but that phase transition to the final state occurred and it will get stored in that instagram.

B

C

I'm not sure if we get the actual disappearance that we may not get this one. It's.

B

Just when it's market to debate.

C

Well, actually, there is the delete handler when it really disappears and we process it right. We have a.

E

Do we have a finalizer?

E

We have a finalizer, so.

D

E

So it was not going to be deleted officially until all the pods disappear yeah. So.

C

Virtual controller is in charge of data right with the finalizer, actually right yeah, so we can get it yeah and even if not I mean we get a real delete event inside verge controller for overvm. If it would not be the case, we may not process this right now, but we should actually so yeah. I think we have the opportunity.

D

To process it we're not doing it.

C

Yeah, I think the actual delete when we get from the cache. This object does not exist anymore. I think we're not reporting that right now, but we could yeah. We have to put opportunity, as you say,.

C

C

Yeah then I I know right now not often a case where it would be the case, but it can be. I think it's not impossible.

A

So what do you say uh to answer this question um if we were to uh if we were to run run late, um I I I could sort of position like, or I could sort of envision this as like. If we were to run this late and then um we have sort of the exact same idea as if we were to run it before we're like we're just we're going to gather information from prometheus um through this period of time. Like that's our that's like our api, we want to gather.

A

We want to gather this information for this period of time and then, when we run it after the idea is that we're just going to query prometheus for the timestamps, and this would give us the opportunity if we wanted to later pivot, to say: okay, we're going to run it for this period of time. We just we just kind of change it to be like okay, we just change it to some sort of. You know. We have some sort of time that we run it um so that wouldn't.

E

A

Us from doing this in the future, I think it's so we could, I think, maybe what we can start with is we do an after report and then we have the opportunity to pivot.

B

Later you can generate, you know, even generate the report later. Of course, if the permit is still running and you have the timestamps, you can generate like the report from different brands later. If you need.

C

And, in addition, um that's what I kind of hinted in the pr comment. I made it's any time possible to just tell prometheus to append specific labels to all metrics, starting from a specific time point.

C

So it's also easy to just add really a label with a test id or something during that period of the test time and then just remove it again and replace it with the next one for the next test. And then you don't need you don't even need time stamps on the reporting tool, for instance, just as an example.

A

Okay, I think that gives us a pathway. I think that's like we can start with that. I think this is easier too, like we start after we just assume prometheus will make that assumption, um and then uh we just we'll just put it behind some sort of api so that we can have the opportunity to do this. You know if we decide we wanted to do during. um We want to do it like yeah. If we want to come back and do something during.

A

A

A

All right, we only got like one minute left. um I I like what we have here. I think this is pretty good. um What other like last-minute thoughts like what else can we throw out here that we want in the reporting tool.

E

I think my last thought here is when we start making this reporting tool. Let's pick the one thing that we care about the most just focus on that initially, so I think that was probably what phase transition times or am I wrong? There.

A

Yeah, um it's a good question. I.

E

A

Yeah, this is probably the I guess we could say we could start with this one. I I think these are both very important um but yeah the this one. Let me give it a little highlight be like we can have that be our first.

E

A

Think it's just a reasonable starting point. We have all the data around it already so.

E

So a really simple tool that all it does is over time period give us the vmi thresholds that occurred during that and then let us build out from there so just make sure that we have a really solid agreed upon entry point for what this tool can start with, because that makes it actionable. I think it's actionable now. Actually, through this discussion we could go off and somebody could write this right now.

A

Okay, so we'll leave it right like that.

A

Okay, all right we're at time. So uh thank you, everybody. uh This is pretty good yeah. We got a lot done with this. All right uh have a good day. Everybody. Thank you very much and um we'll see you online all right.

B

A

A