KubeVirt SIG Performance and Scale, 19 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-08-19

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.lu45zu2oo32a

A

Okay, all right welcome to sixth scale. um uh Today is august 19th. um Let's, uh let's start with the first item on the agenda, um the first one is uh vmi specific metrics, um so we had this discussion previously about um having uh well.

A

The discussion was sort of about not having metrics that um scale with the number of uh objects that we create and um yeah it's important for a lot of reasons like we don't want to just sort of overwhelm prometheus or uh with a bunch of data we want to be careful with, uh since I can affect scale- and you know, since we're relying significantly and perform the performance measurements on metrics, we don't want to overwhelm it.

A

um So with that in mind, um there may still be some data that we could gather um with sort of a subset of of vmis, so not the whole the whole um and not every vmi, just the ones that um that sort of meet some criteria uh that we that we care about, and we want to report about.

A

um So that's that's what this topic is kind of finding. You know what what areas where we'd want to know um specific uh vmi data, um so we can identify it or, uh or something um and kind of whatever has some some um description of what's happening. So I have um some ideas around um around this and see if they make sense so, um for instance, right now, we do um a bunch of um new bunch of measurements of performance. We report uh we have time stamps that say like okay, we tune each phase.

A

um One one thing that would be interesting um is if we were to have some sort of um gauge that um could capture sort of like how many bmis are in each phase. This could give us like a way that we could um say um like have a general idea of like okay. What's going on in the cluster like we're, seeing a ton of, am I stuck in scheduling, um we already know like there's a slow performance there and we're seeing a lot of them that are that are getting stuck there.

A

You know, maybe we should investigate that. um That's one idea: um another one is a few mice that take longer than expected since or some threshold time for uh for phase so like right. Now we're talking about thresholds and we're going to do this in ci organized.

B

B

I think we lost ryan.

C

And I was getting mad at me again: oops.

A

Hello, can you hear me.

C

A

Okay, um sorry, hey all right: uh my internet dropped off again, so I'm tired from my phone all right hold on. Let me see if we can we'll cover this here.

C

I'm sharing the documents here.

A

C

A

Kevin, let me see if um let's see, if I can um okay migraine to you, ken okay you're the host now, okay- that's that's fine, all right! um Well! That works right all right. So, let's uh I I don't know where I ended up getting dropped off there, um but um I'll just talk about that last point. um So the the that last point is the so vmi that took longer the females would take longer than expected in some sort of threshold.

A

um So the general idea is that if we have vmis that are a little slow um and we have a general idea of what um slow is, you know we make something that we can configure. um Then that could be something that whenever we notice it, we could capture and view my specific data. We have labels that are very specific to dmi's and we report those to prometheus and then we can have something like a dashboard around it.

A

So so something that's very um so kind of the goal is having subsets of um of vms running in the cluster. So we can get very um focused view on things that could be going wrong and kind of so it makes it easier to um to look at or even just notice, things that are that are kind of, um but not working as expected. So what do people think kind of of this idea, um alexa like via my specific metrics and the other other ideas.

C

um We already, I think we already have like when we did the performance test, since myself did uh there's already a metric showing how many vms are in which state uh do you want to talk about more granular like how long a vm stays in the state approximately or.

A

No okay, so there is there. Is this one then so we have um okay. I thought it was just a count, though I thought it just like incremented did it. Did we actually like, like I thought right now: it's the number that we see or that we have seen that we have seen in a phase like do we. I don't think we decrement it.

C

um Wait, let me pull up my microphone snapshot because I thought.

B

What I mean the number of in each phase, like just counting.

A

Yeah yeah, so I I'm thinking more of a gauge than a count, a counter so like we. If we, if we have 50, vmis and running you know any given time we we have a, um we have a our our metric shows, you know 50, you know. If we have 15, you know, scheduling and so on and so forth.

C

If you look at the link, I send you see the number of pending and it also goes down and rick gets replaced by number and running.

B

The histogram has the count, so it will be very similar to just have a gauntlet so.

A

Okay, so then we then we are documenting it. Okay, I thought we weren't. Okay, then that's never mind, and this one, I think, is this one's covered. So, okay, um that's good to know how about the second one. You know what do people think of that.

C

I don't know about the threshold, but um I thought david was working on this metric. That measures the transition time and not the amount of vms and that's a that's a different thing. I think, but- and that's that's more in in that where we could also have an alert if we want to.

B

C

B

I think this maybe a you know, an alert should be interesting.

B

You have, you know a specific, I don't know, maybe dashboard for that. You know where we can. We could do something related to that.

A

Yeah, like um okay, that could work yeah I mean, I guess, like the idea, is I'm thinking it's like kevin. You shared that dashboard like we can see um in general how long things take um and like if we were to do histograms of like say, 99 percentile um quintile for some of these being able to identify um like okay. This is these. Are the super, slow ones? Let's look more closely at them, because they're going to clean it to large quests, just like thousands of years like, where is that yeah.

C

I don't know if it would do it on a vm basis, because, as we, we talked about that it can be a lot of a lot of labels, um but in general like showing on the dash on the dashboard time spent average time spent in um scheduling starting pending. I don't know.

B

A

B

A

My point is that I agree with you like we. We don't want to have that many labels, but what I'm saying is that you know: could we get away with the um only labeling things in in select cases so only like, because we have you know it was slow. You know it was whatever it's over some threshold, so we're going to create um we're going to have a label for this. That's that includes the the view, my name, because we want it to stick out.

C

hmm That I'm not sure if that is a metric thing, or that could also be just kubernetes events saying hey this vm is waiting way too long and it's status.

A

Yeah but we could well, you know maybe, but that's.

B

Hard to originally.

A

Yeah like because I think the idea is like so if we have like it's kind of the way, we're doing the time steps like so like we have like say um we, we notice, like okay, we're going we're changing face right, like that's kind of how we do things so like we do like. We have a change phase function and that's when we do like we. um We set the time stamp at that point.

A

um At this point in time, we could do a comparison and say like okay was that really slow was that like? Was that a really like looking at the the two times, because we have the objects? Was that a really slow transition, or was that a like a reasonable one? Like was that I guess the question was that an unreasonable amount of time spent and if it was unreasonable, then we could flag it. We could say like okay. um This is kind of strange, like, let's just we'll, add a label for this.

A

um That includes the vmi name, to make it stick out more.

C

Yeah, I I just went.

B

Outside you know, like uh the lyrics that uh kevin mentioned lots more, you know you know suitable for that, then, because you have like you can change that. You know the alerts, because you know this kind of uh values will change for different systems so, and uh it looks also something that will you might monitor via parameters in it.

B

So you already have the the times if you want to put like a threshold- and you know rise an alert regarding that, I don't think maybe we should mark the vm, but just just use that, like a you know outside and.

C

You can verify that yeah. You could also do it on a namespace basis, but one one other thought on that was: is all this phase transition monitoring? I had that thought when we discussed it for the first time. um It's not really anything we need to do uh in in our control plane.

C

It could also be a tool you run like uh top, um that can anybody can run or you can run the cluster and expose prometheus, because all it has to do is watch the vmi resource, like all vmi resources, and it can be done outside control, pane outside process, just monitoring when it is having a quinase watch on vmi and recording this stuff when needed.

C

So we could build a tool that you, if, if you say, see like some stuff isn't coming up as as you want you, you do. Bird ctl face monitor and it provides us metrics when needed.

A

Yeah I mean maybe this is this this. I guess I think maybe we need like another like. I think that I kind of the way that I'm envisioning this is that it could be configurable because you're writing that it's like it's gonna vary for cluster and also the workload everything um and I could see this value being configurable based on whatever the workload is, but um in terms of like, um I think, the use case.

A

It needs to be defined a little bit more in terms of like um how this would because I think, like it's going along the idea of like where we're going to do the performance in prometheus or doing it through um prometheus for for our metrics and so um like. I wouldn't expect another tool like you could I mean, like I agree with you could, but I mean kind of I think, along the same lines, we've we're going toward this we're going toward using prometheus for all of us, so it I mean, I think, it's possible.

A

I think it's the point like it's possible. We could do it um as long as it's not something that's, I think, like I think, like. I think there is a use case for it, but I think it needs to be a little bit further defined.

C

But did did it make sense like um everything. That's based on fields on a kubernetes resource does not necessarily have to be a metric in our control plane and we expose and have to have a feature toggle for or anything it can be any kubernetes client process doing that. Creating that metric for us on demand, and if you care about it, you just deploy that and you get a metric or if you don't care anymore, you remove it again or you run it locally. Does that make sense.

B

It might be a debug tools and it's something like that, so it's yeah but keep patching the vmi is that what you're saying.

C

No, this tool doesn't have to touch it. It just has to have a watch on the vmi resource on the status of the vmi resource. Only it can actually work for any kubernetes resource, like this tool could watch transitions between any kubernetes resource that has status and face or any film status, and you see how long it spends in a certain time. It's nothing cooper, specific, just.

B

Watching a cleanse.

C

Resource for changes.

B

I'm for a watcher, yeah.

A

Yeah, I see what you mean, I mean, maybe that's something we could discuss in this context like this kind of cause. It's just it's sort of it wasn't really where I was going with this, but I understand what a perspective that we that it could be sort of a tool. I mean we what we're doing now with like um with our audit tool right. We could do this like we could you.

C

A

The the timestamps and say, okay, this vm took too long like we could do that right now um we won't get. We won't get anything in our dashboards for and that's I guess that's what my point is. Is that the the specific angle I'm looking at here is that are there any ways that we could take advantage of um having vmi specific metrics and our dashboards um so with just a subset of vmis, with cases that we care about that's kind of what I'm.

C

A

C

Yeah like but yeah, and this this tool could also expose prometheus metrics, and you could run it in cluster. I just mean it doesn't have to be in our control plane. We.

A

C

Need a toggle in our control plane saying, monitor these vms because this tool could be independent and do that for a label selector or yeah.

A

B

Yeah, the control plane should be more general, so we shouldn't filter which vm we are analyzing, but but like an external tool, could do that yeah. Definitely.

C

If you like, it could have a look at prototyping, something like that and and see if it fits what you what you meant, also getting it metrics and having a level selector or something. How is.

A

How is this different than like what we do right now in the auditor? Like that's what I mean like it's the same thing right now. It's done it's right now. Our auditor does runs after we. We don't do it. We don't do a watch story.

C

I think the audit tool right.

A

B

C

Prometheus metrics right and this tool would do two things. It would create a watch on the bmi resource, maybe with a label selector and look at changes to status and when status, phase changes, it records a time stamp and adds that as a metric and.

B

um And the metric part.

C

B

The idea is just to avoid parameters.

C

No, no, it can expose prometheus metrics. The uh idea.

A

C

Exposing another metric in our control plane that is bound to some toggle on a vmi.

D

C

Don't know if that even is necessary or makes sense, but yeah.

A

But I I don't think that's a problem like I I I like my goal. Like I kind of my point of bringing this up, is that, like I want to expose it being prometheus, like I'm, making the assumption that that we would use.

C

A

Through prometheus, like I'm, not making the assumption that like, I would want to use this oxide, like that's totally a valid use case and we've kind of talked about this in the context of the article like. But we started.

C

Going down the.

A

Like a different route but like it doesn't mean we can't you know also add this, but that my my my focus on this just like so we can expose this and prometheus, but I mean if you think that, like it's more valuable to not have this information, I mean I don't know we can have that argument, but.

C

To me, like no.

A

C

Agree I wanted in prometheus as well. I just the difference. I'm saying is, I don't think or I don't want to teach our control plane to decide what metric to do based on the label or some toggle on a vm, that's like if our control plane exposes a metric, it should be as safe and granular general as possible, and this process could bring more if needed.

C

If it should be a separate process. I I'm not sure about that.

A

B

So yeah just to make sure if I understood so that you want to have like a metric that say what is over the time that the threshold, how many vms it's it's uh it is, uh is lower. Something like that.

A

I'm thinking anything I'm thinking over, like I'm thinking we want any. The idea of the metric is something that it's a subset of bmis like because we can't like we already talked our last time. We don't want the the number of of um labels to scale with the number of objects, so we want something to sort of limit.

A

The number of um like we want to try to find is a metric that can give us a more granular view, while also not causing us to have a ton of labels that you know it scales with the number of bmis that we have.

B

Right, I would say that if you cannot so, if you cannot just define what we want to see so and why the metrics that we have right now are not enough, it will be just you know clear for me so just to understand.

C

I think why what we have is not enough is we count how many vms are in status pending and when they switch. But you don't know if there's like four vms that are stuck in pending forever, because metric is still fluctuating, you can't you don't get the transition time on average or at all or per percentile.

A

Yeah we we don't know if there's like because say, there's no new timestamp, that's been created on on one of them like you like you're, right, like we, it makes it look like like, like that, that vm is functioning just fine, um but because we never had the transition.

A

We have no idea that, like there's one there sitting there and then even if we did find one, we wouldn't, we wouldn't know which one it is like we would have to. We have to scour a cluster and look for it.

B

But this is the way that the metrics you know should should go. Isn't it you, you don't really know well.

A

That's that's what I'm asking like, if that, if we're? If so, if that's like, because because I understand like this might not be like the right approach in terms of like like this might not be how the general pattern is for for doing the symmetric. So that's kind of where I wanted to go with this like. Is this something that we could see as reasonable or not.

B

Yeah, so I would say that then the the name of the the vmi- I don't know if we we should have you know that we have. We have the pod, you know, isn't it.

A

I don't even think we do I it's it's, there's nothing unique that it's only the phases and their and then when they get into buckets and they get elected into buckets.

C

So with that, I I just looked for the for for david's pr on the phase transition metric that would extend that metric. There would be a separate. We have a metric just telling how.

A

Long, it takes.

C

To go from one a to b right.

A

C

A

Right well see yeah. The idea is that uh the question is like: can we make it more specific and the second question is: should we make it more specific, okay,.

C

A

Of the those are the things because I think there are use cases that there's valuable information. That's still there like, as you pointed out, kevin like you, could have something. That's stuck and you wouldn't know it. um You could also have something that was very slow and you wouldn't know which one and which you know the hbmi is so that that's what it's targeting those cases and to see if it's something that's feasible or if it's maybe this is not the right approach, and you know maybe another tool is the right approach here or something.

B

A

C

B

Don't know if we well- maybe maybe- but I don't know if you identify- which bmi is the one that it's this lowest one. It should be the focus of the metric. So we know that there are vmis that are slow, you know and and then, if they fail yeah we can, but we can go to the logs and check the logs of ever all the vmi and see which ones fail and just debug that that's the way that I'm doing actually so and um and the yeah the other thing was oh yeah.

B

So if it's spending forever, you see the counting of vmis that are pending. You know it doesn't go down, so you can see that there are some dmi's. You know stuck in pending.

A

Yeah, but if you have like churn in your zone like if you're constantly creating and deleting you'll, never know anyone's.

C

A

In no one's leaving, you won't know.

C

um The what I could imagine the link that you sent is um where the vmi transition time gets recorded right now. What I could imagine is, depending on some condition, is just a label or something we, we just add the my name and maybe namespace to to that labels list, and that would questions if you wanted triggered by a label or a namespace label or how you would.

A

Yeah, it would be a name. That's that's exactly what I'm thinking it's like is that if, but it's sort of a question of when we do this, like I'm thinking.

C

A

Like it's, we we we have this thing like it's like, like I wrote their chase change phase like you know such a timestamp. I think that's that function that you have here that you, you might phase transition or something um one of those, but we we changed the um whenever we, the the we change a phase. We set the timestamp at that point in time.

A

We could do a comparison and if the time spent is unreasonable, you know whatever it is that we set the unreasonableness or unreasonable time at then we add the vmi label, you know and that I could see that being configurable.

C

Yeah right it would have to because I don't think we could hard code, something like that. So you would you could like set a label like I don't know, metrics dot cube root. I o slash uh transition time threshold. I don't know and set it to 10 and then, if it's longer than 10 seconds, the label gets added.

B

I think just complicates a lot.

C

Yeah, it's it's complicated, yeah.

A

C

A

B

A

B

Because if if, for example, everyone that will run cookware needs to be very aware of that, you know because the matter will never, we will not work. If someone doesn't configure it correctly.

B

C

That that's why I thought about doing it in a dedicated tool, so it doesn't have to be part of our api and everything.

B

Yeah, I I think like if, for this kind of specific kind of threshold it shouldn't be in the control plane, it should be external here like, uh as we mentioned, so I think even the grafana dashboard. Maybe it's possible to create kind of threshold, so you can you can there? You know just have things that are higher than some. You know someone.

C

The only more usable approach I could think of would be a development setting on the cr, but then again people might enable this and blow up their prometheus, because I didn't expect it and yeah I don't know. I would like to see the metric, I'm I'm yeah, but I'm.

B

Scared yeah yeah, I'm stuck in the face that that it's interesting and you know I don't know.

B

Because like if it's talking what in scheduling what does it mean, it means that the pod was not created, and maybe the pod was not scheduled because once the pod is created it, it should never stop. Isn't it.

C

B

C

Stuck impending because storage is too slow or something is wrong, it can always get stuck.

C

B

But then it should show in the logs. You know when you described, you should have an event for that.

C

Even scheduling should show on the events that it's not enough, having a lot of memory or something like yeah and yeah.

B

Yeah, I don't know.

C

I don't know the problem is: if you're handling a thousand vms, it's hard to look at that by hand, you need some weight and then I think that's where ryan is going at with the metric that it's easier to spot problems and maybe alert them problems.

B

I'm wondering how kubernetes doing that for pods, because.

A

I don't know if they do.

C

I don't think so.

B

Yeah, I don't, I don't think for you.

C

A

C

Events and events are are more.

B

Yeah yeah. I think this is a good topic. We should spend more time. You know dedicate some time to think more about that.

A

Yeah, I could do mailing lists to write or something. Maybe we can follow up on right: okay, yeah, all right cool, all right. Let's go to the next topic, then, um okay, so the kubernetes has an um one two, oh there's greater than one two, oh um greater than equal to one tomorrow. It's api um priority and fairness, um and at least as far as I could tell, there was not a policy created for q verts, and this would be an interesting thing to do um for a number of reasons.

A

um One of them is that we we want to make sure that our requests to the api server are not inhibited by anything else, and we also want to make sure that we're not a noisy neighbor and that, if we're some reason, our component's out of control, we're not hogging the api server, um so it it protects us and it protects others in the cluster. um So this is, um we could define this per component.

A

We can do a lot of things with this, um but uh um the general idea is that you know we can create some policies around this with a flow control and a priority level. Config and there's some good examples in the in the link there of what we can do and there's already some existing examples if you in the by default in the cluster um right now, if you have a one-two cluster, you'll see that there's a bunch. um So what? If? What do people think about this topic?

A

um I think this is something that would help us with scale um even performance.

B

Well, this is definitely nice.

B

I think it's something that I I always think about that, but I know I never come up with some conclusion of that, for example in the test that I'm doing and the task that I you know the ci environment, um we have the kubernetes masternodes dedicated to some nodes and at some somehow I never did that, but I think that the coupe vert, you know, control. You know, controllers should be in the masternodes, you know and should be sharing the worker nodes.

B

Of course, the you know the vert handler and the finished launcher will of course, will be in the worker nodes, but the other, the other controllers uh should be dedicated in there. But it's related to that. So it's different approach is isolate.

A

B

The mods oh, or have this priority fairness, which it's nice I I've. I didn't know that so I I never seen this before so and it would be interesting.

C

Priority and fairness a bit: it's not a different story, kind of it's um it's more telling the apis of what requests are prioritized and yeah uh who to rate limit and when to rate limit and like not to rate limit kubert or something like that uh as much as some random process. Okay, so it's your friend.

B

Okay, it's it's not neighbor. A nice neighbors then related to that.

C

uh It is on an apr request basis on.

B

The rig yeah, so if you, if we have other crds in the cluster, we can prioritize converge.

A

Yeah think of this, as like kubernetes way of protecting itself um like this is just like a an api to make sure that um that you just that's you know someone just doesn't overwhelm the api server, um so it it comes with some um so kind of the it comes with some of the features that come with it is that you know we can make sure that you know our requests are going to be get a shot at the api server um and we can. We can isolate them.

A

You know, based on whatever you know the user, the name namespace the service account all sorts of things, the verbs that we use the apis. um So we can make sure that those requests like by our word handlers, for example, are are getting, um are not being interrupted by perhaps someone else, that's using the same api.

B

So and then the configuration it's in the kubernetes itself.

A

B

Yeah, so maybe if we document you know how to configure those things uh in kubernetes for our modules, you know have like a straightforward. You know demo for how to do that. It would be. You would be nice yeah, because it's not anything to change in the convert itself and it's just how to apply that for computing.

C

It kind of is, or we can like with our control installation, we can provide a flow scheme, which is we provide a um our back group or permissions and stuff. That's just another new resource that you can create to designate. Some traffic is important.

A

Yeah, so I think this one is also needs like a final discussion, so I I'm already doing this is something I was looking at because, um like there was there's a bunch of information that you can get from it.

A

um In addition to like the features I just mentioned like because you can do um because you can sort of filter by api user name space, all this stuff, just all the general artifact rules you can um you can you can isolate traffic like you can, um based on you know what gets into these queues that are part of the priority and various.

A

So you can see like like, for example, if humour is creating like a ton of list, requests, we'll we'll see them get queued up and when we can actually have metrics that that could do this even on a pervert basis, if we wanted to um so there's like there's, actually some in addition to like protecting yourself, there's also some things we can probably learn um about um about like what our traffic patterns are right now. So there's I think, there's a lot of benefits to it.

A

um You know for for learning things as well as just protecting ourselves, so I what I'll do.

C

A

I'll create a follow-up discussion on this. um I have. I already have like a document going in terms of like some of the ideas I have um and kind of do a study like how much memory this you know how many requests per second we can take, and um you know what our cues should be and stuff like that. I can do a discussion for that in the mailing list.

C

I'd love to see that, like the the protein fairness being implemented or used by keyboard, because it's a um a kind of big thing for the community for kubernetes, for the api companies, api and- and it's still beta. But I think it should at some point. It should get to a point where it's mandatory for some things in kubernetes. Because that's.

B

Yeah yeah yeah. That's definitely nice.

A

Okay, all right so we'll do another we're gonna default for that one. Okay, all right! Let's go to the next one, this one's a pull request.

A

Oh sorry, I'm not I'm not sharing. Oh thanks.

C

A

Thank you. I thought it was. I moved to it how.

C

Do I switch tabs and there you go.

A

There you go okay, um this one's um this one's cool, um just.

A

Yeah, um do you want to talk about this one? Oh there's, a pr is kind of sensitive.

D

Yeah, so um I've gathered some some data as far as transfer for the logging and and it seems like during um high high traffic. We see a lot of logs. Some of them are either duplicated or could be consolidated into into ones to um like we could just save a lot um and- and in this pr I tried to do it um so some of the logs I've moved them to either like higher verbosity, uh which are in the like hot paths or like here.

D

In this um execute method, I try to consolidate these two um vmi and domain logs that we previously logged as two separate, um which resulted like in in lots of um kind of duplication and um and yeah like um right now. I I think it's it's. It's also much better to uh to search for forgiven status, um um having having those two uh in one place, um yeah and- and I think um I think that's it.

A

And you saw like how like how much was the reduction of logs? uh Did you see when like because these were, I think I saw like on the graph? They were pretty significant significance of these.

D

Yeah, so if you look in in the graph, so it was like some um over some time right, and so it's not not like on the one around because, like the logs, are in in millions, so we we are just saving uh like, for example, like here in this in this um for loop. I think we are saving around like six millions um on logs and the the whole like in the whole, like interval, we've collected 40 million. So it's um it's. uh I think, quick math.

D

D

Correct yeah it's around like 50 um so- um and this is just just for this- this particular uh place and and um and uh in in overall. It should be, I think, like thirty five, fifty percent of the locks should be should be removed, with my changes, um of course like assuming that the verbosity is is set to two, because otherwise it's it's uh yeah. It's different well,.

C

One concert yeah go ahead, ryan.

A

No, I I I didn't have anything else that was good. Okay,.

C

uh Yeah one concern I have with this specific part is that um I mean if my math and wrong this is just is reducing the amount of logs of this part by half kinda, because it's combining logs. The only problem I see with it is.

D

Yeah and and then then there is this processing event log, which also so it's oh yeah,.

C

This is just one two.

D

Thirds yeah.

C

Yeah, okay, this just yeah this could probably okay. um The only thing I I I probably can see with those combined logs is that there's a not insignificant trend of doing um log-based, analytics and metrics, also blood powered by graffana and stuff, where combined logs can be hard to extract data from I don't know if anybody's doing that and has more experience with it. But um for those use cases it's easier to have dedicated messages for two things that are two things should make sense.

D

Yeah, I agree that that there is um some trade-off um and um I I I'm not not fully sure um like. What's the what's the right approach here, um like the the problem, is that, for example like in in this else, if where domain exists- and the am I not we, we are kind of like trying to cr recreate this vmi reference right instead of a logging domain um and- and uh it could potentially be a big confusion about this.

D

But on the other hand, you don't seem to have much more information from the from the domain than um than the vmi and uh and the the we. I think we would need to think what we are actually getting by by splitting this into two um and if there is any any particular reason to to do that,.

A

Okay, well, I would um I mean we can take it up on the pr. Maybe I think I think it's a great change. I think it's like we're reducing a lot of vlogs uh we're doing single like singaporean. That's that's! A really good change, cool! Okay!

A

All right! Let's go to the next one! Thanks news! um uh Next one's the perskillo generator um needs approval.

B

Yeah so ryan revealed that last time- and you did the you know lgtm on the label, but it is still like need some approval and yeah.

A

I think we need um dave or roman or I don't know kevin. Do you have a? I don't have approval. um I don't.

C

Need approval rights.

A

You don't either okay, we'll need um david to do it then.

B

Yeah, that's fine! Let's wait for that!.

A

Okay, all right the next one, um the performance evaluation. So it looks like you did another one marcelo.

B

Yeah, um I did like the you know the performance mods again, and this is the update, the I would say the at least the master branch for this week. You know it was like the last one, but it's fair enough.

B

You know update um it's failed to create 500 vms, and I was expecting that. Let me explain first, so uh I was doing another test, as you guys might remember how many vms I can you know pack in a node.

B

Actually it's the the next uh task that I listed here in the in the meeting so and then I can create at least a little bit more than 300 uh vmis per node.

B

However, here in this test, I'm using three nodes, so I would expect at least 900 bmis in the cluster, but with 500 fails, um I didn't collect all the logs, so unfortunately I don't know why they failed, um but they failed, and but we can check the grafana you know uh matrix.

B

This is also the update grafana uh that I have. It might be interesting. Also to see- and basically I see um here just remember-.

B

Okay, so yeah. First of all, we can go to this. If you I don't know who is sharing, if you can go to the vmi uh creation time yeah this one, that is in the middle.

A

Yeah, let me let.

B

Me try to zoom in because, oh by the way, I tried to do this snapshot as you suggested before, but.

A

Sometimes it's.

B

Work sometimes it's like I put like a one hour. You know for the maybe it's too much data and.

C

You need to increase the timeout um when you create a snapshot, ask you for a time or you type in 10 seconds or something, and it should be fine.

B

Yeah I put one hour and even even though it fails to create this snapshot, so maybe it's my internet. I don't know it didn't work. So um that's why I put this screenshot here, but I would definitely try the this next shot later if it works. Okay, so things that are interesting here is okay. Let's take a look in this two last uh ones, which is actually 400, vms and 500..

B

So 500 fail but 400 work, and we see this. uh You know phase, for example, the running and then the schedule and then scheduling, uh especially for 500 400.

B

uh Did you see that it got scheduled with five minutes? Oh here's night, five percentile okay, so it not means that all vms were super slow like that, but they're, the worst ones.

B

So we get like five minutes was scheduled and then it took more five minutes to actually start to run and- and it's expected because I don't know well, we are trying to create a lot of vms per node, but it's maybe you know you know too long. You know to to create that, um and it's just something interesting to just see uh the other one.

B

It's the it's the other metric, that's related not to the vmi, but for generally for the face and I stacked them and then we can see how much you know in general, it was taking. Of course, it's the scheduling time.

B

It's the one that it's taking more time and then we have the schedule and the running which looks a little bit: mismatching, isn't it from one to another, but again the vmi creation. Time is just taking the the worst case scenario so and that we can see like the long running happens here there so yeah, the the other things that we can. Oh, first of all, if we go.

C

To the top one question on on that those crafts, real quick um that is 400 500 vms right, because they, the grass, are very close together on the numbers, which could mean that we reach a limit at 400. That 500 is only filling a bit but 600 it would be interesting. 600 goes up higher or if it also caps out on the five minutes, somehow on scheduling. That's our.

B

Yeah I tried to create more. However, as I mentioned, 500 uh failed yeah and actually the namespace got stuck. You know it didn't delete and okay and the test didn't, you know, continue with more vms but um I'll try to run it again. Yeah to see if I can reproduce that with more vms and see yeah.

A

B

We are definitely reaching the image here: yeah and 500 400 500, but it unlimits in the system. Isn't it because again I have another test creating 300 per oh then it's 300 that was 300 per node and it was you know, working fine and it only failed now 500. Here we can. If you see the vm count, that's good! Where mission yeah, you know below.

B

Yes, this one, we can see, that's failed and running and.

B

And then it's like many vms fail with 500, also another material that I put here. It's the rest, you know rate limit duration yeah just one, and this should be fixed within the aromas pr to increase that. But we are definitely you know reaching a lot of you know go. You know throttling limits here for the api request, and maybe it's related to that. You know the errors that we are seeing and this is slow down that we are seeing a lot with.

B

It might be related to that.

C

It would be interesting.

B

C

What gets the right limit the most.

A

Yeah marcelo, I was wondering you know because it would be cool to see a comparison like you have here. It's like what you exactly what you have, plus with um an increased um qps and burst, to see the uh what we end up with see how the graphs compare.

B

Yeah, I think the pair was not merged yet, but.

A

Okay, yeah: well, maybe the next one you do. um That would be a cool one. To do. That'd be awesome to see what uh see how it makes a difference.

A

Now that it's configurable, the other one is um if you scroll down kevin um this, I'm wondering if, um like you see how we have the vm count, we get some, we get some failed. I wonder if those scheduled vms are going from scheduled to failed, and maybe that's why it levels off right. There.

A

That's what's like we're just not getting a little, that's like where we're going to fail.

A

Something like that, maybe.

C

Which could also explain why the graph above is not going up, because it's just yeah yeah just die yeah.

A

And that graph on the right with the phases, so is this? Does that mean like, um like the far right um one there like it's? That means five minutes in scheduling and then and then it means like a minute and a half in schedule. Is that what it means like they're stacked like that.

B

A

So the vmi phase transition latency. Does that mean that you spent five minutes in scheduling of the total end? Ten times I mean you spend five minutes of scheduling and then um a minutes um like a minute.

C

A

Scheduled and then the last bit there and into from scheduled to running yeah.

B

What that means.

A

B

A

B

It's it's like a general, so it will be like a it's, not per vmi, like the other one that we have the vm iteration time. This is the combination of all the vmis in this phase, and it's we get some directions of how how the vmis are spending in each phase.

C

Are you excluding pending or is because it should also be pending it at first.

B

um Yeah, let me check here I I think I had it here, but it's not sure so.

C

Yeah, that's an interesting part. Right scheduling is already like scheduling. We can't do much about because of scaling of the parts. I think so.

B

Yeah I have depending- and it's not shiny, so we, but we do have pending, isn't it for some reason. I think I.

A

B

I think uh david tried to fix that before. Isn't it yeah.

C

And it emerged.

B

I don't know, I don't remember, he's not.

C

I I'm pending in my in my in my metrics.

C

Maybe I'm looking at a different metric, I don't know. I can't see right now.

A

It shouldn't be there: okay, well, maybe we'll get marcelo. I think that we can. Let's uh next chance, you get, you can try with the qps change and let's do the measurement and then we'll see if pending shows up for that one.

C

A

We can, that would be nothing nearly interesting.

C

Because, if pending is a big pound of part of scheduling, then we have a problem: if not if pending is relatively small and- and it's only scheduling is that big, uh we can't do much about it and the only part. That's really us is the top part, and that seems to be growing. Okay,.

B

Yeah, but if you see the vm via migration time, it's 10, you know in the left- and here is smaller, maybe because, depending it's not showing up, you know yes,.

A

I would expect something.

B

A

Yeah, okay, all right! Well, we're we're at time. um There is one more topic I don't know. If can we cover this in like in like 30 seconds here or do we need to um push this? The next time.

C

Yeah, I I it came up yesterday. That's that's mine. um Wait. Let me share a different screen. um David asked yesterday on my uh bug fix for the go routines.

C

If we can test that somehow- and we decided like it's it's hard to test- and one thing I wanted to bring up is that we could, or should somehow with those density tests measure or be able to define uh a way of of seeing regressions like that by checking like some some threshold of go routines or cpu load that is allowed to be before and after.

C

If, if after there is more, we can alert on that. Somehow I I still don't know how to do the prometheus magic on that, but it would be great to in our future dashboards or alerts, see this pr added go routines that did not get cleaned up or uh I don't think it can be a funk test.

C

As long as funk is not talking to prometheus.

A

Like I mean, I think, the framework, like I'm kind of classifying the framework that we've been building as sort of like a functional test framework. Oh.

C

A

C

A

That's what I mean by this, like we could.

C

A

We do density testing there as part of the density testing we can test for regressions. I think we just haven't defined this yet like we just this is, I think, once.

C

A

Marcelo's load um generator emerges, um I've got the thresholds like ready, I'm gonna hook up to marcelo's low generator, and then I think we have sort of like the the ground floor, like we have sort of the the foundation for what we want to do for like functional tests, and then I think, having a functional test for this.

A

Exactly would be you know using that existing like everything, that's there and just kind of putting in some code to say like okay, here's before you know it appears like here's after something, I don't know something we could like here or here's, what we expect or something like. Here's we're not we're not leaking. Well. I guess you know this one, you just you create. You create a bunch of vms, you delete them and we see that there's no um right.

C

A

No leak yeah, so it.

C

Could be part of the audit tool uh or the tool scope? For example? um I think that they've mentioned, but yeah. I just want to raise attention that we should.

A

C

A

Agree so I what I'll do is, let me create I'll, create an add. Some, like my issue, tracker section, to make sure.

C

A

C

A

That we eventually get to for this. Okay, all right we're at time. Folks, uh thanks for your time, um I'll see, y'all online have a good day.

B

Youtube bye thanks.