KubeVirt SIG Performance and Scale, 26 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-08-26

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.4xsswmk8jszy

A

Okay, uh welcome to 6k everybody, it's august, 26th. uh The link to the design large of the meeting notes is in the um the chat. um Add yourself as attendee, please, okay! um So today's agenda um first item um is shared dashboard location. um I thought of this after last meeting, because I, after um we kind of had um I could tell there's like we've, had a bunch of dashboards that people have created um around some of the metrics that we've had, and I was just wondering because I was looking around.

A

I was seeing uh looking at the q verts uh repo for dashboards, and I wasn't it didn't. Look like there was some of the um the dashboards that people were showing that were pretty cool that doesn't look like they were merged into that qvert repo, and so I just wanted to bring up the ideas like if you know it seemed like that might be a cl a case, a place that we could all share. uh Dashboard ideas and.

B

Things that people.

A

Are looking at um I mean? Does that make sense like what do people think like? Is it? Does it make sense like how, if, if we had some sort of uh get out repo that we could have had a bunch of dashboards that that we could all share around this stuff.

C

A

C

And actually there is a repository for that. um I was planning to you know to submit the dashboard that uh I'm using for the control plane there. There is a dashboard for the vms um metrics, but it's it's missing the the the the mat only the the metrics that we have for the contour plane.

C

um I I will submit the pr for that. It's a coupe vert! I don't remember now. If it's it's here in our document, it's some somewhere. Let me check this. The link again.

C

Yep go ahead, yeah.

B

One one I I tried a few of those dashboards that are in the cuber ci and one important point I would like to see for the share dashboard location is that uh we document how to or we make sure the dashboards are properly exported and that they are compatible with the grafana operator or just like export them as json and not as some custom resource. So we can import that into any grafana, because I I couldn't import most of the dashboards.

B

I had to copy the json out of the dashboard and that didn't work and um yeah, and then I just create my own.

C

A

True, so um like what I could see is like like, when we're doing some of the like the measurements, you know for our for our public quests and, um like you know, we're we're trying to say. Okay, this improves performance, you know of some sort, um I mean we can it's good to see it in the tool. It's also good to see it, have a visual and have her have the opportunity to show visible a visual as well and like if, if we have an easy way to do that, I think it sort of increases.

A

The likeability likely likelihood that someone will include those dashboards and without without having to do like a whole lot of work. We can just you know, paste them into our import them until into it, with the make cluster up and import it and then and then take screenshots. That might make it a lot easier. So so this is the link that we have. This is like where, where we can use- uh and that's like where the current dashboard set.

B

Oh and uh a note.

A

B

Share a few of those rain tank uh dashboards and somebody asked if we can host that ourselves. I looked that up and rain tank is just any grafana instance, so you can um set up like we wanted. We would want a kubrick refund anyways. We can set that up in a way that it can also receive snapshots and uh we can instead of taking screenshots, we can actually share snapshots to the cube right grafana instead and.

A

So the kubert kavana is that, like from the ci jobs grafana like that's what that's, how we that's what it's for like? What's the what gets posted there.

B

I think it was the original intention that we recorded we have a graffana showing all the ci at the metrics we collect in c irons,.

A

B

C

I'm painting that yeah I will do that.

A

Okay, so hello, so for now we can. Maybe we can start with this just so, we have a shared location and then eventually like. If we have this place, that uh we with ci dashboards, um we can start taking some of the dashboards that we've created here and adding them to um that grafana um job so that we can grab snapshots or whatever, um because yeah that would be pretty cool. um They have to have okay, so that that sounds good.

A

So we can correct, let's collaborate here, so I I guess, let's see whatever we're doing.

A

I think that's one of the takeaways there is if we have a place that we can collaborate on this if we're doing some sort of, if we're adding a metric of some sort- and you know it warrants a dashboard- um let's make sure we have a follow-up pr here, it's a dashboard of some sort, or at least just open an issue to do it um so that we can make sure that, if we're adding this stuff, let's let's have some sort of dashboard that we can share along with it to make use of this.

B

And as a future, at some point in the future, if we build like a general keyboard, dashboard or multiple of those dashboards are also useful to admins. um It might be an idea to publish those to the grafana marketplace from github, so people search grafana for hey. I want a keyboard dashboard and they just get one officially suggested. huh Okay,.

A

B

I mean there's a.

A

Lot of a lot of places, we can go with this: okay, yeah cool, okay. uh Let's move to the second item, um so uh api priority fairness. I mentioned this last time, so I wrote a document about some testing what I did and some information about this. um So let me talk through this for 10 minutes or so um so. The the api priority in fairness is um was introduced uh in alpha um and kubernetes 118., it's beta in 120, um so it's enabled by defaults um and basically api and primary fairness.

A

Is it's going to allow the kubernetes api server to to protect itself from any sort of misbehaving services in the cluster? So any sort of thing that could possibly cause some sort of denial of service?

A

um There's a lot of things that we can sort of um do with this and and like the the way that um if we leverage to like the sort of the rate limiting aspect of it can be distributed uh or can be, um can be tagged to or broken down by, api calls or um basically, by the our back rules by user by verb. Everything um and you can create policies around all these things to do some form of rate limiting um and it's a really granular way.

A

So it's it's actually really cool, and so, in addition to protecting the api server, you can use other things like um protecting um a user from from another or from a user or protecting like some sort of api, like whatever it is from. You know, if you're, a big user of an api, you know giving yourself priority by user or or making sure that that those apis get through a higher priority.

A

So there's a lot of things that we can. We can do with this, and so the idea is that kubert um and newer clusters could have a policy by default. Just like kubernetes does to make sure that it's traffic, uh the control plane traffic, is not interrupted.

A

So I'll talk through a little bit of what this concept is and then um kind of my goal is that um if we can kind of get kind of an understanding of what this is, maybe we can discuss how um you know what that policy could be, and it probably won't this man. I think it's likely to end up as a milliness right, but just to give you guys an introduction to the topic I'll I'll talk through this here.

A

So the first thing is: there's two apis that come with this there's this flow control and a priority level configuration um the flow control is, is basically the set of rules like that define, what's going to be regulated by the priority level configuration, um uh so this is sort of like the what like, what's going to be rate, limited just kind of think of it, like that um priority level configuration defines a number of outstanding requests, limitations and number of queued requests for a flow controlled kind of thing about.

A

This is like how something will be how something can get rate limited, so um the number of queues and all these other things. These configurations will affect the priority and um if requests can get rejected and all sorts of things like that, so here's an example just to give you a taste of this, so a flow control. um This this example is, uh it will capture vmi's list requests um for this service account lister zero from this namespace, and you can see like some things in here like this.

A

Here's the priority level configuration it defines um and the rules it looks a lot like our back, so we can specify an api level, um the resource, um the verbs everything and the subjects like service accounts, or you can do all users if you want to uh the name space of it, um so it gets pre-granular in terms of what you want to control.

A

um So what's important here is that there's sort of a one-to-many relationship and that you can um every flow control can list. um You have many flow controls that list the same priority level configuration, um and so this plays sort of a role in how the fairness algorithm works, which I can talk to after I talked about what the plc is here, so the uh the example priority level control here um that that reference above this is uh has a few fields.

A

We have insured concurrency shares uh kind of the way to think about. This is there's like a calculation that happens um on the kubernetes api server side that determines the number how much concurrency you can have.

A

How many, how many outstanding requests you can have um it's it's based on the number of max in-flight requests and um another flag, there's those two flags that that basically determine how many uh requests that, once the api server can handle, and it's there's a calculation that takes those values into account and and this value, um and that will determine how many concurrency shares you can have.

A

um So that's, basically your priorities way to think about it and then there's um the uh the configure or how you will be how you can reload the number of cues the queue length limit and the hand size um cues that this is pretty intuitive. So, like you know, the note thing about just a number of lists, the lengths of those lists and size. This is um the. This is part of the fairness algorithm hand. Size is referring to kind of like a.

A

I think it comes from the uh like cards like a deck of cards. If you were to have a remaining in your hand, but the way to think about this is the um the definition of is that this is the number of cues that a flow control can end up in um and what's important about that is in this algorithm. Is that let me see I actually want to see here go I think I have like I'll show you there's a good picture of it.

A

So the idea is that um there's a good picture so when there are a lot of heavy um users that are um accessing a bunch of different cues, the idea is, you want to distribute those users, um two different cues, but the the problem is: is that you don't want um you sort of don't want to end up in the wrong queue so hand? Size is the number of queues that you'll end up in, so that number was four so in that exam.

A

So for the example here is that you'd end up in four of these boxes, your your requests would end up in four of them and statistically the probability of you sharing one of those boxes with a heavy user.

A

Let's say if you, um let's say like the rainbow here or something like: if, if the rainbows have a user, it's going to clog up these cues, but if you're across four boxes then you'll have the your request will eventually get through so you're not going to be run over by someone. That's got a lot of requests going through you'll eventually get attention from the api server um and there's some old probability about this in this table.

A

um There's a there's, a good document kind of explaining what this is and but basically, if you see here like the hand, size the number of queues- um and you can see that like based on the number of high intensity users- is the probability that you'll squish a low intensity user, and you can see how and based on the number of cues and hand size.

A

The numbers can can vary.

A

Okay, um so the so, with that in mind, like you, can we want? We want to have a lot of um well. Actually, let me go back up for a second, so the idea with the this this shuffle algorithm is that the we want to have. um We want to have a lot of um workers. We want to have a lot of flow controls. This is like the workers and the um for the for the algorithm, so we want to have a lot of them.

A

um We want them distributed, but we also want to make sure we have a responsible number of cues as well, and the right amount of q length. They all have different behavior q length.

A

The way it was described is that it's better for burst requests, so they don't get rejected, they can pile up in the queue and uh but all the stuff, though, does cost more memory like when and so the number of queues that need to be maintained and um and so forth it does. It does use more memory, so something to consider, but the ultimate result, though, is that we don't.

A

We don't run over completely the api server with all these requests, which is what would happen if we just were sort of if we weren't blocked at all, we just let the api server just be over one with you know thousands or of requests list requests or something like that, something very heavy um until it just ran out of memory.

A

um Are there any questions on this, though, before I go to like the the next, I have a few tests that I'll talk through um yeah.

D

What's the default protection mechanisms that the api server has in place, there's got to be some sort of way that it's protecting itself. I would hope.

A

A

Let me see if I can find.

A

Yeah there are, uh if I can get the.

A

A

So the by default, um the you can see here like the um there, are a bunch of cues that are a bunch of flow schemas that are created um and they all share. uh They're all workers in this workload high priority level configuration which is here so the control plane protects itself by using this little configuration.

D

Well, what's the default for just normal, not kubernetes core components, but let's say kuvert, for example, if it's content in the api server, where okay, the catch-all and yeah okay, so then yeah.

A

You get down here, cut off the needle regardless.

D

Of of what how the system's configured the catch-all we would fall under that, so we would be limited by the catch-all okay, so we wouldn't overrun the api server if we fall into the catch-all which is a low priority.

D

Is that accurate, or am I yeah.

A

Like so you're gonna you're going to get um you're gonna get into the catch-all well see. I haven't tried this with the catch-all. It's a good question like you get you'll get into the low priority. um I think what you'll end up with is a is a ton of rejected requests. You can see, there's no queues, I mean it's.

A

I haven't tried this with none, but there's no cues and there's no hand size and no q length. um So I think maybe you get rejected, um but I what I mean by overrunning the the vap server. I mean in the case prior to having this like having free reign if there was no sort of api priority inference. So in the case of the catch all um I would expect that these are these are rejected, since there is no queue.

A

Okay, are there any more questions about this.

D

Okay, so this isn't so bad! I'm sorry! I'm.

B

Just looking at my cluster.

D

At the same time, so this is uh installed by deepvault.

D

uh Where do we fall in today, so we fall into the catch-all today I mean we work with that and our our own client-side rate limiting does something to uh impact our performance. I guess I'm.

D

Trying to understand the difference, um let's see.

A

So, like the case of a a user, that is, that is, uh let's say, listing vmis like crazy.

A

The concern is that, if we're in the catch-all and say the user is in the catch-all, the noisy user can completely overwhelm any of kuvert's request to the api server.

D

A

That's the concern so essentially doing like what the way that q, with that kubernetes protects this control plane now so having a way to something of a higher priority higher precedence that we can make sure that kuvert's requests aren't going to be affected by anything that the user could possibly do.

A

I mean it could just be that we end up in workflow high and that's just what that's that could be reasonable, but really the idea is just do we avoid any sort of collisions with someone else.

D

Okay, I'm interested in your data. It sounds like you have some tests or something.

A

C

Energy, um I also check that my cluster, it has just workload high workload, low and the high has. You know, 40 concurrent uh well, yeah 40 for the congress. Maybe strat go yeah, I don't know what is this and the low it's 100?

C

It seems that the low is higher than the high. I I don't know what what we should expect for these numbers. You know and why they call high and low here.

A

um So um let me see, I think it's priority or um let me see so the 40 shares and then low and then where's the um the precedence does anything even use workload, though no it doesn't look like it. Oh yeah service accounts get workflow, though.

A

It could be how they measured it. I mean yeah, I mean because I see you're saying like you see 40 here and it's lower than 100., so it actually has to get more priority. I should.

C

Maybe should expect the opposite, like the high 100 in the low 40., I don't know what they mean with high and low, because this looks like the same one except it.

A

Looks weird sure, I'm not sure what they mean either, because this is.

B

A

There's a load.

B

Like a small workload, so it gets more shares because it doesn't do as much.

A

I don't know uh the the I mean this is the highest that is here, except for this one yeah. I don't know what they mean by low and high here, yeah, I'm just respecting.

C

A

C

Just confusing yeah, that's fine.

A

Yeah, okay, um so the testing I did um was focused on trying to like understand like what different cues queue, lengths, um hand, sizes and how they affect different things. So um I have a few sort of assumptions that go into this. So one is that I, during this testing, what I do is I disable the client side rate limiter um and then second thing is that I never see the api server get completely overrun.

A

The memory and cpu does go up so differently, but it doesn't get completely overrun, uh which is good, that's what's expected, and then I had some. I highlighted some things I thought were more interesting.

A

um One of the things I saw during this is that during the tests there was um a lot of or eventually over time, the tests like the the latencies, went down for the apis over the for like different verbs, like list, for example, uh forget and other things. uh Things just got really fast, and um I noticed this, but when I went through the the aps server logs and I saw like the how um how much faster um xcd got during this, um it was from caching or something it was significantly faster.

A

Going from you know almost you know: 300 2000, milliseconds, all the way down 900 500. That gets really quick over time to the point that um it's actually interesting, how it, how it does this, but based on the load, so here's kind of what it did. So I have 600 lists per second of 50 vmis, so I'm just doing a get request in a name space that gets all the bmis. This is pretty expensive. um 600 per second is extremely expensive, so the concurrency limit for this is calculated.

A

This is a prometheus metric, so I grabbed it. It's 178.. um It's the same for all of them. uh I did a buy namespace um flow control and then here's what I had for the priority level: 20 priority, 10 cues 20 length, and hence that's four. um So I had a bunch of metrics. um I built like a little dashboard to kind of look at this so and this this is basically what I pulled from it. um I do see that the we get 180 requests per. Second, I go through.

A

uh It eventually settles down at a much lower rate. At least that's uh that's. What shows we see a cue wait time in 1.5 seconds that becomes nothing uh over time. Our request execution time was very high at first, it goes down uh rejected requests, um so this returns a boot to 409 for 29 that risk return. From from this, uh we were getting a ton of rejected requests at first. Eventually, it completely goes away, um and this happens when the cube gets full.

A

That was uh so it's one of the ways it can happen, but this is what all these were. Is the queues were just filling and we're getting rejected. Requests um dispatch 350 per second eventually settles down a little bit lower. um The number of enqueued requests. uh 19 uh gets down, lower lists. Latency like I was talking about, is during this time. It goes up to 10 seconds. It even settles down around 8.

A

um and you can kind of see from some of the dashboards like initially during this time period, there's just an explosion of requests, and then it's just slowly. Slowly.

C

Slowly comes down in the summertime, please increase a bit yeah.

D

And this explosion of a request, this is happening when you're starting on the vmis and do you have some sort? It sounds like you have some sort of load generator or something that's just doing uh list requests.

A

So all I did was, I have um I created 50 vmis in the namespace. um I have a pod. That's running on. All it does is there's a uh a get I get to get to the namespace, get all the vmis in that namespace.

D

A

And it does it at one pod we'll do it at. I think three per second and there's 20 of them. Is it not my math right 30 per second okay.

D

So we're seeing uh an increase in the q length that happens just trying to get this clear during the startup of the bmis. While this pod is just issuing list requests, and uh then we see it, what appears like the everything kind of settles down? Are we thinking the sales down because the apis or our keeper control plane has settled down and then all the the pods list requests can just take more higher priority at that point, or what's your theory here.

A

um So the so the I'm not sure if it's on, if it's us, that's doing this so the when when so at first we we get, we get a ton of requests, um we start taking a little bit of time to to fulfill them. uh It eventually gets a lot faster, but I mean the time increase that I saw like was from from ncb, like the response from ncd wasn't necessarily from from our control plane at all, but this is one.

A

This is an area that I need to explore more, and this is just kind of where I try to where I identified the the sudden speed or the massive speed up was, was going to storage and coming back a lot faster. It might.

D

Be caching so after things settle down.

A

D

There's no difference between all the objects, then it's able to return the same result.

A

Yeah, that's that's what yeah that's exactly the that I was thinking is that, like eventually, is that etsy he gets so overwhelmed with like okay, you just keep asking me for the same things like I'm just gonna. Have it ready for you and it's and it's coming back really quick and that's that's what I was showing up here. Like I mean you can see it's the start, we're as high as 2000 milliseconds from storage and then we get as low as 299 milliseconds and the total time improves significantly.

A

While the you know, hcp response is pretty fast during the whole process and the result is that.

C

A

Cues don't fill up and then we don't get any more rejected requests. What's that.

C

So you you run it for 20 minutes.

A

I ran this. It.

C

A

uh It took a little bit. What do I have in my timer here so roughly I let this sit. This one sit for an hour, um there's two dashboards because about an hour went by during this process. I think, let's see seven.

A

I can't tell maybe there's longer times just look off, but I remember it was about an hour. I have here I'll show you the next one, because I think I do I do. I wait a little longer in each of these now, because I realized that that's what what happened. So I show it a little bit more and test you, but you can see kind of the here's, the api server. You can see how it's processing all these requests, but it's not overwhelmed.

A

It's not completely overwhelmed, um which is important um and we're still able to do lists and uh worse. So eventually, you can see when we're in eight seconds we actually settle down.

A

So that's good! So as I do a second um a second test, uh I increased the number of queues of 32 and, I think hand size was four and out six same concurrency share. So it turns currency limits, 178.

A

um slightly different results. I received a little bit higher dispatch requests, same amount of rejected requests, rate, limited queue, time went up um and q requests went up, um so that makes sense and with longer queues uh list, latency eventually settled down much lower. So you can see this. This is just over one dashboard now, instead of two, so you can see kind of over time. This looks like uh it's about 25 minutes or so you can see how okay we initially start. I I'd already created the vmi's.

A

I start listing 600 lists per second, and eventually we would come down.

A

So the to kind of, like summarize like where I'm going with this, the um with the with cuber like, have been able to sort of protect itself from a really noisy user, like this, for example, uh sort of trying to find the right balance of hand, size, q, length, cues and priority that that sort of fits.

A

So what I'm doing is I'm just showing, like the extreme case or a few different extreme cases, just to make sure that hard to show how this can things can be affected, and let me go down to test for where I do or do I not have to. I only have three: I do one more test, maybe I don't have it here yeah, it looks like I don't. One of the tests I did with was um was with three different flow controls, um so three workers same amount of list requests um same.

A

um I think it was the same definition as test two here, and um that was the best result that I saw. um I think I didn't I think I forgot to copy over to this document, but I'll have to find it. But um anyway, the point the point is: is trying to find sort of the right balance of this, so that we don't uh so that that fits cube so that we don't get overrun um by somebody else.

C

So just try to understand so when you say hi is, is it the beginning of the test and when we say settle it means like the steady state, you keep doing the list and then it go. You know it's steady state, just some some latency for the request or yeah.

A

So the when it goes up here when it's very high, this is the moment when I start kicking off the list request, and the study say which is sort of. I was surprised like this- was I sort of expected the this level to sort of continue? I expected there were no the number of rejected requests to stay flat, but I think it's because of caching on the storage side is why we see, like those those requests return very quickly, um eventually at cd sort of is able to return things faster. Do you.

B

A

C

Yeah we have this, it is, it has a dashboard also and it it might be nice to see what's happening to your url.

A

I don't have, I don't. Have some agency yeah that'd be a good idea. I, like I pulled this from the from the api server logs, just to get a sense of like where like because that's my like after I saw this happening over and over again, I was like I figured I'd trace these, and so that's what these are.

A

But I mean I can you clearly see that there is something happening here that is causing the that's the return significantly faster. I mean you can it's littered with these, these, like three thousand millisecond requests initially and then eventually you see a ton of these and then so on, um and it's really fast, but I mean, like sort of I mean that's a cool thing to see.

A

I mean I think it's what's beside the point I mean, I think the idea is that with 600 list requests per second, like a very heavy operation, I mean there, you can see how much the cpu memory explodes. It doesn't completely overrun the api server, it's still able to serve traffic, um and when I did the test with three different workers, they were still getting through. It actually had a higher um patch request.

A

It was somewhere a little over 400 um and so like the number of workers that we had worked better um in that case and that scenario, so I certainly so the idea is like if, if we were to define a policy of some sort, like you know, whatever the cues hand, size is it's something we'd end up with something where we we take all the keyword components and we put them all together and they've, maybe their individual flow controls, but they have to share priority level configuration for qvert and then they have.

A

You know some queue length, probably one of the ones you know like that they had in the example for kubernetes. Maybe we can copy that one or something like this seemed to work, fine um and and that's kind of what we what we can use to protect ourselves from from somebody. That's that's doing this um somewhere else and just so make sure we're getting to the api server and make sure that none of our controllers also are just are doing anything that we don't want them to do.

D

Can you uh create an issue on uh kubert kuvert link this data and I think the issue should be. um I think we should auto generate some sort of flow schema, perhaps and make that something that the or I'm sorry the operator is capable of installing maybe based on our data, it's kind of an investigative issue, but it would allow people to find this so to find your data easily, and maybe it's something that we can automate at some point in the future. If it makes sense.

A

Yeah, I agree, I think they're still yeah, I think, they're still sort of um some investigation. I like like there's a lot of still smoking questions yeah, so that makes sense to me like, for instance, like another area we could do. We could even do flow controls um per api or per verb and per api per user account. Like there's a lot.

A

We could do and like one of the things like that prevents us like if we say that if we know, for example, that list is heavy, we do know that um we could isolate lists and then we could let create go through. You know, have it be processed separately so that we can make sure there's fairness between those list requests, not you know blocking any of the um crates. So there's there's like a lot. We could do around this, but um yeah.

A

I think we can start with a an issue and then then, maybe um as we get more data or something like, I kind of want to start a mailing list discussion and we can kind of get some consensus on like what we think this should.

D

So here's some scenarios I'm seeing this potentially impacting us so we're talking about like lots and.

B

D

And lots lots of work handlers, uh bert handler, is going to be calling lots of lists and watch uh on virtual machine instances. So I could see in a kind of failure scenario. For example, let's say half of a data center has a power failure and we're bringing.

B

All those nodes back.

D

Online, uh if we had a flow control or something that gives bert handler precedence, then it might be able to bring up all of its virtual machine instances quicker because it would be able to do um get the lists. uh Cached and everything quicker.

A

Yep, I think it's.

B

A

Like we could get, it could be that scenario david or like in the case that there's that that huge adage, and also in those we have a ton of those list, watch requests. It doesn't completely take down our control plan as well.

A

Or yeah this isn't doesn't completely take down the opacity. Yeah sorry go ahead. Marcelo.

C

Now, when we do an update in the cluster, also for, for example, your version- and it will also you know- do a bunch of requests kind of things.

A

Yeah, so I uh some next steps, like you said, I think that makes sense to me like create an issue. We can get some attention. This, I think, there's still, I need to do some more tests because there's like a lot of different ones, I can do I I want to do like a one. That's a little more granular like without a 600 list request, but with like maybe 10 different workers and see how it performs and just to get an idea.

A

And then maybe we can try a few weeks ago with the hand size and the other one and the the keys q length a few times, and we get an idea of how we can perform in a few of these scenarios.

D

Yeah and the idea with the the github issue is not necessarily that we have to solve something immediately. It's just discoverability of your research as well.

C

Yeah, okay, and also, if you can include the you know the other, um the catch-all you know. What's the difference between the catch-all and with the example that you did to see, if it's uh how much it can be, you know mitigate or improve it, configuring that it will be nice also- and my last comment: it's about the tcd, the tcgc.

C

You know, since your b2 is low. Like the atcg documentation says it's, the latency should be under 10 milliseconds and you are 300 and two thousand milliseconds are.

D

You in a nested environment definitely too.

C

D

Ryan, uh are you writing cluster app or something? How would you no not for these really, so this was tested on bare metal.

A

C

It's way too high, I would say well, I is it running. You know separate from the vms, you know with the master nodes, something like that um or.

A

It it yeah, I think, that's in its own vm.

A

I think it's so what's this. What's generally, is it like 10 milliseconds, you said.

C

Yeah 10 seconds should be fine. It's what I'm seeing also.

A

It could be this request. I don't know what it is on the on the like the normal request, like it, I don't know what the baseline is normally because, like so, for example, like this case like this, um that I'm showing here like these, these logs only show up in the api server when they're really slow like when they're over a certain amount of time, like, I think it's 500 milliseconds is the the baseline. So there are other requests in here that are faster. It's just they're, not here. Okay,.

D

I don't know so I don't know.

A

I don't know what they are: yeah yeah we're seeing the high ones we're seeing the slow ones there's.

D

A latency uh metric- maybe you had that in your dashboard- I just missed it, but you could do the p99 uh or whatever of that yeah also see the average. Let's see.

A

I think that's what marcel was asking. I don't have the cd dashboard attached here, but yeah I mean, I think, that's a good one to at least have to show um this, because we would expect to see this to at least decline, um or at least maybe to show up in the dashboard and.

D

Even see the average, I mean if you're not running on.

C

D

Kernel or something like that, I would expect to see collisions just because of there's other things running on that server and every once in a while.

D

It's not going to return fast, whatever that could just be cpu scheduling, even if.

A

It's under those.

D

Yeah right, but still yeah, thousands, that's that's really yeah.

C

That's true, too bad, isn't it. This isn't two thousand two.

A

This latency is ginormous at this point I mean I don't know of the total list. Latency time I've got 10 seconds. I mean we're seeing three seconds. I don't know where the rest of it is, but I mean that's pretty that's a huge number yeah so, and this is one second, that's four total right there. Oh sorry, no, it's the three total total, but the um yeah. So it's yeah I mean that's. That is slow, but I mean there's a lot of requests that are going on at this time.

A

So that's why I was like I kind of expected this, but I don't know what it is at baseline, so something I can check.

D

Can we we have like 15 minutes left? We have a few more items that you want to move on. I think you have the next yeah.

A

Yeah, I think we've covered that yep. I think we're good okay, um okay, let's go to the next one, um be my specific metrics. Do we want to? I have a million to start for this. Do we want to talk about like this, for maybe um two three minutes here? Is there anything um or do we want to just cover it in the thread like do people have anything they want to add to this, or we can just take it to the thread.

D

It seems like we're looking for some very specific cases for stuck bmi, so it's uh are we stuck um from the creation to running and then maybe in between there is it stuck between scheduling the scheduled or scheduled to running, and then are we stuck between uh termination and finalization, like maybe um it's tough to it's tough to represent that just in phases? Maybe I don't know I had trouble just trying to figure out what maybe.

D

Maybe define, uh let's collectively define what is stuck, what we're trying to solve here or detect here what what is the stuck bmi? I know we had it on the um mailing list. Maybe we can just hash through that real quick.

A

Yeah, um so this to me, the step, a stuck bmi is something that is um is not progressing um past the phase. Whatever phase it is like it's just not it's not moving. uh It's been there for and sort of the to quantify it would be like. Is you know we if we expect you know vmis to go through pending in less than a minute and this one's taking 10 minutes? um That's how it quantifies something as stuck.

D

Okay, so I think we can represent all of this, given once we get your your deletion histogram in there. I think the collective of all the metrics that we have. We can represent this, so we would want to be looking at phases, phase transitions to take too long specific based transitions, so scheduling to schedule we'd want. If that takes too long like we come up with some sort of threshold, then that's um well. The thing that's tricky here is: we won't know about stuck vmis until after they're unstuck.

A

Yeah well, so that's why I was thinking we used. We could use. We could use the creation time as like, plus like the number of phases like what I'm assuming is that if we had some, if we knew like if we had an estimate of like um like I'm thinking, we use them 10 times sort of as the threshold time. It's like.

A

Okay, if you have, if you've gone through this many phases, we just you know, make an assumption that they're whatever amount of time per phase, and then um it's been this long since creation, then there's a high probability that you're stuck and that's like, and that's how we define unreasonable transition time threshold. It's like say I said to 10 minutes whatever it is.

A

It's like okay, he's gone through this vm has gone through three phases and it's reached its 10 minute mark or whatever you know, 10 minutes, plus the three phases, maybe a minute per face. So 13 minutes it's it's stuck like you know. That's let's report this. As you know, as a as something that's stuck, I don't.

B

Know something like that is.

A

What I was thinking.

B

David comp, we know, like you, said we can only know after it's unstuck, but we record all the phase transition time stamps.

B

So we can still know if, if you know the order of transitions which we should, we can still look at when it switched into the previous phase.

B

How long that took right.

C

But if it doesn't change the phase.

A

That would be maybe current time minus the previous time.

A

Yes, if it's a scheduling, okay, it's.

C

If you it's scheduling and it's never changed the phase, you don't know.

A

C

But you don't yeah, you also don't know when to work hours.

A

Yeah I mean I guess we might be getting too much in the details, but it's it's sort of the. When when do you look like when do you well, I mean, I guess, that's the same problem with creation if you use the creation time, because when do you look yeah.

C

A

The same problem it if.

C

It if it doesn't get an update, it's it's very hard, isn't it so when it's stuck, if you cannot even watch it because it's not update it will not show up in the event right.

D

We would get no indication yeah.

B

D

All this is definitely possible. uh We can tell this I'm just trying to point out. I guess by my questioning is that I can't figure out a practical way of doing it. I mean it can definitely be done. It's just not falling within the patterns that we've already established.

B

So what I, what I think, um yeah for example, pods, do like maybe it's not directly looking at if the vm is stuck, but we if we can come up with the reasons why a vm is stuck and that should always cause an error or an event, and that can be recorded like if a pod can't launch because of no resources. You see, then the pods events, if it can't launch because of the volume, is taking forever to mount.

B

You see it in the parts events and maybe that's something we already do or we could do or look at more that we find or know all the cases why a vmware gets stuck and recorded there.

C

Or if some of these events happened- and we know that it was- you know, stuck the vm, we should move it to fail phase. You know maybe what's happening. Is you know well.

A

So you we still process the key right like we still. We still are pointing it through the work view every time or no or we don't, because we don't have any of that. We do the.

D

Problem is that when a vmi is stuck, let's say stuck in scheduling, because the pod uh there's just no way for the pod to be scheduled right now, either you've asked for a resource that doesn't exist or whatever it's going to keep, that scheduler is going to keep trying to schedule it. But if it's not writing any sort of update to that pod, then that's what would cue up the vmi. So if something is writing a condition to that pod or something anything, then we would re-trigger the reconcile.

D

But if it's not, then we're just never going to look at that bmi again in the reconciled loop, which also means that our monitoring wouldn't see it because we're looking at the.

B

D

uh Informer callback.

B

A

So for deployments.

B

And stuff same same situation right, I think that's expected.

C

Must have one update.

B

It it yeah, it will get an update if the part can't be scheduled. We should update the vmi saying what the pod is saying, why it can't be scheduled, but then we shouldn't touch it anymore until.

A

C

B

Is scheduled right.

C

Yeah we are dependent on the life cycle and then.

D

C

Stuck there, but.

C

It's an open issue.

B

Yeah, that's why.

C

There is no way to fix that right now,.

B

Yeah, it's generic.

C

Programming and and that's problem right.

A

Let's see yeah.

B

And an admin might want to look at stuck resources in general. It's nothing specific to us! That's why I suggested to find a way to look at it for all sorts of resources, but for us it's just.

B

I think we're doing what we're supposed to do in.

C

D

So here's why I would recommend as a path forward for you, ryan, if this is something that you're going to research figure out, how we can detect a stacked vmi and what that means, and don't worry about necessarily how to report it or anything yet, just figure out like how can we practically determine like how can our code detect this thing is occurring? What would that involve?

D

Does that sound like a like a good at least action that can be taken.

A

Yeah, I think, without the assumption I was making, is that we were, we were still getting events to these pods, but now it's clear. It means that there are cases that we're not so.

B

A

D

A

That's the problem, so that's that's why what I'm missing so I need to right. We, I need to figure out if there's a way around a way to like to deal with without having any guarantee here.

B

I think we're getting events, um but we're not getting updates. Events are a separate object, so there might be events we can listen for, but it's not yeah, no watcher events, just okay.

A

Okay: okay, uh that's something I can research, then okay, okay, um sounds good, okay! Next one um record via my deletion time. This is the pr that I was working on.

A

um The only open question that I had just wanted to bring up was that um finding the right sort of end post to to like record deletions so like right now, um the way that I was looking at is that we we use the we like the removal of the um of the finalizer as like the end point, um one of the assumptions that I was making and I just having a chance to check this- um is that when we're removing the finalizer or when it's been removed, we should that we're in a failed or a succeeded state, because the pod is exited.

A

um That's what I was going to look them from when I tried it. I didn't catch any though so now I tried also just catching on. If the finalizer is removed, uh I didn't see that either.

A

um So I'm not sure I'm not sure where what uh where I could go with with this, if, if there being no finalizer on the vmi, the new vmi object. If that's the right, ending to record delete time.

D

Let's see, what are you trying to measure? Are you measuring the time from deletion to the time that the guest is shut down within the pod, or are you worried about the time of deletion to the time when the pod actually shuts down? Those are two different things.

A

So the deletion would be the time when so the pod, when the poncho is down. That's when we pull the final answer off yes, so yeah, it would be the time that they we see the deletion timestamp. So that's when we process the control plan as process the delete request up until that, so the pods removed.

D

You can measure accurately right now, uh the time from deletion to the time that the vmi hits a finalized state. I believe I think, that's possible because we have a finalizer on the vmi that will guarantee that the um vert controller uh component will see the bmi one last time before uh it is deleted.

D

If you're right, there's no guarantee about the pod thing.

A

Well, pod thing is in like upon being removed or.

D

B

If there never was a part, then it, for example,.

D

Well, that or what you're getting.

B

D

B

D

The cumulative process is down that when the pods down, which is that that's accurately for how long it takes for the bmi to shut down but not get deleted, it's kind of okay. Does that make sense.

A

Yeah, like I want to when the vmi is deleted, because there are because, because I've seen cases where the pods are moved, but the vmware's still there I'd. Rather, we continue to measure if this vms still there, because that's that's kind of our interface.

D

I don't know how you're going to get when the pod is deleted, because once the final lighter lizards are removed, you may never see that object ever again, so you'll never.

A

D

A

That's what I'm seeing? Okay, that's that's! What I was saying is that I'm checking I because once what kuvert sees that at the last moment, I think it's like that is final function. It checks like okay, it's this and then it removes the finalizer and I was hoping I could catch the event of the the object being updated without the finalizer, but I'm not catching it yeah. That's the end.

D

A

Okay yeah, so I need to catch it right before that, and they like with these the is.

D

Final will give you the best, the closest you're going to get to it. But it's not a guarantee, but the closest you're going to get to representing what you're trying to get.

A

Yeah, okay, so it's not a guarantee. So what would be a better representation if I'm not.

D

The only way you can put.

A

A new finalizer.

D

Another finalizer just for the monitoring, which I do not recommend.

D

You would look to see if you're the only finalizer left, if you're the only finalizer left record this and then remove your finalizer.

B

D

B

In the pot stuff, because what I noticed yesterday was a deleter vm and the vmi created, stays and succeeded for a notable time until it gets removed and the way you can measure that is, the watches, give you a type deleted events type, and that is really the last time the object existed in kubernetes.

B

A

At that you get the last state.

B

Before it actually was deleted,.

A

The last state before it was actually deleted.

B

A

Okay, well, what about um so forget the deleting the bmi then because it's because well so I could do what you suggested um kevin, but we just won't forget for one second like what about the pod like, so I could get the deleted event on the pod. I mean. Is that even going to get me like or what's like the? What would be my ending time if I were to do try to get the the when the qmu process ends like yeah.

B

I want you to understand why or if you're explicitly interested in the part or not the bmi or.

A

Yeah, I was so I was interested mostly in the vmi, because I thought that made the most sense because that's like the object, that's like her interface to tracking this, but maybe it's not possible. I mean I I don't think like in terms of the use case like I think I think both would be okay and I think it gives us the I mean I think, as long as we clarify what it is that it's measuring, I think it's fine, it's just mostly what I'm after is like what what's the right.

A

What's the way that we can have some sort of guarantee.

D

A

B

What do you want to know.

D

Well: here's here's what let's just focus on what you can actually measure right now and that's the.

B

Time stamp.

D

To a finalized phase, you can measure that.

B

You can even measure the deletion timestamp to the actual deletion from kubernetes. Why the watch delete deleted event.

A

Guaranteed to get that, though yeah, I don't think you're gonna get it.

B

I thought you are okay,.

D

uh You might get it you're, probably.

A

I mean that's, that's one of the edge phases. I don't know yeah. I don't know if you're gonna get it the that's.

D

The whole reason the finalizers exist is because.

A

Yeah exactly so.

D

A

Okay, so I think I need to go to that. It's final state then I needed that's what I need to tie into then I can't use the finalizer so.

D

That would be a yeah that would be close enough.

A

Okay, yeah, I mean that.

D

C

That's achievable, which.

A

C

New, like a new, um you know when you get the finalizer deleted and then you get like the parameters matrix there is. That is that everything, instead of have the the watch that you were doing so because when it's delete you can define all the finalizers.

C

So then you can do like the timestamp there.

A

Yeah I mean, let me let me play around with this a little bit and um see if I can have a password here, all right, we're we're we're over time. Here I mean dia. Can we cover this in like one minute or should I say this for next time.

D

One minute yeah, I I just want to know so, what's the status here, I see that we have a periodic job. That's running today, wearing the perf stat. uh Is that accurate or the same yeah, okay and.

C

We didn't export the dashboard yet.

D

Okay and what's the plan moving forward.

C

Yeah, so the the plan is uh to ex to do this dash and also to have this new um right now, a nested virtualization in this in the ci.

C

um But we want to have this, uh the cluster that I'm using for the the benchmarking that I was reporting to actually run the test, and even we are using the the tests with the functional the well the functional test that creates 100 vms.

C

But the idea is to move that with the load generator plus the the tool that you you created. You know the pair of uh perf perf broth. I don't remember not the name yeah.

D

Yeah, so we don't yes do we not have any uh do we have any thresholds set today.

C

No for the period, but we don't have any thresholds for now. Yeah, okay, got it no.

D

C

It's just running.

C

Yeah, it was like a roman actually request, remove the thresholds from the pr in the beginning, just because we mentioned that we could first just check, you know the the executions and then we we we, we have the thresholds, especially because it's it's actually running now in the nested virtualization, this icg collocated with all the jobs but yeah next week. I I'll try to prepare. You know the the infrastructure and have it like uh move move forward with that.

C

But definitely the idea is to um replace the the job that we have now with the tools that we are creating yeah.

B

But so I I I prefer if we can get what we have at least usable now and have like the dashboard, so we can look at it while we move it over because I I yeah would that be.

D

B

Good two too, I.

D

Like the idea, the dashboards and I'd like to see us export the perf audit results without the threshold.

B

So we can get the results.

D

And begin uh using that to kind of understand what we would want to actually set the thresholds at for our environment, and you said marcelo that you want to move the cluster or something. Let me make sure I got that straight, we're running in nested mode right now, but you had a dedicated environment or something.

D

C

Yes, exactly so it's so we're right now, like uh as the regular functional test which it's uh running the ci you know, uh convert ci and the convert. Ci creates uh vm and sql inside right, hoover nets, clustering side and run converts so- uh and this is nested virtualization- that it's creating and also it's it's run the the the functional test and I share the cluster right, but um I I'm going to introduce a new cluster. um Actually, we already have the machines.

C

I was doing some tests, but I need to actually release that for the our converts guide now and and then it will have like a kubernetes cluster, always running that runs the pair of jobs that we want um yeah. I will configure that with federico. He I backed before so just one.

D

Last thing, because I.

C

I before that I was just yeah.

D

I was just gonna ask.

C

I just why did you then create this cluster.

C

D

Sorry, I I just wanted to summarize real quick uh where, where is this being tracked, uh these changes, so we can um at least get some sort of.

D

Where can I go to to figure out how um what's being worked on with this and the progress that's being made? Do we have that.

C

um I have have it in red hat issues, but um we should I yeah. I should create like um an external, um an issue in the convert repository describing that isn't it so I I will do that yeah, it's it's.

B

C

It's not open here.

B

For mind, if we are actually running those tests right now, periodically I'd like to or I'd prefer to get something from them. First, like it's priority to get the dashboards and to get the grafana, and so we can do something with them, um because right now, they're running for nothing right and.

A

B

Move on from there, I don't know what you think, but that.

A

B

My preference, I want to see something finally,.

C

Yeah we can each double check, but I think it's already been collected the metrics, um but I need to see what's happening yeah I will. I will follow up on that today and I will update that as soon as possible.

D

I might be able to add the performance results. Let me I'll add that to myself.

D

Okay thanks sorry that went over a little bit, but I thought that was important just to kind of understand the progress here. Yeah.

C

A

Okay thanks everybody.