KubeVirt SIG Performance and Scale, 2 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-09-02

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.4t9i91had6to

A

Okay, all right welcome everybody to some scale september, 2nd. I add the link to the documents in chat, so you can add your name to the attendees great okay. um So what we'll go through? Please add any agenda items, uh but we'll start with the the first item.

A

um So this is a discussion that I bought from the mailing list brought up a few weeks ago, but I wanted to kind of back up and kind of clarify one of the um original goals for the discussion um in case there was any other comments about it and kind of clarify a few things, um so this originally brought up two weeks ago.

A

The original ask that that I had was um around vmi, specific metrics, and the context of this is that we had discussed um having metrics around vmis, but we didn't want to have metrics that sort of that that kind of ballooned out of control.

A

So one of the things like with um that we currently do with metrics is uh we have. We have um a sort of like a summary of the number of or the amount of time it takes in phases. We have we kind of, we accumulate them into buckets, we don't. Actually uh we don't actually output the specific vmi labels as a name, um and so the difference is that we, we have a static number of of metrics that come with the phase transitions that we do.

A

We don't actually report the name of the vmi on them, because those labels would quickly grow based on the number of vms in the cluster, and so one of the things that I brought up was, or so the question that I asked here was like: are there any time? Are there any situations where we'd actually want to know the um the actual vmi um we're going to? When would we want to kind of like pull back the curtain and say? Okay, this is the vmi.

A

That is uh that we care about, um because it's doing something you know, that's that we want to look at um more closely and one of the things that I brought up I started. One of the use cases was uh was, I called the vmis that were they were stuck like things that um that weren't um that were taking a long time um and so what that's the kind of mailing.

A

The thread that I that I brought up the topic was stuck the amazed and one of the things that came out of it from our discussion last week was um and was that if a vmi is stuck um the thing, I want to clarify that, like it won't be creating events and so the way I want to change.

A

The definition is that if we have a vmi, that's like impending or something and it's or maybe it's in scheduling and it's waiting on um a device to be assigned to it or something, and it's just sitting there, um the pods. Just let's not do anything so we're not gonna get any events, and this this could classify something.

A

That's stuck it's not progressing at all, um but there are also other cases of this, and so I split this onto a second category which is like emi's that are slow, they're ones that are progressing but they're just taking a long time.

A

So some amount of time that's longer than what we expect, and this could be for any number of reasons but like if we went from you know scheduling into scheduled and we knew how long roughly it took to go between those phases, and we noticed that one vmi just took an incredibly long time um and we'd see this in our dashboards. But uh you know, could this be?

A

It would be useful to know which one it was so that we can trace it and get a better look at it, and so that was one of the things that that I wanted to clarify about this. um But this is a fairly general topic uh in terms of other metrics that we could do and how we could label them. But are there any thoughts on this, though? Like? Does this sound like um in terms of a better use case like I'd want to?

A

If we can is, I think, because I think this is a little bit more achievable if we can, if we can expect that if a vmi is slow and it is progressing that we could, we could actually see it and perhaps label add a name to it instead of um instead of ones that are that are stuck.

B

Ryan, can you hear me yeah favorable grit right? um I think the the the second out of these two points uh you mentioned something like you discussed it last time already where I wasn't there, but so maybe I might add something which you already said, but uh so regarding these two cases, especially the second one reminds that are stuck won't, create events. I think for this one, you summarized it pretty well that you would want to see events there, like normal cubecut, will get events. Events for the first case.

B

uh Would it be more something which should be handled via monitoring and less via labels or something or do you think? It's not what what I it's only a little bit like you wanted to see directly in the objects.

A

No, so I I want to basically I want I want to monitor it like right, like I guess, sort of what I'm saying is that we we right now we don't we see. We can see this this first case.

A

um We can see this show up in the dashboards if, like yeah and so like, if you have like a 99 percentile, for example, um for a histogram, you you'll occasionally see like some vms are just they're, just a little slow, but you don't know which one it is is say: there's like a lot of churn and zones, a lot of creating and deleting happening. We don't know which vmi it is.

A

You have to do a lot of work to figure it out, and so my suggestion is that if we can see these events, perhaps we can pass the the actual name of the vmi. um In addition to the other information we're passing, so we can locate it okay, so it's sort of like a it's sort of an advanced way of monitoring so that we can have like a subset of bmis that we could look into if we want to.

C

On the not getting events on stock vmis would it have been idea to watch events for via my parts and uh forward those just as we forward status like for a vmi pod event with create a vmi event.

A

Yeah well so th this one. I guess we're kind of one of the discussion I think like where we are with this one is that I don't know I I don't know, I'm not sure how we could solve this. I think kevin. You were the one who said that, like if kubernetes has this problem too and then, if a pod is stuck, you know what is what is kubernetes going to do about? It doesn't do anything about it and so.

C

A

Sends an event usually.

B

Yeah, but, but only if it does something right so yeah like the scheduler periodically, tries something and it would periodically fail and send events. But for the time where the, where this this component in charge is not doing anything, you won't, you won't see anywhere, but yeah yeah.

D

B

Guess we have here this kind of gap where we are waiting for uh for kubernetes to do the scheduling action for us where we are actually mirroring pot conditions already in the vmi, but on the pod condition you cannot necessarily see that, for instance, the scheduler tried something again. You would only see that on the events, and that is where we are probably silent.

C

Yeah and that's where we could forward the pod events should be my events.

A

So if that's I mean, if, if if we could get some events, then perhaps this could be something that's solvable. I mean I so if we yeah, if the schedule's doing something there and it's posting some something on the pod. Like I I mean we do see this like if you like, you're impending and you're, like oh no notes are available, you know do like 100 times or something like you can see that there's.

A

But what about like cases like, um if you're waiting for, like, um like your cni, to do something like to to provide you with an interface or something will that show up on the pod.

C

I think so I thought I've also seen the pod warning about storage, not being mountable or available, couldn't find a pvc that matches.

A

B

I guess the main issue here is just that um that, on the part status- you're, really you so, for instance, you have an issue with mounting something on cni uh that that would add a let's say, accommodation x, on the pot status. So it would add on the first error, that condition and send an event.

B

But if then, when the cubelet retries with cni to mount it, the condition in the status does not change at all, not even the state this time step, but you will see another event, that's mostly to avoid storms like. Can you see that you get when you're watching parts? You would get an immense amount of warnings if status were populated differently and yeah um yeah? I'm also not sure what to do if we would not listen to porn events directly.

A

Okay, so it's so it sounds like it's possible, so we is it that we don't listen to them today or that we do because we.

B

Don't we're looking at the part conditions and we're mirroring them so that you see the vmi what's currently going on, but if the part conditions are not changing, but just because the port conditions are not changing does not mean that they are not sent, for instance, 50 events in the meantime. For that part,.

C

Okay, roman, do you have any concerns about watching and mirroring the pod events to the vmis.

B

Yeah, I guess just mirror yeah and.

C

Not in the play mirror, we should do something with them, but I.

B

Would just say, let's be careful when we consider.

C

B

I can't see anything where I would say it's definitely a no-go, but I mean there is obviously, but I guess it's it can potentially create a lot of unwanted traffic or increase of traffic. If you're not careful.

E

Is there any precedence for for doing this kind of forwarding of events? I've never really seen anything quite like it before.

C

No meaning there are, but I don't I like the only case where the events are so prevalent are pods.

C

As far as I know, and I haven't seen another operator wrap parts the way we do, I mean, what do we could look at what deployment sets and and that stuff does if they show.

B

That they are not doing that, no, they, they are just sending events, sometimes basically, media events on themselves, like yeah.

C

B

They interpret the external situation at the moment when they are evaluating the objects but they're not looking at any events or anything so and I think in general it's a very safe pattern to do it exactly like this. So what we're doing is kind of the pattern being intended to have, because it's safe and scales well and everything. But yes, it has the disadvantage to directly see always on the object. What's going on.

A

Yeah, I think so I mean.

C

The events I'm sorry right.

A

So the thing that for me I the most important- I guess, like- I think the first item here I think, um would be, I think at least a start, and then you know this. Maybe we can talk about this as more of an advanced case. If we want to expand this to monitoring things that get stuck, I think this one at least for in terms of like calculating performance, I think, would be initially valuable, so we can locate the outliers and then this one could come later.

A

If it's, you know as a way to weed out any sort of external things happening like you know um like if any of the device plug-ins or something is like, we can maybe capture those here.

C

Yeah and like one more thought, the event mirroring um I mean we don't need to do that. It's it's kind of solvable on the client set as well. I mean right now. If you do keep ctl described, you get events for the resource you're looking at, but the events are there like. You can also just query event, since they only give me bmi events or only give me events for the vm this vmi, because we have the labels on the parts. So we should be able to query that not tested, but it should be possible.

C

B

I guess the main issue is that that you cannot clearly related objects right directly. So it's not. There is no easy cube, cuddle command for doing.

A

Okay, um I think so for me like what I'll take away from this is like the next episode. What I, what I'll like to investigate, is um taking um taking it towards this round of going like things that are slow, so assuming that we'll hit transition times and see if we can um capture, we can capture these in a way. That's that's sensible that and like the way I proposed on the mailing list, was that we have sort of a.

A

We have a threshold that we expect is as configurable that we have um per transition that we can set to a large number and if it it's over this number, then we can say. Okay, just add the name of this bmi.

A

Okay, all right I'll explore that see if see how that looks and see if that makes sense,.

C

Yeah, okay, I'm I'm yeah, just I think I've said it before I'm just careful about introducing any hard-coded thresholds into the control plane. Just for the sake of a metric um yeah.

A

Yeah, I mean I mean: do we want to discuss it? Like I mean the? The idea is that whatever I would do would be configurable to you know: we'd, do it in the the cr or something in the kubert, cr and it'd be optional like we? Wouldn't we wouldn't use it by default. To just be something. If you want to have some sort of advanced look into how things are are going with the vmis you could use it.

D

Yeah, I think, last time kevin mentioned, it could be like a startle tool watching you know the vmis and generating you know some. You know warning or something and then you you can have the logs of that. You know and track that and create the thresholds. But outside you know the the control plane itself, it's just a tool that we run and watch. You know the all, the objects, all the vmi objects and our name space, maybe and and then you can mark you know and create thresholds.

D

It's like a kind of a debugging tool. Isn't it and it can can go in the direction that uh you know uh david vosso is uh developing also two for you know for monitoring that, but instead of being like uh something to the end of the test, but something online, I think what uh ryan wants to do.

E

I have one thought I just wanted to throw out that I've been thinking about for a bit and I've circled around to it maybe a few times, but all right, so we have different verbosity levels for our logging, so the default verbosity levels are meant to not overwhelm our logging stack and not just present so much information that people don't know what to do with it.

E

Why are we? This makes sense to sense to extend that kind of thought process to our monitoring, where we have verbosity in our monitoring, where somebody increases the cluster verbosity for monitoring. We start to include more information in our monitoring labels, uh perhaps even more metrics that are more intensive to the logging stat or the monitoring stack, and things like that that we didn't want by default, but maybe during certain low stress scenarios we would is that a concept that we want to even consider.

B

I'm not sure so so I like having the metrics ready to identify when problems occur. Also the the rough area like say some vms are slowly going out of the face, and from from my perspective, it can be fine. If you cannot directly see you in prometheus or grafana, which one it is um I just want to.

B

For instance, I mean you, for instance, david had the pr where he added the timestamps when transitions are happening, so it would be rather trivial to when, when I see again, I see an alert going off, which says amount x of let's say: five percent of ovmis are suddenly starting slower than usual that I would then just run my other diagnosis tool, which would really just fetch ovmis, look at the phase transition and give me okay. That are the five ones which are slowest right now, and I see it immediately yeah and that's not.

D

Going to be just client side.

B

So because otherwise it's always I mean promising zone, it's really great at scale, but mostly, if you don't try to save everything.

C

Okay- and I I want to re-advertise that uh that that building that I think I I prototyped that and was very trivial to build something that just takes the data we have in the vmi and creates metrics from it one needed, because we have all the data all in there already like it's not going away.

B

Yeah and everything, for instance, the case which ryan mentioned right- and I think a very small script can do that immediately right, just a few lines, just yeah.

A

I mean I mean you could well I mean I understand the perspective like we could I mean this could be something in the audit tool as well like it doesn't have to run as a watch. um We could just scrape the time steps based on what's currently there if we don't want to. If we just don't need the history we just kind of grab. If you notice something is wrong, we can just capture um which one it is yeah I mean I can see that perspective.

A

A

Yeah, I mean to me well to say, like I I'm not real like I, I think I'm I'm not necessarily sold in any direction. I'm just trying to figure out. What's the right, what's the right one, so it seems like folks, like the the client side way of doing this like kevin you're on the client side, I mean other folks are on the client side. It sounds like so it seems like the consensus is.

C

I think klein is relative. This can also run in cluster all the time. The thing I prototyped it's just exposing prometheus metrics. You can run it inside and have it as another keyboard component, if needed,.

A

But well I don't know if I'd go. That far I mean, I think like to me. I my the way I like this. The most is that if I, if I could add this to the audit tool like that's the way I did yeah.

C

Or that yeah yeah that.

A

Like to me, those are the two options going through the audio and doing it after I notice something is wrong um or having it directly in previous or having directly in in their dashboard.

C

E

One thing to point out about the audit tool: is it's only going to be able to collect this if those vmis still exist.

A

Right right, so you just it's like you know without if, if you're, okay with not having history is, would be the the option like they have to still be there yep.

A

Yeah, so it gives you an option like you have the opportunity to do it. If it's.

C

If it's it's there yeah and the watch mode that I tried, you can give a label, select and say only look at this namespace or only the csv mines or you just watch everything. And then you get history as long as it ran.

A

How about so? How about this? I think I think, starting with the client side to me, makes the most sense, because it's I don't. First of all, I don't think it's. I don't think it's like a massive commitment to do it like it's in it, I think, would be useful in general just to have it so to me, like that's, I think, that's a good place to to start and then, if it ever comes time that it makes sense that we want to have this expanded, we want to have more history.

A

I think, then we can talk about the discussion of you know what you have kevin versus um having it directly and previous.

B

Yeah- and I mean it's also not that we did not have vm granularity in the metrics so far like when you look directly, do you have metrics, like storage access network exists and so on there we really do include the name and namespace.

A

So right yeah, no, I mean I mean just for like the perf stuff is what I mean so I yeah like okay, so I I think I have direction with this, so I, I think client I'll start client side with the just in the audit tool, and then you know if, depending on what, how that goes and how it gets used and if there needs to be more, we can talk about possibly extending it to have it with with history, which would be either watch with what kevin has here or the with the atheist.

A

Okay, all right, I think I have a path forward with that. Well,.

B

Also rand, what a wonder um since you have some experience, running keyboard and bigger scale. What do you do with events in general? Like? Do you mirror them to the logging tool for post analysis or something um the events like for like the pods sam? Just all, like you, cuddle get events.

B

There are a lot of them right and if you delete an object, the events are gone of it or they are gone after 24 hours and so on.

B

So so I've been wondering if, if your mirroring them to a logging system to keep them for, for instance, yeah, for instance for such situations, like you see monitoring that two days ago, five pms they're, slow starting and then, if you, if you are because, if you're mirroring all the events, you can look back then and correlate it with other events which are going on in the cluster at that moment, and so.

A

Yeah I we do have cabana I mean I don't know. If um I don't know if we have all if we capture all the events or if it's um I believe, it's just we're just grabbing logs from all the components, I'm not sure if it gets the points of it having the events, but that's a good point too: okay, yeah, um okay, that makes sense. I think I have a path forward. I'm going to go with the I'm going to go with the audit tool. I think that'll get a good start to this.

A

Okay, um let's go to the second um point, um so I I had this earlier. This was also from last time, um so marcel you created a pr for this already, which is um which is good. So, basically, just to reiterate. From the last time um we've we've had like marcelo's under some really good presentations. Talking about the different data's gathered and you've shown a lot of good dashboards, so I thought it made sense if we could all share some dashboard that we do that we can show whenever we're doing testing.

A

So we can just kind of compare you know. So we all have access to it, so we don't have to build a new one. Each time or we don't have you know different data or anything like that. We just have an apple samples comparison when we, whenever we do uh show any pictures, um so any sort of like any sort of changes like we want to do with the performance, let's contribute to uh to the community dashboard and keyword monitoring.

A

So micelle you added this, which is pr it sounds like it's you've got to. I think I read it. You had a bunch of the recent changes that we brought in, which is good start.

D

Yeah, it was actually I was pending to to commit this. This new, you know pr so once I saw your message I just I just did it just after yeah. Well, thanks so uh yeah, this is an update version of the the dashboard that you guys saw before. I was improving that a lot and including many things that I think it's uh important for the counterplane.

D

So then we have like some. I something you know few categories now like request rates and latency, then the the work kills uh metrics then um etc, metrics and general process. You know memory, fi, cpu file, description and network and go link status, garbage collector memory uh and well it's actually. It has go routines and threads. It's missing here and storage operations, which is interesting um in the the tests you know, especially when I deleting uh vm. So sometimes it takes a lot of time.

D

You know to delete vmis and I also see when it's taking a lot of time to delete vmis. I see a lot of errors uh for uh unmount uh the empty gear directory.

D

So this uh you know this slow down might be related to that. So um so of this matter. As I think it's interesting, so this is the the new dashboard um I I don't know if I have a picture of that right now, but.

A

You have on your: uh are you able to share your screen.

D

Yeah, let me share my screen, so one second.

D

B

Not yet, at least not for me,.

D

Oh, it's just for one. Second, it seems and to enable some.

D

um It's I think I need to um reopen. Oh can you? Can you see something? Yeah? Okay, great? I I thought I would need to fix that. Okay, so uh we have so this one you guys already saw you know this uh read code request.

D

It's basically uh get you know uh uh watching this operation uh and we have the durations the duration, something that warns me is virtual machine lease. It's taking one minute. It's not. The watch also is was expecting to be slow, but at least take one minute so, and uh I always see with that and something is very slow. You know in the know, what's the request that it's it's taking one minute, but some list requests taking one minute anyway.

D

The goal here is to just describe the dashboard, so we have the right code same thing uh but for put delete. uh This is the the metric that it's interesting. I also keep this the rate limit. Duration, um the vmi creation, for some reason, pending phase doesn't show up for me, but you need to investigate more that so because you guys see that isn't it um you know.

C

D

Oh, it might be zero, that's the right yeah and I yeah. I don't show the values that has zero that okay. Thank you. That's that's the reason. Okay. So I also have this via my account and the the rate. The rate is interesting just to see how you know more or less how many vmws per second it's being created when we do the dance test and then all the work kills metrics that we already saw before and and then process opening files.

D

We can see that vert handler has many open files and from you know, it's the the main one actually here, but it's only eight.

D

So it shouldn't be that problematic here, but it's just something to to keep an eye on that and and then before we didn't have the threads, but now it's so it's good just to have not only the go routines, but also the number of threads in it that it's been created and the garbage collector also that's been problematic right now, the cube rbca, that's the the the one that spends more time on the garbage collecting for the test and a lot. All everything looks fine, also for me here and the tcg.

D

uh I think we were discussing that before so the tcd performance, especially the request duration, it's something that we we need to keep an eye on that and if some everything that is higher than 10 milliseconds, uh the official cd documentation says that it's you need to to see that it's a problematic and the storage.

D

So this is the storage operations and it's mostly related to the pods, and we see you you can see here, for example, the air horse, and these are specific experiments here that you see there was like a very we can see here. First, uh where is it yeah the request here anyway?

D

Anyway? You can see like it was a big gap between the experiments from 9 to 11.. I have one hour, I think, waiting here or 30 minutes. I don't remember now, but it's still like it was a long time waiting for the vms to be deleted to be able to run the other tasks, and we see for exactly for that.

D

A lot of uh you know amount uh over here uh operations for when it's deleting the via okay. Maybe it's as expected to happens that I don't know so, but uh just some correlation here anyway. This is the dashboard so uh that it's theirs and uh the idea is to have it open and anyone can can play with that.

B

D

B

um Oh god, kevin.

C

I I just wanted to ask: if um maybe we could move the bmi metrics to the top like how many there are, um because it's like the main indicator, you're looking at mostly like everything, depends on how many vmis there are. If there is no vmis and you get a lot of arrows, it's bad. If you don't have, if you have a lot of bmis.

D

I think yeah we can reshuffle yeah the order here, yeah, okay. I can't do that so.

B

This dashboard can be is basically replace it. It's always up to date, and it would basically replace what we merged some time ago into keyboard keyboard, ci right.

D

Yeah yeah, the idea is, I I push that to the monitoring and I think you already mentioned that before that we actually maybe should you know um you know, pull the dashboard from this uh repository. You know.

B

Yeah and we can also pull it into our ci grafana board,.

D

B

D

For the city yeah.

B

Awesome, thank you and because I wasn't there last time I wanted to ask: did you see any changes with the rate limit pr? Did you do that.

D

I don't remember, I think this is the up to date. um It was should I run today with the lasted master. So, but I I yeah I I need to double check that I cannot. I I don't know, but.

B

Okay, yeah, if you didn't play explicitly with the values, it may just still be too low to not hit any rate limits, so yeah.

D

I didn't play with the values: yeah, okay,.

B

Great just just to be up to date, thanks mm-hmm.

A

Cool yeah, nice, nice, marcelo, okay, thanks.

A

Okay, all right thanks for that. um Okay, let's go to the next item, so we have metrics focused on vmis and not vm.

C

uh Yeah, I added that, because it came up um um in my team. I think, a few days ago um I was asked if we have like count and some other basic metrics on on vms, and I um noticed or yeah I was like. I don't think so, and um we're focusing a lot on the vmi, because it's the main workload, but we also still have a first citizen object, called vmvm and uh it also progresses through stages, and I indust things, and I want to mention that.

D

And see what everybody.

C

Thinks if we should treat it like that and also get core metrics for it,.

E

What would you be looking for for a metric from the vm.

C

I I what we looked for in that specific case was like the amount of vms that are running. But we are looking right now with looking talking about phase transitions and um adding metrics to our dashboards, and we all do all that for bmis. And I don't know if we should do that for vms.

D

C

Well, or only vms, instead, that that doesn't sound right by the way um and uh yeah.

D

E

Think that we have um so we have metrics that represents a running vm today, that's just the bmi metrics that we have. I think, if there's something specific to the vm controller, that's where it makes sense to make vm specific metrics so like, for example,.

D

How many restarts.

E

Of vm experiences or um vms that are in I I don't know what else would we really want? That's just specific to vms, so we.

B

E

Stuck and we have something those like where we're waiting.

B

For pv for data volumes to be ready before we start or no, they don't move to bmi. I think.

E

B

No actually, I.

F

Think that is a vm.

E

Specific flow, so we have a day of on templates. It's going to create that data volume. I think it's going to wait for that data volume to complete before moving on to creating the bmi, so that that would be a specific one. uh Storytelling yeah, I think we're just seeing how many vms.

B

Are there, I would not say that this is something we would want that. I think there are solutions where you can create kind, all kinds of metrics out of kubernetes objects without having to edit object.

D

B

For for specific things, like david mentioned, I think it makes sense if people need it.

C

Yeah, like I don't know, running and not running vm yeah, I mean you can also get that easily through query, but I was asked if we have that- and I I didn't think about vms at all the last few calls here and we always talk about vmis and I I don't know.

B

For for for something like not running vms, I'm not sure, for instance, I if, if I would say like not running games which should run yes, I'm not sure. What is that? um Do we can't do we also how many co yeah, I don't know, I think there there are ways to see how many conflict maps are there and so on right.

C

I I I don't know.

B

I think there is something like that.

C

B

I think yeah, I think there are already pluggables which give you general things like how many objects are there? How many deletes are happening already, so I'm not sure if we should take these basic operations directly here, yeah.

D

B

But anything where something unusual is happening or where we need where we need something to. I don't sorry, my glory is.

D

Ringing and not stopping.

B

Come on yeah, but where anything where we would need something to see that something unusual is happening.

B

C

C

Very persistent mailman.

A

Okay is that um does that satisfy your ask kevin.

C

Yeah, I I mean it wasn't yeah. It was partially meant. As a I don't know. What do you think and a reminder that.

D

So kevin is: is that motivated by the fact that someone maybe create a vm directly instead of creating a vmi, because I mean, if you, if you always have the vmi, why do you want similar metrics for vm, for example,.

C

I mean I think we are creating a metric as somebody my team is creating a metric that shows vmis created from templar vms, create from templates, but what we usually advertise and we, I think, what even our docs in, like open shift say, create vms and on bmi or remember, like we don't document creating vmis, we tell people to create bims and may also from templates, because vm creates a vmi for you, so yeah.

C

I was curious if, if that should be different or.

A

Yeah added a few metrics that I think to me like these are the ones that I could see you having a few counts around these.

C

Senses to me I mean like the result, would also be most people actually don't use vms most people use vmis.

C

That would be interesting to know because, right now we tell, as far as I know, openshift customers who use vms and not bmis, because the vm does vmi stuff for you.

B

We, I think that I mean there. I think it's a um it makes sense like people are normally not using ports directly, just right.

D

B

Manually, I think it makes sense here too I I can definitely see some scale use cases where you can't just do your own vmi stuff, but for the usual vm case, where you want to stop it, restart it modify it and start it again and so on. You probably want to am, but there are other other things. So thank you. Vm is one control. On top of him, I would say.

C

B

um What I wanted to say.

B

Is what's a little bit tricky here? Is that the metrics we kind of have all kind? I would expect that.

B

um What what you very often, I think we need is that you have to monitor your whole namespace to see if something goes wrong like you would probably not necessarily see that a part has issues in the storage provisioning phase, where it's waiting for pvcs, you would just see that it takes long to start, but you would probably see on the events and on some metrics that the storage provisioning itself takes a long time and you would basically monitor both and see.

B

Oh pods are taking a long time now to start a new closes, but also pvc provisioning is taking a long time. So this may be related right. This is how you're normally attacking problems there.

C

B

Maybe I just see it differently, but that is mostly how I see things happening on the monitoring side.

C

Yeah, okay, yeah. I think I got what I ask thanks.

B

But yeah definitely if there is something which which people need to see to see. If everything is right or not, it definitely makes sense to have for vminest for wienstein.

A

A

Okay, um okay, uh so do we have any more items? I have a few things I could talk about if we don't but yeah are there any other topics and if you want to write it in here, you can take a few minutes.

A

Okay, um okay! Well so one of the one things that I was gonna bring up uh or just maybe discuss so like we, we have um kind of delay the land of things right now we have um so marcelo's got the the density test um right now we have that in ci right we have um david, wrote the audit tool um marcel, you did the the uh the load generation tool, um so I think so, someone's like kind of looking at what we have.

A

The next steps we have in front of us are to hook a bunch of this stuff up together and then start getting some some bass lines, some thresholds.

A

I think that was one of the things we had from last time, so we're getting close to a few of these things uh being able to tying a few things, two things together and start getting a bunch of valuable information on a on a per pr basis. So I think the next step and and then I think, this one's for you, david, so david you're, doing that you're going to do the thresholds and um as part of that um we're going to have- and this is going to be nci right.

A

This is how this is, how you plan on doing it or what's your approach with this.

E

Yeah so just initially um going to export the profi results and then after we see a few runs of that, we can establish the pattern of what we want to set for our thresholds and that environment, and we can commit that.

E

It's going to take me a little bit to get to that. I probably won't start on that until next week, but that's my idea. Somebody else can take that as well. If somebody else has more bandwidth to to work on this more timely than I can.

D

Yeah I was planning to actually create a test. You know integrating all these tools, so yeah.

A

So we can talk.

D

We can synchronize with them.

A

Okay, so I guess that's good yeah, so we have um so as part of this right david. It's that we have to. We have to integrate a bunch of the tools right like um or is it that you want to run the audit tool and just generate the result in ci and then maybe the kind of tying it together and see I could be separately. Would that work.

E

What um what do you mean by tying it together.

A

So like um we have, we have the generation, we have the load generation tool, we have the audit tool and we have marcelo's density test so tying together, three of those, but that could be a separate task, uh then just generating the thresholds or yeah.

E

So I I I think, marcelo correct me if I'm wrong that you were going to be replacing the current density tests or at least that's generating the load for the density test, with your new load generation tool, and then we can integrate independently of that. Eventually, this perf audit tool as well and the perf audit tool it doesn't have to have thresholds immediately. You can just gather results and export them, and then we can decide on thresholds after we get a few iterations of data.

D

Yeah, that's that's the idea. Is you replace the the other tasks that we have there and with the the new tools that we are creating.

D

No, not not our creative, we created already so yeah and I also integrated the the dashboard in the ci to see the job that is running there. I read talk with federico, so it is their graphene dashboard right there. So, but I don't know, we cannot really see the metrics. You know with the dash we can maybe import. I don't know if we can edit the graph dashboard, sir, but I'll have a look on that.

B

What do you mean you? You can't see them so yeah. We have a graphing dashboard instance, but you, you cannot see the metrics from the prometheus instance then or.

D

No, I didn't, I didn't so the ci you know infrastructure has a graphona dashboard that I'm saying, but I didn't play with that, and I don't know if we can see the job that is already running there and well. We can see, but I don't know which metrics is exported and if we can, uh you know dynamically include a new dashboard there or if we need to.

B

Yeah but the dashboards are so if you, if we get your pr merged in keyboard monitoring, uh we can just uh roll, we, we can just run our deploy job every day and it would pick up the latest change there and it would deploy it on for me of the graffana dashboard. I guess that would be something which you want to do from the ideologist.

C

Yeah, can you share a link to that grafana as soon as it's ready, yeah.

B

I I think we shared it in the discussion somewhere but again again.

C

I've been waiting to look at our tester metrics for a while now very cool, very curious.

B

I can just say I can't give you a link where you can see the dashboard. I can just give you a link where the dashboard should be added to be deployed automatically.

A

Okay, um so who can uh yeah.

C

I haven't found the performer.

A

Yet, who uh who has access to the profoto board? Is it like um that's public? This is public. Okay,.

B

It's yeah cool, so general. The dashboard is wait um soon.

A

Okay and so then marcella wants um the dashboard.

B

Is here grafana.com and we have a deployed shop where we deploy right now, a few json dashboards, always when we run the job and we would have to modify the ansible rule to also check out the keyword, monitoring, repo and deploy them to.

D

Okay, yeah, the only thing that I need to check is the labels, because you know, maybe we want to you know- have some specific labels for the job, something like that. Yeah.

B

Yeah that it would be great if we can, for instance, see the job id so that we can directly go to that or something. But I think it's great. If we first just see some results and then we can.

D

Yeah I'll put it generic and then we can play with that. Yeah.

C

What would be great for the grafana? I great I I I just collected- um I don't know what other metrics we have in there, but we we are exporting metrics with uh prometheus jobs uh without test runs so for permissions. It would be great if we could use um grafana or if we have a prometheus ui to explore those metrics without creating a dashboard taken in because we can't access them in other ways.

B

So the prometheus metrics are also right now publicly variable.

C

B

C

Okay: well, then never mind you.

D

Can see where the graphone is connected so just just take the.

C

B

Best to ask federico where it is, there are.

D

Issues with getting.

B

The dns right, then, you have to set the header and the right ip address or something to.

C

Access it right now.

A

So um just to understand so like we can um so this says all the information about ci we could we like is this the work that you're doing ourselves that you'd integrate like some more information based on a job like we could get like the perf data from it.

D

A

D

B

And we're collecting already from this one test shop, the metrics right. Did you merge it.

D

It's it's merged. Yeah, yeah,.

B

D

We keep connecting the metrics.

B

Already we're just not uh exposing.

D

You showing them with the.

B

Different data, yeah cool.

A

Let me add the link to.

A

Cool, okay, yeah! That's awesome! Definitely looking forward to that.

A

Okay, all right! Well, I think we have. I. I think those are the kind of three the three the three things here. I'm gonna show you looking at this one david look at this one. um I can help with the middle one: do the hook up the load generation to um to replace the density to the way you're doing it the way you're generating a load right now intensity, um yeah and give that one a shot yeah. I was trying to.

D

A

D

Yeah because it's it will be very you know, very short and easy to to change that so far, because I work on that, but unless, if you really want to do that, it's fine! No! That's.

A

Fine, I mean, I think you, you understand the legitimacy pretty well, so probably quite a lot faster for you, okay, cool all right. Are there any um any other topics that we whatever discuss.

D

Yeah, just just one less comment, so if you guys remember, I was you know playing with the creating 500 vms per you know in the node and for you know our scaling test. We want to pack as much vm as possible in the future, and I was reaching a lot of you know. Libya timeouts, and so I and I created the cluster with the coop spray and the first, the first cluster that I create with was with docker runtime.

D

Then I tried with cryo and actually got worse. It could create only 500, 400 vms and I was creating 580 and then, when I tried to use container d as the runtime, I could create 500 vms. uh You know without any complaint from creating the containers, so without any leave it uh timeout and I also they using the cryo.

D

Actually I received a lot of you know events saying that they cry the runtime was overloaded and was delaying the creation of containers, something like that. So it might be the if docker was also, you know, seen the same issue there. The runtime was being overloaded and that's why I couldn't create 500 vms, but using the runtime as continuity. I can. I could do that so and then I moved to an x issue that was shortest of uh memory that I already discussed with ram on that we can actually allocate the minimal.

D

You know uh memory for vms uh vmis now and and then I can, you know, schedule more vms per and and finally I managed to create you know just to say five minutes. Finally, I managed to create one 500 vmis per node, so yeah I will share. You, know uh the new experiment soon. So with you guys.

A

Okay, cool yeah that one uh that would be cool and then, um if you do the the rate limit of change that roman mentioned um to see the difference in crate time or just uh the no, how much is being rate limited would be cool. um Okay, all right. We have four minutes left to actually remember so we have um there's a few. I just want to draw attention to a few of the um open. um Mrs in case we can, if these are ready.

A

uh The pprof profiler is this one good like has there been any more review? That's needed on this looks like you haven't one looks good. Oh.

E

A

E

I think it was good to go. I had to just I just rebased it yeah. I think I just rebased that today there was just a merge conflict, uh so I need another. What do I need? Can you go up to the labels, see what I need there.

D

Yeah last time was need every base, so I don't know.

E

Okay, so I need to approve- and it looks good to me I had it looks good to me at one point, but I had to rebase.

A

A

All right um who marcelo are you? Are you comfortable with this, and I can do the the approved yeah? Okay, okay,.

E

A

E

Proof ryan: did you get that look.

A

No okay! Actually I don't really.

D

I can't do it. I think I can also.

A

Yeah, actually I don't yeah roman, if uh I think you'll have to you, have to do the approve here.

B

uh Well, uh if david has still looks good to me, he can sell the proof he can just not put a looks good.

C

Yeah, it should be should be enough. Okay,.

F

C

F

Can self-approve I've never tried yeah? You cannot write, looks good to me, but you can improve.

C

B

F

B

Own pr yeah, but you cannot not not. It looks like yeah. It looked good to me to your code. Okay, that's I.

C

Think it's automatically approved by yourself. So that's why an lttm will be enough for a maintainer, but I'm not sure I've seen that happen before in kubernetes and was surprised so.

B

In kubernetes yeah in kubernetes, if you are in the proof of an area and the code just touches parts where you can improve, you automatically get the approved label. Even if you don't approve your own work.

E

That's interesting: I will try it yeah.

D

Yeah, I think I think that when I when I approve, I receive a message that you need to approve. Also, so I don't know if it needs to two approvals or something like that. So.

B

Yeah- and there is also the difficulty of course, because when you, when you do review and select approve there, you just add it looks good to me, whereas slash approve is something different right.

C

Yeah, that's how I accidentally merged the pr ones.

A

Okay, all right, here's, the next one. uh This is the monitor, request, counts this one's good. It's got approved, it looks good to me, looks like it'll. Just just go in after ci passes, okay, yeah.

C

I trigger that. I think storage is a bit flaky today.

A

C

A

Marcelo already merged yeah; okay, this one's done; okay,.

A

The rate limiter, I think this merge already roman. This is.

A

um I'm still working on the the failed um phase transition metrics, I'm going to transition. What I'm doing slightly, I think like based on your last comment date, I'm almost thinking that this needs to be actually in create. The only thing we need to do is actually just catch.

A

We just need to change the time we measure the old time just needs to be from the deletion timestamp and and the few cases that it exists or in the in the case it does exist. Otherwise, if we just continue, we just do what's there, but I'm still working on that one.

A

um This one merged yeah. It did.

A

um This is just an issue and then submerged.

C

C

Okay, that's the cpu one, um uh the the goring fix merged and so did a few of the back ports, um but not all of them because it seems like we did. Somebody didn't backport everything to consistently. So um if you have the backboards fail for missing images and lanes and very weird stuff, I don't understand and I couldn't get help really, but I think I can't I. I can't bring up the energy to fix uh releases that nobody else touched for a long time, even though releases before them got touched for a long time.

C

So, like I managed to merge 0.36 but not 0.37, but I managed 0.38 because 37 got forgotten at some point. Probably I don't know yeah. There are some.

B

Inofficial long-term releases, let me say it this way,.

C

Yeah, I I managed to backport to the release is important to openshift so, but not for everything else.

A

Okay, yeah, I mean I the we we used 35, but we ended up doing so. It looks like we didn't.

E

A

To the uh the end of it, that's right: we we could, I did the backboard internally, so it's fine.

E

The situation's not gonna always be like this um yeah.

C

I know kind of an awkward situation right now.

E

But in the future, once we approach uh getting into cncf incubation and eventually ga the predictability of our back ports and our release schedule for how long the community actually supports releases will kind of be defined uh right now, it's just kind of in the backlog, backlog.

A

Okay, well we're at time everybody uh thanks very much I'll see you all online have a good day.