KubeVirt SIG Performance and Scale, 29 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-07-29

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.twy9rph886f0

A

Oh wow, okay, so we already have an agenda and everything.

B

Yeah, I I feel it.

A

All right I'll mark myself down.

A

We'll give it a few more minutes to fill in the agenda and we'll get started.

A

Okay, let's get started um so first up on the agenda. We have this perfscale audit tool. I think yeah. This is mine. um What do we have left here? Is anyone? Does everyone feel pretty decent about the?

A

uh If we look at the description, I updated the description with a workflow that kind of shows how we can start uh give it to start time, give it an end time and then capture the results during that time and then also compare those results to thresholds. So we give an input config with our desired thresholds.

A

So we want to know that we get to running vmi to running state within x amount of time and then, when we get the output, we'll both get what the p95 p99 p50, whatever results were and whether that met our threshold. So we could use that for our periodic testing and things like that. Does that satisfy everyone, how that's kind of all laid out there any comments or concerns there.

C

Yeah, it looks good to me, okay, I'm looking forward to seeing in an act like if, when we can output it somehow and after our job runs. Maybe.

B

Yeah, it looks good to me also um my only concern now, but it's not to block the pr, so I think it should be merged um is with the new tool. I don't know if how new is that, but the one that you just marked? um That's the second one, the cluster provider.

B

um I think they share. You know similarities for the work. You know the performance framework and I I just made some comments and actually in the comfort cluster profiler and I've, maybe it will be better to have everything in only one place. You know, instead of have like a lot of different tools spread around.

B

We could have like this, this profiler, which in the end they all did too it's something like that. Also isn't it. We are collecting some metrics from prometheus and creating a report and I've.

B

It might be maybe a good idea to merge things together. So then we have like a single shoot. You know the profiler profiling, the application performance.

A

Yeah, I'm fine with that. um So maybe let's jump to that uh that kuver cluster profiler pr. So that's the second one in our agenda and it's kind of related to the audit tool. um It's something so this this keeper cluster profiler. um It does something a little bit different than the audit tool right now and maybe there's a way to converge this behavior. Certainly we could um make this uh perf audit tool. Do both types of behavior but I'll give a brief rundown of the differences, real quick.

A

So the audit tool is taking a time range going to prometheus, pulling uh data around or through that time range and then compiling some results and comparing that to a threshold. So it's something that can be done retroactively.

A

The profiler is something that we uh run and it actually triggers things like tracing or other like things in our actual components to begin capturing data, and then it has to be started at the very beginning like before the stress test starts and it has to be stopped and dumped at the end of the stress test- and this gives us things like prof uh dumps for cpu and memory.

A

So we can figure out where we spent the most time and execution, and I added something that I think is a little bit controversial, uh where I'm figuring out what I'm counting, what api resources we're calling and the actions. So, for example, uh I'm counting how many times every component calls a list or a get on a specific resource like a pod and I'm aggregating all those into a report and that's something that we can get and the uh the dump as well. So that part, possibly, we could put that prometheus.

A

I'm unsure how to represent it. I'd like to hear people's thoughts on how I could make a prometheus metric that would give us the kind of data that I'm looking for and that http request counts. The aggregate accounts.

C

um So the I I really like the proof part uh I've said it before I I've been looking at distributed tracing the last two days and um I'm still, I I, the people of part, might actually be more useful to some extent like than that um for the request metrics, I think I or I thought we already had those metrics somehow um or we we collect them and they I ask them in the comments.

C

I think uh if we could use the metrics where we collect the the difference might be, they are not as granular or not as uh up to date. I wouldn't pull them from pure primordius. I don't think I would mix the two tools one pulling from prometheus. The only thing I would change is we have the counters in our code and we might be able to read the count as we already have, instead of building new counters, um but this tool seemed like that.

C

The cluster profile seems more like something any operator could implement, and I like that idea- and I wouldn't mix it up with too much specifics- that we can do externally.

C

If that makes sense,.

B

Yeah, you mean don't mix it with the perf, you know without it too,.

C

Yeah, don't mix it with the audio tool because we specify specific thresholds, we think for for our specific application. But the profiler right now seems very generic uh to any operator. Where can go.

B

Yeah, so what I, if it's, the only doing the p prof it makes sense. So it's only doing this profiling, however, uh as as I saw the atp, you know metrics being collected, and I think those metrics at least some of these metrics- that david is trying to you know accounts there is information.

B

Is I listed there, two metrics, but also in the cooperate. uh You know um code there is this reflectors metrics uh that it's also account the list and watch, uh although I cannot see the name. Parameters for some reason need to double check why it's not being exposed, but we have some metrics there. We should have some metrics there that it's showing uh you know this api request and- and I I like the idea to have like uh you- know this report showing many things.

B

um But if we go for this direction, I think it would be nice to get the data from prometheus and dump in a report and then it's pretty much what perf the lg2 is doing you know um and and then we could have them together, but if it's, if it's only doing the uh p prof um yeah, it can can be separated.

B

So the just. I also comment that in the in the cluster profiler uh pr uh p prof uh has two different apis, so we can use one it's what uh david you use it and another one is expose it as um an api. So so, in this case you can also determine, for example, uh the amount of time that you want to profile like you are doing, with the audio tools, like you know, 10 minutes, whatever you're doing and and also do some live.

B

You know analysis uh during the execution of the code, so it can it's, it has a user interface. You know you so you can. You can check that while it's running and might be interesting, also.

A

I'll take a look at the net profiling that you've pointed out. I haven't yeah. I haven't looked at that in detail. I think that my um my initial reaction to that is uh how to access these endpoints so right now I can get ingress through the cube, burnetti's api server and through our sub resource endpoint, and actually retrieve um the runtime profiling data the real-time data uh we'd have to. I don't know how to aggregate that as easily, but that's something I'll look into. I think.

A

For the sake of this conversation, uh it sounds like we're in agreement that pprof is useful somehow making that simpler for.

D

A

And all right, so we can probably table that and say we're going to do something there. Somehow, when we look at this uh profiler tool as it is now we have pprof, and then we also have this http stuff that I'm doing and the http http stuff is a little bit controversial because it appears like it could be a prometheus metric and I've been thinking about that as well.

A

I don't know how to get the data uh from an existing metric, so what I want is to know exactly what api calls occurred on our specific resources from our components and I've been unable to get that from.

A

Maybe I'm unaware of how to do it, but I've looked at it and I can't get structured the way I want and I'll promote this today. I could probably create a new prometheus metric that pretty much exposes exactly what I'm doing.

A

I think it's going to create a lot of indexes, so it might be something we don't want to always have enabled.

A

So I think it would probably be a histogram per a per operation per resource, so like every list on a pod would be a histogram to count like have the counts, and things like that in it.

C

What what what like I I noticed that we are lacking some metrics, that I was I thought we should have before in the api server. um What kind of metrics are you looking for? The calls we cause or the calls we get because there's pods on it, so it means we make the call right, yeah.

A

All I want is to know this, isn't uh I want to know exactly what api calls come from every component within our control plane over a specific period of time? So if I.

D

A

Test, I want to be able to retroactively.

A

Look at every single api call. Individual api call the counts that have occurred for that, and we see that like, for example, we're calling 10 000 lists on pvcs and that's not expected. I want to go and investigate that uh yeah or if we see that there's new lists or new get calls that are occurring. I want to be able to go and figure out where that's occurring, and why so I want to know exactly what component it came from and I want to be identified. That's unexpected.

B

So I think we can see that with the the metric that we have already.

C

Wow, it might be not consistently but yeah go ahead.

B

Right so yeah we can discuss him and it will be actually useful for me also. So, for example, the api server request total you have the verbs, so you can see at least watch and and then you have the you know the groups and that shows actually what's being called uh and regard. So, for example, if it's been called uh pods virtual machines, you know it will be like the modules, the components and you can you it has it's a histogram, so it has counts. So you can count that.

C

um The metric you're talking about is the api server. One.

B

The api server, and also we have this- the rest, uh yeah yeah, that should.

C

Be the api server is not the one he was looking for, because that's any request made by anybody and I think what they meant was the requests we make as the control plane, and that would be the rest client, but I'm not sure if we consistently used it everywhere like, but we should have like. I saw something like that. We have the rest, client metrics, the calls we make, but I'm not sure if they are really in every of our clients.

B

But then we should fix that yeah.

A

Let's say they are: um how does that give me the per verb per resource information? How would I, how would I uh yeah rather that.

C

I don't have a cluster right now. It's booting! I can't try, uh wait yeah I'll, send you close to. Let me check.

C

See if I have an example.

A

Because that is where I tried and could not.

C

Yeah the rest climb, metrics are, I I had al. I also had a problem with the structure of them that I couldn't get exactly what I wanted um like some resources were, were not a part of it or a part. I I I had some confusion with him as well. Right, where are the metrics? I can't find the list anymore.

A

Okay, let's look at this latency seconds and see what we're actually putting in there.

A

Broken down by verb and url.

A

So what is verb going to be? Is that going to be get list watch or is it going to be something like.

C

I think so, yes and kubernetes. Those are called works.

A

Sir, oh so here's the thing, the url.

A

What what url is this is this the full url that I'm getting so when it observes are we.

C

I think it's slash api, v1, etc. uh If I remember correctly,.

A

I don't think we're capturing the url, so we get the verb and the latency. I think. That's all we get.

A

Yeah yeah, so there's no url there. So we don't know what we know that lists have become increasingly latent, but we have no idea which ones or we can't individually count. What urls are there. That's actually tricky because it's difficult to uh figure out exactly what the resource was and whether it was a.

A

If, if we did that, and we had a histogram for like we added a label for the exact resource that this is occurring on, it's going to create a histogram for every resource.

B

Oh, I was muted sorry. I was.

A

Saying a lot of things I wasn't trying to.

B

Talk over yet yeah, we didn't hear you so I I just put here an example in the chat, so actually the url has the resource, for example. Here I have an example that it has the pods, so it's listing the pod end points and other whatever resource actually has been called, and it's counting per um you know, per component and.

B

And and instance,.

B

It's true that it's missing, no, not yeah! Now that I'm looking it's true that we don't know where yeah exactly, then we should extend this, isn't it. It should have like.

C

B

Yeah, like this api server, has something that they call group or whatever thing that whatever we want to call, it should have the name of the component yeah.

C

Yeah, it's probably built this way because the place where it gets injected, we might be able to fix that or pull apart the url, which will be not the most beautiful way to do it, but yeah.

C

D

We should clean it up.

C

A

D

A

A lot more uh like more histograms right, so it.

C

Might cause less histograms right now it will cause a lot because the url is the full url with resource version everything so it will spam like hell. It looks.

A

Like the resource version and everything's all normal, it's all taken out.

C

Now, from what he's in there, it's like the url field is the whole request with parameters and stuff. No, yes, that means it's like one. One bucket per.

C

Call even maybe no.

A

It's just per resource like if you look at the endpoints, it looks like everything's just there's, no there's no resource version or anything the values and everything are taken out.

C

Okay, yeah yeah.

D

C

Looking at the chat and then looked like.

A

Yeah, it's tough. I had to really expand it um yeah, and this is also a.

A

This is a metric that is provided by the kubernetes like it's a native. I don't know how to describe it. It's something that's provided by kubernetes itself that we are just extending from our components.

C

Oh okay, I thought we were we. We built this right now.

A

It's just not it's not ours.

C

A

Their wrapper around this too.

B

But probably you can include a new label- okay, at least just to say, which components calling that yeah.

A

I don't know, let's see.

C

I'm trying to find what is used.

A

There's no uh we're these are their callbacks, that we are.

B

I think I saw that in the code somewhere- oh, it's yeah from youtube.

B

Yeah so here in the in the metric.

B

A

Even if we get this, I'm not sure it gives me the granularity.

C

A

I'm entirely looking for.

C

You you basically want the same metrics the api sort of exposes but limited to caller keyboard control.

A

And things like that or individual component method.

C

Group and resource: that's the three things you want right.

B

C

Maybe something resource.

A

uh I care less about step resource yeah.

D

A

uh So the action, the resource and the component that it comes from.

C

Oh yeah yeah that, but that's like I, I wouldn't worry too much about prometheus load there. That is like the most basic thing everybody does um if our prometheus dies. Because of that we should fix our prometheus, because that sounds like basic, essential stuff.

A

This can be history.

C

D

A

Combination of verb resource and components that we call.

C

B

D

I think that is a quest.

B

It's already doing some more or less like that. However, it's it's the the different direction and.

C

The ap is so much just that as well. Yeah.

B

I think maybe it should have like a something similar to api server, but instead of being the direction to the component, it's from the component yeah yeah yeah. I think it's will be very useful to have that informative.

A

I can create a new metric uh that gives me exactly what I want. uh It would be a histogram.

B

Yeah, I'm just wondering if it's collide with the one that we have, maybe we don't need both. So just we just need to check that.

C

The one we have actually doesn't feel that useful compared to what we're talking about now. To be honest,.

B

No, I'm just saying that, because we just you know, remove the one that we have now and have the new one.

B

No just true, if the new one, we have the same functionality as this one. We just we don't need just one anymore, yeah.

A

One thing that's also difficult here for me to understand is how would we so I want to use this data. However, we get it to determine when new api, when unexpected api calls are occurring, whether that's the frequency of the calls or if new ones have been introduced, that we don't expect to happen during a test.

A

Now, when I retroactively pulled this stuff, I would have to know if, like when, I request.

A

How would I get that uh the things that I don't know about, I guess this one trying to ask when I request the data, it would be a histogram with a label that I'm not expecting. So how would how would I get information about the thing that I don't expect.

C

You probably would pull a metric um for the entire like for all components and then group by component and resource or component resource and and and verb, and then you will see the numbers that you can compare to the previous run. You can see that and then you, you actually see that gets before, where you can graph that you see gets before in the last call where four parts were much lower than gets are now by weird handler.

C

Yeah, I don't know how much you play with pricks, but you can group those metrics together. So you only get what you want and you don't for, but you you could still get everything.

C

That's paradoxical, but.

A

Okay, so here's what I think I'll do for now, um this profiler tool- let's forget the http stuff uh and uh I'll go with the prof. I might clean it up and if we think this is a good idea, then I'll probably introduce the tool just with pprof for now uh aggregating that, for the http part, I'm going to experiment with a new prometheus metric. uh That gives me exactly what I want uh and do whatever I need to do to get what I want and then we can look at what I've done figure out.

A

If there's an existing metric gives us something similar and if there's not, we can introduce a new metric and use that by the audit tool to give us information about requests and everything.

A

That sound reasonable as a path forward.

C

Sounds great, I can't wait to see the to try to pee pro because I'm looking at the guardians right now and that would be amazing.

A

Yeah, I think um I think it would be good. Do you think? So we talked about like tool, convergence and things like that. um I don't know if ctl is the right place for this even should I just create.

A

I don't know if merging it into perf audit is accurate either. Since that's a tool that retroactively looks at um at this information, I can create a new tool. That's just cluster profiler, that's similar to perf audit and then in the future. If we ever decide to converge those uh to like the audit tool and the profile tool together, uh we could. I could even just pronounce it's just p prop. I could just call it p, prof, um aggregator or something I don't know something. That's just specific to that.

C

I wouldn't have minded inverted ctl, but I I I think a small separate tool for now would also help also I because in my in back of my head, I still play with this idea of this being useful for other projects that if it's a small tool, it's easier to show off, you don't need very ctl, and if we want to put it into word ctrl, we can just include it there. Then it's just a command. If you use cobra as well, it's just you just register and you have it as well.

A

Okay, let me capture what we discussed.

C

I wouldn't merge with the per audit tool. I think it's different things and it's more interacting with our control plan. Unless the audit tool looks more like a prometheus client right now, and I think that's something different.

B

It's it's generator report, isn't it with the metrics that we want? It's kind of you know showing the should the idea to have like it's it right now. It's only one metric, but we want to extend it for cpu.

B

You know using memory usage and many other metrics and show the report and the profiler it's just something that with more granularity, isn't it so strange showing the.

C

Profile, the profile is a dev tool you use while working on the code. Mostly it's like the data you get from the profiler, it's hard to compare, run by run. You you it. There is little valley in in collecting it per test, run for example, and and padding them against each other, because profiling data is hard to compare, but yeah.

A

The p prime tool for sure that's where I've kind of mixed the two things the http request stuff coming from the current profiler- would be useful. So I need to figure that out.

C

Right, that's where I would separate tools, one being useful for measuring the other. One is more depth tool that doesn't the data? Doesn't it's hard to aggregate for future runs.

A

Okay, um all right so for the just to summarize per scale audit tool that I have now we feel comfortable with that. It sounds like and the the workflow that I have there. So can we.

D

A

That merged- I guess I don't know if we have any approvers, maybe I can get roman to approve it.

A

All right, I guess a different way of saying that: does anyone have any reservations about merging the audit tool today and the scope of that tool.

B

C

The only hold was from a cello I, but I don't.

B

Know, if that's my question, okay, I put the whole just to discuss today regarding just both two tools.

A

Mess up a note here, I will unblock that.

A

All right, I think, we're good with these two topics.

D

A

We move on to this tracing experiment. Let's hear about that.

C

Yeah um yeah, I I spent friday last friday and then I think yesterday because I was sick um looking at the go routines. First, that's a bit below and during that I remember that we talked about uh tracing in the past and I was kind of feeling excited about it, and I want to give it a try how complicated it is to add jaeger, tracing and open tracing to vert handler, and it was fairly easy um uh it.

C

I uh sadly, don't have it running, because my azure cluster got killed um it's rebooting, um but yeah. What I, after adding that and installing jaeger in in my open shifts um every run of dirt endless execute method, got a trace. I could look at. I saw the I saw nice charts of how long each run takes and where it spends time.

C

I haven't added a lot of spans like trace bands yet but like I should have taken a screenshot, um but what it basically gives us. You add two lines to a function and for that function you then it then shows up in in a in the flame chart saying I spent this much time and I can add like edit at a key.

C

So it shows android handler is looking at the virtual machine test and that had spent so much time reading from cache and updating it and doing this, and that which is compared to the profiling more work to add, because you actually need to annotate each function to do that and some of them and for some like you, you also need to kubernetes like uh to go context, so I had to change some signatures to have the possibility to create spans.

C

um So it's a bit more work than the gopro filing, but it allows us to do it on the fly and add metadata so compared to pprof, where you see how much time is spent in a single function in general, it shows you how much time is spent in a single function. For this run, which can be useful to debug those scaling issues we saw or um any errors that occur, I'm not sure.

C

Yet how much more it will bring us if we have to pee proof stuff, so I'll keep looking at it verge handler was fairly boring. I um only edited it for the protein stuff and it was hard to get in there because you always need to put context in and we almost know in our code we probably propagate context so far, so it's quite some work sometimes but yeah.

A

So the context being okay, do we need an individual contacts for every key or something like? How does this.

C

No, are you familiar with the goal and context context stuff like.

A

Yeah it's a way of specifying well here's my rudimentary understanding it's a way of providing my timeouts and things like that for api calls uh per api call.

C

Yes, so the context you can you start with the your root context, and you can you pass it through your entire call stack basically wherever you need it, however, far deep, you need it. If you change it, you get a new context based on that, so it's kind of mutable and what tracing does it uses that context, like context, is very useful to pass along stuff down the tree so trade? What tracing does it? It uses context to ride on it and, like another way, would be?

C

If you don't want to pass context, you could pass the spans directly and build the tree this way, but with context you already have that, and um so, whenever you call a function, you pass it a new context based on your previous context, with the new parent span parent trace. You have- and this way you get this this tree of of calls um that you can then annotate so.

B

C

Both a cancellation method in in concurrent go, but it's also a method of passing along stuff based on each other. um I've had a logging library that also used that so, instead of calling lock.log like we do, um which is kind of like some people, consider it a bad practice because a global variable of sorts, our logger, it passes log to context, and you can call say like log from context and then add like vmi name and if you pass the same context to the next function. It also has this vmi name in there.

C

You can do structured logging, that aggregates tags, for example, and that's what context is very good for.

A

Okay, so for us to enabling enable this type of tracing and everything we would have to go through the code and begin passing a unique context to the functions that we care about. Is that the gist of it.

C

uh Yeah we we need uh to trace in a function like if you want to trace our reconcile loop, um we create a context for each loop run um now because, like if you trace an http server, you create a context per request and for us a request is kind of a reconcile loop.

C

So you create a context for a loop and if you want to know how long a function somewhere down the tree takes like the our template generation, for example, you need to pass the context from the top down to the place where you want more information.

A

Yeah I got it and.

C

That can be can be a bit of work.

A

Yeah, so uh that's that's pretty interesting. I think it's gonna be really invasive um yeah. I'm not opposed.

B

To it, I think it also has some overhead, so I know that jager jaeger can be disabled, but you know it's in terms of some of her hats. Just we just need to be careful.

A

My advice was to see how far we can get with the pprof data and if it doesn't give us.

D

The granularity.

A

We need, then we can start looking at this, enabling this tracing, like your experiment, is pretty cool. I see it being something.

A

More useful for, uh like an engineer, who's trying to understand a live operation. So let's say I work at uber and I need to understand why this specific uh customer's request um keeps getting some error or something like that and we're spending the time where I can look at the tracing and look at the span of that request, see it trace through our entire, like actual operation, and they gain an understanding, live how that occurred.

A

Where, with prof, we're just running this on our dev clusters uh and running, like stress tests that aren't like live, so we have a little bit more flexibility. I guess is what I'm getting at to not have to need something.

D

A

As involved yeah, maybe I don't know if it doesn't give us the granularity, then.

C

It seems more, it is more a an operational observability to unless say um debugging or it can be debunking tool, but yeah, it's less. It's more useful, like if you experience problem in the cluster- and you can look at that and you see like or you see why vm is not spinning up and you can see it's waiting. The record the operator is waiting or stuck or taking forever in the reconcile loop, and you can maybe see why.

B

I would see like if we have some. You know we want to analyze. You know a call across components um if we would have that. So a call goes to many different components and then we want to see what's happening and which components actually is the bottom eye.

B

But it's not what's happening so we want to actually trace the word handler, for example, internally and see what's happening inside in it and uh p prof controlled that maybe so yeah yeah.

C

The difference is granularity, p prof shows you how much time ghost spends in a function and open tracing shows you uh how much time it spends in every call of this function.

C

um And it can give you the metadata metadata, like it spends more time in calls in reconciliation, reconciling this single, while the other one is just generally takes a lot of time in there. So it's more operational more if you, if you have a running cluster and you want to see what's going on while prof is more again for us, while developing so yeah, that's I'm also still not sure how useful it will be.

C

um Also regarding the amount of work it will take to get it to important places.

C

And uh regarding the overheads.

C

Sure it's like two more function calls per function that to to record a span and everything, but in general, from from my experience, the open, tracing, libraries and jaeger libraries itself are really really performant. Like a.

B

C

Can be more, logging can be more expensive than that. I have a different.

B

Experience disabling jaeger improve the performance in like very much.

C

B

C

Also depends on how you, how you're drinking yes right,.

B

It can be disabled by default and if someone wants it can be enabled also, it has this flexibility.

C

Right, the the the calls themselves and the functions are not the problem. It's the recording that you enable or disable like there's a collector in the background running that collects some spans and that's the part that can slow down, and you can tell it to only sample like every second time or on probability and stuff like that, and then it gets faster again.

A

All right um any other thoughts on this tracing thing.

A

Because I would.

B

A

Some more about the evaluations.

C

A

I see that there's no 404 problem anymore, looking at.

B

Yes, surprisingly, isn't it so what yeah? What what's going on? So I run again so, as we talked yes last time with the update master repository, the same experiment and there, the thing is the four you know all four problem was gone so.

A

You updated to the latest main and it just disappeared.

B

Yeah some, maybe some pr was merged. I I the point is since we are not tracking prs. um Unfortunately, I don't know what fixed that so, but it's I don't I don't know, and but it's kind of good and bad night bad, because we don't know but good, because we don't have this problem anymore. So yeah.

A

I know what's really.

C

A

D

C

B

Okay yeah: this was um the read called request which the api you know, server requests, but only for the convert calls is the duration.

C

Is that always a minute like or is metric wrong all right? This is interesting.

B

Isn't it because I thought it was very, you know super high one minute here. um However, I see the same thing in the kubernetes calls. uh I don't know. I was expecting to see this very just too high. This is only for a read watch and uh lease operation.

B

um I don't know why it's too high, but.

C

If you take out the watches, it might be lower because watches are ongoing, requests.

B

Yeah, maybe should see here actually which operation is, is the one that takes too much time or if it's all, but it's yeah, it's something weird one minute here. Isn't it.

A

It's one minute, the maximum bucket or something what.

B

A

Yes got to be the watches, because it's just leaving the connection open.

B

So probably that maybe I can split, you know the watts and the other things. So I.

C

In general, I would dumb ditch watches because for request, duration watches are probably irrelevant because they are long running anyways.

A

I'm actually more interested if we have short running watches than long running.

C

huh I mean if they get canceled all the time, yeah yeah.

A

Or we have to keep resyncing for some reason or whatever.

C

A

C

That would that would show up in the lists.

A

As well, we would see a lot more lists because there's a list and then the watch for our informers.

C

Yeah those requests look way better.

B

Yeah, so we was just you know, we still see some. You know uh 40409 requests here, but they're very low, so it should be fine and I got this new metric which the okay. So if you guys want to discuss something here,.

C

um Well, one question for this: this dashboard, um could you add the number of uh virtual machine reminds to the graph somehow so, when you we have.

B

Here you know you, you know if you go down.

B

I have it now number of vmis. Oh two down. Oh sorry, a little bit upper, oh okay,.

B

C

Yeah: okay, oh yeah, that's great! Okay, yeah.

B

Yeah, I need you to update the you know this new graphene dashboard. Okay, I'm keeping an update in that to improve. So um probably don't you don't see this uh current there um yeah. This is the number of vmis. Now it's also, although it's only showing the running here, it's also show when it's something fails now here uh the number of vmis it's running and fail. I omit the ones that are zero. That's why you don't see.

C

B

The legend sure anyway um yeah so, uh regarding their the rest uh rate, limited duration. um This is a metric that roman enabled, I think it's. This is interesting, and this metric shows how long the height limiter uh wait until it's you know, uh uh and uh how what's the best world say that until it's enabled the request goes through. So it's the.

A

Throttle right.

B

Yes, okay, exactly yeah do.

A

We ever drop, does the rate limiter ever drop it, or is it going to hold on to that request indefinitely until it can try to execute it.

B

Yeah, I think it's hold on and then okay and then you know permitted to execute. So it means that uh if you uh with the pr that roman creates now we can, you know, increase the rate limit, and these numbers here should be smaller uh in the cold. I see two uh thresholds there one day they call long running thresholds. What was 50 milliseconds and uh another one that it's long running threshold, it's one uh second, so if so, you know just analyzing this.

B

It means for me at least that all the requests should be under 50 milliseconds, isn't it and what we are seeing here is way higher and definitely we need to increase this. You know character seconds things. That's a problem enable that, especially for which controller and vert handler here and so on. So this is the new metric um and it's it's showing that we have some problem here and with the aromas. Pr probably will fix that or something that we need to fine tune here. Yeah.

C

Sure, just a small note like because we talked about it before. I really think we should like find a way to clean about client metrics, because urls this way in the metrics are are.

B

C

Everything you need in there is is parts and maybe the ip, no only parts and that's.

C

That's also waste.

D

Of data and prometheus.

B

That's true, and uh maybe the instance you know yeah, I think maybe it already has yes, yes, but yeah yeah by the way. So that's why I got confused before, because these metrics here actually show the components, isn't it just uh rest rate limiter, so uh roman, it's enabling this uh and this metric is actually the same. uh You know the same file that we enable rest, client, rest, latency, request, latency seconds so and roman it did a similar metric, but in enabling the name of the the component.

B

So maybe it's the name of the. Let me see what I did with this metric just one. Second.

B

Yes, this is the name of the container, so the name of the container shows the name of the component and was you know easy to to check that, although this metric here, it's showing regarding the rate limit, but we could potentially have the metric or trying to see expose the name of the code, the container, for example, for the other metric, and maybe we will have what we want.

C

B

Yeah, so the other thing that I want to discuss here- it's actually uh I don't know if uh david. If you saw that I mar I market you in this document so and uh yeah, I don't know if you're receiving some invitation for that anyways, just just to be sure if you guys get notified. If I do that, something like that.

A

I did uh yeah, let me, can you um hit that arrow on the uh or just do that.

B

Basically, what I see here so, if you see maybe I can just comment here- there is some mismatching here, so vm creation, time and vm phase transition latency see it's way, uh lower uh the transition latency. I understand that we will have some mismatching here, because this is some general uh aggregation in the transition latency, but not that high isn't it. This is because it's missing a scheduling phase, so it has the running the schedule, but not the scheduling phase. So I don't know what it's gone or it's not collecting that so.

A

B

Would you expect.

A

To see I guess I I know you would expect to see the scheduling, but is this? Is this phase transitions um from one another, or is it phase transitions from creation? uh This specific graph that we're.

B

The first one is it's from creation to running the vm creation time. The second one is the transitions which the each individual phase. However, I would expect him, uh but do they scheduling is? Isn't it? Isn't it the from creation to scheduling?

B

What's the baseline from the scheduling time phase uh because you have like uh you know, I was expecting creation to scheduling, then scheduling to schedule and then schedule to run you yeah.

A

Let me take a look.

A

There might be a reason that we're missing that.

C

I misunderstood what the creation time means like I was trying to find out what it means, and you say it means from the time the resource pops.

D

C

You create the.

B

Way it becomes running.

C

C

Okay, so it takes five minutes at that point of time before.

B

95 percentile yeah.

C

Then the late and on the one to the right, you would also see a total of five minutes, but you're missing the one.

B

The scheduling yeah it's getting.

A

Some time scheduling.

B

A

I don't know this is possible, we might be missing if it occurs really quickly. We might be missing it. uh So it's scheduling the scheduled seems like that should be a while, though, and that should be the one that takes a while.

C

B

C

Also pending there's pending scheduling,.

C

Pending is when no notice like I I had pending yesterday with this error. I think myself, you also had where the vms couldn't be scheduled, even because of some permission, issues.

B

C

B

Then it's first pending, it might be, it should have pending, but this is only if there is no enough resource or specific an alternate that we want to place um here.

B

So, first of all, I I checked it all the the phase that it's in this metric and it's only three schedule running and succeed, uh because after I delete that it gets succeed and actually the deleting I'm not showing here, but uh it suits. It takes a lot of time. Also.

B

To delete so delete.

A

Okay, yeah. That takes a long time because we're actually um waiting for the pod to terminate which could, uh depending on how the virtual machine was created, uh there's a transition or I'm sorry, a graceful period where we wait.

B

So it could be like 15.

A

To 30 seconds by default and somebody that has like a windows machine, they send things like an hour, because sometimes windows updates occur during shutdown that one's kind of, depending on how you structure your vmi. It may or may not reflect anything useful.

B

It's the vms are not being booted so.

A

Yeah, so you'd think that it would shut down pretty quickly.

B

A

B

I will investigate.

A

Why scheduling isn't being set? um Yeah, that's curious to me. It could be.

C

Pending, maybe as well, can we get that as well.

A

Yeah scheduling so we're not seeing pending either.

B

um I think because it's not pending- but I I I I'm not sure so- I'm not seeing pending yeah, I thought we had a pending statement.

B

But I think pending, maybe it's not getting pending. You know, because I have enough resources, a lot of resources and all and the scheduling is not pending. It's just scheduling. However, I don't see the scheduling phase so it it doesn't have to be.

C

Impending always because until it gets gets picked up like our operator will set it to scheduling. You know like it.

B

Even if you have a lot of resources.

C

Will start impending and be there until one loop.

B

I think it gets to the scheduling phase, so the pod- yes, but the vmi. I think it's once the pod gets scheduled. It's right goes to the scheduling phase, see yeah.

A

But it's pending we're going to create the pod. Then it's going to go scheduling and then once the pod is uh running is going to go scheduled.

C

B

Okay, so you guys think we should see some pending.

A

Yeah, I would expect to see pending uh yeah. I will investigate this. I felt like I saw them all when I tested it, but I only tested the vm creation time one. So I I did running uh scheduling and I layered them all together. It actually gives a really similar result to the vm phase transition latency. You could just use the to creation time as well.

A

It would just look, I think it would look pretty much exactly the same. Would it not? Maybe maybe it's slightly different instead.

B

Of yeah, instead of stack, you just excuse me just overlay down, isn't it because it's the sound yeah, maybe I can. I can do that and.

A

If you do that, do we still have value and the phase transition between each other, maybe.

C

I want you to ask that.

A

B

Okay for the first one, let me check double check now. I can. I can do this right now.

B

C

Here, did you did you try the rain tank export this time.

B

Oh the export. No, I didn't do that because you know the export. I need to save it locally in the cluster. Isn't it no.

C

No, you you, can you can snap, you can snapshot to the cloud or to a local file and share that, and you can. We can all browse to grafana live in grafana.

B

Oh, this next shot, yeah! Sorry, I just don't do that yeah well,.

A

So we're over time I just looked at the clock right. I don't want to uh drag this.

B

A

Long for people.

B

I would double check that and let you know this.

A

B

Is there anything.

A

Else quickly that anyone needed to bring up uh that we can't carry over until next week um blocked or anything like that.

C

No like yeah, if, if so, if you have time, have a look at the go return leak investigation. I think I found a few spots but um I'll fix them once I get.

A

C

Running and see if I actually fix them.

A

Did you link to that in this document.

C

Yeah, it's uh the last item verge handler resource leak, investigation, okay,.

C

Yeah all right I'll take my includes there.

B

Just regarding this, uh you know leaking. You know this.

D

B

Routine, I was thinking that you mentioned that it's hard to uh close some some of the routines because we lose the you know the track of that. Maybe I was thinking if we have, by default all the go routines with timeout. We could solve these problems.

C

Maybe but it will still give us a long tail.

D

C

D

C

Hard to close them, it's that the part was that was hard for some of them is to find out if we close them like the code is a bit mixed up where they get created and where they get close. There is closing for a lot of them, but it's hard to see if it, if it's hard to see, if there's a co-path that doesn't and I'm trying to investigate those to.

D

See which ones are actually offending.

C

um And uh what I shared before, do you miss a lot if we do tests like that? um Also, if we add performance tests to our code, it would be great if we can like put the exact specs, we need to run those tests somewhere, because I was trying to run the density test and I didn't know what I need like how my cluster has to look.

C

I think it's in here somewhere and now I see it, um but if we add birth tests in in the test suite, it should tell us what you need to do to run it somehow.

B

Yeah, assuming the pr for that, so it would should be there, but anyway I can document. I don't know where what's the best place to put information for that, but with the test maybe or with me a readme in the test.

D

C

Yeah, okay, I don't know what you just think, but.

B

Oh yeah, there is another thing here, so I don't know we are out of time, but you know just just to mention so just just if you request usage, so I don't know if it may be a issue. It's maybe it's an issue for scheduling.

B

You know uh the no, the cpu that is requested and the cpu that is used uh in the in the components. It's the secure request, it's way low than the cpu that it's been used for vert api view controller, and we have an alert of that when it's happening in our uh cooperate. uh You know uh you know cold and probably this already it's been raised a lot of time. No one has seen that, but is something that maybe we need to improve that you know fix the cpu request.

C

We sh the the request should be the smallest amount. The controller needs to run. What we would set otherwise is either increase them at runtime with an autoscaler or set limits with an autoscaler, and we have the story of investigating that, because if we increase them now, people could could not run it anymore.

A

We don't want to set a limit because we might get yeah it's more important for memory that we don't sell limit, but we don't want any limits. Requests uh that means discussion if we go over. The problem is that we can be evicted in certain scenarios uh and rescheduled other places, and we don't necessarily want that to happen either. So we want our request to be accurate, um yeah and if we are consistently over it, uh that's not a great thing uh yeah we should. We should understand that a little bit better.

A

uh I think that's one of those things. uh When we look at the perf audit tool, we need thresholds for as well to set threshold for what we expect to occur with cpu and memory and uh what actually occurred with the actual usage if it ever peaks over it.

C

Because that's.

A

A indication what.

C

I mean we shouldn't set it higher than the minimum, because if you run it on a minimal cluster, you shouldn't you couldn't schedule it. If you didn't have the resources, uh that's why we should look- or we discussed before that we should look at uh the kubernetes, horizontal and vertical pod, auto scalers, to change that when the demand increases, so the auto scaler would see. Okay, this operator is is using so much resources, let's give it more requests. So it's reflected.

A

It does that by killing the pod and restarting it there's.

C

Right now it does yes.

B

Yes, I have some comment about that, so it's but okay. We don't have too much time to discuss that guys, but there is another. I I put like a link in the to discuss. That actually is the last link here that I call slos slis, so in just doc, man so cooperver actually kubernetes has some analysis is the maximum number of vms that they should uh that they support per node and that's they. They they don't overload the kubelet.

B

For example, we don't have this kind of analysis, I'm planning maybe to to to arrive with jonathas in our experiments, and this should show us the these limits. Isn't it so that we want to have, for example, let's assume that maybe the best limit that we have it's 160 vms per node, and then we can check here.

B

You know supporting 106 vms per node. What's the limit that deferred handler the cpu, the minimum cpu request that the version handler should have for for this? You know um things that we support and then we keep with this uh sure vendors.

A

For uh different, like cluster sizes, on the keyboard cr, if.

C

A

C

Yeah all right, if we ask this, we need to make it tunable yeah.

A

So I I added a link or a topic for the agenda for next week to handle uh this memory. Cpu request on pause discussion. Let's carry it over there. I think. Let's just call a meeting. I don't want to take anyone else.

A

Yeah thanks for joining this is productive.

B

Thank you guys.

A

You all next week.