KubeVirt SIG Performance and Scale, 27 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG - Performance and scale 2021-05-27

Description

Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit

A

Okay, all right all right. Welcome to sixth scale everybody. um I touch the document chat. um Add your name! That's an attendee! When you get a moment, um I'm gonna share my.

A

A

Okay, you should be seeing the our meeting minutes at google doc. Okay, um so we're uh today uh for the agenda items added a few things um talking about tooling uh or from some tooling so we've had we've had a number of discussions on the mailing list. um We've talked about different tools um that we can measure and measures. uh You know one of our important initiatives that we want to go through.

A

We want to measure performance, we want to measure scale, so um I could see there being sort of two tools, or at least two different sort of verticals- that we can go after one for performance, one for scale, um so I figured we could take the time today to discuss uh some of what's been said on the mailing list and see if we can kind of capture some requirements, the details um prior art, everything um that we can think of that goes into um solving some of these problems.

A

um So I figured we could just start with one and then let's just see how far we get and we can always uh see how see if we get to the to both. So I think we start with performance that was kind of one of the ones that um there's a lot of mailing list traffic on um david. You talked about it, a bunch um with some of your what you looked at with the profiling and then fan had also mentioned.

A

He was interested in doing some work uh around this and some tooling work he's done so um I figured we could start. um You know, I think we could start with requirements. Does that that make sense that maybe we can capture some like what are the things that we want in a tool that that measures performance, um so what do people think does it? I add four things: there's probably a lot more so like what you know, what are some things and I can write them down.

B

So I would move, go profiling. There might be a third uh category here on the agenda. Okay, so we have tools to measure performance uh to I'm not sure measure scale would be accurate as much as to create uh stress at scale or something like that and then there's tools to um what would you even call it? It's like it's not really measuring as much as it is uh like. I consider profiling, something that helps us address um problems we found with performance, so it it's not really measuring. I don't know what it's measuring.

B

I guess just measuring how long we spent okay.

C

More debug, profiling or.

B

Something like that, maybe that's not important delineation right now, all right, I'm gonna, I'm gonna, walk back on that.

A

Okay, that's fine, so we can call it so we'll we can and and again like this could be all one tool. I I don't really you know, and these could just be all features whatever, but um so we could just start. I don't know we need to start with this, like I, um I guess so profile we could say like we will say this is a requirement like we want to do profiling with this performance tool. Does that make sense like? I think, that's like something that would fit here like?

A

I think that's what we want to do like uh measure. We want to do some measurements of periods of time. We want to record phase changes like we want to know every time we go from running or from pending to scheduling we want to capture that um places. We want to record like what else do we want in here like um what are some other ways we can measure, or I consider this like measuring code david, like we're, measuring or we're sort of like we're your profile, yeah.

C

We're supposed to just like yeah sorry.

A

C

uh To me serious, there's like two different parts of performance meshes, we we already do and we we can do or I I'd like to see um the one being measurements to done while developing like.

B

C

The go code seeing how much time we spend in the function and stuff like that and the errors are measurements. That's also the period of time stuff that are also useful during operations or not necessarily by a class cluster admin, but by us running during tests and stuff, and that's like kind of two. Okay.

A

C

A

For like an example, we wouldn't we wouldn't do profiling, while we're running large amounts of workload. We're like, while we're launching a lot of vms right. Does that add, is that, is that? Is there any added benefit of doing profiling while we're launching lots of workloads.

C

For uh scale tests, possibly.

B

Yeah, okay, that can make sense. It's the different insights that we're looking at so one's measuring externally, uh like how long it takes for certain things to occur, and then I guess the profiling is measuring internally, where we actually spend the most time uh in our actual functions and things like that.

A

Okay, yeah to me that makes sense that we can, I think that's. This sounds like the same tool. Okay, so like. Maybe we can so what yeah like? What else do we want to like what do? What else do we think of like what house we want to measure? I guess is the question: what do we want in our tool?

A

Anything specific you can think of.

B

A few things that I can think of from a developer's perspective is, I want to know what api calls are resulting from our controllers, like how many calls are being made and what calls what to what apis are being made at what frequency things like that.

B

I guess I'm not gonna yeah I'll, throw out what what's interesting and we could talk about where we can get all this, because there's places we can already get some of this information. Another one that's interesting to me is understanding our work cues a little bit better.

B

Knowing how many times they are popping per a a key, so understanding like between going from posting a vmi to running, uh for example, how many times are we processing or syncing that virtual machine to get there, and that gives us some indications of problems? If uh you know the average amount of work we do is greatly increased, so we can lower the amount of work that we're doing to get it into that state.

B

That seems like that would improve things, possibly so yeah q length and um maybe statistics on how often uh keys are called.

C

Okay, yeah, like those topics, are kind of under the the api pressure group. Like the pressure we cause by uh calling the api server a lot or the pressure we get from the api. So because there's a lot of objects like watching conflict maps is like can be very horrible, so stuff like that, is related to all the all the caches and views we have. That. Can that I'd like to see more numbers about how we, how we behave at scale and if our code is performant enough.

B

Yeah, so one thing we've seen is that we try to find trends that are linear and trends are exponential.

B

So sometimes we have trends that look good in small clusters, because maybe we're watching all objects in the cluster, but there's just not that many of them, but then that quickly multiplies as the scale of the cluster and the number of workloads and other types of objects increase to the point where it it's not just a linear progression at performance. It's it's worse and those aren't always obvious.

B

I think we had go profiling here, but just measuring simple things like um that, the cpu and memory usage of our controllers is useful.

C

Go routines stuff like that.

B

D

uh I think this will be better.

C

So sorry, I interrupted your game because we have delay, I think, go ahead.

D

Okay, don't worry so yeah, I think we, it would be better to also measuring the the latency in the controller. So, for example, when the uh the pod creation and the the vmi creation are not synchronized, so the power creation most likely be very fast, and but the vmi status updates would be uh later customize. The security, the keys, are filled up in the work queue for the available worker.

D

To pick up um so the later that that is the latency also, I think uh it would be um better to measure the uh the event handler, a call called back in the controller and how the key, after the the event includes a key in the worker queue and how soon the key will be pick up from the queue.

B

A

Can you add that, from there fan what you just said, I think under crew life yeah add that in there, okay yeah sure those are all good things. Okay, um I'm thinking some other things like um um like portability. uh Some of those that's another one that I'd be interested in. Even this is not necessarily um to measurement or just has to do with measurement, but this is something that I want as a requirement like who's gonna, be using the tool like the different profiles like I want to use it as a developer.

A

um I we also want to use in ci um who else, like I'm sure, some qa people would probably love to use this um and then uh who else like where could like? How can we hand this to people like like? We just give them like a pod like or something to run? That's another thing, I'd be interested in the way that we can move it around so lightweight, something. That's that's just portable. We just launch in a cluster and then maybe just some measurements and returns or something.

A

Okay, sorry david, you, I think we're trying to say something before I don't.

B

Know I think I was, I can't remember it wasn't it wasn't going to add much to the conversation. If I can't remember it all right, okay,.

A

So we have um okay, so we have something all right. So we do. We have some profiling we're going to measure periods of time we're going to um okay. So uh um uh how okay? We we another thing so we're measuring periods of time. um How about like, um like creating like um we probably want to like I'm thinking about how we control this like we could. um We could have like a like like wait.

A

I guess the way I'm describing or the way that it sort of seems here is like we're doing a lot of watching. Should we also create in the same tool like create a thousand vms and watch them, or do we just watch and then let someone else create like we just need to guess.

E

Hey ryan: this is this, is ryan, hey.

A

E

I was actually going to bring that up. I think we have a number of sort of um households or things that do both and I think I think it would be good to I mean we need to evaluate that and talk.

E

Are we writing tools to simply observe, or are we writing tools to induce activity? So I don't have right answer to that, but um that's an important.

B

Question I think.

E

B

Given this a little bit, I like the idea of separating the two so separating the tool that creates the stress and then uh have something like because it might be disjoint how we monitor, we might be getting some statistics from uh prometheus. We might be getting some from some other profiling tool. We made it's unclear to me exactly where all these measurements are going to live and what's appropriate, even so the idea of separating the two tools or separating the ability to to measure from the ability to generate load.

B

I I think that will serve us well,.

C

I I think we already have a separate effort that focuses on on creating stress and load tests, and um so the stuff I selected right now. This all kind of is in the area of prometheus metrics that they already are working on to export from test runs and stuff. So that's all stuff: they they can work on the load testing with um that all feeds like prometheus style, metrics and the go. Profiling is more on the depth side, and maybe, if we had tracing that can fill into that. But these here seem, like usual metrics,.

B

Possibly, I think I'm leaning towards that if we could enable some sort of debug, uh I don't think we want to always export all these metrics to prometheus, but it could be like enable debug, metrics or something like that. And then, when we look at results over time periods, uh we could you can get that for prometheus. In the time series.

B

How do you all feel about using uh prometheus I'm a little uneasy about requiring prometheus here? Is that something that people on the couch feel comfortable with, or what are the thoughts.

A

What do you mean by requiring parenthesis like um um so like? In other words, we don't like, because I guess what I was thinking is that we could just have output, be anything that the person that's using the tool chooses so like it could be export to prometheus export to json export to file the only sort of format that we can use to sort of um create data.

B

Well, I mean that's a that sounds good. Here's, the problem with that when I'm looking at this list. Some of these uh some of this information already exists in prometheus. So specifically like api calls made, we can take a look at the api server and the metrics are already exposed there and detect what comes from our controller pods, for example, and we can create data, we can get data about that already, so it already exists in prometheus.

B

I don't know how like at that point, it becomes disjoint, because we'd either have to gather that information ourselves and export it to a different format, or I don't even know how that or recreate something similar.

B

A

C

Right now, our code base, uh metrics codebase, seems to be very focused on prometheus, like we import the prometheus code to generate our metrics.

C

That's a format that a lot of scraping services can read nowadays, but if we would look at stuff like open telemetry, which is more generic around the whole metric story, um I think they support different exporters that can be configured like you say you want to export a prometheus or to stackdriver or datadog or whatever. I think that's in their scope and that's more generic.

C

If we really want that, but right now we focus on prometheus. We have the metrics out there. I don't. I don't see a reason why we shouldn't start with that.

A

Yeah, I mean I guess what I'm thinking is that, like, like, I talked about with the different personas like in in uh like dev qa um or wherever, whoever is using this I'm just sort of thinking of in terms of the their their ramp. Like you know, what's their initial, you know what do they need to do to get to leverage this tooling and so yeah we're we're adding a dependency?

A

If we say you have to use prometheus um yeah, I I like, I think like so I guess what I'm saying is like some of the stuff um like, like you say, api calls. It definitely like. I already see like found that it's already recorded information, so we can capture it some of the other stuff, like I'm, seeing um like, for instance, uh like phase changes um like it. I could see it being almost in both cases like like this doesn't exist right like we don't have like.

A

If I'm gonna do this, I'm gonna like record phase changes. This sounds like I'm to write some code, I'm going to write some code that records this or something, and then it just and then we're going to report. We're going to expose it on an endpoint, and so like some of these, like I'm wondering if they that they could do either one yeah.

B

We can write a package of some sort that begins capturing this, this kind of developer insights and then it can have different ways of exporting it so to prometheus, uh maybe to a locally and then aggregated via sub resource endpoint, or something like that.

B

um We can do that. There's going to be gaps. I guess the thing that I get concerned about is. I just want to make this sure. We understand that there's going to be some information that prometheus already have. We already have metrics for they're exported into prometheus um yeah.

D

B

The cluster that aren't going to be represented there unless we totally recreate um the way these things are collected, um then sure.

C

By the way, there's those cpu memory open routine stuff we already have. I I look at that for some debugging, that's already in there from the prometheus go code, for example,.

A

Yeah, like I, I don't think I actually think we don't even need to change this. I think like so. I guess we could like so this. This part right here, um yeah like if this. If I'm looking at doing I'm looking to measure performance, I guess so what I'm trying to get is like what is like the bare minimum like in terms of like okay. This is useful. I could hand this to qa. They can give me a measurement back and it's like very little on ramp and like what are like two or three things.

A

I need like here's, one, um here's, here's two and like here's, three like that's, that's it and like this stuff, like that's great and I can maybe that's like I consider it almost more advanced like I don't I almost don't even want to touch it like. I think that's good like we have that as prometheus, and we just leave it like that, and then these, like three things like these simple things like it's. Basically just a blob of text like we can just have that, be like our our format.

A

That is that, can we can get from the tool that can go to any type of format and prometheus.

C

Yeah, with the former advice I see, is a slight issue with uh with the more metric see stuff because, like right now, it's being output as time series like it gets scraped. It's not. We don't record the go code, doesn't record time series. I don't think it should because, like a database like that grows insanely um so right now there is a output of the prometheus metric every x and promises grapes that it reacts.

A

Yeah also, let me also mention so this this okay, so this is in this right here this prometheus reporting. This is like what I'm hearing is like this isn't like already baked into the code. What part of what I'm thinking is like, so we have like a tool that measures performance like um uh like would, for example, this recording happen all the time. Would we say that like if we record phase changes? Is this something we do all the time or is this something only we do when we want to specifically measure performance.

B

A

B

Something we enable as a debug uh metric.

A

C

Maybe that's what the line is.

A

Like where we're like sort of divvying between the two is like like, if we want to do additional debugging things like like we're, already reporting this right so like maybe some of the things that we consider to be like the specific addition, you know the additional debugging tools that we enable, because we want some more information, are the things that we allow different formats or we look to have different formats for.

C

For the phase change to some extent, we might already have the metrics implicitly because we update the conditions conditions on on our resources with timestamps, so that could and and there's the kubernetes events to some extent like that could already be extracted. We would just build that into the operator more consistently or.

B

uh Maybe I'm not sure.

C

I see different ways for those phase changes either like one could be a a metric style. We we record when we reconcile if a resource switched conditions into ready or vm into yeah, if it should change phase and we record the timestamps into the time series um or we add like tracing that is annotated with the resource name and namespace, and we actually have a trace how long it spends in what phase.

B

There's ways of reporting it if we wanted to, but it is more, I don't think we can extract that with our current metrics. I'm unaware how we would do it.

C

It would be no code either way. Yeah.

A

You think this would be no code.

C

It would add it would be new code or.

A

B

Code, so I think you're, right ryan in saying that this would be uh something new, that we would have the potential of exporting in multiple ways.

A

Okay, hey, um okay! I guess like.

A

Okay, yeah I'm trying to think like what. Maybe we can break this down a little bit more and let me I want to change this list a little bit into the things like here's like one of them like um a few of these, like maybe we can break apart this so like what what currently exists in like prometheus.

C

The gometrix.

C

I think we have some api call latency stuff, but I never figure out what it actually does. It's like the naming is a bit confusing.

B

So we have uh ap api call information, that's exported by the kubernetes api server.

A

B

So we have a lot of that, but that that's coming from a different view and we're talking about. I think it's still accurate. It tells us what's uh being made and how frequent. So, I think it's good enough, but it's coming from the the inside is coming from the api server rather than from the actual component. That's making the api calls.

C

I don't think we have any. I look for those, I don't think we have any metrics telling how our apis get called like how our own latency is, but I'm not sure about that. I couldn't find it at least last time I looked.

A

Does this look right what I have like? What are what should be under prometheus here like what already exists under prometheus? That's something that we like. I guess the divide would be what would take no code to get, and I.

B

Don't think that they, the event callbacks, I don't think, are under prometheus.

C

And latency between bottom creation, at least I haven't seen yet.

A

Okay, so both of these go up, so it's just the cpu.

B

Cpu and api calls yeah yeah, okay,.

C

B

I think that's.

A

B

A

B

A

Should any of these um be like, maybe just a little bit of code in the monitoring side that that, like we, just we're just missing that'll suddenly be like we can add to prometheus? Like is anything that we see in here? Oh yeah,.

B

We can export the majority of this.

A

Okay, so all of it will okay so this. So then, when okay, then, when this makes sense like so, if I, if we created a new tool, it would be capable of doing go, profiling measuring periods of time, record, phase changes, and we just kind of like so things that are focused on like a period of time. There are additional debugging things that we're doing when we're doing load um that we can capture output.

C

Some of those some of the things I I think can do- can be done in a separate tool like the uh face. Changes in theory could be just a small application that watches the resources and exports phase changes the go profiling would be more feature, we have to add uh binaries.

C

A

Or crew length.

B

I think I think it makes sense to have all this in the package within keeper yeah.

D

That's actually.

B

The way I envisioned possibly this this tool that you're talking about is kind of deep profiling and maybe recording um some of this developer information is that we can create a sub-resource endpoint that um it turns on profiling within all our components, to begin capturing all this information and then, when we stop it, it can aggregate all that information and just give us back a report. So that would uh give us some averages and maybe like p99 of certain statistics and things like that. We find interesting.

B

Those same statistics can be exported prometheus if we want- and uh so people can use prometheus to gain the same insights they wanted. But you can get your nice like print outs, if you, if you want in it.

A

Okay, so I guess so let me let me characterize this section, so this is gonna be um this is gonna, be um so measurements uh developers or actually used to be anyone um anyone wants to on for a period of time during load to get or okay, something like that yeah like that's what we want to do with this sectioning requirements, and then this is what we and so we'll export to so export.

A

All the data already I'll say, data can be exported into can be exported into many formats.

A

Atheists uh file standard out- I don't know something like that.

B

It probably would be some sort of json format yeah, and I would.

A

Just start with something.

B

Really simple, like I even did like a proof of concept on how we could call the resource the sub resource. Endpoint um call like enable something in all of our components for a period of time and stop it and then gather results uh like it should be just practical to start with. I don't want us to if it makes sense for us to start with just exporting this kind of deep insight stuff into a file in json format or whatever great.

B

Let's just do that, and then we can talk about prometheus in the future or something like that. Just I guess to actually be used immediately.

C

I just think the gathering and getting into json file is actually the most work compared to exporting most of the stuff to prometheus, because you have to make up your own storage for that. You have to aggregate all this stuff somehow and the metrics I I don't know if that's even for phase changes, I don't know how we would record that properly.

B

You don't have to put in a uh you, don't have to store it anywhere, you can store it all in memory and then you export it uh just over the network and then it's aggregated at um like invert, ctl or something when the sub resource is returning. All this information aggregate is going to aggregate all in some sort, json format and then you're just dumping it to a file when you get it. So it's it's all in memory until it actually lands on your local machine.

C

Yeah, but that's like that's already the hard part like we would have to store data memory. That could be quite a lot and we there, when we already have systems that can do that for us like something like prometheus or any any service that can scrape a matrix endpoint like even the even the tracing that I'm I'm, I'm still a big fan of, and we like to see super tracing. It just exports traces and they get collected by something else. It doesn't store anything in memory because that's too much.

B

Sure about that point we're, depending on a separate tool, which is it's fine? If that's, if we consider that part of our requirements, I'm just not sure we're willing to make that jump. Yet it doesn't sound like.

A

Yeah, like like, I guess what I'm saying is like yeah, I think, um giving the opportunity for someone to export the data. I think is I I understand what you're saying about previous. It makes sense to me like it's important, it's it's a great tool and it gives us a lot of power.

A

It's just it's the dependency of of having to have it well, I think like in a lot of cases, we're we're going to leverage that, but I also think there's value in and also having another data format. You know whatever json, um but I I think that's something we can evaluate like, as as like, we go a little bit further. Like dave was saying: what's the easiest, I think we'll get sort of a reasonable idea for that when we start.

C

A

Looking at this a lot deeper.

C

I just want to avoid us writing our own in memory time, series database.

B

It wouldn't be a time series database, it would be a um a metric, that's calculating a single like we're, not we're not going to be taking samples of this over and over and over and presenting all those samples back. I don't think I.

C

For example, the.

B

Work queue length or whatever we probably have like a running average and maybe um like a max and a min or or things like that, that we we keep in memory, not a this is the sample every millisecond of how like it wouldn't be time series. That's not what I envisioned at least, maybe that's what others.

C

Are thinking I know that's where, where our difference comes from, that's because I was expecting you want the the metrics to be used as useful as we would get them from something like prometheus, where you have to graph and that's what you want to dump and.

A

Yeah well for some of the things like um for like phase changes right like we're. Gonna have timestamps for each of the events for, like, let's say we're watching for 10 minutes, and we see a bunch of pod changes. We'll have we'll have you know whatever extended pods that go through those phase changes and we'll have times for them.

B

So we would have maybe it's a bmi that every vmi would be a key and then you'd have a series of um of time stamps associated with each phase of that bmi, which would be exported.

B

C

Yeah but then again like for the for the um uh phase changes, I still think it wouldn't have to be uh something that runs in our operator or anything. This can. This could be done by tool, that's externally and exporting this stuff, because it's getting it from kubernetes events, anyways enough to watch the resource.

B

B

Wait say that again, it could be something that gets it from where and.

C

Anything that like it just as far as understood the phase changes it just has to be something that watches um the vmi resource on the kubernetes api server and records itself. It doesn't have to be in the in the recon in the operator itself. It can be.

A

C

Application that does that.

A

Yeah, it could be anywhere yeah.

B

And that's the metric that we can start, maybe maybe we should just start getting more comfortable with prometheus. I feel like there's a general hesitance. Maybe it's just myself feeling this way of depending on it, but if we get more comfortable with it, maybe a lot of our apprehensions go away.

B

um Is it even an option, ryan for you all to uh set up a prometheus stack and use that yeah.

A

No, I think, that's yeah. I think that's fine! It's it's! I like, I said I think, like. I think that, like I'm open to going, I my respect, I'm okay like if we go on to go with um prometheus or whatever like. I think these are all good options.

A

It's, I guess, what's mostly like that, I just wanted for to sort of protect us the so the idea like I talked about portability like if, like in different ways, different personas, like people who want to use different um sort of different stacks like if I'm just doing like regular development, like I don't really have a prometheus stack like I just have my cube cluster. It's just got keyword on it and you know what do I need to do to get all this national metrics? I have to go and enable one I don't know.

A

Maybe that's just the case that I have to deal with because of um I'm doing, load, testing or I'm doing performance, and so I should have one so maybe that's just the right.

B

Sure yeah, I think it might be the case. I know at least in our development development environments that we're using with like cluster up and stuff like that there is a provider that's being developed in kubernci, that's going to come with prometheus and grafana built in automatically, so it's going to be something that's more accessible to developers soon.

B

Maybe maybe this is something we have to just accept. I'm unsure! Yet I'm just nervous that we're going to be fighting trying to get a representation, a custom representation within um these file. Outputs that is not flexible, is what we can build in grafana.

C

Yeah, that's yes,.

A

We can I could just take like so I guess let me look at it um from a different way um if we enable the stuff to be described by prometheus, and we say okay, if prometheus, like I'm trying to think, can I enable these identical use cases like so I guess well all right. Two things. One of them is, though the on wrap was one of my concerns like okay, I guess like so. If I get over that, that's that's fine.

A

The other thing is um uh other types like um one of the things like I like to like look at is I'll want to take wait. Wait was busy with json like I can take it. I can make visualizations like if I wanted to do it in excel or any other format. I have a little bit more flexibility and I could just scrape prometheus right just because it's just json there anyway. I could just and then build it from there.

B

Well, you can use grafana to build those same graphs, yeah a lot of cases yeah or.

C

The baristas yeah you can't just uh prometheus is not exactly it's not jason, I think, but you can yeah, you can't just describe prometheus or you can. If you don't want to want your prometheus, you can just build an application that understands prometheus format, which is very easy and end points yourself.

A

I think yeah, but I think um why don't we just, I think, we'll simplify. Why don't we just start with one? What why don't we just start with prometheus, then and then, if we have a problem with it, let's we'll add the other stuff, then let's just like okay, that's.

B

The right I hate it, but I think that's the right approach just to see how far that gets us um if we find it obviously just isn't going to give us the fidelity and uh response that we need and we'll create our own thing. There's some things that cambridge replaced like the go profiling when it actually comes to understanding where we spend the most time and what functions and things like that.

B

Certainly we have to build our own tooling, to aggregate that, but really all this other stuff that I'm saying we can export as a metric like the phase changes we can create uh in our watch and the bmi controller say: okay, we got a vmi. We see that the previous one didn't have the space, and now it's running so we can export to whatever is exporting the prometheus that that information, so it gets out there.

C

Yeah uh david, maybe some context on the go profile because you talked about aggregating. um Did you have more in mind than what uh gopro stuff outputs on an endpoint? If you wanted to, or do you mean aggregating those for per service.

B

Yeah, it would be aggregating the pprof uh information back so that binary pprof information would be aggregate back per component where you could go in okay,.

C

And basically, I.

B

Can consider like using bert, ctl dev start profiling, uh then stop profiling into dump, and then it's going to dump all the aggregated pprof uh information into like a zip file or something yeah.

C

B

C

Okay, yeah and like I want to advertise this again, like I think quite some few, including the work use, could be something that profits from. If, if we think more about adding tracing to our code in general, like uh open tracing or what it's called now like, you can get flame charts of. What is what is going on in our operators. Kubernetes is doing this now to some extent, and um it's it's pretty great.

B

Yeah, what tooling do you need for that to visualize that kevin.

C

To visualize that there is a few um I think the most common one is called jaeger, um but, like I know, I used one that was built into tcp for a while. um I don't know about the others, and you can annotate it with like you. Can you could even have traces that let you query how a bmi progressed through the cluster that'd.

E

Be interesting.

C

um I had a great time with that and it gave a lot of insights on our operation, the previous job on, because they got stuck in stuff. They were managing kafka and kafka was low and we figured out through that.

D

You mean open tracy.

C

Yeah open tracing, it's.

E

uh Open telemetry, I think.

C

Yeah, it's called open telemetry. Now they merged with open sensors.

C

A

Okay, yeah, I think that's that makes it like this opens up a bunch of tooling for us like going the prometheus around it, and even just like these things, there was a lot of tooling and then even what we can do here with the profiling. So then, okay, so then I guess to break this, to break this down um measure performance. It sounds like we have two.

A

We have one tool and it's it's specifically for the for the profiling and for for doing tracing, and these are just going to be our requirements and then we have, and then we look at the rest of the stuff kind of as as metrics. So these would be basically a package inside of cuber, so not necessarily a tool, it's just something that um is there and then these we we enable right, like we enable with we have like some performance sub resource that we enable, like, I think we already collected these.

A

You said we enabled these for for a more advanced look at to what's going on for a period of time. Does that make sense.

B

I think that those would be considered developer metrics, so it would be something that would be on by default. It would be something to enable how we enable it. I wouldn't make that uh judgment. Yeah, okay,.

A

Okay and then fan you added something about a custom metrics api.

D

Yeah, yes, uh that's um that's something relating to how we uh present the result. So uh if uh yeah I like to appear using just some some resources of the bmi to export the metrics, uh I a concern here is: uh if anything changes on the vmi object will cause the inq and decreasing the q less, which might increase the burden of overheading. The word controller, so I'm thinking yeah just the brainstorming.

D

So if we could using the crd or export the metrics in uh different than the bmi object like using the customer metrics api, is that whether that be a um better.

C

See if not seen that it looks cool you, so I mean we, we can export metrics on api, conformant paths and hpa and probably vpa can scale based on those exports.

D

Yeah and also we can extend the the metrics.

C

That kind of fits in the discussion we had about resource limits a few weeks ago that we would want to scale our components at some point.

A

A

Okay, I think we, this could be something we could look at as part of the sign. Okay, I think the the I guess, the next step, I think to me this makes sense like as a set of requirements- and I think it's clear to me like this- is its own tool. This is in courtroom code, so we have, I guess, two different things. um uh Okay, do we have, I think so. I think what next steps on this is we need to. We need to have like a formal design.

A

Is that does that make sense like we can do? Let's do. Let's do some two designs, one here one here. um Does anyone want to volunteer for this like for like for one of these, and we can look at uh or I mean, or we can do it together? I don't know what what uh what do people think does that make sense to the next step.

B

Could we just tackle one of these, so what if we just tackle metrics for now? Okay, I just.

A

Don't want to split us.

B

A

C

Yeah we can focus on on first.

A

Is a design that makes sense like do we next up here? Do you want to make a design doc? Do you should I just? Should we create a bunch of issues with a list of this these things and we can kind of split them up as a group um in developing these like what it?

A

B

Want to do, I think the design is enabling a a path for us to enable uh debug metrics, so that- and these are all just items that we would add after the fact. So how do we approach um turning on debug metrics in cubert and exporting them all right, and then we can just add all of our metrics to that package.

B

A

Yeah, that needs to be our design. Okay,.

C

I think only a few of those I would enable make disable enable and disable because, like the work you length and like measure periods of time is like kind of given by the metrics concept itself, but work your length would be something that I would always send out to winners, because it's like a no-brainer and it helps with our scalability testing. And I feel we already have a few initiatives around metrics and alerts and all that. Maybe we should.

B

I don't know the relationship.

C

B

uh If we export too much what that, what load that puts on the like production, prometheus and uh production,.

D

Time serious database- everything I don't want to just flood.

B

It with information that is totally useless to 95 of people, especially operators like I consider what we turn on by default, something that somebody an operations team would want to know about, and I don't see, work cue length being one of those, for example, I could be like I.

C

Don't know the working link, I don't know if that's like more, I don't know about the work using detail right now, but it's more if it's more high high and low watermark thing that could be really helpful because people sometimes wonder: why is my resource not being reconciled in huge environments and the ops team can see why? Because we're cuteful.

A

Yeah, I I I think. Well, I I just don't like some of these yeah, I don't know. I think these need to be turned on and off like. I just think that um at least so from my perspective as a developer persona, uh I only really want to look at this like, for instance, I like, if I know I'm doing a load test I want to. You know, have a specific time period that I expect to for it to be running because you know whatever I have this tool generate load. I know when it starts.

A

I know it's ends. uh I I want to capture during that time period this stuff and then the rest of the time. I really don't care like it. You know I and then, from the operator perspective from the the end user perspective, yeah. What do you? What are you getting from this? I mean I guess you can get like okay and it's taking a long time scheduling I mean you could see that anyway. I guess um like some of the stuff like yeah, I don't know what um it's hard to say.

A

What a lot of the value is.

C

So so for us ex like if we talk in the prometheus stylometric context for us exporting metrics is as cheap as it gets. The only disabling. The only thing we would probably disable is not recording them because to check if it's enabled is more work than actually exporting. um We would just disable that we export this method.

C

It would not show up in the metrics end point, so we would only relieve pressure from the prometheus, that's scraping it and the producers can't say I don't care about this metric and not store it um and for measuring a period of time. You would either in this time period you generate a load, and you would only look at this time span in your time series or you would only scrape during a time series if you turn on your prometheus. Just for that.

B

Yeah that makes sense to me, so this could be something that we only export these metrics when something in our keyboard. Cr is set to export developer, metrics or something like that. So.

D

You wouldn't have to do it on a time.

B

Like start at this time period, and at this time period, it could just be either on or off and then, when you want to view what happened during your stress test, you can actually isolate via the time series database, just the results within that time period.

C

Yeah, okay, this way we make it easy for prometheus operator, so say: hey, please, disable that we're full.

A

Okay, so we so it's all right, so a matter of what so we're going to click. So it's just we're going to always collect we're just going to decide whether we're going to export.

C

Yeah because, like the metrics themselves, mostly are um atomic ins or floats.

A

Okay, I mean that sounds like to me, like we kind of just captured a lot of the design right like that. That's that's it right. We just we're going to capture. We just decided we export and the means that we do that in the way that it's configurable I mean it sounds pretty natural to me on the keyboard: cr, yeah, okay, okay, so.

A

Okay, all right, so then the we're running a long time, but I guess the last question is then I I think, like that to me seems to be what would do um I. So I guess I mean what how do people want to look at this? I I mean if you want, I can write sort of a design for this, like, I think, can kind of capture this. What we talked about today and some notes- and that could just be our like design um like the keyboard cr, we export it.

A

We choose when you export and then then these are our metrics that we're going to add, and then you know, maybe we can split up. You know amongst the group like how we can implement some of these. Does that make sense if you're, okay with that.

C

Yeah yeah, I just.

A

C

One more addition like some of those might like the the amount of work required. The first quite a bit like working length, might be surprisingly easy, but the phase changes it might need more of a concept of how we actually record them and where we record them or um okay, the latency, yeah and stuff. That's that's. uh I think.

A

We can, I think, we've carved it out enough that I think that it's that's something we can discuss in the in the in the pull request. The way we're always working on it can just can kind of take the lead on that, because I think, because, like yeah you're right- and that is a little different for each of them, but I think yeah- that's something we can discuss kind of when we get to start digging into the code.

B

We need to start getting a lot more experience as a community with prometheus and building grafana charts, and things like that and to see how well that serves us for what we're.

C

B

Achieve here we need to look.

C

I I thought I thought it's already already pretty established and like there's a lot of promises code in the database and in the code we have like we export like alert, books and stuff. Like that, I thought it's. It's established.

B

To be honest, it's established it's not that at all. It's the fact that our.

C

B

Clusters don't have prometheus installed anchor file.

D

B

And there's friction involved with making that available within our development clusters, so.

D

For example, something.

B

Is simple that sounds trivial? That's not ingress just to view the grafana dashboard from our developer clusters. Things like that have to be thought through.

C

A

There is an issue someone's at it. Roman's doing it. um I don't know where it is. There's an issue somewhere. Oh, I have it in another tab somewhere that um it's looking to be added, I think, by default for ci. So that's actually a dependency me to track.

B

So once we get everything.

A

B

From the developer flow- and I mean I'm not sure if you all are using cluster up and cluster down and things like that for development, um okay, yeah! Well, if you started to at least for development, I understand it doesn't make sense for testing that that's an easy path to begin, just uh gaining experience with prometheus and grafana, and all that once once that provider lands, I mean it's just trivial at that point,.

C

I'm using I'm on using cluster sync with my own cluster that has promises on it, um but I tried to get prometheus into cluster up before and I had not enough ram.

B

ah That's a problem: how much ram.

C

No, no, I mean that wasn't the most prometheus problem it was. It was me, messing up the conflict because I said I didn't set more than two cores and stuff everything died in my laptop. It was not that it's local environment, being I'm not.

B

Gonna, well, that's important! So how much ram does your local environment have? Is it less than 32.

C

uh No, it's I think it's 32, but like it was, it wasn't even around. It was a cpu course like there's a flag cpu course that's defaulting to two, and if you do a lot of work, your two cpus sadly die at some point: okay and uh yeah.

C

I think it was a documentation or me reading, not enough issue good to know.

A

Okay, um I think this is good, I think so I'll. Do the I'll take the item to I'll capture this kind of everything we talked about here as a design and sort of a design document I'll show you on the mailing list um in the doc I'll have these and uh let's uh we can have some people sign up to take. You know one of them, some of them. They want to work on and we can. We can start tackling it uh that way.

C

One more addition: I have. um We have the.

B

C

The api calls mapas, I recently checked, and the only way we see if the api calls incoming to our api is through the api metrics supported by quantities, and they are not that great from what I've seen so far like one problem being, we only get latency across some endpoints and they get polluted by the console and stuff called so our request. Latency is insanely high, because people do console.

B

C

And so we might want to look at an api called incoming to us as well.

C

A

Yeah, that's definitely something we could. We can look at okay, um great, uh I we're already a few minutes over. So I I think um that's good I'll. Do we'll look at this uh so next time we can talk about scale or whatever. Let's we'll just focus on this, um we gotta get our measuring down first. So let's do that and um we'll start to start um executing on that.

A

Okay, are there any other items and otherwise we can. We can close on that.

B

Sounds good thanks.

A

Yeah. Thank you. Everybody.