Kubernetes SIG Instrumentation, 23 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20220623

Description

SIG Instrumentation Bi-Weekly Meeting June 23rd 2022

A

A

Welcome to today's edition of sea instrumentation by weekly meeting today is the 23rd of June. um It seems that we have a couple of factual details on our list for today. Let me show my screen. Oh, can you give me the permissions.

B

Yes, one sec foreign.

A

Can you see it okay, yep? um So the first topic is mine. um So basically, I wanted to talk about um a problem that has come up a few times during, like reviewing of PRS um so essentially like when people are introducing new metrics.

A

um That tend to do it globally, like the essential dormitory globally or use the Legacy or G Street to register them.

A

um Eight positions problem for a lot of them and I keep having question of like what's the best practices to like um initialize a metric. What's the best way to register them, and things like that um and through the look at base like there is no real standout and like on the way to do it like everybody does. It is on the way and we keep having like more and more Global metrics, as well as more and more metrics that are registered into the global.

A

uh The Legacy register um and I was wondering that, if it would make sense to standardize that in order to make it cleaners all over the code base, so that, like the end goal, would be maybe to get away from this Legacy registry and from all these globals that we have, um but maybe a first step would be just to like have a good standard. Whether we have like I. Don't know. Example, for instance, that we could share um to the other developers to give them like some idea of how they should write their code.

A

What do you think.

B

I think that definitely makes sense. um Would this look like sort of a helper library that people should use to register things or.

A

That would serve as an example of how it can be done um either like for me, there could be two example, one uh which essentially relies on the Legacy registry because, like in a lot of areas in the code base, we cannot move away from it that easily, but we can prepare the code in a way that it will be easier to migrate away from the Legacy registry at some point and another example. That would be like kind of more like the new way, with the cube registry and stuff like that.

A

So it would be just like some quick example to guide them through how this can be actually done, because at the end of the day like, if we really want to create the standards and have it like applied to the whole code base, then we will need like some avistatic analysis and that will require like too much time. I. Think parties for now.

A

um So maybe just a good starting point is like some good uh example that the community could follow.

A

Because the the more essentially the more um people are adding metrics in a wrong way, the other it will be to move away from the global. So in the Legacy registry as a whole.

A

um And if we don't take any measures to prevent that I think it will be out and we will essentially like never move away from those.

B

B

A

I can try to work on that. I just wanted to gather some feedback to see like if any other has any subs on that and yeah. If anyone want to look into that with me, um that could be a good way also to see How uh Matrix registration works in the kubernetes code base and what other challenges.

B

It might be nice to have like a readme for component-based metrics that tells people what best practices are and how to use the effect like that might be a good place. Yeah.

A

Maybe we could integrate teaching to that because, like we already have a readme like where we share the best practices, or at least like tell them uh how to write a metric? Where would you have that somewhere and maybe we could extend it to this kind of information.

A

All right, if we don't have anything more on the topic we can switch to social trees, is not here today. So we will switch to geocent Elementary's exponential uh Instagram uh description that we had two weeks ago and now it's a continuation because I essentially uh David and me attended the open Telemetry uh meeting yesterday. That was discussing this kind of topic with the committee's team um and at the end of this meeting.

A

Essentially, what happened is that the open, Telemetry standards seems to agree that some of the Prometheus thoughts are more refined and might be better suited.

A

So they are kind of going into the Primitive way of like implementing these new histograms, and so they are having like some kind of Middle Ground now today, um and we have a discussion with David, where we kind of wanted to kind of explore what would be the future for us in kubernetes with this music histogram and what would be the challenges um essentially like one of the main issues that we kind of we didn't know like what could happen with.

A

It is the cardinality because, um with normal Instagram, we already see like some of the histograms, with only like 10 buckets already of like so much Cardiology like they generate so many time series. So what would happen if we have like this new committees, um exponential Instagram that have like hundreds of buckets and I discussed that with one of the Committees maintainer and apparently so?

A

First, the way it will be stored in the times 3 database is different, and so essentially, instead of um having the buckets as a dimension of the metrics like we used to have so one bucket would be one type seven. Now the buckets are part of the Matrix in the tsdb, so they are not counted up Dimension anymore, so that will reduce the more usage of The Matrix by a hole like we won't have to. We take that into accounting Academy anymore.

A

um But what I was told is that if the metric were already posing a problem in terms of cardinality uh with The Advocates before with a new implementation, that would still be the case like we would still need to worry about the increasing number of buckets. That might happen, and that may still be an issue even though, like this has been optimized but they've looked into adding limits per histogram, so very strong.

A

You can configure a limit and say well, I, don't want this histogram to have more than 100 buckets and if it has more than the scale will go higher and the buckets will be reduced in the position as well. um So they always thought about like some kind of mechanism to prevent that.

C

um And they are.

A

Looking into like our feedback, also like, if we tried this feature and found that there are something that are missing, they are open to discussion as well.

A

Do you want it to say something then yeah.

B

Sorry, my fire alarm's going off.

B

um So do you know if, if the so they said it number of buckets shouldn't matter as much, but that's only if your Prometheus server is actually using the exponential histogram format.

A

B

Yeah, if, if, if we were falling back to an explicit bucket representation of exponential histograms, we would then have cardinality problems right.

A

If we switch from an exponential to them connecting a fixed location right.

B

Well, so so let me explain what I understand the migration path to the and then so that the idea was like from our perspective. Let's say we have a histogram for API server, latency or something right and yeah. We take that histogram and keep the same name the same labels and stuff, but change it from a regular histogram to an exponential, histogram yep right, um supposedly, according to what the Prometheus folks have said. If you query it with a Prometheus server that doesn't support exponential histograms it'll return, a fixed bucket, histogram representation of that histogram.

B

So I guess my concern. There would be that the fixed bucket representation would have cardinality problems because we might increase the number of buckets.

A

Yeah, you see your point but.

B

If, like, let's say, we had 10 buckets before, if we were able to limit it to 10 buckets afterwards, especially if it applied to the fixed bucket representation or the thick sorry, the fix yeah, the fixed budget representation of it, then that might be useful for us, because we could like we. The bucket boundaries would still change yeah.

A

I see it's gonna be good to have like two two really miss I guess like one for the exponential buckets and one for the fixed packets right.

B

Right because for exponential I may not care if I have 100 buckets, yeah I, don't know if that really works. That well.

A

Yeah yeah, that's a good point, but.

B

I wonder if they've.

A

Sort of belly, because yeah at the end of the day like it depends on the back end like on the client I, don't think we'll be able to expose anything that will tell them to really reduce the number of buckets.

A

So you don't really know how that would work to be fair. The chair, that's definitely a good point that we will need to.

B

Well, from my understanding, they focused on making it really easy to switch from.

B

A fixed bucket histogram to an exponential bucket histogram at query time without it changing any of the data like they want that to be seamless, but it does mean that if we switch it over in our code, we're basically like there's. No there's no like way of easing that transition in code for us, which I think is going to make it difficult.

A

And I wonder like how problematic like the change would be for, like any users that was doesn't use for me to use as a back-end, and it has some kind of exporter that will like translate the metrics into a permitted like their own format.

A

I, don't really know how that would happen in like if we should really think about it on a uh users.

B

From the format perspective, as long as they can still get a fixed bucket, histogram out, I think it's like we could maybe call it backwards compatible. The only question for me again is the number of buckets: if we're now giving them 100 buckets and they have to parse all of them, then like they're, gonna notice,.

A

She store them as Facebook kits right in their banking or whatever yeah, and that would be very problematic for us.

A

So yeah yeah I wonder if maybe we could tune kubernetes as a whole to expose one or the other, but then that would be kind of difficult to do, because that's why you're likely so the Registries that we have with that kind of like.

A

Yeah yeah I think that's difficult, but we still have a lot of time actually like to think about that, because they are only at the stage of POC um and for now, like the just trying things around optimizing. The algorithm and things like that, so yeah I see a long way before I didn't see prediction but yeah the sooner. We have like an idea of what would be the challenges. The result would be I guess.

A

Maybe it would be worth reducing David to organize a meeting with maybe the permit to estimate list to discuss that and try to share our thoughts and our fears as we get to that particular feature.

A

I try to see if they'll be pretty much more, we could arise. I.

B

Think it might be more productive for us to try and write up a sort of a straw man of how we would do our migration and, like ask them to comment on it. Yeah.

A

B

That just so that they can understand the context and then eventually, hopefully we can address those concerns and turn it into a cup without too much extra effort.

A

Yeah yeah- that sounds great, this one's great. Let's do that um I can stop working on it and we can show our concerns. Yeah increase. It.

C

Yeah I ended your agenda.

C

Can you refresh the page yeah.

C

Awesome good yeah, yes, so yeah it's authority to the aps over metric yeah. So uh it's a noun issue, I think yeah. It's the cardinality is very high yeah. It's about yeah yeah, so I'm wondering whether we can't use trees to replace, replace that metric I, don't know whether it's possible. So basically, the idea is that we use choices and the one simplify the metric so that metric uh with the same name but only captures the high on the high latency cases and a drop out labels.

C

So in this case for the metric, it will be very small and it has less labels if we want to sell so and also we can get alerts from that aggregated the magic. So after we get alert, we can go back to see all trees to know which uh which label or something has has some latency issue.

A

The theme is like you already need most of the label for alerting I I, don't know if you have any I do do you know which label would be? We can opt out for alerting, for example, because to me.

C

A

Are using all of them today and even like six scalability, is relying on that um to build their slos and to test that, like in their test by plane that even like with a hundred node, we are not going above. Our slos.

C

Yeah so basically I want to have uh an aggregate metric and that algorithmetric only captures the high latency cases for low latency. Another value yeah because for alerts I think it only cares about. Is the high latency for low latency I think we just can't know it. If one we want to know more details, we can use the EPS over treating Trace data to say the details and we can also drop some labels.

C

We want to know whether the API server has some latency issue or not, and then we can go back to trade to know uh more details. I don't know whether it is.

A

B

I, don't I don't know if we can replace it given is this? This is stable right.

A

C

B

So what what you're proposing sounds actually a lot just like um changing the bucket structure right to focus on.

C

B

More buckets for high latency, because that's where you want your detail, yeah.

C

B

uh If we could add exemplars, that might be a useful, related feature here.

A

As well, because even like, even with this metric and as available as it is, you still need uh logs or traces in order to debug any kind of latency issue because well latency show are complex, and if it's not like, if your Oslo is going down, not because of errors, and it's because of the latency, then it can be anything in your cluster because, like the API server is at the center of everything would be a good way to do it.

A

Because then you will have a link to your traces like a trace ID attached to, like you like the request without the slowest. So that could be an option, but they are reducing the Cardiology Of This metric is like something really hard to do, because, like a ton of stuff depends on it. Sadly,.

C

Yeah yeah I will look like look into the new new uh new Hitler and scratcher later to know more yeah I think it can. Can we do some some, the Carnation, Maybe Yeah, so basically I think the.

A

Even though it do it takes a lot of meaning to me and like to I guess a lot of people. This metric is like the the most important one out of all of the metrics that the kubernetes exposes. So we kind of went with the idea that, even though it's um taking a lot of memory, considering the amount of things we can do with it, uh we can afford the fact that it takes I. Don't know like twenty percent of the time series, um but yeah yeah.

C

Basically, I think these latency magic is not suitable for it. It's so I think measure is not suitable for maintenance, I. Think for anything. We should use Trace.

B

No, no I will disagree. I would say it's important to have both because.

A

B

Are samples so you may only be sampling one in 10, 000 traces and that's not sufficient to properly build slos.

B

um Your metrics should cover 100 of the case, so you have high quality signals and then traces are for debugging uh particular cases. I think exemplars will help a lot with this.

C

B

There there is a school of thought that says that you should turn on a hundred percent tracing and then generate metrics from your traces, uh but the end of the day. You need the metrics anyways, so um the route that you take to get there isn't particularly important and I think we already have this established. So we should stick with.

A

It yeah yeah I, think I wrote today's more like going from metrics to logs to traces um in that direction and not like from traces to Netflix so and going like thinking that that could be and like reflecting on. That would be all because everything is built that way today, but yeah.

A

So the idea that I had and I discussed with you David, uh something was to add, like this um examples to this particular metric, like the one with the latency and also like a trace ID propagated into the LG clouds, so that at least there is a way to Via this um latency metric to go deeper into like what's happening really, because it doesn't give you much info like you only know that some of the requests are slow when you are using it, which is great like for you as solo and for what you promise to your customer and stuff like that.

A

But when you need to make it like it's impossible. So if we have at least some way to link it to audit logs and then to Tracy that gave us well. The problem is in that City or is it in storage or like where it happens, then it will be wiser.

A

All right do we have anything else to to say on that or are we good.

A

So I guess we can get four minutes back of our time. um See you everyone on the next call right.

B

B