Kubernetes SIG Instrumentation, 2 Mar 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20230302

Description

SIG Instrumentation Bi-Weekly Meeting - March 3rd 2023

A

Okay, welcome everyone to Sig instrumentation bi-weekly uh today is March, 2nd uh 2023.. uh We have a couple things on the agenda hold on. Let me share my screen, make sure I share the right thing, because otherwise I can get in trouble.

A

A

All right, so let me Zoom does weird things once you start projecting it like changes, your views and then, like everyone, disappears into like a little box. It's like I I, don't know who who wrote their UI, but it is. It is a little janky.

A

Okay, uh so welcome everyone. We have two items on the agenda. uh The first is Tim's on categorized request: metrics, do you wanna? Do you want to talk about a little bit.

B

Yeah sure um so this is, um we can maybe think of it as like pre-cap um I'm, just trying to uh kind of gauge interest on this idea.

B

If it's worth exploring this uh further and maybe getting a cap out in 128 or 129, um it's not kind of an urgent issue, um but uh just kind of the the problem statement I with some frequency find myself wishing that I could break down our request: metrics by uh namespace or user, requesting user um or um yeah I, guess those are the the main two ones, um maybe occasionally like resource name, but that's more less common, um but we can't uh actually have metrics with labels for those fields because of cardinality issues.

B

So I was thinking about like what a solution to this might look like. I came up with this idea for um having a static, although we could probably discuss a dynamic version as well um uh definition of different categories of requests.

B

So, for instance, if I want to say like um I'm, going to classify these requests as system requests, they're for things in Cube system or they're made from system components. Maybe there's a couple cluster scoped resources that I explicitly want to include, and so any request that matches any of these patterns in the uh include in the example below gets essentially labeled or categorized as system, and then we can record emit a metric. That is uh what did I call it. Api server categorize request total.

B

So this is basically the API server request total metric, but with an additional dimension, which is category and that's the uh uh kind of like. So in this case, the um those requests.

C

Would get categorized as system and.

B

Increment that count, and so the idea would be you know everyone what what every cluster operator cloud provider considers to be system requests. This is going to be a little different. It depends on which plugins you've installed, um what you consider to be workloads versus what you consider to be part of running a cluster, um and so, let's the cluster operator Define. This is this group of Metro of uh requests that I care about, measuring, um explicitly and uh sort of a build your own metric around it. So.

A

I have a question for you: yep.

A

Would your intent to be to scrape these metrics or would you look at them manually.

B

uh Scrape them yeah.

A

I mean the problem with scraping them. Is that you end up with the same cardinality issues right like.

A

So because cardinality issues like locally on the client because yeah, you basically bloat the memory usage of the API server by having like a zillion, metrics time series and the same thing happens on Prometheus or Cloud monitoring or wherever right, uh whatever your backend may be.

A

um So there are so like, even if we were to enable this like scraping, it could very well be potentially problematic.

B

Yeah, that's a good argument against the dynamic version, I think um with the static version. The assumption is that every cluster that I'm, that I care about would have the same uh definition of metrics categorization.

A

um So I I have so I have a like I I thought about this problem. A lot since you brought this up like actually quite a while ago in basic instrumentation slack channel, so I've thought about it. A lot like I think I would be okay with this if it was not dynamic and if this were Exposed on a different, scraping endpoint, which would be enable or disableable by a command line flag on the API server so like basically enable High, cardinality, metrics or something, and then you can enable them.

A

With the caveat that, like you probably don't want to leave this enabled in a long running cluster, because you're like likely to eventually.

C

B

I'm not sure that this would be higher cardinality on a given cluster than.

A

If it's identical to API server request total and it has namespace, it will be higher cardinality. No.

B

So it doesn't have many base, so the only extra label- Dimension that's being added here is category and category is a statically defined string. So in the example here, there's a category called system, and so that includes Cube namespace, but it the namespace isn't being included there. That just gets a request. That's in the cube system, namespace gets labeled system, and it's just that system piece that.

A

I mean use recorded, I mean, but a simpler thing would actually just to be put to put namespace there right and then you could do this categorization on the monitoring back inside.

B

Yes, but then you do run into much higher cardinality issues and.

A

Yeah yeah and I'm, saying like even even this, is higher cardinality, because if you include uh these three types of four types or three types of categories, uh that's three acting um that's basically, three oxing, the the cardinality of API server, request Matrix, which is basically a third of the total of the metric metric yeah. It's basically like.

C

Yeah, the latency one is the third yeah because, like latency, for example, is like 10 times the cardinality of.

A

But the conjunction of the two I think account for about 33 of this yeah yeah for sure for sure.

C

But I I think that, as long as we have infinite amount of values for this label, it should be fine because, like most of the problem, we've run into so far, I've been because of cardinality explosion, like the actual Footprints hasn't been that important, like people have been reporting that the latency metric is like really heavy like uh actually consume a lot of memory on their back end and but we're always able to tell them that. Well, this.

A

If you drop labels.

C

And yeah, and even like it's such an important metric uh that you need to spend that much memory to save it because, like it will tell you um the health of your cluster, essentially um so yeah I I, think it's reasonable and I have like many use cases for that kind of um usage like, for example. Here we are creating groups with namespace and user agent or like User Group.

C

um They have other use cases where it could be good like to add new labels for infinite amount of namespace. Let's say your group name space together and put a value for them.

A

um I mean even if you were to have namespaces, you could only have so many namespaces like like it's practically bound and so like.

C

But that's why it's important to have a like a list of values. This.

A

Is this is a high card? Yeah I mean we already have.

A

We already have the ability to truncate and group label values.

A

A

We can limit, we can specify known, acceptable values and put everything in another. We had a cap for this Dynamic yeah.

C

Yeah but like it doesn't allow you to add new dimension that are unbounded by design by default and that you can bomb no.

A

No, what I'm saying is you can probably do this thing using using namespace if we were to have if we were to duplicate API server, request, total, add namespace then use our filter. We could probably get this result.

C

A

What I'm saying it's.

C

Super dangerous because, like then, you.

B

A

No, it's it's literally the same thing. Yeah, it's the same year. It's the same yeah. It is the same thing yeah as long as it's.

C

On a different endpoint and you have to actually yeah.

A

Exactly that's exactly what I'm saying that's exactly what I'm saying! So, if it's on a separate endpoint like we have this Dynamic cardinality limit limiter thing. Basically, because people have exploded metrics so many times, we've enabled this flag, which allows you to bound a label to a certain known set of values and then to Cluster those and put everything else into another group.

A

That way the memory ends up getting constrained which, in conjunction with this, gives you something very similar to this right without actually having to implement much, except maybe duplicating this adding a namespace label and putting it on a separate, endpoint and, and so so, basically I I'm, suggesting something simpler than this right, which is you basically want to know. The namespace and you're worried about the namespace, Cardinale and I'm? Actually, an user.

B

uh The user IDs is maybe an important user.

A

Ids, oh, that user IDs is tricky. That's completely unbounded.

A

um I mean like look even completely unbounded stuff, I'm, okay, with in an external toggleable metrics endpoint, because, like you're, only enabling this for short durations of time right. So, like your your concerns about like booming and crash looping, the API server goes out the window because you're doing this on purpose.

B

Well, I mean let me go in a little more depth in in the specific use case that I'm looking for here, um I'd like to be able to set and uh come up with an SLO across our Fleet of clusters, for what is the error rate on system requests?

B

um I haven't exact I, don't have a concise definition of system requests, but essentially requests that are made um by components that we fully manage as part of gke, but.

A

You don't want, do you need latencies? No well, then, why can't you make a simpler metric, which only has that Dimension and the error code.

B

B

uh That would probably be enough.

B

um I think it would be useful to have information like um like which resource was failing, which verb um but I suppose like. If we just had the bare minimum metric of category and code, then we could pull that information out of logs.

A

And you could also you could also look at the the other API server request, total metric to look at like the distribution of requests and then basically kind of do an inference right um for your use case. I would almost just go with the simpler metric and bypass all this stuff, because this is uh too complicated to it sounds like you can achieve that with the simpler metric. The thing that you want.

B

The simpler metric it would have the code and namespace and user you're saying yeah.

A

uh Not user because user is unbounded and.

C

We have a rig X for uh in on the thing that we have to bound the label values do. Does it support 3x today, no.

A

C

Say a map- oh because if we could map- let's say all the users that that kind of match the regex into a new user, then we could limit the number of value it takes.

A

Well, it would need to support multiple regex. That's that's! The cabinet of.

C

A

A

But yeah so I I can get you everything except the user, and the user would be a problem. Even if we did something more complicated. It would still be a problem right because you're going to have card. No, you, like users is unbounded like there can be There's no practical cap on users right like namespaces. There is because of etcd and the size could uh and whatever and like Discovery falling over. If you have too many namespaces, like you know like like, there's like namespace, is practically bounded.

A

Users are not practically about it and in a metric you're going to have cardinality concerns and the only way I would be okay with users being in a metric is if it was in a separate endpoint and it was toggleable and it was not going to be enabled.

A

Permanently because you would basically own the API server yeah.

B

Well, that's why I like this approach of mapping, that unbounded set of users to just statically Define bounded set of values.

A

I mean you can do that with a regular metric, though I mean like you, can you can have a metric called okay. uh What do you call this thing that you're trying to measure like what.

B

C

A

To measure system, error rate system, error rate, okay, System error rate, you have a metric and then basically in that metric, you you pass in the user. You pass in the namespace and you pass in the error code right. If there is an error and then in your function, you take the user, and then you put it into a group that requires none of this right like it requires request, metrics categorization, manifest. You don't need that. You can just write that in code in your metric.

C

Yeah, okay, not really do that. Upstream though yeah.

A

I'm sure you can, why not.

C

A

Pnf PNF is a valid thing right: PNF P, there's the default PNF setting Upstream, so I would imagine this would be completely configurable right, like you could read your PNF settings and then throw the stuff into the you could call your PNF settings and then put stuff in the right groups. Right no I mean.

B

uh Sorry, are you talking about priority and fairness, yeah.

A

Yeah, that's what this is about right.

B

uh Well, so um funny you should mention that, so that's what we're thinking about using as a proxy for this um right now is um we don't I, don't think we have a uh priority and fairness metric that has error code attached to it. So we could, we might be able to get away with just doing that, um but what we're looking at right now is just uh aggregating a metric out of the law, the HTTP logs from the.

A

Apis and that also works right, like yeah I mean like Cloud monitor during, has the log base metrics, which, which is super nice uh for High, cardinality stuff but yeah. But you end up doing the same sort of transforms where you're like putting stuff into groups.

B

A

B

So we can, we can keep it I mean we have ways of doing this outside of the API server right.

C

B

I guess what I was trying to gauge was like? Is there interest in having something like this built in.

A

I mean there would have to be like a real compelling use case for it, and I would support building it if there was like a real compelling use case where, like we, could not solve this problem, otherwise right, um but it sounds like we can probably solve this otherwise, given given our conversation, yeah uh does. Does that sort of resolve the thing for you or.

B

Yeah I think I. Think that gives me what I need. Okay.

A

Yeah and if you want me to show you like what I mean by like, like so in the body of like where you're recording the metric like record this metric, you know you pass in these parameters and then you have like this call to recording the metric. But before that body right, you can just make a call to PNF settings, get all of the the user groups which are in workload status and then do like I, said inclusion and then put it into that workload status.

A

You could you could do something like that and you could even cache those values so that you're not making that call every single time when you're, yeah right and then and then basically you you have you've, basically taken your users and then you have put them into a a known, balanced set of groups.

B

Yeah you're talking about doing all of this client-side or in.

A

The like, like literally in code.

B

In the in the API service.

A

Yeah yeah, yeah, okay and dynamically right because, like PNF settings can be dynamically.

B

Set that would be adding a new metric that you're talking.

A

About that would be adding a new metric, yeah yeah.

B

A

Yeah I mean I I, don't think we can do this with API requests until we definitely can't add namespace to it, because uh because of the cardinality is so high and stable anyway, yeah.

C

It's stable and the cardinality is so high.

A

It just like people.

B

C

A

Our our API server.

C

Request Matrix today.

A

C

Yeah there was a continuity explosion, like maybe three years ago it was a name space that was attached to a request and yeah like Downstream. You got like more than oh yeah or.

A

Scalability tests they keep creating or deleting namespaces. So basically, what happens when you include a namespace label is that the scalability tests tend to um the API server because of the namespace, so we might not actually be able to ever use namespace as a label value.

A

Do this it'll just feel scalable just.

B

Unless it was on.

A

A separate endpoint that was toggleable, then then scalability tests would be okay.

C

B

C

It looks like there's an intern like only one dimension.

B

It looks like there's another item on the agenda, so I can see the floor. Okay,.

A

Okay, cool thanks. Thank you! Tim yeah yeah, thanks Tim, uh okay, yeah, the next one is uh mentorship plan; okay, yeah sure, let's open up the mentorship plan, so we have a bunch of volunteers for um or for mentorship.

A

And we have several people interested, um so how should we do assignments.

C

I, don't know really yeah.

A

C

uh He's been like damn IPM years, where there is only one person to mental, so that would be.

A

How about this, since we have six minutes left, uh if you guys are on- and you guys have.

A

If the people who want a mentor would like please fill out your desired Mentor, uh if there is a slot, then you can do it. So it's I guess it'll be first, come first serve um and that way we can. We can distribute them.

A

C

Remove David, since you won't be, you know, yes,.

A

David David should be out because David is uh hold on to me. How do I do.

A

Okay, there we go and I feel like Elena might only be able to handle one but I I. Don't know she wrote this she's out today too, but um yeah. So if you guys are interested in a mentor, please fill this out uh and then over the course of the next week.

A

We will check this list and then proactively reach out to you. Can, if you guys can include your slack handle then we'll reach out to you via Slack, and then we will arrange a one-on-one and then we can. We can get started that way, yep. How does that sound? 's, great okay, awesome.

A

uh Well, that's, that's basically it for today and we have about three minutes left. Is there anything else anyone wants to talk about.

A

All right, then, we can call it. We can call it for today's meeting thanks everyone for attending. Thank.

C

You bye all right, bye.