Kubernetes SIG Instrumentation, 10 Dec 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20201210

Description

SIG Instrumentation Dec 10th 2020

A

Okay, I can get some money.

B

Yeah I can, I can take it. Okay, welcome everyone. uh This is the consultation meeting um today is december 10th, um and this is actually our last instrumentation meeting for the year. um So, let's make it a good one, um okay. So the first point we have on the agenda is continue discussing on metrics naming policy.

B

Does anybody remember where we left off.

C

This is about the whether to use cube, prefix uh metrics from components other than cubestate metrics, and I think we.

C

I don't remember what we ended up with at the end of 120, but I believe that clayton added his metric to the scheduler uh with a cube prefix.

B

Personally, I think that's, okay, I think that's just fine, um but it's inconsistent.

B

Sorry, isn't it inconsistent?

B

Well, no, it is consistent with coop state metrics.

A

Right but internally, because we have like these prescribed thing with the component names and all that stuff, um but yeah it doesn't really matter. I I don't have strong enough opinion to take care of.

C

I think the main follow-up is that we should uh update our instrumentation guidelines such that it's consistent going forward, so I believe actually.

C

Elena was the main argument arguer against using the cube prefix. So um oh.

A

C

And she's not in attendance today, um but I think let's differ, maybe yeah, I'm okay with that we can defer it until the next year. Okay, I also think um if someone wants to open a change to the uh guidelines, that might also be a good place to discuss in the meantime,.

B

I think that's fair.

B

Okay, um I guess: if someone wants to go ahead and open those changes, they can do that. I would say for now we can go on with our next uh with our next point. So we've got. um We don't have anyone tagged on this, but what's going on with stablemetrics lots of bugs against website for the list being empty.

B

B

The agenda item says: what's going on with stable metrics lots of bugs against website for the list being empty? Oh yeah yeah. We.

A

Need to yeah uh but we're frozen. We can't do anything. We're waiting for 121.

B

Yeah I mean that's that's fair to just comment that here.

A

Yeah the plan is is to to to elevate some of these like entity. Objects should probably be stable 121.. We could flip that one um I've been talking to some people, api machinery who might be interested in taking up the api server processing time for requesting latencies, because right now, all of our request. Latency metrics, include webhook processing time, um which makes it on slowable, because we don't control webhooks.

B

I think that's kind of silly like an slow intended to to show um user experience right. So um if we don't include it, then that kind of defeats, the purpose of setting an slo in that.

A

You you can have an slo for for what your code does but like if a web hook times out like or let's say you create a crd and other examples. Let's say you create a crd and then you create a million instances of that crd and then you do a list.

A

Is it reasonable to expect this million object list to fall under a kubernetes slo and that's usually a no because the number of objects that you're fetching from a list actually pertains to the latency.

B

I I don't think that's up for kubernetes to define we actually haven't defined.

B

Me try to get up. I I think that that that kind of thing is up for a provider to define and if they have uh a definition like that, then they need to enforce. I don't know resource quota or whatever.

A

Let me find the thing I can find this: uh let's certainly let's move on and then I'll I'll find the link.

B

But in general I think, like uh request and duration metrics should um we should consider it to make to to graduate them to to stable. I think there are a number of other ones, especially those that are important for setting slos, um like I do think, should be stable. I'm not sure I entirely agree with um etsy object cons.

B

um I I think uh I think, uh request metrics, though for sure.

A

Leaking cr leaking crd creations. Std optic cans is like basically the one thing that allows you to detect this if it's a slow leak. This is the thing that you alert off of this is like has to be a such a common alert.

A

I would be so surprised if it wasn't.

D

I'm going back maybe to original automatic about matrix stabilization or like list of stable metrics. I was thinking about this and like the main problem, is that we don't have six like actively having any motivation like to to go forward, and it's not our like, also job to to to to go run after them and chase them. uh But uh there was a music created like for a secret group.

E

D

Group for reliability- and I was interested in talking with maybe why tech or overall, like yeah vertically, if they can motivate as part of definition of reliability, uh motivates introducing observability or some sort of stable signals that everyone can depend on to provide reliability and drive the the stabilization of metrics. For this I know that's.

B

Yeah, I think that makes.

D

Sense, yeah. That was like only very barely like scratchy uh first idea and maybe discussed it with someone else yeah, but.

B

Yeah this sounds pretty reasonable to me um like I I do. I do think we need to um graduate metrics that are important for slos, but I don't think we should create stability for things that are internal implementation. Details. um It's like that's kind of like yeah. That's how that's how I see it and like and lcd um objects count. I th iron, I agree is important, but I don't think that it qualifies as an slo in itself.

B

So um anyways, that's how I think about it.

A

Well, it it's not just as a level right, it's like it's.

A

What are people using uh for charts and alerts that we cannot make changes without breaking people right like and, if you think about like the ones that are the most important like yes, api requests, latencies, obviously slo, because latency thing uh std object counts is more kubernetes specific, but it is.

A

Not that bad, because you have like it's it's bounded to the number of crds in a specific cluster, and you you know for for red hat, I would think right. It is very important, especially uh with openshift stuff right like so I'm actually surprised so.

B

I mean I maybe someone who actually still works at redhead can answer this, but um as far as I'm aware, there's no alerting on this um on this metric, but it is being reported to red hat so that red head knows about the number of lcd objects in each um in each cluster.

B

Is that still this case.

E

Yes, we expanded it a bit more to include a bit more um information we send, but we don't alert on it. No from what I just remember off the top of my head.

B

Precisely because the the like red head mostly follows the um the rule of like symptom-based, alerting right, and that is 100 percent a cause. The number of ncd objects, it's very possible that there's an entirely different cause that is causing the api latency to be high right and that's why we want to alert on the api latency, not to say that the alerting on density objects also is wrong, that that makes sense. But the combination is really what like um leads you to the to the root cause right.

B

But the point is the thing that I think needs to be stable is what we do that latency as a low on and nothing that would be considered a cause based alert.

E

Yeah, exactly, I think, matthias can speak about the slo alert I think he joined. um We used the ones that.

D

E

In the high defined in the kubernetes mixin, um so we use that in red hat as well. So that's the thing that would make sense to be stable from our perspective, at least um but yeah. I agree. I mean the fcd metrics as well. um The object count, but for different reasons, I guess.

A

You can catch stuff earlier is the nice thing about see, object counts because it's a cause and so like. If you notice something getting to an area where you know like you're about how.

E

Do you know healing limit like.

A

That's the thing right for.

E

Us yeah, sorry for us, like our clusters, can be very varying in size. So um what is large number of accounts of objects is different for each customer right, so um it doesn't make that much sense right so for for us, the causes are not something we alert on.

A

uh So, probably for crts it's a little bit less stable, but for built-ins there are known scalability limits, which uh you can much more accurately know. uh Given you know your setup.

E

But if you don't have latency problems.

D

E

D

Yeah, so I just wanted to to some to tell you if you want to define the slo, you usually want to have some boundaries like you cannot have. I know unlimited uh have like infinite boundary about okay. We support all possible number of lcd objects within this slo. You, you give some boundaries like open source, scalability envelopes like this number of objects, this the type- and this is supported- and this is within slr. It's not part of what we monitor tesla, it's something that we decide if slo can cover such uh cluster or such.

A

I mean metrics should be stable, even if they're, not an slo input. Right, like I mean like the thing that you want to provide, is some sort of like api guarantee and like slo is definitely fall. I think that one is indisputable, but I why we also think other things do as well right, like.

B

I guess I mean we never we never discussed any any rules like that right, um so it is, it is totally up for discussion. um The what I would I just worry about is that if we accept something like um cd objects counts um that that kind of opens like sort of pandora's box and um everybody wants all the metrics to be stable. All of a sudden.

B

um That's just what I worry about, and I agree like I guess that's what I've been saying all uh the entire time now that, like slo ones, definitely need to be stable. After that it becomes more vague.

C

Is there a reason why we wouldn't want something to be stable? It seems like we're, essentially just tying our hands with respect to making changes.

A

uh Well, it's actually not even on us uh technically, it's on component owners.

C

Yeah, I I was using us, maybe a little bit more.

A

Broadly yeah, uh and so from the api machinery side, I'm saying I am personally planning on making a city object counts as stable metric, uh because it's important to me and I I will bring it up dave, I'm sure I feel like the other people. There also think it's important.

E

I don't disagree with making things stable, but I'm curious: how do we so and do we have any versioning as well like if we in the future want to go to version two um but still keep the previous metric or something along those lines, or do we just we'll always keep it once we make it stable and just introduce another metric in case? Something goes wrong.

A

That makes it you'd have to do yeah. You would have to do the latter, because uh otherwise you could potentially bring congestion.

C

We still have deprecation policies, we do.

A

Disable metrics.

C

Right so there is a process for replacing metrics over time.

B

I I guess what I um first of all. Yes, component owners can propose stable metrics, but the process does say that we do need to approve them. But, however, um the things that um the things that I'm kind of try was trying to get at uh that, I don't think, can ever be um stable and fcd is like mostly does.

E

B

This category are things that are implementation, specific right, where the implementation might evolve over time in some specific component and those kinds of metrics I don't think can ever be stable. I I realize this. This is uh like we're so pretty far in with etsy. That probably ncd will always be there, but no.

A

I mean there's kind, there's kind, there's you know, there's k3s right: they have, they have the storage adapter with a simple light. So actually that's a great argument. uh That's a that's a great argument for before we turn it to stable. We should maybe just have an a storage. Internal storage object account. uh It's a resource. Resource-Based object, account.

B

Yeah that I think that that I could, I could agree more with that. We have an uh a resource object. Count: um okay, instead of an etsy object, count.

A

Okay, I'll I'll make a proposal for that in uh ap. I'm sure, then we can see if you can see about I'm just doing that.

B

That sounds good great seems like we're like okay, um we have 10 more minutes left and we've got two more topics: uh lily. Take it away.

E

What did I write? um I guess uh we recently more recently, we've had in keepsake metrics. We had a bunch of um issues for feature requests around metrics. We don't necessarily expose so things that are not one-to-one um mapping to the kubernetes api um and I think, since we had so many and and a lot of them do make sense, there is nothing wrong with them. um We were thinking of either creating a new sub-project of sick instrumentation for those.

E

um So these would be metrics that do not map to the kubernetes one-on-one, but they cannot be exposed by kubernetes natively. um So there's a link to the issue um in the doc.

A

E

A

Is there a cubesat metrics like working group or something like I, I have.

E

No, but really.

A

That's why I.

E

Brought it here but um like the thing is: either we create a new project um that is not part of cube, state metrics and as part of like sig instrumentation, uh that exposes those metrics that we want, or we add it as a new metrics endpoint as part of cubestate metrics, um so that you enable via flag. I think that was the main discussion we did on the issue.

E

A

E

A

Some sort of extensibility component that allows people to plug in, like their own custom, things to generate metrics.

B

Go ahead, go ahead.

E

Yeah we had those as well um those ideas and um we had them mainly around customer resources, but not sort of.

D

Around this, because.

E

Yeah around custom resources makes sense, um but around the specific metrics, which are calculations that we know should make should be there, but we just can't uh map it correctly um would not be like they're they're common for every user right. um So it would make sense to be done by us, for everyone essentially go ahead.

B

Yeah, I I think that makes sense. uh um I think something along uh I I I'm happy with trying this within cubside metrics as a separate metrics endpoint.

B

um I I don't get the impression that we could make use of the optimizations that we did for the like normal metrics endpoint, that we have on coupe state metrics, because those are pretty specific to the current model like aggregations, won't work, and that's specifically what this user was requesting.

B

But I think I'm I would be open to trying this and doing this in in a sort of generic way like like han said where we essentially write a configuration of these are the aggregations that you would do over these kinds of objects, maybe with a json path or something within that object or those objects.

B

I've definitely thought about this before it's not. It's not super easy, but I think that would be the thing that would satisfy most people and then we can maintain useful configurations for those um aggregations as part of.

E

Coop state metrics yeah that, like I, I don't necessarily care about the implementation details but having those common calculations done by either us or whoever, whichever component metrics, we're exposing um so that the people who have the knowledge how those metrics should be calculated can make those I'm happy with whether it's in the code or if it's configuration, um I'm fine with either.

C

When you say would.

E

We sorry go sorry when.

C

You say aggregation is this something that can be done at query time that we're just making more convenient, or is this something that we can't get from the current set of information.

B

uh Kind of both um like some of some some aggregations that users are asking for, would just cause wild amounts of cardinality that are completely useless unless you're doing exactly this aggregation um so pre-calculating. This would save a ton of resources which would make sense. That's so yes, kind of both.

A

uh You could drop in prom queue to actually just run those things in process, possibly.

A

Basically, just execute that.

B

I don't know if I would want to do that.

A

Yeah, it might be a little memory hungry.

B

That's kind of the point right like it's extremely cheap to do this, um like within group state, metrics, yeah yeah for sure.

E

Yeah we had a bunch of them um that we closed before because, like the first paragraph is we stick to the kubernetes api one-on-one um and we do the mapping so yeah. I think the only thing is um I'm happy to have this part of keepsake metrics, but um for crds, whether we want to put it under this or keep state metrics, I'm not sure, but we've been getting a lot of requests for crd metrics.

C

E

C

The requests are listed, so people can go through them if they're curious.

E

Yeah I'll link it to the docs.

A

Yeah, I mean the reason why I was asking about the working group was. I actually have some interest in extending keepsake metrics uh and I was wondering what the forum was for it and since we yeah there's just a lot possibly to discuss and yeah.

E

Yeah, I'm happy to set that up. um Currently, it's just opening an issue really and just discussing there between the maintainers but yeah. um We can do that as well. We can have a separate working group, there's definitely a lot of things um that we can do, especially if we accept the new aggregated format. I think there's a lot of nice metrics we can make so yeah sounds good. We can discuss that.

B

Yeah, the the reason why I was mentioning like configuration for the aggregations is, I feel, like we're, never going to be able to make everyone happy with the chosen aggregations right. It's almost like doing queries, so that's why this I feel like it has to be configurable, but it makes the whole problem a lot. More complicated um yeah but yeah like group set metrics is a sick, instrumentation sub project. So I don't see why we can't discuss it here. Actually,.

A

uh Well, mostly because we have four minutes left and uh the extension thing would be a a lot. It could easily eat up the entire time.

B

I mean there's also a good state metrics channel on kubernetes slack. We can start out there perfect.

F

E

Do we want the proposal for this to be part of keepsake, metrics, repo or part of sig instrumentation.

B

I I would say we can start it within coop state metrics. If we realize this may outgrow um group state metrics, I think we can still create a new sub project. Should we.

E

No, I mean like.

B

E

Proposal the the cap.

A

Proposal should be in instrumentation and enhancements right.

B

Yeah, I would say so.

A

Gbc is the worst right.

A

Go ahead. Sorry.

E

Yes, uh no, the other one was just announcement that we graduated the 2.0 um to beta. Now um we have some scale and performance testing left and after that, we're hopefully gonna release um 2.0 as stable, and we already started merging things for the next release.

B

Very cool: that's fine! Go ahead.

D

Yeah, I I was just interested in the scalability, just you mentioned. If you define like we what we discussed previously about like the scalability band, like.

E

D

That I mentioned here and block is that this is something that we are following up with. You know six scalability looking at the dirt, like their official defined envelope for for for kubernetes.

E

Go ahead really.

A

Please please please.

E

Oh no, I just wanted to say that um like for us, we just emailed the um scalability and um we're writing a new test for this, because it's quite unique, I think, to keep state metrics but yeah. That's all sorry!.

A

uh Merrick, were you talking about cubesat metrics? Were you talking about the slo.

D

uh Thing mean keepsake metrics, like I just wanted to make sure, because this is something that I was also looking at metric server just to to make sure that we have consistent approach, that that was discussed with sex guardians. If you have tests like, I would be happy. I would also want to maybe look at them, because this is something that I'm looking at.

B

We haven't actually written them yet, but uh six scalability kind of instructed us where those would go. Yeah.

E

They sort of gave us a guideline.

D

Is? Is it a written guideline or just.

E

Yes, it is it's a part of the mailing list. um I can taste it. Okay, okay,.

D

That would be perfect.

B

Okay uh looks like we have discussed everything on our agenda for today and we've got one minute left. um I guess we can get give everybody that one minute left uh uh it's over. So uh time's up. um This was our last meeting for the year, so everybody have a wonderful holidays and a happy new year see you all next year.

F

E

F

New year, 20 21.

F