Kubernetes SIG Instrumentation, 19 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20210819

Description

SIG Instrumentation Bi-Weekly Meeting August 19th 2021

A

All right, it's august 19th. This is sick, instrumentation um and, as han has just mentioned, we only have kind of one item on the agenda, which is mostly a discussion around the revisited metrics stability classes. Do you want to kick it off.

B

Sure so to update, we me and elena went to sig architecture, and we uh we brought this. We brought this up. um Basically, people were on board. um I think there's probably gonna be a little bit of um bike shedding around the actual stability classes and the guarantees, but uh um I think that's to be expected, uh so we should probably actually get to what the stability classes are are actually going to mean like semantically.

B

I think I think that's what it comes down to now.

A

Well, were there any comments on the existing like light proposal right like we, we had previously discussed uh like internal slash, debug um alpha beta stable. I think right, yeah.

B

So they liked the two additional stability classes, um the internals or development one and the beta. I brought up the lagging thing.

B

So basically, um we also went to working group, reliability and davide suggested that we lag metric classes, a release behind feature releases and the reason for this was they don't actually start mandating metrics until a beta release and even after a product goes ga, there's not enough information about usage, you don't get widespread usage until something is actually ga, and so um it doesn't make sense to ga and metric without knowing how how this stuff is actually being used.

B

So it would basically be alpha during beta and beta during ga and then like subsequent release. It would become stable.

A

To what capacity do we? Actually? I mean if we go with the same naming as features then people are gonna, have similar expectations um with those names right, but it sounds like we're expecting that there are situations where we ga a feature with a beta metric and we're gonna have to remove or change that metric significantly right.

A

What what would that scenario then look like.

C

After something is due.

A

After a feature, sga and the metric is beta, and at this point, because now a lot of people are starting to use it, we realize there's a problem with it.

A

B

I I think we're gonna. I think I think beta things are going to be a little bit. Wonky like I think, we're going to have to be okay with breaking people with beta changes.

A

Yes, that that's that that that is what is what was on my mind as well, but that is not the expectation that people necessarily have about beta features today in kubernetes, and that's why I'm saying I technically.

D

Feel like there.

A

Is a potential to confuse um end users? They.

B

Can expect this the metric name to remain the same.

A

They can expect the metric name to remain the same. Okay, if that's the, if that's the uh guarantee that we're giving, I think I think that's, I think I think that would be okay. I think I need to think about it a bit more, but I think this could could be workable, at least as like as long as we have some pretty concrete rules for this, I think anything's, really, okay, as long as it's the expectation is not just yeah, it's it's exactly like features.

B

The well the metric itself, so the metric with that name will not disappear for n releases, even in beta right in beta, however, in beta uh labels can be added or removed and removed. Okay.

C

A

D

Process, in that case,.

C

And like would there be like some kind of deprecation of the beta metric before, like removing the actual label, there is no guarantee to the label, so we can do whatever with them.

B

I think I think basically we would have to deprecate and remove just like beta features. Alpha metrics, on the other hand, can be removed without deprecation, so that would be the distinction.

B

Because that's how it works today, right, like alpha features, can be removed without deprecation yeah.

A

So uh I think, I'm okay with trying it exactly like this for the one, and only reason that people tend to already pretty heavily use beta features, and so I feel like when we get to that point. Actually, the the metrics are going to be exercised enough that there aren't. You know wild leaks.

B

It really depends on the future. Like party and fairness. The metrics have been mutating, like basically release to release right like these multi. These multi-release features like priority these, the huge ones.

B

Yeah, I mean they're. You can't really.

A

Rely on them, I mean they, they are. The feature is beta right, and so the metric is off would be out so yeah that that seems fine. That's that's what I'm saying right like. I feel like the scenario that you're just describing is exactly why I think this is working. It.

B

Makes sense yeah.

A

B

Also, you know when people add metrics to a cap, they don't have to be new metrics. They can use existing metrics as a way to measure their own feature right, like api server, request, latencies or duration seconds or whatever is like a perfectly acceptable metric to measure your future right. Like are you talking talking about uh production, readiness right now or uh yeah, production readiness, mandates, metrics for beta features, and I'm saying that because yeah but yeah uh production readiness is an extension. In my mind of the cap.

A

Fair enough.

B

um But yeah so.

A

So no go ahead.

B

Yeah, I'm saying that, like we shouldn't necessarily expect the proliferation of metrics, not necessarily, um and so like there could be, people could be using stable. Metrics people could be ahead of the curve they could be using stable metrics for a beta feature, stable metrics for an alpha feature, because they're using existing metrics.

A

If there, if they are existing metrics well, I think that's perfectly fine.

D

For priority and fairness, would those metrics be considered like internal or debug most of them, or are those the kind that we would actually want to make stable.

B

I think I think at the outset they should be internal um because I'm not sure how people are going to use them. I think, however, given sort of the wide scope and nature of a party in fairness, you're going to want to have some public stable metrics for gauging how your requests are getting, throttled or prioritized, or whatever like um right, like you're gonna want some set of metrics, for that, like people's clusters are heavily affected by that feature,.

A

B

A

How this cannot be uh why this wouldn't be able to be connected back to like request errors or latency.

B

It's going to affect latencies that will so definitely like the duration seconds. Your distributions are going to be affected by it, but how it's being affected is not really um you don't know how, like you just know that something is happening. I mean uh like how do you tweak the settings? You have? No, no data really. I agree and.

A

That that is exactly why I think this is a an internal metric.

B

That's not uh internal, like.

A

B

A

That that the metrics would describe sort of an internal state right.

B

No! No! No! No! No, because the the difference here is internal. When we had discussed it was more like a developer flow like someone who's developing kubernetes, but in this case uh this feature affects end users, cluster admins, who would want.

A

That was always the case with internal. The internal ones are the ones that are tied to the implementation of its surrounding code.

B

Theoretically, we want something: that's not internal here, though, right like you want to know like how much priority and fairness is being like without being tied to the implementation like how saturated is the pipeline.

B

That's like at a high level. That's what you want that number and you don't want that to be internal, because that's not an internal concept, it's literally the the what the thing is doing and how it's affecting your cluster, and you should be able to slo off of this right because party and fairness when it's being exercised slowly degrades your cluster.

B

It prevents it from falling over, but it degrades your cluster.

A

Yeah yeah and I think, if we can put it into you, know saturation metrics. I think that would be okay.

D

Yeah, I I think, by what I was hoping to get at is it seems like we're putting cluster operators in a bit of a weird position. If we launch a ga feature without metrics that are stable, it almost feels like there should be some very small set of metrics that go stable at ga and then lots of metrics that come afterwards and a lot of them are debugged and will never go stable.

B

I mean like, in this case you're going to want some set of metrics. I agree I mean, but like people are going to be using request durations, that's a stable metric to to know what the latencies are, but the saturation metric is like kind of a tricky one and you're not going to know which dimensions like it makes sense to even be measuring until people start using it. It's starting to be used now and we're realizing a lot of weird things about it. So I think it's exactly kind of why.

B

Mutating metrics at that point has to be okay right, like yeah.

A

B

That's fair that.

A

This is the this is basically where we are describing this the similar situation of right now you can use the proxy metric of um request duration, but we need it. It would be good to figure out something that is clearer about the situation right um and so something.

D

A

To have that, like the all the metrics that are related to a feature, don't have to go through the metric stability process, along with the feature right like that, can come afterwards. We're not mandating that it has to be in that way, right.

B

A

I mean like you, can still go through metric stability after a product has yeah yeah for sure sure right so yeah, that's what I'm saying like that's, that's more! That's what you're, what you just described! No like we're! Yes, we're! Now we're now having more widespread adoption of this feature and we're realizing.

A

We need better insight into it, and so we create a new alpha matrix like just just just an example right and even if, if we were to say that the feature went ga already, we could still experiment with these metrics, and this is going to happen all the time right that we're going to realize, there's some more aspects that we would like to understand better and even if that's just an internal metric right, maybe one day we'll realize okay. Actually, this is something that people want slo off of like we shouldn't.

A

It should always be possible to go through this process. Afterwards is what I'm saying how.

B

Many releases between beta and ga do you think.

D

For metric stability level or for feature or something else,.

B

For metric stability.

A

That's a good question: I don't yeah.

D

I don't know, but that's sort of why it it does seem weird to me to one not require anything to be stable at ga, because then I I worry a little bit about perma-beta metrics.

B

um No, no, we we have. We have. We have static analysis, so we can basically enforce uh beta metrics being promoted after we just have to decide the number.

D

Okay, that makes sense to me. I think, I'm okay with that, then.

A

Yeah and we are requiring a beta metric for at least a beta metric for a stable feature right.

B

A

Yes, then that's it. That seems pretty reasonable to me. Well,.

B

Not necessarily, it can also just be comprised of completely internal metrics.

A

Yeah that that that's fine, that completely depends on the feature. I understand that it.

B

Does depend on the feature.

A

That's uh like that that that's then up to production readiness to the side right right, yeah yeah! I think that's, that's reasonable. It's really hard to un tune to put a finger on at what point do people really use the metric right?

A

I want to say in my experience after four releases, people definitely start using it, um but, like I I would say two releases is probably a reasonable number. That's.

B

What I was thinking too, like.

A

Three, maybe like maybe maybe maybe damien- has some some idea about this as well, but like in the kubernetes mix-in. I want to say we start using the metrics pretty much right away um like when they when there are beta features and then people you know slowly start adopting this so yeah. I think I think two releases is reasonable number for for to promote to stable, because at this point like the kubernetes mix in will certainly have um exercise that.

B

We can provide an escape patch, we can see say after two releases um you either have to deprecate your metric or you have to promote it and deprecation. Just buys you another cycle right like.

A

B

You can go from deprecated because you're not sure to okay. This is the ending that I want.

B

But we force it, I mean we have to force it otherwise, yeah we have perma perma-beta, which is obviously not good.

A

I guess giving the escape patch of give me one more releases that gonna be abused to go to be paramount later on.

B

I don't think so. I think it depends on the future. um I could see party in fairness, probably using it, um but I see a lot of other things like scheduler stuff, like people are pretty pretty uh yeah. It's not super contentious like these are the metrics that we want for some of these features, I think some of the auth stuff yeah. It's like pretty pretty simple. I think, like you, will know after two features after two releases, like what metrics make sense.

A

Yeah, I I I think that that sounds reasonable. Is there any? Are there any other um points that we had open.

B

uh Yeah one more thing was um we're going to be able to run static analysis against everything except uh the custom collectors, because those are super. Those are dynamic and there's like a set of other metrics which are dynamic. So those automatically have to be internal.

B

So we will be able to statically analyze and generate a list of all alpha beta and stable metrics, but not the internal ones.

B

Anything internal ones we'd have to do at.

A

Runtime, um like resource metrics, are going to move to the like cri stats, endpoint anyways, and then then we, you know practically, have a stable list there as well. So if like that, those are the only ones where I would be worried about the collector metrics, but it seems like we already have a solution on the horizon for resource metrics. So to me this sounds fairly reasonable. Actually,.

B

But not having the list for internal.

A

Yeah, like it seems, workable to start at least.

B

Yeah, it kind of sucks you kind of want the list I I want to generate like I want to auto, generate all the all of the metrics. I would like to do it statically, um but I think you kind of have to do it at runtime because of the dynamic nature of some of the metrics.

A

Yeah, the problem is that.

B

A

Don't know whether that's going to be consistent or not.

B

uh You you know at registration time, I'm not talking about scraping the endpoint like we, we have a. We basically have a wrapper around the registry, so why do you know what registration.

A

Time you need to actually collect from it to.

B

No, because if you, if you invoke, must register, then you have access to the description of the metric. The type.

A

No, that that is not guaranteed to be stable,.

B

What you have to call.

A

When you call must register, all all must register verifies. Is that it implements the collector interface.

B

Right and that there is.

A

A description right yes, but the there's a there's, a collect description method. I forget, I think it's yeah describe or something and yeah it will return descriptions, there's no contract that it has to be. It has to return the exhaustive list that it potentially knows, and there are collectors that dynamically return. The descriptions.

B

Right custom collect custom collectors um yeah and those we're not going to be able to do anything about it, but for the set that is registered, this is basically as comprehensive as you can. You can get.

A

Okay, I think I see what what what you mean now you you're saying yeah. You are acknowledging that this doesn't work for custom collectors, yeah sure for for all the other ones we we can run describe and we can automatically collect it. Yeah.

B

Yeah yeah, it's going to be a super set of uh uh okay. What we can get at stack analysis time, okay,.

C

Okay, again shouldn't the custom collectors like uh implement the interface that we have with the stability uh like all the stability, api and stuff.

C

C

And I was uh implementing not the normal, like like register interface, but the one you implemented in component base. So I think for this one you can actually like get the number of like.

B

It might be possible because yeah yeah you're right rainbow mango implemented some stuff on top of collectors, so it will work for some subset of them.

B

Not all of them, though, because you can still dynamically generate descriptions, so there there are.

B

We can get most.

A

Except the really weird ones, I guess we could. We could get it even for the ones that are implemented in kubernetes, because we could force them to.

A

Or not, I guess we can't necessarily force them.

B

No they're, already they're, already existing and they're already dynamic. The work were cute ones are like kind of like that.

A

No they're way worse.

B

Yeah well, the ones that interpolate the name.

A

Yes, that one that is, I think,.

B

A

B

There's like there's like a loop or something uh yeah, it's it's terrible, um but yeah yeah, so those I mean will be perma permanent journal. I mean one should not one should not do that and we will audit them in the future and make sure that they don't happen but yeah they exist and yeah. We should try to deprecate them. Probably those particular metrics.

A

B

Yeah, we should try to deprecate them. We.

A

D

Try to find the the.

B

The uh the set that of metrics that we can't determine that registration time and then we should try to deprecate all of them. Basically,.

A

Is this the metrics cleanup part two.

A

B

Be able to auto generate metrics, I mean it's not we're not renaming them right.

D

We're just getting rid of the.

B

A

Okay, well, I I'm sure someone uses them right, whichever one once they are yeah.

C

Yeah, so we might not be able to get rid of all of them, but do they even have like stability, since they are not implementing like the correct interface, no yeah.

A

B

Will be available.

C

Yeah they will.

B

Create them like.

C

From the from the legacy registry, maybe you can lick them and see if they have or not to stability.

B

I think all metrics are alpha right now, but we can't parse all alpha metrics and what we're going to do is we're going to flip that and mark all metrics internal and not parse the internal ones. But then we will selectively turn stuff into alpha because they are parsable and then we can just.

D

Generate our list.

B

From that make sense.

B

Yeah um cool, then I I guess we have an agreement on on most aspects of this yeah. Well, that was a productive, productive session agreed cool. Okay, then I guess we we can work offline on the on the cap.

A

Yeah, that sounds good. Is there anything else that anybody wants to discuss for the last three minutes? Maybe a quick one.

A

Okay, if there's nothing going once going twice all right, then everybody can have three minutes of their time. Back, have a wonderful local time. Bye-Bye.