Kubernetes SIG Instrumentation, 7 Jan 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20200107

Description

SIG Instrumentation January 7th 2020

A

Record start, the recording uh welcome everyone today is uh january, 7th 2021. This is the sig instrumentation community meeting. uh We have a couple items on the agenda. uh Elena, do you want to do you want to kick it off.

B

I think she said she went to the team.

B

Okay, um the first one on here. Oh great, now my internet connection's unstable. um Hopefully you can still hear me great okay, um so for this one I just wanted to mention. uh This is not the first sort of thing like this that I've seen before, but basically like a critical metrics regression happened. Sig node got tagged. They didn't notice.

B

uh I guess, like david, very kindly like tried to approve the fixes and whatnot but like they just didn't go through and they sat for. Like three months, we've had two releases where all of the like machine metrics from c advisor have been totally broken uh and like no one tagged, sig instrumentation.

B

So uh like I, and it's not really a priority for node so anyways. I just wanted to give people a heads up as soon as I noticed that that was a thing uh I reviewed and uh dealt with the pr uh post haste uh and I uh back ported it to both of the affected versions. So it should be on its way to being like, I think, it's fixed already uh it'll get probably like released to the next patch release for 119 120.

B

uh But I just wanted to like mention this and ask if anybody has any ideas other than me being more involved in sig node, for like preventing this sort of thing from happening.

A

uh So I am not sure that that thing that you described was an incorrect. I I actually that sounds like that makes sense right because, like uh component owners own their own metrics like we for us right, our duty is like more about the shape and form and about like conforming to certain things. But if someone like deletes their http route to their metrics, endpoint yeah, their metrics are going to disappear.

A

um But that's not really on us.

B

What it kind is because so, like people who had reported this bug were very upset and if we were doing, for example like if those had been stable metrics and they we had been conformance testing. We would have seen that they disappeared in 119 and it would have been release blocking.

B

But we're not doing anything like that right now. So I I don't know if this is kind of a segue into the. We really need to pick stable, metrics and like actually do some sort of testing on it to make sure that we are catching those regressions. uh But it's kind of the segway into the other topic that I have on yeah.

C

I think this makes perfect sense um like it's just we don't have any stable metrics, so uh yeah yeah, we'll.

A

Actually have somebody I wanted to ping them right now uh that wants to do it for api machinery.

A

B

So we need to like go and tell sigs like yo. You need to pick your stable metrics, or are we actually well? This is kind of segwaying into the next topic so before we do that, uh just to wrap up on this one, it's fixed yay. um I think that we need some sort of strategy to like because, like fundamentally like when metrics break people aren't like. Oh you know, node is incompetent they're, like kubernetes metrics suck, so I think it.

B

It's definitely a uh it's definitely a concern of saying instrumentation and I think that if people had tagged instrumentation, uh it probably would have gotten like it would have been noticed at a triage meeting. It would have been dealt with post haste. That kind of thing.

A

There's a couple nuanced things about the node one specifically, which uh make it a little bit harder to tie with the stable things mostly because they they use the the custom collectors uh which are dynamic in nature, uh which we explicitly excluded from being able to be marked as stable, because things like what probably happened are more apt to happen.

A

And and also see advisor is deprecated, it's been deprecated forever, so I don't think we can ever stabilize those right like so like there's like multiple, like those were supposed to be excluded like like, like that, was an intent that was intentionally brought out in the cap.

A

I don't think c advisor.

D

Metrics are deprecated yeah.

B

They're definitely not deprecated. Yes,.

A

C advisor is deprecated, though you see advisor itself right c.

C

Advisor is yes? Yes, no! No it. It's definitely in one of our caps.

A

Okay, that's that's kind of with it.

B

So I'm gonna argue that, like we can't deprecate a thing without having a replacement and we don't have one so yeah.

C

I I totally agree with this, and this is why the the cap has never been completed. I believe it's the resource, metrics endpoint, yeah,.

A

Yep yeah! No, but it's it's it's getting there right! It's it's! Now! It's no longer! Under the alpha thing, you got moved last thing to slash resource very much.

C

We did last time we discussed it, though we said that um there are not so the metrics that are there.

B

Are not parody or replacement.

C

I don't think we said we necessarily need parity, but what is there is not sufficient. That's what we said. I don't think we ever necessarily defined what um what completion would look like. I guess that would be the best um like action point. um If we can, if we can have parity amazing, I'm not sure if that's necessarily the goal.

A

This yeah, but this can't be done by civic interpretation like alone right. This is like this requires a very specialized conjunction skill set right of node and also.

B

I mean the good news is that uh I am uh since I switched jobs uh within like the past uh month at red hat. I am now going to be significantly more involved in cigno'd and can help with that, um mostly like I was just trying to like suss out uh like han. I I think I agree with your assessment.

B

This is one of those specialized weird cases where the metrics are totally critical and they're super special cased, and we really need to figure out a strategy of like what to do with them, but like it's, not necessarily a generalizable one like. I think I agree with that, um but I think nonetheless it will be good and valuable to talk about what we want to do for stable metrics and I guess for caps. In general, I didn't put a cap review thing on this agenda.

B

I figured we went until next meeting but um yeah, so that's all I had for that one. I don't know if anyone has anything to add.

A

C

A

This is the 31st, so we should probably try to get a cap for this by the 31st. I uh almost sounds like elena. Is volunteering to write a cap.

B

A

D

I am meeting with some.

B

People to talk about this today so.

C

A

Okay, it sounds like elena is volunteering to find someone to write the skeptic.

B

Oh no, this was the shot maintainer writing that cap and not me.

D

I I did want to point out one thing uh just if people are curious about the details of why uh the regression slip through c advisor has its own end and testing that it does before pre-submits that checks, things like the prometheus metrics and all of the random json endpoints it has, but kubernetes doesn't have tests for the c advisor's specific endpoints, because at least thus far it's assumed that those have already been tested.

D

As part of the c advisor release process, it does have tests for the summary api, which is usually enough to exercise the sort of is c advisor working in kubernetes question. But this case was just a failure to register a bunch of metrics, um and so that was why that wasn't caught in any of our pre-submits.

A

I mean this is also why I kind of wanted that thing. Where again, I'm going back to the thing that I argued with frederick about with the metric and the descriptors, because, basically, if we could have like a thing of these, are all of the the metrics that were registered, then it would be easy to just dump this in something and to generate hey look. These are the things here's the metadata I mean like. You can expose this, also through a debug endpoint or whatever.

A

So you can see hey these are the metrics success accessible to me or or what not right like it. Just.

C

It doesn't seem to me that this would necessarily have caught this.

A

It will it would. It would be in conjunction with this right like if you can output, if they're not being registered, and you can output this and you can save that. Then you can.

C

I I think, having end-to-end tests for our own metrics is probably the solution.

A

um Yeah, but that that belongs to you.

B

Let's not rabbit hole this too much. I had a.

A

Quick question.

B

For han, um so you mentioned a date for enhancements freeze, I just checked everything sig release. They have not released any dates yet so.

A

Oh okay, there isn't even.

B

A folder in the sig release repo for 121, yet.

A

Okay- I I don't know because uh I just I got pinged by david eads in the morning to look at something, and I asked him when he needed me to look at it and he said maybe.

B

He gave you a deadline because he knows that if he doesn't, you won't look at it.

A

He said just look at it before cut, freeze and I said, what's kept freeze and he said uh january 31st or so, oh he says he's guessing january, 31st or so so, maybe, okay, okay! I was wrong. I just.

B

I I just I was curious because I uh I was like looking to update the dates on our dock for the next release and like they haven't been published yet otherwise I would have done it.

A

Okay, so then probably we have more time so that's great um yeah to.

C

To to wrap up um this this item, it sounds like to me: we should uh finish up that cap um to get whatever we call feature parody in our in the couplet metrics endpoint with kublet resource metrics endpoint. um Then I think we can discuss maybe for these particular metrics.

C

We can even have an a special strategy if we can find something generalizable to mark these uh as stable, because I think these do make sense to have stable, uh have a stable um and then I think we can talk about uh other um stable metrics for 121, potentially.

A

What about the existing ones in the resource metric can can any of those be elevated disabled, because then, then we will have at least something we're checking against. We can just flip it.

D

So we do test the resource, metrics endpoint, it's just that resource. The thing to keep in mind is that the resource metrics is a very, very, very small. Subset of the metric c advisor provides right, very basic, cpu and memory that the metrics that we're missing are machine metrics, which include uh topology information, as metrics uh includes a really wide variety of stuff that probably doesn't belong in the resource. Metrics endpoint.

A

I'm not saying that this is sufficient. I'm just saying that this would be a necessary step right, like.

B

I'm just trying to find the cap uh corresponding. Is this like an old cap, because I don't know what it's called.

C

Resource, yes, it's a really old one.

A

It's pre-stability the cupid resources.

B

That's in their enhancements repo, that's pre-stability is uh the metrics overhaul.

C

I believe it's pre keps being in the repo.

D

Resource metrics one is is under node, oh yeah. It could be under node, yeah yeah. That's right!.

C

But what what what han said, I think is- is interesting um to at least think about. um Maybe we can talk about that once we've fleshed out that cap a little bit more, but um I I think I could agree to marking the ones that we have today as stable in the yeah right we're going this way.

C

Everything else seems additive to me right, so I think people would appreciate having some resource metrics as stable right with the with the let's say, promise to add more yes,.

B

Yes, um I just linked in the agenda. um What I think is the right cap. Can someone confirm.

C

That looks right: okay,.

C

Without having clicked on the link.

A

Yeah yeah the author's okay.

B

So, and I think let's see, does the kept metadata say this is uh no, it's implementable not implemented. Well, this is on my list of node caps that need attention, so all 50 of them or whatever. um Okay, that makes sense to me. I think that, like we have a reasonable path forward there, so then my my next question is like what do we do about stable metrics in 121 are like we picking them? Are we telling sigs to pick them? uh Six have two: what's the approach here.

C

Six can propose.

A

Them they can propose. Well, they have to maintain their stability, so they are, they are committed to their own, like they're, the ones who have to maintain the guarantees and that that is why it falls on them.

C

That that's, that is correct, but the process is there so that we do at least yeah yeah.

B

We also have to have a metric, at least one.

C

B

What's the process.

C

They followed the documented process.

B

Where is it documented.

C

B

Okay, so this process in the cap, we need to tell them about it and uh we probably need to send out some sort of communication to the project. Yes,.

C

And I would expect um if we, if we haven't already, I would expect this to be a document in the community repo under developer docs.

C

If we don't have this already, we should add it.

A

A

It's yeah, I mean it's just a pr, though right, like you, just mark it as stable, and then you have to you have to text signals rotation. That's basically, but it will be auto tagged because these are these are metrics um and the attack for metrics, so it'll be fine.

A

They will require a second instrumentation thing because of the auto generated, uh uh stable, metrics thing, um and thank you merrick for that uh uh so yeah, so it will yeah. uh Definitely we will have to prove.

C

um On a similar note, I think uh we started the conversation last end of last year about a couple of um candidates that we feel like um should probably be proposed, or at least be reviewed, to be proposed. Eventually, um I believe han wanted to. um uh He wanted to introduce a new generic one that exposes storage object, count so that we don't have a one that is specific to etcd. um I think we could already add this. I mean, I don't think this needed needs a cap or anything yeah.

A

Yeah yeah, I I can add it.

C

um I I because a lot of people use that cd1. I would add one in parallel. I would say yeah yeah. Of course, no, I'm not I'm.

A

I'm not going to delete uh accounts yeah, that's not happening.

C

um But then the the other ones uh are obviously apis over metrics um and I would even say scheduler. Metrics are probably um a good candidate as well like scheduling, latency.

A

Api server metrics: I want to remove some labels before we make it stable.

C

That's fine, as I said, I primarily want to pick some to review so that we can make the changes that we would want to make um so that then, in the following um release, we could mark them stable.

C

Do you want to take care of sorry.

E

I I just wanted to mention that there was also bounced idea about the kubernetes reliability working group that will work and introduce more sls, so basically as a comparison, like mainly kubernetes, but in particular scalability team proposed and proposed and maintained some official kubernetes sls and looking at their definitions, they're, pretty sized, not precise in definition of metric, because there's they don't define metric and based on my like personal contact with scholarly. I guess this is like defined like they have some promised use instance that they run and define them, but it's not public.

E

So it would be good to not have something that they they have in their own instance, and this is like the official definition of slr but having them like have a company, maybe metric and stability status.

C

That makes sense we actually have the kubernetes mix and that collet that has a bunch of um common prometheus um slos already. um So I think it would only make sense to uh have all of these definitions in a central place, and this is already one that is widely used in the community.

B

E

But I mean like that that there is a kubernetes six capability sls. I don't know if you've seen something.

A

Yeah, we talked about it like last time time right, we were talking about the the requests and the web books and stuff.

A

B

I have been trying to look through to actually find the thing that talks about how to make a metric stable. It's split into like three or four pieces, and the only one that talks about metric stability, which also talks about like uh static analysis uh that one is marked as implemented. uh So, uh but we didn't know.

A

B

As like set anything to stable yet- and I don't actually is the static analysis in place.

A

It is, it is in place and that's why we generate the empty list of stable, metrics, yeah yeah, that's where it comes from um so.

B

We should be.

A

B

Like the sort of intermediate instructions about how to set the metadata from that.

A

uh We should probably have an official doc on how to set a scalable metric, and maybe we should uh does anyone know how to create new pr templates because stable metric maybe be uh may have, may we should probably have our own pr template for stable, metrics right.

B

I'm sure you can.

A

Market marking this measure as stable and then or whatever right like.

C

I would say: let's introduce process f after we've done this a couple of times. um Yeah seems a little premature at this point.

A

Let's, okay, let me just I can create the storage storage object, one like pretty easily, uh so let me just create a storage optic metric and then I'll submit a pr and then in a separate pr. I will just upgrade it to stable.

C

Sounds good to me: okay, let's.

E

Make sure once.

C

That it actually works, so we don't define a stable metric that doesn't work.

A

Yeah sure sure uh well yeah, I will test the metric, but it will I'm literally going to define it in exactly the same way that city after counts is defined, and it will then.

B

So han, can I ask you: uh is it okay for me that to ask you with since you're gonna do the testing of this once you finish, your testing send the project wide communication to kdev, saying we're doing this. This release. Please participate! Here's! How.

A

B

For picking stable, metrics.

A

uh Send what to dab.

B

An email so people.

A

About, oh, it's just oh yeah that we're setting it! No, no I'm going to talk about with api machinery. If, if uh the leads there are not okay with it, then I.

B

Can't do it it's trying to socialize it and do the testing and whatnot first, but once like you are uh like, once we go and we're ready. We need to give everybody ample notice and like project-wide comms, in order to actually participate.

A

Yeah I mean we're not going to break anything, though, because we're not taking way out to the object account, so it should be uh like. Even if people don't know about it purely.

B

Additive, I don't understand why anybody would find this objectionable. It's literally taking no.

A

No they're they're, not I mean it adds you're duplicating a metric. So there's a little there's a memory overhead, but it it it should be pretty minimal because it's one two per resource type.

A

So, um okay, then yeah okay, I I will just I will yeah talk to some people, make sure that they're okay with it and then I will add the metric.

E

Cool thanks han.

D

All right, I think, that's everything that was on our agenda for today happy new year, everyone and uh you get three minutes back.

A

Stay safe. Everyone next time see you next.

B

Time maybe.

A

It will be maybe things things will be calmer. Good luck.

B

Thank you, bye. Everyone bye.