Kubernetes SIG Instrumentation, 12 Nov 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20201112

Description

SIG Instrumentation November 12th 2020

A

All right welcome. Everybody today is november 12th. um This is sick, instrumentation and we've got a couple of items on the agenda today and the first one is by david. Take it away.

B

Cool so um basically, someone came to sig node recently talking about the conformance tests that they're trying to add for node-related apis.

B

One of those apis is the node proxy api, and what the conformance folks would like to do is either be able to test all of the apis or start removing ones that are not going to be included in conformance and during the meeting. Most of the node folks didn't have any use cases that were node related for the proxy api, but I believe the prometheus server makes fairly.

B

Makes use of the proxy apis to query node metrics, for example?

B

So the first question is: do we want to keep this api around and I suspect, unless I'm mistaken, I suspect the answer is yes, because a lot of folks depend on it, but we could also explore like what it would mean if we didn't have that api and what we'd have to do instead and then. The second question is okay: how do we actually conformance test this endpoint, and the issue is that kubernetes conformance considers almost all of the node the stuff that we run on nodes to be in implementation detail.

B

So, for example, you're allowed to create a pod object and then look to see if the pod object is running but you're not allowed to create a pod object and then query the container runtime to see if it's there so we're not supposed to rely on the presence of cubelet endpoints or no problem, detector, endpoints or any of those things.

B

So this is more of a I'd like to make sure that the people on this call who work on the prometheus server voice their use case, but also, then we might need to work with signate to figure out how to exercise that endpoint. If we do decide, we need it.

A

Yeah, so I I am fairly certain that a lot of people in the prometheus ecosystem do make use of this. For exactly what you just said. I think basically, um people use it so that they don't have to discover nodes in some way or something or maybe it has to do with our back or something um so that essentially, uh what they do is they use this api to uh query every single node in the cluster for their metrics.

A

I personally have always said that this nobody should be doing this because essentially you're putting the api in between as kind of a failure domain. I've always said that this is a bad idea, but nevertheless a lot of people do make use of it for exactly this reason. So yeah yeah.

C

It I can discuss like discussed more, but I think I'm more interested not about prometheus use for the sentence, but this product not proxy, being the main way that most debugging is done uh for. I know samara api, uh like I think, every case that I've seen someone having problem with samara api.

C

uh You have one like you have easy way to do: one liner to always get to the like from any point to to check if the metrics exposed by coupe that are correct and if kubelet is doing the correct things like we've. I know I think it also applies. Checking version checking like if you do any debugging from outside of cluster like and one node is not working like checking if summer api. What version this cube that is running?

C

Is it reporting, like all those things, are done currently with this api, at least when, like from me.

D

C

Like people, I know the backers.

D

I don't understand why this is a sig instrumentation thing like this uh is more than sick, instrumentation right, like.

A

This is why I.

D

A

D

Don't think that's what.

A

uh David was saying: yeah.

B

A

Think I'm trying.

B

E

We've got some use cases that they are worried about breaking and they want our input for our use cases. We don't own this as a sig node thing.

F

D

uh Yeah I mean so like people are using this in a way that probably should not officially be supported, because it's kind of it seems like it's kind of hacky.

A

I tend to agree, um as I said, I've always discouraged this for the prometheus use case. People as I said, people still uh make use of this widely anyways, um even though it's it is discouraged in various places, uh but yeah, I don't know if we would potentially be like. I don't think we should remove it, but I don't think we should necessarily add it to conformance either. But I guess that was kind of the point of all of this right.

D

Most of the proxy use cases are kind of sketchy right like like, like, I feel, like all of the use cases for proxy endpoints are kind of like this right. They're like.

C

So proxy endpoints are like compared to normal, like cubelet like I know, accessing metrics those are like truck. Sync is considered an admin on the activity, and basically you can it's attack surface. That gives you can you can easily jump to group from it? So any like. uh So usually those kind of this kind of api should be only used for debugging by administrator. If and everything else fails or like administrators cannot access the node uh directly.

C

So uh us there are a couple of architectures or distributions that use this architecture like gke, so debugging via this api is very like useful, but also it should never be used by case that is popular by kubernetes community sorry premium community, where they use it as a is like network without skipping need to set up proper networking or proxy uh because, like I said before, the the this.

F

C

A route action- that's that is super super worrying from security standpoint, yeah.

D

Well, I mean like the argument is like even without the instrumentation hat on right like should we be conforming conformance testing proxy endpoints because, like that's like most of those use, cases are going to be kind of sketchy and then basically, if you write up conformance tests for it, then you're basically saying this is the expected behavior that we are now endorsing all of kubernetes distributions that I mean that feels kind.

C

Of so possibly it's expected behavior. That administrator can use that, but it's that's the the sketchy thing that this is like. If you we consider what are the administrator use cases then? Yes, if we are not considering like as a standard debugging tools like uh ap api, that are needed for debugging as part of standard conformance, then I agree with with those use cases not being really.

D

uh It's directly from sorry I was just reading. I'm sorry, america, I didn't mean to interrupt those.

D

The next steps. They're, basically saying this. I think.

D

I think that's fine, I we don't really want to encourage this right, like.

B

Okay, so the second question, then, is that if signate is interested in deprecating, these endpoints um are there use cases we still want to support through them? Are we okay, um deprecating and removing them? uh Should we reach out to any of the like? If there's any representative groups that are making heavy use of them,.

D

Api reviews, you should talk to api review architecture, proxy yeah. You should probably talk to jordan legit.

B

Okay, well, my my main objective here was to request feedback. If anyone does have use cases, um so you can leave that on the the issue I linked.

A

I mean a potenti like let's say the the prometheus use case was something that we wanted to continue to support. We could do that in a very targeted way, as opposed to a general proxy.

B

Right maybe like, through a like a new api server resource or just.

A

It it could be, it could be pretty much exactly this except for um yeah. I guess it would be a new sub resource of nodes which would be metrics and it just proxies through to the metrics endpoint. I believe that is exactly what people use it for, not that I think again not that. I think that this is a good idea, but um if there are too many people using this, it could be a middle ground.

A

B

Cool, I think that's it for that agenda item then.

A

Okay, um then, the second one we have is uh elena.

E

Yeah, uh so I got a ping from, I think, contrib about uh developer documentation updates. uh So that's the linked issue. Initially, I sort of misunderstood the request. Basically uh they are going through all of the developer documentation. So this is not tied to a release. This is general kubernetes development, documentation and they're, asking basically that every sig owns their documentation and trying to get it updated so like we don't necessarily have to go and update it.

E

But if there's anything that is like obviously terribly wrong and out of date like docker from the stone age or something like that, uh they have requested that we try to update that so that it's not like totally wrong. So I think for the most part uh like I, I took a look through things and I think it's mostly up to date.

E

We probably want to add- uh and this is maybe going to be a question for david, um so we currently have some documentation for metrics or instrumentation um for like logging uh for like how to write events, uh and then, I think, also the structured logging migration. uh So that's all great, um but we might also want to add some for traces and we might also want to ensure that, like somebody's assigned to review all of these, so they are up to date. So this is.

F

E

Not urgent or anything like that, um but it would be good so that people uh we definitely have gotten a lot of questions about like is this the only documentation there is for this style guide and like what do we do about this metric stuff? And I know with some of the current caps for like adding more metrics to various components.

E

We've had some discussions by naming policies and whatnot, so it might be also that it's not just a matter of documenting things that there's also some decisions that we need to make as a sig and then write documentation for them. uh But I just wanted to you know sort of bring this up as a general topic and see what people had to say and if anybody wanted to like jump in on specific talks,.

A

I think there there was uh one specific one that is super actionable already, um that, like anybody who wants to could immediately take, which is uh that was uh like linked out to which is wrappers like for the um for the metrics framework, that we have um that we actually document those in the instrumentation guide, which we don't today.

D

We have some documentation somewhere, though right. I.

E

Think it was in the cap and it didn't actually make it into the guide, so it might just be a migration uh like something easy like that. I mean it's not great to have them in multiple places, but people tend to not look at the cap once the cap is implemented.

D

I always get confused about where all the docs are, because.

F

There's like several.

D

Different spots for this stuff, there's also the website stuff. I I have patched some of the website stuff and then there's like.

A

It's here- oh actually, yeah lily also mentioned something in the in the chat about global metrics registry. I think this has been something that we've wanted to add for a long time. The I think the biggest problem about adding docs about global metrics registry right now is that everything's still using the global metrics registry. So we would be recommending something that nobody's doing right now um in kubernetes. Well, no.

D

Cubelet cubelet uses uh non-global registries because they have multiple metrics endpoints.

A

It okay in some places it does yes, so so so it there. There are examples of it. Okay and.

F

I think the main point is to go onwards with a better solution rather than having people go to the global registry by default, but just a suggestion I don't know.

A

Yeah, I guess, since what han said, um I I think you're actually right we should, since there are, there is precedence already in the in the code base and I think it's a really great suggestion.

A

Apparently the instrumentation quick start guide doesn't mention the wrappers.

E

So uh do we, I guess: how do we make this actionable? We've talked about a bunch of possible different things that we could be doing here. uh Is there specific things that people want to like volunteer for? Do you want to like document that, against that larger tracking issue, to say that you're doing that? uh How do we want to proceed with this, because at least the ask initially that came in like definitely, we need to do all this other stuff, and this is great, but the ass that initially came in was just like.

E

Please review your existing documentation and make sure it's not like you know totally like out there, uh so I don't know that we've had anybody volunteer for that.

D

uh I I can do this. I I assigned this thing to myself because yeah I can. I can write documents for the okay.

E

I'll put a comment on the uh on the overall community issue, saying that you're gonna do the initial first pass review if you need to make any changes like feel free to ping me or other people for review.

E

On those things I mean, I think, you're an approver, so you don't need anybody to approve for you, but uh and then, if anybody else wants to pick up any of those other things, I tried to take a list in the doc or in the notes in the agenda, so you can feel free to like stick your name beside it file an issue or something work on it: poke people for reviews. That's totally.

A

Fine yeah that sounds good.

A

I think we have a way forward here. um Should we go on to the next point on the agenda um mark go ahead.

C

Yeah so elena suggested to bring cover like this more discussion to some recent doc that I wrote regarding matrix server. So overall I wanted to collect my. I created the document that collects all my thoughts about why matrix server like there are some problems with metric server, how it compares to other uh solutions like premises, adapter or overall output, as per metric servers used mainly for auto scaling, how it compares to custom metrics out to scaling so at the end at least some ideas. What should be improved?

C

What should be we should be, should should we change or what to focus on, but I think that brought really important point that we should.

C

We should ask like community and gather some like feedback, and we don't have currently any like channels how how we could organize uh how we could organize feedback. I know asking people what they think about like what could be improved in some project.

C

So I think that would be may possibly say, like a part of discussion that would, I would want to bring to the meeting, would be how if we have some like strategy ideas or like we are thinking about long term, how the product will evolve, how how we should collect feedback and how we can improve this.

C

A

Mean in general, I think kubernetes as a product has run like surveys a couple of times for where we needed uh like user feedback. um I think, is that kind of what you're looking for.

E

Potentially, we could reach out to x to help with that.

C

Yeah yeah, I think, yeah. If that that's the the direct way, I think that that would make sense.

A

Yeah, I think this would would make sense. um Was there anything in particular? I didn't have time to read through the whole thing um that you wanted feedback on in terms of this issue.

C

So that I think that I know.

C

So yeah, maybe that I think that two goals like for uh first bring this document to over elected to them to to wider audience us both maintainers of metric server agree that this looks like a over a good direction, and I I think we agree, so that's done and second, I think we agreed that we will reach out to contributex. uh I don't know alana would be. How would be, would you be able to help with that.

E

Yeah I can help someone. Let me just I'll uh I'll, assign myself to this and then I'll just put a note on here that we want to reach out to contrib x to try to run a user.

E

Survey: okay done.

C

Okay, I think that's for this topic.

A

Okay, great, um let's see, I think we had a last minute edition.

E

A

I just added some announcements.

E

Do we have any other business before announcements.

A

uh um I I was wondering if anybody wanted to discuss any of the um like name potential name, um yeah, not one pr.

E

I think there might be two uh is this: the the scheduler and the uh the other? Okay, let me try to link the pr's yeah.

D

D

E

I found the scheduler one and I'm just hunting for the other pr.

E

I think it's this one.

E

I guess that one's technically on the cubelet.

F

E

uh Then I will add that, uh does someone want to give some context around this.

E

Is it gonna, be me.

B

E

B

E

David go ahead.

B

Okay, so the scheduler one is based on the cap. That clayton wrote that I think a lot of you already reviewed. It adds a metric to the scheduler to describe the requests of pods that the scheduler is aware of and it in the kep proposed a naming convention, starting with uh cube, underscore pod underscore resource underscore requests, I think, basically to mimic what the cube state metrics pod, metrics look like and as well. It proposed a re-labeling, no, not relabeling. It proposed adding rules to the example.

B

Prometheus server configuration configuration to take sorry, there's some echo to take.

B

The c advisor metrics and rewrite them to sort of match the naming of these new scheduler metrics.

B

At the same time, there's been some interest in the sig node community to adding direct pod level metrics, because, if you're running with a sandbox like kata containers or something then you'll have significant overhead for your pod. That isn't captured as part of the containers, and so signate has been talking about that and agreed that it was a good idea to add a metric for that as well. So the question is one: are we okay with the cube underscore pod name for the scheduler metrics and two for the cubelet's endpoint?

B

Should we try and align with the existing metric names, which look similar to c advisor metric names? I think container cpu usage seconds total that sort of thing no cube no resources or should we align with clayton's new proposed naming, starting with cube.

A

So I mean I had a discussion with a couple of you yesterday already about this, but I want to reiterate here: um I feel, like we kind of have two um types of classes of metrics um that we generally tend to expose in kubernetes one being about the component itself um and then one being about uh some something that we kind of invented within kubernetes, being some entity within kubernetes right, so pod deployment etc.

A

And I think it would actually be nice if we had for everything that was kind of meta, that we had this cube um prefix uh to kind of signify.

A

This is something um within kubernetes, but not necessarily about this component right now, um so that that's at least how I've always viewed all of these metrics I mean I primarily only interacted with them from from kubesave metrics, because there we actually totally consistently apply this pattern, um but I think it would be useful to have this everywhere, which brings me to the like one of the previous points. If we decide we want to do this, this should probably be in the instrumentation guidelines.

E

Yeah, so I was the I guess, uh the possibly the lone dissenter, I'm not really sure. uh So. My concern with this was well. There were two things, so I come at these metrics from an operator's perspective and like when I'm using metrics like I need to be able to understand certain things very very quickly about those metrics where the metric is from, uh because if the metric is misbehaving, like I'm very worried that you know like well, what component is misbehaving?

E

So if that's not clear from the metric name, that was a concern for me, and so uh there was, you know, a proposal to say like use cube pod for these metrics coming out of the scheduler. uh But the other problem is: is that like, when I see a cube under score metric traditionally, those all come from cube state metrics? So as a cluster operator, if I see a cube underscore pod metric, I'm going to go. Look at the documentation in cube state metrics to see about that metric.

E

I'm not going to find it there because it's not actually coming from cube state metrics. So I was just I'm confused about or I'm concerned that, like you know, this might cause confusion uh that this stuff is not really inherent to the name anymore, and I think that you know a lot of these things in terms of like making like sort of global naming like, as you know, these global concepts and whatnot like they.

E

They overall make sense, but we have problems right now in that, like metrics, are not very well documented and like that's something that I think we have to really really well communicate because, like right now, effectively like the only documentation on metrics, is reading the code and possibly like some of the help text that pops up from the prometheus scrape it's not like a thing: that's easily discoverable and because operators rely on these conventions.

E

In the absence of that documentation and communication, when we go and break them, I think it's going to make a lot of people really confused. So that's that's my concern.

D

I am 100 on board with the discoverability of metrics thing like 100.

D

I had this argument with the with frederick like a while ago, but uh I think we can just add a descriptor's endpoint to it, so that people can just curl an endpoint and get all of the registered metric, descriptors and labels. I think that would make sense.

A

Right, like I'm, not sure that necessarily solves the problem at hand. um I think the the endpoint could make sense, although, um as we discussed yesterday, I'm not a hundred percent convinced uh that it's actually that much better than the metrics great, um but, like I'm, I'm happy to be proven wrong based on data, um but the the point of um like I I feel like. We should be consistent in one way or another right and right now.

A

We definitely do not have a consistent rule for this, um and I I would prefer, if we did.

D

I propose that we enforce subsystem.

D

uh We have static analysis.

E

D

E

A pause y'all uh because we're we're at time and we've had at least one person drop. um I think that we should continue this conversation in our next meeting and I just if it's okay, I really want to quickly make some announcements before people drop, because we've got dates and meeting cancellations and stuff upcoming. Is that cool, okay, cool, um so dates and stuff? Today is code freeze, as mentioned at the beginning of the meeting, so anything we want to get in for 120. We got to do it now. um Doc.

E

Freeze is upcoming on november 30th uh and uh I think people mostly have placeholder doc pr's open, but make sure you know to get that stuff done uh and the other announcement I wanted to make is. It is the end of the year season where we cancel a lot of meetings. So I'm trying to make sure that everything is reflected on like the agenda doc, so you can see which meetings have been cancelled and the calendar has also been updated, uh but next week, kubecon no triage week after us thanksgiving no meeting.

E

So I think our next scheduled meeting is going to be the uh triage session. The first week of december, we should have a regular meeting after that, and then I don't know what we should do for meetings for the end of the year. Do we want to have one more triage and then meet in the new year.

D

uh What day would that be.

E

uh The next triage would be, I think, the 16th of december and then our our next, like normal actual meeting, would be like the 24th, which I think is christmas eve. So that's probably not going to happen. So I'm assuming we'd cancel the meeting on the 24th and the 30th of december.

A

I think so: yeah, okay.

E

So then I will go ahead and do that and then that'll be all of our meetings for 2020 and we'll resume meeting regularly in 2021..

A

um In regards to code, freeze um I'll, like I guess, clayton is gonna- try want to try to get this in right. um So are we okay with saying um just to get this into uh into the release? He can keep the name he wants to for now, and there needs to. There will still be a decision made um on a more general basis, or how do you feel about that.

F

Personally, I commented, and I I from at least cube state metrics and from a operator perspective, I don't see a problem with using the cube pod. um There's always the target labels um like job label, which describes where the metric originates. So that's how I differentiate between them. So I really don't see a problem with that.

A

F

And from keepsake metrics perspective, we were thinking at some point of renaming them all with ksm um in the beginning, with a prefix instead of cube. I think we then at some point dropped that, because we already have that convention and yeah people are used to them, and these metrics are supposed to be a drop-in replacement for um the ones we already have. So um I think it's a good one.

E

Yeah I mean I think I would be in favor of labeling the cubestate metric metrics with the ksm prefix. I know that would probably break a bunch of people. uh Maybe that's like a 2.00 sort of thing. I think 2.0 is outro right.

F

Yeah it's too late for that. We were thinking for 2.0, but um I mean that's not out yet, but we that would be too much breaking changes as we already had testers, but we were considering that, but at some point we dropped it as it would be way too. Many alerting and recording rules to rename.

E

Make sense, I think we should probably have a future agenda item to continue discussing this.

A

Yes, that sounds good.

A

Okay, uh let's make sure we have that and then I would say we call it a day and everybody have a wonderful local time. All right see everyone.