Kubernetes SIG Instrumentation, 30 Apr 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20200430

Description

SIG Instrumentation Meeting - April 30th 2020

KEPs:

1) Mechanism for warning API clients about deprecated API use
2) Accurate Pod resource request/limit metric
3) Dynamic logs sanitization

A

All right welcome. Everyone today is April 30th. This is the secret instrumentation meeting and we have a couple of topics today. I think the first one is from Mark.

B

A

B

Only yeah can you hear me yeah.

A

B

It doesn't FYI that I just updated it today because there's later changes but yeah the release cycle was published. So the the enhancement freeze is pushed two weeks to May 19th and the code freeze is pushed also two weeks to June, 25 and yeah, but just FYI for people to write caps that they have additional two weeks to work on them. That's all.

A

Alright, thank you mark seems very useful because we have three new caps in discussion today and the first one we have is by Jordan go ahead.

C

All right near me, yeah cool I, was gonna. Do a quick demo of this. If you want me share my screen, I can show show, rather than tell what it says. I can't show my screen. Yeah.

D

I think that's limited to hosts.

D

E

What we do in the signal, oh.

D

I guess I can make you a co-host you're like a legit person.

D

No, no I can't actually do it. Fredrik you've got to do it.

C

C

All right, can you see.

B

C

A

D

Right, yes, although be great, if you made the font bigger I, can.

C

Do that better.

D

C

Little bit a little bit all right, so this this can has a few aspects. The motivation is that kubernetes has a lot of things going on and one of those things is, if you guys graduate from beta to GA and we deprecate to remove the pre-release versions over a fairly long period of time, like multiple releases, sometimes a year, those depredations are announced in release, notes and in API Docs and sometimes blog posts, but there's no great way for users or admins to have visibility to the fact that they're actually using these deprecated API.

C

And so, if you didn't scrupulously read.

B

C

Release notes document and keep track of the timelines and everything when the API is actually go away or oftens. This cat is proposing a way to surface used to users and a way to give admins visibility to deprecated API use in their clusters. So this is a demo of what it could look like in practice.

C

So I have a manifest that has some beta API days that were actually deprecated a couple of releases ago, which you may or may not have known and if I send those to the server, the server would have information that this are deprecated and would spit back warning headers which client go and keep control would gather and then handle in whatever way they wanted to so I have Q control setup to output, those to standard error as warnings which we already output, client-side wanting silver.

C

They work under some circumstances, so this this hook stand at the client level, which means that, no matter what operations I do when I talk to that endpoint. If I get a warning header back I bet that header gets processed, it even works for like raw raw requests. Where we point Q control directly at endpoint and don't do any other sort of client-side processing, the client still sees that morning. So from a user perspective, this is great like if I, if I am using these deprecated things, I get told frequently like this is deprecated.

C

This is what it was deprecated. This is when it's going away. This is what you should use. So that's that's the first half of the proposal. What that's, perhaps more interesting to use against permutation is from a cluster admin perspective, a we were running version 121 and I wanted to know if it was safe to upgrade this cluster to 122, where this deprecated API is going to go away.

C

What I would really like to do is be able to query a metric that says, show me use deprecated things that are removed in 122, and so, in addition to returning a warning header to the user, we can increment a count of a deprecated, API use and one of the dimensions we can add to that is the version major/minor version where the deprecated API is being removed.

C

So this removes the burden from cluster admins to also read release, notes and so to keep track of timelines and everything to know when an API is being removed and lets them put as part of their upgrade pre-check. You know if I'm about to upgrade to one 22 query, to see if.

C

If API, that I remove in that version, were used over some period of time, the last day last week last month, whatever whatever they want to get on, and then if there is usage, the third part that the the capitalism's is adding an annotation to audit events. So if you see usage of these deprecated things, you could then go query your audit logs to track down like which specific client or namespace or whatever you need to go fix and then reset the metric or watch for the metric no longer growing over time.

C

Once you have a period of time with no use of deprecated guys that are removing the next version, then you are safe.

C

That's my demo! That's my spiel yeah one of the things I was talking with Han about I know. We had cardinality issues with some of the other metrics around request requests with lots of dimensions, and so we tried to cut this down to just version resource removed, major minor version and verbs. So we drop some of the really problematic things like the component and the content type and the dry run like we tried to drop out all of the things that wouldn't actually make sense and we blow up our cardinality yeah.

C

So if there are questions I'm happy to take those or I want to go. Look at the cap. It got merged implementable, but the implementation hasn't started yet. So, if there's additional feedback that people had gladdy great.

F

One thing I know that we also talked about this, but looking at it now, I do feel like the version thing would maybe be better if it was just one label with the Dobbs's index.

C

Yeah I, don't I thought there were maybe some benefits to representing as integers but I don't feel strongly I'm happy to defer to whatever tickets rotation thanks works best.

A

C

A

On on them being separate labels, or what would be.

C

Designer labels I.

A

F

A

F

I think it reads better and also it's more consistent with the way that we began expressing versions in kubernetes right I mean we dispatch meant that version string and we use some very library through the parts of thing, but on like in on the server side. I want metrics and gestures. You can you can actually just you know, make separated virtually. You could do both if you really want okay.

A

You can so I think that I'm kind of wondering is.

A

What like we, we already captured like requests in in the normal API server requests, counters right, I'm kind of feel, like I'm missing, why those weren't sufficient to do to do this.

C

The issue is, if you, if you use those, then you have to go it's on the admin, basically to go figure out which API is. They should be looking at for a given release, and so that's the thing that we historically had a lot of trouble with. Like didn't. Didn't you read the release notes nine months ago it was on page 17 that we said this random thing was gonna, be removed in this random release, and sometimes those targets even change right like what we said in the release notes.

C

Nine months ago, six months later, we discover an issue that bumps it out three months or, and so the goal of this is to let them hit a metric like we know when these things are going away, we shouldn't make them go through like three steps of indirection yeah. That was the goal yeah.

A

No I think I think that mix makes a lot of sense I actually like that, we we actually happen to have written alerts for these kind of things ourselves for the things that we knew, but it would certainly I can told you see how this makes it easier.

D

It's a little more generic approach, I really liked it. Thanks for the demo.

G

Another option Jordan is you could probably we could consider adding a new label to the count metric and just make that be degraded or.

G

Deprecated, which is that doesn't actually increase cardinality at all.

C

G

C

Won't tell you whether the thing is about to be removed, so you actually need a number label I think like whether it's deprecated or not, and the target version, and that metric is already so monstrous. I'm really yeah.

F

Yeah yeah, so basically we want to move in the opposite direction, which is we are? We really want to have a core metric for this request stuff without so many dimensions, because right now it's like kind of a dumping ground.

G

It wouldn't add any dimensions: that's what I'm getting at like if a metric is deprecated a metric if a suit. If a the particular series is a combination of end points, it is either deprecated or not in a given release. It won't have more series.

G

The number of series doesn't change, just more labels, but you're already gonna get more labels there. So I know we can take it to an offline thing. I just like.

D

G

Reporting two bits of data that are the same data in two different series. You can report all of that data in the existing series, because that series for a given version of code is either deprecated or not. There's there's other things to it like it makes that metric more complex, which is itself an argument. But like would we come up with other links? The the resource tracking one actually had some implications of this is like or the scheduling one. So it's just like that's a good thing too.

G

To get a policy on for Cuba's is adding more constant labels to things a good or bad idea when it's about communicating to the infrastructure, the instrument er.

A

Yeah I think I think I see the point like I think. Actually it would almost have the opposite effect of by introducing these these metrics right. We would be adding more metrics nor more series, whereas what Clayton was saying is that we could actually expose the same information with the same amount of series. Yeah.

F

Yeah basically yeah.

C

F

Every every working.

C

Resource tuple, the removed version and the deprecated labels would only have a single value like for a given server anyway. It.

G

Would complicate upgrades a little bit, but upgrade metrics are already. You already have to take that into account for failover anyway, because you're gonna get your in a multiple series for the different servers. The different versions, and so aggregate queries already have to take that into account. It's more of a do.

G

We decorate is there actually more of a use case in other places, for us adding more constant labels to series that convey operational information that we previously did not consider adding like aggregate, api's and I'm not advocating for this at all, but like as a theory, the aggregator proxy we track. All of those API request calls as they pass through the API server when those are two groups that are hosted on an aggregated API, is it potentially valuable for us to track which back-end aerated API those were going to I'm not advocating for that?

G

It's just like that's the same kind of info, like what version is the server behind it? Some of these things we typically force people to join. There might be places where combining them actually has value. We.

F

Do do that on the request metrics we.

B

F

The component thing it has aggregator inhibit user, read the name it. They are really offensive. No.

G

I mean like even like, not just the name but stuff like the version of the aggregated server. Because and we don't agree, aids are the most efficient.

F

It's so additional metadata about the interesting.

C

So concretely do people you know like a dedicated metric, similar to what I showed maybe combining major/minor into a single label. It's better or adding deprecated and removed version labels to the existing account. So.

F

The problem with adding it to this metric is okay, your document, your your nothing, reduce the cardinality but you're going to break certain ingestion pipelines, because this is basically an S, a local metric.

A

In what way does adding the dimension or adding in the table bring anything I get? First of all, it's an alpha metric.

F

This is the special.

F

So it is yes, this is the counterintuitive aspect of it like you have to think of it like it's the database table. If you create a database entry with a field that doesn't exist, then, basically that's going to the air out right. So if you define this additional dimension and that dimension doesn't exist in your bag and depending on your back-end, some backends are more resilient to it.

F

Others are not.

A

Yeah I get your point, but at the end of the day like we have a Prometheus integration right and like I'm, not sure like I, think we can. We can talk about this, but I'd like we're, never going to be able to satisfy every use case with this right.

F

That's true, but if we're going to have this on the request counter metric I would prefer that we create a good one like a really good one without all of this stuff, that's in it right now, and then we just do it properly. Haven't we already got that twice and no the second time. The second time was basically almost exactly the same. The first time it.

A

Just actually in Philly I think this metric we've never cleaned.

G

Up, which was a press count, we played up like four yard: it just predated, sig, instrumentation or predated their race. Yeah like this one, we revved on a bunch like both cardinality explosions and we've changed labels, but I mean we have a process now for having stable metric safety eyes. So we should practice those yeah.

F

Yeah we can dump the client I, don't know the client stuff in content type it just I was finding them to something, but yeah anyway.

D

C

I'm good they time box it like who should I, follow up with to settle whether I should target a new metric or updating the current metric that mean calm and.

D

Thank you cool sorry, I didn't mean to cut you off I, just there's two more caps on the agenda, but from what I can see, they don't exist in the enhancements repo, so I'm, assuming that these are there's no links yet they're. Just ideas is that correct.

A

So the next one, maybe Clayton, can talk about it more, but there's a Google Doc that was sent to the main enlist draft just.

G

To get consensus so just a real high level I had spent some time asking a bunch of fundamental questions like can I actually figure out, what's scheduled on a cluster. What the capacity usage is: we've added node allocatable, which allows us to understand what the capacity of the cluster is, and we have a few metrics and cube state metrics that show parts of the life cycle I actually had come in from the perspective of reviewing how SIG's were properly reflecting.

G

That found a couple bugs on the way and I got to the point of saying you know this is actually a hard with metrics we have today. What can we improve? The proposals? Kind of a unified is attempting to look at it from the holistic perspective of there's a set of standard questions that most admins would want to ask about capacity, planning capacity usage and available capacity over the lifetime of their cluster. It's a little specific to the mindset that someone's got an instrument.

G

Monitoring solution is using it with a cluster, which I think is a fair assumption and it's effectively proposing trying to we've kind of got a piecemeal. Looking at it from the proposing at least a metric, we haven't mentioned for the node capacity side for our various set of resources and proposing the equivalent for the pod side in a way that captures the Civic complexity of the model. Frederick had done an investigation of whether we could add more metrics to cube state metrics. That would allow us to build up the the metrics individually.

G

One of the challenges is you end up with a lot more series, because you need to bring in a lot more data to get the fully accurate view, and you end up with maybe five or six times as many series, if you bring all of the smaller pieces in, and so the proposal was one or a set of metrics, depending on the correct pattern for reporting what the current observed scheduling request is across the cluster on a number of dimensions, starting at pod, name, space and node, because those are the ones that most people plan around and then resource.

G

As this other dimension, there's a bunch of smaller issues prototyped. It basically I think at the point that, if there's no disagreement, this would be something that I would want to propose as a cap and convert it from the Google Doc. But I wanted to get feedback from this group and I've had some feedback from users that I've tried to capture the the thing but not like quotes and from some of the other SIG's that were not in the form of like we're. Gonna go endorse this publicly.

G

It was just like hey, like we really like this too. Let us know if there's anything we can do to help some people were interested, but I think the feedback phase of using the metrics would be something that will go into the cap, as in the proposal form. How do we get people to try these and use it and offer feedback about whether this actually helps them understand capacity, planning and allocation?

G

F

Any reason so I mean like so it sounds like to me. You basically want good level metrics.

G

This is about capacity, so it's about requests and limits, correct, possibly, lessen limits, because.

D

I'm, just looking over both the the linked cube state, metrics issue and the draft proposal really quickly this jives, with some of the tensions that I found building up this sort of stuff. But I haven't looked at this. It's like q18 so and.

G

I got into it from the init container side, because what we're reporting Fernet containers would be wrong on some clusters, and that requires you know a new metric. We would have to go, add and then I started asking.

G

Well, then we have pod pending overhead coming, and that also is incorrect or that would have to be factored in and that's a new set of metrics and then we have resource usage is going to change when we add pod pod capacity updates, allowing pods to change the resource, requests, live and so I felt like this was a good opportunity to say: can we bring this all together and when it hides those details from an end in they see the right metrics, even as we add features.

E

It's the crux of the proposal about having pod level requests and limits, or are some of the other details, also critical to being able to accomplish what you're trying to accomplish the.

G

Pod is the unit that is scheduled, and so at the heart of the cute is also what is evicted at the heart of the cube system. It's the pod. We could go deeper into containers. I didn't I, simply looked at it from a cardinality perspective, which is you know for every new series.

G

We had to cube State metrics, that's an Oh, pods or sometimes Oh, Forex, pods or ever, and this could be a no pods that for most people most of the time most admins would allow you to get pod, namespace and node level constraints, which is eighty percent of the questions. People are gonna, ask and would leave the door open for future or subdivision if we needed to I think.

A

Part of what I had.

G

A

Liked about this compared to the investigation that I did about using the cubes, eight metrics, metrics or potentially adding more groups, eight metrics metrics to do these calculations is actually as it changes within the scheduler. This can be reflected in those calculations, and so we don't actually need to keep up with also writing fairly complex, bronchiole queries or actually, even if you have a effect, if you have a separate, doesn't that's not prometheus, it would not work with those right.

A

So people would keep updating all these rules and try to catch up, and it's almost kind of a parallel to what Jordan was saying about like we need to it's kind of our responsibility to tell that up front, as opposed to the user having to find out.

F

Yeah, except these things fall into the category of the types of metrics that we're basically going to have to manually garbage Clark.

D

Keeping an eye on time.

B

D

Have one more cap on the agenda and Pavel has been waiting very patiently, so perhaps let's table this discussion just so he can I haven't seen any links or any proposed caps. So maybe can you speak about that super briefly, with the three minutes that we have left and we can come back to this discussion clean, it might be good to like actually submit that as a Google Doc has a draft cap. So more folks can see and comment on it.

B

F

How he won personally very interesting yeah.

D

Awesome: okay, paavo: what do you got? Yeah.

H

So just I wanted to join the meeting. Just to give you a heads up when it comes to the cap. We are drafting me together with security team, and the problem which we want to address is the problem of some incidents when we had sensitive data like like tokens, keys or passwords leaking into logs of control of manager or some other control plane component. So this was pointed out as a one of the high priority things to be fixed in kubernetes security audit, which was done like a year ago, or so.

H

So we would like to add an optional mechanism for logging in those components to check if any of the parameters passed to Kellogg does not contain any sensitive data like, for instance, password or security key a tool to allow identification of those parameters we want to.

H

H

Allow tagging fields in strikes with with a special and go long tack which later on, could be picked up by the verification library which would be integrated with Caleb. So so this is like the high-level overview of of this cap. I hope that next week we I will be able to publish it fast, yeah.

F

Sounds great to me.

D

Awesome and I think we're at time, so perhaps we'll take a look once that draft is up at next week's meeting. Well next meeting for two weeks should.

F

We have another meeting next week because they kept deadline is coming up and we have many tips. Would people want that.

F

D

Have no objections to that, although we haven't done that before Frederick, do you have any thoughts? What.

A

What is exactly the deadline again is a May 1916 to to expiry so so we will definitely.

F

Have any other meeting at once: yeah? Okay, it's probably fine yeah.

A

It seems like we've reviewed at least the potential caps, and everything else can be done. Offline nephew.

F

But there's one more actually.

A

We are out of time, though, so, can you make sure to put it on the mailing list and it's not mine? Okay, but in any case we are out of time. So please make sure to let everyone know what you're talking about and see y'all at the next meeting and have a wonderful local time. Bye, bye, thanks right.