Kubernetes SIG Instrumentation, 28 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20200528

Description

SIG Instrumentation Meeting May 28th, 2020

A

Sounds good hold on. Let me get rid of this backlight.

B

Okay um welcome everyone uh today is may 28, and this is sick instrumentation. We have um two two topics today uh and the first one is from han about setting up sick repose.

A

uh Yeah, I thought it would be nice if we had um a dedicated sig repo for instrumentation instrumentation related, you know, tooling or code. um I think it's it's pretty standard practice amongst most plus sigs.

A

I think we have one for metric server, but we don't have one for, like uh you, know generic stuff or prototyping things, and I thought it might be nice.

B

Do we have uh like precedence for this with like api machinery, or something.

A

uh They are very liberal about creating things like. I know that, like.

A

The storage migrator stuff was in a sigripo, um I mean mostly, I think people are trying to get away from putting stuff into the main kate's stuff, if it's yeah for sure not completely necessary, and I I think that's desirable pattern and, like we shouldn't be constrained by kubernetes, we don't have to be.

B

Yeah no, no, I I agree, and especially if, if this is not something that's intended to ship with every everything and everyone, uh it doesn't make sense, um I mean I, I think it would be okay uh to create a uh a repo. I think maybe we should, at the very least we should have kind of a clear purpose of any repo that we um that we create.

B

um I think it's okay, if we say this is for tooling stuff that we want to prototype and stuff like that, but at the very least it needs to be that right, um yeah. I think I think that would be okay cool. um I do wonder, do we? I think we have repo creation um as part of our governance. We do, if I remember correctly, so I think it just needs to.

B

I think there's there's literally a section in our government governance about creating new projects for the sig, um so we do need to follow that, but other than that I don't really see an issue.

A

Awesome uh there was oh yeah, so, uh okay, so that's that uh the next issue is also mine, which is uh elena signed, signed a to do for me, less less, I think uh hold on. Let me find the there's a github issue, but I uh I don't see it on the agenda, but it's not. uh It is actually on the agenda. I'm on my personal computer, because zoom doesn't work on my other computer, um so it is on the previous week's thing.

A

B

Linked it, I believe, uh fixing api request total metric before returning to stable.

A

uh Wasn't that the agenda, or was it, was it stability? Okay, it was stability to to ga that we wanted to talk about, or there was this other one. I I I conflated, though I think.

B

I guess uh I'm marked for this one, but I think so then. The next item that we have on the agenda for today is uh revisit metric stability. Cap. uh Graduation criteria um is that the one that you're talking about.

A

uh No, I was talking.

C

About I'll, add it uh han: let's, let's skip that one uh for the end.

B

Okay yeah, so let's uh go through this one, then um the unfortunately, the link and.

B

It just needs an.

D

E in issues- oh, I I might have copied it wrong.

C

Okay, no, I didn't fix that one. I fixed a different one.

B

Yeah, I guess this is actually uh hans hunts issue, so this is moving api server metrics to stable.

A

Well, yes, we we opened the discussion for this last time.

A

So uh yeah the api server request. Metrics are uh arguably probably the most important for cluster administrators to establish slos um and there are two metrics specifically, which everyone establishes slos against, which is api, server, request, total and request uh duration. Now, which is the new one.

A

And there are perpetual issues with these metrics cardinality issues and also it's like it's really easy for people to want to just add things to this, because everything hinges off of a request right. So this thing has like 11 dimensions or something um and a lot of them don't make sense. We've had cardinality issues. We just recently had the security issue with it um and before we turn it to stable, I think maybe we should just strip it to its bare essence like what is required for the slo and that way like when we reduce.

A

We reduce the probability that it will become a security issue, and we have to do something to it um and two, uh like I have this link here, um where we as open source community, have defined um slo.

A

Like slo-ish type things uh for for for kubernetes um in terms of life, are you talking.

B

About the slows defined basic scalability, yeah, okay, I see it here. Yes,.

A

And then they acknowledge that these are like not true as a lows in that like there's like an asterisk right, because, like our request, latencies include weapon processing time, but people can arbitrarily add web hooks.

A

So um we actually have no way right now to surface whether or not we're even meeting these things. Yeah, and I think if we go down this route and make the effort to make these like true canonical, kubernetes slo, sli metrics, then one we should reduce the clutter and two. We should be able to actually guarantee the slos that you know. We are saying that we will meet.

B

I mean, I think the the later part is six scalabilities um responsibility, but I agree like we um I mean the the fact that these should be stable. Metrics um is basically already proven by the fact that uh every time we mess with them, we first check with six scalability right because they they will immediately feel the impact. The moment we merge it right. um So I agree with that. There's one one thing uh that I wanted to call out.

B

I don't know if we have uh matthias on the call, but um is exactly based on these um slos matthias has already created a couple of um prometheus rules um that follow like the um google sre books, um best practices on like multi-error, multi-window error, burn rates and stuff like that. So I encourage folks to check those out. Can you can.

A

B

A

B

Actually part of one of our um uh sig projects, which is the kubernetes.

D

Let's put it in the book.

B

Yeah, thank you um but yeah. I totally agree. We should definitely do this um one one kind of potentially.

B

um Controversial thing that I'm gonna throw into the room is um what, if we got rid of the metric entirely and only used histograms, um because at the end of the day, they're actually duplicate, because histograms already um count all the requests right and histograms you're gonna need anyways, um but you only really care about um histograms uh for succeeding um requests or we would strip histograms to only um report succeeding requests. I.

A

I, I think, that's a great idea, I think, there's a reason why we haven't been able to do. It is because the labels are not the same yeah and we should make them the same, but our labels are all over the place. We have client strings in there and client strings should not be in there like.

B

um Content type- I guess yeah, I guess what I'm trying to say is- or uh you already uh included that in here, so that we should review all of those metrics. I was kind of focused on uh the request total metric, but yeah.

A

Oh yeah, you know yeah, I put total and latency histogram yeah. Those are the.

A

Yes, I think that's a good idea. I mean yes for sure, because it's embedded in there it says underscore total or whatever so yeah we should. We should do that. That would be great uh one.

A

A

The way that I've been thinking about it is that we basically want internal api server, request, processing time, yeah right exclusive of web hooks, because that is basically the only thing that we can guarantee.

B

I I do wonder um it does still account for the cumulative latency right, so I'm not sure I would actually exclude it, um but I would definitely make sure that we have appropriate metrics that um show the webhook latency, yeah and and for kind of our control group. um Six scalability obviously can't have um like non-standard uh or I guess they shouldn't have any web hooks at all.

A

Yes, I I think we should definitely have both, uh because you want the cumulative latency as well as the weapon latencies yeah, um and we need to do our best to reduce the cardinality, because if we introduce multiple metrics for these, then obviously you're just doubling um it. It's really not that bad, though considering the number of dimensions that we have like, we can easily double these.

A

If we get rid of a few things right, like um yeah, but moving forward, we need a set of prescriptions for what people, what to tell people to do if they want to add something that hinges off of this right, like um like, nobody has like people, are just throwing stuff into this one, because there is no set of prescriptions right like like.

A

I mean, where else are you gonna, throw the things that you want right? I mean.

B

I'm not sure I'm following are you? um Are you referring to what you mentioned earlier about, like anything that refers to a request um tends to people tend to want to throw into this metric? Yes,.

E

A

Yeah, so so, if we have guidance about, if you have core like request, type request related metrics like this is how you should do it. You should create your own thing with your own dimensions, and then here is a prometheus query that you can use to join the two metrics right and in that way, like your metric.

A

Has like zero chance of of screwing up this very critical slo thing that basically, the entire world is alerting.

F

Yeah off of very.

A

F

Point because um in the kubernetes mixin like what I try to do is actually the count based uh availability measurement, where we essentially like sum up our requests by verb and count the code over 28 days and that basically blows up almost on every every other deployment out there, because the cardinality is as high as you just said. So that would be awesome.

G

My question is: do we understand why people put so many want to put so many things into this metric, like my understanding? Is that if someone wants things like clear clients like no one would want to have a metric or like have a slopper user client.

B

Yeah, I think it's uh it's kind of our uh it's kind of our failure to uh educate the rest of the um kubernetes engineers um on some of these topics um and like it's just some people are just new to this um to this world right and that's fair and that's fine. um We just need to. I guess: we need to kind of educate folks a bit more yeah.

G

So my point is that should we like so, we should promote like so if I would, uh if I think how I would understand, need the user client I would. There are some cases. People like need. Those things to to under uh like have an aggregate or be able to search what our clients use or what versions, and I think this is more for debugging and and things like structured logging, which gives you debug information.

G

What happens in my system what events are in my system, what user clients spoke to my system and metrics are more about defining state, so we should think about. There was some time ago I discussed with david about like having a good. You know, making it easier for kubernetes developer to decide if they should write a log or a metric, and I think this goes uh under our responsibility.

G

I could not agree more yeah and.

B

G

Should yeah think about uh writing a big guidance so because I'm currently writing the migration for structure logging like instructions?

G

I look into current current developers, uh documentation and I quickly say that it's pretty lacking there's mostly no info like metrics, are described, but locks are much much smaller and I think we should think about like vision for building a vision for instrumentation kubernetes. What should be a lock? What should be a metric in house people should where people should look for it like when someone wants to solve some problem like.

A

uh Like yeah for sure, like.

G

So so the client strings are a great.

A

Example of things that should be in logs right, because these are uh you do want these things and you you can break out, like you know, log based metrics or whatever, with the right back end and do your accounts that way. But um you know, there's stuff like that, and there's also stuff like like there might be things that you want metrics on, which are bounded right like like, for instance, we have this. This dry run dimension right, which is kind of an odd thing to be in a nestle metric, which is right, like yeah.

A

um But if we had a separate thing with this dry run, then we could just join the two and then we could have that because it's bounded that is a bounded dimension and so yeah right like um yes, I think we need a set of guidance and some things and.

A

We should explicitly define these things.

B

A

B

Agree, I think uh we have a couple of more topics um on today's agenda if I'm not completely mistaken, um so I I think maybe we can each um give this a little bit of thought and try to comment on um hans issue here and I would love to move this conversation forward.

B

What can I see?

B

Sorry, I guess I should have uh continued to share my screen. um We have one more topic um which is housekeeping, uh wrap up metrics overhaul ken.

C

Yeah, I just added that to the agenda uh when I was looking through my github notifications, uh we've gotten some pings from the enhancements team in terms of like, what's going on with this in 119 or whatever is this done, uh and I just wanted to check in with folks like where we are with that.

B

I think we've got rainbow mango on the on the call. Maybe you can give a quick update. I think you recently had some successes with some things.

B

Sorry, sorry, I think you recently got like the prom tool, the linting stuff, uh to work right, so I think that's kind of an uh yay um a huge achievement. I think that was a pretty massive rabbit hole. um I don't know if everybody followed this, but essentially uh he had to go up to the prometheus project, extract things into the library and everything like it. It was pretty massive, but we finally made it so yeah huge props. For that, that's awesome.

B

Thank you. Thank you yeah. So I think we're kind of one step further, um but uh maybe I can I can share my screen one more time. We do have six minutes uh and we can have a look at what else we had.

A

Left, oh yeah, by the way I I uh have volunteered uh rainbow mango for a maintainer at cubecon, china uh awesome. I think that makes a lot of sense, given uh its contributions, which we really appreciate.

B

um Does anybody know what we still have to do for this? One.

H

H

I think we we should add, um attach the plan.

C

Yeah the thing that they were asking us for was a test plan.

B

What does that mean? You.

C

Scroll all the.

B

C

Down uh there are some comments from the folks managing enhancements.

G

I think the idea of test done is having an input from seek testing like overall having a comment and approvals from sick testing that we we reached to that reach out to them and like discuss them.

A

Okay, yeah, if uh doesn't break, then I think that that we have it's okay,.

C

B

This one, I don't tend to agree.

C

I mean this this. This kept us so old, uh and the the scope of this cap was not really something that, like a formal test plan, might necessarily apply to. uh So I think that we just need to go and like do the paperwork and do the housekeeping to say like look. uh This is all already been implemented. It's all working, it didn't break anything like. We don't really have a formal test plan, because this was mostly like metrics usability stuff um like what do we do? What else do we need signed off on.

B

I mean we, we have tests for everything that we implemented, so I don't know exactly what uh they would be asking for. So uh is there any volunteers to like uh get the paperwork done with um with testing them.

A

I don't think we have to do anything I mean we can just say like.

A

Is it like you're per se right.

C

A

H

C

Up on this, one rainbow mango you've been running with this. So if you want to pull me into saying, like look that like, if there's anything that you have on your agenda uh here like to do for these folks, uh let me know, but otherwise I can just go back to them and be like. I don't think we need this.

E

Is this just updating the kept to say here's our test plan section we're not doing any tests because and here's the explanation.

A

We do have tests, I mean we have tested our answer like the metric stability. Stuff has unit tests right, so we can say that and then we can say that this thing a bunch of things downstream and that as long as like our end-to-end tests, don't start bombing out, then we should be okay.

B

And for what it's worth like, we've had this thing in place and sex scalability has been doing their scalability tests, as you said, so this doesn't seem to have um any like regressions. uh Well,.

A

It has had regressions they've had to change stuff but uh yeah, but yeah. uh I I think they've made the changes and everything works now.

B

I I I meant in regards to the framework itself, um not so.

C

B

um Yes, we broke metrics, so they had to adapt things so that in that sense, you're absolutely right, but um I mean to me: it sounds like we're actually now um at this actually at this step. So um I guess we're literally in at the point where we should be rediscussing: the api server metrics and um that's even outside of the scope of this uh cap. So once once we kind of get the okay from the other six. I feel like we're done with this.

B

How do other people feel about that.

C

Yeah, I just put a comment on the bottom. I assigned this to myself and I said I will do the paperwork to close out the cap, because I think this is fun.

B

I guess I guess we should uh not just um talk to sick testing, but just also cycle back one more time with, uh I think it's sick release, or whoever commented on this just to kind of close the loop with everyone but yeah, um which I think would conclude what may have been the biggest work item. We've ever done as this group, which I think everybody can be a bit proud of yay.

A

And we have more things in the pipeline.

B

A

That's awesome, yeah, uh oh one, one other thing, so we have a lot of things in the pipeline, uh so uh merrick just uh started. This um got the the logs directory in uh component base. um So I was wondering if we wanted a top-level directory, mostly because it would make it like right now we're flat on the component-based directory level, and so we have like metrics and we have logs. But if we have instrumentation, basically we don't have apple problems anymore. We're going to have tracing stuff right like that is for sure yeah.

A

So we're then we're going to have three out of the n number of things on the top level, which is possibly not super desirable. Should I'm.

B

So I I guess, if I'm hearing this correctly, um you you're, suggesting we create a instrumentation one on top level and then the three and then.

G

B

Within within that one yeah, I think that's.

G

So, to give a context like this is uh component based, it was proposed as a cap as a way to refactor all the components, so it's managed by working groups, standard component standard and it like, I would leave so. My idea was to leave to like structure of the the repository up to them because they own it. We like there are lots of six that will be owning uh different sub directories, but currently uh uh so so. Currently, it's up to them to to agree, and I will just check yeah.

A

For sure we should, we should definitely talk to. Like you know, uh stefan stephanie is the one stefan and michael, I think, are uh yeah michael.

B

Yeah are the two two main uh component people I mean. We obviously want to check back with them, but um it doesn't from the sick, instrumentation side. I don't think it's controversial.

A

Okay, great and then I I will try to follow up with one of them and see if I can just move our stuff into a doctor. Awesome.

B

um And that actually concludes our time for today. um Thanks everyone for joining and see you all next time and have a wonderful local time.

B