Kubernetes SIG Instrumentation, 24 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20210624

Description

SIG Instrumentation Bi-Weekly Meeting June 24th 2021

A

There, a random black blank screen for everyone.

B

All right, I think the recording's on um uh welcome everyone. The this is the sick. Instrumentation meeting today is june 24th and we have a couple of items on the agenda and I think uh elena, do you want to take the first one or do you just want me to make the announcement.

C

uh You wouldn't make the announcement.

B

um So just a reminder: code on july 8th, so we've got exactly two weeks to go um so make sure that you get everything in that you want to, um and on that note I think we have a couple of uh kep reviews.

B

C

Watch one uh I threw this on the agenda and it's mostly to I guess, pastor david, um because we only have two caps targeted for this release, so one of them is the structured logging stuff and that's being handled by working group, structured logging, so we're just deferring to them. So the other is tracing. That's our problem so, uh and I saw that uh david's pr, I think, is ready for review.

C

So I just wanted to make sure that we had it on the agenda that we made sure that we've got reviewers and approvers assigned from the institution side that if we need them from other sigs, we know who they are. So we can go pester them uh because I know this is like slipped now like three releases or something like that, and that doesn't feel great. So.

D

uh I think we definitely have good coverage from the instrumentation side. We have uh lily's review that frederick's reviewed it and han's also taken a couple passes. I think alana did you review it as well.

E

C

Haven't looked at it, I'm reviewing it.

E

Basically, because I I have a I'm predisposition bias or something so level up says, I can't review it.

F

C

Do you need to take a look? Do.

F

B

I think we've, I think been like three instrumentation members that have uh reviewed it. I think yeah.

D

B

Looks like two.

D

Minutes: okay, with um liggett's review.

E

uh Yeah yeah, I think I think, he's happy to have anyone review it. That's not himself.

D

Okay sounds good, then I I've been going back and forth with jordan. The last thing that we're discussing is just where, in the api server handler stack, this belongs.

D

So once we decide that, then this pr should be good. Then there will be two follow-up pr's one that adds the test and another that adds the fcd client instrumentation, so that we can actually connect this with lily's work in lcd.

B

Is there anything special that needs to be done other than pass the context.

D

So one we need to add uh the interceptors, so it's sort of like the grpc version of adding an http handler uh or using yeah, and then because we don't want to use globals in the api server. We need to plumb the tracer provider. All the way through so there'll be a bunch of functions that it gets added to and then we're just going to use the standard, grpc interceptors.

D

Thank you for doing that. Right off the bat.

F

We said no globals.

E

E

Just so much simpler because it would be like a one-line change. I.

A

Mean like yes, I know.

E

I know I know, but it's.

A

Just like a one-line.

E

A

Yeah, but it makes it so complicated because it imports etcd as well and it would be a huge miss. I think, um yeah.

D

Yeah, so look for those. I hope that those will be less contentious than the first one has been since they're doing fairly straightforward things and they're there's not as many things to have opinions about. I think so. um Yeah I'll send those around yeah yep. That's what I've got. um I also signed up to write a blog and, if you'd like to help, I know I've been doing most of this, so I'm going to assume that I'll write the whole thing, but if anyone's interested and wants to collaborate on I'm happy to work together,.

B

I'm happy to look it over yeah, whatever, whatever you need, thanks yeah, if you need a review more than happy to, but like this is your work, you should get the glory.

E

B

Cool okay, cool um anything else on on that note, if not, then um I think this is lily's point, but I think this is continuing continuation of our discussion from last week. I think.

A

Yeah I just I remembered we started a discussion I think in triage, and then we were like, let's park it for this one um I forget which metric it was, but it was.

E

uh It was long running gauge.

A

Yeah, yes, yeah nice.

C

Yeah, the long running gauge versus the like more specific metric. That was a duplicate that we don't know what to do about.

E

Api register.

A

E

We can just give some context.

A

To the people who aren't there and for the recording.

G

uh Let me find it hold on it's just so hard to find stuff on github. I get.

C

Yeah once you've commented on like 500 kubernetes prs good luck with your github notifications.

G

Wow, it's impossible to find stuff.

B

I mean we should be able to find it fairly quickly by like latest sick instrumentation label prs right, because that's.

A

B

Found it through uh tree, I think.

E

uh Yes, I'm just looking in.

G

The ones that I have participated in and just.

A

E

A

G

G

I did not find.

G

E

It's got to be an open issue, it's an open, pull request.

C

C

No, there was an issue, let me see if I can dig it up.

E

There was an issue and appear yes.

E

Yes, if we can have more than one person looking for, that would be great, because I am.

C

Found it, I think, okay.

E

There you go awesome.

C

uh Let me put it in the chat. Is this it? I just searched for a long running gauge.

E

Oh yeah yeah, that's probably a good one. I was actually looking my notifications. I was looking for the other yeah.

C

Don't don't do that, you will never.

B

Find that I found the pr counterpart to that one.

A

I think the discussion was whether what we want to actually promote stable right.

E

uh Yeah I mean this was this was an instance of that and my argument, for it is that this is basically request count.

E

It's a specialized request count and it includes stuff that is not in request, and it includes data. That's not in request.

B

So I I like this, I, what I would ask for is if we promote a metric like this, could we have some sort of a signal of these kind of requests? Failing right, because, ultimately, that's what we like what it? What is it that we want to detect with this metric right, and I I guess it's something to the extent of their being interrupted for some reason or right like what is the.

E

Latency and counts, so if you have too many, that would be bad if latencies would be really short. That would also be bad because these are long running ones and start over the.

B

So I guess that would be uh watches opened. I guess that you can you you. Can you can.

B

So we would need a metric uh for um like watches started, I guess or something like that, because that is the thing that you're worried about right. If there's a high increase in watches being opened or the total length of a watch being short.

G

uh Let me actually look at the magic. What what does the metric actually look like.

B

If I'm not mistaken, I think it's. The currently active watches is what the metric represents today um and I.

C

So maybe to summarize, there were a bunch of issues that we saw right. So one is like uh so issue. One is we have the duplicated metrics uh where, like one is kind of a subset of the other uh and then another? Is that the name of the one that is the superset has gauge in it? So it like doesn't.

A

C

With the metrics naming policy, so we can't make it stable. um Was there anything else that was a problem here.

B

I'm just trying to recall the discussion from last week. I think we started looking at this and then han was saying he would like to promote a metric like that, but there hasn't actually been a proposal to no.

F

No, no any of them right.

B

Okay. Okay, so because I was I was confused because I was looking at these pull requests and I'm like, where did that discussion? Come from.

C

We don't have a proposal.

B

C

Either of them to promote them to stable, yeah.

E

Like we think, maybe we should.

C

Consider them for it.

E

No, neither of these are in their current state promotable, but I'm saying something like this.

C

That's what I'm.

E

Saying we don't have a.

C

We don't have a, we, don't have anything to say. We want something like this and a plan to promote it to stable. um I would say I guess, there's also sort of like issue four, uh which is that we don't own this metric, it's owned by api machinery, so.

B

And I mean we can make specific recommendations in order to get it promoted right.

E

B

I think that that's sort of what the discussion today so far was, and it seems like we, we all largely agree. It just needs to be a little bit more clear in our intention.

C

So I guess like given this issue list. Does this give us a clear roadmap like we know that we have one that's a subset of the other, so we want to get rid of the subset metric. We know that there's a violation of the naming policy, so we want to fix that and then we want to make it stable. Like does. That is that enough of a proposal.

A

I think that one thing that I'm a bit confused by is, we haven't really fully agreed on what makes it stable, at least judging by the previous discussion right, what um what makes a metric be stable, because last time I think we discussed like if it's a an alert, metric um or a dashboard metrics, then we should make it stable, but we haven't really fully decided. Have we I.

E

I work backwards. You know I work from uh slos like metrics that are used in slos or.

A

Definitely like.

E

Ones, right, um metrics that are commonly used in alerts.

A

Yeah, um maybe we can just document that somewhere, yeah yeah.

C

I think that it would be fair.

F

C

um Sorry sorry lily uh for this one, I would say that, like it's, probably a fair thing to promote to a stable metric in the sense that like if we ask a question of how do we use this to like quantify an actual symptom that an operator can respond to, we can say: oh look, the tail latency is really high, that's going to be user impacting and so like they're going to see that. So that would be a problem.

B

Yeah, I I feel I I think we all agree on the intention. I'm not sure this metric either of those metrics actually give us that today, um and so I think, that's that's the only gap that we haven't actually figured out, but we all agree on the intention. So I think that's just the thing that we need to figure out.

E

So how about this is a proposal we deprecate both of them and we create a new one that looks better. I.

C

Think that that's what we want to do: okay, yeah.

B

Okay, though I I think I wouldn't do any of that, until we've actually figured out what a metric would look like that actually fulfills our requirements.

H

E

B

None of these will, but I would I'm not a fan of deprecating things without having a plan in place.

E

I I do have I have a plan. My plan is to make it uh almost parallel to request the request metrics, uh which there are two of them, there's the total and there's the seconds, um but in this case like we probably didn't even need the total, um because the count is already in the in the histogram. But in this case it's a gauge. We actually do need to yeah.

B

So actually, this field's incredibly similar to the grpc streaming metrics that are in the grpc, go.

C

An action to put together a proposal uh to drive like first like coming up with a replacement, metric and deprecating the other two.

E

uh The second thing was: do we need a beta.

C

Oh yeah, that should be a totally separate topic, so let me split that.

A

C

E

Because if we have beta, then like we can actually mark stuff that exists today as beta, because people likely depend on them. Everything is awful, and so people are just like. Oh it's awful, we can just do whatever we want and then they just break stuff.

B

In hindsight, I'm not sure alpha was ever a good name, because many metrics just are never gonna um be promoted right and it feels like alpha seems wrong like giving it that name sounds like they're. It will be promoted at one point or another yeah.

A

And then it's going to be promoted in like two releases or something like per usual and kubernetes, or something like that. Yeah yeah.

H

No, I think only better, like isn't it better features that are expected to be graduated or.

B

I I think formally you're right, but I think the user expectation is different.

H

That that can be fixed.

E

Yeah, but it's well: these are not features right. These are like things that people alert and graph off of right, their charts break. I.

C

Will say for like from the enhancement planning perspective, uh it's a problem right now that we don't have beta, because if people add metrics as part of an alpha implementation and they want to graduate their feature to beta, there's no corresponding change promoting those metrics. So then, all of a sudden, it's like they get promoted to stable at the last minute, and there was no instrumentation review in between uh and at that point it might be too late for us to give feedback on a thing.

E

We can introduce two, we can introduce an experimental and also a beta.

C

Well, so alpha should be experimental. I think the problem is right now that the expectations are not aligned with that, because literally everything is alpha.

E

Yeah yeah, what I'm saying is: if we have an experimental, then basically we can align alpha beta and ga to be with features right so like when people are promoting like right now, when people are happy.

C

I agree: I think that alpha should be the experimental and that we should have a policy to basically say if you've got an alpha metric in for, like any number of time, we are going to like either force promote it to beta or remove it, because it's supposed to be experimental.

B

So uh I think what ncd did is very similar to what uh how I think about a lot of the metrics. I think etcd and then correct me. If I'm wrong, I think ncd has a specific debug prefix, where they say um these are metrics that are that are kind of bound to the code surrounding it, and so, when this code changes inevitably these metrics will go away, probably or change significantly, and I I I would like to have something like this to signify this. There is no promotion process for this metric.

B

That's ever going to happen. It's purely reflecting an internal code.

E

E

That, basically.

H

Mean like feature could like, if someone doesn't decide to graduate their feature technically, that this is the case that the metric should appear optionally. Only when or it should appear as an experimental as metric that like can, can be removed, but not.

E

B

Like I mean I I I like debug metrics, I don't know I, but that could just be my predisposition of having worked with ncd for like five years right, um uh but I guess what I was also trying to say. On the same note, I think the majority of metrics that we have are actually on that note um and the debug minority are actually the slo ones and those should absolutely be on their path towards graduation.

B

So I I I would like to propose, I think, to introduce a debug um mode or debug.

A

B

Internal uh I'm fine with internal as well. It kind of reflects the same notion um and then the other thing was beta, which I guess we still need to figure out.

B

The actual semantics around this.

D

I wonder if making debug metrics disabled by default would help? Is it like you get them if you ask for them.

B

I would do it the opposite if, if at all,.

E

Yeah well, we would be changing the existing behavior, so yeah. We would have to have a flag that disables debug metrics. um Otherwise we could break people.

C

We're keeping an eye on time, uh because we have one more thing on the agenda: do we have somebody to own drive, put together a proposal for what we want to see here, because I don't have a great idea of what we want or who's gonna make that happen.

E

I am happy to work on it with anyone. If who is interested.

B

In it, I I'm happy to flesh out the um details of each of the steps, because the more I think about it now, if we introduce a debug type, then I'm not sure a beta type is actually still needed, because then the alpha metric is actually the type that we are intending to eventually graduate, which was the thing that we wanted. This.

C

Is why I think that we should keep alpha as alpha and promote all the alpha stuff to beta, but I think we just need a proposal, so uh I'm gonna say uh in the notes, uh han and frederick, you can work on that.

C

We can move on to the next item. Awesome.

H

H

uh So let me talk about what I cropped up last week, so han said that he really likes globals, so I think I am trying to get rid of one hello. I don't know if you heard about this.

G

H

G

H

Yeah, it really is not like one line global that became pretty pretty big so um insured to to really graduate alternative login formats. We need a way to to like configure and manage and provide users.

H

Make give users like assurance that this, in this alternative format, like has the same features works? The same is uh like has the like will not break you like? Basically will not break you so for that we need to have uh comparable features that are provided by k, lock and the the here, mostly json format, as we are developing it further.

H

So we brought with the 122 we implemented. I think one feature that was um missing: what was color information so source code? So now you can you get the color information in in your logs, which is great, because uh there are many complaints, uh and so we know that it's important, but we we want to to graduate it. We we we need to make sure that k-lock and k-log and the json doesn't like go into different ways that we have a core functionality that is delivered by both the solution and people can add more solutions.

H

We don't over restrict, so we can add more uh alternative implementation of logging. Where one example, I think what someone someone asks about asked about integration, a journal for cubelet like instead of having a text in your journal, as most people use systemd for your cubelet. It would be great if you just get structure logging within your cubelet internal, um so yeah.

H

The proposal- basically it's over for now over like aggressively, gives you, let's duplicate everything that doesn't serve the original purpose of logging and uh and all, but only deprecate it within carbonite's components and leave the k-lock people like to their own. But we care about the those components uh I wanted to first go through through sig instrumentation for your feedback and then possibly to seek arch. Have you talked? Have you talked to jordan about this jordan? I talked so I could think, but for now no no.

E

H

Have the difference? Because jordan.

E

Cares very much about backwards, compatibility and breaking stuff and we're moving flex falls in the domain of stuff. That I think jordan would care very, very much about.

H

I know is there any reason that we could not treat those flags are as any other flag as we are not breaking caleb losers, which is apparently a huge community of people that use k-lock? We just want to deprecate in in kubernetes in the core components, those flags that we think are not needed and we have already processed for defecation like we can comply to it and the I often first question: are we okay with like dropping flux from k, lock or changing the flags for k? Look.

H

We already did at some point change the flux. So basically, kubernetes went through uh standardizing the format how flux use hyphens versus underscores and which we drop. People like.

E

H

E

Talks about this is that, like basically k, log uses um very commonly used flag names that collide with other loggers, and so basically, you can't test two loggers simultaneously, because you have flag collisions.

E

So it would be nice if we had like prefixes or something.

C

E

C

A quick question because I'm looking at the flags that are being proposed for deprecation and it's not clear to me why they should be proposed for deprecation for json, like looking at things like log file max size, for like log rotation uh like json logs on disk. uh You know, like you're gonna, have a full serializable log line and then, like you know, that's when you would drunk it, it wouldn't like truncated halfway through the line.

C

So uh I'm I'm not sure why that would be considered not applicable similar for, like writing to standard error versus standard out like this is a different file descriptor. Why does it matter where we're writing the json.

H

So, okay about uh maybe starting from files so like at the beginning of the proposal, I'll just stay like I, I just I. I said that we should comply to some basic. You know idea or have some set of rules that we want to comply and having application card about logging that gives people so much weight or.

H

Way to use it and abuse it to to add more features like like writing to files changing uh frequency of flashing and syncing to disk. I know I just read only you know recently that apparently k log syncs every five seconds or kubernetes syncs syncs to disk like calls a full sync on all your disk, that a sync on x, the free format. If you have linux syncs, all the disk on data all for disk doesn't work for single files, so this is basically just to comply to.

H

Those flags would be a nightmare uh to implement and we need to adjust the core. If we just standardize those three flags, it should be enough to get people. I understand like it would be useful, for I know, love rotation, but there are already so so many tools that can like give you the same experience and uh from what I know like at some point.

H

We tried even to use those features of k lock in kubernetes, but they just they didn't pass our scalability tests, so those flags from scalability standpoint and broken, because you cannot run a 5k cluster using logs that are written to disk, even though this is capable to handle it and and dean's written a separate binary that just ingests std out out of kubernetes components and writes this for k lock, because kayla cannot write them itself.

C

So I have a request uh because that's not documented in the issue, so my request would be. Can we make sure that that and any motivating information for like why each flag we want to deprecate because we may need to make a case-by-case call on them? So if we could include like sort of a table list thing with like we don't want this, because it's hard for these reasons uh and so on. I think that would be helpful in trying to make a decision because it sounds like we're. Gonna break uh a lot of backwards.

C

Compatibility potentially like I know I've seen lots of bugs uh in terms of like you know, cubelet like not possibly like supporting one of these flags properly. So I think people care.

H

C

I'm sorry we're over time so yeah.

H

Yeah, so I will not doubt that if you have any other ideas I added to this or in the notes, I will take a look.

H

E

Have some ideas? We should talk about it. Offline.

H

Yeah, that's great.

B

We are also two minutes over time, um so I think that's a good idea to take that offline um is there. We are two minutes three minutes over time. Is there anything else that people want to say last word, or should we adjourn and see y'all next time I don't like globals. I was joking just wanted to make that to you in the recording yeah. Just just for the record.

B

All right uh in that case have a wonderful local time. Everyone bye.

A