Kubernetes SIG Instrumentation, 13 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation 20221013

Description

SIG Instrumentation Bi-Weekly Meeting Oct 13th 2022

A

B

A

Ready, yes, I think we have a full house, somebody should kick off.

C

The recording welcome everyone to Sig instrumentation bi-weekly today is October 13th. We have a few items on the agenda today. Up first is um I, don't remember what the name behind the username is, but asking about the release of metric server.

A

Yeah for this we kind of need an American I, don't see marriage here today.

D

Yeah yeah because, like Merrick, is the only maintainer like the.

A

One yeah, so he would know uh we can move this on to the next meeting and I will ping America and ask him to attend cool.

A

All right so Phil you have a proposal. Yeah.

E

Guess I'm up so we've been uh working. My team's been working on uh metrics, specifically related to usage uh for for a bit now in thought it might be a good fit for uh the community as a sub project, especially Katrina and I have been talking about uh metrics, that's related to you know the utilization metrics and some of these things to help tune workloads um and she was also interested in it.

E

We thought that the kubernetes project might be a good place for us to collaborate on these and I have a doc link that kind of outlines like why not just you know the coop State metrics or why not? You know any of the various other solutions that already exist, um that I can kind of walk through here, but but the intent is, is you know asking the Sig if this is something that there is any interest in uh having developed within the community.

B

Maybe can we walk the dock, so folks have an idea of what.

E

C

B

Can someone give Phil co-host I cannot.

C

Yes, I should be able to.

D

C

Okay, Phil, you should now be a co-host and you should be able to share your screen all right. I. Don't, let's see.

E

Anyone want to take a bet on how long it figures out how long it takes me to figure out uh share Google.

A

uh Under two minutes,.

E

I think that's that's about right.

B

That's good bound, oh.

E

C

E

B

E

Got it I, gotta quit and I gotta quit and reopen. It says.

F

E

Will be right back all.

C

B

If there are folks that are new to the meeting, uh while we're waiting for Phil to come back, uh I just wanted to let you know that if you feel comfortable feel free to turn on your camera, say hi, we love having new contributors and seeing instrumentation and uh it's great to see all your faces. uh Paul, you have a hand up.

C

I would just like to say hello, hi everybody bye, everybody um I did also want to say that I have a uh I'd, have a pull request to add some basic reconciliation metrics to KCM. That um I simply need to reopen and rebase uh having been starved for time with numerous distractions uh and I am hoping to do that soon anyway, nice speaking to you all, would.

B

You be able to put a link to that in the agenda, so then we can find it at a later date.

C

I sure can thank.

B

You uh Abel, did you have your hand up? Yes,.

E

I did how quickly can I get those post permissions again.

C

F

D

Hey everyone I'm about I work with Phil on.

C

The stuff he's about to show so hopefully I can answer some questions and hey Katrina good. To see you good to see you too uh I can introduce myself as well. um My.

B

Name is Katrina I previously.

C

Work with Bill and Ava I now work at Shopify and uh I've still been talking to Phil about some of the stuff he's about to present uh shaffa is is pretty intrigued by it as well so I'm here to support the proposal and I mentioned it as well to some of my colleagues who work on uh platform, efficiency and observability and I know I see a few of them. I came here today as well. If they'd like to introduce themselves.

C

Yeah, so um my name is Pedro I'm, also working currently at Shopify on the observability team and yeah. Here, like intrigued by this proposal and uh quite interested in this uh like seek in general Mass Mutual.

C

You're welcome.

C

I am Tai from Shopify as well on platform efficiency, team, uh I, don't think you'll be seeing me as a regular here, but also interested in The Proposal.

D

D

Working on a shopifying and the efficiency team.

C

And we are working on a little thing that can be very interesting to have a more Matrix, so we try to follow this uh Sig too yeah.

C

And I'll introduce myself to you: I'm Danny I work with uh Paul Eva and Dell and I work with Katrina for a bit as well. Hi Katrina.

D

C

Yeah we work with eash too sorry, yeah.

B

It's all good uh Risha were you about to introduce yourself? Yes,.

C

Hi everyone I'm Richa I work with Han and wenja here, I joined Google around three months back and I'm excited to be a part of these meetings and start contributing to the community awesome.

E

Should I introduce myself too.

B

C

E

Makes sense all right, I'm, uh Phil, Wittrock uh work here with you know about uh yes, uh Danny and Paul uh and uh have spent time in the community on uh oftentimes in six CLI, also sometime in safe release. So it's great coming over to this territory getting to see the folks over here.

E

A bit so it's good good, seeing all the Google folks.

B

You want to launch into it Phil yeah.

C

I will do that.

E

Okay, so uh like the the tldr ride, is that we don't have this uh solution that uh we think is, uh you know, solves a problem, um certainly solves a problem for us and we think it would solve a problem for others and it's designed from the ground up uh after our uh several rewrites. uh That is uh to be like generic and extensible, and really avoid hard coding any of our specific use cases or metadata, um and really have that side loaded through configuration.

E

uh Here's like a TL VR example of just kind of like here's, something that is relatively difficult to do today um and that we're trying to make easy right and that's like for all the containers in a workload get the uh do a one second sample interval uh and then show me the uh you know per container, or rather you know. If a pod has multiple containers say you have that you know app container and then your login container right for for each one of those.

E

So for the app containers show me the 95th percentile utilization for CPU um over the last five minutes over all those one second samples across all the pods and then the same thing for the um for the logging container right, and so you know the the use case for this right is like I wanna um understand my requests and my limit set. You know uh to good values. How might I want to change them, um and so getting this you know.

E

One second sampling interval is a much finer grain than say like a 15 or 30. Second uh sampling interval. You might get so so a couple couple things here that this provides that are challenging to get today. One is that that finer grain sampling interval another is uh resolving those containers up into the workloads so right, uh rather than having to join it with.

E

You know what is the owner and then what the owner from replica Set uh deployment and these sorts of things all through prom ql queries, uh which is like actually really quite challenging to do, and actually it's very expensive, and if you have a sufficiently large number of containers right, you don't want to be doing these massive joints um and the the sampling interval is is not one second right in Prometheus is not the right storage mechanism in, in my opinion, for storing one. Second, you know time series at a one.

E

Second, um granularity: it's you know uh that would definitely blow up Prometheus so having it at. um You know a one second interval, you know scraped beforehand, and then you know export to Prometheus. You know a five minute window, what the 95th percentile is or the max or the mean or the median or or these sorts of things, and so this is kind of how we describe it again.

E

This gets exported to Prometheus, so it's kind of a preprocessor for Prometheus to uh improved performance and offer a number of things and says: okay grab me the CPU and memory and I want to get container utilization uh and then here's kind of what I'm aggregating against right so keep the container.

E

These are the labels I want to keep so there's a bunch of labels that you know attach this, like the node, for instance, um and priority class, and all these sorts of things and drop all those and instead uh keep these labels and then give me the P95 for everything that has the same set of labels, so those that's kind of the quick TL IDR example of like. How would you express in this this in our configuration, how to get this metric and then that metric gets exported to Prometheus?

E

And then, if you want to do additional prompt ql on that, like it's very possible, like average over time uh over a Time window of like a day or a week, or something like that.

B

There's a question in the chat: where does the five minutes come.

E

In that is a configuration option, so the the way this is is implemented. Is uh that there's a Scraper on each node that scrapes uh every second and then stores it in a ring buffer and then pushes it to a central Service? uh You know over an interval, and so the scrape interval is a configuration option like is it every second is every tenth of the second is every two seconds um the size of the Ring buffer is. uh um Is a configuration option?

E

Do you want to store 10 minutes worth of samples, five minutes worth of samples, one minutes worth of samples in that ring buffer before you start expiring, and then how frequently you push them to the uh the main collector is, is a third configuration thing right, so you could store, for instance, five minutes worth of samples, but push them every minute so that the collector has, um while it has a five minute window, it also has the most up-to-date set of metrics when it does get scraped, they're, not five minutes old, for instance.

E

um So so all of those are done for configuration options. We chose these particular numbers um based off the the one. Second, scrape I think is off of the Google autopilot paper.

E

I believe that was you know the interval chosen there, and so we said well, I don't have a better interval that I have a strong feeling for so we chose that I think the five minutes was chosen because uh um it matched kind of the scrape interval we're using for Prometheus at the time um so so having the window of five minutes matched uh made sure that each scraped out of alcohol, the samples in it um that's how we chose those but they're they're, not uh fixed.

B

There's another question in the chat: would this be a part of cube, State, metrics or KSM.

E

I mean no I, don't see it as being a part of either one of those I mean it's possible uh I think it has a pretty different set of goals. It's much more focused um Coupe, State metrics publishes uh all the uh like publishes a metric for everything, and um it allows you to do a lot of really cool stuff.

E

um But to do simple like to do this, for instance, you would need to do a lot of joins and still can't get at the second uh interval, and so it'd be a pretty big I think architecture shift for either one of those and introduce maybe a lot of unnecessary complexity and those called other problems this doesn't solve. This is really just focused on capacity and usage metrics, uh and so introducing that complexity and Coop State metrics I. Don't think necessarily uh provides a lot of benefit.

B

Yeah and correct me if I'm wrong, but Cube State metrics, currently doesn't really have any tooling for utilization like you would have to go currently I guess the open source solution just go scrape, see advisor endpoints on the cubelets.

E

You're not not in this sort of uh not in this target right, so this this particular.

B

Yeah, so this this would be a new, a new potential model or like tool of getting at it like existing. Without this thing are options for like getting that data on the nodes is either write your own thing, scrape C advisor.

B

Is there anything I'm.

C

Missing right, you.

E

Can you can use you can use Coop State metrics to get not this, but something else right, so you can get it over a longer interval without the workload metadata attached, right, I think maybe that's where and that's actually how we started out and that well, you know. Initially, we started out with the coupe State, metrics and and spent some time doing. The propql joins uh to try and try and make that work before, eventually setting only on this solution.

E

So yes, the answer is it doesn't? Do it.

B

Yeah uh he wrote also said metric server, which makes sense uh there's another question in the chat. Where is the data coming from cubelet, see advisor containerdy or directly from the process.

E

uh Fantastic question so today, so this is I, don't think something that's like fixed in time and we're happy to uh love to get feedback on better ways of doing it. We explored a number of different options, I think you're, talking about the utilization data and the one second scrapes specifically, is that.

F

E

F

That's, that's that's my question. I think.

E

Okay, so right now, what we do is we just Mount the file system, the sys FS secret file system and then lock it. We tried getting it from the container from the container D.

E

We tried different options and turns out like walking the assist the group file system isn't particularly complex and and using some of those other tools is uh not particularly simple, um and so because this this worked out very it was a small amount of code to do this, and you know, worked pretty well and uh gives not a lot of permissions like mounting having access to the container V socket, for instance, gives us stuff that we don't necessarily want to be able to do, but not being the secret file system in read only uh really limits our capacity to do malicious things.

E

So that's that's what we ended up doing that, uh but we're not we're not stuck on this, and it doesn't work well for some things. Like micro VMS, uh we, you know you can't get at the container metrics uh and micro, VM runtime, so I I think we're going to continue it to evolve. How do we get these metrics and maybe have different options available where you select in the config? Where do you want to get the metrics from.

F

Yeah I'd be very interested in it and also like getting in touch we've been kind of exploring in GK metrics. If we can, we get these metrics in high resolution and where we'd get them from, and that was also what we looked at like skipping coupon State advisor directly looking at the file system and watching that as well to get what's the most resolution possible.

E

One nice thing too, is that um this example is just for containers, but we also have the C group. We have. We scrape the c groups on at a node level. So there's two Dimensions there's like the workload Dimension, which is spread across a bunch of notes, right and then there's another dimension, which is like the the no Dimension and So within the node Dimension you make Halo. How much is the system Slice versus the coup, pods C group, using right within the system? Slice do I.

E

Have some process like uh that kicks up, you know, is it really the kublet? That's the most or do I have other processes in there and what do they Spike at and we've seen some interest, so we scrape those. We also export those p95s Maxes, those sorts of things, and we have seen interesting results where you know. There's certain processes running.

E

um You know in the system reserved that on average don't take a lot. But if you look at a one second granularity, they they Spike, like they actually take a ton at that very small interval. There's others that you know just on average, take a significant amount, but never Spike and having an understanding of that I think is is definitely interesting.

F

One other thing you might be that might be interesting to look at that we've been at least discussing if it would be possible, is looking at uh it's some way. Looking at om killer events, because that's like the final point you get like, how much did you use because before it got killed and that's what we usually Miss 100.

E

Umkiller CPU throttling for C group C2 various like memory pressure events, um all that stuff I think we want to look at and some of that stuff is like partially wired in, but not just not complete, but but we're certainly working towards getting all that sort of stuff. And then you can build like you're saying a collective picture of like you know, maybe maybe my P95 uh usage was well below my request, but I'm still getting throttled right and possible thing.

E

Should I keep walking through the stock or uh or do we.

A

E

To ask more questions.

A

We have, we have five minutes left um so you're, proposing basically to turn this into a sick project.

F

A

um I'm open to it, uh we would need volunteers.

F

To make me very interested, at least because it's something we've been exploring, but.

A

Okay, we can, we can, uh does anyone have any objections to this being a part of this.

D

um I I have an option to be honest with uh this one, because uh essentially I feel like it's trying to replace the existing um ways we have to collect the uh usage that is not already well optimized, like we've seen that it doesn't scale away to get the metrics from uh cubelete directly because, like C advisor is not optimized um and I feel like this would be just a shortcut like a shortened to not like optimize your advisor um and.

B

Just well from somewhere else, don't we have an ultimate goal of getting rid of C advisor.

D

Anyways like this will give us.

B

This would be a good on-ramp for that, potentially.

C

There is a related uh signode enhancement that they're working on to get all the C advisor metrics through the CRI.

B

The CRI stats one I think this is complementary, because this is only going to be I, think a subset of those stats and at a much more scalable uh uh like level uh because with the CRI stats well. First of all, also the CRI stats isn't everything from C advisory. That's also a subset, um but I think that this is looking at much more granular data than the CRI stats proposal. Currently is: okay.

C

I think they they had recently expanded that to be basically all the sea advisor metrics.

B

Yeah I did see that the uh that the cap had been updated, but it wasn't clear to me if uh they were planning on replacing all of them or not.

E

I'm gonna I want to be clear, like the value here isn't like we have another way of getting the metrics from like the metrics on the Node. That is not the value in switching to see all right. Like that's, that's like the. How right the the proposal is. How do we get one second glandularity metrics for containers within a workload for instance or C group aggregated I, want to see a histogram of the c groups within a node across a cluster and and I'm not like.

E

Is there any do we see like, for instance, Coop State metrics, starting to do one second sampling and then scoring those in ring buffers in memory, and then you know and walking trees in order to map you know pods into workloads and doing like these sorts of more sophisticated tasks. Hopefully,.

A

Asm in the near future, uh realistically.

D

And then even like that, couldn't be fitted for what the purpose of KSM is yeah um personally, is that, like we already have like this kind of granularity for the resource, metrics endpoints like we already have a way to um scrape like the like the prod usage container usage? More often, um the only limit today is that um cubelet doesn't like. It was shown that we are like it's impossible to get down to one second, because the advisor is not optimizing.

D

The Way Forward is to move to, like is a fixed advisor I'll move to the Cris metrics, so I think, like this part, is already covered, but not the part about, like the actual operation of like providing um the actual, percentile and stuff like that.

D

um But from that other side, I'm also thinking that, uh even with a one second script interval um in my opinion, Prometheus should be more optimized than whatever solution we come up with, because even if uh the scrapes is done more often the data will be put onto the disk like it will be on run more often like, because it's put into the disk after a certain amount of scrapes um and in terms of optimization for searching the data and then applying the actual mathematics.

D

This is uh well done. Permutation should be pretty fast as well.

F

My problem I think with the going through like focusing the cubelet part and getting a resolution. There is like sure we have to Resource metrics and points on and have more resolution.

F

um Summary API, like is still not actively a proper public API. There is no clients for it. Like people doesn't have an API when it comes to that that we actually tell people to use, we just use it to my understanding and the resource metrics.

F

It's also always still load on the cubelet I, actually like moving these things out of the cubelet or like making available inside to reduce the strain this puts on the triplet like with the amount of metric we sometimes we're in gke I would actually prefer a way that doesn't go through the cubelet anymore.

E

I have a question: let's say: we've switched to whatever way of getting the metrics on the node of choice is like I have no opinion about it. Right so, let's say there's like a preferred way, which is you know the container runtime interface or it really is the advisor we resurrected from the dead, whatever it is right like if we just switch to that. Does that address the now? We have two ways of doing this thing.

E

Aspect of this it seemed like that was part of the concern. Is that the correct read on.

A

I think the question is like: uh does it make sense to roll some of this into the cubelet directly.

E

It's going to be hard to roll in the like. The aggregation aspect like I, didn't really cover all the use cases right, but the the aspect of scrape like collecting all the samples from all the notes and then or sorting them right sort of sorting. So let's say you have 300 samples per container and you have let's say: 100 100 replicas for pod replicas for a deployment spread across. Let's say you know two replica sets because it's mid-roll out by collecting those 300 samples per container times, 100 pods. You got 30 000 samples there.

E

Now you sort them now. You do P95 like like those sorts of operations, I, don't think it makes sense to do in either of the Google lit or in hoop State metrics.

B

So we're at time. What would our next steps be here not not to cut off this discussion, but I think we're not meeting for another month, because our next bi-weekly conflicts with kubecon.

A

Oh by the way, we're having a face-to-face at cubecon uh on Monday, um so.

C

If you want time.

A

uh I think 10, 15 or 10 30. in the morning on Monday during the week of coupon uh and yeah. We should definitely talk about this I think um I. Think.

B

I think a number of us aren't going to be at kubecon I'm not going to be at kubecon I. Think Phil said he was not going to be at kubecon, so I don't know. If we can talk about this there necessarily and have all the right people in the room.

A

um I am personally okay with rolling this into uh um the Sig I I. Think.

B

We clearly have a lot of interest and I think a lot of like people who are looking to contribute and uh as an out of tree thing, I don't really have any concerns of like throwing it in the Sig's project and seeing what happens, we would just need like a list of people for approvers reviewers, repo admins, that kind of thing uh and we can POC it and if it turns out that, like you know this isn't the best fit, then we can do that once we actually, you know, see and work with the code yeah.

F

Good stuff falls out of it, it can go into cubelet or other parts. Yeah.

E

Yeah, why don't we do? uh My suggestion would be like I'd, be fine with like a three months evaluation period, it's harder to evaluate when you can't see the code or you can't run it.

A

That sounds reasonable. Who wants to take the action item of creating the the Sig repos.

B

I can I can take the AI to make the issue and all that jazz great.

B

uh And Phil, would you be able to get me a list of people who should be admins, approvers, reviewers, yeah.

D

B

Lots of folks on this call who sound interested so like, let's keep up that momentum. Thank you for.

E

Being here you can you can slack me on pubert Rock um uh in the kubernetes slack, uh if you're, you know interested or add to the agenda and then I'll do my best to make sure everyone's. uh We reach out to everyone that has interest and yeah I'll Atlanta I'll get you uh list of names, great.

B

uh Back over to you David are we done.

C

I think that's it for the agenda. So uh thanks everyone for joining. Sorry, sorry, we ran long and uh see everyone at Cube or see some people.

B

Cheers everyone thank you.