Kubernetes SIG Instrumentation, 9 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Instrumentation - usage-metrics-collector deep dive

Description

usage-metrics-collector deep dive

A

B

Recording to the cloud welcome everyone to today's edition of usage, metrics collector sub project meeting, not the regular Sig instrumentation. uh It is Wednesday, February, 8, 2023 and I believe we're gonna. Kick it off with a very exciting demo uh and then we'll follow up with a presentation and we'll have opportunities for people to ask questions today is going to be a deep dive on the usage metrics collector sub project, so uh Phil. Please take it away with the demo.

C

Sure I'll just start with the demo here, so I'm, going to start out with just end result, which is like this. In this case uh Prometheus instance, I got so I got a cluster running over here. It's just a random gke cluster I created nothing too special about it, except it's using c groups, V1 um and so I deployed our application uh with Coupe control, apply import forwarded and I got this Prometheus instance running as part of it, and so this should be pretty familiar to everyone here.

C

uh What's interesting, I think is that these are the new and metrics published, or here are some of the new metrics published by uh The Collector, and these are the utilization metrics, uh which are kind of the more interesting of them, because they're the ones that uh you don't get the you don't get quite the same thing when you're using Coop, State, metrics or C advisor. So, for instance, we have the max utilization for a workload as one of the metrics here, and so what this one is, is it's at a one?

C

Second sampling, uh so you sample every second, the CPU. In this case, um you get five minutes worth of samples, so that's 300 samples.

C

What was the max sample over that five minute period And? So in this case and then the workload name here um and so you'll see here it has the container right. So you can look at in this case, there's just uh one container um and then various other labels associated with this particular workload. uh It's running as a demon set.

C

This is the name and So within all the containers, in this case, or with all the instances, so there's the 300 samples per five minutes and then there's the various replicating three replicas one for each um one for each node right and so, which case you have 900 or so samples every five minutes.

C

What was the max sample that we saw okay, and so that's about Point, uh one CPU 0.09 CPU right, but then you also have these other ones which are like what was the P95 sample of those thousand or so what was the median of those thousand or so and what was the average of those thousand or so right?

C

And so you can see, there's like a pretty big difference between the max sample of all the nodes and all the seconds and kind of the average down here which is closer to what you typically get if you're, if you'd be sampling every five minutes.

C

um So this is you know, and it automatically does the walking from the pods into the workloads to resolve all the the replicas and such so that's kind of, like I. Think one of the more interesting um examples of what this can do. uh One thing it's doing as well is I just wrote this. This code isn't checked in yet, but I wrote us just kind of for the demo.

C

What happens if we take those samples and upload them to bigquery, for instance, and so every time it samples it goes and pushes that set of samples into bigquery, and so that allows you to do kind of interesting analytics on.

C

You know, searching for workloads that are maybe poorly tuned or what are your workloads that are having the highest requests or what are workloads that are using? You know really requiring on their limits a lot more than than maybe you want, and so in this case, I just did a pivot and so, and it just has a sidecar that dumps that data into um bigquery I also have it dumping into.

C

um The cloud storage over here, so if you want, for instance, to do other sorts of analytics I, think a lot of what we're trying to do is making it like. How do we make it simple to uh to do sophisticated stuff? With this that may be prom ql queries are just you know not well situated to do oftentimes. You know you get.

C

um If you want to look at maybe 30 days worth of data uh prom doing something sophisticated in prom ql is, it starts to hit its limitations, and so that's why? Having archives of this data or or putting them in.

A

C

Sources uh gives you capabilities that you wouldn't otherwise have um and then yeah that's uh that's kind of the main bits. I can maybe just walk through quickly.

C

You know uh what's going on here, uh so we got a couple Samplers here. These are the things sampling on each of the nodes, and then we got this collector, which does the aggregation in our grafana and Prometheus instances, and so you can do something like this uh to. If you want to just debug and see what's going on, you can just of course, curl that collector instance and it'll dump all the stuff out there.

B

So Phil just to um avoid defining things by like uh their name, so the the node sampler is the thing on each of the nodes that is looking at all of the workloads and collecting utilization metrics. And then the collector is the thing scraping. Those Samplers and doing cool.

C

B

C

Correct correct right.

B

I, don't know if that that helps anyone, but I heard like the sampler is sampling and I'm. Like ah yes, I know exactly what that means. You might.

C

B

Yeah, hopefully,.

C

Looking at the the pods, maybe you got an idea of the demon said but yeah um that's correct. Yeah the architecture about you're, probably gonna, go into architecture in your presentation. Right of how.

A

C

Do that set of samples.

D

C

D

Yeah sure I'll.

B

Wait questions before we go into the presentation.

E

uh Yeah actually I have.

B

Let me quickly just say the the question in chat: uh how are metrics data exported into bigquery? Does it upload metrics data file into bigquery, so.

C

Yeah- and that was done like as as like what would be cool to do as part of the demo right and um and so I wrote a script uh that just uses the go: laying bigquery there's a load API that loads Json and the trick there was that you have to update the schema. So one thing with bigquery is, if you add a label right to your metrics, and then you try and push that into bigquery. It's gonna this. The query has to have those additional labels, um and so it basically just reads the so.

C

The collector dumps um there's an option in the config for The Collector to dump out a Json file each time it does a collection right into that. Json file is actually a format that.

A

C

Can read: Norm uh natively, it's just new line, delimited Json Blobs of the of the metrics, and so what we do is just or what that that sidecar does is just uh reads those files. You know, com list, the directory every minute or so reads those files looks at what all the labels are infers the schema and then sends a load command to bigquery with that schema, and then it will automatically.

C

If you do, that, you can set an option to automatically update bigquery with the new schema that that same sidecar is also what pushes it into GCS.

B

uh We have another question in chat, but first Han you had a question.

E

Yeah I was just wondering like uh what kind of cardinality have you tested, because workload names are unbounded and you can theoretically have huge clusters right with many I'll.

C

Say the workload names tend to be smaller than the Pod names, which is what we had to do before.

C

um So, where we're like, if you, if you look at what we're trying to replace what we're trying to replace, is doing a c advisor lookup, all the pods joining the utilization of that with the Pod info with KSM, then with the Pod labels from KSM, then with the node info from KSM, then with the node labels from KSM right and the namespace labels from KSM right and you're, basically doing a massive join of all the pods.

C

You know, um and so, if you look at I'd, say we're looking at more at the cardinality that we're reducing right, which is you know, orders of magnitude, Maybe 100x,.

C

In terms of just the operations and sophistication.

C

I'm gonna I'm gonna refrain from from um saying the exact cardinality just because I can't remember what we've we're pretty clear to talk about in terms of what we're doing.

B

A question in chat, uh which I think was just kind of covered. uh Maybe this will be covered later, but to what degree is this intended to replace the advisor metric collection? It doesn't need to entirely replace it, but it can be used as an alternative mechanism to see advisor. It looks at the same things that c advisor does, but is a little bit less CPU intensive, and only because it only looks at a small subset of what C advisor can collect uh it's just looking at CPU and memory.

B

It's not looking at all of the other stuff. The advisor looks at a lot of things, so um so you can use it instead of see advisor. If, if those are the only things that you care about yeah and then you don't have to do all these joins with Cube State metrics as well as Phil said.

C

It's meant to replace the doing the thing with C advisor that people really hate doing, which is those really complicated, joins across really high cardinality data. To get uh like, like the workload one I demonstrated, for instance like how do you figure out the workload? Well then, now you're joining with the owner's reference, but the owner's reference from KSM. Isn't it's like in order to walk replica set like how do you walk from pod to replica set to deployment, maybe in something higher? It's actually really really complicated.

C

Query to to do, and so I'd say the scope of it is limiting it to replacing the pieces that the advisor really isn't isn't trying to do. It's not you know, would I use this to replace like the one thing C advisor does is like I want to debug apod, because there's one pod, that's must be like is is: has too much CPU one just one pod, not the workload right, not broader patterns.

C

We we categorically have said, that's really not what we're trying to do, um and so for for that use case. We um we I've advised people yeah continue to use the advisor for that. Don't come looking at our things, for it.

B

So we do have one more question in the chat, but I think that one might get answered in the presentation. So why don't we go and do the presentation and uh then we can come back to q. A um I also wanted to. Let folks know like I'm, happy to read your questions and chat, but also like feel free to put your hand up, and you can unmute and ask the question as well, uh but uh without further Ado About You want to take us away with uh your presentation.

B

Sure thing beautiful.

B

D

I just needed that button and everyone's seeing my screen.

D

Sweet so just a little bit about the design uh of The Collector, uh so the main piece is the actual collector itself. Let me find my mouse here: I can't find it oops.

D

Where is my mouth.

D

Hi this is my mouse disappear, oops, well, I guess: I can go into presentation mode, because I can't see my mouse anyways, the so the main pieces of The Collector itself it takes in a config and uh I'm gonna, develop a lot deeper into the config, because it's how it sets up all the aggregations that Phil showed so, for example, you would set the resources, the extension labels, the aggregations themselves, the local samples that Phil was pushing to bigquery, sidecar, metrics and so on.

D

And the second piece is the node sampler: that's what actually uh gets utilization metrics from the nodes, be it using container D or uh crawling the C group, the C group slot system, and also you have a controller piece. That's what gets your resources from the API server to get, maybe requests allocated or your limits and quota and so on, and this uh collectors also a Prometheus exporter, because you you want to expose all the metrics you aggregated uh via a uh metrics endpoint.

D

Please interrupt me if you have any questions. I'll I'll can go over those all right. Let's see if we can do this so a little bit about the note sampler, it runs on every note.

D

um It pushes metrics to The Collector, so you have metrics like CPU memory. Cpu throttling boom kills as well, and um so the first kind of set of metrics it pushes would be the container metrics. So you can get those by walking uh csfsc groups and but it also supports uh getting metrics here to contain the D socket. If you have a pod runtime like for kind of containers where you can't get that by walking through the c groups.

D

So then you also have node metrics uh by c groups, so you can kind of glob at different levels and get metrics for those as well, and this is also this is a config. You can pass this. uh The Globs by config I'll mention that a little bit in the next slide. So this takes some configurations. uh The buffer size, for example, Phil, was showing uh one second sampling and uh buffer size of five minutes. So 300 samples, that's also configurable.

D

uh How often you push to The Collector, it's configurable as well, and also the no double Globs I mentioned in the previous uh point, all right, so the metrics collector itself. So we have that config I talked about here. You would specify what resources you're interested in uh what are the sources of the metrics, the aggregations and the extensions I'll go deeper into that in the in a slight uh ahead.

D

It's also the aggregation engine that actually performs the aggregations and exposes those metrics as well, and then you have the controller that would list the resource quota to get uh other sources from that and pods nodes, bbcs or namespaces.

D

And then you have the exported piece that actually exports uh exposes the metrics endpoint that you could scrape all right. So the config, where all the magic happens.

D

So the first thing in the config is the actual resources you're interested in. So, for example, you have CPU memory, storage and there's a resource type called items, for example, if I'm interested in it interested in some a metric on uh maybe an IP class on a pod and I want to count that on the namespace. So how many IP classes are being used in some namespace I can expose this as an item. It's just essentially a count uh all right, so you also have extension labels.

D

So these are labels that are not built in, for example, container name, pod name, those are built-in labels, but what, if I want uh a label that is Maybe a Prometheus label that is an annotation and a label on a pod or a namespace or a node, or even a note taint so for a container running on a node that has a taint of no schedule.

D

I want to have that label in that container metric, so I would use these extension labels to get that, and then you have aggregations which Define the actual sources, where you get the metrics and how they get aggregated and how the metric that actually gets emitted for those, and then you also have something called sidecar metrics. So if you have metrics, you want to expose that are not part of the collector. You can also use a sidecar to write those metrics to disk and rotate them, and then you just uh pass the folder to The.

D

Collector and it'll expose those metrics as well, and then this is the part that relates to the bigquery uh uh piece that or GCS or whatever Object Store you want. So you you can also save uh local samples for the metrics in either Proto form or Json to disk.

D

Okay, that's it about config. Let's talk a little bit about aggregation, so sources, so uh your sources are just where your metrics are actually coming from. So you have different types here. You have containers C group spot no quota and for each type you have different sources in that. So, for example, here uh for quota I might be interested in resource code of heart or quota use for container I might be interested in the allocations. Or limits all right, so um then you have for the aggregations. You also oops oops. What happened there?

D

Oops sorry about that?

D

Okay, there you go. You also have levels, and this is kind of where the aggregation actually happens. So for a level you have a mask.

D

So a mask is just a set of labels and a level name that determines what metrics are going to get aggregated for those label values, and so for that you have the built-in labels, and then this is where you pull in those extension labels that I mentioned before, and then you have an operation because you have these masks that uh specify what labels you want to aggregate over and you have the actual operation that's going to get applied to the samples that get aggregated. So we currently averages some median uh uh mean P95 is supported.

D

You can also do histograms and for histograms you have to specify kind of buckets so that these uh your samples can be uh correctly aggregated into these histogram packets and there's a no export uh this controls. If that level is actually exposed asymmetric as Phil alluded to. Maybe you don't want to expose container or pod level metrics, but you want to do those aggregations, so you would just specify the aggregation and then add the no export true and that does not get exported.

D

So you, then you have a retention name. This relates to the local samples. If you want to expose one of these levels into the local samples that get written to disk, you would just specify the retention name.

D

All right, I have an example here. So here's an example of of an aggregation uh on a container type and I am interested in utilization and requests allocated for that container, and so my masks are I want to I want my first aggregation to be on the container. Here are the built-in labels that I'm interested in maybe a container name namespace, the Pod name, priority class and the EXP in the note that that container is running on then for extensions.

D

Maybe I want to know about uh uh the the the the taints on the note that that container is running on. Maybe if it's no schedule I want to know about that, and also, if I have an annotation on the Node that specifies a node pool. I also want to include that here and then I have no export. True I mean at operation is average by the way.

D

So it averages all the samples that match that matches the label values and then I don't want to export this metric and then the next aggregation level will be the Pod. uh It's just the same thing. It's just missing that container name and then uh the operation here is the sum so I want to sum all the containers that belong to that pot.

D

So if I look at the metrics that come out of this, for let's assume that our resource here so CPU I would have something like that, depending on what prefix I set in my collector I would get this prefix I I'm just calling that Cube usage here then this pod name is this.

D

The mask name I had here so just say spot, and this is that operation I use so sum- and this is the actual uh Source itself so regress allocated, and this is the resource I'm measuring so that'll be CPU course and those other labels that are pulled from the built-in labels and the extension labels. The same thing for utilization as well, and that's it any questions.

B

Okay, so um just coming back quickly to uh how use question, did the presentation answer your question.

C

Oh yes, yes, perfect,.

B

And uh Ty I think: did you want to unmute and ask your question.

A

Sure I'm just wondering a bit more about the extensions. um My the current way, I'm I'm understanding it which might be wrong, is you know um you mentioned like labels and annotations can be looked up so under extensions. Whatever key you specify, there is going to be uh it's gonna be looking for a matching key in annotations or labels is. Is that correct and then I'm also wondering like if you would have an annotation and a label with the same key is? Is there an order to kind of which gets checked? First.

C

I can share. I can probably just show quickly what it looks like and it might answer.

C

All right, can you see this all right so like this is an example of the config. uh So let me just reorient everyone these. These are like the rules that a boat was showing right and so the way we've written our config. This is the open source repo by the way right, we kind of document.

C

These are what gets produced by these sorts of things, and you can see them building on one another here right and some of these are histograms and uh whatnot, and so, if you look up at the top, this is where kind of uh the extensions are here right, and so these are all commented out. I don't have any extensions, enabled right, but let's say I wanted to.

C

Let's say I wanted for all the metrics derived from pods.

C

So that's like you know the Pod metrics, if you have them, um but like workload, metrics app, metrics, uh namespace metrics, like for the sum of utilization of a namespace clusterometrics, like all these sorts of things, um I would say: okay whenever I am building that metric go look for this annotation or this label on the pot right, and so, if it says annotation here, I'm looking for an annotation, if it says label here, I'm looking for a label on the Pod right and then I say: okay create the label on the metric right.

C

So not the label on the Pod, but like the metric label named this right and then copy the value from the either pod label orientation label. If it exists into this new label right her namespace labels, it would look like this, and so this applies to any namespaced object right. So if I apply, let's say I put in a label on a namespace that calls it. You know, project cool or something like that.

C

Right and I want to track all project cool namespaces together in some way right then I'd, say Okay, um go find the label on a namespace project cool and when I'm doing pod level, metrics I look up the namespace of the Pod. So we do the join here right in memory through kind of the informers cast, get the Pod get its namespace look for the namespace label. Does it have project cool if it does copy that over to this label same thing with node labels right so get the Pod get the nodes on?

C

Maybe I want to find certain nodes. I have different node pools in my cluster and so I want to. You know, get metrics you know oriented around which node pools are full versus empty or something like that or what's the utilization of different node pools or which node pools tend to be high, more highly, have higher requests or those sorts of things I'd go get the Pod get the note. It's on look for the label, pull its label and then, when I aggregate those pod level metrics up into other dimensions.

C

um I have these labels here right and same thing with the note teams, the chains are done a little differently because taints are repeated and not just a map, but you can do the same thing. Looking for taints those labels, then um I, don't think I have any examples here, but you could see a built in when you want to use them. There'd be something like this. They appear in here right and so say: I want to aggregate doing a sum.

C

This is probably the wrong level but say you know here: I want to aggregate on one of those labels right, project, cool, true right and so by setting project cool equals. True here, when I do the sum I'm going to retain that label right and so I'll now have two distinct groups potentially for this workload portion of the workload that's project cool in the portion of the workload.

C

That's not now for this workload since they're all pods in the same workload, they're, probably all going to have the same value for project cool, but maybe for the you know, node pool right. They might not right and so I have half my workload on a node pool. You know experimental and half my you know, or next version.

C

Maybe I want to know what version of the node kubernetes couplet right, for instance, they're running on and partition my data that way so this is this is like the extensibility that, like we know that we haven't thought of all the ways people are going to need to partition and annotate their data, um and so that's why all these are built into the rules of. Do you want to do some? Do you want to do average? How do you want to kind of build up your data?

C

Does that? Does that clarify answer your question about how those labels are found very.

A

Much yeah, just that sort of you know seeing seeing them. Okay, it's a ref. It's a it's a key of a definition that explains it.

A

Awesome hooray.

B

Great demo, great presentation, uh it doesn't look like we have any more questions in the chat, so I know we promised on our agenda for today which I didn't Link in the chat, but I will do right. Now um we had technical, Deep dive q, a we've done both of those things uh but also road map. uh Do we want to talk a little bit about road map.

C

I uh whoops was I was supposed to prepare something. Oh.

B

F

I've got something I.

B

Don't know we don't need to.

F

Prepare something just like Jeff: we painted it internally on a boulder, and uh you know the plan is that the Road Runner is gonna. Think that's like a real tunnel.

C

um I'd say I mean there's some things uh I think maturity. The big one is going to be around just maturity. How do we make sure that um that that we don't lose data? How do we keep the collector as resilient as possible? I know that AJ has been a challenge to get right um and just looking at different different factors. There I think the container d uh is support versus the c groups uh path walking.

C

uh We're want to build more support, for that I mean it's it's out there and and it's um it's usable I- think uh it's ready ready for use, but it doesn't have quite as many miles I think a big one is c groups V2. uh We currently just walk the C group, speed One path, maybe for container D. It doesn't, if you're using container D. It doesn't matter whether you're using C group speed one or two, but for the node level metrics.

C

um We don't get those from container d right, the the system sliced, because it's not running through container d, for instance, um we probably want to add c groups, V2 support.

B

We're using the container DC, Group, V1 API so like I, think it's specific to C group V1. So we all know so.

C

No, so, okay, so c groups V2 for both the file path blocking and containerdy properly. Are things like be great to support for.

D

Another thing is the the sidecar metrics will like to support aggregation on those as well, because right now you just specify the metrics and they're not it'll, be cool to aggregate the sidecar metrics the same way that we aggregate our our metrics natively.

B

uh When you refer to the sidecar metrics, which side car, are you referring to.

D

So there's an option to specify sidecar metrics by giving a volume Drive where the collector can read those metrics from and expose them, so those metrics they're just read and exposed, as is there's no aggregation on that.

D

So I would like to add support to that.

C

Yeah an example you might be wondering like. Why would you want to do that right um and- and you don't actually have to do that through this collector right, you could create. You could spin up your own service that exports. You know whatever any metrics you want, but one pattern we found is um maybe metadata about the cluster itself that that you don't get from the cluster API, for instance, or or just other sources of information that you really want exported.

C

At the same time, you want to make sure that um you, it's not another separate service to monitor like one service is down and the other, isn't you start not having all the data so the we found the transactional nature of being like I, gotta scrape. Then I probably have all the data at that scrape at that point in time um versus versus a distributed, scraping a bunch of services, and maybe at a moment in time you don't have data from one one particular thing, but you do for another. uh Having that is. Is nice!

C

It's nice, not critical, but nice.

G

So um how does it behave right now, if, like the Informer connection, is not ready or the watch is stuck or reconnects? Is there no metrics, then, for this period of time,.

C

um I think, let me try and remember so like if it doesn't I think in some cases it will like restart and kick itself.

C

um There's uh uh I think it will read from the cash potentially as well depends on how stale it is um I in practice. I think that's been less of a problem. It seems to re-establish itself relatively successfully. It's actually the leases. The big one we found is um getting a lease connection, and then that kills your then then you're like I, guess I'm, not the leader anymore, and then you die, but maybe the other one is like also not able to get a lease connection. And so you were fine.

C

Your informers were fine, uh but because, because you couldn't get at least you decided that you're going to be unhealthy.

G

And The Collector can only run as a one instance right now right or can you Shard or like scale.

C

It up- that's not true. We tend to run it as one instance um because of because of what I just described, we found more issues with not getting leased and not getting the informers. uh You have to be really. You have to really know what you're doing to run it in, ha because you have all these node Samplers right.

C

This demon set of one per node that are pushing into the centralized collector, and so um what you want to have happen is you only want the Prometheus instance, for instance, to be scraping: The Collector, that's getting the metrics from those nodes right if it starts scraping the other one, all of a sudden you're going to have this weird flapping of um utilization, zero right, and so that's why?

C

If you look at the can, the config has like all these Readiness and health checks in it, and we make sure that uh when you run an ha mode The Collector that isn't getting the node samples is marks. Itself is not ready, um and so that way it doesn't get any collections. But it's it's complicated to do are.

G

You like right now, you can only run as AJ. You can't like Shard uh input from different nodes to separate collectors.

C

I think like it's possible to do something like that. The sharp, because, like that, the way the aggregation works right, like a big portion of this, is doing all the joins right and doing all the aggregation and that's possible because you have like one unified view of everything, um and so, if you're sharding it, for instance, you couldn't have the aggregation run across you now you have to do either a hierarchy. Well, you have partial aggregations and then filtering up or um so the answer is no.

C

We don't support that and and that's why we don't.

G

Would it make sense for the all the cube, API data you get from the informal reflector that is like pods and yeah, actually pods and nodes to collect that data inside the node worker? Because you then have at least not sure you don't have one collector that needs to get all pods from a single cluster because it can get quite big.

C

I mean we still need to. It depends on what the goals I'd say. This has not been a problem for us, I'd, say Prometheus like it's actually really efficient, so it's surprisingly efficient and Prometheus is has a much harder time with the cardinality right, because it's only storing the data right now right and then Prometheus stores. If you wanted to store 30 days, okay, well, that's actually a lot being a lot more data. So for our Focus! That's that's the bottleneck.

C

um If you wanted to Shard it like again, the the when you do the roll-up from pod to workload like how do you make sure that you can whatever's doing that, roll up has all the pods in that workload right and then, let's say, you're also doing a separate roll-up from like pods to node pools, okay, well, that one has to have all the node pools. Well, what?

C

If, like the intersection of having all the Pod the workloads and having all the node pools right um and then and then oftentimes, we really want to roll up to the full cluster level. How what is you know? How allocated is this cluster? Is it full? Is there a lot of emptiness right, and so something has to have all that so you'd have to do tiering.

G

um You wouldn't gain a lot, and it just depends on like it's only bad if you have like a real big number of nodes and pods okay,.

C

Right and yeah and then and then and then it's your bottleneck right. So maybe you have some super awesome if you're not pushing it to Prometheus you're, just pushing it to bigquery, maybe maybe then I could see that being um a bigger problem.

C

I would love for someone to come to me and be like it doesn't scale, no matter how much CPU and memory I give it right and then yeah I'd be really interested in that.

G

Yeah and the reason one is because we um we've seen metric server, which is also like this one single thing uh at certain plus the sizes start to struggle and get it like a lot of mostly memory utilization to a point where sometimes it doesn't fit on a node anymore um and then and I wonder if that might be similar problem. But I think with the aggregation. You're doing probably a lot cheaper, because you don't have to hold that many massive objects in memory.

C

Yeah and we can do there's actually some off. We have a lot of room for optimization, because right now, the way we do the listing the way the Pod list API for controller runtime works. Is it basically copies the entire pod memory right, and so we could probably cut our memory in half if we really needed to by um by just reading directly out of the Informer cache without making copies of the objects. For instance,.

G

um Yeah, so you you might would help in a lot of also operators. I've worked on like right now is switching away from Informer. If you don't need the whole object and use reflect and have your own cache or there's the transforming Informer.

C

But it's a bit more complicated, yeah I! Do that like drop all the annotations and labels that aren't there.

G

C

Thing we've done I put in an optimization to uh in the reflector like have a Transformer to just drop like start clearing parts of the object, uh and we turn that on we're like it's going to be awesome, we're going to save half the memory and then like it saved like two percent, we're like uh at that just turn it off. Who cares.

G

I had better results with only the reflector.

C

A

E

Say it's great.

C

We have the opportunity to do optimizations, we haven't hit those, yet we still haven't hit scaling limits.

G

And one last question: the um note part is pushing metrics in proto right. So in theory like if someone would come around and say, Hey I want those in a different I. Want you to do the aggregation somewhere else to run this in a different format. They could just implement the same API and tell all the nodes to push there.

C

Yeah and actually the collect, so the demon set has in it a it says like the service it pushes to right so yeah. You could absolutely just run that demon set change the end point and Implement a um implement. The server protocol and you'd be good.

G

So, for example, the bigquery export, if you only want bigquery, you could just instead of having a sidecar, let's decide the node thing push to your bigquery exporter, yeah.

C

In theory, um I, wouldn't that's not the way, I'd start out, because I'd ask: do you really want to push 300? Do you really want to push like that level of granularity of samples? And the answer is almost certainly no so then you're like doing the aggregations right like um okay I'd, say maybe the the more interesting thing I could see you doing is.

C

Maybe you want to have like um less granularity, because five minutes is too much, and so maybe you want to um did do something where you're doing averages over 20 minutes max is over 20 minutes those sorts of things.

G

C

One one thing I didn't demo that we also found was really interesting, is just getting an idea of holistically like on the Node. How much is coupons taking versus the systems lives? Are these things well tuned, uh seeing like when you run on large clusters, just the variance of like turns out. It's not like one size fits all, isn't perfect for the system, slice right and you're, either under under provisioned or over provisioned, but almost no there's.

A

Almost no reality.

C

Where you're perfect, everywhere.

B

E

C

B

uh We got 12 minutes left just a time check, so we do actually have a decent amount of time so feel free to keep questions coming at us or Phil I. Don't know if there's anything else you wanted to so.

C

My favorite part of this project has nothing to do with the project, we're seeing instrumentation or anything else. That's a little piece of infrastructure. We just had to write because we really were tired of writing tests, and so we write our own test infrastructure um that uh Paul, Paul and I have really patted ours and enable have all patted ourselves in the back about.

F

Very generously, major patch, so.

C

uh I, don't know I know if people are interested in seeing that something I could show off, and that's all right. So the theory is I hate writing tests.

C

um What I really want to do is run my code and then say: what's the results of the code and are they what I expect right and I don't want to write what the results are and I don't want to write the test, and the only thing I want to write is the input, um and so you know based on that kind of theory, uh we said: okay, what can we do to make that a reality?

C

uh And so we wrote most of our uh most of our tests are just functional where we have in the collector here this. You know these are all directories and you just create a directory of test data, and so we say: okay, here's a test case, um and so for this we say: here's the state I want the cluster to have so just load up the cluster with this state.

C

um At this this replica sets and then our infrastructure looks and finds like something called input, client objects, yaml and just says: yeah. Okay, I'll create the in-memory cluster with that um and then here's the spec for configuring it. So here's the aggregation rules and all the various pieces, and then maybe here's this extra one. You know other pieces of input that you might need to say.

C

Okay, like ignore these metrics, because you know latency seconds, uh obviously changes and it's not going to be stable um and then here's, maybe the here's, uh maybe the the inputs to write for you know what you get from the C group file system right and so each one of these test cases is just like here's. The set of input state that I expect and then the test basically just runs the code.

C

You say: okay, go uh read some of these files and run the code do a little bit of setup and then it produces this expected file, uh which is the output from it, and so it writes it, and so um I don't write this expected file. What I do is I just run it and say, update the expected file or fail if it's not the expected file, uh and then that way, whenever we make changes to our code, let's say we want to add a label.

C

You know we're changing the expectations we want to make this better. We found a bug oftentimes, we'll find a bug right instead of going and having to fix every test case when we fix that bug, we just say update this, you know run it in update mode and then you just do a diff right, and so then it makes things really easy. Where I just fixed the bug and say I expect, like all these files to have changed, show me they've changed right, and so we just kind of stamped these things out with different.

C

You know it is a lot of copy and paste, but it's actually um you know copying pasting in this format. I think has worked out really well for us, where we just kind of stamped these different test cases out. uh We have a lot of them because they're really easy to write you just kind of copy it.

C

One test case change the inputs to be kind of what you expect um and so that, for instance, did the test for these things and they're, not they're, not free, but you can kind of see they're, not they're, not terrible, right.

F

Phil, why don't you go ahead and break one and and run it? And let's see you know what it looks like when you've got a break and one of the test cases.

C

uh Jimmy, like changed the code to do something different.

F

The file to be like hi there or whatever yeah.

C

So maybe I'll change one of these labels right to say app to right, so I'm changing the label here go test package collector.

C

All I'm doing a live demo is something I've like never practiced before. So, let's make sure the test. Actually, let's see if the test actually is working or if it's not broken, something else could just be.

H

C

So it works there right and so then you know you can imagine I changed the code uh in a way that doesn't match the test file. We.

F

Need a little bit more terminal, I think.

C

A little oh sorry, more terminal.

F

C

Cowbell right, and so maybe this isn't, the great I don't know I, don't love this unable to save. This is not what I expected. Oh here, it is uh right. So then you can see you know. Is this the best thing in the world? It's not. Let me show you how we expect the workflow to really work right would be so I've added everything right. So it thinks like this is the way um and then, let's see update so maybe I'll do something like this.

C

So pretend that the test the test case doesn't match, um doesn't match anymore right, I've made a code change and now it doesn't match. What's going to be produced, I can say like just update this stuff and then I run it it's going to run all the tests and then, when I go into Source control.

C

um It's interesting, it'll show you hey, look I made this change like it did say before the expected result was you know app to, and now the expected result is F1, and so typically, when we write a test case, we leave these things empty kind of a bad practice. We should probably um we should probably be a little more careful about making sure the test expects, expect expectations.

F

Mine aren't empty. Oh.

C

F

Copy them, from other cases.

C

Yeah um I mean, and actually you can see the deaths really easy. Then you copy a test case. You change an input, you, you add all the stuff to get run. The test do the diff and then you can see yeah when I test it. When I changed this input, I see the output changes in a corresponding way. That's that's good! So um that's that's something! Anyone can actually borrow from this repo. If you happen to want it for something you're doing.

D

Those are pattern to reviewing PR, as you just go, look at the test cases and how they change first and if they, if that doesn't pass and I go. Oh change that doesn't work.

H

I have a general question, for you um say: there's an organization that thinks this is looking pretty cool and would like to try it out. um What caveats would you have for them in terms of production Readiness like don't, try it on clusters, this big, don't try it. Let's configure or it's all great just go for it.

C

I'd say: don't don't try it, uh so the the size of the cluster is probably going to be less than issue than the cardinality of what you're exporting in terms of what you're going to blow up right.

C

um Maybe maybe listing like this thing's going to list all your pods make sure you're comfortable with that uh sort of thing. Obviously, any new piece of software don't Don't, just run it. um You know on your most important production cluster without trying it like, since the normal sanity applies. I'd say um be careful about what which Prometheus you push this into push it into its own Prometheus I'd, say even not just starting out. You probably want your own Prometheus for this stuff. You don't want to be alerting.

C

You don't want the thing that tells you when something's terribly wrong and alerts you to be also the fire hose for your utilization metrics, um so so try and create your own Prometheus message. um Instance for is probably the most important one.

F

I would I would probably add, add to that like run it side by side with whatever you're doing now and like carefully examine the data, um because you might be surprised once you start getting like very high resolution data. What what you may not have seen before.

C

Yeah, uh that's very true: the yeah, the sampling. It may look a little. You don't want it to look wildly different.

C

So do a sanity check on that any any metrics that you're going to make decisions off of I'd I'd, compare to see advisor and just do the like you're not going to need to do all the joins with the advisor just to get an estimate. Like sum up by names like you, can do it at a namespace level, for instance, without doing any joints from C advisor um and then compare like the namespace level. Metrics make sure they're the same. Maybe do some individual pod metrics kick the tires.

C

One thing you're almost certainly going to have to do is um the if you're not using containerdy, if you're using the c groups file path, walking, it relies on having certain naming conventions right. It reads the Pod uid from the from the c groups path. It reads: the container ID from the c groups path and different different systems are set up differently uh with you know some of them append everything with DOT slice. Some of them are pre-pinned of you know: coup pods Dash. First of all, Dash whatever to the Pod name.

C

So um if you're, if it's not working out of the box, you're gonna have to uh you, you you might even either either have to understand what it's doing or um maybe you could ping us on slack and say: hey I I'm, not getting utilization. Metrics I looked at the c groups. This is what I'm seeing.

B

The container energy-based stuff should avoid that problem, um so you could try that out, um but uh oh I guess another possible thing that we could have on the road map is support for more Cris, a sort of a general thing.

B

um I wanted to use like a general library and one didn't exist, so um them to the breaks, but that might help out too. We have one minute left. So uh I guess any last questions.

B

Hopefully, hopefully this was like I'm, not sure if we want to keep meeting in a sub project meeting like this or if we can just include this in our regular business uh and like say, instrumentation meetings.

B

um If that sounds good to people like, we can just include it in a regular agenda item on our Thursday meetings, uh update from the sub project, but I imagine we can work pretty asynchronously. uh Does anybody have any particular um feelings, one way or the other? Now that we've demoed things we've done the Deep dive.

C

What do you recommend.

B

I think we can always have another ad hoc sub-project meeting. If we need one uh I, don't know if we'll need regular. What um at this point so I think it's totally fine to just use the main Sig instrumentation meeting for that.

C

G

Think that's fine, as.

F

Long as you notify people in advance of meetings, so they can participate if they want.

B

Of course, of course, yeah I would try to ensure, like with the last two meetings, so we're giving at least a week notice. So, okay.

B

uh Awesome, okay, well uh looking forward to seeing what everyone finds out feel free to leave issues on the GitHub repo reach out on slack, we'll just use the main Sig instrumentation channel, uh and you can find many of us there and really looking forward to where we go with this.

H