Kubernetes SIG Node, 12 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220712

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20220712-170344_Recording_3440x1440

A

Good morning, everyone today is the July uh 12th and is our weekly updates and a few of us cannot join I. Think because it's hot is the vacation time. When is summer time. So, let's start and our first topic Daniel, do you want to talk about the dynamic resource? Graduation?

A

Do you need do you need the the permission to share sites.

B

Yeah, that would be great.

B

Still unhappy permission.

B

Anyways um hi all I'm Daniel I work for sap um I've been in this meeting not many times before, and today. I wanted to talk about a a enhancement proposal that I have talked about already a couple months ago um and I can always show. There's also some comments on that in the Google sheet. Let me try to share now. Okay now it should work and then.

C

B

The last time I talked about it is here. um There were also some comments here that I tried to address over the the time- yeah anyways, just to recap, because it was a a while ago. um The the main idea behind it is targeting cubes Cuban system reserved settings on in the cubelet they're. Currently statically set and.

B

The idea would be instead of setting these things before calculating the target reservations before the qubit starts, for instance, like AKs, gkey or other players are doing it.

B

um It would be nice to to be able to do that at runtime, actually um so the the and, why, as the motivation behind it, um we have seen on our Landscapes, we are um we're essentially operating a kubernetes as a service internally at sap, and so we have a bunch of clusters running, and we have seen that the resource reservations that we are setting.

B

um We have a similar heuristic based on on Max parts and these kind of things um that these reservations we are setting statically, don't really reflect reality and for one it depends on how many parts you're running, obviously, um and also what kind of workload is actually running, and so we have seen for the same amount of pots deployed, sometimes also Max spots. We typically do in like 200 or 250, so the maximum possible.

B

um We are seeing vastly different numbers um regards to utilization and also in regards to how much memory and CPU actually container runtime is using mostly also the cubelet but the cubelet. We have seen typically um scales with the amount of Parts deployed um or not scales, but the model memory it needs and CPU.

B

So in general, sometimes we have too high reservations for clusters with very inactive workload, meaning we are achieving a low utilization or cost a lot of money and on the other hand, we have for very busy clusters, typically the ones that are hosting a lot of control planes of other clusters.

B

um We see that we are not reserving enough and then under some circumstances, that can we have seen some CPU starvation or system level out of memory issues, because the the C group limit on cue Parts um that is set from system Q reserved is too high. Actually that's why we see these problems and we without restarting the cubelet. We have no time have no option to adjust it, but it's not a really a good option to restart the cubelet.

B

um So that's would be one of those reasons um why it would be nice to adjust that at runtime, um maybe just a second, for instance, if I, let's see okay, now it's just burned, reload them real, quick I thought it would be nice just to see it real, quick. What I'm even talking about.

B

Anyways, if this and just port forwarding at the moment.

B

Essentially, I just wanted to show you real quick that we see vastly different numbers. This is just a random cluster here, um if we are, if you see this here on the right this column here, um for we see very different working set sizes, for instance, for system slides.

B

um We see different numbers ranging from two to eight gigs and that's already, when we're um have 100 requests um and also when we calculate um how much memory we should Reserve, um then it. It also varies from like here, one gig to sometimes seven gig or six gigs. So it really fast. It's vastly different um and that's the main main motivation so yeah.

B

What in the past have also one of the reasons why that is, for instance, for CPU that I have checked, is um the container um logs to stand it out right and then go through a pipe and then the container, the shim decorates those locks and and writes it to the cubelet, the barlock, pods and so on. um So if we have a container that does a lot of blogging that influences the container runtime CQ requirements.

B

That is one of those reasons um why that could happen and for the amount of Parts is because the container Edition, for instance, is part of the um the container runtime C group anyways.

B

That is the main motivation why that would make sense. From our point of view, um the the long-term goal would be to for one, of course prevent issues in our stakeholders clusters, because we don't control these clusters to be like a like, a um maybe like gke, and we don't control clusters from our stakeholders. So we would like to prevent issues in those clusters and we also don't set their system and and Q preserve settings. We we just do it with the static formula, um but certain for us.

B

Certain stakeholders sometimes rely too much on these safeguards and their just.

B

For instance, in one occasion someone is relying on Cuba Division and they basically have a memory leaking container, and if these, if the the secret limits are not properly set, um no eviction will take place, but a system level um om, and that's just one of those instances where we don't have to go in there and explain and try to find the right settings so Cuban system reserved and it is causing causing effort on our side and the other topic is, of course, utilization just to be able to safely increase.

B

That would save a lot of money yeah. um If you have any questions or something like that, please just ask um if I'm not making sense um just tell me otherwise. I would go to two concrete proposals um and that would also be included in this Draft cap.

B

Okay, so the one one.

D

Question maybe yeah, you were mentioning that you want to increase, usually the the cases that you want to increase the the size of those resources for for system reserved and so on. Right.

B

What do you exactly mean by the size of the resources, maybe Amazon.

D

Why you want to take more cores, more memory, correct or yeah.

B

We want to adjust it based on the actual needs on that node. So if it needs less, we would Reserve less if it needs more. We want to reserve more.

D

Okay but see if it's fixed, uh usually for you, it's not sufficient, or it was too much.

B

um For the majority of cases, um a formula like gke or Azure over reserve and I think that's the main purpose behind these static reservations um to have an over reservation, but we also see that in certain cases we do not Reserve enough. So we have like it's never really correct for us. That's the that's. The main reason.

D

What's the overhead, but but most probably you will mention it later, but what will be the overhead to determine this automatically or basically in runtime yeah.

B

um I've actually built a POC for this, but essentially what you? What I think you need is you need to, for instance, on Linux systems. You would need to check um regularly, like a Reconciliation fashion program info just check how much really memory you have um available on a system plus need to check some um C group stats from cue parts, for instance to overall and and for for CPU, for instance, measure the the real um free CPU time. I. Think the advice already has that.

D

And how many CPU Cycles we will eat with that.

B

uh Of course it depends on the recommenders if I call the process like that implementation.

B

um So I can't really answer it, but it wasn't much. um Maybe 20 Milli cores 15, not sure, but that's just yeah from a POC level.

B

Yeah so I think that is then an implementation detail right I mean, um but of course it matters um how to get these metrics and how to then make a decision based on those um definitely matters and I. Think that ties well into those two proposals here.

B

So the first proposal is actually to to run some external recommended process, whatever that is, it's that wouldn't wouldn't be part of core kubernetes, um and this one calculates, however, that they won't do that, um for it could be obviously different for Windows and for Linux um a recommendation and the cubelet would have a means of telling it what system reserved and Cube preserved should have so that it wouldn't be only through the config file like here on the left, but you could do that at runtime, and one idea would be to expose like a grpc server that would be similar to um where is it here to the Pod resources?

B

I think they also actually do the same thing already, but where you can get some stats over for some parts and containers currently run.

B

um Essentially, the the cubelet could also create a Unix socket and call that I don't know under Dynamic resource reservations, and it has an API where you can post and get the current reservations from so you can. You can have an ability to reconcile it.

E

uh One question about uh how frequent privilege will report this change the allocation to scheduler on the Node resource, because it's going to impact the scheduling decision going forward right. We have C issues like a static apart resource consumed by Static, Parts being reflected at a scheduler site. Editor node object side with the latency cause, scheduling, failure or couplet out of resource for the Pod to start on the Node. How do we mitigate that with the changing allocatable resource here.

B

Yeah I think that's one of the biggest difficulties um and that's also one of the things that should probably be talked about when in this cap, because yes, you're you're, changing that's essentially like changing the config right on on the Node, making a different um Cuban system reserve. What today is already possible, but you just do it a lot more often and so yeah. It is really. You will have to see what impact it has on on the scheduler. Exactly yes,.

F

And I think uh one thing we'd have to consider is: how does it impact the running workloads as well? So if your system is consuming more resources- and you decide to allocate it more, do does that mean that we have to start evicting the existing pods that are running on the cluster, so I think that would be very important as well.

B

I actually think that is I haven't tested this. Yes, it was on my list, but I think that that's how what the human would do um so for for some reason. The system reserves the like your container runtime needs so much more memory than it could happen that um you have more pots right than more reserved on your pots. Then you actually have allocatable and yes, um so there are certain things we have to take care of. That obviously would need to be thoroughly tested.

G

uh Hello, gone, gone, sorry, gone.

H

um I'm just curious: um can we try to solve it with a bit more genetic solution, not just for reservation? uh What I mean by that is what right now we're cool blood detects were available node resource once during the start time uh startup time and never change it afterwards.

H

So all kind of problems like CPU, hot black or offline or memory, hot black or offline, uh practically required to restart the public. So, instead of just the reservation part, can we probably think about what kublet resource available resource discovery of the dynamic? Well, like the whole, the whole set of resources.

A

The the Alexandria- this is the fundamental problems. Scheduler is not doing the usage or available or what a while skydiving right. So the problem it is when we report, we know the comeback and register, and we report back see here's the available here. It is the machine capacity. Here's the available resource scheduler just take that one. So it's not really consume of, so they don't expect those changes and also they don't expect it. There's the usage you take into consideration. We include the sky learning decision. So that's why we end up.

A

Have the static report back yeah I totally agree with you and we we could make that more Dynamic and uh and.

H

Current capacity and allocatable is part of an old status part of a node object. So it's something. What is dynamically updated from time to time so.

E

H

Interior, it should be no big difference like read it once or read it every time a node is updating or status.

F

um I think the allocatable is the value that is capacity minus reserved. It is not taking into consideration the resources occupied by the pods itself.

F

So that's why scheduler doesn't know the actual utilization of resources and it has to compute every time it tries to schedule a pod.

H

Yeah, let's find somebody what I'm saying is what uh like, instead of just trying to make one parameter like resorbent resources uh change it like well, let's have all the capacities to be adjustable during the lifetime of a Google.

A

So, let's, let's not goes to the capacity first right so capacity, let's assume make those problems as simple today. Let's just talk about allocateable, which is what Daniel Pro holds secure, Dynamic right. So here's the problem.

A

How, because at any given time and you the resource usage, just think about the simple cases, just only like the system, usage, kernel, usage and also those system, demons, usage and again it has, they may be like the one uh t0 the usage you got to pick and the T1 they dropped and how we are, through those kind of things, normalize those things right so that it is needed to take consideration and sometimes those usage go up and down actually rely on the workload right.

A

So it's not working on the working of the also like it's special for a lot of the first of all the usage. It is sometimes this go up and go down and when they sometimes when they go up- and we also will trigger off the systems right kernel, certain colors that go up and down so how we are going to smooth those things.

A

That's the more fundamental problem, and so this is why, because in the past, we didn't have that as like average like to normalize the data, and we didn't report back to the scheduler. So this is why we take the approach in on the Node side, make that over reserved a little bit over reserved and and also like to try to be the most static. That's the fundamental problem, but uh in the book in the past we did in the node.

A

We do do those normalization we based on the giving time and the moving average and report to scheduler Sky do not take that into consideration and so that we can be more dynamic. Unfortunately, we we think, oh that's, really over complicated when the initial design, especially with the scheduling design even today, I think that's still think about this over complicated, so so. This is why we love our really doing those moving average to report back. So that's.

A

Why end up on the notice side, because we have to protect the node right, you, the one things you definitely want to prevent. It is out of disk out of memory right, especially out of those things, because you could uh tear down the entire node and out of memory even include the cost of the disk corruption, right data, corruption, issues and hold the holder like the node. It is go down, so that's kind of things.

A

We take the strategy, of course, even without those average, those things you still need, the only things is why we are more confident Innovative. Today it is because we don't take into consideration.

A

We could also always at runtime rejected right. So so we don't report schedules are oversimplify, and but we don't report, but when the scheduling job to here and we based on currently used it based on current allocatable and rejected, but that could cause of the cascading problem. You keep just reject at the schedule, keep it as scheduling. On the same note, so that's all scheduling to try to avoid that kind of thing. Then we'll do another level of the Idaho fixed.

A

So this is why I kind of like the at the early time, because for the simplified scheduling and I decided to make this also is over simplified, but to satisfy customers so I made that is the flag or configure So based on your working environment. Let's find some customers say: oh I have the large node, but even though the manager node I only have the one part how much management resource you are needed right. So you could um food and know the come up time and the admin next Twitter can see here.

A

So this node and I I type the node label the node and then make some changes. Small, like the reservation, much smaller, so that's kind of flexibility back then the decision I made it yeah, so so I just want to explain some background and the context where this is complicated. But at that time we opened you if scheduler open, to take some uh consideration to take some usage or wires scheduling change.

A

We are totally open to to calculate give more smooth data like the normalized data and back to the scheduling, and we can do more advanced enhancement here.

B

Yeah, thanks for the background and I think that that um my question would be um currently we already allowed changing the other cable right. um Like my like someone else already said before, sorry, I I missed the name, um so it actually would just update the other cable like today, right and sure maybe I. If I understood it correctly, you were saying that if that would happen, while the scheduling decision happens, then that could lead to to problems during the scheduling process. If I understood that correctly.

B

um The thing is this: how often that happens right? How often you update they are locatable is same. It would be similar to how often do you restart at currently the cubital process, and how often do you change the the configuration in the cubelet so.

H

um Well, our applicable, for example, can be changed if device plugins goes up and down. That leads to update with an old resource status.

B

Okay, yeah I was talking about a memory and CPU in this sense, but yes, you're right sure.

C

So there is one more angle to it right like if you think about the overhead at stable, State, it's May, it's mainly about like running probes like cubelet is gonna scale linearly, mainly with number of PODS, and the runtimes usage is gonna depend on the number of probes you're running so another way to tackle this is to adjust the Pod overhead dynamically.

C

So, on the basis of number of probes you have and how frequently they run, so we have a fixed part over it right now right and that would also require running your monitoring process inside the Pod C group. So in cryo we have an option where you can either run your uh your shim either in the system, slice or you can run it within the Pod slice.

C

So that way like the like, the Pod is charged the cost of running the probes, and if we know how many probes the Pod has, then we can dynamically adjust the overhead. For that part.

C

F

C

Way, I think to approach this problem.

B

Interesting Yeah so basically you're charging it on the part just to to adjust the part overhead. um One.

C

Part over it is not adjustable right now, but that that's one way it can be tackled right right now it's fixed, but we do. We do provide an option to charge it to the Pod slice, and if we make the Pod overhead Dynamic, then you don't have to change the other things right.

B

Yeah, if you can I'm just thinking, um if you can make sure that that's the only thing that increases cubelet and container runtime usage right, yeah.

C

Mainly it's that, based on our uh observations like at stable State, when all your pods are running like exec probes is the biggest offender and then okay, now often you run them rest of it is mostly linear.

A

For the menu, how do you do that? Because these are his problems? He, the the cluster, is not the and the his control. So how can oh you just think about the I. Did.

C

The yeah, the cubelet itself, is like dynamically calculating the Pod over it on the basis of number of probes and how frequently they are run. I mean it won't. It won't solve this entirely, but it will make it more. I mean it'll solve part of it. Right I mean you still need to worry about pods and containers that are using them that have a memory, leak and stuff.

I

Does that uh take into account like uh if the system has a 16 gig of memory, and you have a system reserved of 4K and that's too much? uh The is the spot overhead uh help utilize that you know the four gig capacity which is not being used? No.

C

So I mean this assumes that you're charging your shim processes to the Pod slice right, so we are system reserved like you can. Your system result can be more accurate. It's not going to change much because the overhead is charged to the part itself, but right now that pod overhead is a constant. What.

I

I'm saying is: we can look into.

C

Making that more Dynamic that.

I

Should be easy, yeah.

C

And cubic can calculate right how many probes you have, how often they are running so how.

I

E

C

Add to the part.

I

Okay, I was was wondering where, if the Pod overhead is taking like you're using you're charging, some to the port overhead, does that you know come from the system, result side of things or is it part of the cube let's overall, okay, this is your. This is your playground. This is what you can do.

C

It comes under the cube, Cube, Cube, pods right.

I

C

Decided to the Pod slice.

I

Okay, then, does it really solve this, uh like over configuration of the system reserved.

C

Now it doesn't change, but at least your system result is not going to vary as much right when you start charging it to the Pod. Okay,.

F

C

Now it's variable because the shim is running in the in the system reserved in this example, and it can vary Dynamic. It varies a lot because of the probes and things that aren't taken. Your control.

A

So, basically, how the overheader we introduce it is the one to uh more accurate to charge those usage right prop the usage, all those kind of things. So what terminal suggest is just make that it is chatted to the right owner right usage users here.

A

So then in that case, is the Cuba system and all it could be pre-reserved much smaller right, more accurate.

E

Here, okay and more static right right.

A

Yeah yeah even more static. Yes, exactly.

B

G

Think you could at least.

B

Go ahead. Sorry.

G

Yeah I have a question: does the like the CPU limit and the memory limits and other resource limit apply to that.

E

Part overhead resource.

C

Right now, it's memory and shares if I remember correctly,.

G

C

Know it's definitely memory I love to take a look and see if chairs as well. Thanks.

B

Yeah I think it would at least get rid of this. The most common problem. Right with how many parts you'd apply to have that more more static, okay, yeah, then the other thing would be, then the the amount of work, the kind of workload right- that's that would be um out of scope of that, but it would.

C

Definitely yeah right, it'll definitely help. Of course, if there'll be some workloads that won't fit under that.

A

I know: what do you propose actually uh charge to the power, the uh slice? So that's, but when we first want to do the part overhead, we do think about the move. Those kind of the uh stride into the per uh policy group, or so like the charging really not only yeah.

C

Yeah, that's what we are doing, Don like in cryo. We are charging it to the actual pod, not the top level. uh Cube word slice. It goes to the part, and if we can make this more Dynamic, then I think it is all more problems. Yeah.

H

A

H

Based around times, it's also should be properly charged inside the not outside right.

J

I

I took a quick look at the cap and I think I. Damn support you of the external monitor idea there. Various Innovations can happen. Ebpf is probably a good way to do it. uh One thing is: definitely would want to run this by the six scheduling folks to see what they feel uh about modifying this, the node allocatable, uh because it's definitely going to affect scheduling decisions uh have had to deal with, explain things to them, just for the Pod resize part of it.

I

um This is you know, approaching another very inter introducing another variable for them, and then we want to take safe, Don's thoughts about usage as well. If something that can be, that can be used to make this better yeah.

A

My understanding is the scheduler is a really optimized based on assumption, but those are stable nights so, like the like, the uh the powders uh vertical scheduling, those kind of things actually I think is the only effect of the Single part, but not against the entire of the exact part in the node. So they are really Keen about that. Yes,.

I

Yeah this this would introducer is the resizing. A single pod has a race condition possibility and we use the max to mitigate you know uh to kind of prioritize existing pod resizing. Now this is like resizing a node as far as the scheduler is concerned, and that would affect all pods that are being um you know, being evaluated for a particular node fit and uh definitely would run this by the six scheduling folks and see what they feel.

B

Yeah good tip, I'll, definitely I'll.

K

B

Do that um that makes a lot of sense. Actually, that was also one of the that's one of the major unknowns. For me. um It's also hard to test, um but yeah makes a lot of sense. um Yeah. You have looked at the draft PR right and the draft and cap that I posted yeah.

I

I cursor really looked at it: I I'm, sorry, I.

B

I

B

So I think that's where I included this picture. um This one is an alternative one um that is more similar to the BPA and how that that one works. So that would be an alternative, essentially that that I just wanted to quickly show you um so instead of the cubelet um having a grpc server, we could also just say: Okay.

B

um We don't have any of that. The the qubits are still only watches. Node resources doesn't have a grpc server and we we would introduce like in the specifications like you preserved and system reserved that um field, and if later you want to have also eviction hard or something like this, these kind of settings.

B

You could also introduce that in the spec and say: okay, um instead of running through the qubit configuration, you just can specify these settings on the Node itself, and then you would have metrics you um that can be used by an external control plane that does it for all nodes right, similar to the vpa. That gets it through the core, metrics Pipeline and then can update um According to some some algorithm or specification, and that.

I

Would be an alternative.

B

I

I'm glad you looked at the vpa side of things, there was something I was gonna suggest: okay,.

B

Yeah um so that I I would say, maybe, if I open that draft or go to um also two or six scheduling that I would also then show different proposals. Maybe these two I'm currently kind of preferring this alternative implementation, um just because it probably wouldn't even require running like a demon set or something like this, because typically people have already know. Node exporter um deployed that already gets like metrics from proc mem info and then maybe through C advisor.

B

You could also get some other metrics from the from the c groups um that would be required to calculate the reserved resources and then it would be not very intrusive, so that that.

I

B

You're watching.

I

Watching what the recommended there is a default and then you're going to watch for what the recommendation for system reserved is I like this better too.

B

Yeah, currently, the cubelet correct me if I'm wrong, um also watches notes right and just if some, if somebody nothing would change essentially just still watches the node resource and as soon as someone like our reserved resource recommender control plan. If it's deployed changes something then the cubelet would adjust um the C group um limits on coupons based on that, so it the the main difference here. It's just it doesn't have a grpc server.

B

It goes through the kubernetes API, it's a little bit more kubernetes native um yeah, but then you have a little bit um that you need to go through Prometheus and these kind of things like DPA. Does it um no okay? That was wrong? It doesn't go through promises um BPA, but um yeah.

B

It's just a little bit of a different approach um and I think these the details can be discussed also in when I open the draft cap, if you like, if it goes to too far now there are also other topics and uh I wrote down some dependencies, um for instance everywhere. We should add there, of course, the scheduling dependencies right. What happens then? What happens um for eviction if you change that a lot and also in the qubit code itself? What happens?

B

um Basically, you I need to check wherever the we, um we assume static, resource reservations that needs to be looked at, for instance, and as far as I remember in the CPU manager in some places, and that's the case and also with this memory. Quality of service feature flag, that's currently in Alpha um that and and the memory manager as well. Yes,.

G

B

That makes makes a lot of sense.

B

Yeah exactly so, there are certain dependencies that would need to be looked at. Do.

D

You plan to Benchmark that before and after to see, if there is value.

B

I think that would make sense um when implementing that to make sure that that there's nothing going wrong or unexpected.

D

Not only in terms of correctness but in terms of performance.

B

Sure I think um that makes sense. I don't have a lot of experience. um How to do that so I would I would have to reach out to some of you, um but I'm certainly would that would make sense, I think foreign.

B

Yeah, so these are all good points, sure make sense um I'm coming to the end here. um I don't want to take too much of your time, um so I think the main questions for me would be.

B

Is that something that you guys think is valuable and um and if so, how would be the process and going ahead with this? Would I open a a draft cap and then more discussion takes place there um or how How would? How would that go.

B

So I'm just not quite sure about the process itself. um It seems to well from what I read is um you need to have some traction right? That's what I read from the documentation, um I'm so I'm, not quite sure how to quantify that I mean um the issue is open since a while on our side, we still think it makes sense to implement such thing.

B

B

Sorry go ahead.

A

B

A

uh Can you hear me now yeah.

B

A

So I'm wondering after the menu suggests that, based on your particular problem right, so your customer, your problem, do you think about the minutes. Proposal can solve your problem.

B

The the biggest thing for us is actually um is not so much the the amount of pots deployed, because we typically have um the max amount of Parts deployed.

B

um So for us, it's pretty deterministic, this part at least, but what is not really deterministic for us is the amount of workload and the kind of workload running. um So that's the bigger issue from from our side. So that's why it would be good to have that pot overhead. If you're talking about this. Yes, we have that if that.

A

Yeah, actually, what menu suggests is not doing the only number of the problem number of the color. Also, it is workload specifically because a certain part, the quest of the problem and the requested some logging and other things that also consume of the resources. If we don't properly charge those to the powder and over hide and then you will charge it to the cube assistant, so that's kind of like the make that it is unstable and not as static, not accurate.

A

So if you could to try what he suggested here and then we actually, you could because one of the another set of problems you see that you don't want, like the gke over charged right, so no over reserved based on the nursing size and the production, because you are not controlled about the cluster as gke. So so you want to give the more flexibility, so you want to reserve even more smaller. So after that, the overcharge to the powder slice, the cable part.

A

So then, even like the particular powder, you could estimate down about your Cuba system reserve. Here, that's I think you can solve the Europe. Your original code there so there's another problem, Alexandra earlier mentioned- that to make this a whole thing more Dynamic and that's the even more complicated problem. This is why I mentioned that the scheduling- oh wow, that's the more genomic solution, but at this moment, based on your problem, your production I do think about the word.

A

Menu suggested, maybe doesn't solve your problem, so maybe we could start from there and say: do we need to do this implementation? Do this change? We could come back this way, make that even more generic, but I do think about that to solve your product and issue. When you have the really great suggestion there, that's how initially we started the part overhead.

L

Just one coming once you end up doing that. You probably have better note stability in that these overheads will be constrained because they're not really constrained today when they can kind of fight. So there is kind of a noisy issue, it's constrained just to the Pod and not to the other workloads running. So this is positive. The negative side is that now you are going to expose this overhead most likely to the end user, that they will be consumed for this. So in my experience, this is the drawback challenge of pod overhead.

L

That will have to be understood where someone says I started a 50 ml, CPU pod with one mag of memory, and you have this overhead in place, even if it is dynamic, um how they view their resource quota depending how sensitive they are in their namespace.

L

This is something to be understood.

B

I'm currently wondering um could I centrally enforce this and calculate these pod overheads, because I don't have I assume that's possible Right to do that. Currently, you cannot. It's.

E

B

Pod overheads are statically defined as far as I understand it, and if there would be a way to calculate those overheads dynamically, then that could solve the issue right. That would be then good, so I'm not quite sure how I mean you said: yeah go ahead, yeah.

C

I think that's that's where we can start right like like one way is looking at the exact frequency I think if, if we some, if we have a way to add an overhead for that accurately, then that will solve most of it. So that's where I think we can start experimenting in this area like how can cubelet look at the Pod spec and adjust the Pod over it?

C

I

Or use or use ebpf to you know, get signals yeah yeah.

A

Exactly that's exactly there's the two things that I want to comment: why it is we are using that to to estimate and the feedback loop and will guide the customer right you how you another thing is the customer, because that's the overhead user should pay, so customer didn't estimate that the correctly initially we get that's a static one. So if the customer wants using more even they are guaranteed today, so they may be cured and that damage constraint by that part level that participal for either being killed or whatever slow down.

A

So then they they could make that burstable right first to make that versatile and did the measurement again even without the the verticals Auto skating, and this still could make that burstable and didn't measure themselves and the next time to to make that is at a part level to share. So they still have the way to do these things to move forward, but I, don't think about that! I do think about that blind charge to the system. That's the wrong! We did that initially, because the we that time is too earlier.

A

We did that in initially, but the reason we introduced the Pod overhead. It is try to solve this problem, especially after we have like the VM workload to support the rest, like the Kata content and the overhead. It is, is much bigger than the pulse container, and so that's why we want to introduce this configurable at the first place. Just.

B

On top of my head trying to understand, essentially, then the cubelet would need to somehow determine how much CPU and memory. If we just talk about this overhead, it causes four system, slice processes right so I'm, trying to understand how it can do that without monitoring, for instance, the the standard out, um how much a pot is actually riding at what frequency to to stand it out. How would it be able to to practically do that? That's what I'm trying to understand.

B

Because you would need to have some I, don't know how yeah just it seems to be. You need to find you need to find every single or a lot of the single reasons, how a pop um a container process can influence the memory and CPU usage, so you're going the other way around and to me just on top of my head. That sounds um difficult, but I'm sure you have a lot more experience. Doing that it seems to me more.

B

The other way around just looking at the whole system itself seems to be a little simpler.

L

If everything is in the appropriate pod, C group um like they get their way cryos that's up. You can easily always know what the overhead is in that it's, this sum of container the workload has their own unique, C group as a child in there. If you subtract that from the Pod C group you'll know exactly what the overhead is, um the problem is: what do you do with this information?

L

Your pod is already running and he's been doing vertical part Auto scaling for at least a couple of weeks, and it's challenging to manage that on how to adjust them and then better would be be able to look at and say, heuristically I'm familiar with this workload.

L

I know what the runtime characteristics are, but I think that that's um a whole different, challenging problem, and should that be done at the node or the scheduler, you know today with it being static, the scheduler knows so it can pick a node appropriately, knowing that the summer container requests and the overhead is what it needs to take into account. If we don't know that what the overhead is.

L

A

L

It could be dynamic, it'd be it'd, be pretty interesting and nice. It's just complicated.

A

Sorry to interrupt it, I really need a Time checking here. We only have the five minutes left, but we have still two two items, but the.

G

A

Items it's just the core Auto attention. So that's why and so so maybe one day you can quickly update and then what you need and then we move forward and also.

I

Yeah I I was on vacation last week, so not not nothing much to say about it, but uh the only thing I had to do was rebase it and get to the latest code. The outstanding issues that are uh they're they're tracked in my Wiki and uh I got before I went on vacation I. Think I I got a buy off from six scheduling.

I

uh They were okay with uh one way has commented on the cab saying uh the outstanding remaining issues can come in a separate PR, so it's uh we're at uh two out of three now or two out of four I, think uh Tim is lgtm uh and uh six Credit Union is LG. Tm uh I'm wondering what all is left to get node and the CRI lgtms. Thank you. Everyone all for uh working on the just getting a guidance out.

I

uh One of the questions that's still outstanding is how we should look at uh when you request a resize, especially this is especially for memory. uh If we cache it and then do it later, I think I I'm not comfortable with that. uh If there is a series of updates it could you know we we order them from uh smallest to biggest, and if it's not really changing the limit, and it's telling us yes, we did it or uh succeeding.

C

No I think yeah I think what we want to do in brunsi is make sure that it fails. If it can't do the update and.

I

We just propagate it back immediately. Yes,.

C

Yeah, so that's the case for V1 and day two: we have an issue open and I'll work with her to get it. Okay,.

I

I can work with you on slack too. You know.

C

I

I think what this boils down to is I need to make another, a small change commit, which updates the guidelines. uh A few lines. I would say on the update container resources API as to what the CRI expects a lot more document of what we expect from the runtime uh and a lot more detail in the documentation.

C

Yeah I think that that sounds good yeah.

A

Can you can you check out the CIA related and.

C

A

Sure, yeah and also really well, can you helped you on this one yeah.

I

I think Mike and Peter are good with what we have I presume. Are they on the meeting.

C

Peter is out today.

I

Mike is not here either: okay,.

A

Yeah so but I think I will menu here and also, and also Reuben, also pickup on the continuity recently. So.

I

Yeah he's got that radiator.

A

So so we have the post set to cover and also say, I cover and also know the side cover for this.

I

Okay, so hopefully, when director gets back from vacation, we'll uh we'll do one final look and get this boulder up to the top of the hill.

A

Well, packed for user to try yeah.

I

Okay, thanks thanks everyone. That's all I had thanks.

A

Vinay and Andrew please.

K

Yeah I just wanted to update uh with the help of munal and and Mike. We did a another round of reviews and um at this point, I think um everyone is happy with the checkpoint container um code, pull requests and I think now we only need um approved from from Derek if I understand and if I understood correctly, foreign.

A

J

Can everyone hear me sorry, yes, yeah working good? um Thank you. I I was not able to add that item to the agenda. I'm, really sorry, I didn't have any privileges on that doc. um Just wanted to follow on the email that we sent. uh I think that was last Friday uh was really not a long time ago, but um we're looking for some feedback here on the SRO 2K MMO transition uh for the management of out of trick and modulus and kubernetes um I'm, just not quite sure how we should move forward with this.

J

Are we looking for a lazy consensus or or should like? Should we give it more time? Oh, that's really. The question.

F

J

Have for you today.

J

Foreign, it's only been four days and I've sent it before the weekend, so I would understand if people need more time to review but I'm just trying to to timebox um this really.

A

Have any contacts I'll miss my sorry, I didn't get. What did you asked requested here? Maybe we could uh move the discussing to the second note: Channel slack Channel there yeah I'm, going to join yeah.

J

That that's exactly the sro2 camera transition that you made. Yes,.

A

I, don't have all the background the context here. Anyone.

C

I I guess Clinton: let's, let's move it to signaled and like uh the channel and then follow up next week.

J

Sounds good sounds good. I'll write a recap uh on the channel, then. Thank you all very much. Yeah.

A

Thanks thanks and also Alexandra, uh we I will I will reply to you about your question on continuity. Yeah see you.

C

J

You thanks bye.