Kubernetes SIG Node, 9 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230509

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230509-170530_Recording_640x360.mp4

A

A

I can't hear you either Sergey.

A

A

Hey folks, let's get started. Okay, so first on the agenda, we have Swati with the device manager, recovery bug, fix backwards, yeah.

B

um Yeah so uh last week, last week we had this device manager bug that was fixed and um I have the link kind of added in the agenda Doc, and the aim was to ensure that um the recovery flow is correct and what we we, what we've added as part of that is extra checks to make sure that the application pod requesting a device is admitted only if the device plugin has registered itself to cubelet and the underlying devices are healthy.

B

Given that the device manager is a ga feature, I went ahead and created a bunch of backboards. The core changes are very straightforward, but we had to add a bunch of end-to-end tests to ensure that the feature was well tested.

B

um The cherry pick deadline is end of this week and Sergey is suggested in one of the review comments that it's best that I have a discussion with the community members and gather feedback. So I really appreciate everyone's feedback. If there is any concerns in back porting, this fix or or if everyone's okay with that, please let me know.

B

C

Yeah there you go okay. Can everyone hear me too yeah.

A

C

Yeah we can so yeah, so actually what you summarize, what happened, which it is much uh the old bag, what we address long time back in open source. So that's why earlier um I saw this open, open shift with the pack first report actually to open shift the integration instead and think about. It is open source issue so, but I did uh this morning after I saw you ask a chair, picker, so I did ask the GK internal, because I saw that some new change here uh earlier last year and of course this regression.

C

So so that's why I want to wonder it is uh so so because a little bit concern for me. It is the traffic to the existing GA. The release right so I want to just give me a little bit time and once I got to the report back from the internal production. Basically, it's just because internal production is just using open source. Kubernetes right, so I want to understand the. Why I never heard this production report so then I can process. This whole thing, yeah. Absolutely no problem at all.

C

Just my perspective, yeah thanks for that to fix the issue.

B

No problem, uh just one more thing, I'd like to point out I, think the reason that it's affecting us is primarily in single node deployments, I. Think. Typically, if you have multi-node deployments, you can drain the nodes, but uh the the use case that we care about. It is specific to single node, kubernetes openshift deployments, and that's where we we don't have the ability to drain the node and that's where it's affecting us. The most and I. Remember as well. David mentioned that in case of GPU plugins. He noticed something like that.

B

But if you can take a look- and you know, give me a green signal or or whatever your feedback, if it's okay to proceed or not yeah.

D

And and sorry, just very quick question: can you give a little more details of how the bug is triggered? Is there anything specific that needs to happen? It's just any Kubler restarter. It's like a race condition or yeah.

B

Yeah, so in in the scenario that that was reported to us, it was uh that the Divine underlying device was a provisioned properly, but in the end-to-end test case that that we've provided as part of the pr itself, what we do is we uh we intentionally prevent the device plugin from registering itself to cubelet, and because of that, uh the underlying devices are essentially not healthy, and we should see the application pod that is requesting the device.

B

It should fail at admission time- and this is of course after like the node has been rebooted and cubelet has been restarted. You you would notice this. You know, so you know steady state, all your your application container device plug-in everything is up and running you reboot, the node and and the registration does not occur. The application Port should fail application form which is requesting device.

B

Does that make sense? Should I probably do.

E

Everything yeah, none of them totally makes sense um so, but just to be clear, it's the device needs to be unhealthy after reboot. Is that accurate or.

D

B

um Yes, so you know when it's steady state everything is up and running everything is allocated and when the node is rebooted, there could be a scenario where your underlying devices haven't been provisioned. It could be because you know say with SRV index: you need to ensure that the device the device driver is up and running or for whatever reason, the device plug-in pod appears after your application thought, because essentially at recovery time or no reboot time, you have no control over the ordering of PODS.

B

It could essentially happen, and- and here we were attempting to make sure that you know in scenario that it happens, we are, we are taking appropriate actions to prevent the Pod from incorrectly going into running State.

D

Gosh, it makes sense, makes sense. Thank you explanation.

C

Thanks also for pointing out the single node, because in the summary and I think and I didn't realize that's the single node, so we definitely didn't fix the single node issue in the past. So this is why, when you describe that kind of things that we fix, we try to make sure the device plugin is not allows the the the capacity right so so until everything's ready, but for the single node.

C

That case is I, think we didn't not just single node, actually uh single node and also know the reboot and just simply read from the checkpoint and claim they have the capacity, because we did try to connect the after node died or reboot. We try to re-register recreate after that node or re-register node. So there's the sum private, but the single note cases, whatever I, don't think we handled so thanks for for pulling that up. So, okay I will approve that one. Just after all, right thanks.

A

John all right, uh thank you, swathi! uh So next on the agenda, we have uh Karthik with Dynamic node resize proposals.

F

Yeah yeah hi everyone. uh So recently uh we submitted a enhancement proposal, node resize. uh Basically we are trying to solve a problem where in which we want to dynamically change the node compute capacity uh without restarting the cubelet, so the current workaround is that if you increase or decrease the node capacity, we need to restart a cubelet.

F

So we want to avoid that making a completely dynamic, and so the use case we are trying to solve is that uh in a situation where adding a new node to a cluster is considering considerably taking more time rather than updating existing machines and also if a user cluster want to manage The Limited number of nodes. So this has a couple of use cases. We are trying to handle with this proposal, and so we have done some observation and we created a Google doc and shared across the signal.

F

Amazing Channel and we got some feedback as well. So I would like to hear more from the community and guidance to take this forward.

G

I had a quick question um for the a discussion doc that you sent. You enumerated some things as non-goals and I'm curious. um If you could explain the motivations on them not being goals like. Is there something unique about your environment where you're not using devices or you're, not looking at level swap devices or um is your workload not needing any isolation around or Affinity around CPUs and memory I'm just kind of curious yeah.

B

G

Those two items were called out as non-goals.

F

Yeah so initially, when we started with that, our goal was only getting regarding the dynamic resize and we consider them as non-gool, but later the discussion, the feedback. uh It has known that we have to make them as a goal and we have to improvise the resource manager initialization of the reinitialization, depending on the changes so as it would affect those components as well. So recently we have moved them to the goals in the enhancement as well.

G

Okay and then just uh follow up in is your goal to be able to increase any Resource as well as decrease in your resource or what given the removal of those from the nine goals? Is there any constraint around resizing you're wanting to do.

F

uh Yeah, basically, we are looking to both increase and decrease the resources of a node.

G

With no restrictions, correct, yeah.

F

G

So, like included in that would be the ability to decrease the amount of resource on a node for its current and scheduled pods.

F

H

Okay, yeah I, think poetry. Admission is one of the big topics. Can you hear me by the way I reconnected, yeah yeah.

F

We can hear you thank.

H

You yeah Pottery admission is a big uh problem for this PR, uh this cabin, maybe some other caps like one thing we discussed like I, commented somewhere here in up armor uh case. For instance, we have the case when apartment was disabled and then even Kubota restarted like what do you do with sports? That means a Parmer uh that was another uh kind of early admission situation that needs to be covered.

F

H

But uh beside the admission I also uh like this cap is targeted on like automatic resize, like whatever C advisor tells you. You do um so first comments, I put in document about uh throttling of this situation. So if C advisor is like flake you're like you have resources, quite Dynamic I, don't know for some reason. Maybe some like bubbling happening because I'm like missed uh error in Sea advisor, then we should at least threshold how we change uh node size, but I did ideally I.

H

Think uh we may also have uh um approval based uh resize, so C advisor can detect the new size, but then you may need to go through some API call to actually do the size when it's possible I think it will be better for some cases, uh especially when uh um cases when uh there is like some virtual environment that actually bigger size but like somebody wants to control it uh and from from a site and make it smaller and bigger based on demand, uh and maybe they have information about Hardware virtual Hardware that doesn't uh like advisor, doesn't know.

I

Well, there are a couple of our cornerstones, uh coconut cases in in our in an office uh C advisor if I remember correctly how it represents, where the Polish information about a machine it.

I

It will not showcase the problem scenario when, like one of the trails of a core uh got offline, it or when, uh like let's say, like cxl memory and like a new memory region, will appear on one out dynamically attached or even worse, like uh when, when something disappears so like, when we are talking about VMS, where we can do more memorable link, it's easier but uh like starting jungling. When you have a course with different amount of threads, it will be well ten of worms. We are opening.

I

I would say this current implementation of uh set of managers. We have.

F

I

I mean I'm, not I'm, not against me wrong, I'm, not against it's. Just what like, where a lot of underwater Stones were quite showcase, a bunch of problems we.

G

Maybe one of the things that would help me uh could be my information was out of date or I just had forgotten this. um If the feedback loop was on whatever C advisor has observed, and obviously the channel to increase or decrease resources to the node are happening at a band. Is there anything you could point to in your cat that might describe the state of the art, whether that's on Linux or Windows, hosts with respect to what resources can be dynamically added or removed without a reboot um I?

G

Think that would be helpful in just even understanding the scope of restrictions. I don't have that to memory, uh and so maybe maybe Dawn you have your hand raised. Maybe you do I, don't know I'm. Sorry, if I can't believe. Thank you.

C

I think that everyone, there is a lot of complexity here, I told I totally agree. So that's why I can lower my hands here, but I do want to say bring some memory to some folks here. um I do have constant as well, maybe I'm more concerned on this one. This is, could be a lot of non-go being there, but the complexity is invaded in this language without those kind of the resource management. Without those part admission, I, don't think this can go can can fly can be beneficial to many users.

C

I also worry about people. Even we make this work. I think this has been couple years ago, something like the same in a proposal. Send it to kubernetes I, think we worked as a together and reject that one, because that's really could the potential harm of kubernetes so similar protozo is kind of the uh Dynamic node that or maybe watch her node, the virtual, what you kubernetes it's quite similar, so so the node size can unlimited expanded.

C

um So I worry about once we started to this one, then it goes to that direction. I'm, okay, to build something. On top of the kubernetes outside the kubernetes and hide those details, implementation to treat the entire cluster as like, a virtual Giant pool and the resource, but the real name and the kubernetes layer um I think that's that's the complexity.

C

You heard many people mention that different complexity right so I, don't want to repeat that many resource and how you're going to manage Dynamics, so so so, but on the kubernetes, the common layer, what we are doing, this purely from the kubernetes open source perspective and uh so I really worry about. We couldn't handle very well and abusively using and even harm of the kubernetes health in general. So that's just my comment here. So I want to say that I I personally don't want to.

C

We continue on the start and as people give me more opinion, use cases here, but I also want to discuss those kids can we handle on top of the kubernetes and so that's kind of what I'm? This is my opinion. I'm, pretty straightforward, sorry about the disappointed comments.

J

Yeah I had two questions. One touches on what Don brought up, um which is there an alternative that deals with um a less invasive way of fixing it in the cubelet by pushing it to the someone managing the node? Who wanted to do this more aggressively um like replacing a node with another node and it's possible to orchestrate some of that today.

J

It wouldn't be easy, but like some of the things that we might have to go spec out to do, that might be a little bit more manageable and you can imagine that as like it's up to you to stop the cubelet, do API operations and then start the cubelet and if we there's bugs and stopping and starting the cubelet and the cubelet needs to handle things like devices go away when a node is rebooted anyway.

J

So some of those efforts like if an implementation path like that might actually help in other ways and be simpler to implement because it could be done on top and improve through testing basic changes. So that's one thing to consider: I'll make that comment in The Proposal and the second one was a concern about the dynamic resizing would be we already. A lot of the the third party, plug-ins device manager or CPU already are a little fuzzy. Have bugs around dynamism?

J

Is the expectation with the cap that they would each of those components would be changed to support dynamism the same way or is it left? As you know, we're just going to focus on these resources and then carry it out to the others.

F

Yeah, so we want to uh change those plugins to accompany these changes as well with the dynamic resize. That's.

B

F

J

I mean I'm generally in favor of improving how we deal with dynamism and the cubelet, because we're not too not to say that I don't agree with Don's concerns. Just there's a lot of dynamism in the cubelet. That's just broken today and it's all subtle stuff, um anything that helps anything that forces us to go and test it and be better I think is and that positive for the cubelet.

A

I

Yeah, so I just wanted to answer. Don's comment about the use cases so, like we main use case uh besides with virtual machines, is what uh what are some changes in the hardware? What is going to a market or radio on the market, so multiple Hardware vendors implemented with feature called cxl memory, and one of the properties of this thingy is uh possibility to dynamically, attach or detach what memory.

I

So what's the matter when you have a node which is under memory pressure, you can like, if you were back to a management infrastructure like please, attach me like in our 10 gig of memory and from top of a rack pool of memory, it will be dynamically increased and when we uh the same we're decreasing. So it's it's something what will be common in? Probably like next couple of years, but for where to make sure this feature is available, we need to start to work on it now.

I

Or or another thing like another example like Swati or Francesca, might also say something like where a bunch of Telco customers, who can say I, want to run my workload on like on the full call or exclusively and for them uh shutting down the hyper threat of a core is quite uh I, wouldn't say like common situation, but desired situation.

I

C

On top of those things, and it's unlimited and it's never there's the boundary but uh Alexandra you use kisses, actually is different here right, so you use cases attached memory and there's the there's the even like a couple years ago uh we have the GC. You also have the case like resizing note, and um so uh the problem is. We have a range right, so you could up front for those kisses. Actually you could you have to arrange how much um Can Dynamic it's unlimited resource.

C

You steal right, so you could how we could handle from there. So, like okay, when I report a note, you basically know what's your magnet size, this is what I told the gke team back then, but they never come back to me. Unfortunately, but that's what I thought you do know. What's your real name, maximum of the success, you know: what's your latency to grow from the minimum current to what I available resource to grow, to the maxim, and then you give me that information, then we can handle still it is more manageable.

C

It is predictable so that we can give that signal to the scheduler and give that signal to the resource management even on the Node we can handle better. I did share this to the internal GCE team and but they'll never come back to me so I share here, since we have more use cases really common use cases. We can start from there to discussing so, but I really concerned about that. We go to down to next unlimited resource, so so everything's a hide you'll, basically treat the kubernetes as the scheduler and I.

C

Don't think about that, we can take that big jobs we want to handle. We could handle at another level. Work with our kernel, handle those scheduling job much better to uh to manage of the resources or negences throughput, even throughput we cannot handle while so so we we only can do that part of the job right. So that's kind of my concern to abusive go crazy, so this is this is why I try to rephrase if we have the real name vendor device vendor have the real potential in the in the future.

C

We could start talk about those kind of things, but I want to start from the whole kubernetes. How kubernetes can manage the how admin can manage those kind of things and on top of those kind of serverless or even go crazy, like the Northerners of things, I think that's can build on top of us not in open source. Kubernetes know the team here.

I

Just one comment uh about the provisioning potential capacity: uh the problem is what or Kubler trusts what sea advisor tell and she advisor doesn't know anything about the potential capacity. It just says, like I see it how it is right now- and this is my node capacity, so for admin there is no way to to to to tell what, like I, have some additional potential capacity, and actually this cap is actually to to to retrigger uh like sea advisor.

I

To reread that information as I say, it's like I I know what it will be a lot of corner cases in it, but in in a long term it's something what we probably need to look at anyway.

I

It it's more like technical problems rather than make should we do it or not just my opinion.

A

Great, uh maybe I think we can move on to the next topic.

F

Yeah, thank you all for the feedback. Yeah thanks.

A

uh Next is Clayton with cubelet State improvements for 128.

J

Sure um so the previous two topics deal with uh some of the things I tried to summarize in the attached spreadsheet. I hope folks have seen that I sent that around a while back. um But if it's your first time seeing it um there's a number of intersecting challenges in the cubelet about how the cubelet you know, takes input, um and then you know different sub components how they react to that input. Admission obviously takes desired pops pods and turns them into pods that the cubelet has actually desired to run.

J

There's a couple of bugs on static pods that have some short-term implications, but a number of the things that we talked about even today like admission and resizing, um as well as in 127, the In-Place resizing of pods, made it made it obvious that we have some challenges in um describing and reasoning about how you. What is the correct way to to add some of these features, and as we get more of these Dynamic features, um it figures it's a good time, so for 128 I was going to focus. David Porter's been working on this.

J

Along with me, I was going to work on trying to get some of the um Clarity in the cubelet around the difference between what someone has asked the cubelet to run and what the cubelet is actually running um and that's kind of been some stuff. That's been working on for a couple releases, that'll help the static pod stuff that Ryan had asked about, and then um working with In-Place resizing to improve cubelet admission to be a little bit less.

J

The current approach is a good is a start, but it doesn't really sit cleanly with a bunch of the other components, and it wouldn't be something we could extend to new components. I was also going to take a look at that for folks who have seen um the charts.

J

Are there any other key issues around admission and State Management to cubelet any of the issues that you might have seen in that deck that are important that either folks would like to work on with David and I, or need to need additional capabilities in admission, for instance, that might justify coming up with a bigger design than just the cap themselves. So it's kind of a general question to anyone who has changes coming in one two, eight or potentially in one two: nine.

J

That would be big enough, that they would they're not quite sure how they would accomplish it in the cubelet, and it involves you know things that you want and dynamic changes over time, as well as admission. If it says any of those three magic words. um I would like to know.

H

I, don't think it's a dynamic enough, but Google Resource Management plugin may affect both admission and it will be pluggable. So it's not static enough, but I think it yeah um I, don't remember the last state of this cap to understand how if it will be affecting your work at all.

J

Certainly, certainly I would probably say anything around Dynamic resizing of nodes would need to be fairly like. We would want to make sure that we get the design of anything like that in the cap. Are there any things that folks are working on, like in CPU manager, device manager, plugins beyond what you mentioned? Sergey anybody know of anything. That's going to be going in. That has some dynamism.

C

Yeah yeah, we can share with you the our planning. The problem is even we have the planning. A lot of time is not that predictable right depends on the authors.

C

Benefits depends on the reviewers benefit approval all those kind of dependencies. So, but we can definitely uh uh share with you. Our current planning like what it is people commit to say they are going to working on. There's the Civil. Oh yeah, manure I'll share with you manual fast, so you can see that there are a bunch of the people and uh to working on that yeah. Look at the 1.28. There are many things. People say they might want to work out, but they're, not clearness. We definitely will deliver or lock down the resource. Yeah.

J

Yeah, the one that and I think if we had caught if I had if I had realized this earlier, um David and I have been working on like some better descriptions of this. If I had realized that In-Place resizing was going for one two, seven, we probably could have helped some of the review, um and so some of what I'm looking to do is I'd like to improve the review process for complicated change.

J

State changes to the cubelet by having better docs, um better explanations and better code structure, because a lot of a lot of this, how cubelet deals with state is um go: ask someone who hasn't worked on a project in five years or somebody who's very busy and I. Think that's um that's the thing I'd like us to get out of by the end of one two nine is: you can go, read a doc and understand what the cubelet is doing um when you want to change how data flows around the cubelet.

J

Says anything yeah.

K

J

Anybody has anything else, just reach out to me.

K

uh Clayton, uh um if you can look into the sidecars uh enhancement and changes, would be awesome too.

E

Okay, all right yeah.

D

E

One of the things yeah, one of the things I was looking at into that we managed to tackle, is uh the issues around termination. Grace period like with pre-stop hook, um blocking precept hooks basically like are not factored into termination or screen. I think that's something nice and also help uh with.

D

That effort to fix.

J

That actually reminds me of one more um if there's so, one of the things that David and I have noticed is there's a lot of subtleties at the cubelet's behavior that aren't well captured by ede tests that stress them. um uh One of the along the lines of the termination thing, um the context cancellation in the cubelet. um You know what a delete request comes in or the Pod exits.

J

You know something that can come in and say like: hey, we've, we've stopped this and we're moving to the next phase, uh one of the things that was pretty clear there is it's really hard to explain how to test this, um and so some of the contributions from folks needs a lot of hand holding just to get better tests in place.

J

So if there's other people who are looking to build better testing as part of their Caps or their use cases, um I think this is also a fruitful angle for us to look at how we can catch more issues with fewer tests um and and catch them in like the restart issues, instead of going and catching them one by one as each new component is added putting that on. You know the cap author to run more aggressive testing around new features, that'll catch things like cupid, restart, or you know early termination stuff like that.

J

So um if you, if any of those things interest, you combine.

H

Yeah I wanted to know that we did a few improvements in testing for sidecars, specifically it's mostly about the life cycle less about restarts, but uh yeah. If you look at this cap to mentioned here, I think it's a good transition to next topic from Zach in other dynamics of a change of behavior proposal. Zach, do you want to speak out sure.

L

Thanks, um hey I'm, Zach I have never presented here before, but I didn't know.

L

Many of y'all um I have been trying to bug Sergey and David for some improvement in Crash loop back off behavior, and they suggested I come here and and present like why I care about Crash loop back off and you know what options we have with the possibility of uh working towards a cup, perhaps for 128 or uh you know later um so, there's a doc Linked In the agenda uh that I tried to summarize the issues but uh crashly back off, for those of you not familiar, has basically sat since the beginning of time in kubernetes, um with a static policy of you know, container crashes we uh uh back and we restarted after 10 seconds next crash.

L

We back it off by doubling and keep going until we reach a back off of uh 10 minutes or Max. Five minutes sorry.

L

um This actually presents a challenge and uh I'm now in a role with the GTA games team, which is mostly responsible for gones, which is the thing that host live came through. But previously I was on a role: maintaining uh gke kubernetes control planes which were heavy static. Pod users, um actually in both of those roles, I I, had issues with crash loop back off and I'll.

L

Explain why, in the first in the previous role, uh we and this isn't unique to to static pause at all, but anytime, you have a cold start with a lot of um with a long chain of dependencies. So you have C depends on B depends on a uh if that cold start is a little slow.

L

So you know a takes a little while to start out all of a sudden b gets backed off more and more, and then that can easily Cascade the sea getting backed up more and more so, especially in uh kubernetes control playing you had, you know, SCD. Eventually, we might have I think we actually introduced a dependency on Etsy that something that CD also depended on yada yada and then give API server depends on that.

L

uh We actually saw reliability issues where you know our slos were getting impacted because qbpi server wasn't coming up in the in the kind of rare case, because these initialization chains took a little too long um in a different uh part of the coin. Agonius has an interesting case where, in order to not um exacerbate or or notice to not uh put undo load on the kubernetes control plane, uh we would like to be able to restart the container within the Pod rapidly.

L

um So, in the economics case, each of these little live game servers is effectively a shared simulation between players, so you're logging in you're you're, you know have some Street provider match right and you're both connecting to this thing, and then it um recycles the entire pod or uh potentially the container, depending on like how they have it set up right now we have a kind of hacky work around and agonas where you can restart within the process, and we allow that kind of within the life cycle in agones, but there's a lot of people that don't want to restart within the process because they're basically saying like hey, we run a containerized workflow anyways.

L

Why am I restarting in the process when I Could Just Bounce the container, but the problem with bouncing the container? Is it only work for workflows where uh sorry it only works for like games whose session length is uh whatever it is? 10 minutes the the back off reset timer so again, coming back to say, Street, Fighter or something like. If you're, you know playing a a two-minute match against someone um it. uh That container is going to stop at that two minute Mark and have no opportunity to be recycled without back off.

L

So that's kind of the impetus to the doc talks about it, a little more um I kind of wanted to explore different options for that I think the easiest and kind of different options for changing this.

L

um A on the complicated end, I feel like we could introduce just API. That is straight up like for this pod I want this exact back off, Behavior uh I! Think that might take a little bit of like kind of tuning to understand what the the right balance are. Like you know, what's the minimum that Kubla could support for a backup, speed or anything like that?

L

Eight, a super, simple approach might actually be ripping off. Some of the the work done and like the the recent jobs kept uh might even be something like just having two static policies, one of which was or or maybe one tunable policy or something of that nature have one policy: that's for containers that exited cleaning cleanly and another like strictly looking at just exit codes, um and maybe another policy for containers that you know, crashed or had some other infrastructure failure or something of that nature.

L

Anyways. uh That's kind of roughly trying to think if I presented any other options in the dock yeah, so I actually wanted to explore that a bit here um see if y'all had ideas had other approaches that we should be considering.

L

Any kind of questions.

C

um So the background for it's not any other people lose their hands. Okay, sorry so so I just want I want to say that the back of window back off or uh uh August. My initial name is just oversimplified. That's just back! Then it's just to protect the node right yeah, because you worry about those either exactly you were there. We discussed you can't play this before also too and we all know that's the problem. We want to optimize those things enhance this right. So there's the one cutest!

C

Actually you are really triggering me to thinking- is that if the on purpose there's the certain process like now cascading the workflow and the one lead, the restart actually is the properly starter, and maybe it shouldn't be punished by the back of restart. We only really go. It is punished this to the crash Loop of the container trying not there need them to, because we didn't crack them and we shouldn't need them to attack our node right.

C

So if it's properly so that's the way yeah, you propose something like the maybe first crash, but how we are going to track her track. This is the first time right, so we do have some, but that restart the count not properly I have to say that. But we could do some policy next to say. Oh certain things, uh you maybe don't need like first restart. First, the crash- maybe you or maybe execute code, is the proper restart and we shouldn't publish by the back of uh uh questionable things.

C

So we could start those kind of things. I just want to share here, because we know their problem from. They were oversimplified policy to move forward. Yeah.

L

It got us this far.

C

Yes, eight years.

J

And um so I uh I was in favor of this there's a couple other use cases of the Damon job proposal also got me thinking about this, which is like. um Basically, if you want to do anything that applies policy uh on nodes um and the the Damon job proposal. For those who aren't here was basically a Daemon set can be used to enforce policy on nodes and actually that's how, in practice a lot of people use it. You know you want to force pull images to all the nodes.

J

You want to go change some State. You want to deploy a network plugin a lot of those um there's kind of a there's, a separation, maybe between you, apply a policy to the node, and you run something on the Node, but a lot of the applying policy on the Node variants.

J

um The nice thing about Damon said is you could just keep running it and the the naive way to do it would be you would run it and then you would run it again in a little bit when it changes and you can we can let people go right, bash for Loops um all day long.

J

uh There's some use cases in there, though, that, um as long as we can properly like, as Don mentioned as long as we can properly make sure that we're not the resources that are that we use in the cubelet for a container that are outside the container tracking um are properly Limited.

J

um I, think we we can dramatically improve what we do today. Dawn kind of triggered. Another thing which is there are things that we don't account for, like a crash. Looping pod is using system resources that are sometimes poorly accounted for to the Pod.

J

um It would be awesome as we go through this if we can come up with a justification for quantifying the actual impact of restarts on the Node, even like a really high level like what does it actually cost to to restart a container in terms of cubelet time Cuba this time and what is what? What is the Pod do, and what does container runtime? Do that's a good input to like, maybe just an operational parameter, which is how much you know how much yeah Derek's probably I, was thinking about.

J

System d as well, is like there is an administrative policy, which is how much resources do you want to accept? Is overhead and that's the real key reason that crash loop back off exists is to with the worst policy even moving up to another level of policy, just like how many resources per second do you want to fairly divide among um workloads um and leave that up to the admin? If we can keep it below that, I think that's another option as well.

L

For sure uh surrogate I think you've had a hand.

H

Yeah I, really like this angle of uh minimizing, like improving this behavior for all parts like accounts, better accounting resources uh like making it cheaper to restart will definitely help everybody I think in this specific case.

H

It also resonates with another cap where Port intention, like container intentionally, want to instead of saying that it's a benign failure, please restart me as fast as possible, also maybe indicate that it's like it's very bad failure and you need to like terminate into our pod because it's like such a bad failure that I I can't run here any longer, so um I think from API perspective. There may be uh two features coming together like when you when container saying like. Is there this failure?

H

Specific exit code or specific type of failure is uh B9 or it's very critical, so um I think there is opportunity for API change that can be done reasonably uh narrow, scoped.

D

L

Clayton was that a new hand.

J

Sorry I just expected modern chat, video programs to take my hand down when I unmuted, sorry I.

L

I wasn't tracking I, just yeah.

A

E

Yeah one of the things I think we should think about here is like uh when we, when we think about this, is it? Is it up to the admin to Define kind of what is a crash versus not or is it something like we observe so in the sense of like you know, I see two ways we can progress here right, one is like say we look at the exit code right, um you know if it's a zero exit code may be successful.

E

Maybe we don't limit it or something like that, but there might be a different case where you know you're deploying some application where you don't control the accent code right and you want to force it. It's not debate by the crashly back off, so I think it might be useful to to see what we're trying to sell like What scenario. Return to solve turns the assumption that we do have control around the actual application being run and can modify it to.

E

You know to notify that it's the crash or not right or we can't so that might be helpful and and when we designed this right thinking about what case we want to address.

J

Adding one more point to um static pods should probably not be put into crash loopback off um at the at the rate they do today like um there's.

J

This is a bigger discussion, but it gets into some of that chain of dependencies. um I'll I'll, be sure to add that comment when we get to it. But um thinking about it now, there's very few cases where the full back off makes sense for a static pod, because that's the administrator's responsibility to make sure that the um the point of the static pod is to be running and our full back office.

L

In that scenario, sorry I do have a hand up David, uh but I'll briefly respond to Clayton. Here, that's an area where like, if we had just.

C

L

For the like configurable parameters on the cubelet, it might work fine, especially for like kcp or it's a kubernetes control plane case, but.

J

Yeah, if we dramatically improve it, it may not even be necessary. I'm just I I keep finding ways that static. Pods don't do what people use them for which is run critical system components that must come back sometimes in order to make sure that the cluster comes back, and so, if we pick a good default, I'll just add that as a I think it'll be a goal on whatever cap comes up is static. Pods, you don't have to think about static, pods doing the right thing. They just do the right thing by default.

H

L

I think I'm done if any other questions so.

H

From action items, do you want to try progressing this cap as API change, or do you want to progresses resource visualization Zach? Do you want to work on one or another? Do you have preference.

L

Sorry, what was the second option.

H

uh Like one is improve, like the situation, for everything uh just confirms that we don't waste too much research on restarting and just remove this uh five minutes that it started for everything and another option is to have an API that you allow to toggle like how much it is like. uh Is it been I understood or not, foreign.

L

Er seems easier to adopt like, although probably longer to Implement, like the API change was probably longer to get in. Do you have an intuition here.

H

Maybe you need to get presented, API review and get a sense and stick ups, maybe uh helping with that. So if.

C

I'm here I hope we address this properly.

B

C

Hope we address this properly instead, just simply get rid of that five minute. Right thanks. That's that's! Maybe um I! Think if you are, you won't do that you, you don't need to came to the signal, that's one of the bad things, because you could do that while using it I, don't know the config. We could expose that one right. So then you could handle that node config, but that's not, which has been that simply will apply all the um all the content, all the parts I'm not sure.

C

Maybe in your cases, because you are talking about this particular special cases- you, okay, that's okay, but I'm, not sure for all the other cases. I hope I mean that's properly.

L

I think having it in general is just node. Config is too limiting in part, because I think you're going to want odds with different configurations exactly.

C

So that's why I hope we are addressed properly and introduce a different policy. I think this is the good things to start this conversation. We all agree this part. This is not the first time we heard those complaints, but this first time people give us more detail about use cases right, the workflow and the process right this. Let's leverage this this use this situation and make this correct.

C

G

L

G

It's been a really useful, interesting conversation. um One thing I was curious about is I, did put in the chat and I didn't speak up like I was trying to think mentally.

G

If you had total control of your Linux host, what you would have wanted, the behavior to be, and like so I just went down to like the init system, we're always commonly working on here in system D and saying what would Zach's unit file looked like and I kind of wonder in your scenario, if like, if you had the ability to specify a start limit first um budget for your pod, if that would have been what you would have used in your use case um uh and uh either way if you wanted to like just if you weren't familiar with that field, if you want to Google it and come back and say: oh man, that's the missing thing.

G

I would have wanted I'm just curious. If that is the the scenario um that we're lacking in cube right now,.

L

Can you give it a 10 second overview of start limit, burst Yeah.

G

So basically like, if you have a systemd unit with a resource, restart policy of always- and you are failing on Startup like systemd- will try to rapidly start you as fast as possible over and over again up to that burst value, and then, once that burst value has been reached. Let's say it was 10. You then get put into the back off period you get put into that. Earn a interval second weight, whereas in cubelet we put you automatically in a back off period and it feels like what you're wanting is.

G

Maybe like an initial budget to exhaust before you get point in timeout, yeah, yeah,.

L

I think that's possible, but like The Agonist case where it's like every you know a couple minutes, it might I'd have to look at how systemd handles that because it's possible it's, maybe that plus the the tunable uh like when the timer reset, because if the timer doesn't reset, uh if I crash every couple minutes crash. Sorry but I think these are kind of details. We can flesh out I actually.

G

I roughly thought the timer would reset the moment you were reported as a happily running pod or you passed the loudness or Readiness check.

L

Yeah, that's an interesting possibility.

H

J

Well, I I will say that the hot looping pot is scary, so it wouldn't fix that, because you could, you could probably dos the cube that way. Just exit success immediately.

L

Yeah and I don't know if any of the cases we have would be particularly grumpy about the the base like 10. Second one um so I mean on the kcp side, like I, certainly didn't like the 10 seconds when it came up, but like it's better than the hot Loop for sure um so I think there's some discussion there for sure.

A

Yeah great uh VR over time, we still have some topics we can carry them forward to next week. uh Thanks for joining folks.

H

And amongst topics there are some uh link cprs that needs approval. If somebody have time, please go through them. Yeah I'll.

A

Make a pass okay, I started today. Thank you.

H