Kubernetes SIG Node, 5 Sep 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230905

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230905-170609_Recording_1920x1096.mp4

A

Hello, hello, it's September, 5th 2023. uh It's a signaled weekly meeting. Welcome everybody I have few item on agenda. uh Let's jump right into it. uh Parker are you here for the first item.

B

C

B

This issue we have discussed in the issue, and currently we think this is a future request and for uh Contender level CPU static policy uh design. uh The current design is only about guaranteed the pod, but in this case, if you, if the set car or the other containers in the Pod is not guaranteed, but one of the container is guaranteed and with the integer CPU limit and request, should we do the same thing for those containers.

B

Is to add a new uh CPU stack CPU policy like a contender level, static.

D

um I just want to jump in for a little bit. Actually uh Francesco and I were commenting on the this issue and I. Think one thing we can definitely evaluate is either a policy or policy option for this, but I would like to understand why putting these pods into separate.

D

Oh sorry, splitting these containers into separate pods is not an option and um I think the other concern I have is the implication around guarantees, because if you have a pod with a guaranteed container and a non-guaranteed container, um the guarantee can, if you, if you allocate exclusive CPUs to a guaranteed container, the expectation would be that um that it's it's a workload.

D

That's you know that's consuming those CPUs and it's a performance, sensitive workload or something really important that we want to keep for a long time, but because this particular pod is not going to have guarantee quality of service, it could be evicted by another pod that has higher priority. So those are that that's kind of the only concern I have but I think in general. This could be a useful feature.

E

I think this also matches a recent request. We got where, like you have pod with two different kind of processes. Ones are always pinned the other ones, aren't running all the time, so if they assign CPUs for the ones that are running all the time, we are wasting resources. So we need a way where you know some are pink and some processes aren't print. I. Think this kind of intersects uh that kind of use case.

D

Yeah, absolutely actually that's uh one of our team members who's working on that team. In this case and uh and the use cases that you know, one application is performed sensitive and the other one is maybe logging or something that's doing stuff in the background. So so it makes sense, but like I I'm still not able to visualize how we we will saw the overall. You know uh priority Problem.

E

D

Terms of quality of service of the pod.

E

Maybe we keep the whole pod uh guaranteed like in the guaranteed uh class, even if a sidecar is not using full CPUs. That could be one way.

A

um You mentioned that uh even worker pod can uh not need the whole CPU reserved. Yes,.

E

So it will have.

A

E

Adjust the definition, I I think we have a class of PODS where you need some things full CPUs, but then they have helper processes that don't necessarily need full CPUs. So can guaranteed definition be expanded to include that, and we have some way. Maybe they can be sidecars when they are not using full CPUs. You still stay guaranteed.

D

I think if we can change the definition of guaranteed I think we can it'll be okay, but I thought that's a very major change.

F

Well, instead of relying on implicit definitions, create were explicit way to specify which container inside report, which is which.

E

Yeah so I think with sidecar. We have a way right to distinguish between the sidecar and the regular containers. Sasha. So are you suggesting a yet another explicit marker for this case.

F

No I'm I'm, just saying what, like so far with CPU manager policy, will like very relying very implicitly on qos. It won't implicit qures class of report, I'm saying what it might be time to think about explicit definitions inside puts back.

G

F

Instead of assuming.

A

Yeah can we have guaranteed with no request and if you have exclusive definition like, can the Pod be guaranteed but have not requesting anything or have no limits on anything.

E

I guess we'll still need to impose some rules: I guess that admission that it needs to have at least one container with with this matching and others aren't something like that, but I think yeah. This is not simple, but I definitely see use cases coming up requesting these kind of scenarios.

A

Do we have do we have other scenarios when sidecars have to be on the same CPU set?

A

Is it often the case like? Can we like just change the default Behavior or we need to have two separate modes with sidecars and with outside cars,.

F

Particularly the person why so site also used, so if you want exclusive CPU allocation, usually it's a process which is doing well like a full busy Loop and those processors are usually not uh really happy. If, if something else start to consume the same CPU course so having side cars on separate, probably shared CPU set is better way.

F

The question is uh like the whole c groups. Here are okay, what what is created explicitly or him? Well sorry, implicitly, because of both qos.

A

Sidecars are always should be separate, so we can just change default definition of policy and exclusive sidecars. There.

F

Depends what kind of sidecars you awake like if it's something like service mesh or some like logger or something we also probably don't need the dedicated resources.

A

Maybe you have any scenarios in mind when it's the sidecar is actually need. Resources need to be on the same.

E

Maybe we we come up with explicit use cases and collect them and see. What may happen is if we design for one use case, then we'll get a conflicting use case that once exactly opposite Behavior.

A

Yeah, for me, it's not one or another. It's just uh do we need options to have a switch between two behaviors, so we just need to look on one behavior and have it it's default, not configurable, because I I feel that uh more we discuss is your CPUs and sidecars. More I think that it certainly makes sense to have them have sidecars by default. Have not using the same resources.

E

Are you thinking of a different parallel hierarchy or just sidecars not contributing to the qos calculation.

A

Maybe I don't know.

A

Okay, what would be the next step, sometimes.

E

Gather these use cases, and so we can generalize and make a design that will cover most of them.

F

You know probably continuing issue.

A

Thank you for bringing it in Uh Kevin. uh You are next.

H

Actually I've already processed I reviewed a.

A

Yeah, oh, it's already done.

H

F

I wasn't able to attend today.

H

He he pinned me so I already processed before meeting yeah.

A

Thank you, uh Ryan.

C

Yeah, um can you give me share access.

A

Even making your co-host, so you can present yeah.

C

A

C

It's done yeah thanks, so I wanted to bring this up to everyone's attention to get everyone thinking about it. I've been testing out the graceful, shutdown feature and I. Don't think it's working as we expect um so normally um can you see my screen?

C

Personal, okay, good, okay, so normal pod termination usually sends a 30-second grace period for uh to terminate upon, and if the Pod spec sets a graceful termination period seconds, then that termination and grace period is honored during Pawn termination, and so the cubelet comes around sees that a pawn needs to be terminated since the initial Sig term to the pod, Waits, For, That termination and if the Pod hasn't exited.

C

Yet since the sick kill, that's the normal life cycle for a pod with graceful shutdown, um we intend the Pod to be gracefully terminated and we want static. Pods demon sets replica sets on this node shutdown via the system. D signal to gracefully terminate the Pod I included a link to the documentation there for everyone who doesn't know about it.

C

So right now we have two ways of gracefully shutting down a pod. We have priority graceful which allows a priority class level to be set with some sort of seconds attached to it.

C

The other scenario that we have is a regular graceful shutdown I'm going to call it regular where you configure the shutdown grace period and the shutdown grace period. Critical, pods configuration item that sets to that gets sets set to a maximum uh period of time to wait.

C

So the problem with this is that all pods, during a no graceful shutdown, are using a maximum number of seconds to shut down.

C

So if we set the node graceful shutdown to an hour which is reasonable, we don't ever send a 30-second grace period for the shutdown and so all pods that do not have a termination grace period set are waiting an hour to be terminated, and this is problematic uh for a few reasons.

C

um Openshift specifically runs CI with all of its configuration in place, and so we would like to set the notes, great node, graceful shutdown to an hour, but then have the Pod gracefully terminate uh correctly and Upstream. Ci is not running the end-to-end test with graceful shutdown on, and so the test, pods, brain and CI would wait an hour to terminate and so we're getting timeouts on the CI side of things on doing the end-to-end tests.

C

When, arguably, we probably should not- um and so I proposed a fix here in this PR- that we do send a 30 second shutdown sequence to the pod.

C

This would allow um pods to do their normal shutdown sequence and in the worst case, we set the graceful termination to the group's shut down grace period seconds.

C

This means that the configuration configuration items that we have for a graceful shutdown if they were set to an hour, the Pod would terminate in either 30 seconds or if the termination grace period was set on the Pod, it would wait that period of time or in the worst of cases. It would wait for the entire hour to to terminate that pod and finish its reboot, and so this sort of is different than how the feature was originally conceived. I think but I believe it's correct for how we should go forward with it.

C

um I'd like to see this PR go in and I think we need to improve the end-to-end test. Excuse me in Upstream kubernetes to enable graceful shutdown so that we can test end to end with it as well.

C

um Go ahead. Yeah.

E

I think I'll call out David I think David. We probably didn't discuss this, but it makes sense because during a normal shutdown, if nothing is specified, we give it 30 and at when this is enabled we are just not giving the same 30 but whatever the maximum for the group is so it should be uniform.

I

Yeah yeah I just wanted to mention I. Think, first of all, thank you so much for uh for putting together the presentation and all the info I think it's super helpful, um yeah I think when we conceived the feature we were probably looking and kind of anticipating use cases with much shorter Grace periods. You know like 30 SEC, like shutdown periods.

H

A

I

Seconds or a minute, not necessarily an hour, so I think maybe when you start testing these longer ones, uh these type of issues come up um and I. Guess it's just kind of like a behavior question right like if a pod doesn't set the termination grades period. What should you?

I

What type of value should you set right and I guess, there's two options: there you either use the maximum sign, which is sounds like what we do today um or you just use 30 seconds, because it's the defaults for regular termination, so I think that makes sense. I think, like the maybe counter arguments you're using the longer one is like. Maybe you would expect the application to be well behaved and exit early, but I, guess you can't?

I

um You know it's application specific, so overall, yeah I'm, definitely supportive I! Think of using the 30 seconds um if it doesn't specify anything. It sounds like the right. The right call to align with normal termination.

C

Okay, that sounds good. um Our use cases that we have uh some pods, that customers are running for firmware updates, and so that's why they need such a long-termination period. um I.

I

See yeah sense.

H

um Ryan thanks for the the put this together. Actually, this is a plan uh explains why I saw some production issues and um customer complaining to me say: oh, we I only have the batch jobs, but I afford that every time when you're upgrade I have to wait for one hour at least especially for the large cluster, so the upgrade take forever for to to customer perspective, I was kind of I keep suggesting to using grease powder grease period shut down period.

H

And I said the proper set, those kind of things, and then you shouldn't with this I think this is explained. Why? Because, if right, so they have to wait for the they always have some patterns leading site properly, so they always have to go through one hour. So that's. Why have the problem yeah.

A

Graceful determination or you've seen it with some kind of upgrade, because I think uh similar issue may happen in with any drain scenario, we're going to be uh set every right of graceful period.

E

Sitting we want to check the code for rain, but I think it should be uniform across all three regular brain language. Yeah.

C

Yeah sure thing: okay, okay, I'm.

I

Just curious one question I had when we talked about the defaulting to 30 seconds. Where is that done? Actually today? Is that done on the API server side, or is this done on the kublet side? Right now, when we talk about that, there's the default 30 seconds if it's not specified.

C

um There is a config item uh in the API um I, don't know if the cubelet does it or uh I know that we set the termination grace period in various places.

C

uh Maybe somebody maybe Peter's on the call on knows for sure. If cryo doesn't.

G

I

That'd be interesting because I.

G

G

I think it's documented all over the place and then cryo also at least in crowds and like we also enforce that like if it's not like in cases where isn't coming directly from the Pod, we assume 30 seconds so I think it's just like a convention that was found I.

E

Think no I don't think cryo changes it it's coming from the cubeletal, but we need to definitely figure out whether it's the origin is the API or the cubelet.

H

E

H

Think cumulate decided that 30 seconds the kubernator tried to be a really conservative. So that's why uh a lot of places uh in the old time even is like the four hour, and so then we introduced one hour so 30 seconds. I, don't remember at all.

G

I think might be based off of the max weight for container on time, so, like actually 30 in a couple of different spots, we definitely like do us like there. There are some places into cubic where it's hard coded at 30.

I

Yeah the reason I'm asking I'm just curious because to answer the other question: if it's covered for drain scenarios and so forth, right I think that, like depending where it's done right, because all the drains will probably go through eviction API, which is on the Epi server side right. So if that's done there then I would imagine the defaulting would also happen. But if it's not it's roughly via Kublai like graceful shutdown, then we don't have that defaulting rate so um yeah. It would be good to understand that. But we can look at that later.

I

C

Yeah, thanks for the review, um so I will I just wanted to bring this to everyone's attention and um get some reviews on the PRS and the issue. Thanks Ryan.

A

Can you take it as an action item to review other drain scenarios to understand that no change in behavior in this specific case, gradual termination is already unique by itself. So we don't want to make it too much too different right.

C

Yeah I can do that and um the code here is uh only in the node shutdown, well known, graceful shutdown files, but I'll take an action item on the other section. Thank you.

H

um Actually, there's the one more topic rough. Do you want to talk about that? Api things? I keep saw that the item on the signal, but never speak, scheduled.

J

Oh yeah, uh sorry, uh so just uh real quickly. We have a new uh cap. That is a follow-up from what we discussed last week and this is covering uh new API in kublet. That would return pod information, specifically Readiness information and just a heads up that it has turned from a dock into a cap, and we would appreciate any reviews.

A

Thank you, yeah Don. We discussed it last week. uh I think you missed it. Last week, I.

H

Was at the Google next thanks.

A

Yeah yeah last topic we have is uh follow up on in place, vpa plus core bindings I'm, not sure.

K

Yeah uh here uh Jackson's here, yeah right uh in the I think two or three weeks ago. You know uh our teammate linear, shared The Proposal with the community, and we split the like some problems into four sub uh sub issues and for some of them currently we already created PR's and one cap, and we are like uh we will try to uh like uh sync with the community, follow up with the community again on the next steps.

K

Do you think we should have some like owners from uh signal boost or is just a uh I think uh have some other like process or what's the next steps.

K

E

G

An announcement.

E

I I forget like what we said last time. You're gonna did.

D

E

Flush out the dog already say: we're gonna work on an announcement that ticket from there.

G

E

Maybe the doc can be converted to an enhancement and we find owners for it.

K

uh You mean uh what kind of announcement uh so right now we we just uh like move uh like the individual Sub sub issues into a PR or cap I think only one currently need a cap and the rest that we just send out the PRS, and we write the uh like uh descriptions uh in the pr and the problem statement in the pr as well. uh Do you think we need like additional, like uh kind of uh announcement, talk or something else? No.

E

I think for the peers, it's fine. It should go through.

K

E

Review and approves afterwards.

K

E

The announcement is when we'll do maybe planning starting next week. We can discuss it.

K

uh Okay sounds good, so so that part is the uh should be down by us or by a signal, memories.

H

um I think I'm a new, maybe they need it is really uh next week for the planning we're gonna find. uh Can you summarize what the feature uh and put in our planning dock once as one item, so we need to find a reviewer and.

C

H

For that, but I just want to mention that the reviewer and the pros mean not me like. We agree right.

K

It makes sense well.

H

We have tibac Force discussing uh if I remember correctly, uh one of those things I do I think in last time we discussed, um we did explain, there's the interesting on that usage use cases.

K

H

The implementation of design, my my virus right, so people.

K

H

To watch CPU manager design act this way, but your use cases is quite real, so we are not just. We want the more abstract right to Define Implement that feature in the causes her so, but we can discuss this one and then identify the reviewer and the approve.

K

Yeah, uh okay! So since there's a like kind of uh um important uh planning uh meeting next week, uh if I understand correctly, uh do you think it's better for us to uh participate? The discussion.

H

Are planning and decision all under signaled? Oh.

K

H

Yeah but there's the talk, I think you can find from signal the meeting here. There's the we are using the same dock um right.

K

H

Yeah, so do this, you put your item there and your current dog have the zero issues right. So this.

F

H

Which is it's hard for people to digest if they haven't uh read that before right? So, but I saw you break into the zero PR right but to but to your proposal like the single require of the cup I didn't say the any documentation just only for that one. So.

K

H

To be the difficult for people to digest, so many.

K

Information: okay, no problem, so so for the cap, one, uh we felt the issue and the corresponding PR. Or do you think that's enough or we need a separate, a separate dog.

H

Oh, if you have this issue, then this would be okay, yeah.

K

I guess right right, uh yeah, I I gave the issue links in the documentation and the visually uh have the pr Associated yeah.

H

A

And I uh placed the agenda item for next item: Kappa planning. So let's do kept playing next week. uh I, don't think we will have time for many topics beyond the planning.

E

Yeah we can clean up a bit.

H

Maybe we should clean up earlier and then share the team earlier, so they can act, their ID.

E

H

I know everything there's so much items I know.

A

Yeah, if you own a cap- and you have a PR uh to update it to 129 or anything like that, uh try to think on a slug, so we can try. uh We like it will be best to have a lot of things cleaned up before the meeting, so we will go faster if you own something, please send APR early, so we can Target it for release.

A

Okay with that, we reach the end of agenda as it's written anything else.

A

If nothing else, thank you, everybody have a great rest of your day. Bye. Okay, thanks.