Kubernetes SIG Node, 21 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220621

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20220621-170431_Recording_640x360

A

Good morning, everyone today is the June 21st, and thanks and this week we have like the uh that nine for the type so I know everyone's busy. So let's go over our agenda.

A

uh I saw the current and I did review that one, the Kevin. Do you want to talk about the the alignment circuit alignment, CPU manager, option.

B

A

B

Don't think it's on done so I think we'll just have to do.

A

Okay, okay, so I just reviewed before this meeting and and I will continue finish and having to finish other review and but I think the current give uh throughout the review. I agree with him, so uh looks likely we are going to approve, and after this meeting and move forward, this is an alpha feature. I offer option and for existing features, so yeah sure makes sense. Yeah. Do you want to talk about the next one yeah.

B

So uh don I put together a list of few of them, uh I think first, second and the fourth one are like minor updates to the caps uh that need approval like Milestone or the testing section. So so Derek is not out uh is out this week. So we're gonna ask you to if you can take a quick look and add approve on those I.

A

Think before this meeting, I have already uh approved uh two of them, oh and the first one. For the part, the network will write the condition um so I I, so they're still after the review going on. So please ping me once you sing, for that is Friday.

C

Dynamic resource allocation is not in that list. That's the pr number of uh cap, 30 63 3063,.

C

I don't seem to have edit permissions, or at least not where I have a good for the agenda open, so I can't edit. This is Patrick speaking by volume, Patrick.

B

Can you put it in chat, I'll, add it to the list and.

C

Yes, okay, so do you want me to give a short status, update I've been discussing it in the last few days and it's not ready for implement implementable, but Tim Hawkins suggested that we should merge it, as is with the provisional status that is in the cap jaml file at the moment, because this this PR that we have open is so large, but it's just becomes unwieldy to to just look at Commons and for scroll through changes.

C

So Tim really encouraged uh everyone uh me as the author and Aldo as we have our core reviewer to consider merging it provisional and then taking it from there. We are fairly close to having something that we all agree on for the API and the scheduler, but yeah. We need further review also from someone from from sick note to move it to implementable.

B

So I know Derek was reviewing it I'm, not sure if he got a chance to comment but he's out this week, I can make a pass, but I'm not sure like uh if I'll be able to approve it entirely from Signal's perspective. Zik.

C

Node is actually to my, in my opinion, the smallest part.

B

Of a cap, okay, I can I can take a look, then yeah.

C

Yeah it it's mostly about the the main problem is Newport API, scheduler and then at the very end, cubelet consumes some of the new information and we intentionally simplified it compared to what direct refute in 124. So now cubelet, for example, doesn't need any additional permissions. It doesn't need to modify any additional objects as part of this proposal anymore, which make should make it simpler for signaled to to refue and and approve because it's it's really simple in that regard.

C

But if you can, if you can take a look, I.

B

Can I can I can make a path today for sure yeah, yeah.

C

Great thanks, yeah.

D

I also wanted to take a quick look, so I'll try to take a look.

C

Who has the all who is who can approve in the end, if Derek is out.

A

Menu, we have a bunch of people still and the cabin also can prove uh menu, because the the problem is, the goal is not just a plural right, so we also want to make sure that it is the meat requirement yeah.

E

A

Think a part of we in the past. We are easy to approve certain things and then we have to duplicate so because a lot of change. So we it's I, understand that you work very hard on as well, but the the one part of the we want requirement for the approval is also feel free to say no, so.

C

Of course, I, don't as you as you know, I, usually don't attend sick note meetings, so I don't know how much Sasha and my other colleagues have discussed this cap with you guys.

C

um I can certainly add more background if that's needed or if that's helpful, but I I really just need to know what has already been said about it and where we, where have signaled, is with regard to what this cap is trying to solve.

C

Anyway, let's, let's do a first pass and then we, but we need to follow up quickly because there was I, was still hoping to get it into 125 as implementable.

C

The Prototype came along fairly well as as outlined in the cap, so I don't expect too many tricky problems that still need to be solved. It's really just about yeah trying a few approaches for the for the API. That's that's the open that I have with Tim.

C

Anyway, I'll give you I'll, shut up now and give you a chance to look at it when, after the meeting.

A

Thanks for volunteer to take another video, unfortunately Derek's have the most content here, because we assigned to him uh to review this one, so I don't have. But the team also talk to me. I understand team approval, but the team also raised his concern to me, and so he hope our signal can hold the bar here. So that's why I want to make sure uh uh as you did, we need to hold the public select the dynamic the resource allocation is being the feature we want.

A

I mean the signal the want to move forward for last couple years right, so the every time being pushed back, but this time looks like it's more promising, so we do want to. Maybe even I mentioned you team have the a lot of consent before he approval. So he didn't talk to me. Then I gave him the reason. My religion I want to move this forward, but it's starting to mean, like we are going to sacrifice uh everything right: the reliability and the meant used to be 1980 and the maintainability just move forward.

A

I just want to make sure we. We understand that there's something because we signal the community. uh We have weeping suffer or maybe in a criticize the real happiness here and the people want a certain feature and desperately, but the but the in the reality. We have also because those uh once a while we because the velocity we treat off the relapenator, so that's also better us a lot yeah and uh yeah Patrick. We understand this.

A

We want to move forward, but unfortunately, Derek have the full entire context from the beginning, and it's not here yeah. We also have like the 19 different couple in this unique cycle. Lastly, and I just want to say that, and uh and and there's the review or pandemic's problem, and also the approve of enemies, problem I.

C

Checked with my colleagues and, for example, Sasha agrees that the class-based resources, which I think you discussed last week, which based on the title sounds similar, is actually doing something else and he Sasha agreed. But this one, the other one, the class-based resources, that's less important than Dynamic resource allocation. So if you need to choose from our side until you're fine to to pick Dynamic resource allocation and just reject the other one and postpone it if, if you, if you don't feel comfortable doing both.

F

Yeah I agree: okay,.

C

Thanks for the confirmation, I wasn't sure whether you were on the call yeah.

B

Yeah I think like what you're saying is like uh we will try, but Derek had the most contexts and we have a lot of the plate and you want to make sure that we get it right and right so we'll try but like, but it may slip and.

C

We can make.

B

Keep working on it, yeah.

A

Okay, so needs to follow up on all those type still open the cap and after meeting so.

G

Yeah I did a quick update for the Pod status conditions that was I, guess the third entry, so the quick update on that was uh so I uh direct did do a full review on Thursday, but uh so uh he he also spoke to me and like suggested that you know he had a specific name that he thought better aligned with uh what the cap is trying to achieve, which is bot has network instead of uh surfacing sandbox related Concepts.

G

So uh the cap has been updated to basically integrate all that feedback, but I think um Derek didn't get a chance before he left. uh Is there any way we can uh and then I have another review and try to kind of get it approved for this quarter.

A

G

B

I reviewed it together, so we have the same perspective. It will be good to get Don's perspective on it. Thank you.

A

Okay, let's move to the next topic: Daniel, do you want to talk about the testing and the reliability? That's.

H

Oh yeah, um so I got a chance to go back over the recording from the kickoff. We did for the reliability work and did a quick summary of basically what we ended up, uh which is to say, we generally agreed that we have stuff to do. uh Obviously, um uh part of that was trying to figure out what we mean when we say things are unreliable, um so, like pot, that's like simple.

H

This definition of thought is, you know violating a published invariant of the kubernetes API, so we say it works in this way and then it doesn't.

H

um But this gets more complicated when you consider that a lot of the things we do are either not particularly well documented and when you add in our various plug-in interfaces and the lack of um definition around how some of those interactions should result failure.

H

um So as part of that, we also then found a few areas where we want to improve things. So part of that is improving the contract.

H

Testing around Cris and I've started opening issues for that, uh so both testing CRI implementations, but also testing how the kubelet will respond to failures in the CRI or like grpc, um and then also that we want to use tests to document what exists today to avoid shipping regressions without like knowing about it and actually be aware when we are shipping behavioral changes, uh because today it's fairly easy to make a change to the kiblet.

H

That will not break any tests, but will change Behavior and that's quite scary, especially when it has like um spiraling issues with like scheduler interactions and other stuff.

H

um And then there's an extra area which is clarifying where we're actually just lacking features for things that can cause failures. uh So some of that is um like when disk accounting is really slow or we run out of memory, and then things spiral out of control. uh We should be better at like documenting that proactively, um rather than just having you know, a bunch of unspecified things that can go wrong.

H

H

Does that make sense to everyone slash? Does anybody have questions comments, concerns places they'd like to help.

G

So clarification on the later one that you mentioned around things that are missing, I, guess an indication is it that there are silent failures when there's like IO contention and memory.

H

uh So it's um when there are issues there, we sort of spiral out and break in other ways that are like harder to track down, and it's mostly a case of being explicit about what we don't have.

H

And what we in some cases can't handle.

B

See the the I o one is is bad: usually it results in like timeouts between the cubelet or the container runtime when they start getting starred, so I think a good way to handle that one is maybe having some metrics like in the node uh node exporter or something that can catch it fast enough, uh so maybe like.

B

If there are like folks uh across across companies, I want to collaborate on something that we can do that it will help because I know that's been tricky and we did a bunch of workarounds in cryo uh to better handle that, but basically like it depends on how you can configure your node right. Okay, if you configure it with cryo. These are the kind of Errors you expect if you've configure it, but containerdy you'll see different errors and so on. So maybe a better alert thing may be useful. There.

D

B

D

On on the GK side, we've had like a lot of issues also with uh just throttling and disk related stuff and yeah. It would be great to work more on those issues and I think there's also works, for example an NPD that we can do to detect it and maybe service. So that's like a node condition. So users are clear, you know, maybe the node should not get new pod scheduled on it.

D

Etc um yeah I think like one of the most challenging parts right now so in in the in the in the Sega like in in Kublai and stuff I, think it's a lot of the contracts are not super well defined so like.

D

If we look at 122, we had like the pub life cycle refactor, for example, and it changed certain um certain things around uh plot status updates, basically, and whether it's like when the Pod status update, is delivered or, for example, is there an IP on the Pod status when a pod is terminated stuff like that, so that stuff is all kind of a little bit ambiguous today and what that should be right.

D

Should there be like my P on a terminated pod or not right um and we're kind of relying on the existing Behavior to capture that, but unfortunately, there's like a lot of controllers and other things in the ecosystem that have started to rely on these things and and that changing them can cause kind of Downstream effects. So I think that's one of the big areas I think we need to work on is some kind of defining those contracts today and actually documenting what is important. What is not important, I'm sure we have tests for them.

A

um To summarize, what do you propose the later you.

G

A

To expect to be on CRI, you want to have like the uh clear defined. Even we did Define, but it looks like we might don't have like the uh test. uh Sufficient test cover right to check the powder uh status, about the condition and all those kind of things women missing.

A

Something like, for example, you just mentioned uh powder terminating yeah, I saw tons of the product issue and the product is terminating, but but it could be like the already died or could or maybe still is running container still running, and then there's no clear uh way to explain, and so you did. You propose next, the between kubernetes and CRI. We also have another layer, and the basically is the between APS server or Master Control plan and also kubernetes on top of the against of the powder API.

D

Exactly yeah yeah I mean just to take like a concrete example. We, uh when a pot is terminating like it on the Pod status after the plot is terminated. Do we expect an IP on the Pod spec or not and like that? Actually was a behavior change in 122, for example, and I? Don't think we ever really thought too much about whether it should have an IP or shouldn't have an IP and I.

D

Don't think the answer is too important, but the fact that we're consistent matters, because, uh like controllers and other things have started taking advantage of that. So when we broke that in 122 it cost kind of Downstream. um You know breakage, for example, endpoints, controller and stuff like that. So uh definitely I think. There's work to be done at that layer at the kind of pod API between API server and Google layer.

A

So so yeah looks like everyone want to help, but they do uh the new. Do we have like the way proposed the process like how people participate.

H

Do not particularly well thought out yet um I'm, not sure whether we want to just bundle it into the CI testing subgroup um or if uh we want something uh more aligned with like the main Sig stuff um and so I haven't particularly spent a lot of time. Thinking about that part of this um I'm just trying to document what we want.

H

A

Yeah I think the next step um and we need to have like an accent, uh uh actionable plan right. So so then action plan I think the original and they are thought it is uh in the CI project, a CI testing project of driving this one, because from the previous uh stage to the next stage, which is we either more test the coverage right. We are kind of the make some progress, uh uh uh D, flick or testify, and so now is we we want to.

A

We found that there's the missing test quality, so we maybe all maybe like the sometimes it's been ideal, but the next, for example, some stress tests that we talk about, but the level we run those tests anymore. So we try to add those back and update because some is not reflected. Today's needs today's situation, so we try to re-ed it back uh and but we need the plan how to do that and accent of all so people can share the work right. So we need a central place, a central place, Central dark.

A

They say oh here, it is what we wanted to do to improve our overall testability and reliability here and here's the actionable accent, people working on so people no people could be so reached to each other to communicate so avoid unless the duplicate effort right.

H

Yeah when I finished, writing up issues, I'll start a doc sure.

A

H

You foreign meetings.

A

Oh, maybe just whatever format.

A

Like track around, we have the trackable any things you prefer and come back to here with just using those to track up our stuff yeah.

A

Any other comment on this.

A

Also David, maybe you want to fire the issue and, as they say that Daniel about what did you talk about about the condition about the status yeah.

D

um Oh I'll make sure there's an issue for that.

A

Thanks next move to next one and I think this sorry I need.

I

That's me: Aldo.

A

Yes, please uh yeah all.

I

Right um so here uh um think ups and the working group watch, we are suggesting um to add some um rules for terminating or continue retrying jobs and uh the natural um API. For this it seems to be exit codes, but uh not in all cases. That's not enough. In all cases, for example, you could have a lot of failures due to due to infrastructure errors or system errors.

I

For example, the node completely goes down in that case, there's no cubelet, so there will there wouldn't be exit code in the Pod spec um and things like that so um part of a proposal uh that it. This is how, when y signal is involved, the part of The Proposal is to standardize all of the infrastructure system errors in kubernetes into a single port condition uh that every controller uh can write. For example, if cubelet you know, does an eviction about eviction.

I

It would also write a condition saying this is a system determination uh which start with status through. um If uh keep scheduler does a preemption it would. It would write the same condition if um the the Pod garbage collector the text that the node is gone and is to delete the parts that are orphan same same thing.

I

They would have this condition um so that that's basically the proposal in in cubelet side, uh we have identified that cubelet sometimes writes a status reason, um which is uh in the cases where there is a oompl or when there is an eviction. So we are try. We are looking to just in addition to adding a reason, uh these standardized spot condition.

I

um So that's a proposal and the cap is, is pretty much ready and there's a few details to to enhance, um but I wanted to bring this up to your attention to see. If uh you think this, this approach seems correct or you have any other suggestions.

D

One question I have, um and maybe it's already answered there but wanted to just ask like how do you handle stuff where um the pods are killed? Outside of you know the standard kubernetes API like an unkill, or maybe someone just SSH just to the node and you know, deletes it directly from the CRI or something like that. Right stops the plot sandbox from the CRI. It's an idea that Google's gonna reconcile that state and I mean it probably won't have an exit code at that point right because I won't.

D

We won't uh have watched it.

I

So I guess that's what I there the case of oomko I think it's already handled by cubelet, uh where the kernel kids kills the pot. um So we we just wanna hook into that uh logic, I'm, not sure what happens if a user SSH is- and that removes the mostly container um yeah.

B

Could you imagine us.

I

Actually, um in the case of a complete VM failure, there is the the port garbage collector in the cute in the cube controller manager. So we would add that logic there of that scenario. But I don't know about CRI what happens if you remove the container from CRI.

B

So cubelet will still detect that it's gone and it'll respawn it when it lists over the plug and when we move to event it like. Hopefully it will get that sooner.

I

Right so it sounds like we should be able to uh tap into that logic and add the condition as well right.

D

A

My understanding it is this problem, so actually try to avoid the I mean like what the other early earlier say that all the controller, the kubernetes also is one of the controller here right to start there. So, basically you could determine next. The controller like the global I mean not Global cost level of the controller and then we're based on that failure and determine should I how I'm going to retry or not to retry or uh or maybe it is next move to the defender. We couldn't have the more intelligence so.

I

Right, so there are two parts of the proposal: one is to have the API to to decide what to do with this kind of errors and that's in the job controller to to implement and that's pretty much agreed.

I

The other side of the equation is: um how do we make it standard um in kubernetes that certain failures are um you know the the con the Pod didn't finish successfully and also it didn't fail, unsuccessfully, because of uh software bugs it just failed, because there was pressure, you know the nose was gone.

I

uh There was no more space, so it keeps together preempted the Pod. All of those all those errors. We want to standardize into a single a single pot condition that we can filter against.

A

um This is really useful feature I just a little bit concerned because I know you you you you ping me last week. The problem is you earlier. You heard we have so many archive already going on here and we don't connect the few people already out of office, and so so only my concern is review, benefits and also approve benefits. It is the important feature. I can say that obviously, and the the concern is because this is also his API change standardization.

A

So we we don't want to get is wrong that people started using, but this is the alpha feature right. I have fasted.

I

Yes, it's going for Alpha, but again so currently so I think from our reliability point of view. This is not a huge addition, because the cubelet already handles all of these errors. It's just about standardizing how how to surface those errors, but it's already been.

I

We already handled them right. It's just hard to know when, when that happens,.

A

um Yes, the only problem is I have to look at the capture, my counseling, it is kubernator already handle. So if you want to standardize, you either go with kubernetes with or if you want to kubernate a node to change a new way, then we need to start a single magnification for existing customers. They may be relying on the previous handed towards their job.

I

I don't want to remove whatever. Is there uh right, I just want to add a new Port condition for this. For all of this, okay.

A

That's good okay, so anyone wanted here to review this one and looks like the two note and it's limited scope and also is the alpha feature.

G

A

Okay, cool thanks.

D

Yeah I can help review as well from my side thanks.

A

Thanks children and David and when you think about the Israeli and the ping ping, either me and the manure and we are the approver I and also Karen or maybe uh yeah. Okay,.

I

A

Thank you, foreign.

A

Last one um Andrew: do we want to talk about the quick Point container checkpoint, yeah.

F

Yes, yes, so um I I got a couple of reviews from um from different people. Thank you very much everyone, and so um the question is now: if, if it needs any more reviews, um I think monal mentioned that he will look over it next week. So I just wanted to bring it up here to the stages of the of the future.

B

I can make a last pass Adrian, and then we can uh get it up to how much yeah okay I.

A

Think I also reviewed and I have some minor pricing, but that input there because I think it's almost ready to go. So that's why I, just um okay, so yeah I need to move forward. This way, yeah looks like okay, yeah.

D

Well, Adrian, just quick question: maybe um for the runtime support um is what's the status on the I'm, pretty sure you've been working on the cryo side yourself right, but on the containerdy side, is there any support yet or that still yeah.

F

So so, I I opened a pull request basically to to wire through um the new CRI um call to container d, and so um the pull request exists. The pull request has no unit testing yet so um I need to add this to the container Depot request and the the the the bigger problem for the containerdy side is that um I need to think about a format where to store the the checkpoints, um how to use it in containerdy because um currently container the um has no support which fits.

F

um What what we we thought about here, but it should be easy um to to come up with something. So um basically, at this point, containerd is able to create a checkpoint.

F

We just need to put it um in in the right file and and then it should be that so it's it looks, it looks the the most infrastructure and containerdy already exists. So it's it's. It's not that big part. A big chunk of work.

D

Okay, cool yeah, we're happy to help um feel free, Ping, On, The continuity side, happy to like test it out and and cool. Thank you.

E

Awesome well, thank you.

E

Yeah I just need to sync it up to the new API changes for the duration in the in the location. Exactly.

E

Yeah just kidding me when it's ready and I'll I'll give it the review. Okay.

G

E

All right, yeah, thank you.

A

Thanks, Mike and David, and and uh on this way so the day and on our prop on our agenda, any other topic. People want to brought up.

A

Okay, everyone got a little bit more than 20 minutes back, so we can review our type and the PS. So thanks everyone, yeah bye,.

B

I

C

Thank you, bye.