Kubernetes SIG Node, 8 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220308

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right well uh welcome everyone to the march 8th uh 2022 edition of uh sig node, um uh we'll kick off our meetings, as we usually do here. um Sergey won't even update on where we are with uh overall velocity for the past week.

B

Yeah um this week was uh very slow. We still fighting some bugs. uh I think one of them is on agenda today and box creates many pr's, so we're growing on pr's. Some of them are working progress, so I'm not that concerned about it uh we merging on a regular basis. Maybe some of yards was closed. um I looked through all of them. Nothing is getting lost as a rotten or anything like that. So nothing is saving, um so very good uh work.

B

Everybody ah thank you for uh keeping the pace and, as I said, there are many work in progress prs and the prs needed review. um So if you have time, um look at the ones that needs review.

A

All right very cool, and then I guess uh that can lead into our next topic, which is uh where we are on the release and what we think is relatively on track or not. um I saw you had a list of items there. I don't know uh if you want to give any highlights on each or, if folks, on the call of concern that their item's not there.

A

Maybe we can talk through that.

B

Yeah, I think the best would be to go through the list and see where the item is. I don't know status for all of them. I try to check, and but I don't want to misrepresent anything so first dynamic, google config is done. uh Pr is immersed. There is no dynamic config in kubelet any longer.

B

Next one is credential provider. I uh I saw the pr that is there, but I think cap is calling for more work, especially for internet testing, and I don't think uh this is being worked on by anybody.

B

Is anybody who work on that? Pr on a call.

B

So last uh release we cut it out of uh release because of lack of uh activity.

B

I wonder whether we need to do the same facilities, because similarly, nobody working on that like idt, was working on that, but then she stopped and uh andrew I'm not sure how different it is right now.

A

Yeah I had worked with andrew most on this in the past so and I'll ping him and see if he has cycles or not but um yeah. I guess it's the continuing barrier to fixing the uh attitude cloud provider, migration.

B

So, do we need to market as a soft cut or like unless stomach will will be done really quickly.

A

uh Yeah, you can mark it as that and then I'll stop with andrew.

B

Question removal done, uh we still have few, please admit, um docker, shame but mathias is working on uh with. Please admit all yours. It's like see windows, storage, to remove those.

C

Yeah um I can represent this, I'm not sure if uh peter's on the call um we chat about this one, it's kind of in progress. I think the issue is uh we're thinking to make it beta, but to make it beta, we kind of need to turn on the feature gate by default, and then I think we're ready to do that fully, uh because the uh some of the pieces of the cap, like the alternative metrics in um the cri, are not fully implemented yet like a new prometheus endpoint.

C

So we might track this one as alpha for now and uh gonna see how it goes, because we aren't really fully ready to turn on the feature gate. So we had some offline discussion about that. One we'll probably update the cap to reflect the latest status there.

B

um Okay, I will fix the comment later. Thank you for update, uh swap, I think I talked to elana before she went on vacation to other people who was interested in doing that work. So it seems that swap is not happening. This release because of lack of contributions.

B

B

Let's move on, I will uh comment on the cap to cut it from release priority class radio based critical shutdown. David is a here.

C

Yes, so this one, I believe I mean renault, are tracking, I believe uh the cap has been uh merged and I believe I've saw the implementation and I reviewed it. um I think it just needed some approvals, so yeah.

D

It's on my list uh I'll get to it this week. uh Oh.

C

Thanks renault.

B

So the only thing left is to approve the pr and it's ready for beta.

D

Yep, it's start getting better. Okay,.

B

So we're keeping it in the series and last one grpc probes. uh I send the pr for that and uh yeah just waiting for api review. I would suggest we like it's uh past some deadline, but uh I would suggest to keep it uh anyway. uh It's uh it's really straightforward.

B

If no objections and yeah.

D

I think sounds good. It's small enough. You can yeah. I can make a bus.

B

And those plans can be work in progress. uh Whatever hat is uh small pr, uh it's uh it's due. I think we you can keep it uh um in place vertical scaling. I think vinyak commented here uh he's not on the call, but he said that he has some other priorities, but he is actually working on he'll start working on derek's comments uh after coming friday, so there are still time 2 29th, so I think you can keep it uh this.

B

Category checkpoint, I saw uh ice peers. Yes, I.

D

Started reviewing it today and I think sasha is also gonna help me yeah.

B

Perfect and uh image pro secrets.

D

Mike you, you want to talk about it, so we we have, we started reviewing it and like they were like attributed, and there were some concerns that we are not addressing uh checkpointing that information on this. So maybe we can have a quick discussion mike if you wanna just give a status.

E

Yeah sure it's a complicated subject so probably being off offline call to go into details, but suffice to say there was uh we. We split the feature up into two phases and the second phase was going to include persisting the information, unfortunately, to process the information. We need some changes, probably in the container run time and some other policies within kubernetes for image image pulling, um because we can't just persist a list of yeah uh pods that you successfully have.

E

You know, pulled images and or the references of those images alongside a hash for the secrets, because that would make the secrets even more vulnerable than we have today. um So we need a way to you, know checkpoint it in a secure way, probably encrypted or something that code is probably going to be exposed. So we need a new policy, we don't usually store.

E

You know, checkpoint, secure information right, um so just checkpointing it. It needs a cap. Basically, and that's why we're splitting it into two phases? um The current phase only works, if you- uh or at least you know, with with respect to the expectation of the of the current cap in this pr, only works if you, if you require uh garbage collection to have occurred on the images.

E

If you haven't collected uh those images, then we won't know because we're not persisting whether or not the image was pulled with a secret or was loaded into the container runtime directly or in the past had been loaded with a sacred grid because we're not persisting the past information right, um so don't leave. You know correctly pointed out that you know his bar is doing persistence um and I agree. We need to do persistence. We we just have to decide.

E

We want is phase one still viable. Are we okay with using an alpha feature, gate and telling customers that, if they want to test this gate, we suggest that they? um You know in fact, do garbage collection of the images um and not to tr otherwise, not to trust any images that are currently cached.

A

Mike, could you help me out for a second on that one? So I remember reviewing the cap in the design and then um I'm glad that jordan caught that topic. um If we just proceed with phase one, what is net new in the cubelet with this capability.

E

Right then, the net new would be um with the images you know garbage collected. You can turn this feature, gate on and start testing the the performance, and you know making sure that everything works right with actually doing a. You know in memory, persistence for the life of that kublet right, be it a day, be it a week where you could actually know that you can use pull if not present and not pull always.

E

You know, with a with the controller, to make to force pull always which allows us to go on to the next performance problem. What we're trying to do in fast solutions is get down to sub second pawn initialization and if you have to pull always you're, not gonna, you're, not gonna, find all the all the issues that we need to fix for fast pause, the pods right, um so we were really just trying to make progress.

E

Knowing that this isn't the end game with having some, you know, long-running image, cache persistence, hosted by the container runtimes or, however, we end up. You know solving this problem. um We we've got, we've got a bunch of problems to fix and we're just trying to make it. So we can test fast uh solutions right.

D

I think one problem I see is like, if, if we want to somehow figure out whether it was loaded offline in the container runtime, then should the solution move entirely to the container. Runtime and cubelet is just passing the information on okay use this or.

E

D

It whatever then like, we should pause.

E

Right right so right now, the the current image pull policies are not image cache policies.

E

um What the expectation is is that there's some magical image cash policy, that's happening on the images based on the poll policy, that's been used and- and that is not the case right so there's some security issues and thus everybody's just either using never pull or pull always and there's a lot of security issues even with the never pull, because you don't know where that image came from right.

E

So I agree mornelle the end game might actually be to uh to define the image cache policies in your pod spec bring that down into the uh you know, extended runtime cache manager and actually have them implement the the image uh policy instead. So what would happen, then, is in google it when it does insured secret pulled images. It would just pass the pod spec with the policy for caching down to the container runtime and the secrets.

E

um You know the the current credits that we have um and then what what we would do is reply back from the container run time that yeah gabrielli. We have pulled the image with this with this. uh You know this particular cred or a hash of that credit you like uh and then, and then we've handled it and now you can run your pot because we have all the images right pulled, but we're not doing it that way today it would require a re-architecture yeah.

A

I guess um uh so thanks uh mike for describing that um what I was trying to, I feel like.

A

uh There's not.

C

A lot of benefit.

A

With the current approach, absent fixing those things, at least that's what I'm hearing when I hear this out so like I'm wondering like, do, we want to just revisit and go back and explore the runtime route. That seems interesting, um but I'm wondering like the risk. If we go ahead with the option as it is now, does that minimize our ability to evolve that feature gig going forward? uh Will we have to unwind something if we just ultimately think it's going to go into the run time well, do you feel an urgency.

E

If the images are collected, it does solve the problem of you having to put a controller in there. That does pull always so that that is correct. We're making the assumption that everybody using kubernetes today is forcing pull always.

A

uh Help me understand that I'm sorry and then I don't want to take too much time. There's.

E

A bug there's a there's, a bug in the image policy for him full image, if not present, in so far as it presumes that you have access to that image. That is awesome.

A

I understand that so you're saying you're still keeping the list of images that required authentication on pool as a checkpoint file on the node, and so you at least will know that I use the cred to pull that image. I guess what I'm wondering is. Do you really know that? um Because you don't know if the image.

E

Only if they were only if you're, the only one that pulled the image if the image got loaded directly into the container runtime, all I know is it's present.

A

But actually I guess what I'm wondering is like: if the cubic gave you a secret to pull that image and then the runtime, the image was actually open, like it never needed the secret to pull it like. Does the keyboard actually know that auth was needed to pull it or not? And then yes, yeah.

E

I have that in the I have.

A

That in the code.

E

We we check if we do a poll. Without you know, without creds we get it, we get a reference back. That says it was pulled without grads and I remove or delete if you will the fact that you need a cred now uh or you needed to credit before. But now you don't need a credit. I.

A

Actually store that so then, like conceptually like the present feature as it is today, would be useful in systems who might wipe their image fs on reboot.

E

Yes, if you wipe it, this is this, then it's good the way it's set all right. Well, maybe that's.

D

Which is your typical.

E

Cloud environment right: you create a vm, you create, it is white, um usually when you're wiping kubelet you're wiping the node you're starting it up over again. In that case, it has value. Okay, oh yeah. I shouldn't have made, I didn't mean to say it's. Only. A performance sign, yeah.

D

E

It's the value in that case and the wiping- and it's also the value uh you know for when you're trying to if we're using this as an evaluation set of code, um but marnell's right, you know, maybe the the right answer here is just to start over with a new set of policies and implement the policy in a different place instead of the mhgc manager in kubelet, we could handle that down in the container run times I'll be at this.

E

That's a big ask with container runtimes right to do that.

D

Yeah either that or we say: okay, we don't worry about images pulled by the container runtime we can just have a boolean may be returning back whether it was pulled by the runtime or the cubelet, and do persistence that will.

E

D

Thing to do yeah.

E

Yes, like we're doing pinning today we could add uh you know who who pulled it right, you pulled it. So it's okay, right or or we could even do a filter on the image stats list right that comes back. To only be you know only give me you know information about the images that are nks.io that I pulled.

E

Then then, yes, we could have persistence and if we could figure out how to.

D

Do that in in a way.

E

That would be, uh you know, supportive, we, if we and then we also added the additional information above the who pulled it, to be whether it was pulled the secret or not right.

A

Yeah, the other part of me is trying to think what, if the keyboard did it? Is there a useful future audit event? We could admit um yeah, I guess uh from my opinion I could see there being some benefit as it is now. If I reboot a host and people might find that practical uh and get some experience on it, um but I guess ultimately I'd defer to you and renault and if you think it's worth keeping it around or not.

A

E

Yes, is there more value in the in the end game? Yes, I guess the other question would be derek in the in this group. Does this break us from other possible solutions in the future? um None come to my mind, but liggett's point that there may be some expectation that this problem is fixed.

E

That needs to be addressed that this isn't a is. This is an endgame fix right.

A

Yeah, basically, even in alpha we'd, have to enumerate all of the uh security gotchas in that documentation and.

E

A

Yeah, there's obviously a risk with that either way uh I'll. Let you all all right thanks all right I'll deal with derrick thanks all right uh were there any.

E

A

That sergey did not go over that folks wanted to raise attention to today.

A

um If not or you have one later, please reach out on slack, I guess um and going to the rest of the items. uh I guess we talked about venez in place, update already and then um bobby. Do you want to talk around uh the out of cpu issue? You've been exploring.

C

Yeah sure um so, basically it's just gonna update from last week. um The context is, there is a kind of regression in 122 with um with part of the pod life cycle, refactoring stuff that basically introduced an issue that when um pods are basically the when during termination, the kublet sends a update to the api server and it sends the update that the pod is terminated. But it's actually not fully terminated yet and that causes that new pods when they're scheduled uh they might be rejected by the kubelet locally because they don't have enough resources.

C

um So we talked about this last week. There was a pr up like by clayton um who is attempting to fix the issue, um so I'm kind of working with with clayton on that right. Now, um it's turning out to be kind of complex fix, because uh we kind of want to make a minimal fix, because I think we'll need to cherry pick us back all the way to 122, but it requires some some kind of deeper changes in pod life cycle, stuff um and so there's kind of discussion.

C

If we have enough testing around it and so forth, and I kind of looked at the latest code, I think it makes sense to me that the latest change, basically the latest change uh reports to the api server, the the terminal phase, only when it's fully actually terminal um and all the containers are not running anymore. um So I think the fix looks good. The only question if like the testing is sufficient and if we have enough um coverage and so forth, uh like I did another test yesterday, I actually found like a regression.

C

I think in the pr I posted it on my github comments, just by doing some manual tests, so that's kind of the big bigger concern if we just have enough testing. So um if anyone wants to take a closer look and have some ideas on other test cases and stuff, we should look at, I think they'll be valuable.

C

um So that's going to update for that issue. um So that's that one um and then any questions from that one. I have like slightly related topic I want to go to, but maybe there's some questions on, that.

B

First yeah, I'm curious: uh did we close on the discussion of reversing the fix that will cause the regression, because uh I agree with your assessment like if uh new fix will cause new regression, it may be even worse situation, we'll chase backs all the time.

C

Yeah yeah, so that's another option. I guess um the issue is, I guess, like these changes are all introduced in kind of a really big pr. The pod lifecycle refactor pr back in 122., like I think it's almost a year ago at this point, I'm not sure how feasible it is to fully revert it. I guess we haven't really explored that option, um but maybe something worth to consider, but I think the issue there were.

D

Other ill consequences.

A

Before that, yeah.

C

Yeah, I think the problem is that pr did fix a lot of issues, but you can introduce some other ones, so it's kind of fix one thing break another thing, type of thing: yeah, um so anyways, that's in progress and then another issue that I found actually just yesterday, I haven't opened up a github issue, but it's kind of related to actually what lifecycle refactor stuff.

C

um One of the changes as a result of the pod life cycle refactoring, is that when pods are terminated, for example by kubla during eviction or graceful shutdown or any other use cases where kublet basically manually kills a pod um before pre-122 before the public, refactor stuff, the status of the pod and all of the conditions associated with it, like the ready condition and the container stats et cetera, were not actually reported by kublet um and after 122 after the public lifecycle refactor.

C

Actually that changed and now kubla does report, container statuses and ready conditions and so forth um after pods are terminated and the problem is kind of reports them in two phases. One is that it puts the pot into a terminal phase, so it changes the phase to like succeeded or failed, and then the second update it actually updates the conditions and so there's an issue we found kind of internally, where basically um the kubelet might might uh might not send a second update.

C

For example, if the node is shut down or the node is terminated for whatever reason, so it can basically leave pods in a very weird state where the pod is in terminal phase or failed phase, um but the ready conditions still reporting true um and that kind of creates a problem because, like um kubernetes services and endpoints and stuff stuff like that, use the ready condition to tech to detect if the pod is ready or not, and so this can result in basically traffic being sent to pods that are not actually uh like ready.

C

um So that's kind of an issue so I'll open up a github issue to talk about this more. I just kind of thought about this. Yesterday.

A

Were you saying that was for graceful, shutdown and eviction, or just one of those two paths.

C

It doesn't really matter, I think the the case I was looking at was for graceful shutdown, but it's it's it's any for any case when the kublet uh initiates an eviction, basically because, prior to this, um just no conditions, no details were reported. Basically if the bot was in failed um so because the update happens in two two states basically like in two steps to the api server.

C

If, for whatever reason, the second update is not sent- or maybe it's sent but like um basically, the first update is sent is just updating the phase and a lot of other controllers and external things like the service controller, endpoint controller, etc. Don't actually look at the phase, they look at the ready condition, so one thought I have here on how to fix this. Potentially.

C

Is um you know if we already know the pod is going to be in terminal phase right like failed or succeeded, we should probably not report a ready condition of true, because ready condition is kind of used by you know, services to detect if they can send traffic and if we're changing the terminal phase. That means we're already shutting down this pod. So that's one idea but uh probably needs a little bit more discussion. I think so.

A

Yeah that sounds good yeah. I look forward to the issue cool.

A

All right, um any other topics that people want to uh race today. If not, we can give back a half hour.

A

All right well.

C

Thanks so much for.

A

Everyone who participated and uh we'll talk to you next week, bye everyone bye thanks.