Kubernetes SIG Node, 31 Aug 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210831

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

uh Welcome everyone to the august 31st node meeting um meeting is recorded I'll, be uploaded to youtube uh shortly afterwards uh light agenda. Today, um I think the first item uh kevin had put up around getting uh an item tracked in caps that uh I know donna myself got labeled with the right milestone. So I think that's settled.

A

Renault, I think you were just paying that you had an issue you wanted to talk through. Do you want to.

B

Yeah uh sure so, like I've, I've been doing like uh profiling and uh cpu memory usage analysis on the node side, and one thing I notice is like in cryo we used, we use a library, a parallel gzip library, to speed up image, pulling and like extraction and so on, and that library ends up allocating bigger buffers to to speed things up. So it is like. Okay, you get 30 speed and wall clock time compared to using the default like 32 kb size buffers, but the problem we run into with something like that is.

B

If you don't limit the number of concurrent image pools, you end up with huge spikes in memory. So I'm trying to find the right balance here. Like uh one thing we considered was like. Okay, do we switch away from the library like on some devices? We don't care if it takes longer to pull images, but even if say, I want to use that library to speed up my image pulls, but I also want to stay under a certain memory limit. Maybe it might make sense to limit the number of concurrent image pulls that I perform.

B

So we have a serial option in the cubelet and then we have a one that say: don't serialize which I'm assuming like it does everything in concurrent. So I'm I want to bring up whether we ever discussed having a limit on the number number of image pools. We can do concurrently in the cubelet and does it make sense to add one so we have like uh we can put a cap or we can adjust our memory usage. So it is predictable.

C

A couple thoughts here: do you I don't know the history was it was discussed before, but do you see how it can be done on cryo level like can cry decide.

B

Yeah yeah, so cryo can do it, but the problem there is then we'll have to deal with the cubelet asking cryo to pull an image and then cryo, saying yeah and then just it doesn't get to it and like just additional back and forth between the cubelet and the cryo, and I I feel it might be cleaner to do it on the cubelet side because it it controls, like we already have a serialized serialized flag. So on top of that, if we have okay, do five pulls or do six pulls.

B

Anyways you're gonna be like saturating your network beyond a certain number and you won't be able to pull any more images so having such a tunable may be helpful and especially useful in like uh low memory, edge edge, node kind of scenarios, so.

A

B

A

We just had the serial image puller to begin with, and then we had a a knob internally that never got exposed externally on.

A

If I recall on the number of the size of that queue that could be in that image pulling um for the parallel puller, I was trying to refresh my own memory on this. I don't think we had anything that, like kept, that cue.

A

Would would you did, did you do anything renowned to inspect on like what what the number of concurrent power language pools would be? It used to be that we didn't do parallel image bowls because it was unsafe, but we're past that problem now so.

B

Yeah, we are past that I I haven't seen the code yet derek I'll look into it and maybe I'll get some more data and I'll come back next week. I.

D

Just wanted to.

B

Wanted to throw it out like this is what I observed and if it made sense to have some gap on that.

E

You know you talk or you talk about the memory usage is higher right and, uh if we add some limit to uh uh I'll talk about, I think just like the concurrent number, or are you going to propose to add some resource limit that concerning him, cannot cannot use more than what important image.

B

Yeah, so my my goal ultimately is like I should be able to come up with a number that hey, I am doing so many image pools. I am running so many pods and I am running so many containers and for a particular version of go. My memory usage should be predictable and and like like doing it like, I think, with the changes going into the go run time, it may be hard to do it with just the like, having a memory limit but to have these knobs and then for each version.

B

We test and publish and at some point we'll stabilize when gold vm doesn't change. Like recently, I saw some numbers that go 117 reduce the rs rss usage drastically. We haven't yet tested with that, but that may help. But the problem I have right now is like we don't have a cap and because of that, like any kind of system reservation or any alerts that we set up in that area are meaningless, because if you have a spike in the number of image pools, that number could go up and then your go. Vm is gonna.

B

Hang on to that memory for like five ten minutes and customer is gonna get alert, so I wanna put a cap on that. So at least I know with my testing that cryo will never exceed this number or I can go and change the code that oh 1mb is too big. So I want to make my buffers to be like half of or one fourth of that size. So I can trade off like how much my image pull time is versus how much memory and cpu it's using.

E

I see yeah, I think it I just. I think uh it makes sense to me, uh but I just want to also mention one thing that uh I think in anything continuity there's also uh there's another concurrency configuration there, that is for for each image how many layers.

B

E

Pour in parallel, so maybe we need to.

B

E

I mean when people use it, they need to know that they need to complete both of them right. So yeah.

B

Exactly so yeah, so we had.

E

One more thing.

B

Yeah we had the same so we we had one at the little at the layer size, but we didn't have one at the image size and we added that. But I thought it would be cleaner to do it from the cubelet than like enforcing it in the runtime directly.

A

But just to confirm that we know: did you try running the serial image puller in your in this test flow and see if it actually did what you expected.

B

No direct, that's on my list for this week, so I'll. Try that out and see like what the state, what the memory usage is. After a cluster comes up with serial versus spiderman.

A

um I mean I have no conceptual issue to it. I mean it's, it's just the size of a channel that we're filling in the current palette puller. So um we could.

F

Try to find the right.

G

A

Thanks awesome thanks ronald um next item here: no, the agenda's growing.

A

Next is looking for feedback on pods that have ephemeral containers. I have not read this linked issue of any of the other folks in the call reviewed this.

H

So um like just to provide a little bit of context on the issue and like what the idea behind it was.

H

uh The originally kept foreign containers mentioned that for cluster administrators to like identify pods that had cut fm error, containers created a new port condition would be added and that part would would be added like for that pod and through that condition it would be recognized that said, pod had an ephemeral container created now that wasn't implemented uh so far, but hopefully in 123.

H

uh That should at least like uh we're hoping to try and implement that. So what I wanted to try and get feedback on is so uh if you go through the issue, there is discussion. There is discussion around two major themes. One being should we add pod condition to only uh pods that have ethanol containers created or should the spot condition also be applicable to pods that have cube cut legs executed on them. So um what I basically wanted to try and get feedback on is, do we add like us like under one part condition?

H

Do we cover both of these cases or do we uh not need a port conditioner for keyboard legs? I can just work on small containers uh on this new, like added port condition or like along along those lines along those lines.

A

Yeah I was just uh bringing out the ephemeral containers kept to refresh my memory on.

A

The proposed tainting.

A

um I know, at least uh speaking from our experience at red hat dealing with users of kubernetes. We have a number of users that.

A

Want to proactively disable the usage of exec um those same users would probably also proactively disable the usage of ephemeral containers um from a for a security posture.

A

um The one thing that's giving me pause on, uh maybe speaking to your question: exactly is the other cap. That's put forward right now to handle um container notifications.

A

uh I don't know seth's on the phone, but we've talked about this previously. um That was kind of like a way of doing exactly execs and I'd want to think through.

A

um If, if applying, that condition would be desirable or not, um are you familiar with that proposal?.

H

uh No, no I'm not, but oh thanks I'll, take it up and get back.

A

um For the moment, though, I think I think it's fine to start with a distinct condition related to the usage of um containers and then let's treat exec um separately.

B

Okay, yeah. That sounds good thanks.

A

Do others have a different perspective on that.

A

Okay, uh just looking at the cap, but we didn't actually name the condition in the related part of the cup.

A

um On the issue, is there a proposed name for this, and maybe I've just missed it.

H

uh No not not yet uh so I think there might be a mention of one on the issue, but that was just to uh that's just like a placeholder of sorts to like uh drive discussion, but doesn't there's nothing proposed as of yet.

A

Okay, um if we look back, uh if you look back on the.

A

The notification api kept, which I'll paste in the chat that also maybe we would want to think about um if there would be a condition tied to that, because that's basically a way of doing execs without doing the end user initiating the exec.

A

um So if you want to take a look at that and then maybe come back next week and give your thoughts on it, um I think that'd be great.

B

Yeah, that sounds good. Thank you.

A

All right and then it looks like we have um one one other item on the agenda.

A

uh Just calling to a linked issue. Is it? Do you want to summarize the.

F

Yeah, um so we were experimenting on windows with running the sandbox container as different users to match linux functionality, and we broke something um what we know what we broke, but we saw an interesting behavior come out of that. That we wanted to just uh double check here. Is that if the sandbox container fails to start the pod stays in a pending state and you can look and see the errors?

F

If you describe the pod, you'll see there and you'll see the area in the cupid log, but the pot itself will stay in the pending state and then we were taking a look in the cubelet and it doesn't look like some of the lifecycle checks check the state of the sandbox container.

F

So we were wondering if that was an oversight, because it just doesn't happen on linux, that much or if nobody noticed that or if there was any intention behind that um yeah. I think that's the question.

A

um That's a good question: the uh phase calculation, if we bring up that part of the cubelet.

A

I'm just thinking about the scenario described. It would still expect things like active deadline seconds if that was on that pod and that pod had never actually started at sandbox.

A

The cubelet should still um destroy that pod, um but I think uh uh pod phases generally was a tough thing to think through here, in the sense that there isn't like a great phase that a pod in that set could go to and it was kind of a fixed state machine. So was there something um mark when you're looking this that you thought it made more sense to be in.

F

um I think we were wondering if it like the whole pod should go to failed eventually.

A

D

Well, when is eventually because it makes sense to me that it'd be impending if it was retrying trying to create the sandbox.

F

We saw it stay in pending for like 15 minutes. That seems a bit excessive yeah.

A

F

A

Would expect it to stay pending forever? um We'd have the same issue as well. If the cni hadn't yet been deployed. Okay on that node right, where um uh the sandbox also wouldn't uh create so.

F

Okay, I guess that makes sense. We weren't sure if there was reasoning behind that, but if there is a situation where the pods are pending, because we're waiting for cni to come in to set that up.

D

Well, once they go fail, they'll never retry to create the pod. So that's probably not. What we want unless like the pod is irredeemably cannot be run.

F

If you, but if you have a failure yeah, I guess it depends in our case.

D

So if the pod goes to failed, it's not getting restarted. Yeah.

A

What was the reason, though, that um you're saying what were you expecting in the situation you're, exploring that the sandbox would eventually successfully create.

F

No, there was an issue in um the way that we were trying to to start the sandbox where it would never create- and we also do see this occasionally um so with on on windows. Each like the container image needs to be paired to the container um like uh to the os that it's running on. So each new version of windows that comes out.

F

We need to add a new image to the new container image, to the pause image that we publish or that we build and release out of kubernetes, and we have seen in the past that, when, um if users update the os versions and don't uh take a new pod or don't take a new pause image that contains a container image that matches that um the the sandbox image won't start. And then you get into the same state.

A

Yeah, so the behavior thing I think, is as intended, and maybe what um this is a reminder of. It's like, I think uh pod phase is probably the thing identified as like one of the errors of kubernetes and so um versus uses of conditions generally, and we can't say that it's running and we can't add a new state and we can't destroy the pods. So I think unfortunately like where you're at right now is. Our hope would be that the sandbox gets created successfully or if something else had to go and reap these pods.

A

They would have to um do that themselves.

F

Okay, yeah we're not necessarily looking we weren't, necessarily looking for a change of behavior, but we were just um wondering if there was intention behind that. It sounds like there is.

A

Yeah um yep, okay! Well, uh thanks for raising the question.

F

Okay thanks, I think, thanks for answering that.

A

um Any other topics for today, otherwise I think uh happy to give folks back their time and.

I

Reviews uh actually not a new topic, but my question to previous one. So would it make sense to also combine that with a problem when the runtime is not able to create a sandbox and port required to be evicted from the node and reschedule it somewhere or in this particular case rescheduling into another? Node will not help.

A

I'm very sorry alex my daughter came in and wanted to show me a new shirt. At the same time, you were speaking, and I was trying to get her to leave the room. Okay, no problem.

I

My my question is like scenario for for this: uh when the sandbox creation failed, would it help if this port will be evicted from the node and when rescheduled somewhere.

D

Well, so in the case of the cni failed and that's the reason that it's pending, we don't want that because then we'd end up with, like mass evictions,.

D

uh Sorry I see that this was also discussed last week, but I was out of office so well.

I

I also was not present last week, so I don't know what context, but I was just curious if we have a scenarios when, let's say like, we suppose, image miss merchants for some our reason like we have a scenario. What runtime is not able to satisfy uh creation of pod or container?

I

Would it make sense to have a error message from runtime saying what sorry I cannot do it on this particular node.

D

I

What cni might be not with good case for this error, but this, let's say post container mismatch- uh might be.

D

Normally, I think that would be surfaced as events from the pod right.

I

I don't know mark what what do you say.

F

I

H

B

I

F

We couldn't start the like. The container failed.

F

Yeah, I think so I think something.

A

F

Could listen to that.

A

Yeah and then you'll act.

F

A

It alex, I mean, I think, generally um there's no shortage of higher order, things that could go and write a controller to reap that pod and hope it gets rescheduled elsewhere.

A

But I don't think the keyboard itself can make that choice.

I

Even if we will have some hints from loyal, lower layers like from runtime, what like, for some reason, this configuration of report or something is, is not possible on this node.

A

Yeah, I just don't know if uh the decision the cubelet would choose to make would always be right, like so um take the the cni uh choice, let's say: you're running a single node kubernetes, and when you do life cycle maintenance on that single node kubernetes, you don't drain your workload because there is no other node for it to go to um you restart your box. uh Behavior right now for cubelet is that it will just try to restart all pods that had previously been scheduled to it. Pods had been scheduled to it.

A

That said, they expected a cni, and if we have a situation where um we try to start a pod that the cni wasn't present yet because the cni itself hadn't had its statements that launched like you, wouldn't want to.

I

Yeah, I understand, with cni case I'm more worried about this uh infrastructure containers. What mark mentioned is which is a bit.

A

More problematic, but like the usage of a pause container itself, for example, um is not uniform across all run times, so cryo itself doesn't always start a pause container depending on what the the the pod needed. So it's kind of opaque to the keyboard. At that point,.

F

It seems like the issue is determining when it's a you want it to be re-tribal failure or a terminal failure.

I

Yeah yeah and what's was the background of my question like, should we have a scenario with runtime uh returns where which says like regardless how much you try it? I can't run that spot on this note.

A

But let's take the cni scenario: wouldn't it do that if.

I

No cni had been deployed. I don't think so, because we know what cni is something which can be repeated, I'm I'm more about what scenarios would like for some reason. Sandbox creation is failing.

I

uh Well, maybe like hypothetical example, but let's say uh we have a vm based runtime and for some reason like it's, it's configured reporters schedule, it uh run time. Class set for this vm based runtime, but let's say like virtualization, is not enabled on onenote. So regardless how many times you try, the hypervisor will say sorry vtx is not enabled on this node. I cannot start it properly.

A

So renault you had looked in the past on maybe enriching error handling responses from cri to cubelet, um maybe alex. If you had a few examples we could we could look at trying to enrich that api. That's so that the cubicle could make a decision that says the runtime's telling us. You know, there's just no hope. um Okay, yeah I'll I'll check this one yeah.

D

J

D

J

Yeah, I just wanted to call out that we see similar cases with uh the cubelet volume manager as well, where the csi plug-in might try to basically attach your amount of volume, and you know the cubelet just basically continues to retry through the cubelet volume manager, um even though it might be a terminal condition where the csi plug-in may not ever succeed right. So it just continues to try to attach an amount of volume, uh so kind of like a similar scenario.

J

That services in the in the storage site as well just wanted to call that out.

K

So I wanted to add one example as well um so yeah, similarly with the mismatches as well on windows, not only with bose, if one of the customers is using image that mismatches a host, then we end up in similar scenario as well, where we keep pre trying to recreate the sandbox, but this will never run on this host.

K

So I agree with the point of alex if we can filter some types of errors that yeah really runtime can run this pod uh anyway, on this node and then in this case we should try elsewhere or stop trying to run it.

A

um The other thing that we could think about in this is um this sounds like a just like a startup problem, so we have deadline seconds which basically says how long this pod can run on this. This node before the cubelet proactively reaps it. um Maybe we could think about use cases of like uh startup periods. Let's say if this pod doesn't start up in period x, then the qubit also should proactively read that.

I

My also second reason for the question was uh like: yes, we with all these deadlines. We can have a scenario. What port pod will be market has failed, uh but maybe we should have a scenario to report to scheduler what we need to find another place for this spot.

A

That's not captured with disc pressure or pit pressure or one of the existing pressure signals.

I

Well, for a good example of this csi plug-in, so what? For some reason, the volume is not possible to attach on this node so find uh another node, where this volume also will be available.

A

um So I'm sure there's like specific scenarios. We could um work through so yeah, maybe alex if you had a few like, I said if we had a way from the cri to the cubelet to advertise. This is this is has no more hope. Then. Maybe we could um think about proactively uh terminating that pod. I'm not I'd have to think more on the volume scenarios, but.

F

Yeah I was gonna say that I don't think that would cover the volume scenarios, but that would cover could potentially cover both the cni like waiting for cni and to just this container's not going to start.

I

Yeah, so that's practically applicable to any of extension to a couplet. So when we're well, I wouldn't bring with device plugins, but it's also a possible scenario. What's like? Yes, you can try to allocate the device, but for some reason I can't satisfy this device request and it needs to be more without from an old.

I

So it's probably a generic thing uh to say what this port, for some reason, is not runnable on this node and needs to find a new place, and we need to have that from storage interface from runtime from device plugins or whatever else, extending mechanism we might have.

A

Yeah I agree alex, but just for the case of the device plug-in though I thought that the expectation would be that the device plug-in um dynamically updates the allocatable number of devices on that node, so that, if it the device was unhealthy. But it's it's not going to be counted by the scheduler.

A

Anyway, I think each one of these is complicated to reason through. So uh if, if there were things just from cri flows, maybe we could um focus on one call in particular, like start pod sandbox and find out if there were well understood, terminal cases that we can respond to.

F

Okay sounds good. Yes, I think that could help us here and.

A

All right: well, I think that is today's agenda. um Just a reminder, I think the dates for caps are september 9th, uh so I hope to get through a lot of them this week and I'm sure the other reviewers will will do their best as well. So we will uh meet again next week. There's.

D

Also, a soft deadline from the production readiness team that, if your kepp is not uh if it doesn't have the prr questionnaire, filled out and ready by, I think this thursday one week before the deadline we may not get to yours. So please ensure that if you want to get your prr approved, that your kept is up and has the questionnaire filled out by this thursday.

A

Awesome thanks a lot and uh we'll see you all next week.

A