Kubernetes SIG Node, 12 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20210112

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right uh welcome everyone to the january 12th uh kubernetes signature meeting, um just a reminder of the meetings I recorded them all about them to youtube for those who can't be here at this time or forget all the great things we say, but we'll treat everyone well uh in the meetings together. So a number of topics uh on today's agenda.

A

The first topic was an item that I'd put on there with uh lana that we want to give some general awareness to around maybe some unintended consequences that we saw with uh how graceful termination worked and liveness probes alana. Do you want to talk through the issue and maybe get some awareness of the cap? You were proposing.

B

Yeah, so I threw up sort of a straw cap uh for discussion. I didn't want to, like you know, fill out this whole full-fledged cap if people are not interested in this approach. uh But basically the sort of summary of the issue is that uh when we use a liveness probe on a pod and the liveness probe fails, it will use the termination grace period seconds for the pod to wait to terminate that thing.

B

And so, if the termination grace period seconds is set super long say like an hour, because you want the pod to be able to gracefully drain connections and that kind of thing, then, if your liveness probe fails, it will potentially take up to an hour for the liveness probe to restart the pod, which is not intended. Behavior not desired and could potentially result in some outages. So uh there's like a number of different approaches, we could take to fix this, but probably the most backwards compatible, one would be to say.

B

Well, we want to be able to configure this directly without changing the existing behavior, because it appears that a number of people have documented uh sort of like relying on the fact that they expect it will take. You know the termination period gray seconds uh to terminate a pod or a container in a pod on a liveness probe failure. uh So that is what I put in the kubernetes uh enhancement, repo sort of up.

B

As like a straw proposal, I think I put the the link in the notes already sorry firefox restarted, so all of my tabs have not loaded uh yes, uh so I put the I linked the pr in the agenda and I don't know if anybody has any specific comments outside of what's already been put on the cap. Pr, uh but you know, do people think that this is a reasonable approach. There are some other possibilities that we could go for. uh That would not require a cap and an api change.

B

The api change that I'm proposing is backwards compatible. I know there were some concerns about uh well, if we add this like field to, you know the liveness probe, uh it's going to be on all of the probes, so like are all of the probes going to use it or like. Is it just going to be for the one and like some concerns about maybe having a field for all probes, but only being used by ones so uh happy to hear your feedback comments doesn't have to be now? You can also comment on the pr.

A

Yeah thanks for the summary and the fall up on the salon. um I know personally, we at red hat were just hit by this issue, uh and so we are somewhat um bit by the scars of uh teams feeling an unintended consequence of of this type of issue, so um I saw dawn. You had commented that uh you had seen similar uh and that it might have had some issues in api review. Did you have any particular references to it, because I I hadn't um recalled any.

C

I tried to find I before this meeting I couldn't find because it is. I think this is why people started to work around that issue, and uh this is particular things actually hurt, gke a while back and also I saw this people came to the signal discussing, so we did suggest the support the company, but without all the detail, because I obviously from the uh clinton I see the people, the link uh he shared. I see the people abusive to using ever.

C

Obviously what we see also it is abusive using so also, but we didn't saw the so so so there's the proposal not goes to the enhancement that I think that's the pr directly. If I remember correctly, the pr director tried to fix this problem and it's basically just rejected by the complicate complexity.

C

So I think this is good things like the other process and more people realize how hard it is and how painful so that in the gk, you have to introduce something to detect this kind of problem and then do something uh special to deal with because it's been rejected.

C

So there's the one thing: it's kind of.

A

Producing you guys might have operational experience of writing, something that can detect pods that are in this state and proactively terminate them.

C

No, we actually don't and if the part in the such sea nectar for like the liveness probing failure and the config, this is one there's, no any good way. So that's we have to proactively tell the customer don't do that, but if it is white of the node shut down because we need a drink, so we basically said magazine value and it just shut up like five minutes. It is kind of readable.

C

I even remember in the signal that I some people miss it this way, because we do have like the rejection in the api review in the past, so I'm basically suggesting just okay for shutdown set magazine, five minutes or maybe 10 minutes, and we have this kind of conversation couple times.

A

Yeah, so I get that that that can happen on node maintenance, um where you might have encountered this, but these issues at least struck us when node node maintenance was happening right. It was like you have a one one-hour graceful termination period to allow requests to drain, or even if it was five minutes right, like um you, basically double your outage time, uh so I guess uh my bias would be I'd love for us to find a way to solve this broadly in the community and so like, like kept alana yeah.

A

Let's uh uh I'd like we keep pushing this, and I guess if folks have other add-ons or operational experience, they might have done to work around it. Let's get them on the cap, but um hopefully we can overcome any api review challenges in just a kept discussion.

B

uh I see uh jack francis linked a comment in the chat uh where I guess uh this issue was reported as well in uh december. The issue that I'm working on was like a super old one that ended up getting reopened so jack. Do you have any thoughts you want to add.

D

Hey elena and folks, um I'm really happy people are talking about this. So basically, my involvement, just to give a little bit of a backstory, was observing the change that went in, um I think late december for 120.

D

to fix a long-standing um edge case where the the liveness probe, for uh I can't remember it was for exactly for hdb, but one of the two was never working, and so that was actually fixed and then absorbed the current one second default, um which has the had the practical side effect of suddenly introducing a one second default to all the liveness probes out there, because because it hasn't been working, no one had been declaring that timeout value.

D

I mean folks who are doing it permissively, I'm sure there are folks who are trying to do it and didn't know that it didn't work but anyway. So I was, I I'd, observe this and I tried to basically my point was: let's not push this change. Let's roll this back, because this is going to surprise people, and so I've been thinking about ways to fix this. I mean it's super unfortunate that there's this behavior, that everyone assumed was working for several years and was discovered not to be working which has these side effects.

D

So I'm I'm wondering if we can just increase the that issue. The conclusions that actually right now are to perhaps marshall, a discussion around just increasing the timeout to something like 30 seconds, um because it sounds like what you guys have observed. Is that folks noticed this new timeout said: wait a minute. I don't want to time out. So I'm going to just throw some totally crazy high value, which has.

B

There's a couple of things I think possibly getting a little bit conflated here, so I think, uh looking through sort of the uh the tree of things, uh your link there points to the issue, uh so that was exec probe. So I think uh in that case, exec probe timeouts had just like never worked, so you could set them.

E

B

Just wouldn't be enforced in any way, uh and so I guess uh I'm not sure how exactly uh that was implemented, but I think I I don't know if that's like a default or configurable or whatever. In this case, this is like the non-exec probe uh sort of like liveness probes, like http that kind of thing where, if these fail, they don't have anything configurable at all. They default to use the termination uh grace period seconds, that's set at a pod level and that could be super high uh for you know like normal operations.

B

D

B

They're coupled right now and there's no way to change them. So that's kind of the thing that I'm trying to tackle here basically.

A

If we determine that you're not live, we shouldn't the way graceful termination period was originally defined, was basically saying how we shut down, something that was healthy with enough time, and we haven't really approached how to give the right amount of time to shut down something that was already known to be unhealthy and so um yeah. I think that's a good summary lana, I guess maybe as a concrete action item.

A

If folks could uh review alana's cup- and maybe we could see if we can reach some type of consensus in this release window on a plan. I think that would be good, but.

D

uh I'll definitely, happily support a lot of effort for sure. Thank you so much.

F

I I was wanting to just uh chime in and say like what this kind of goes back to. What does aliveness probe failure mean right, um because if it means the application is not running, then we want to kill it as soon as possible. Right, like zero grace period termination seconds right.

F

We want to kill it and start something that will work um so it it kind of goes back to what does a liveness probe failure mean if it means that the application is not serving requests or not operating properly, then we should kill it immediately um and there shouldn't be any time out. The timeout should always be zero.

F

So I'm not saying that's a good idea, but I wanted to throw it out there, because that could solve this problem by just being opinionated about what the timeout is by properly defining what a liveness probe failure means. I.

B

Think that that's not currently defined as like one of the alternatives in this uh in the cap, so I could add that, uh and specifically, I think maybe the way that we would do that is like we'd need a feature flag for that behavior for sure, uh but like if we have the feature flag turned on, be like like kill the thing right away.

B

No wait uh ignore the uh the thing, but in the other case like maintain the default behavior, I guess my question would be if we feature flag that, like long term, do we want to face that in as the default or like do we want to keep the two options uh like that would be my ori.

C

I I uh to simplify the problem. I kind of really agree with the size and but I just say I just want to point out since I really agree with you. I want to emphasize this, but I noticed that the people abusive this way the lifeless problem a lot of times, people using it's not really just loudness, and we we saw that right. So we add some other uh prop, uh because there's some people using that loudness is readiness and yes, I'm using for the some other purpose.

C

So so I guess like there's simply a trick that the secure, maybe is not backward compatible, but I totally agree with you from the design point if we design this correctly initially and then we shouldn't have this part. Oh we fixed this earlier because I remember when this is probably reported long time back and before we have so many structure and the sake see the architecture and so that so then we shouldn't allow such long time like the period time grease period time for lifeless problem.

G

A

So I think yeah we got the option added. Maybe we can think of creative ways to make the good choice the safe choice that people opt into personally, if I could have opted in on the pod spec to say liveness means you know true death and that would have been great for our use case. uh I'm sure anyone else who's running ingress had similar scenarios.

A

uh So thanks anna for helping push this forward and we'll try to continue discussion on the so uh in the interest of time. Maybe on the next item a few of us have been discussing on uh the red hat side what we were interested in pursuing in 121. We want to open discussion to the broader community. You know uh folks, if what what we wanted to do as a group in 121 and try to start shepherding planning so uh renault, you had a doc. You wanted to share through sure.

H

Okay, so I don't have screen sharing, enabled direct.

A

Let me give you those permissions.

A

I forgot that we changed this last.

A

A

You should have a nominal okay.

H

H

All right, uh can you see my screen.

B

H

Okay, awesome, so, basically, what we did is we put together a bunch of items that are continuing from last time and also things that are like low hanging fruit, but got stalled in 120 for some reason that we can just continue to push forward and some new items that we identified.

H

We could benefit from, and this is the table, so we can go from it top down and, like I'm sure, like others on the call may have ideas for what else should be pulled in so maybe quickly go through this and then have other proposals like the one don just added to the dock. Does that work.

A

Yeah it'd be great.

H

Okay, so the first uh item in the list is cri graduation. So, while doing this work, we realized that we first have to update the runtimes like container d and cryo, to support both e1 and v1 alpha one. So we can bring it back into the ci and then drop v1. So uh we have the kept merged. Cryo is updated. uh Mike, do you know? What's the status on uh container d side, yeah we're updated as well? Okay, awesome, then.

H

I think we are ready for the next step to bring the runtimes back into ci and then drop v1 alpha one, so so that that can proceed no blockers over there. uh Any questions or comments on that item.

H

Okay, so the next one is run as group, and I think here mike opened a pr and then it just didn't go further. So maybe we can just uh continue moving that or moving that work forward, because it looks like it shouldn't be a lot of work beyond addressing the comments in the pr I.

H

Think, okay, uh so the next one on the list is node, graceful, shutdown, so david's pr for that feature got merged in 120 and the next step there is to uh test it out gather feedback and also have a plan for some kind of an end-to-end test in the ci. So we can move it to beta.

I

Yep, I think that makes sense yeah, okay,.

H

All right so uh so we spoke yesterday what we thought is like. If we can, I don't know if we can do a complete restart in the e2e test, but at least we can do half of it. So we can fake the signal that the cubelet gets to do the drain and then have a special pod that waits till the end of its graceful period to write out a file and then make sure that file gets written.

H

So that gets us half way to verify it, and then we can figure out what we want to do about an actual restart of the system. Yep.

I

I think that makes sense when you're having uh n10 tests of the the shutdown is important and then moving into beta and so forth. Okay,.

H

All right awesome any more thoughts on that one.

H

Okay, so the next one on the list is c groups v2. So we got a lot of work merged into 119, but we didn't really make a lot of progress during 120..

H

So what I think we should be doing there next is testing out what we already merged, making sure that it works well and then the second thing really is like taking advantage of the new c groups. V2 features like there are new knobs and c groups.

H

We do like memory, low memory high that don't map to c groups v1, so we play with that uh and then the runtimes, for example, using annotations or something and then figure out how we can expose those features or how they map to the spot settings that are there today in kubernetes to to take this forward and then the second thing that we can do is uh see how we can configure uh the qos slices uh to take advantage of these new features and and like the third step, would be how to use uh a user land um killer like umd or something.

H

But I think before that we need to do these first. First at least two steps.

H

So it's mostly some testing and then proposing additions.

I

Just curious for this one: does it make sense? Do we already have ete tests in kubernetes ci.

H

So herschel is working on enabling that he got cryo with c groups, v1 working on a periodic right now, and his next task is to get some c groups v2 running there.

A

Yeah, just in general, just to remind everyone that the plan was to keep cubelets supporting v1 or v2 for some period of time. So.

A

We won't be able to. You know, obviously, move that we there's no intention to move everyone to v2, necessarily so we're going to need to have grow a new ci job um specific for this.

A

But just for those who weren't aware that previously.

H

Yeah so, like I see a lot of complaints from customers around like um kill, really not behaving well and the end game really is to get us to a position where we can use custom user and home demons and and also better behavior from the kernel with the fixes going into v2.

H

So if we keep making progress, hopefully we'll reach there one day.

I

Yeah, I think, there's also some discussion around eviction and using something like the psi. Metrics right.

H

Correct yep, yep, yep, absolutely yeah, so I think, like omd uses psi uh david.

C

So manu, I want to ask you when we talk about the sql version 2, we are not talking about all the resource management right, like the new enhancement for memory, new enhancement, cpu and the new enhancement on the disk. All those kind of things we are not we just only want to have enough of uh uh the node can support both v1 and v2 and then based on that one we want to evolve and to utilize some v2 functionality. So that's the separate uh feature request right.

H

Right yeah, so I think the first one is yes testing the current conversion and the second point over there is like the new v2 features like your memory low high. It will be testing it out at the runtime level and then proposing new editions, yeah.

A

Yeah, okay, because I'm sorry I think, on the cap we said the existing kep covered uh parity with v1 capability, so anytime we bring in a new v2 capability, like I read that as a separate activity.

C

Yes, yes, so I just want to make it clear, because people start talking about me too many fancy features, but actually that need to carefully design and the whole memory management, even like the qs might be helped you thinking about as a whole to design. This is just only for the capability, so we can start support. We.

H

Do right so the next one on the list is username spaces, and uh is this mauricio or rodrigo on the call?

H

Okay, so I mean they have uh there's a cap open, there's been a fair bit of black preview, going on with tim, hawkins and others have also chimed in over there. So if we get some agreement uh that then we can probably start working on phase one is if we like reach agreement before the announcement.

H

That's that's what we are thinking, because the runtimes have the features that we need for.

A

Just for my memory and I can follow through but did phase, one include volume support or was it just phase.

H

One is all using a single username space, so yeah what volume is really good with that.

C

So I I I say the first one basically is just only cluster mode right. If I remove okay, they're called the stream mode. This is just cast cluster mode. Okay, we do the mapping, okay, god.

H

Yes and there's some some things are go moving in the kernel. We don't know yet whether they're going to land or not, but hopefully by the time we do phase one. Then we get some signal from the kernel that yeah we are ready and the later phases become easier for us.

C

So the way we do sure the first one don't have the dependence on the kernel right.

H

Yeah like if you use a single mapping, yeah you're, fine.

C

Yes, and if I remove correctly, uh I need to refresh my memory on this one so the evening of this one, uh there are some small uh changes uh required on the cri api.

H

Yes, yes, yes, those uh changes.

C

I just want to make sure we are not like the promoter promo to the sa at better, but then later we found oh there's, so I just want to make sure there's this maybe have some coordination and maybe some of the ci related change. We want before the ci beta it can get in to this way.

H

Yeah, I think that makes sense, so if, by enhancement freeze, we can reach agreement on this skip that will be good before we declare the cris is better.

H

Okay, maybe we can capture that.

H

Okay, so the next one is the one covered by ilana the liveness broke. Timeout uh then dawn you proposed the ephemeral containers, so I put down. Leeward was the owner for moving to beta and yes, can you be the approval dawn so definitely yeah, okay, okay, awesome.

H

uh Okay, so the next one on the list uh swap I mean it's it's early days, so karen and elana talked about it. I think uh last week so best we can do is maybe target a cap.

B

Yeah, I'm not even looking to target a cap for this release, uh probably the next release. I think there's just a lot of discussion and research that needs to be this one. So.

H

Yeah, that's fair, I mean we can continue gathering requirements and see.

B

Yeah, I think that's the plan like not actually have the cap like approved, alpha implementable for 121, but uh hopefully maybe for 122.

H

Yeah we can suggest work towards okay, with no commitment.

C

Are we going to target this one for the secret version 2 only or both version 1 and the version 2 have to support? I think we we because I remember, there's some enhancement and but the idea interpersonal try with the version 2 and those kind of support, but I want to know more, and I do we want to just support one motion, sequel for version uh 2 or maybe both. I think this also needs because I believe the implementation and design should be different. If I remember last time, I checked single version, two.

H

Yeah, I don't. I think that we'll dig into the details of what, but we feel at least from the conversation we had yesterday. We feel we should target both, because customers may benefit from having swap before we get ready with c groups we do which could take like a year or more cat thanks.

H

So the next one on the list is like enabling second by default- and uh we spoke about this just before the break and uh sasha and I are working together on a cap. We will get one ready before next week's meeting, so we have something to discuss and then we can see I mean if we reach agreement, we feel it. It may not be too much of a stretch to actually target an alpha for this.

H

If we please agreement, that's that's a big thing.

H

uh So next one I'm gonna hold because we have a separate topic on the agenda for this item uh after this one, then I think there's like a couple more items, so this there's a sis cuddles move graduating it. We don't have an owner for that. Yet so, if anyone on the call that isn't already signed up for something has the time and want to be an owner, just let us know, and we can look into moving that forward.

A

Yeah, so on this one renault, it wasn't until um there was some discussion on the slack channel um either end of last year. Earlier this year, around uh says cuddles, where I would have thought given the state of cisco's. Now I would have been fine just moving it right to ga, but um there was uh some feedback around the user experience when using unsafe.

A

That um okay, I may want to refresh my memory on, but I don't know if the person who gave that feedback might be on the call or not, but it was more saying maybe the node should be able to um uh be tainted appropriately if it had unsafe, cyst controls consumed on or not, um but we can go back and okay, okay, even that I think, could be handled. Maybe post ga of the existing sys control support. Okay, all right.

H

There I got captured that in the notes uh we can follow.

H

H

uh So the next one is cri container log rotation and moving graduating that and uruguay. She is happy to work on that. I think this was captured in derek's feature list and we just didn't get to 120.

H

So any thoughts comments on that one.

C

um This one, my most concern, is just compatible compatibility with the production, because I remember this one's been done in the first, we defined the ci and it's been been delayed delayed because I have the compatibility issue with openshift.

C

So now, if we want to enable this by default, I want to make sure other production also is compatible uh well now. There is because that time, there's no any compatibility issue on the gke, but because I have the compatibility with openshift, so we delayed engineer delayed, but I know over time production move forward, not just gke and not just openshift and not many products. So I I guess this one will have to do some like the due diligence to figure out. There's, no compatibility.

C

Okay, underneath the app finder someone found gk make sure this one still is wallet.

A

Yeah, maybe urbashi could I I don't even know if we had a probably predates kept. So even just getting a document written up on uh that outcome, maybe well. She could do as a first step.

B

Yeah um one of the things I have on the agenda for later is a cap triage thing from last meeting where I went and did a spreadsheet of all the caps uh and one thing that I've noticed is people reaching out to me being like where's, my cap, and it's because there was no issue tracking it. So uh anybody who has uh something in that situation, uh something that they want done but does not have an issue in the k enhancements repo, I'm hoping to get that addressed in this release.

B

H

So the next one on the list is a memory manager and I think it's a waiting uh review uh from clayton.

J

Yeah, it's waiting for the final review from clayton already finished with the review. So it's the only thing. That's stopped.

H

Do you mind linking to it in the dark if you have a link handy so yeah.

J

Yeah sure sure I will provide a link to the pull request. Okay,.

H

Thanks awesome and uh the last one on the list is spoiled resources. Concrete assignment derek I don't have any. I don't remember enough detail here. If you want to talk to it.

A

Yeah, so this was uh uh the kept was already approved in 1 20, but um the pod resources endpoint that lets you do things like um gpu uh understand what devices got assigned to what pods the cap had extended that to understand what uh cpu sets uh fair call got assigned to particular pods and um uh I think we're at landed in the last releases. We had the cap, but then um other engineering distractions prevented us from getting the implementation done.

A

So some of the goal here, I think, was trying to support a longer term goal of having topology aware scheduling, information fed back to the cluster scheduler, and this was like one of the early prereqs, um so yeah um renault had reviewed the cap and we'll look at the implementation.

A

H

Right, so that's really the last one in the list here do folks have other items.

C

What's the kind of status about the power, the overhead? I don't think about. We finished that work. We've been working on that one and container the the runtime class. I think we made the progress nicely needs, but there's some small progress, but but the power overhead we didn't make much okay.

C

The reason I opened this up because that auto sizing system reserved without to really handle power overhead properly and the charge properly. I don't think you still have like the um issues that overloaded the still potential, and so we may have to want to sure.

H

I think, like maybe sergey or eric ernst may have the latest on it, but I think it makes sense because we were talking about g8 along with runtime classes, uh based on comments here.

A

The the only the other, so maybe what we could do is um and also a big thank you for uh helping get this list together and sharing it. Maybe as a sig, we could set a goal of um uh giving everyone a week to reflect and then come back next week and see if there were tweaks or adjustments or gaps that we missed.

A

But then, if we can get some agreement on what we wanted to focus on as a group tentatively next tuesday, that would allow us to better focus where caps. We need to give attention to before the freeze date, which uh was the ninth, so yep.

H

It makes sense so today, maybe next week we come back and do like a comment check uh for each of these and see what actually we want to.

A

Yeah and where we don't.

H

A

Clear, like authors versus approvers type stuff, maybe we can help flesh that out in the week ahead, um yep but uh yeah, a big thank you for helping put this through and the rights to edit this. It should be just anybody in the cigna group right, so I think so. Yeah yeah, okay, all right! um If it's okay, we can move on to the next topic.

A

uh Which, I think was uh ryan and uh herschel.

K

Yeah I'm falling for harshal, uh so at red hat, we are seeing uh quite a few notes going into not ready and um on every single node type that we have.

K

We have to calculate the system reserve to try and protect the node from going into those memory constraints, scenarios where the cubelet's not running sshs and running systemd performance, um not working correctly, and so the the way to that the cubelet does this currently is that you set a system reserved on it so that the c group um gets set um throughout the system, and currently you have to do that prior to the cubelet starting and so the proposal here is to adopt basically google's plan, which is that they have an algorithm to um calculate a system, memory and cpu reserved, and we would like to put this in the cubelet itself.

K

And the um way that we are proposing to do this currently is that on an alpha release, we would have basically a default profile that contains the algorithm that google uses it uses a a sliding scale like number of uh like four gigabytes of memory gets you 25 system reserves set up and then perhaps on a beta or ga release.

K

Allow different profiles to be injected into the system depending on. If there's more demand for this feature, but I think for red hat's purposes um this the scale that is that google's using is probably sufficient, um and so that's basically the feature uh that we're proposing here. um We can do it outside the cubelet, but that means basically pushing out scripts to all the nodes to be able to calculate this in a uniform way, and we feel that the cubelet's, probably the right spot for this.

K

um We have been discussing whether it's dynamic, meaning, if there's more memory or cpu added to the system, does the cubelet actually see that and I think that's to be determined.

K

um It might be difficult to have it uh automatically see that so you might need to restart, but that's basically the only gotcha that we have seen so far.

C

Right, maybe can I comment since the kubernetes, the gk things proposed by me, and also I post the auto resizing long long time ago, when we first started not allocatable, so even gk today is not dynamic.

C

I did propose in some dynamical way when I first proposed node allocatable, because we are based on the number of the part so because that's the there's the part, it's itself the overhead, that's the separate type we've been talk about back then I do want to chat to the pub, but they also have like the management overhead, like the number of the container lumber of the pot, how much of the darker we have been using based on those kind of things and how much power.

C

So you can see like the cri when we design even like logging. We originally think about we'll be charging to the part install container extended to the node, but we didn't because that's two big changes.

C

If you notice retreats back in the old of the ci, I did propose and talk to engineer behind and google the proposal actually charging to the node. Even like there's some disk. I o usage. We try to charge it to the container charging to power, but that's too dramatic changing and the at the end we end up have like the so so basically, so that's why we have the proposed. You can see like a magazine, part per load, and what we are today apply.

C

It is per node based on the merging size is not optimized, it's not so. Basically, we base it on our production based on the kernel version. Also, there's the factor about the kernel different kernel version. Actually, the usage also is different, which is unfortunate. This kernel thread you see, is different, so so I just want to correctly. We are not really auto uh auto sets based on the demand, my original, it is basically uh before I assign. I will know this part. This container will go into that node.

C

So how much reserve that would be dynamic, then very easily that because the scanner, because we will pass that node allocable to the scheduler, you can see all those design. And so then scheduling will be due west late, because with dynamic change, you know the allocable resource. All those kind of things system will say can I attack accept this job will be do the best we didn't do that at the end.

C

Actually, today is based on the machine type uh we monitor from uh from our kernel, worship and and and the monitor from all those kind of things, and also you have the different production have the different demon up from the schedule. So that's why we didn't really need that level. I just wanted to make. You know how complicated it is and in the original design.

C

What I proposed even like include of the cia and also node allocable and many things I didn't think about auto sizing and it's just hard in the community, even if the gk cloud is much harder to do, but it might be, can do, but for some awful people using uh kubernetes for their internal infrastructure. Maybe it's way easier, but things like the gke and openshift provide services for cloud user which much harder.

C

I just honestly share with you, because if, like the book I will propose, I need to propose this for burger many times and but then the cost is too big many years ago. Cost is too big for every time when you made the kernel change. So that's why I even for book I back off. I just make you know here.

K

Maybe it would be more beneficial perhaps to work on the pot overhead, maybe instead of this, um what's your thought on that.

C

I I do think about the overhead, uh so the reason powder over had also been proposed a long time back and we delay that the work it is because that time we only have the one continental right time right. So then we only have the urancy so the reason power, the overhead is later uh have, like the more urgency related urgency. It's just because the clear container booming, of course that's not the clear container, that's the qatar and then there's the uh device cases and there's the many other like the uh some like other container.

C

So that's why we are started to looking at the container run time and the plus the power overhead. I just want to say overhead for me also, I think about. We could enhance indonesia.

C

We could enhance a number because we could change our elite container in a long time when we worked on the racket. Actually I did talk to core as fox and we could build our uh make our elite container more uh intelligent. We never get to do those things because uh is, is all those advanced the feature? Might not the common use cases for everyone? So that's why we didn't move forward, but we do think about it. In long run, you need a container can do a lot of other jobs.

A

Yeah, so don, thank you for sharing the background, at least on google's experience um when we had researched some of this, I think we've looked at what every vendor is trying to do in their hosted space, and I think everyone has different tables to some degree.

A

um I think the way I'm trying to think about this, though, and just to maybe softly push back on some of your comments is um uh we have heuristics today in the cube that, like pods perform right. That um uh is a good example of like it's a very rough heuristic, but it's it's probably used everywhere.

A

Right now, on a rough like yeah, you can punt what uh 10 pods per core or something is our default, and I I think what I see is a lot of struggle on even setting it at all setting system reserved at all or setting. um uh Basically any reservation that, like um I felt like uh having some heuristic from the sig that gave uh was potentially valuable. um Whether or not this particular mapping table is the right table to go as a kind of a different topic.

A

I guess, but I I guess what I wonder is like do you feel like it's not even worth, providing something given all the variables or do you still think you could see a community benefit in having some some heuristic the way we have say with pods for core.

C

The the reason it is how you are going to measure that overhead you are based on what we introduce maximum part per load to measure all you basically just say. Oh, I estimate the average of the part on each node. How much how you are going to estimate that and how you are going to say. Oh 30 parts, maybe it is my average and for each part they have like the two or three container running include of those default container info container.

C

Then you measure that one and also then you end up especially for open shift use cases you in the app maybe have some like some database lack of those things running. Maybe they want a single one. They basically told you. You are reserved too much for kubernetes, which is only managing one stateful set here and then are you going to overwrite a lot of that override, so this is kind of really so. This is why I think, a long time back, I see the easy way of auto sizing, but auto sizing dynamic settings.

C

It's not auto sensing. Sorry, dynamic sizing based on us on the assignment based on the scheduling, but obviously I know I cannot uh move over from the scheduler, because that's more change for them, because we really want a simple scheduler right like the kubernetes is not what I want more complex, enhanced, intelligent scheduler, but the auto setting is really hard because you only on the node to do the local optimization, which is not a cluster level.

A

Things maybe what I was going to wonder is like right now you can set system reserved where you can set a literal value x, megabytes of ram right, or I think we also support syntax, that lets.

A

You support a percentage based value and the thing that uh so you can say I reserve 10 of ram or something right, and the thing that strikes me when I look at the table here and I look at what others have done is rather than just having a fixed percentage or a literal value, it's kind of like if I could provide a a function instead right and really what this table is showing is just a plot and like if we could provide a syntax that allow somebody to provide that plot, like I think it's, it makes things a lot easier for how you manage nodes across a wide variety of of footprints right.

C

I I agree with you. This is, uh I do have the I think the similar proposal captured in one of the really old of the issue like the what's the formula we can using. uh We can carry out this discussion. I just I just want to see that if we really want to make this work you have to uh if this is just like the guidance for community, maybe that's.

A

Ryan's point on maybe pod have a reheard versus this, like I don't, I think, that's a false choice versus like if we change this instead of saying auto system reserve sizing, we said um formula based uh reservations and still left it easier to supply the formula. I think we can get the same results and people can then figure out the right formulas in their production environment, but, like it's un, it's unquestionable to me that, like this is a a plot of a graph and people figure it out for their sizes, how to best line that.

A

So I don't know, maybe we can take that feedback uh and adjust the other feedback. I don't know don. If you had experience on that, you wanted to share was if we wanted to do reservations for other resources aside from memory, so I think renault and I discussed pids and uh I forget the other one- monoliths ephemeral, disc yeah. Thank you, um but uh just looking at this at least ryan, I feel like having a function would be going a long way. Okay,.

K

Yeah I'll take that feedback thanks.

B

Just a quick time check, we have a ton of agenda left. I wonder if you want to triage the rest of the items on here.

A

Yeah uh good point: elana.

B

Sergey has an item about a bug which I think we should probably talk about. Even though he's not here.

C

Yeah seeker has some appointment, so he couldn't make it yet.

B

Yeah he he asked me to cover for him, uh because I think his his kid is sick. um I can cover that now. If we want to cover that next.

A

Yeah, uh if there's a 120 regression now it's yeah yeah.

B

It's it's a regression from 120., uh so the the issue is linked there um and I think both sergey and I have spent some time looking into this.

B

uh Basically, uh there was a change in 120, which was actually a something that had previously been reverted uh because of performance issues, uh but we are finding that introducing this again we're still having a lot of performance issues on pod deletion, and this is causing uh quite a lot of test flaking, uh as well as uh like just generally uh pods are taking a lot longer to delete uh and there's concerns that that might cause uh production implications like above and beyond the uh like tests flaking.

B

So uh I know that uh we've had, I think, uh pakoshu uh like suggested a possible patch to fix this. uh But I looked at it and I wasn't like it adds a lot more complexity to the cube logic and I'm not convinced it will actually fix it, uh and so uh sergey is suggesting. We just revert it, uh and I think I uh I also agree with that. But I wanted to get other folks feedback on this. It's a really like sort of sticky thing uh that has to do with the sandbox cleanup.

A

Yeah, so the concern was if we were leaking sandboxes previously and we just didn't know so I don't know if anyone has a memory on this, but there is a lot of discussion.

A

It's always a recurring topic on. If we've discovered we're leaking something or we weren't um seth. Did you ever look at the original ones of these in the past uh if we were leaking sandboxes or not? Well, I'm thinking back to.

A

I don't know if david's here, but we had issues where we were wanting to make sure is the pod actually gone before we deleted the thing, um and I know at least I could be conflating issues alana, I'm sorry, but last year both seth and I spent a fair bit of time helping various engineers explore uh if things were deleted or not deleted, and my memory at the time here was that there was some concern that sandboxes weren't uh always being short, delete or not, but my memory could be poor, so uh I'll I'm I'll have to track.

F

This down myself, but I didn't know seth. If you had a memory, I I do not have a memory of it most. The things I remember working on were mostly mirror pod related and like mirror, pods being deleted before the static pod actually shut down and stuff like that,.

A

Yeah, so that was the thing I didn't know if, at the same time we were asking like is the thing actually deleted or not across mirror pods and normal pods? If the sandbox issue had come up similar in that discussion, but we'll have to chase it down.

C

I don't I don't know the original problem. I don't think about. We discussed at the signal that I saw I I didn't look at that, but I'm totally okay with the reward to avoid the potential of the production uh regression, but then we can have more time to figure out what's how to properly fix uh the original issue.

A

Because that was the.

C

Issue was about cni's.

A

Right, like it's saying that we weren't, we were deleting them from the api server before we knew cni had cleaned up.

B

Yes, I think that's right and.

A

I'm not sure which cni was being used here with that.

C

C

Can we come back this one and can we agree reward now and then come back with more detail next week to figure out? What's the original issue, how we are going to properly fix in the new release.

A

Yeah, I think so.

B

So you're comfortable with reverting for now. That was what sergey suggested. Yes, please, okay,.

C

B

Yeah I'll synch with him and then either he or I or someone else can uh submit the pr for that.

C

A

Okay, so we have uh one minute left and uh apologies for the full agenda.

A

um uh If it's okay, uh the items we didn't get to uh will move to our next agenda and if some of them were items that we wanted to explore, looks like checkpoint restore, um we should probably get them on the 121 plan doc, um but uh big. Thank you to everyone, helping uh make the meeting so productive today. So um I will talk to you all later.

J

K

A

Let's try to get the planned items settled up for next week.

C

C

Thank you. Both.

J

Thanks my folks.