Kubernetes SIG Node, 1 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node20210629

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Thanks dawn, go ahead, this is 29th june signal. Weekly meeting um looks like we have a pretty packed agenda for today. um Let's kick it off with the pr initial update. I believe it is going to be sergey or alana.

B

Yeah, sorry, I didn't prepare this um yet uh I can go in the end of the meeting.

A

No problem sounds.

C

Good um from the bug perspective, we squashed a lot of bugs. I don't know if you want that update.

A

Yeah yeah feel free to go ahead.

C

uh I think we, what was the specific numbers? I posted them in the slack as well, but we definitely closed over 100 bugs when we squashed bugs last week. So thanks everybody who helped out with that. Let's see what are the exact numbers uh and sergey made some graphs too, which were really great. uh Yes, we closed 136 issues and we updated over 200. So I should say we updated over 200 that were still open. I don't think that counts all of the ones that we closed as well.

C

So uh we did a lot of work. So thanks everybody who participated in that.

A

That's amazing thanks alana for organizing this. This was, I think, a great event and we we saw a lot of new faces and a lot of participation that was great.

A

Okay and next one is you elana? Do you want to go ahead with swap update.

C

Oh, it's me. I thought that vinay was in front of me uh yeah. uh I think that I just need, like final approval, uh jordan finished up api review and I'm just pushing out some stuff to fix nits, uh but other than that, I think it's basically good to go. uh Seth reviewed the implementation a couple weeks ago. Now I think- and there was no- uh there were no really big changes.

C

uh There was a renaming thing uh that needed to happen, but uh the currently the pr end-to-end jobs for both ubuntu and fedora are working, and so I think it's good to go.

A

Cool sounds good. Is there anything else like any support or anything, any review requests you need from anyone.

C

uh I think it's just final approval at this point.

D

Yeah, so I think renault and myself were drafted as a purpose for this when we did the cap review so um we'll do that this week.

C

Yeah derek, if you could take a look as soon as uh you feasibly can. uh I know uh there's a lot of folks who are kind of chomping at the bit to be able to test this, uh and until it merges it's hard to test so um basically the the sooner we get it in for like alpha or beta releases for 122. The better.

D

Yeah, we'll we'll do.

A

Cool, so next one again, I think it's you alana um requesting no approver.

C

Oh wow, it's just the whole agenda is me today huh uh yeah, I have requested approver access. I hope that's! Okay with folks uh somebody pestered me saying you have reviewed something like 500 node prs. Why are you not an approver, so I said okay, I will apply. uh I think, I'm waiting on uh don and some of the folks from google to take a look at the request.

A

That sounds great, uh I think um other than the bug scrub, like. I think everyone knows about the bug scrub, but there's one more thing that I'd like to add which a lot of people might not know.

A

Iran has been leading and facilitating 1.22 leeds cohort training group which I've had the opportunity to experience kind of first hand, and that has been kind of a great great learning opportunity for me, uh as as a representative from sig note, and there are other representatives from other sig uh sigs as well, and this is part of the contributor ladder growth program. So if people want to sign up for like the following releases, it'll be kind of a great learning opportunity.

A

So thanks iran again for that.

C

Oh yeah, no problem.

D

um So I think one of the things we'd like to do alana out of this request is to more concretely write down.

D

Like the growth path for cygnode, others have expressed interest as well, and so one of the things I wanted to do was get the existing set of approvers together to write down, maybe their path to approver and maybe some common criteria that we can evaluate these things going forward. And then hopefully we can meet this week and then proceed from there. But.

D

Like swati, I think everybody else appreciates all your effort in the sig and um yeah. Hopefully we can uh find a path forward.

E

Yeah, so we don't have like the clear written uh requirement. uh Explicit site is being issued in the past, so this is why signaled and also work with some other thing, have some like the some written. But it's not the word clearing, because one issue we realized in the past. It is so some of the sick. Big sticks like sticking like the signal large enough and a poor means three different laws, and so there's no uh final, granular definition. So that's why um this is is missing.

E

So recently, elena and phil fox approach ask approver it's not just atlanta and some other people also ask for, and so then we realize there's the time for us to retain those kind of things make more defined by the existing of the sig prover.

E

And then we can come back because we think about the uh a fair fox includes avalanche on the past towards that one, and we should make that more clear, and so we have a constant standard for the community can follow and also it's written down publicly and then we, the people can follow and they can really just self exams. They are where they are.

E

So just yeah, so we should do this so derek and I just before the meeting say we should uh address this problem and hopefully um this week we can get together to start.

D

And then just in transparency, one of the concerns is that um public intersects, with uh storage and networking and to some degree scheduling for what it embeds so.

D

We just have to work our way through that uh sensitive to those six, so uh we'll look to get that uh written down and um um yeah. So uh thanks for coming stepping forward alone,.

C

Yeah uh one of the things that I I just linked in the chat, uh unfortunately there's just you know the there exists a contributor ladder and path and whatnot, but the requirements to become an approver according to the project are, I think, not aligned with what a lot of the larger cigs actually expect uh because they're like they list the requirements like oh yeah, if you've like reviewed, 30, pr's, you're good to go uh and if you've been a reviewer for three months, and I recognize that, uh for you know certain chunks of code- that's really not sufficient.

C

D

Yeah, um I think uh yeah we'll try to get something more granular um or appropriate, and uh actually would like everyone in the stake here to give feedback on it as well right.

D

um We have to find a way to balance these things so um for those approvers on today's call I'll look to set up uh some time and then um maybe in next week's sick meeting, we can review and then evaluate both alana's proposal and any others who want to come forward with that community standard and so thanks again alana for helping us to get this clarified.

A

Sure, okay, so let's move on to the next item, we have vinay. He wants to talk about in place sport vertical scaling cap, oh yeah,.

F

A

F

Yeah so um over the past week, uh I think uh the core implementation I got it uh mostly completed.

F

uh The only piece that remains in the core implementation is changing the scheduler and resource quota, accounting to use max of requests and resources allocated where previously, in the previous iteration of the design, we had it at resources allocated uh the new piece that came in uh besides the change to switching from uh spec dot resources allocated uh to status, dot resources allocated is uh the resize which I, uh which was easier than I thought, so it is working, and uh the good thing is that we have uh the e2hta switch is pretty comprehensive from the last iteration of the design uh that one, I think wong chen is adapting to uh cater to the updated api, and it looks like she's making great progress on that.

F

So we have a certain degree of confidence with the automated e2e tests uh from the previous iteration that you know we can count on rely.

G

F

Quality of the changes that we've made, that said, uh I'm planning to get the scheduler rq uh changes looked at closely because I don't want to just port this. I want to look at it and make sure that it makes sense uh with the max and uh was wondering if lantau I know ronald had a comment about uh cri.

F

uh I'm wondering I don't know if tim hawkin got a chance to look at the api changes so uh I'll ping him again, uh he mentioned that he's got me in the queue I just don't know how backed up he is. uh I'm wondering if uh lantau had a chance to look at the core implementation. Most of this is report from what david ashbal had already looked at pretty closely.

F

I kept it kept the code structure, the the key piece of how the cubelet processes changes and how it's plumbed down to the cri and how the container status is queried and then the the um the api status is generated for uh part status. They mostly remain the same. The new thing is the uh checkpointing code and I put that under status.

F

So the key thing to look at closely is: does that make sense? uh Does the code look good? Is this something that we could improve on and then the other thing is the resize, so just wanted to see where lantau is at. We have.

H

Yeah, I'm here yeah. I will take a look this week. Last week I checked it. I checked that I don't know whether it's ready for you, because you you don't do not merge work in progress in the the pr title. So I thought that it's not ready for review yet yeah.

F

I didn't want it to be, you know, considered for merge ready until it's pretty close to ready I'll, remove those labels now uh because, okay, I I want to remove it after I get done with the scheduler and resource quota changes, although it's pretty small, I want to make sure that I do this by hand as in manually to make sure that I don't miss anything or so. This was already done in the last iteration we switched from using uh requests because requests can now change.

F

Previously request was immutable, so when you create the part, it is what it is so now we have to do a max of requests and resources allocated. That's what we agreed we decided to do and I think it is um reasonable. uh It might even be simpler. um I'm planning to do that in a couple of days. uh Just I've been backed up with other things, so I haven't gotten to this. I know time is running short, but uh I think the rest of the code is ready for review.

H

Okay, yeah: I will take a look uh this week.

F

Okay, thank you very much.

F

uh That's all I had uh unless uh anybody has any questions for me or watching.

F

Oh one more thing we were wondering: should we uh create a separate direct pr for the e2e test from wang chen to kk, or should it go through my pr? uh What is the preference for signal usually.

C

You need to test that uh it actually is working before you can merge the pr. So you probably want the e to e's in your pr.

F

It was part of the pr or part of my main pr, then okay, we'll do that. I just want to make sure that her contribution to this doesn't just get merged with mine and it stands out. That's all she gets credit. I mean.

C

Yeah you can uh put the the person's name in the commits so uh like or do co-authoring.

F

C

It shouldn't be an issue: okay, we'll.

F

Do that co-authoring? Okay! Thank you.

A

Thanks next we have daniel. He wants to talk about auto system, cube, reserve system reserves. um Do we have daniel on the car.

I

Oh yeah, I am here um yeah. I think I would maybe need like 10 minutes or so explaining a little bit about the background. If there's time for that.

I

Let me quickly share my screen if it's.

I

Possible, I think the host test is able to participate screen sharing at the moment.

F

Yeah, I think she needs to write us goals.

I

Oh okay, so yeah! um Well, that's being done! I have opened a feature issue and it is about reserving cube the cube reserved and the system reserved memory, uh basically a new approach of doing that. Okay, now I think I can share.

I

Hopefully you can see my screen now: okay, I've! I have opened the issue here. I've scrapped a little bit about it, but I want to give you a background about um on it. um Yeah first off hi there um I'm daniel I work for sap.

I

um I work on gardner, so it's an open source um kubernetes as a service and yeah, because that's the first time I'm presenting something here. I want to say hi here: okay, let's jump right into it. Currently, setting cube and system reserved um is basically a static configuration right um in the configuration file, cubesurfed system reserved- and it's typically done prior to the cubelet, startup and updating cuban system. Reserves um requires a cubelet restart and how it is typically done for kubernetes um manage kubernetes offerings.

I

Is they calculate um the cuban system reserved based on the machine size? Gke is apparently doing it. Azure openshift has an enhancement, open and also gardner is doing that. um What we have observed is that calculating reserved resources prior to the cubelet process, starting is an approximation in the best case, and we have had some troubles because on particularly busy clusters, we have seen that the resource reservations were too low and for mostly idling clusters, not a lot of parts running on it, and the resource reservations were too high, based on the machine size, calculation.

D

um Can I ask one question: if that's okay, daniel um yeah.

I

Of course, always this.

D

Is derek, so is the machine size heuristic based on just the instant site, or is it also based on? uh Is that is the pods per node? um Is it just choosing the defaults from cubelet for that, or are you also tweaking, like your desired pod density.

I

um So I think there are different kinds of calculations out there. There's some that also considers max parts right, yeah.

C

I

But I'm currently in the process of of finding a better way to do that, at least for gardner and so yeah. That's what the issue is technically about, and I also wanted to ask of course: um how are your experiences using this heuristic based on the machine size or max parts? For instance, um yeah.

I

Did that answer your question.

D

uh Yeah, I guess so I was just trying to some some folks, you know, might tune for particular um workload characteristics I guess and so like. uh If you are trying to bin pack, your nodes or you're, trying to you know, plan your notes for 20, 40, 80 or just the pods per core calculator, I think, is what 10 pods per core right now is the default and cable. I just wasn't sure if those other knobs were going in the heuristic beyond either ram or cpu.

I

D

I

D

Fine uh thanks go ahead. I'm sorry.

I

Yeah, that makes sense um so just real quick. Why is that even important um consequences over reservation um over reservation leads to lower note allocatable you have that has um effects on scheduling. All of that. In the end, we increase your costs lower the utilization, as you were mentioning regarding bin packing, for instance, um the bigger problem that we have faced is under reservation um for a cpu, I'm just mentioning cpu and memory here, um but I think it's similar for other resource types um cpu here.

I

The q, ports c group, for instance, has too many cpu shares and in case there's a cpu contention, um there's a risk of starving the all the other process, basically in addition to the cubelet and the container runtime, and that has led in our cases to some node instabilities.

I

um Maybe something like plaque is unhealthy, so yeah it can surface in various ways and it's hard to to debug and then to know why um memory, the the major issue that we're seeing here is when you're under under reserve memory is that the the global om linux, om killer hits before the c group level om killer. um So, basically that the node is running out of physical memory before the q parts c group limit is hit, and that has some bad implications because then any almost any process on the host can can be terminated.

I

Yes, it's influenced by the quality of service class um yeah, but we have also seen here, problems also always deadlocks due to when, when this global out of memory happened. So, ideally, of course, the the c group limit on the coupon slice would hit first because before there's um a global out of memory happening.

I

So that's that's the goal for the memory um setting the right limit for the memory here, um yeah just to show it real, quick, um usually we under reserve, so the the c group memory limit is quite high. Actually almost like almost at capacity level. I've indicated that here it's a physical capacity level, and the only thing I want to show here is that the the memory limit on the q port c group is not at all reached.

I

Yet there's a lot of there's a lot of um capacity left technically, but the host itself is almost running out of memory. So, even though, by adding a little bit of of memory usage to, for instance, cue parts or any other um was process, the the limits on the cubot slides are not hit. But before that happens, the the whole um machine is running out of memory and that's typically, what we are seeing when we under reserve memory and that's problematic um one.

I

Second here yeah what drives required resource reservations, any process outside of the q party group, and we have seen that it's mainly the cubelet and the container runtime. I guess that makes sense, and also including the the container runtime processes when using containery, for instance,.

I

And now, then, the question is what drives the cubelet and container runtime resource usage. What we have observed is the number of parts running currently on the node. I think that also makes sense because the cubelet and the container runtime have to handle more parts, and there are also additional container runtime shim processes um for each part.

I

In addition, a big influence is the kind of work like running or deployed on the node currently, and that's something that's very difficult to predict or impossible to predict if you're running a managed service, because um yeah people can deploy any kind of workload on it and we are unsure exactly why. That is the case. I think that would need um more deep investigation to why exactly. um But we we see that across a lot of clusters um showing you here, the parts deployed um just from one to one 10 to 100 parts. You can.

I

You can see here that basically, the cpu and the memory um requirements for for docker the container runtime here and the cubelet are increasing um almost linearly.

I

So that's one thing and the second is the more the more significant one. Sorry, there's a lot going on in here, but I just want you to to check out um here on the side. So, technically, what we did is we we reproduced the issue.

I

We we created a node with exactly the same configuration the same operating system size, um but once with the real world applications running on it, some real workload and the other one, just 110 parts same amount of parts running some some sleep so doing nothing essentially and then comparing the the required cpu and memory for the system, slice and, in particular um the cubelet and the container runtime.

I

What you can see here is for in this um test. Cluster docker, for instance, is using 2.6 gi and the cubelet is using 120 and comparing that, with the real world usage that we are seeing there, it is almost more than double it's 6 gi and 300 mm mi for the cubelet. Even though we are running the same amount of parts, it's the same operating system, it's the same machine and size so yeah the workload has some influence on it.

D

I just had to cross even under a real world workload, was that like 110 pods, that were under, like churn like pod, stopping and starting or new pods getting scheduled, but just like a constant 110 pods at work or.

I

Sorry go ahead.

D

Or is it did you schedule 110 pods, who were actually doing function and then just kind of see what happened? I guess I'm curious. If any of this memory growth is a consequence of kind of entropy in the system from stopping and starting many other pods.

I

This was actually a long-running um note, so this is a node that hosts, um in this particular case, um control planes of other um clusters and okay.

D

So this is just gartner itself hosting the user clusters, cool yeah.

I

So this is long running stuff, um and these are basically the numbers we're seeing there on multiple clusters and yeah. So finally, to the to the proposal or the feature request.

I

Basically, the main idea would be to because we are seeing all these problems try to find a better way to or a different way, how to reserve resources, especially for memory and cpu, and that also takes into account the current usage on the system, because it's really hard to predict, at least for us, and what kind of workload is running on it, how many parts and then how much cpu and memory is going to be consumed by that and has to be reserved um yeah yeah. I.

F

Just want another question: to clarify sorry, the the real world system and the test system. They were uh the pod specs were the similar in terms of both the init container and the actual container spec, except that it's not doing any work, reserving the same amount of memory. But I I just want to make sure the init containers were not missed.

I

um No, I think they were missed. So in this particular case there were no inner containers, for instance, in there they were. This were really plain um load test parts but yeah. It's definitely not a very comprehensive investigation that I would and comfortably say everything is covered.

F

Okay, I'm not drawing any insights. I just want to make sure that we know we are aware of it. That's all.

I

Yeah good good point um so yeah.

J

I have a question uh for this uh non-port non-coupon c group processes. uh Is it so what you are trying to parse with c groups, hierarchy by yourself.

J

And that's from where you get the actual usage in your proposal.

I

You mean in this slice here and in this slide here.

J

No in your proposal so.

I

J

Correctly is what you are trying to adjust the values of system and cube reserved values inside kublad, based on information, which is right now not known to kublet. So my understanding is what you need to parse the rest of the c groups hierarchy inside kublet to adjust those values. Is it.

E

So, alexander, you are right yeah, so maybe we leave let the daniel finish the proposal, but you are absolute writer, so this is so we consider I daniel I post what we discussed in the past and but the giving today's the quantity of the services chase structure uni uh would propose to have to pass down to the c group on the on the disk on the file system system file system. So so that's make those things more complicated, but we we we can finish this and then there's some things that we need to consider.

E

I

Yeah definitely I mean there are a lot of things to consider and I had the first thing I definitely wanted to ask. First is: um what are your your thoughts on it? um Is that feasible? Do you have similar problems? That's the main thing um I mean I come to that in a second in the question part, but that's basically it regardless of the actual implementation, because I haven't detailed that out. I first wanted to ask um I'll just jump here or I'm already just.

D

A couple questions daniel: if it's okay, um you have the just trying to understand the bounding of maybe what you have explored or haven't explored like.

D

H

D

For gardener is the cubelet and container runtime under systems life or in a separate uh c group slice from system.

I

They are currently under they're, not following the recommended secret setup. At the moment they are both under the system slice, so.

D

Yeah, you can't.

I

D

And to be fair, I think that's perfectly fine, um uh but we uh and like here uh with red hat, we were doing the same right and while we, I think in the community, talk about ways of splitting them out. I think um very few have done it in practice. uh So that's good and then the other question I was curious about is, if um you've you've done any testing on what happens.

D

If you have the enforce node allocatable feature turned on to be enforcing on the system c group so like right now, there's nothing actually inducing pressure back on system slice to force reclaim because it's unbounded- and there is a knob that lets. You set a bounding limit, but it's kind of risky to set, because um you don't really know what the moon killer is going to do when it gets to pressure on system slice. But it would naturally point um so.

D

I was just curious like if you have explored uh either bounding the keyblade in run time in a in a separate c group where node allocatable is enforceable on that c group and seeing different results, because that was one question and the other question was um renault and I were chatting a little bit in the background. But one of the reasons we at red hat have been so motivated to look at secret speed. 2 is, we think we would see improvements in this space, and so I'm curious if sap has explored that at all.

I

um We are exploring first, I would say, going to the recommended c group setup um and then basically.

I

Enforcing the node allocatable on on the separate on the system, slice and then also on the runtime slice, but currently that's not being done yet because it's just as you as you see, the problem is just that the memory and the cpu requirements are so vastly different that at least I am not feeling very comfortable on on enabling that um because yeah stuff in system slice and might getting killed. As you were saying um yeah. So I have not done a full-blown investigation on on adding yeah.

D

Is that something we've ever evaluated? I can't recall if we've ever tried to parent or run time and cable in a separate c group and turn it on or not in our own preference. But um am I off that nothing's going to be putting pressure on that to to force reclaim that the numbers uh will just grow? um I don't know.

K

Yeah, I think maybe seth would know derek if we explore that what he's not on and yeah, but I agree with you.

E

So the number could be go and here based on your system level of those config and the trigger system. So then the system, what's the trigger the whole note the performance and also which one even we have the um score side right. Oh, it is the best effort, because uh when colonel look at the kernel wheel, because the kernel will also try to proven of the kernel dialogue, which is really bad after this room, so they will try to pick up which whatever process based on available of those kind of things and actually explicitly.

E

This is not unkilled even like the new kernel version. uh There's not a lot to like, like the mark, certain process, not unkilled, but you could still so then they will. Basically, it is just random or some process has some range of the um's uh circle range and then we'll think about which one have the most can we can reclaim the memory most and then they will cure. So you make this a whole system and predictable performance and predictable and at the same time will be rendered.

E

So all those like the kubernetes, the priority uh priority class is won't. Make sense here: uh what's the signal good part, it is right now when we cure we can detect in kubernetes we can detect and we will kill the entire depart.

E

We do have some signal, but occasionally, if you don't handle very well, you may be end up, kill certain process in that c group, but the whole holy work node is not. The company could be also don't get around to kill the entire path, so you may uh paper over the real problem, but the customer maybe suffer so so all kind of those complexity.

E

I think the executive complexity is being discussed even at the earlier kubernetes time, so so so hopefully, this is why, when the seekable version two come out, I hope we can do a better job on that one giveaway. We have more signal from kernel and we also they have the hub. They have the clubs and we can passing more doing dynamic during the time we could trigger um more.

E

Have the more control like the user space can based on those like the uh or priority class based on how important the job it is, then we can pass those values to the kernel.

E

I'm not sure version sql version 2 can do that. Also, there are one things because, based on our today's layout, no matter, you are using the slides or using the old way with them c group, so they always have the uh not reclaimable uh memory, so a q which is accumulated at the uh top level of the c group. So that's also make that node unlockable over time will be not represented at a given time. So there's the many complexities you just share here.

L

um You kind of detailed uh pretty well like what happens if you under reserve, I'm trying to understand the problem with over-reserving or maybe reserving for kind of uh what you would say is a bin-packed, node or fully utilized node, because I see I feel like if you adjust it you're, going to show that there's more resource allocatable, but then more pods are going to come in and then you're going to have to reduce the amount of allocatable, because system reserve would need to grow to grow.

L

So it's almost like you would be falsely advertising you have more, but if people do actually come on it, then you would start needing to take that away, and it would I'm just curious what the downside of planning for worst cases, because worst case is maybe what you would be been packing like what your pod max pods would be.

I

So I guess um for over reservation, um then less resources right would be advertised to the scheduler and less parts would be able to run on. On that note, and also the c group limit on the q parts right would be lower and basically you would be wasting some resources on the node because you're over reserving for the cubelet and the container runtime. If I understood your question right, I'm not sure.

L

Well, I I suppose that um in the scenario that your pod is your your note is is doing nothing. You have one pod on it. Let's say you're going to have too much space that you'll be the system reserved will be. I have a lot of headroom but nobody's there to use that headroom anyway.

L

I guess my point is that if you are sizing it based on having zero headroom like that you're a pod max with with a set activity level for io, logs and stuff that might spike. However much is used if you just plan for that.

L

I don't see what issue it have, because, because, yes, when it's empty it'll it'll, have you know extra headroom that could be consumed but nobody's there to consume it anyway. If the note is completely full, you already know, then that's how much system reserved you need for a full node and you couldn't take any more in your size correctly at that point.

L

So I'm just trying to point out that if you do reduce the amount of system reserved um you're leaving room for somebody who isn't there necessarily and if you just have it, set up for a full node correctly that that's kind of my question or a point.

I

Yeah, um I guess that makes sense to me how you said it yeah.

E

Daniel there are also have the one slides. I remember you demonstrate that there's the three node and the same configurations even same kernel version. If I remember you mentioned some system identical, but just different workload learning so you observe the memory usage is different.

E

That's also could be the cases earlier, like direct medicine on the new medicine earlier. There also have some other thing. uh I also mentioned another thing that uh could be not reclaimable usage accumulated. There could be one another thing: it is today's situation. uh If customer like the uncertain node, they have the running job and the user have to constantly send the uh kubercato execute and kubercutter logging, which is also that's not a chat too. Unfortunately, today is not all charged to the pod.

E

This is the accounting issue on the memory side and also our code implementation uh for the continuity and the docker. I believe crowd also same cases so still charge to the all node locatable located to the system, and so that's also to bring some variation there.

I

I I guess um that's another point of of why the the cube and system reserve is not really um determined to be able to be determined prior to the kubele startup. I guess, and so there are some variants there, what would make some sort of reconciliation or at least being able to update it at runtime um handy if I understood it, correct.

E

You are so right. This is so. This is why the nomad is the cubos uh uh reserve. Those kind of node equivalent reserve is the best practice um and also it is the best what we can uh share based on the production. So it's more like that. We generate those kind of things through the monitoring and as the average, and so this is also like earlier. We say that we cannot guarantee those things, because if we guarantee then we have to enforce you.

E

You are already mentioned that even we are enforced, you feel uncomfortable right, so we just yeah, because we due to the imitation and the viral cases, no matter is in the kernel or user space. So that's why, um after this many years, unfortunately, we still didn't enforce those things.

E

So if you say that I share something, and in the past we talked about, uh there's also have the formula uh we talked about in the past. Hopefully we could enforce, but until today we feel you're uncomfortable to enforce those things.

D

Yeah so at least for the two questions like this is a very interesting place for further study and learnings to the community, and at least you're, not alone in knowing that others in the community, whether it's myself speaking from red, hat's experience or probably dawn speaking from google's experience, this ability to properly reserve is is a community-wide problem.

D

um One thing I'm thinking about here is like if it. If, if this is an area, you'd want to do further study or how hard it is to maybe tweak your setup, I I would be curious to know if you did go and parent the cubelet and runtime in a different c group and did enforce node allocatable on the cube reserve c group only and not system slice.

D

um Do you see different memory usage characteristics, because the one thing I am thinking is that uh memory pressure on your c group will get reported at. I think what 70 or 80 threshold value? I can't remember and you'll start getting some reclaim, but, as don said, you might hit issues.

C

D

Things like exec and logging uh that might be variable in your real world environment, um but I'd be very interested to learn on that and then maybe renault do you want to talk about all about why we think secret v2 might be helpful here or maybe some things that we're we're holding.

K

Out before I think so, with v1, the io and memory aren't as well behaved in v2. They have proper accounting between the two and another issue. We have really is like uh when you get memory pressure like since we don't have swap your executables will start getting so the file pages will start getting swapped, leading to system instability and like during recent conversations with kernel engineers, like they really said. Oh, you cannot like totally pack it.

K

You need to leave some headroom and, depending on your, like workload, characteristics how much io they are performing, how much memory they are requesting. You'll always need some headroom, so I think it's like um you'll have to balance stability versus wasting some space for like for head room and to address some points like don mentioned about uh c groups, killing some process in the part. So in v2 we have a knob where we can say: oh okay, kill at the c group level, so the whole thing goes away. So what.

E

K

So that will help. The second thing is with uh psi and like ilana is working on swap with those things in place, we'll be able to get in something like um d and will be able to get earlier notifications and be able to take better decisions on what processes to kill. But then the own killer doesn't work well at all under load. It will end up like killing processes. It shouldn't. Do you think it it shouldn't be and only way to protect against.

K

It is like setting a minus thousand, uh which you pointed out, one more thing like I'm not sure like on the cryo side, what we are doing is our shim, which is called con mon is running under the pod slice, and so it gets charged to the workload, but it is also protected by a minus thousand, and we don't like. We wrote it in c, so it doesn't. You have a lot of overhead so that way we always get a notification and also we are not like charging it to the system slice.

K

So those are other things to play around with. If you have like similar knobs on container design and then use like the pod overhead feature, to give your shim additional definition,.

I

I think what I didn't quite understand is how better accounting and the c group version 2 or psi would um would help with that issue here, um where you would need to, or would it where would make sense to adjust the the c group limits, for instance, if the for the q parts lies at runtime, because um processes outside of it um like the container runtime, are using more suddenly?

I

How is that related to that? That's what I'm trying to understand.

K

So there are a couple of things: jobs in c groups. We do like the memory high and low and then where we are kind of trying to use the min or the system uh slice and then we'll see how it helps. The idea is like, as opposed to like the high and max the pressure starts happening based on overage and min, gives you some like a higher level of guarantee that okay, you won't get till. You really start it by a lot.

K

So there are some knobs and we don't have all the answers uh right now to be honest and it's more like okay, we, we got the basic c groups feature in place, which is a parity with v1, and now we are exploring the new knobs and then we'll do performance session to drive like what will uh end up being the recommendations at the node.

G

Yeah, I can add just uh in 122 cycle right now. There is an open cap for memory quest for secret v2, that's starting to explore um using the memory.hi things to um basically, you know induce memory, take back right under system pressure. So that's something actively being worked on right now.

E

um I also can attack why the memory accounting fighter can have this situation so another uh beyond the water menu and the david said earlier so just earlier. I also mentioned that this is actually the user space problem, but we do have the same problem for kernels right right, so kernel takes some behavior and to have certain part of workload container uh most.

E

It is attached to the container, but there still have some slab usage discharge to the kernel, so they will based on a different type of the workload, so all reserved for the system and it could be based on those work. Node could be not appropriate.

E

So this is why dynamic adjust only based on number of the part based on those kind of things is won't be accurate unless you have the better proper memory accounting. So this is from another perspective.

I

A

Sorry, I'm going to have to stop here now. We have just five minutes and we have two more items on the agenda. So I'm going to have to hand over to barthols. Do we have him.

M

A

Yep, do you want to just quickly go through your item.

M

Yeah hi everyone, so I just want to really quickly ask signal the provers to look at this pr, because it's uh it's ready like for a long time and it's got lgdm uh maybe months ago, or something like that so and so that's, basically the only code.

M

Pr that is related to to this feature, the enhancement pr is already approved. So hey.

C

Ed I had a quick question: uh did the uh so the test? Look green? uh I don't see a pr for promoting the test to conformance test. Did that happen.

M

uh There is a pr for that, but I actually asked conformance guys on the on the slack and they they say that this is not going to happen, because this test requires additional hardware configuration, namely enabling one gigabyte huge pages, and they are not going to allow that.

M

So my idea was to pro to promote it to node conformance, but I'm not. Actually. I don't.

C

Know if this is a node conformance thing, because then the node conformance tests are specifically for the cri, and I don't.

G

Know if this is a cri.

C

Specific thing, uh I don't think it uses a cri.

M

Okay, so that that that's seems uh that it's it's not going. I.

C

Think it just needs approval at this point.

M

D

Yeah, so it's gonna be hard to go from two meg to one gig huge pages and get that running in uh edie uh I'll I'll. Take a look at, but I I I don't think that should block this.

M

Okay, thank you. That's it.

A

Thanks ed, um we have vinayak, I believe he wasn't able to join us. um He has kind of his item pointed out there. People can have a look and uh look at the changes.

N

Yeah I have some contact with that. I can give a brief: introduce uh okay, cool yeah, yeah hi, everyone, uh I'm chipton, I I reviewed the uh the ax cap. um So it's about uh it's about the the security capability api we have in kubernetes and it has a people know it has a add-on drop, but the problem is uh the the ad doesn't work for non-root user uh in a straightforward way.

N

So that's what that was about the vaniac proposed to fix this by adding the capabilities in the ambient capability set, which is a linux concept.

N

So so in the original cap he proposed to change the kubernetes api by adding a new field and ambient and also add a new field in the cri correspondingly and but- and I talked with him- I I kind of uh hesitating of changing the kubernetes api in this way. uh So we uh think about the alternative that we reduce the current api but uh changing the behavior uh uh in a transparent way. That is the current uh current, the pl he mentioned in the in the in the notes, uh contented apl and uh cryopl.

N

uh So we just changed the container runtime implementation uh yeah. That's uh it's just a perfect introduction and the people can review the cab can review this uh uh revised prs, uh so the the I think he wants people thought on. If this is a battery or we want to change the api. Okay.

K

I saw I saw the pr and I left a couple of comments, so I guess like we can figure that out. Async like basically.

C

K

Intersection with psp plus and like a special case where a user does a drop all and then specifically adds back all the capabilities in the default set. So are we worried about things like that.

N

Sure, uh yes, yes, sure uh and uh I my personally, I think the we may uh want to clarify the the spec. The current spec is uh kind of, uh maybe not uh very clear, so we can uh clarify what will happen in different cases. Yeah.

H

Okay, I think both options were discussed before, but uh the the actual field was preferred just because of the backward compatibility right, given the part object. If we just change the meaning of add capability, it may cause some unexpected behavior like, for example, there is a part that they, a non-good part. They have at some point capability by by mistake, and we with this.

H

Change capabilities will be granted to the cause, some surprise to yielders, because the behavior changed underlying. Although the new behavior is expected, I mean it's the right behavior, but there's a behavior change to the existing part.

A

Okay, any other comments, questions in relation to this item.

A

Cool, I think we made it on time. I I did see there was a lot of interest in danielle's agenda item. So if, if there are follow-up items, we can follow up on slack or maybe bring it up in the next meeting just to follow up the discussion um thanks everyone for joining, um see you guys next week.

A

Thank you so much.

O

E

Thanks sweaty thanks thanks sweaty for hosting today's meeting too no problem. My pleasure don. Thank you.