Kubernetes SIG Node, 7 Jul 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20200707

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Code and they were reviewing it uh thanks for your effort in getting them involved, uh but it looks like they're reviewing some of the previous uh previously completed api decisions, including uh resources. Let me just share the screen, so we can look at it.

B

Hi vinay- I am here today- this is tim. Oh hi,.

A

Tim, so, okay, how can you even see my screen? Yes,.

A

Okay, great so yeah uh tim was asking what whether we really need the resources allocated to be in the spec. Why not do local checkpointing summaries versus resources to set resources allocated uh instead of having the pod uh the admission plugin that we added the pod resource allocation plugin and a few other questions around runtimes? I hope mike brown is here today. uh He could share some insight on uh the particular question is if a runtime class, whether it should be allowed to say I can't do in place recess.

A

I think this game arose out of a scenario uh where vm runtimes they may be able to access a cpu within the mili within the cpu, integral cpu bound. So if you're going from one cpu to 1.5, it can do it without starting, but uh we might need to restart if it needs to go to two cpus.

A

So a situation like that and in general, uh the way we have currently defined it is uh we are saying no restart is okay. As far as kubelet is concerned, we won't do anything to restart you, but for the end user. That may not be true, and that looks kind of like the user experience is not really well defined there. They might see something, that's uh they might expect something and not happen, and that might you know confuse them and they may not be happy about it.

A

So, just going through the agenda for today, I think the main question we want to look at is uh oh.

B

Sorry can I can, I just say just um because I don't want to waste anybody's time, so uh I jumped on this cap because it's a it's particularly interesting to me. um You know I worked in uh borg and omega before inside google. I know just how complicated this problem is um and actually it's more complicated than even I understand um so, uh I'm very interested in it I've most of the concerns that I had, I think I've waived.

B

um I have only a few points left that I think are worth discussing um and uh specifically that's the spec versus status checkpointing versus api thing, um whether we want a sub resource for resources, um conditions and events and signals in general, um and the last point is the sort of the semantic that you just described, though I like the words that you wrote on the on the cap, so actually in my mind that one's resolved, so I'm down to three three standing points.

A

Okay, so uh let's pick up the first one, so we uh the reasons I have mentioned- and I just updated responded to that- I think the benefits of having resources are located in the spec is uh twofold: one is uh it opens the door for what seems like a useful feature where a pod can say: hey, I need uh at minimum. I need like x, amount of resources, say one cpu and one gig of ram, ideally I'd like two and then we'd admit it at x at one and then work towards two where possible.

A

So that way, the pod would uh would get scheduled and start running with its minimum desired minimum required, rather than uh not uh stay pending, because resources are not available. So that feels.

B

Really crucial.

B

Like I, I I I can't think of how I would use that, um and I don't. I don't want to design to hypotheticals um my main argument for context for anybody who hasn't read along this giant um uh pr back and forth, like I admit, there's a lot of comments going on there. um The main point that I have with it is as an api reviewer.

B

I find it to be very weird to be putting something in spec that the owner of spec isn't allowed to write to and the need for this admission control, which is using identities to subset, who isn't isn't allowed to write to certain fields, um is really unprecedented um and is a is a strong smell to me uh that something isn't working right.

B

So I asked why why not status and vinay um helpfully disavowed me of the idea that that would actually work, um and there was a there's, a nice corner case that I had missed in my head.

B

What I really think, though, is why, should this be part of the public api, as opposed to being effectively a cubelet problem? um uh It feels to me like the right answer, for this is a local checkpoint and to store the allocated in status that fits with all of the precedence around the api. Both you know the api, spec and status blocks being owned by different entities and all the general patterns without needing any special cases, but it requires a cubelet checkpoint.

C

Hey tim, this is derek, um hey derek, so I want to apologize to banai, because I know that this has been a topic. That's gone on discussion for feels like a year now, and so I think your point on this being hard is is valid.

C

um My memory on why resource allocated was needed was that uh it was to facilitate signaling when um the cubelet had acted on the change like right now I thought we had the we have the issue of when you're resizing a pod.

C

You have the initial resource requirements of that pod that were enumerated, and then you have a need to say. I want to bump that or decrease it, and we had no way to kind of express when we reached a level state so like. If you just had updated the resource requirements, then, if the cube couldn't size you down, you were still using for more resources than you had initially claimed, and there could be some inconsistency, and so I thought it was that need to coordinate on.

C

Have we reached a level state that needed to be understood by scheduling, quota and node that led us to where we are? If my memory is, is incorrect, maybe that's a sign that um our initial thoughts on it yeah yeah? That's that's one of the reasons I mentioned yeah and so like to me. The sniff test tim is like if having not read this kept in six months.

C

Does the original rationale still hold like that? I could explain why we work how you work. I I thought.

D

C

The clear issue is that if you just had local cubelet checkpointing, you had no way of knowing that was the resource reclaimed from a quota perspective and did the scheduler understand the actual state of resources.

B

On that node right, so let me let me be a little bit clearer, I'm actually suggesting that you keep allocated, but you move it to status, which changes the corner case from. I need to be able to signal that this thing happened to what happens if the node reboots and all of that information, that's currently in memory is lost and that's where the checkpoint comes in. That covers the the corner case of what happens.

B

If cubelet loses its memory, uh because you know we don't have a representation of things like memory request uh in any durable place,.

C

uh So I think the issue with that and again I'm I'm happy to be wrong- was that at the time I thought we wanted status to be entirely reconstructable by the cubelet on a restart, and so if the cubelet reported status for what it actually observed was being used, that that was not consistent with what was previously requested. And so you still had say the quota subsystem potentially being gamed or confused, because it would not be aware of an attempt to reclaim resources.

C

So I thought quota is updated to look at allocations um and so just quota wouldn't be status aware, and so that was one of the other design tensions.

B

Yeah, I think quota has to be aware of status in any case, because it has to take the larger of the request or the actual right, because on a on a shrink like this is the attack that jordan laid out right.

B

I could schedule a pod that requests a gigabyte of memory and then once it's accepted to a node, reschedule it to use one byte of memory uh or one megabyte of memory, and my quota would say I'm using one megabyte when in fact I was using one gigabyte um because keyboard wouldn't be able to shrink me because I had active pages or.

C

Something yeah, so I agree it always used to take the max. I think that's what was in the beginning. I think that there's like two tensions, though like if it only looked at um the max all right. Let me think on that one, a second I'm trying to think through.

C

E

I real quick yeah go ahead, so we actually don't need to use the max the reason being that we there are two kinds of changes we could make. We could either change the requests or we can change the limits for limits that can be rejected if, for example, you're using all of your memory, but the downsizing for requests will always be accepted by the cubelet, so we don't actually need to use the max. You can temporarily.

E

For example, downsize your quota, but that should immediately or very quickly take effect.

B

As long as your limit is larger than your request, right, that doesn't sound validation.

B

I'm trying to understand what what david means, if I have a guaranteed class where limit equals request, then a shrink would be shrinking both right.

E

Right because we said that you were not allowed to change quas tier.

B

Right, um which is also an interesting attack right, I could get a guaranteed class pod that I then shrink request on or like. Are we going to enforce that if you're in guaranteed class, you have to shrink them in sync.

E

um I believe the current behavior as proposed and implemented would actually you would accept the change in requests and that would take effect immediately. So if you had eviction, you would actually evict based on the new smaller requests, but the limits would the qubit would attempt to reduce the limits after the new spec has been accepted with quotes right and that could be rejected.

B

So would it be possible if I I schedule my pod for a gig of memory, request equals limit? I get guaranteed class and all the benefits that go with that and then I come in and I change my request to zero.

E

So, in that case, well I don't know if you can be guaranteed with a request of zero. You could change your request to one kilobyte of memory.

E

In that case you would actually be. You could very well be evicted there, because your actual usage would be exceeding your requests and that's the primary mechanism we use in eviction ranking.

B

But I would still have a priority over things that started in a lower cost class right.

E

B

Is not directly taken.

E

Into account during eviction right.

B

I think that's inte, it's interesting. It wasn't where I was going with this in the first place,.

C

um Odd validation, though uh again my memory is, is weaker. Here uh I thought in pod validation. We actually verify that you can't change the clause class of a pod when making the update itself. So, yes, I think uh eviction is kind of a second order effect, but, like I'm not aware of any reason, unless I'm mistaken that you, you could change clause classes with the current proposal.

E

Right, yeah, I'm not suggesting you can change cost classes, I'm suggesting that there are two steps to a resize. One is admission by the cubelet, where it admits the resize and one is say, actually adjusting c group values by the container runtime.

E

If you have a resize that is admittable such as a downsize of a guaranteed pod but is not updatable by the runtime. You can end up after admission where your requests and limits have been changed according to the cubelet and according to its record in memory of that pod, but have not been changed as far as c group files.

E

And all I'm saying is that because eviction is based on the admitted requests and is compared to the actual usage that you would sort of be in a weird state where you could be evicted as a guaranteed pod if the resize isn't possible.

E

But what I am saying is that the important thing is that the uh admission would never be rejected um at admission time right of a guaranteed pod being downsized.

E

But you could have a state where it was unable to reduce the actual limit of the container, um because it's it that memory is in use.

B

Yeah. This is why this cap is so much fun.

B

So the the question here is, it came the way it was explained to me. Vinay helpfully explained this corner case of having two pods uh and, if you like, if cubelet were to restart, it would not know what its previous decisions had been right if, for example, status got lost right, which is always sort of the litmus test for status right.

B

What if I erased it, would it come back, and uh that seems like a real corner case, but can be worked around by having cubelet acknowledge its own previous decisions via a checkpoint right, like I, I have admitted this and I have decided it made sense in the case of a restart either I will either I have the checkpoint, in which case I can resume where I left off, or I don't have the checkpoint, in which case I hadn't really made the decision yet, and that seems okay, but it comes at the cost of making this facet a cubelet problem instead of an api server problem.

C

I still think I still think there's a use case, I'm missing that was clear at the time and so.

C

Where checkpointing alone is insufficient because.

D

So I just want to follow up to reply to you. What's the clear use cases the cleaner use cases, it is what to really earlier explained, uh because when we first talked auto scanning the sig auto scale, whatever team and the one of the requirement it is, it is the scheduler and the controller, and they want to initially really aggressively to set that allocated to to to update to to adjust of this one. But then they will base the historic data and adjust and based on real usage to adjust.

D

So they basically want to leave kubernetes make tons of the decision, but they want to. We give more signal back, and so they could just make the uh uh maybe like the retreat back or maybe adjust. So that's the main reason originally, and we think about this. Basically, they want to uh keep a tab uh describe what their desire state at the beginning time, but it is us, but it's not a user, so so the basically it is like the request.

D

There's the one it is from the initial user to send that request, set the initial request and then there's the like. The system automatic adjuster to us, that's all the required spike, that's kind of the, I believe. That's the original reason we think about it, because already we also talk about status and but the status it is it's not all those controllers watch, and so they basically watch off the uh spec and also that time there's the sample resources also have the limitation, so all those kind of things as about together.

D

So at least, if I remove crack this is the one of the reader.

C

Like I'm imagining a thermometer right where it's like, uh I originally set my temperature at x degrees. I am now asking that between the hours of 12 and 4, I want to dial my temperature up a little higher and I thought that there were multiple values we wanted to read. One was what you were saying was: did the vpa ask for too high of a temperature that the keyboard itself couldn't satisfy?

C

The cube would still want to say what the current temperature reading was, and I thought there was some desire to kind of like find a a a a way of approaching when we got to a level between what was desired, what was met and how strong of a desire was being made. um Maybe that made sense at the time and that's not coming clear to tim.

B

I think I understand, I think I understand the desire for this. The signal, the the problem that I'm having with it, is really from an api um consistency. Point of view what you're describing in as allocated here doesn't belong in spec. It is state information that isn't safe for the owner of a pod to set on their own right and that's that's the smell. um If you look at other apis that follow similar patterns like service right, the user can set their cluster ip and we will respect their cluster ip if they set it.

B

We don't have any case where the user is not allowed to set a field, but the field exists in spec users.

C

So node name is the other field right. So I.

B

C

Like that, sniff test is the same sniff test that that made me feel a horrible stench and probably delayed vinay a lot, uh but then like we, we do have that right, like podspec.nodename is not a normal name, is respected.

B

Though, if you set node name, the scheduler will respect it and ignore your pod right.

A

Yeah, it's the same here too, uh in fact, uh for the live api review with john legit and uh kubecon contributors, san diego. That was the first slide. I think I had where uh we're proposing this. The precedent, for this is a node name, and we had we had uh resources allocated plus a subresource for setting this from the node and the president was uh binding and scheduler uh setting the node name.

B

So there's a difference, though, because node name is immutable after being set right. That's.

F

The same thing here.

B

I thought you said: node allocated can be set by cubelet at runtime whenever it takes a resize or the resource is allocated. Yes, only, not the user. So that's the point is it is mutable by somebody and it exists in spec. We have no precedent for that.

B

If something is in spec and it is mutable at all, it's the users they own it, whoever created the pod, if you don't want it to be mutable by the user. If it's not safe like in this case, then it shouldn't be in spec.

A

The binding endpoint uh does, if you call it again, it's not it's not that it's called. uh I think my connection is going on bad, so the bending node name plus binding scheduler, says that once it once it sees that it's uh clear and if it is not, if it's already set, then the skiller will ignore it, and then it's the user scheduling the part to a particular node. You know that.

B

A

Can't you can't reset node name, you.

C

B

C

We have many actors that act on things in a system. Is the distinction you're raising meaningful, like I can change the container image name on a running pod um uh like I guess what makes that distinction meaningful.

B

Yeah, the distinction is that there's a there's, a sort of philosophy behind the api like the spec, is what you intend to be happening and the status is what system is actuating and if you say, there's a thing that exists in spec and I'm going to write to it, but you're not allowed to write to it. Even though you own the rest of spec you're sort of flying in the face of all of the existing conventions and node name is different, because you can't change node name that doesn't cause a reschedule. That's not a thing.

B

If you look at something like um uh I'm trying to come up with a good a good example, so.

C

I guess we'll go back to service right, like you want schedulers to be status, aware.

B

I don't see why that's a problem, it's not like you. Can it's not like. It changes the watch behavior when you get a watch, you get both spec and status. So.

C

Like if a pod has never been started, or it's just stuck uh waiting to get started after it's been bound to a node uh and the cube doesn't report a status on it or doesn't report a resource allocation on it, because a volume hasn't yet attached um and the containers were never actually invoked like um do we do we want to worry about those things or not? I guess I'm trying to think through, like when. Does the cubelet say that this resource is now um allocated and then the keyboard still has to report?

C

What actual observed usage is right, so the requirements section and the status section would always be 99 of time in sync, and then the keyboard needs to report when it's updating status in some cases like memory, it wants to drive that down incrementally because it can't do it in all one fell swoop um right.

C

So I'm wondering like how many rights back to status at what frequency would we be doing and other knock-off effects that my brain isn't able to follow right now, but um the the other alternative. I think we.

D

C

Was that we talked about having a secondary resource that you could write to, but then that had challenges as well, because the scheduler would have to.

D

C

That additional resource and um the cube would have been watching that additional resource and it had scaling, impacts.

B

Yep, so I I concur with all of your assessment. um This is not there's. No obviously easy good answer, um given, I think, actually all the work that's gone into the cap is really good. I'm. What I'm trying to offer is what I I hope and I'm hoping there's no corner cases that I missed. I hope is a slightly surgical modification.

B

I'm saying take allocated with the semantics that you've currently assigned it, move it to spec or move it to status, and anybody who's looking at a pod needs to consider sort of the tuple of requested, allocated and actual and make decisions that are sort of context dependent based on those three fields. You can't just look at any one and to cover the case of what happens if cubelet restarts cubelet has to save that information. It has to save its own decisions somewhere and the api is not the right place for that.

B

Unless it's part of the specification or- and I guess that's it-.

B

Sorry, I'm thinking out loud as I'm chewing on your arguments too.

C

I don't want to portray my arguments as being stronger than my memory allows me to have um so on the checkpointing side. I know we've uh been bit by checkpoint in the cubelet, so there would be some polling, at least on a cro, to see what's possible there um empty dirt volume usage is actually really hard to measure appropriate in status, um uh and so that's a little tricky.

E

I don't think we actually need to measure empty dirt usage. We just need to measure whatever limits we've placed on it, which I think is just static today um or.

C

They're not actually enforced by that if you're reporting the usage on a per container basis, which is what I thought we were doing here and multiple containers can write to that empty door. Then your usage reporting, I forget how that gets tracked. I have to check.

E

David, you, oh it's! It's not actually usage that we're.

D

E

D

We only need a checkup point out the.

D

Decided basically, it's kind of secondary and there's negative first, it is like the cluster level admission and then down to the kubernetes and then kubernetes to actually check point the kind they are working upon. So it's not real usage here.

D

We, I definitely don't want to check upon the usage- and I think I repeat many times and because the usage first thing, if the pod is not die, container is not a die and usage can be recovered and the second once you check point usage and they are have a lot of knowing issues we talked about in the past.

C

You're right, I I I apologize for usage, I um it is just needing to introspect what the runtime had been previously told to to write in uh secret with s.

B

So I initially thought oh well, we just it's because c groups is is missing information, but actually I don't think that's sufficient either because there's still a case of what, if the node rebooted and came back- and you just want to restart all your pods in place, all the c group information is going to be lost. So I think if this is a case of kubelet, is making a decision, and it needs to remember that decision over time.

E

Right this is the we just need to record the admission decision. We are essentially admitting resizes and we need to record what we previously admitted.

C

Which is why status wasn't perfect for that right.

B

That's right, that's why you need some sort of local checkpoint, because that that doesn't pass the status test.

C

No, no but like it's been admitted to the api server now right like and so now we're we're bringing this into the cubelet and then the keyboard has to be like yeah. I can still fit this or I can still grow this and so like. The scheduler hasn't even been involved in the regrowing of the requests right, they're like they're, not there, and so it's the node fit check and right.

A

um There was another aspect to this: we were.

A

uh We were looking at uh scheduler, potentially assisting uh a resize uh like let's say it's a high priority part and then the scheduler notices that the same argument that I have for vpa uh here, where it can keep a running average of what the expected time for this to be updated. Once it sees the resources, change, uh scheduler can watch that same metric and then say hey. uh This has been it typically. If the cubelet could resize it, it takes uh two seconds and it's been five.

A

Maybe I need to kick out some uh lower priority pods so to help this guy grow. So I love.

D

A

It was part of this.

A

We scoped it out for keeping the footprint of the code changes. It was getting too complex and yeah. It's already playing.

B

Complex, I I love that idea, but I think that's like step two or three. Maybe four.

D

A

It's it's part of it's, it's a few enhancement, it's something that was considered for the skip, but it would be behemoth with that.

B

Yes, so so my main, my main argument here again is looking at this purely from an api reviewer point of view. This feels wrong. It feels like you're using spec as the checkpoint, that you don't have.

C

So here's here's the here's, a question tim, I'm sorry and I'm getting my thoughts back into my cache so uh like I have a cubelet with a pod restart policy of always right and it's been bound to node uh derrick and the derrick node has very few resources because he's just too strapped and so he's got. You know three cpus a gig of memory or something and the pod fit there. Just fine.

C

It requested one cpu and a gig of memory, and then tim comes in and says I want to grow that pod to be way more resources than the derrick node has now the derrick node had been running this pod, let's say and the pod had died and then it needs to be restarted.

C

It can no longer ever actually restart to your new size request which, to you, as tim, is the bad outcome, which is what I thought we were trying to achieve with the allocated flow like it's just gonna, be you scheduled me to a node that can't fit me and I'm still on this node that can't fit me and you, let me be resized higher than the node that can't fit me and the cubelet's never going to go and re-run your workload at the lower resource level that you originally requested.

C

So you basically now are in an outage, and so I feel like the the orchestration we have between the three values rather than two values prevents an outage, because I could still keep running you at the lower value um and I'm now not able to run you at all.

B

No, I think, with the checkpoint you, what you described is exactly the the good path of what you described is what we will have allocated is still present and it's there in the checkpoint. That's what cubelet has put aside for you and the actual status you're saying then.

B

Well, yes, I mean the field is published through pod status, but cubelet has a disk on a file on disk or something that says. This is what I decided in the past, so in case status were to go away. I can always re redecide that and the pod would be able to restart with the lower limit and cubelet would have a way to say hey this resize request, while you know formally schematically, it is correct practically I will never be able to do it.

B

And that's what we want right.

C

D

um um Okay, I have to watch the time- and here we already spent like a roughly after 40 minutes on this one holy moly. So so I think we've been. I could talk about this one for more than a year and and the team thanks for you, raised all the consent and refreshed me a lot of memory and I believe, derek too so so we can carry on this one more and and please connect the community and everyone. If you are interested in talking, please take a look.

D

This is next a really good chance to revisit our decision, and I I think about do. We also need the loop of the auto scanning team here, because a lot of the requirement and async team, you understand a lot of like the auto scanning in work and what's the pink pint. I understand you what you propose to check pine.

D

All those kind of things try to also prevent those problems and because obviously we use the check point of the usage to make that even worse, and we don't check the point that we said so, let's open the question and carry on the discussion.

B

Through the yeah, I just posted in the chat the link to my pr, which has a bunch of questions against the cap, um I'm happy to keep the conversation going there. I had a chat with jordan this morning just to get some history from him and I will go through the cap when I get a chance later today and respond to vinay's comments and and update it with. You know the results of the conversation with derek, but I would love to keep the conversation going um on that.

B

I'm really down to these three points and uh I think I'm not I'm totally willing to be swayed, but I really need to understand them, because I think that there's precedent that this will set, which worries me.

D

B

D

Yeah, thank you. Thank you and then it's more. The topic welcome to next time and we also can carry uh following discussing and at least the report back to the community and what we decided or what's that we resolved and what's it is the new open question. So let's move to next topic manu. Do you want to talk about the the staff rule.

G

uh Yeah, so, uh to give some bit of a background before nalin johnson is like, uh we are working on, enabling unprivileged builds inside a pod, so basically not giving any privileges to the pod, but still be able to perform builds, and this is done using fuse overlay fs, which requires dev fuse in a container, and while doing that, we run into some issues and nolan can talk more. I guess.

B

Greetings everyone, so the main problems we ran into well in this case was dev fuse not being available to the container the unprivileged enchanter I should say what's going on there is that dev use is not available. When we create a host path. uh Char dev volume mount to make it available from the node to the unprivileged container. We still can't access it because it's not added to the unprivileged containers device list that gets set in this device control group.

B

This only happens for privileged containers and even though the kernel some time ago switched to allowing unprivileged users outside of containers to create mounting spaces and mount things and deputies.

B

We can't get around this with just the volume mount. It doesn't fit. The pattern for uh device plug-ins, which expect a certain limited number of devices and that fuse can just show up anywhere. You want to be, and we don't have a solution list for this right now, so we went looking and it turns out. There is an open pr that changes the kubelet so that, when a device on the node is added as a block device or a character device volume mount in a non-privileged container.

B

The cri request also adds that device node to the list of devices that we tell the runtime to expose the container and that ends up causing the run time to add it to the device control group, which would be perfect for this case. uh But it's been open for about a year and I haven't seen any traffic in it for a couple months. So I was hoping for some info or guidance.

D

um If I know correctly, we maybe we talked about this one- I just signaled, or maybe it's different in the different tastic group. Stick meeting uh me here from the or maybe sod from the storage actually proposed some other way uh to support this one. So um so I don't remember the detail, and uh so so so that's. Why did we follow on that one? I haven't looked at this request yet so.

H

Okay, well, I'm gonna post the url for the little request as well chat logs. So folks.

B

Can catch up with it uh looks like the proposed options were device, plugins controls, sorry csi drivers and a local persistent volume.

B

Now I looked at device plugins and I haven't done the research on csi drivers and local pvs yet, but I don't believe that they that any of them results in adding the device node to the devices list that we pass over cri. So I don't think they'll solve this problem for unprivileged containers.

I

uh But uh csi driver may have a notion about a volume uh volume device. Of course it was done for block devices but might be for fuse. It also might work and in case of device plugins, we had a similar scenario with um our intel integrated gpu devices, where what device node is actually the same, but where amount of clients to it might be unlimited.

I

So what we did? We did a kind of hack. What device plugin just artificially creates amount of entities like how many of shared device users can use it. So this device plugins it's possible. It's not the nice thing, but at least the workaround can be done.

G

Is there a third part possible here? I guess it's for dawn and direct them so like uh where we use something like psp or scc or opa, to have a list of uh allowed host devices that the pod can mount, and then we just allow it to the cri directly without creating a device plugin for such use. Cases.

D

um I'm not suggest to using device plugin. Actually here I was more like the when we talked about the house pass and and and I thought we agreed to move to using of the common api, which is the csi and from now on, so not like the ink tray and or hard coded of the wallet management what we did in the past.

D

So I that's the kind to what my first reaction here and uh and but I understand, maybe there's some different special cases and the current kind to the csi driver didn't support the api thing just included this one, but we can discuss so um so we we, we could figure out what we are missing in the csi api, how to make that generic uh and also but in generic enough, but at the same time, can support the particularly uh use cases you proposed here. Oh, we can think about it.

D

Do we need to have the third category to support this one so far, based on the alexandra and even at device plugin, we could do something it's high key, um but I I try to not reinvent a lot of whales here and try to see the what's the existing mechanism. What is missing so then to understand the uh the discrepancy between those apis and then we can say: do we need instant, like immediate jumping, to introduce new api or new plugins right? So that's that's my first reaction.

D

So so have you tried to look at the csi api yeah.

B

I hadn't gotten that one yet.

I

From device plugins perspective, like one concept, which is highly missing, is ability to use like infinitive amount of particular.

D

Devices, yes, that's the initially when we proposed the device plugin and uh unfortunately, until today, us plugin um kinda a little bit specific to the gpu, and maybe this is the chance we make that more generic start a sec support the second case, so I'm not proposing to really using or not using, but I at least we need to understand. What's the discrepancy here and then.

I

Gpu, it was also uh we have a similar scenario. It was a bit hidden by nvidia in web hooks. So practically what happens is what like for nvidia you're, exposing like two devices like the card itself, plus or controlling interface, and this controlling device was injected later.

I

So that's why it's what's not so visible. There.

J

C

So I was trying to catch up with all of it in the background here, because I was curious, what was bringing this to the top of attention, and maybe there are use cases that people are trying to meet that we're not um appreciating um renault. Do you want to give some detail on like what the use case is that.

G

Yes, primarily, this is for doing. Unprivileged builds inside a pod, so without being root and using double fuse and fuse over layers. Yes, we can do that. So the first step is: how do we get div fused exposed in a build board without being privileged.

G

Do you want to add some more details or.

B

uh Sure, currently, we're using privilege pods with the kernel based overlay file system fuse lets us do that from privileged users outside of pods in the broader sense. What we're using these four is to do the copy on write layering for container image builds, so we can extract the differences more efficiently than much slower methods that we have available to us, but which are kind of expensive in terms of time and.

B

C

um So when I read the issue, the issue kind of just discusses like because of container runtime, let me do this: kubernetes must do the same type of level of discussion. What I'm wondering is if we can maybe get a more formal design authored on why kubernetes should do this and what use case it fulfills and if we can scope it.

C

I guess- um and you know at red hat, I know we do obviously container image builds and kubernetes, and I know other communities are doing similar and so um getting uh getting something written around that use case is, I think, probably very clarifying for everybody when understanding why the request was coming.

G

Sure yeah, we just uh saw this open issue and we wanted to see if there were any blockers around that, but we can create a new issue or an enhancement. I guess.

D

Yeah please and yeah. I agree with the dark and clear they write down of the use cases, and then we can start the final use cases and to talk about and the exam of the current solution, and then we are open to any suggestion, and but we need to think about this, not just. We also have to examine why existing of the solution is not support. These cases use cases.

G

Sounds good yeah nolan and I will work on our time.

D

Thank you and looking forward and you write down those use cases and come back. We can talk about more. So, let's talk about another one and our limit and the menu you are still on this one limits.

G

Yeah so again like we linked to an older issue which has been open forever, but really this came up in the context of ai and ml workloads. uh Zwonko are you on the call.

K

Yes, I am, can you hear me.

G

uh Yes, you can, uh you want to uh talk about it.

K

Sure yeah, uh so it's sitting there like for 2015 and uh the use cases we have to set our limits per container is uh mostly for from aiml and hpc workloads where we need to set unlimited memlock specific stack size.

K

Another use case comes from rdma, where we need also memlock, unlimited and other kernel bypass technologies provided by other vendors.

K

And currently, the situation is that we are enabling those limits globally per runtime. So, for example, for cryo, we are set creating a drop in file for systemd and having memlock uh stack size and other variables uh unlimited. But we don't don't have to have those uh for all containers. We just wanted for specific containers.

K

uh The other thing is, we can run a privileged container, which is not uh which we don't want to use. We want to have all those settings per user in a container, and I wanted to ask the group how we can get forward on this issue of our limits in kubernetes.

C

So, to my knowledge, it's been there since 2015, because nothing has changed since 2015 on like a primitive we could use, and so um I guess as funko is there something that we might be missing. I see dawn's original comm and I think, still holds true from 2015, which is we don't really have a good knob to do this effectively and the one that was used by docker in the past wasn't really considered um appropriate.

I

But derek we we nowadays we have a second profiles: uh pattern where runtime can define like certain profile and when the user can specify like. I want to be this particular profile, so something similar can be here like I want limit profile like this and uh on the runtime site. It will be just expanded to set of parameters for injected for container spec.

C

Yeah, so that's an area where we might have evolved since 2015 when I asked like, is there a change that we should view collectively, maybe that's one but like just straight mapping.

C

The support that was in docker to to kubernetes doesn't seem like the right outcome, and so do you agree with that funko or is there some other is? Is there a set of things you wanted to put together on this.

C

Or, or is the the process limitation sufficient.

K

I'm I'm just thinking.

K

Yeah, I'm just curious how we are going to specify this per part and not for the complete node or for a complete runtime.

C

I mean: is it at the granularity of a pod or are there bandings and like? Is this a place where a runtime class could be pursued or something like that? In the background.

K

I'm I'm just curious if we are allowing all parts to have uh locked memory, if pods could you know lock too much memory and stuff other parts if they have unlimited?

K

Let's say somebody sets core unlimited core unlimit memlock unlimit stack size. If memory can be exceeded for other parts, if we are setting it globally for the complete node um they're, like in our dmac use case, we have like two use cases. One is dedicated where we let have only one user and one workload running and the other one would be multi-tenancy where we have a couple of.

K

I don't know several tens of hundreds or containers running uh using rdma uh with aiml workloads and if we are setting this globally for all pods, if this could somehow break the.

K

C

Yeah, I don't have a clear answer on that. I'm curious if others on the call have explored any other workarounds that we might want to raise, um but I think swanko we would have to sit down and think about that carefully.

C

L

D

Actually, I want to ask you so there's the a while back there's the unlimited sequel pool uh very actively, at least the relative activity in the upstream kernel. Do you follow that one? Do you know what the status I forgot to follow last two years.

C

And I don't know the status, which is why I was wondering if there was a very real chance that someone likes funko vermin would have uh helped educate me on something that may have changed, that. I wasn't aware that had changed. So, okay.

C

I think that that be used to develop funko. Are you able to follow up on that.

C

Yeah I can follow up on that or maybe even but inside on this, if you're still on the call you'd want to say if you were aware of anything happening there, but uh I don't know of him.

D

I do think about the uf leader and the moment, the the the five years ago, almost five years ago, if I remember uh earlier people say I made some comment, if I recall that correctly, because people ask for to use in the darker and the content, netflix darker feature unlimited feature and that's the process based. We already agree and that's very easy: it's not that user friendly.

D

So at that time I remember there's the unlimited c group is disgusting, and so that's why we hope to. If we there's the make clockwise, then we could enable that either the container light will add negative power levels. So we can utilize the front. Kubernetes uh perfect, that's much easier to apply and if, if that not make progress- and we need to understand why that's so complicated and we can see, we can see we- I can see some the complexity there.

D

But do we want to understand what it is so then we maybe not think about the phone either way to do something, or maybe we can help the community. I mean the upstream kernel to push down some requirement from user space and I think about the secret pool it's so successful in the past. It is just because a lot of use cases push down the requirement pre-done from the user space to the kernel. So that's why a second version of the stick will be so successful, especially memory management.

D

So we are in the unique of the space to to send out those kernel, folks and expertise, and and what's the use cases, the user care about. I just want to share here and uh to see what we can do here, because the anime things actually has been raised many times.

G

Yeah we can follow up on that for sure and see where the kernel is at uh and we'll come back. So meanwhile, I was wanting to give a more limited uh use. Case of caveats can be enabled like. Yes, all the processes don't get the limits, but in most cases, if your main container process, if we can apply limits to the main container process, that may be enough to elevate the issues that zwonko is running into.

D

D

So this move next one second container and uh video.

F

Yeah hello, so uh I wanted to to ask, or maybe a quick update for direct that was wasn't here the last time. uh The last time I asked about this pr that I created to move to change the with what we talked in the other signal meeting will basically switch the cycle cap to provisional state and mention that it depends on the cubelet graceful shutdown that is still in the works.

F

um I wanted to ask if uh there was any concerns with this pr that uh has not been merged or just yeah a friendly thing.

C

Yeah uh so appreciate the pain and, like you know that I wasn't here last week and then the holiday weekend, as uh still have me catching up, but I have your appearance now this afternoon.

F

Okay, I I didn't know about it. Sorry cool! uh So uh if I understand correctly this pr, if it's fine will be emerged and then I'll do another uh following prs with the callouts that we discussed, does it make sense.

C

Yeah, it sounds perfect.

F

Right. Thank you very much. That's all for me.

D

um Really early round of time, thanks everyone and alexa um alex- and I believe you just want to talk about talk about the repository.

L

Yes, our question related to repository, but uh sweaty will uh talk.

J

L

J

Thanks alexey, um so we presented uh topology, aware scheduling in six scheduling and um like the scheduler plug-in itself um was approved to be placed in the scheduler repo itself. But we were hoping to get get an opinion and get.

J

Basically, approval from the sig node to have the topology exporter daemon, which is responsible for exposing the topology as part of the kubernetes sick.

J

So just wanted to get opinion from derek and don. You.

C

Yeah so sorry, I sent you some some feedback privately on a red hat list, and I also did review on the the two caps in question and for some clarifying questions like the one was it wasn't clear if there was work happening in another cncf domain that this was bringing in or if it was net new work in the sig.

C

So some of those questions I think you may even just answered there, um and so, if you can kind of update on that that be great and then my my understanding for the next discussion is basically there's a request that says, would signal want to sponsor a subproject to export topology information to the scheduler and it sounds like you're saying the scheduler would support a built-in plug-in that could read that information.

C

My ask on the cap was what why do we need to have more than one node fleecing agent in the community, and could we do this as a enhancement to node, feature discovery or evaluate it in that context? And so maybe, if you can look at that feedback and um reach out to the sub project maintainers there to see what their perspective is, uh that would be good going into our next discussion.

J

Sure yeah, I think the reason we didn't go ahead with the node feature. Discovery approach was because it populates uh the node information, the form of labels, and that wasn't something we were trying to do. We were trying to create crds here based on the resource topology. So that was the initial reason, but we can definitely think about it. A bit more and.

C

Yeah yeah so, like I know we're over time, but the last parting concern I have would be that um uh where we could centralize in the case of nfd, like node labeling, to like a core authority rather than giving highly privileged credentials out to many different things. That was a desirable trait, and so I know the nfd community had to tackle that.

C

And so it felt like that same problem would come up in this proposed project, and so, if we can just talk through that, so that we don't repeat the same period of patterns, that would be good. And I know many folks on this call were involved in fd and so they'd be happy to share their experience.

J

Cool thanks, derek thanks for the feedback and the review. We'll have a look and we'll try to address those questions.

D

So we so we really run out the time the people can do our calendar and also like the. But do we have one more topic to do to uh talk about hash? Do you want to talk about the federal image for the node testing or you want to just simply update the status.

M

uh Yeah, no, I just wanted to know if we should, uh if you remember, we were talking about uh using fedora custom, federal images right so, uh and the federal coreos right now is not uh possible to use because it requires uh reboots, while you provision it to enable sega v2.

M

uh I I just it's: it's not enough just simply to provide cloud net file to it. While system comes up so we were trying. We are trying to get a simple fedora image to uh use for no testing, but it turns out federal image is not officially available, uh so we were wondering if there is a step cap or intermediate solution, where we can host a fedora, 32 image for our node tests and then use it.

M

So just wanted to know what you think about it.

D

um I don't think about uh so there's the people. I think this is kind of a mix of the excess right so to end to a project and also uh mix about the cncf beginning, all those kind of things. So, if you don't mind, can we have like a follow-up meeting and then we can report back to the signal here and see what status, but it's harder for us to to have this meeting here and and and also people actually uh in charge of that account also is not here so.

H

D

um So how about you schedule a meeting with me and I I do involve the parties here and then you just add the involved party from your side. I add more party for myself and then we can come back later. So next, let's drop this off nine separate meeting. Is that okay get everyone in.

M

I will continue uh I'll drop you an email with the same thread that we are having me and all, and then we can. We can talk over there about suitable time. Okay,.

D

Yeah thanks thanks and josh. I know you have a topic, but we rent all the time and can we can we follow up next week.

N

Yeah sure no problem.

D

And you also can send me four more, I ask for more cap, or maybe they found that you before, like the request or something, and so we can follow.

N

Yeah sure sure, before the next meeting I will uh issue the enhancement proposal in the repo.

D

Okay, so the last one it is, I think I asked for approval and uh I'm going to take a look and uh I'm on that one. Okay, just I think the x will ask a pool and I will take a look at another one. Thank you everyone and uh have a great weekend uh we're a week. Sorry about it.

D

D