Kubernetes SIG Node, 30 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20220830

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20220830-170256_Recording_3838x2118

A

uh Do we have Kevin and.

A

Okay, uh do we have Kevin and Polly on the call I'm.

B

Back in here, yeah, oh okay, okay, I, don't see Kevin, but well. If he doesn't join, I can still talk a little bit about what he has done at the media.

A

Okay, uh Patrick, let's get started so first, one is an update on the dynamic resource allocation.

B

Okay, let me see what I can share my screen. Real quick I do have a host disabled participated, screen, sharing, I do have a few slides Unfortunately, they are not public connected and I get permission to present.

C

Yeah, let me uh hand that out one second.

B

If not I'll just start talking and it's it's not too much on the slides anyway. uh So, as you all know, we've presented in the past about this thing, but we start with Dynamic resource allocation, which is a mechanism in kubernetes and enhancement proposal that rethinks how devices are potentially handled by kubernetes, because there are quite a few limitations.

B

So I'm not sure what you can see right now,.

A

B

A

B

So that's pretty much. We motivated that's, that's that's.

B

What what's the starting slides container device, Integrity interface is something that has been going on also in parallel uh at the runtime level, and dynamic resource allocation is the level in kubernetes that integrates the whole thing and makes it possible to manage devices in different way, and the main difference is that there are custom parameters for devices, something that describes things that currently just can't be expressed in the kubernetes API parameters for initializing hardware, for example, or complex description of multiple parameters that go beyond the single counter.

B

That is supported by the device, plugin interface right now, um so we kind of went through that already so I'm not going to spend too much time on it.

B

We came up with an approach that is basically a bit like volume handling but gets a bit more complicated because the API is so customizable that we actually have to integrate a resource driver with the scheduler, because the scheduler just doesn't know anything about these claims and what the parameters means, or we have to have some kind of back and forth between scheduler and a custom vendor vendor provided driver for custom resources, but we have a solution. We wrote down a cap, we now I'm getting well like this is this is the overall diagram.

B

It shows that we basically need to touch a lot of the core components, but don't get scared, we do have it working I think, and we also discussed it a lot in the cap discussion with with key Architects. Well, let's, let's get to the actual update for today's calls. This is current status.

B

I've kep got merged as implementable for 125. That was a kind of a way already a big milestone for us, because it it really showed that kubernetes project is behind this, and that's it's not just some wild idea, but some some random guys have, but it's actually something that a project wants to do and I think that's, that's very valuable. That gave us a lot more visibility and also support for going forward.

B

On the other hand, this cat merge happened really close to the cap deadline, and then there wasn't that much time left, given that some people also including myself, went on vacation. So we then decided to give the actual implementation a little bit more time and push back to 126. So that's why you didn't see anything Landing in 125 there's.

B

There are also some open questions that I'm trying to discuss with Tim Hawkin about the Epi design, a teaspoon fairly absent recently, I think also because of a vacation, so that has been stuck a little bit, but there's still time to finish that for 126.

B

and we've been in the meantime again using the time to work on the Prototype that already existed for before we started with 125 and we've been updating bad prototype with some tentative API changes that will probably make to to the cap into the future.

B

We also that's where you guys come in again, try to run or enable end-to-end tests for this thing, so we do have end-to-end tests in that prototype and we can run them locally, for example with local cluster up. That's how I currently do my development I just compiled kubernetes bring up a one. Node cluster and I can run these and test against that cluster, but we certainly would like to have something that runs in brow, ideally with multiple nodes, and that's where last week, I found that there isn't really any existing job.

B

That does that all over node jobs that exist. They are just testing cubelet by bringing out basically a virtual machine and running a special node e to e-test, Binary and cubelet on that virtual machine, which is fine for testing grouplet. But we actually also need to exercise the scheduler.

B

We need multiple nodes to cover scenarios like network attached devices, which is one of the goals for for this feature and I think I, know kind of volunteered to to help a little bit because I'm out of my depth on how to actually run Cube tests so that it uses it brings up a cluster with cryo as runtime, which is the the goal because cryo to release version of it has the necessary CDI support.

B

Containerd will have it soon, but it hasn't done a formal release. So you would. We would have to pull the master plan of containerdy to get CDI support. So cryo is currently the the one that's most most usable for us, because we just need to use I think actually Solve IT some some projects that it was using. The right version already I just couldn't find a configuration where I get a full cluster.

B

D

Hey Patrick uh Patrick about running uh itoa tests with multiple nodes. I have done that with the resize using gke clusters, so you use Cuba to deploy a cluster of the size that you desire. One I use one worker and two two workers and one masternode and you use instead of local. You use I, can work with you on this one. You use skeleton as a type for running your e2e test. Oh, that should put it on that about cryo. That I don't know, but.

B

That's the crucial part, because we container runtime must support CDI.

D

So you may have to modify the kublet uh startup parameters to point to the cryo socket instead of the container D socket.

B

And then there has to be a cryo on that note, which I guess is not going.

D

B

You run on a managed yeah.

D

You can, if you do Cube up and install the cluster, then you can post installation for a small cluster. You can do it or you could modify the cube up script. It's some work uh to uh when it deploys cubelet I Believe by default. It uses container d and to switch that to cryo I, don't know if that is. If there is a flag for that, it's a feature that certainly could be added and it would be useful for others as well. So.

B

Whatever else I can catch there, which would use well, it's not not something that I'm currently actively working on, because it clearly turned out that this would be a major effort to just learn about all of these components and how we currently work, if others have more experience with that, it would would be easier for them.

A

I I think this might be a good topic to bring up in the weeklies uh testing meeting tomorrow. uh I think like Peter and some folks from cry also joined that meeting so.

B

We can I probably have a conflict tomorrow for on Wednesdays, but I can try.

E

To support a certain feature, so you cannot with the eyes.

B

We need we need support for container device interface, which is a spec and needs some code in the runtime to read Json files, that's how it works. The runtime is basically told to modify the container based on a Json file and- and that is the so-called CDI support and cryo. Has it in a released version. Containerd has it in the master Branch, but we are still waiting for formal release. So if there is a way to do a brow job with container D master, that should also work foreign.

F

Yeah just one one point: we do have a whole bunch of jobs on the Sig node container, D tab that use containerdy for master uh both for a cluster test and node tests. So there.

B

Should be a good if you have an example of that with continuity master, that may be a good good sure, yeah.

F

I can send you uh yeah.

B

B

um So, for I mentioned for Prototype a few times, it's basically is a private Pro. It's a pro a branch in my uh Poe kubernetes repo, but it just is an official pull request against kubernetes. So if you have any comments on the existing code, that would be the place to to commend um and I myself already left some comments. Where I have questions to Tim. We just need to touch a CP answers to those now, and this pull request also has some instructions in the description.

B

You can basically check out my Branch compile locally and try out the feature with local app cluster fairly easily.

B

We also have a test driver that doesn't do much because it's Hardware agnostic it just basically injects and variables in a custom way without relying on the the normal functionality for that or normal normal API, and we are using that drive also for end-to-end testing for tests that do failure. Injection, for example, a more realistic example drive or more aggressive, more realistic driver will be the one from Nvidia 4, where gpus so Nvidia is part.

B

Is our our fellow traveler here also motivated by limitations, but things that they can't do with a device plugin interface with four bedroom gpus right now? They have a possibility to put Nvidia Cards into a virtual mode, where you can split up the hardware fairly flexible in in so-called mix, where you can allocate virtual Hardware, basically a subset of Hardware, with certain amount of ram certain compute units they want to make that available to kubernetes workloads, because right now they basically have to be.

B

They have to do pre-partitioning into certain chunks and then hope that this resize resize chunks fit for workload, and this mikron mode will be more flexible and Kevin has already demoed what he has right now and it's fully working. uh He showed us a demo where he was actively on the console, uh creating claims, running ports and.

G

All of that actually actually used to my plan was to show something with that right, yeah and.

B

That case, in that case, I'll just hand it over to you, but we weren't sure whether you joined so I've. Had that said, all I had uh I know, there's one more slide. uh We do have a dra channel on the kubernetes slack, where we are now doing all of our all of our discussions. Actually, we haven't decided about doing regular meetings, yet we were doing some among the people who were interested running up to the cap, that cap submission and we keep doing that. But I think we should start doing regular, open meetings.

B

My questions was ignored would be. Can we use the sick, node Zoom account or what? What kind of Zoom credentials would we need for this, uh which yeah that that's basically the last question that I have and then I could hand over to Kevin.

C

Yeah I mean probably the best thing to do is figure out. If you do, you have a need for regular meetings, but we've given out the zoom information for similar things in the dance yeah.

B

People have asked uh some someone, but I, don't know how how popular that meeting will be right. Now we are fine with with just having a meeting between Nvidia and Intel, but it certainly wouldn't make sense to open it up. I, don't I, don't have a problem with that, except for the logistics of officially hosting that meeting.

B

Okay, well, we can, we can think about it, will if, if more people show up on on that channel and ask for it, we'll probably organize something thanks, yeah.

C

I think the channel is what's new to me. Is there a reason we needed like a specific scope, Channel.

B

um We didn't want to bother you guys on sick note, that's all because it's fairly Lively, we are doing in detail technical discussions there that probably are not of interest to most people on sick note. That was for rationale.

C

Yeah so like I, don't know, I would find it interesting I've um uh seems like there's. Sometimes this tension of um wanting to be involved in the sake and then being kind of Hidden Away from the sick and like the best thing I would say, is just make your involvement as open as possible and um yeah I guess: I didn't know that channel existed. So thanks for uh it's.

B

A new thing right now, but it basically got created I think last week, yeah and- and we can close it again if you think that we should all do all of the discussion on sick note, we can certainly also just shift to to sick note for yeah.

C

I think that's perfect, I! Don't think we need another thing: I guess because there's some government stuff associated with that, it's just more things to over oversee.

E

I also wrote a nice, the too much traffic and people cannot discuss some other signal. Topic. I would rather converging together because anyway, people might want to know the background. Also, it's not you don't need to repeat same thing right. So that's the discussing and other things over the place.

B

Yeah, okay, well, I, think I'll. Take that as a clear mandate to head over to signal and I can certainly tell the people who are who have joined the channel and and make sure that everyone goes to sick note instead, but that works for me.

B

Kevin, are you? Do you agree yeah.

G

That's fine good sounds.

B

A

All right, uh I guess we can move on to the next topic.

A

G

It's fine. um It says that I don't have the ability to share my screen now.

C

G

Let me give you a co-host.

C

C

All right, shut down.

G

uh Yep looks like it.

G

G

Sorry, it kicked me off Derek. Can you give me access again.

G

C

Maybe it'll, let you be a co-host for real this time.

B

I'm getting a message but I'm a co-host.

G

Can you guys see my screen.

H

G

Okay, great yeah, so I don't want to take up too much time, but um I just wanted to show um the Nvidia driver that we've ridden against um this new Dynamic resource allocation framework from kind of an end user's perspective, I'm not going to go into the details of how we built the driver, but I just wanted to show you. You know as someone that might deploy pods against um the Nvidia driver for Dra, you would actually ask for and consume gpus. So um you know just really quickly.

G

First thing I want to show is just uh Bare, Bones cluster, a single node cluster, running the the components that are necessary to interact with dra there's a controller piece of the driver and a demon set running a plug-in on each node in your cluster to actually do the underlying allocation as as necessary, and so in this cluster single node cluster I just have one instance of each of these.

G

um If you're familiar with Nvidia gpus, the next thing I'm going to show, you will be obvious, but for those of you that aren't I'm currently on a machine that has eight gpus on it, four of them currently have this Mode called Mig mode disabled on them, meaning that these gpus can be used in full mode to run workloads. The full GPU can be consumed to run workloads, and these last four in what's called Mig mode, which means that these These gpus are partitionable.

G

They can be partitioned into a smaller sizes and then those smaller kind of sub gpus can then run workloads on top of them, and this output that you're seeing down here is just showing that currently even those last four that are in this Mig mode, don't have any of these partitioned gpus configured for them. If, if and when I um I actually do a creative partition on it, you'll see some sub gpus embedded underneath these and right now, there's nothing there.

G

um So first thing I want to show, then is just um what the Pod specs look like for um or when you want to actually utilize this feature.

G

um If you're familiar, if you're familiar with um with persistent volumes, the syntax of this will look very similar, but the basic idea is that you create, what's called a resource, claim and then that resource claim gets embedded into a pod and event and subsequently associated with a container.

G

So you get access to the type of resource that you're asking for, and so in this first example, I'm, basically just showing how you would have feature parity with the device with the existing device plugin in terms of being able to ask for some number of gpus and then getting access to them right.

G

Obviously, this whole framework is much more powerful, but in a simple case it needs to at least be able to do what the existing device plugin does right and so in this example, and basically creating an instance of a GPU claim, I'm calling it one GPU and then saying that you know if anyone ever grabs reference to this claim, they'll be granted one GPU right and then I have two separate pods. That I've created that each have their own separate resource.

G

Claim that references is that GPU claim and then tells the container running inside of it to grab a hold of that, that resource claim and make sure that all the devices Associated get it with it get injected to it. So what we expect to see from this pod when it gets launched, is that or for this, uh when these two pods launch, we expect them each to get access to to a separate GPU, because there's two separate resource claims that are that are created from that one GPU claim spec that I.

G

That I showed above um contrast this with um the second deployment that I'm showing here, where I once again have a GPU claim that has a single GPU associated with it. But now I have a single pod.

G

That only creates one instance of this claim and then within my two containers of that pod I'm grabbing access to the same GPU, and so it ends up having the effect that you have shared access to this to to a GPU rather than having exclusive access for each of them, um and then the third one that I'm showing here um is just kind of taking this one one level further. Where, instead of having shared access across two containers within a pod, you could actually share access to a GPU across pods.

G

So once again, I have this Global GPU claim object that I create, but now I also create my resource claim in the global scope namespaced by by my test, but you know not embedded within a pod and then, um if I scroll down, you can see.

G

um Oops um I have two separate pods. Now that are both referencing, that one Global resource claim and then grabbing access to that shared GPU. um You know that's named as as I as I showed up here. um So if I then just go ahead and run each of these.

G

um First I apply it then I can um you can see that three of them are in the pending state? Two of them actually started, um and if we run this for a while, we should see them all start up, except, of course something went wrong and my controller crashed on me.

G

um So we're not going to be able to to see what I'm, what I'm trying to demonstrate, because um you know with the controller crash- we're obviously not going to get what we expect here um so yeah.

G

So it's probably not worth moving forward if, unless I restart everything and I don't want to take up everyone's time, but at least you got a a picture of what these um pod specs would look like, um as you start to use this and how you can get the advantage of you know getting access to Shared gpus um instead of just um you know, having to have exclusive access to them, and then so the last pod spec I wanted to show um is a much more complicated one.

G

This one also works so long as my driver's not crashing, um and this is one that actually gives me the ability to partition these gpus up um such that I have a subset of the GPU that I can get access to, and so once again I have this Global GPU claim that I create and then I also have these sub Mig device claims that I can create. That then reference, the GPU claim that I that I want access to, and so in a setup like this I can have.

G

You know a bunch of containers, each wanting access to a specific Mig device and then I can make sure that all of those Mig devices get created on a single shared Mig enabled GPU um rather than being potentially spread across them, and then there's you know, there's syntax to to make sure that that's that that's possible, um and then you know just like before I have some resource claims that I create that reference.

G

The GPU and Mig claims that I showed above in order to make sure that those that the containers ultimately have access to them once they once they get started um yeah. So.

C

Kevin uh one thing: maybe the follow-up on um is uh uh there's no way to can control costs right now like uh so what I was trying to think through is.

C

uh There was no discussion in the existing Cup on like how I would quota claims and if there was uh maybe we could strengthen it because I couldn't find it, um but uh just from a practical standpoint, a lot of folks quota access to gpus within many cheap deployments in the world now and I was trying to think through, like uh how I could get parity of that when using claims and like uh is that something you and Peter can maybe uh give some thought to um yeah.

G

Patrick yeah um Patrick.

C

G

Yeah, no definitely um this has come up a few times now. We don't have a good answer for it yet, but we have we've been thinking about it in the background. Let's put it that way: yeah.

C

So, like you can quote it by storage class, today, storage, so maybe that would be something that we can take inspiration from, um but uh yeah otherwise.

H

That's cool to see.

G

Yep yeah, unfortunately I'm not able to show it in action, I'm, not sure why the controller crashed. It's been working all day and I just tested it again right before this meeting started yeah. Inevitably, when it's live, it doesn't work.

E

Snack the shared TPU by two parts. Is there any explanation or we just assume, that's actually safe.

G

um The the safety comes from the fact that the claims have to be namespaced in the same namespace as the Pod that references them so.

E

If you don't have yeah good I'll show them, that's, we think that's the namespace, that's the uh single tendon. The end is supposedly access is the safe okay.

I

G

Yeah, the the under the the underlying implementation and forces that they have to be in the same namespace, there's no way to break out of the namespace and reference one in a different one.

E

So, besides sends off to you access safety and how we are also guarantee, uh so the two or four whatever to share that with how we guarantee that uh isolation and the resource level isolation. Can we guarantee uh certain things or.

G

For which one sorry say it again, so.

E

The uh work node multiple worker access right, so how we guarantee the quota and the shared all those kind.

G

Of things, yeah, that's what sorry that's what Derek was just asking about, how you can guarantee quota, and we don't have a good answer for that at the moment.

G

um But it's definitely something we need to look into before. This is something that would go. Ga or even beta, I would say.

B

Yeah it would be. The current thinking is that the device driver itself will have to have some kind of quarter system because for kubernetes the parameters are completely opaque. So it can't really know what a certain quota limit might mean in the context of a certain driver.

B

uh So this this logic would have to be configurable in an advice driver when it allocates it knows, for which namespace that is, and that namespace might have a configuration object. That says this many CPU gpus, for example, but then that would be specific to that particular driver. That knows what a GPU is.

C

And so I don't think we need to Hash it all right now, just I think I would View integration with quota as a beta criteria for this um or some story around how to control.

B

I think I think there is something in the capability, because this has come up. Yeah.

C

I did a quick pass through I, didn't see anything proposing changes to quota or adding new quoted resources. So um maybe it's in a separate dock but yeah I, guess Patrick and Kevin. Just uh maybe that's the next. When the demo is great, we can see a quote of concept with it too yeah, but this is really cool to see.

G

Yeah, so everything came back up, I'm, not sure what went wrong. It was just a transient thing, um so yeah so just to to wrap this up um as we expected for that for the first um um deployment that I showed the two different pods have access to separate gpus um for the second test that I showed the two containers have shared access to the same GPU and then likewise in that third test, the two separate pods have shared access to the same GPU yeah.

I

G

Okay, um let's I don't want to take up any more time. So let's pass it on to the to the next topic, thanks guys, yep sure yep.

I

um Question is basically so.

I

Are you able to do that with this model, or is this only one type of splitting.

G

Yeah I mean so this right here is this: these are all just full gpus, um the the one that actually does the splitting I can see. If I can apply it real, quick um make sure it doesn't crash.

I

At the very end, but.

G

Yeah, no, it's fine, um so you know the partitioning that can be done on Nvidia gpus. It has its own API and its own way of doing it.

G

You can so there's two levels of partitioning that can go on in in in nubidy gpus one is you create this box that contains a set of memory and some compute resources, and then, within that box you can create sub partitions where they share the access to the memory, but then have their own set of compute resources inside that and with this model, I'm I'm actively working on so in the existing resource model.

G

There's no way to really leverage that especially not dynamically, because there's no way to say Hey, you know here is a um you know. Here is just memory being partitioned and I'll kind of dynamically create some access to the subcompute.

G

That's inside of that right, but with this model you can do that we can create a um the the memory partition as some something that's shared across all the containers, in a pod, for example, and then each container can then create one of these sub partitions that has access to just a compute piece and I'm in the middle of implementing that right now, which could be part of why this this crashed in the background, because I was doing a bunch of code changes today, um but that's the level of partitioning that you can do.

G

There's there's not much more that you can do just because of the limitations of the interface to the hardware, at least for NVIDIA gpus.

I

G

Answer your question.

I

It does thank you.

G

Yep, okay yeah, so with this one actually I can then um so that one finished running and so now you can see that if I do an Nvidia SMI run here, you can see that um all of those different sub partitions inside of each of these gpus have been created. And then, if I grab the logs for each of those.

G

um You can see that, as we expected, all of the container zeros have access to this partition that has three compute units inside of it um container. One has access to all of the partitions that have two compute units inside of it, and so on with the with the different just, you know configured the way that I specified in the in the spec there cool.

J

I

One more quick question so.

J

I

All just to reiterate, they don't have to be in the same name space though you do not have multi-tenancy with this.

G

um You can't share a claim across namespaces Patrick, correct me if I'm wrong right, they all have to be in the same name space as.

B

The positive that is correct, yes, yeah.

G

Okay, thanks guys.

A

All right folks, uh we can move on to the next topic. uh Status of block, Mount type, feature, gate, uh DGL, sorry, I, don't know your full name.

K

That's me: David.

L

K

um Hi I'm I'm new here so yeah, um so I've been looking at this a bit in mostly in the context of the recent support for username spaces.

K

um This this feature becomes a bit more relevant because in the past, um enabling prop Mount type was probably a bit insecure, um but combining it with username spaces means that you can do things like run Docker and Docker, essentially more um well without as many issues um I asked about a few weeks ago.

K

I don't know how long ago, in slack and I didn't get any response, um and the the background of this feature was the instance about 2018 um and there's a few references of people using it on the internet, but not that much detail and there doesn't seem to be a cap I think it possibly slightly predates the kept process or you know, whatever, whatever isn't written, to kept so um kind of my my question is: what would be the process to getting this towards um beta I assume?

K

The answer is probably writer cap retrospectively, but um I just sort of wanted to bring it up. If there was any context, that's not written down that anyone remembers um or whether I should just sort of take it away and attempt to sort of back back right uh catch for it, I mean if anyone had any suggestions either way.

K

A

I, remember correctly, this was for, like, as you said, for nesting things and at some point there were even discussions of dropping this entirely. So I think your best bet is to, if you want to revive it is to like come up with a cap and then present here, why it makes sense and how it fits into the current username space efforts like what are the use cases that IT addresses and uh yeah I'm going to write a cap for it.

K

K

Yeah I mean I, don't really want to take up any more time. If there's not any detail, it's just. If there was anything that people remembered um it might you might be interested I um recently implemented supporting cryo for it, so it is now supported in cryo and containerdy, so um that might be of interest to someone but other than that I. Don't really have anything else to report right now.

E

uh But David thanks for for pick up this one. So there's the couple things: if we have the wallet uh use kisses and also, if uh so, the concern I can, if this uh uh work with the current Dynamic space and can improve the reduce those negative security concern and please propose those things and then we can move forward to the better if we couldn't there's no other use cases actually, because we do think on. The job also welcome to remove this and also allowance.

E

This one is just the end of this feature, end of the level so either way. Actually it's a hugely helpful for the community life.

K

K

A

K

A

Yeah uh next up, we have Peter with a follow-up on the inheritable capabilities regression.

M

Hey everyone uh yeah, so I finally got some time to uh poke around and um figure out exactly the conditions in which we, uh the behavior, changed uh when we dropped inheritable capabilities in continuity and cryo and uh basically I you know find out there does exist to work around, but you have to you know, change your Docker file and, or you know the binaries on in the uh image that you're using and I wanted to bring up a you know conversation asking about whether, like we're, basically introducing a new paper cut for people who use uh who use capabilities uh and process and container processes that are non-root, because they need to add a whole extra step um and I wanted to talk about.

M

Like you know, do we think that the um you know the changes that we made dropping the inheritable capabilities are really worth it uh for this kind of behavior change. um So I wanted to bring that to the larger group. They don't have any thoughts.

A

So Peter, if you want to like to summarize like before we made the change, we were able to get the right capabilities for exec processes, but now to get the same behavior we have to modify the binary.

M

Yeah- and this is only for uh non-root users, so an argument could be made that, like we shouldn't have like those users, shouldn't have really been able to get those capabilities in the first place. But basically, when you uh run a container with a non-root user and a capability added, the old Behavior was the inheritable capabilities would have been passed and that would have basically merged with permitted and bounding set, that the container is still given and that effectively makes it.

M

So that's the container process was given those capabilities, and now, after having dropped the inheritable capabilities, the uh those non-root users are not able to get those capabilities where they used to be um and to to work around. You have to change the binary that the process is running so.

C

Just make sure you change the binary you're changing the docker file I'm trying to look at the example here, like your document also changed to call set cap.

M

You can use the docker file to change the binary. It's like changing the capabilities on. You know for the file itself, so you know what I did was I used the docker file, because the you know default Alpine image doesn't have those capabilities by default, so you have to add them the inheritable cases or the permitted and effective capabilities to the binary, and you can do that through the docker file. It.

A

I feel like this is something that ambient capabilities was supposed to help solve and I think we stalled that some time ago. Maybe we revived that back again uh and see if, if it can work around this issue, because I don't think that the changing the capabilities of the file is a good work around here. If you can solve it with ambient, instead,.

E

Yeah, it's just huge regression for many customers right.

E

M

Yeah we've had a number of uh people report uh issues in cryo, both upstream and downstream, um complaining about this change.

M

In fact, we reverted we had.

M

We had back ported, this change to cryos, 1, 23 and 22 branches and we've since reverted it, because we made the change Midstream and we didn't want to regress people who were upgrading between a patch version, but we've kept it in 124, and it's now 125 as well, um because we had made them at the Godzilla release, but I'm I'm still unclear as to whether it's really uh is, if we're really adding that much value by the change, because we're adding this regression- and it's only like it's not like we're protecting the processes from actually getting the capability to someone who is smart enough can just add them or like who is aware of the Behavior.

M

Uh Can just add them in the you know by the docker file. So it's not preventing anyone. It's just making it more annoying for people who used to have this Behavior by default.

A

Should we make this configurable in the runtimes and then and then and meanwhile like explore ambient so that way, folks that want to retain the old Behavior have a way to do it.

A

I'm a bit wary of like reverting it entirely. If there is some some Edge case, CV thing that we are missing.

C

That's what I was thinking right now like if he had an option at the runtime, but you'd want that option probably available to the community broadly, since this is very run time.

K

C

Specific so I don't know.

C

That would be a reasonable choice for a runtime to make an idea.

M

Yeah, we could definitely investigate that I I'm sympathetic and that's why I didn't just regret it automatically is that you know it does technically solve a CV, the cve as far as I understand it basically was created because, like cryo and continuity were making like a not idiomatic Unix environment, by giving the inheritable capabilities, apparently process uh managers or processes like damage and stuff through the beginning, inheritable capabilities technically um so it I didn't really find anything where it was. This is like a direct security or like issue. It was a low cve.

M

Just basically saying hey you're doing this wrong. You should do it differently, um so I'm sympathetic to leaving the behavior, um but it does kind of it kind of feels weird to me that we cause this regression and it doesn't from what we know now. Materially change the security um posture.

E

If we couldn't find the runtime thing so the end of we have two people to look around this one and they have to run as the root right so yeah. So all they have to come. We compile generate the battery. The problem is a lot of users be using binary. Is not the control right see a lot of people using existing images, so the and the app will I just don't know how many have that is. It is a problem, so the risk analysis is based on no data at this moment.

E

So I just don't know how many already have that capability built into their binary. So then how many users will be impact so so then, if if they couldn't, if we couldn't find even we find the solution um if the solution is not um so anyway, so that's kind of the could be like a push. Everyone to say: okay, read your head as the root: that's even more problem right.

M

F

M

I was worried originally that that was the only work around I mean the existence of the doc. You know, Dr file change doesn't really change it for people who aren't capable of changing their image, but at least it does. It is a workaround that doesn't involve the you know: capability escalation, um but yeah I agree that it.

M

It also feels weird that we're making a change for uh supposed security flaw and the way that people work around, that is by either having full control of their image or like reducing the overall security posture of their pods themselves.

I

H

Have you given thought to having a configuration for you know how that sets up? In other words, instead of just you know, reverting do you want to make it configurable in the runtime.

M

Yeah and then uh yeah and and I hadn't up until this point and I think that that's a valid approach, I mean I, think you know the default we'd have to kind of think about like, but um it definitely is an option that exists for us, but.

H

M

It didn't it doesn't.

H

Sound like you can just revert and have a new one. It sounds like you. You almost need a a non-defaulted configuration to force um the users into making a pick.

H

um You could do it per pod, but that sounds odd too right, since it's not really helping.

M

Yeah so I mean I I can uh investigate adding the changing crowd. I'm, you know, and continuity can do the same with.

H

We'd implement the same thing: I I! Guess you could start. You could do some kind of an annotation if you really wanted to make it by pod as well.

M

The the one thing about an annotation is, you need a way for the admin to be able to configure like who can use that annotation, which is like crowd, has that capability. But it's cut it's not like the most idiomatic way that we go about it, um or else it basically is effectively like everyone just gets it anyway.

H

Yeah: okay, fair enough.

K

M

Okay, so it sounds like we want to keep this Behavior even, but we want to find a way to help people migrate to it rather than uh just you know, make the change and break a bunch of people, so I think that that's I think that's a good compromise and um I think that's good with me. As long as someone else has anything else to talk about there.

C

But that would mean a runtime if they were motivated, could support a knob.

M

Yeah, because this is a run, a decision that was made on the runtime level anyway, so yeah adding a knob on the runtime to uh to disable this. uh If they want.

C

The only thing that chase down was is there any kubernetes conformance tests that intersects with a capabilities check right now and I? Don't have that to memory, but.

M

um No it actually. This was a testing Gap in CRI test and our our Downstream, with my red hat on the downstream openshift test, did have a test testing this, which we changed, but they just tested the capabilities they didn't test. What the result of giving the capabilities to a pod was um so: okay, uh there's nothing Upstream that prevented us from making this change, because we did and nothing broke until people started. Opening the shoes.

E

Hopefully, we add that to me some tests, the coverage yeah.

C

I was trying to think if we want to come, if we had a conformance test, what we wanted to be then post mitigation, behavior um and I think that's probably the case, but then you wouldn't really want to um uh make people that were previously conformant, no longer conformant either. So it's just a tricky issue uh either way it seems like uh yeah Peter. It seems like we have a clear path on this cool.

M

A

All right, uh maybe we have time for one more uh M group, revisiting allowing node labels to be referenced by the downward API.

L

Hi, it's the first time I've been here as well. I'll try to keep it quick, uh so basically like uh downward API, has not allowed uh node levels and or like annotations historically uh and there's like an issue going back like six years about this and a PR trying to implement it. uh I guess, like my my general question, is: is this something that can be Revisited after six years?

L

um I think that a lot of users generally run their clusters in a way where, like uh information like labels and annotations from the nodes, is pretty uh like readable by anything. So, for example, like we don't like, prevent our users from being able to get that particular information or anything uh so I guess. My question is: uh is this something that could be Revisited and if so, what's the right way to go about that.

C

um So this is just I'm trying to think back on the history of this like there was a lot of sensitivity on.

C

The node label and then also the impact of being able to obviously a note to self-label yeah.

L

The uh the comment on the close PR is uh is mainly saying that.

N

L

A it's a cluster scope resource uh and we don't by default, Grant this access uh to to anything through like our back, so it makes sense not to automatically uh basically assume that everything is going to be able to read or that pods are going to be able to read this because pods are a namespace scope. Resource I believe is the justification.

C

um Probably the best thing to do would be: let's get a new document describing the motivating use case and.

L

Yeah, our primary motivation is for uh being able to expose like Zone information to the pods, so that we can enhance, like our ability to perform routing or testing between pods and different different, like AWS azs or whatever else.

C

Yeah, like Zone information, I think in 2017 wasn't even a well-defined label that was in a great key space at the time, so yep, um if you probably now, if we have a even a more narrow set of labels, such as the failure Zone, which is now I, think a well-defined label or even the OS label, we might be in a better position to have like a an allow list of uh of labels that could be injected.

C

So that would be my recommendation is to start with your minimal need, and then, let's see, if that is fine, I think there are speaking with my own red hat on. There are users I'm aware of who would treat labels as sensitive, because the notes out things are operating. I might be in particular, security zones that the workload can't be aware of so like I I do think going with an allowless type approach might be uh reasonable and labels have been more normalized since 2017. So we might have good success for your use case.

L

Okay, uh when you're saying I open a document, uh this would be uh like.

C

L

Okay gotcha. Thank you.

L

That answers my question.

A

Great uh thanks so I think next on the agenda. We had 125 retro, but we don't have enough time driven. So maybe we can do that.

H

Next week, yeah I will.

J

Move it to you.

H

Next week, Thursday next week, we do thanks all.

A

Right yeah uh sounds good thanks, uh qos class resources, cap rename Marcus. Is that a quick item or needs more discussion? Yeah.

N

Probably yeah probably give a like more.

F

N

Or bigger update next week or the week after, okay, what has been updates there, one just the heads up that I was uh advised and otherwise to write a blog post about about this stuff, like in the early summer, Springtime and I. Finally, deep deep, fried wanted kubernetes program now kubernetes developers or submitted to PR.

N

It was in June if I remember correctly, but now there are good review comments and comments from Peter Hunter's brother, probably uh the blog post. It doesn't make sense to uh kind of publish it. The blog post on the other until the cap has been kind of at least accepted, and there is clear, clear ways, so it will be supported in kubernetes if it will be so I.

N

Guess that's the kind of well or what do people think a kind of now now that now that I'm thinking it it makes sense to me so kind of give the blood flow on the review on hold until under the cap is kind of sorted out that it's accepting at least.

A

Yeah I think it makes sense to hold right till we.

N

uh Yeah but yeah, if people are interested in this, so just go and see the blog post, profit yeah, but anyway, I'll give you an a little bit wider update in the future next week or the week after.

A

All right all right, thanks Marcus, uh when I, we have only a couple of minutes remaining. If you want to quickly update and then we can call uh the sixth meeting yeah.

D

Oh I think this is going to be quick, so uh we uh for the for the pr itself. I think we've been. uh Thank you everyone for merging the continuity support.

D

It is in the master now I think what we want to do is get to get a sense of with the next continuity release will be and when that will get picked up uh once that is in I, think we can move forward with merging the rest of the pr uh testing and merging because we'll have the full loop on the CI and address any issues that are there. At this point, uh I saw your comments on C group B2, you did the review. Thank you very much. uh I think I'll. Add that unit test.

D

That was one of the pending items, but uh don't know if, if I didn't miss anything that you don't see any major issues, it looks okay to you.

D

A

Oh sorry, yeah no I'm still reviewing it when I started reviewing it today. So I took a couple of comments so far, but yeah. Okay,.

D

Please yeah yeah I'll, address them and then I'll do an update with the unit test and all that was one of the known items that was missing and finally, I. Think Marion lubber from uh gke team in Warsaw is uh offered to help with this effort uh and I. Think I can get us good test coverage in Alpha. He's got a very good use case. uh Please welcome him and to believe that covers it.

K

A

Thank things uh and Quentin had a couple of peers: uh Donald Derek, you might want to take a look uh I think if possibly is looking for some reviews. Yeah.

E

Thanks thanks man, yeah.

J

That's right and I if I have just one minute, I wanted to ask um someone from signotes for uh we'd like to spawn to sponsor one of the contributors of kmm um for membership into the kubernetes organization. So if someone wants to help us, please raise your hand. Thank you.

A

I I can help content.

J

Thank you, but we need someone from outside right now: okay,.

E

Oh I can, uh in this I, can help you yeah. Let.

H

D

Thank you, though, thank you on that note. uh Can someone uh sponsor me as well? It's a pain to ask someone to you know: do a keto test on things.

A

Yeah, we know that's overdue: yeah, okay, yeah.

D

This is the right time that I've been meaning to do for a while I just laziness. My laziness is wait. Okay, thank you very much. Bro.

A

All right, thanks for joining folks, uh see you all next week.

A