Kubernetes SIG Node, 30 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20230530

Description

SIG Node weekly meeting. Agenda and notes: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#heading=h.adoto8roitwq

GMT20230530-170349_Recording_1920x1080.mp4

A

Hi, everyone welcome to May 30th 2023 signal weekly meeting. uh We have a few items on the agenda, so we can get started.

A

So first up Sergey has shared a list of caps and then we have a big list. I think 31 caps open. So we really need to finalize what we can really uh achieve during 128.

A

So all right, I think what we can do is maybe next week we can make a pass and create a final list, and this week we can try and go through and review and approve what we can sure I mean folks that want to drive something and we want to make sure something is in 128, make sure that you have uh approvers and reviewers who have the time to help you with your kit.

B

We need to make sure that if you own uh make sure that.

C

C

And stuff like that,.

A

All right, I think Sergey. uh Your video is Frozen.

D

A

Yep, okay, uh okay, so we can move on to the next item on the agenda. So harshal I know you want to talk about uh swap. uh Do you want to quickly give an update.

E

Yeah sure I'll just give a link here in the chat. uh So this is the enhancement we are trying to move forward the Swap, and uh we would like to take a cautious approach here, uh considering the impact of swap on the cluster, especially on the Node, so we are proposing uh to enable swap or or why only for the burstable parts.

E

So the idea behind is it is that if you allow user to set the sharp values like you do in case of memory in CPU, considering how much little we know about test drive, probably uh we may not end up in a right uh situation. So our idea is to enable only to the burst tables parts and calculate the values automatically, and this enhancement actually describes in detail how you can arrive at those values and once we get enough confidence in future.

E

Maybe we put it in the same category as today we find memory and CPU so I like to have uh people in this group. Take a look at that enhancement, share your opinions and let us know if we can move forward with that.

B

Yeah I think one more limitation we discussed before is uh whether we can limit everything to C group V2 only because in C group 52 we have way more control and uh security security sector. It will be much better to do it uh in controlled fashion, rather than just enabling whoever gets it will get potentially gets their secrets exposed and stuff like that.

E

Okay, so your suggestion is, we only do this for C group. We do. We can definitely add that you know in that.

A

Yeah I think it makes sense, and that will also encourage people to move to V2. You know as a way to get new features.

E

Yeah yeah, but overall, if I'm sensing it right, we are in support of this to move ahead and uh except just let me know, on the step whether you agree with this approach of cautiously moving forward. There's a we are Splinter beta into two phases, uh and so.

C

E

Know how? What do you think about that? On the comments uh on the on that we are there.

A

So okay you'll be able to make a pass because I've seen what national is.

C

A

A

All right, thank you. We can move on to the next topic. uh Peter and Marcus uh discover cubelet C group driver from CRI, so I just.

F

A

Called Peter, oh God,.

G

Yeah I saw your your note, uh so there is now announcing uh question that um you know I could make sense to bring to the larger team but yeah. We just wanted to illuminate that. We have the cap open and um you know want people to take a look at it. uh Independent of the open question. We decided to break up the runtime status field into Linux and windows, specific Fields, because the C group driver isn't really that relevant to Windows, um but other than that. This changes, as we discussed a couple months ago.

G

The open question is now you know: do we need this generic?

G

um You know way for the runtime to report some information up to the cubelet, especially considering, like you know at this point, the the only c-group driver that secret V2 support is system. D and you know I, don't know of really any plans to add C group confess like do we even need to do this, or do we just you know, begin the process of deprecating secret profess I.

G

Think that's like a slightly different conversation, but in terms of the question about like whether there's a you know, other things that we could use this for I think we had kind of started to talk about. You know part of the initial conversation that brought this up was beginning to think about. Like you know the who, the owner of some of these you know, runtime configurations would be like right now.

G

There's some I mean with the senior driver specifically there's like a split brain thing, going on um and but uh I I can't think of any specific fields that we currently have. That, like should be moved over, because this one's kind of the only one where the Cuban and the CRI need to be in sync but um I know that there was thought about. Maybe leveraging this for qos classes um to have CRI be able to illuminate like broadcast, which do us classes exist so that the qubit can inform the scheduler about it.

G

H

One more thing is what we can also automatically discover configure it. Runtime class handlers.

G

A

Should this be a separate call like? Are we overloading the runtime status? So that's one thing: should it be like get runtime features as a separate call? That's one thing, and one more thing, I want to add, is like when we were discussing username spaces with auth, uh the question came up. Is there a security issue like the runtimes? Are support are supposed to fail if they are not able to create a username space right now? Do we think we need something like this?

A

Where, when the cubelet is calling into the run time, it gets a list of supported, features or say a runtime is too old, it doesn't even know about username spaces, and it's not failing and it's it gets some pod configuration. It says: I I started it correctly. So do we want this additional runtime check to see? Cubelet gets a list of features, and if username spaces is not in the list, it will not even try to start a pod or fail in uh I'm, not sure uh if that makes sense or not.

G

Yeah I think I think having a separate call is fine with me: I think we just stuck it in here because it's you know this was a place where the keyboard was asking the CRI, like you know, hey what do you have for me uh in terms of writing the status but I I think you know a separate one makes sense, I um the but figuring out what the kind of like communication method would be like what the the schema would be might be a little bit tough, because the username space is more of like do you support username space, whereas the C group driver is like which C group driver?

G

Are you choosing, um but I I think that uh generally having a way for their Cuba to ask the CRI for information and having that be all wrapped up into one I, think that makes sense yeah.

A

I

I think um I would avoid I. Don't disagree with what you're saying uh Ronald I just think um if we keep this kept just focused on the minimizing uh misconfiguration issues between runtime and cubelet. I. Think that just language in this cup saying, um if the if the runtime reports, something the keyboard, will then steer to that rather than it's config I, think that that's uh fine I think to do the broader issues. There's still things.

I

Obviously, that Cuba is setting up that runtimes are not setting up and and whether that's rich with c groups are actually present on the system. uh All that gets into a bigger can of worms like um someone was asking this morning about pids uh and their enforcement, and that's still just the keyboard oriented feature so.

H

Anyway, uh personally,.

I

I would just keep this at the uh at the discoverability thing and not necessarily feel like. We need to solve all the problems right now.

A

I

Think that makes.

A

Sense: Derek yeah. We just start with this and see if it makes it still expand.

G

So I do like the idea. Oh someone else go ahead.

D

Sorry I forgot to where to release the head so so kind of night and it's a my question is of the runtime class. It's a service purpose.

D

The original runtime class is actually to answer this kind of the question.

I

So don I think in this case, though, the question is: what is the preferred C group management model of the host node wide, whereas the runtime classes, pod, specific and then um I, don't think we can actually have different managers of the unified C group hierarchy. So this is basically just saying who is that manager correct me if I'm wrong, Peter right.

G

Yeah yeah we're not we're not trying to accept, like continuity, has the option to have speaker drivers per runtime class, but we couldn't really think of a reason to integrate that um and uh so we're not including that so the the I think the runtime classes were brought up, because that is an option for the CRI to tell the qubit. This is what I'm you know. This is what I have for you like.

G

This is what I'm supporting go ahead, and you know cue like uh instead of having you know, to have to have the CRI configuration saying. Here's all the runtime handlers I have and then have the cubelet. You know have those be API objects created, Cube, API and have those need to agree that would come from the bottom up.

G

I think was the motivation of mentioning that um similar to the qos classes, where the the information is going to Bubble Up from the CRI CRI is the one that owns it and Bubbles it up to the qubit, which will then tell the schedule and everyone like here's. The state of the this node.

G

I, like the I I agree with you Derek that we should keep the scope of this cap to be just like. Have CRI tell cubelet what C group driver to use.

G

I also think that it is relevant to discuss, like you know the general pattern of CRI to pass up information to Cuba, to make sure that we have the API uh extensible in such a way where we could support these other options, because I agree that, as we go forward, I think that there will be other situations where we want to extend this to be like CRI, we'll be telling the qubit much more in the future.

G

So I think a separate CRI call makes sense to do this um so that the Cuba can ask it at different times, and it asks the runtime status, uh and that will give us more flexibility, but that, but for this cap we're just going to add this stuff, and then we can think about adding other stuff later or in a parallel cup.

J

Yeah I agree that we should well it's a good discussion that what yeah should we have a separate goal or what? What is? What is the kind of extensible method in the future? I'd add more more information from the wrong time from time to propagate, to keep bullets. I agree on that.

H

A

All right so I think we have a plan uh thanks for the discussion. Folks, move on to the next item. uh Memo will change power phase when containers exit with zero. This looks like a new issue that was just opened yesterday. uh Is anyone on the call that can speak to this one.

K

uh Yes, so it's raised by me, um it was discovered by uh cherry picking picks for um adding this episode targets to critical thoughts and while testing we realized that the face is changed between 126 and 127..

K

In this it seems like a government-based scenario, but I want to discuss because I think this wasn't discussed or realized during the review, and it's like side effect of um of that ER. That is mentioned there.

K

uh So the pr was to ensure that the terminal phase is assigned to All Parts, um but the focus was on deleted Parts, but as a side effect in a couple of scenarios, the device is now different than it was when the container all containers exit with exit code 0 and the restart policy is always so in a couple of scenarios that I listed in the table it was filed, but now it succeeded um so first of all, I would like to discuss like. Is it like a bug?

K

You would consider, or maybe it's a feature, it's a good change. I, don't know if it's a bug like we need to prioritize it thoroughly.

K

um But if it's not a bug, then we should probably clean up the code now because this components they like in the code, they said the face to failed, but it's ignored because the face is now computed, basically um from the container exit clouds. So it's it's misleading um any any views on that.

A

I, don't have anything of the top. uh Has anyone else looked at it closer to be able to give guidance or we want to take it asynchronous and focus can review and chime in there.

D

Whatever is the behavior, so we need that. Could we cause some problem for customer for you, sir okay, so uh so semantic connect is correct or not. We can decide it and we can discuss, but definitely it is the behavioral semantic changeable.

K

Yes, I looked at the I, think documentation, but I don't think it's like like the like. How should it be so I think I couldn't find like resources. That would say explicit how should what's expected.

D

So so the impact it is more like those controller people using right, so all those kind of controller how they are consumed that the face.

D

So basically right now it is if it's not a binary, it is still for terminated, triggered by eviction or preemption whatever. So the face is succeed and then associated with some reason, which is good thing, but the problem is, we just don't know existing controller, how they are people and also customer Implement, their controller, uh how they are going to consume those kind of things, that's kind of the semantic change for user I'm. Not this is wrong or right. Let's just say this is this: is the how the ecosystem consume.

D

These things yeah may cause some product issues which I don't know it's just based on the semantic we change this yeah.

L

That was my question initially like, so, if you have a job that creates a part and it gets, evicted is the job successful because the part was successful and it doesn't retry.

K

K

um No, the top controller would still consider such Parts as failed.

K

K

So the disruption Target was actually designed to to help with that, but actually um this might be even the the yeah. So here it's interesting I didn't think about this, but if the face is succeeded.

K

um The job controller would not attempt to match the deception, Target condition, and, and thus you may not be able to match um policy.

K

um So this might be something to to look at, although it's like rare for users to use jobs that would, um when sick time that they would exit with zero exit codes like this, this is sort of asking for tablets anyway, but yeah. It might be an issue I'm not.

C

K

If this answer your question, so if yeah the the the part, uh but yes so when you say addicted, it's addiction like due to node pressure, for example. Yes, not.

L

Delicious Mission, so I don't think any chart. That shows is Success. Now.

K

Yes, but the tricky part in job controller is the job controller.

K

No way, job controller will consider this as succeeded, so the job will be succeeded. So I went what about the different scenario. Yes, in this case, um the controller will just consider the spot as files and the entire job will succeed.

K

Which is probably okay, I mean if you handle sick time for the you exit with zero in the fall, you don't normally, you did like checkpointing or something, but um like no idea, if you end with 0 106 term and exit with zero, it means that you, you control the file that sort of other exit of the top.

D

There's the I think it should be okay if we can handle other controller right now, because anyway controller upgrade faster. They know the algorithm should controller, actually the. If we can change the controller I have this forward. Compatibility then, should be okay.

K

um Yeah so.com job controller I think is fine but yeah in the wild introduction. Customers may have different yeah.

D

Definitely there's some some customer controller, but I think that at least the out of box kubernetes controller. If we could uh scan through all those kind of controllers and figure out how the handle they will be good good right. So at least that's the minimum requirement. Otherwise this is character.

D

The regression issue for the for the users.

K

Okay, so anything this is It's not like an issue, and we can stay with the behavior.

K

um I would yet try to consult this with Clayton who participated in the in the in the pr um and uh and yeah I also asked him um and you solve them. I think we just clean up the the code right for the different sub components of cubelet to do not set the phase to fail, because that's misleading right.

K

A

So next on the agenda, we have the bpft presentation uh undo.

M

Yep yep I'll try to be short. Can you give me access to screen sharing please yeah.

A

Just a sec: okay,.

A

Sorry, are you able to give him host it's not showing up for me for some reason.

L

F

M

Oh yeah I got it now awesome.

M

Okay, can everyone see my screen all right, cool? Okay, uh let me just give a quick intro. My name is Andrew stoichus I work, mostly in Sig network uh in the network policy, API subgroup, but more recently, as part of my company's uh kind of Endeavors, we've been working well, I work for red hat in the office of emerging Tech and we've been working on a new project called vpfd I poked in my head last week, just to see if y'all would be interested and I didn't get any.

M

No so I'm here to do a quick presentation. I'm gonna try to keep it pretty short I'm, not going to do a live demo. Just in concerns with time, because I know we have a lot of the agenda but still feel free to raise your hand uh during the presentation and I can stop and try to answer any questions.

M

Okay, so I'm not going to dive deep into what evpf as a technology is um I. Think many of us have heard that buzzword because it's been used as a marketing marketing technique by many companies. People are really excited about it and it's kind of this new general, perfect purpose framework that allows users to I won't say easily, but allows users to run sandbox sandbox programs in the kernel without having to change parental code, it can be used for a ton of different purposes.

M

I come from a networking background, that's kind of where it started, but now it's been expanded to include use cases for monitoring, tracing and security, um and so because of this, we aren't just presenting our work to like Sig, Network or signode, or security. We're trying to go for all three, because we think BPF and kubernetes kind of has a lot of overlapping uses and it can affect the security of a cluster at large. So all three sigs are kind of important, so udpf and kubernetes itself has been like majorly On, The Rise.

M

Obviously, if you've been to a past couple coupons, it's come up more and more psyllium, being one of the big drivers of that um there's a bunch of other examples as well.

M

um Apart from the psyllium and Calico cnis, you have pixie, which is around observability, qbarn or armor, which is runtime enforcement Blix, which is a from Kong and being donated to kubernetes six, which is going to be a Gateway API L4 conformance implementation, another project I'm working on along with Shane uh from c network, and then we also have another example of net observe, which is a open source observability operator, they all kind of rely on evpf as their underlying technology.

M

So, although you know the proliference of of BPF and kubernetes is great, it leads to a lot of kind of interesting problems. One of the main ones is that BPF today, like always requires privileged pods, specifically um and, and it needs cap BPF to load and manipulate BPF programs and BPF Maps, that's at the very minimum. Most often in real deployments. You need even more as we can see. I did a sample of the net observe operator and we need all these caps lifted listed below.

M

uh Another big problem with running BPF and kubernetes is there's. No cooperation means right now. So if one operator tries to load a program on the same hook as another operator, things can get messy things can get undeterministic. There was actually a really good talk at the last kubecon, which I'll need to link here from data dog and psyllium on an explicit case of this happening, and it was almost impossible for them to figure it out, but a whole kubecon talk dedicated to figuring out what was broken and what was going on.

M

So this is obviously one of the biggest problems. It's also really hard today to debug these problems like I just mentioned, and then additionally like there's a ton of duplicated code code and functionality across applications which want to deploy BPF and kubernetes. So these are kind of the main challenges we see from a bpfd community and it's also some of the main challenges we've set out to solve with vpfd.

M

So, let's hop into what it actually is. um Obviously it's an open source project. It started in red, Hat's, emerging Tech networking group and it actually started in the red hat ET GitHub organization, but has now been moved to an unapinated, uh bpfd Dev organization. So fully open source project.

M

um We are also listed on the utf.io projects page and what it essentially is is a system Damien for managing BPF programs. uh This means that we're managing full life cycle, BPF programs, uh loading unloading and we're providing a privileged separation, so that only our Damien is privileged, while our users don't have to be, which is kind of really important for one of those for a lot of security use cases it for XDP and TC programs, which are both Network oriented programs.

M

We leverage the live XDP protocol in order to allow program cooperation so that slam and datadog program stomping problem. We saw in last kubecon we're aiming to try and fix that and in this today, we're using lib xtp protocol to do so, which we have a link to that in our Links Page.

M

But in the future, this functional functionality will one day be actually built into the kernel and when that happens, we'll kind of use that implementation instead and we're also going to be providing a lot of policy security and visibility Tools around it around loading BPF, so we'll give fine grained control over who what users can load BPF? uh What hook points they can load to and will also give a lot of visibility into what BPF programs are loaded onto a system um and if any, are malicious.

M

Based on our analysis, so some details, bpfd, is developed in Rust. It's built on top of a rust BPF, Library called Aya, and one of our main use cases is is focused on deploying BPF and kubernetes, which is why I'm here. So we also have a kubernetes operator which is all written and go so we're not asking anyone to learn rust, but it's. It has its perks for being a system Damian language and today, people who are already writing their evpf enabled applications for kubernetes using existing libraries such as psyllium Aya, um lib BPF Etc.

M

They can still use those, but they can then load and unload programs via bpfd.

M

So uh next I'm going to show two slides that just are kind of images. I, really love images. um The first one is showing like what BPF deployment and kubernetes specifically looks like today. As I said, we have a few of our um BPF enabled applications on the bottom. This is on a single node they're. All deploying Damien sets in order to load their BPF applications. These Damian sets are all privileged.

M

They all require cat BPF and they all use kind of their own stack, whether it's live, BPF enabled psyllium, EPF, enabled or I enabled to load and manage their BPF programs. So, as you can see, there's a lot of room here for duplicate functionality. There's not really any segmentation of capabilities.

M

Every application today has to do this, and we don't think this is the way things should be done in the future. We're hoping for it to look something more like this right. So now, you're BPF enabled applications fit in their own layer and they do not need cat vpf because they aren't the one ones, actually loading the BPF programs. Instead, they simply create their program.

M

Specific crd object, whether it's XDP program, crd TC program, crd or Trace Point program, crd and bpfd, which is the privileged entity in the system, will load their program and manage all their map pinpoints. Then what the applications do is use their existing Mac management library in order to interact with those programs on the host.

M

So the overhead to kind of integrating with bpfd stays somewhat small, but we still get the observability and Security benefits of using kind of a centralized Damien, and it also allows other kubernetes users to dynamically deploy BPF programs to the cluster. So if you have, you could have some core infrastructure operators in your distro that need to use BPF. They could also use bpfd, but then, if you wanted to open up the ability for customers to do so as well, you could, which is kind of a cool feature.

M

So we're all here for kubernetes, right and so I. Think one of the biggest one of our biggest focuses in the bpfd community is how we make this work on kubernetes and the value add we bring to kubernetes. So part of that was writing an operator. I already mentioned before it's written go using operator SDK. Many of you folks are really familiar with it. You can really easily test it from our project today, which is a simple make. Target make run on kind.

M

It includes a couple kubernetes apis, uh the first one being this is actually stale, because we just changed our API. We have dedicated program types instead of BPF program, configs, so TC program, xtp program and Trace Point programs are the types reported today, we're also hoping to expand to you probe and a few others in the near future.

M

um In addition to that, we have a BPF program crd, so this is used to store per node metadata, and it's also going to soon be used to enhance the observability of a cluster. Bpfd will be able to report back to admins. All of the BPF programs are loaded on their cluster, not just the ones that we loaded with bpfd, but everything that's running on your system, so an admin will be able to go. Okay list show me all the BPF programs around my cluster that aren't controlled by bpfd and what are they doing right?

M

What are they? You can't do that very easily today. The last thing we include is a config map. Instead of having a dedicated crd for configuring operator, we are just using a config map because there's not many things to configure the last really cool thing. To note here is our BPF bytecode image. Spec, we've written this in order to solve the problem of Distributing BPF bytecode. Today, the way it works is bytecode is often embedded into the binary of the user space application. That's loading it.

M

So, therefore, in order to release a new BPF program version, you have to release the new user space version too. What we've done is packaged BPF programs into bytecode. So now you can have fine gearing conversion and control over your BPF program and you get all of the benefits of a standard oci container image such as signing, so that's really cool and it allows us to kind of integrate with kubernetes a lot easier.

M

Okay, so I am not going to actually do this demo in front of you, because it's going to take some time, I'll just talk through it really quickly and on the slides or explicit instructions for doing so pretty easy. All you need is kind, so you can give it a try, but what this demo shows is two xtp programs being attached to the same interface on every node in your cluster. One is clobbering the other at first and then you change the priorities and now it's no longer clobbering.

M

So it's it's very, it's a very simple demo um and we Show collaborating versus not collaborating by just counting packets, so this is kind of a deeper uh image into that demo. So please go check it out.

M

If you want it's a lot easier to run now, we just got done with our 0.2.0 release, so everything will stay working great, so I, just kind of give you all a really fast overview, 10 minutes trying to not take all the else time of of BTF and bpfd, but we're more interested in why we're here is trying to figure out what this looks like for. Kubernetes, obviously, um we brought this up to six Network already around. Where should we continue the discussion and out of that?

M

We've actually created a new slack channel for bpfd and a slack channel for evpf in general um in kubernetes, so you can find us in either places we're also trying to figure out what role do the sigs network? No security want to play here right, like obviously signode, would care about this this technology, because a user could really easily break a cluster right. I mean you could tear everything apart with BPF today super easily, and that includes cubelet and everything around it right.

M

um So we really want to have multiple sigs involved and we're trying to figure out if that needs to be in a dedicated working group under a certain Sig or its own Sig, probably not, but just ideas were thrown around and then the last question we want to ask is like: could we see some of these apis being endorsed by multiple sigs meaning? Maybe everyone doesn't want to use bpfd, but we all agree here that you know having apis to control BPF in a kubernetes cluster is smart.

M

So why don't we put those in upstream and let various implementations flourish um and then the last thing short roadmap for bpfd? You can see a bunch of stuff we've gotten done super excited about that. um You can also see our tracking project for more. We have a bunch of cool features in the pipe around observability um and some other cool stuff like being able to attach XDP and TC programs to interfaces based on pod labels.

M

So if you want to attach a TC program to all pods with label X, you can do so so yeah I just want to open the floor up for any questions. The last slide is just links um thanks. So much for your time today and yeah, you can find us if we don't get to answer our questions in ready slack at bpfd or hashtag evupf. So thanks so much for your time today.

A

B

Yeah I have a small question follow-up question on that, like you said that uh signals will be, will care about it because of security and reliability. I also curious how much prob, how many problems do you experience with attribution, ebpf with specific ports and processes? I know pixie had a lot of like I've been talking to them and they had a lot of problems with attribution specific board to signals they receive from ebpf and.

M

I, don't really I, don't really follow your attribution for a specific quad. Can you elaborate on that a bit more.

B

So if you I mean, if you receive EBP applicants on a kernel level, you don't necessarily know which Port this belongs to, like you, you know like some event about the process, but you don't know which annotations so like image is uh process is built on like cover right architecture. Do you like and I understand that many providers may want this information and they all will need to hook up in some EBP of pro to start events, or something like that? So do you experience any problems with that? Like do you provide any solutions.

M

I, don't think we provide any solutions for that use case yet, but this is why we came to y'all like that sounds like a really great first use case. um If you could even just jot a note down under my agenda topic, I'm happy to make an issue and kind of see what we can do uh in the future.

C

Okay, thank you. Yeah.

H

And one of our projects, we were doing the same thing by uh inspection from ebpf program with C group passport process. It's not really trivial, but it's doable.

B

Yeah the problem of many vendors, who do the same simultaneously, they will have a problem because, right back to the same events,.

M

Yeah, if we can do the duplicate effort around you know even tracing logging and debuggability of the of BPR programs, like that's kind of what we want to do, um especially in our operator, and we can do that, because bpfd can provide any node specific level uh metadata that we would need hopefully to be able to report that back to the user. So.

A

I think another one is Andrew. How do we know what process loaded a particular BPF program, but if a rogue process came and loaded something and went away so maybe we will need that kind of tracing as well.

N

A

Yeah, if a hook was inserted, that is kind of you know snooping on data on the Node.

M

Yep and I think this. This brings up some good stuff, I. Think the next time I come back here. Someone from my community comes back here. If we can give you all a presentation on like a distinct short observability tool or use case, we've done. That would be really cool if we can show like what you're saying a BPF program loaded on a node, it's loaded by a process, we don't recognize like here we're giving that information back to the user.

A

Yeah Auto warning: okay: this is not something that that was loaded through bpfd. You may want to check it out. Yep.

M

100 percent.

M

Cool so then I I really appreciate y'all's time we are having a generic evpf meeting um for kubernetes coming up on June 5th. You can go on our Channel and find out more and we have a weekly bpfd meeting um every week. There's more information in our GitHub and our website please reach out with questions I'll keep in contact here like we want use cases coming from cignode, and we want to see how we can make this better, so really excited about it. Thank you.

M

A

Great, uh so the next item on the agenda is CRI, pull image with progress.

O

Yes, it's mine, hello, I'm, just bringing this up again. This was started last year. Unfortunately, I lost time to work on this since then, now I'm back in the game, so sort of so I did a bit of a profile concept for the cubelet implementation, how it will be used for those who missed it, or uh maybe don't remember, because it was so far, though briefly. This is about extending the CRI to support not just a.

C

O

Requests but also pull image with the progress so that the runtime will send back the information every now and then about at what stage the image pulling is at the moment, and the request is supposed to be parameterized and the runtime is supposed to uh act upon the parameters. For example, the image can be requested to be pulled with the progress reported, every one gigabyte or every 30 seconds, or maybe every five minutes, or maybe every 25 percent would be being downloaded and then backed.

O

And the background is that in HPC AI some of the container images, they are huge like 20 Gigabytes, it can be image size and if the network bandwidth is not allowing to download session image in five seconds, then the owner of the workload that's been deployed might be just sitting and wondering what's going on, is?

O

Is the image available or should I wait for five minutes until runtime just fails with the timeout not being able to reach the registry, and for that purpose, it's nice to have the events published on the board object when the image has been pulled just so that the owner of the workload knows that something is going on, and then we can judge approximately how much time left. Okay.

A

Thanks Alex I think that in general sounds useful, but I want to make sure that we are also covering something that Docker shim used to have so with Docker. Shame right. If an image is taking too long, then the cubelet was able to talk with Docker and give it more time to pull the image right now with CRI. If we have a very big image, we could be timing out and we could like it could result in pulling trying to pull the image again.

A

So are we also trying to bring back that old Behavior where we see oh as long as the image pull is making progress, we give it more time because it will be very expensive to start from scratch again and there's no guarantee that the bull will succeed again. Within the same time,.

O

That's a good question: I will have to study that that I I don't know.

A

O

A

No worries I mean since you're touching this chord I would, rather that we address that as well of.

N

Course, with the progress timeout is probably the best for now. Just if you haven't had progress for.

A

The you know exactly yeah yeah yeah right. If you don't see progress for a minute or two, then you can cancel it. Otherwise, if it's going, let it continue.

N

A

Yeah, of course, we actually.

N

Support that, with a configuration already for now so adding it to the API, would be nice.

A

Yeah I mean, but we still respect the CRI timeout in the cubelet right, so if cubelet will have to be aware of whatever the runtime is supporting to allow extending it, yeah.

N

Back to the discussion earlier, how do you find out what that timeout is configured to be.

H

Better for them.

N

To give it to us.

O

Right, anyway, so uh uh to review the serial proposed change. I think the comments last time were that it would be nice to see the sketch of design of how how it will be used in the keyboard. So now that's there I didn't find anything obvious. That would be wrong with it, but I will now uh consider what Bruno just mentioned, of course, and otherwise it's open again for the comments.

O

N

The the hardest part is the when you're requesting percent the reason that's a hard. We don't know what the ultimate size is going to be on the image pole.

K

N

Yeah until until we're finished so calculating percentage tolerance but throughput is doable. That's for sure.

N

F

O

Do you think it's uh worth of getting rid of it straight away this moment? If the percentage is hard to calculate, should we just not consider it.

N

I think so it just with the current uh distribution spec API. We really don't know what the size is going to be.

O

Okay, so time based or just yeah but as I said sized in.

N

In time is the problem with large ones right it's more of a yeah timeout with progress. So if you haven't had you know a megabyte over the last 10 seconds, that's easy to do. um But if you say time out, if you're not finished within five minutes, that's good for small images but horrible for 20 gigabyte images.

O

O

Well, it can be probably both I'm out on no progress and Reporting the progress so that the user can see that something is going on. Not not just. We did not reach a timeout I.

N

Haven't read your cat, but the other thing that can happen is you can be trying to pull the same image from or multiple containers or multiple pods, and we do cash that up so only one's pulling at a time, but the others are waiting so again that timeout progress might needs to be tied to the original yeah. Just just a heads.

O

K

N

O

Is based on the existing full image request, so, whatever way was working until now, it's it's the same, but just was transmitting the progress back backwards to the equivalent or or the CLI to whoever requested image.

C

A

I mean Mike, maybe you can add some comments on the kit and I'll. Do that.

O

A

O

Is not for discussion just now, I'm, just printing out bringing this up that we're trying to or I'm trying to make this into the correct release. So.

A

N

O

N

I think this is this is.

A

Important uh I mean we'll be great if we can make progress on this one yeah.

D

So, let's include this in 1.28 candidate, because next week we are going to finalize resume it's make. This is candidate. Okay,.

A

O

Cool. Thank you.

A

Foreign thanks, uh so next one we have is PR, which is talking about disabling CPU, could have a guaranteed but uh Martin I know you are on the call.

A

Or he's not anymore, oh he's on Martin. You want to cover this one.

F

um Well, I first want to I want to hear the question and then I can provide some phone call.

A

Okay, so this is a request Ed by you, by use about.

F

A

Yes, CPU could offer guaranteed pods and it's on the agenda. So maybe you.

F

Wanna, okay, okay, I can I can discuss it. So it's actually. It was originally opened by archim who is no longer working with me, I resurrected it and rebased. It uh there's still some.

F

Well reservations about one of one of the pieces: this is the smallest possible patch that would work. However, there are reservations about the being disabled for guaranteed quality of service pods that have no CPU pin, meaning from security perspective and resource security perspective. It's actually opening the whole just a bit too much. At least that's the that's. The reservation. I'm hearing um I just added a comment to the pr uh I have a private Branch, where I'm playing with the with a different approach. That's slightly more secure.

F

It's also. The patch is also a bit more invasive, uh not too much uh uh I. I thought it would be worse, but I basically need uh some guidance. Probably I mean we have franchise conspati here, but we probably want to hear from Kevin uh who is not here about which approach is uh preferable to him since he's the one that needs to approve it, but we don't have him have him here.

F

So the the one thing we can discuss right now is I can take my private branch and basically post it to the PR, but I don't want to lose the current solution. Just yet so I can open a new PR and we can compare, but that that splits, the discussion and I don't want to do that either. So you know how what what you do? What should we do? What should I do with the approach here um by the way Derek? Is there.

I

I

um I was just catching up on the pr and I I was just trying to make sure I didn't misunderstand something, or maybe you could share what the concern was. uh I guess I would be apprehensive on uh eliminating the use of CFS quota for guaranteed pods. That um did not have exclusive CPUs, but if the, if the pr is restricted to pods that are already given exclusive CPUs I, actually don't understand uh the counter argument to um keeping CFS quota.

I

um Is there some uh relationship there that I might be missing that others are raising? That would give you a pause or no.

F

No, no, you actually described it perfectly. uh The reserve. The current PR uh is removing CFS quota, even when there are no pin CPUs, no exclusive CPUs and that's just a bit too much.

I

Yeah, so we definitely don't want to do that, but if the, if the okay.

L

I

What I'm gonna make sure I understand because uh yeah I think yeah that's too much uh so it would. We all feel good on the Sweet Spot. Then I was saying if you've been given.

F

No pins only totally makes sense, uh I think I have a solution for that, because I mean the current apis. Don't allow that you know don't allow that so I I had to do a few changes to well I had to add to two methods. uh Two couple of interfaces and now I think it's possible uh I linked my my private Branch there, as I said I can either put it into the pr directly losing the current solution or I can open a new PR.

F

It's probably up to you to to decide which one you prefer I think. If the current one is not acceptable, I think I'll I prefer just replacing the content with the new approach. I.

I

Guess I just want to make sure I understood what the original PR yeah.

F

Yeah exactly what is what you said.

I

Okay, yeah um I I wouldn't have any hesitation to merging a PR that eliminated CFS quota for containers that had exclusive CPUs. um Oh.

F

The the there is a different sorry, let's put it differently. Containers are you know easy, it's the it's the sandbox.

F

Basically, if you have a container that has exclusive CPUs, obviously you can remove CFS quota because you are going to be limited by your CPU Affinity. You will never be able to run on CPUs that are not yours, but the the parent C group, the parent slice, will prevent you from using them because the parent size, the the sandbox one, is actually limiting you as well. It's the.

I

Parent yeah, if the sum of uh containers in the Pod uh were guaranteed quality of service and all used integral cores, then yes, that would be perfectly fine merging that PR to not have CFS quota restricted. If there was a guaranteed pod that did not use integral cores, then we would be in a in a gray, Zone I guess, but if, if you wanted to get to that spot to be reaching like a desired outcome, I think that that's perfectly fine, so.

F

So the current idea is to to remove the sandbox CFS quota when there is a when the whole part is a guaranteed quality of service spot, and there is a container within that has exclusive CPUs.

I

um No I think it has to be that all containers have been assigned exclusive CPUs, meaning the part as a whole has a yeah.

F

That doesn't solve the issue most of your reports. Well, at least most of our reports. Don't look like that.

I

C

Maybe we can follow up.

I

And I was just trying to think of the simplest thing that I'd be like yeah. That makes sense, so you don't need it if uh I have to think through the uh the Pod with partial case to know if the hierarchy works as expected, um yeah, maybe that's the best thing uh to do next.

A

We're out of time folks, uh David did you have a quick comment? No.

J

I have a question with Pierre, but I can follow up on the on the GitHub here.

A

So we didn't get through everything and Sergey has a request for reviews on sidecar and Peter. If we can carry on your topic too next week, that's also review requests.

A

All right folks, uh thanks for joining, see you all next week.