Kubernetes SIG Node, 11 Aug 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20200811

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Yeah so good morning, everyone today is the august 11th and let's start today's meeting and the dark, did you add that to the status like the monitorings at the beginning of the agenda like what we talked about? Do you want to talk during that? One.

B

As a highlight, um I didn't get a chance to check it this morning, at least uh for pr's permitting approval. I think yesterday, when I was looking at this, we were kind of mid-teens and I was working my way through there. uh I think, for the most part, everyone that was pending a cherry pick has been pulled back and many of the other ones that were pending approval, weren't, really um necessarily specific to just signaled responsibilities, so a number of them actually had a dashboard or others apply they're approved.

B

So um I think, for the most part, uh the needing approval cue is in decent shape, for where we could um do our work. uh That was not intersectional with, say, sig windows or api machinery um for the other uh cues. I did not have time earlier this week to pay them too much attention beyond what I looked through the approval, cue um so either, maybe david or seth, uh you had a chance to review uh one.

C

Yeah, so I actually just had sort of a psa. I added a third category or a new category from last week as well, which is prs that need assignees.

C

So it a lot of the prs that don't currently have assignees that are in node actually have had a number of the people who are on this call already reviewing. So if you are reviewing prs, feel free um if you'd like to follow through with the review, to assign yourself that way. We know that we don't need to look for someone else, and I've been going through some of those pr's and trying to assign the people who have been working on them.

A

Yeah there's also some like the pr goes to some people, maybe not evolve with the signal, so I did pay extra attention on those pr only assigned to some like the old member and uh so the couple of those same things maybe from from now on. We have to pay extra attention on those things because we think about that they have the assignment, but actually the people not actively honest.

A

So just share this group here yeah. We may need to fix our our bot and to track.

A

Something any other comments on this one.

A

D

Are going to fix it on top of a meeting note.

A

D

uh D from last week, how many created prs, how many closed pr's? Is it useful or it's not needed.

C

I think it looks really cool yeah thanks.

A

I think it's really useful and we can mention over time. Maybe there are some legacy things that they sting and but this one actually can help us if those like these things lead the early start update. Like the event, though, I think we'll reflect here right so like there's the update one. So then we can pay extra attention about the update one, so people still care, so we can pick up from from here over time, then we can. Hopefully we can cover majority of those kiosks.

B

Yeah one thing that I did- I don't know if anyone from sig windows is in today's call, but there's about uh five or so keyboard-oriented changes specific to windows that it wasn't clear to me. Actually, I had consensus in the sig um and one of the ones that at least I was trying to give some priority attention. The review was the one that was stripping unnecessary security context on windows, um just because I know our own users are hitting that as well.

B

So is anyone from sig windows today on the call that maybe could like summarize, if there's consensus on the approach for this, because uh the pr looked fine, it just wasn't clear that it was an accepted path forward for that sig and, if not, then maybe uh given the cue I see here like we, don't have a great way of distinguishing uh when windows is happy with something versus node right now on some of these uh items.

A

So dark in the past I did uh a while back. I did went to the sig windows and then ask them to have someone represent the sig windows and join signal, because it's kind of a joint product joining the project uh was weird so but the patrick. That's why patrick joined us and the patrick left, and uh so that's why? Maybe I have to go to the stick windows to ask them explicitly to find someone represent the sig windows and attend this meeting. They could take a turn.

A

They could do something, but it is true because they being poke me and the province. I don't have enough content so for those pr's I we need. There are certain things like the controversial between from the different pr, and so that's why we, hopefully in a signal that we could and also, I think, what is more transparent in here and to tell us uh the planning and what's the uh what's the goal, and so it's hard for us. So that's why?

A

Maybe we should find somewhere represented here, because the majority of the signal would see the windows the code actually sent to the signal, though yeah.

B

Yeah I I was so used to patrick being here for uh many many many months that I didn't actually realize he might have been absent, so I was hoping he was still here so yeah if we can find a better ambassador from sig windows. Who's, not better patrick, was great, but uh a present ambassador from the sig. That would be awesome and I'm happy to follow up with uh that myself don. So I can. I can try to see what we can do on that. One.

A

Yeah, so last time is, I went to there ask them requests so that that's why uh patrick frankie's uh to represent us just from represented both from microsoft, to uh kernel team, to represent of the windows container and also from the kubernetes sig windows. So that's kind of coverable.

A

So maybe we have to do the similar things to request explicitly.

B

Okay, cool um yeah, because it's hard to when you look at the pure data to have a clear split on some of these things, and so.

E

B

Anyway, uh well we'll follow up on that. um That's probably the best summary of what I see.

A

Okay, so let's move to next one and uh we rock do you want to talk about the static container yeah.

A

I think you just want to ping for the review right so.

F

Yeah, sorry, hello, okay, yeah, yeah, sorry yeah. I wanted to do a friendly thing on the open pr. uh I I know like the 1.19 uh releases coming and cubicle is coming, so I I guess there won't be much activity but yeah. I wanted to know if it's on the other side, maybe before keep going, there was a chance to get a review or yeah just to know what what to expect.

B

Yeah I'll review this this afternoon, um so thanks for the call out- and uh um I I don't see uh just because 119 is held up- doesn't mean that we can't uh iterate on the updates to the cap. So um I will follow up on that this afternoon.

F

Oh okay, perfect. uh Thank you very much.

A

um I think the sass and also the the sergey rd review right. So it's just. I think that we are. We did a sign to several people uh when we uh talked about that earlier this beginning of this quarter, so there's the stats and there's the turkey and then there's the eric. If I remove correctly assigned to this.

F

Yeah yeah, yeah and sergey left some a minor comment, uh but yeah.

D

F

Biggest appointment is.

D

Yeah, my biggest comment is about init containers. Like can we like uh talking to easter team? uh It seems that you need containers like running running after side containers, initial initialized, it's very important for uh certain scenarios, and my question is basically what will change in the entire cap? If you will just start site containers before you need containers.

F

Yeah, so to sum up buddies on github, if you prefer to to move the discussion there, but what I mention is basically right now sitecar contains are started after unit containers, so any containers can be used to initialize some volumes or something that cycler containers will be able to use or will like might need.

F

So if we, if we create a new, a new definition of sidecar containers that are before unit containers, not all possible sidecar containers now are might need another init container for size, car containers or semantics can get just get tricky start to get tricky, and so I'm I'm open to that. Is that it's better for the long run.

F

I think we should do that, uh but I think that is quite a different problem of the one that we currently know like cycle containers, as we know them today, start after init containers, and there are a bunch of users doing that and- and those are the problems that I think it might be easier to to start solving by but yeah. I want more, of course, more opinions.

B

So rodrigo the one help I could have if you could do this after the call was um uh the cap right now is under sig apps, uh but I don't really see sig apps as being like a long-term stakeholder here. If you could just move it under the signal directory, that would probably ensure that we don't hit any other approval, headaches um sure, but uh I I know sig apps historically brought the use case forward, but it seems like from a implementation ownership perspective. This is pretty clearly part pardon node.

F

Yeah should I add that commit to the same big pr or maybe a simpler.

B

Yeah, do it commit to move that to an alternate directory? That's fine! It shouldn't get confused.

F

But I mean in the current open tr that might be uh that might take long to merge, or maybe it's just another small pr with that change.

F

B

Want to have a pr to move it into node and out of that, and then do a separate one. That's fine or you can do on this, existing one it just okay, um just it seems like it's bucking in the wrong cig.

A

Is dark, that's kind of the, since we already have the process based on the label right, which is sick and to that's easy for us to book a tracking, since the majority worker asked this team to do the review, design review and the implementation review all those kind of things. So it's maybe just try to reflect the facts here. Otherwise we may overlook this one.

A

Even we we talk about this so many times, but just since we procedure and now look at the label and the signal and try to do better job about those type and the pr, and maybe we should enforce those tedious, small things to help the process cool.

F

Awesome I'll move it to signal. Thank you very much. Yeah.

A

So so, let's move to next topic and sweaty and there's the topology wow scheduler and cool I'm looking forward for the demo.

G

Yeah um so hi everyone, I'm swati, I'm here with alexey and francesco and uh just wanted to provide a quick starters update on the topology where scheduler work we've been talking a bit about it, so I thought that would be good just to provide a status update and show a demo. So people know what's going on there. um I have a few slides.

G

If you can give me access, I can just share them totally. Give me one.

G

Second, you can go.

G

Okay, there you go yep yep, let.

E

G

When you can see the screen- yes, okay, um so yeah, so basically the status update is um that we've been working on these two components, one of them being topology, aware scheduler plug-in the other one is resource: topology exporter. We have caps and implementation of both both of these components as part of the resource resource topology exporter.

G

We have we initially prototyped it uh using the container runtime interface, we've enabled it for container d and cryo. This required no cubelet changes, but um when we kind of started looking into port resource api as a way of gathering resource information, um we identified that there were some gaps that needed to be fixed and we proposed a cap and code corresponding to these as well. And I see if you want to talk through this one, um he basically authored the gap and implementation of this.

H

uh Yes, hello, everyone. I have a couple words about it. uh I think uh the changes in border resources was arised by.

H

The thing that we need to rely on interfaces with cri we was not able to reach such a condition.

H

So, from my point of view, the board resources should come first in review, then demon, because it depends on it and.

I

H

Only then the project.

G

Yeah, so basically um the pod resource api as it stands today it exposes information about devices and what this kept essentially does is it in it enables and provides you information about the cpus and the topology information corresponding to the devices. So that's kind of the summary of this cap and then uh I'll be showing a demo I'll, probably hold that for the time being, but uh talk about these two items which are currently in progress, so derek pointed out that we should look for ways to merge the resource topology exporter into no feature discovery.

G

So discussions are currently in progress with nft maintainers. uh It seems positive at this point in time and we're having discussions related to how we can go about it, basically, the design and um basically design discussions. So we have an issue created for this and a proposal doc and then, in relation to um the pod resource api. Again we have an item in progress which franchise goes on the call again he's working on.

G

We are proposing that we expose a watch uh endpoint in the port resource api uh if we have a kept for that and that would enable us to uh to make resource topology export a more event-based as opposed to the current design where it is, uh it is polling. So that's the idea, so these so basically a lot of items in progress, but uh all of these kind of help in enabling topology aware scheduler.

G

B

Okay, um so if I recall the reason I the earlier iteration of the cap before I know that um y'all started looking at note feature, discovery was wanting to basically um kind of consolidate um like the security model. uh For all of these components, um can you refresh my memory if it's, okay, or maybe those who haven't tracked it like?

B

Is the pod resource, grpc endpoint exposed per node, and then is there thought on like uh how that's secured like? Is there some strategy around like tls, enablement and cert rotation, that people would need to think through on this.

G

um So, as far as I understand uh for the pod resource api endpoint, uh I I understand that it's exposed per node um and the information you can gather it from a specific endpoint. It's socket file in relation to the security aspects of it. I think I would need to talk to someone else who's of uh more expertise in this area. I'd probably need to look into that.

H

uh Sorry derek we already put that section in the document which you can find here proposal doc, so you can find.

I

H

Are a description how it works in details and some security.

B

Aspects yeah I'm happy to follow up on that. I was just trying to make sure I knew which docs so like note feature discovery to my knowledge. Didn't deploy a new serving daemon per node. It just propagated um the state of uh what was discovered on the node back to the api server but like it didn't, have a new serving in point. So what I was trying to just make sure with my accurate understanding that you all are proposing a new per node serving endpoint, which I'm just thinking through like deployers who point this out.

B

J

B

End point for the cubelet, as well as um for this additional data.

G

Yeah, so we we have actually a number of ways of doing it. That's what is captured in the proposal dock, one of them being that we propose another source in nft worker itself and the other one being a separate endpoint, and in that we would have to uh connect to the pod resource api, endpoint.

B

Okay, I'll just do my homework and follow up on it afterwards, but.

G

um Is there any other questions, I'm happy to answer otherwise I'll just move towards the demo.

K

Just uh just a note about uh what derek just meant hey, this is funko, uh nfd has tails, but we haven't implemented cert rotation. So this is one thing um be what we need to implement for nfd, but there is tail s for the grpc endpoint, which is already there in nfd.

B

No, but that's funko on the serving side for nfd right, like uh my recollection with nfd, was that you had a per node daemon that fleeced and then sent that information back up to the api server.

K

To the nfd master and then the nfd master is talking to the api.

B

K

B

There was no serving uh port opened to the nfd daemon per worker, um and, and so there wasn't like any need to do anything other than handle client certs back to the api server. There wasn't necessarily a serving cert problem.

K

B

And so all I was asking about was if the proposal here is requiring an additional cert management for serving per node like because, obviously the cube serves an endpoint, 10, 250 or whatever, and um managing certs for that serving endpoint is often a uh a challenge in adoption for deployment, and so uh I was hoping we could get to a spot that didn't require any new serving end points per worker um because of the overhead of of managing those certs.

B

K

So these online inserts.

B

Are fine right, yep.

K

So maybe we should think about of reusing the grpc endpoint we already have in nfd, uh but I think this is also a discussion with uh with with markers who is the lead on on nfd. But uh thanks for pointing this out, yeah yeah.

B

I guess in general, like um whether you're proxying or doing something else like uh thinking about adoption of this stuff, it's much easier. If we can avoid um needing to serve endpoints on public ports per worker node in a cluster and maybe don. You would agree with that as a general principle, but um I'm just kind of on the lookout for things that introduce new new ports needing to be exposed per node and what serving certs they might need to then have rotated. And this is also just like a general challenge.

B

We talked about c advisor and stuff in the past, like there's a lot of operational benefit with the fact that cubelet fronted it for a long time that, uh as we proliferate demons out, it gets um a pain.

J

Yeah there's margo so yeah. That's that's a good problem! Well, we'll keep that that in mind so yeah. It's really really good.

G

Cool um thanks for.

I

Watching one question uh which I asked at the end of my pr and one of which gaps proposal, but I didn't get answer, is we changes to a word resource apis? We will show only the devices which is allocated to report, but it doesn't give the information what resources are available on onenote.

I

So how already allocated information will be helping you obvious scheduling, decisions.

G

So um at this point in time, kind of we're at the poc stage, we are enabling srov device plugin, and in that case we pass back the pci. uh We pass a config file which gives information about the devices that have been enabled in the cluster. So that's that's the current stage, but what you're saying is correct. uh We need to look into how do we gather information about all the devices that are available in the cluster.

I

So practically the current implementation is just using the assumptions of one particular plugin, one particular device.

H

uh Alexander the detail question was about the not capacity uh the notes available resources right.

I

Yes, resourcer api. Will you will get entry only when the particular device will be allocated to something.

H

Yes, but we I can get our capable resources just by cubelet. Community uh exporters are capable resources to the cluster onto the api.

I

Server, it exposes it just as a number, so you don't get information about particular device 80s, so coco, let's say, is what I have resource one five pieces.

H

Yes, but we also can uh export it in the uh expression of german in the same way as cube red did. But yes, of course, we don't know the uh configuration of a device of all device or plugins.

I

So that comes to my question of the changes to port resource apis, so you are not really exposing where actual uh topology information about what devices what is announced by device plugins, you only see after the after the fact allocation. If some of them, I said you already allocated to something.

G

That's correct, I'm not sure, maybe c advisor could be something that we need to look at to identify available resources.

G

I guess this information is available within kubrick, but uh the information about available devices itself would have to be exposed. Somehow we haven't explored that yet um because we we decided that we for the time we just focus on a single device, plug-in get an end-to-end working solution and then maybe generalize it across various devices.

H

Now, alexander wright, uh we need to know exact resource names that resource nameless uh usually comes from. uh For example, srv device, plugin, config or another device. uh Pluginconfig depends.

G

And that's that's the current implementation, so we have. We have it based on a config and we're gathering information of all the devices that are exposed in a cluster and then based on that we evaluate what are available and you know kind of uh subtract them from uh the already allocated one. So the scheduler gets information about what is available now on nominal basis.

G

A

Interrupted here and for checking off the time, we have the several things that proposed and.

I

A

Right are you going to give a diamond here or we can, because I saw you already have the link there or you want people to watch the demo behind offline yeah.

G

Yeah, it's a very quick demo. It'll. Take me about two or three minutes, uh so I can.

A

G

Go through that.

A

At the resource api, that's a lot of topic and we can come back this next time and the dedicated one meeting for the for this yeah sure.

G

So before we kind of dive into the demo video itself, I just want to give an overview of the environment that that this demo showcases, so we have two worker nodes in the cluster. Each node has 80 cpus and 10 srov devices configured and they're again distributed across two numera note. So as you see in this, we have 40 cpus on uh each number, node and five devices on each lower node.

G

So from cube scheduler's point of view, it sees that there are 10 devices srv devices on um on both the nodes and 80 cpus on both the nodes. But then topology, aware scheduler's view is more granular, it has view of various nodes and how those resources are available on an overnight basis.

G

So um so, just uh so when I demonstrate the demo there'd be a certain workloads already running in the cluster. So these are the workloads that are running so I have a pod which is requesting five instances of srv device, five cpus that is over here and then I have um a workload that is requesting two cpus two survey devices and three cpu sr devices that you can see over here.

G

So from scheduler's point of view, like cube scheduler's point of view, both the nodes appear to be exactly the same, because we have five srv devices available on both the nodes and 75 cpus. But when we look at the newman or side of things, the picture is completely different. So if a request comes for a pod, which is something like this so we're requesting for four srov devices, um this information becomes really valuable in that case, because uh cube scheduler could place it on either of the nodes.

G

But in a in a cluster where single moment, policy for topology manager is enabled. If, um if the scheduler places it on this node, the part would end up in a topology affinity error. So the topology aware scary, look plugin, uses the information that we have gathered and which is more granular to place it on this node and then in turn. It gets placed on the first node. So basically we'd have something like this. So that's that's what I'm going to be demonstrating in the demo? Maybe let's go to the demo?

G

Okay, so I have I have here showing that I have two worker nodes. There are three virtual masters as well on this cluster uh srv network operator has been deployed on this cluster. That will basically show information uh about the the statement, visual information about the allocatable.

G

So, as you see here, we have 10 instances of srv node on this. The first work node and the second worker node has 10 srv instances again and now I will be showing. Let me just go back here a bit I'll, be showing here um the three working the work workloads already running in the cluster, which you can see on the left um as well, and then I'm showcasing here. um Sorry, I'm showcasing here um nodes that they're allocated on so uh for simplicity.

G

What I did in this cluster was um the the pci addresses that are ending in an even number are all on pneuma node zero. So here you can see that these um these devices pci addresses all the devices that have been allocated to this board are on the zero. At no note, as you can see over here and then for the sample part 2 you'd see. This is again on number note 0, which is over here and then there. This is the sample 4 3, which is on my node 1..

G

So now uh I'll show the crds.

G

In this case we have node resourced for gcrd in this cluster, uh when we kind of show the instances you see that there are no instances corresponding to the crd and then when we deploy the resource, topology explorer exporter, those crds are populated.

G

So here we are deploying the resource topology exporter, um so we have because it's a daemon set. We have an instance uh for both the nodes and then we have um the crd instances populated corresponding to both the nodes. Now, um so this command is going to show you the picture of the basically the the information corresponding to the specific node. So here you can see there are 35 cpus available on this node and zero survey devices which corresponds to this information over here and then for the other node.

G

Oh sorry, for the for the nominal over here you see that there are 40 cpus and five srv devices available on the no node one. Now we show I'll show you again the the same information corresponding to the other worker node, and it will again match what you see in the diagram here. So we have cpus.

G

We have 38 cpus on this worker node on nominal zero and then 38 cpus over here and two srv devices available over here.

G

So now we deploy the topology where scheduler um we deploy it as a separate scheduler itself. We deploy it in the cube system, name space.

G

So that's that's the command that deploys the scheduler now we'll just quickly check if the scheduler is up and running so we have it up and running now we deploy a pod requesting resources using this scheduler so because this is deployed as another scheduler in the cluster in the quad spec, you would see that I'm specifically mentioning which scheduler is deploying this particular part.

G

So this is the manifest file, and you see here um that I have the schedule. Name name specified as my scheduler to indicate that we need to use another scheduler in this case like ideally once we have the topology, uh where scheduler plug-in merged into the mainstream uh scheduler code. We wouldn't need to do this, but this is how I've done it for the demonstration purpose and now.

B

Why is that not needed or you're saying that once it was merged, you would have it you'd always be topology aware.

G

Like um yeah, depending on um like, if, if you have a single node policy specified on the certain nodes, then that's only when the topology, where scary look plug-in, would kick in and it will try to schedule pods based on that.

B

Okay, so the expectation uh would be that both the scheduler plug-in and then the uh the resource that's coming from nfd that gives the or potentially coming from empty that gives the associated topology policy on that node um they'd work in concert just out of tree.

G

Yeah and uh we would like to make it more um kind of configurable as well, so if someone doesn't want this plugin to be enabled, they could just simply disable the plugin. So my understanding is that the scheduler config itself allows you to enable and disable plugins on on your cluster.

K

Yeah this would be helpful because we are working with some customers and labs that want to engage or enable other schedulers on on openshift, but they are, depending on the exact information, what we need from nfd or other entity which exposes the numerator topology.

K

So uh it would be really mandatory to to disable this scheduler so that other schedulers can onboard and use the same uh same crd, crs and information that we are exposing either via fd, nfd or other entities. uh But I have several really several customers who want to enable uh from the hpc space schedulers that need those information.

A

Because the different use cases, uh so uh this one definitely gave us the boost of some performance. This is uh sensitive, only be some device sensitive of the workload, but also there's something like that. You have to treat off for some nectar if that work. Node is not that performance sensitive make the the product batch workload or even like the thumb, is like the uh latency sensitive, but actually it's not. That uh depends on the the memory intensive workload, oh, and so in that cases this actually will hurt off the utilization.

A

So this is. This is why uh we want this is being the extensible and up out of the default tree, and so then, basically, it's more like the front like the uh cluster level per class level or maybe like a per node poor level. You could enable those kind of the scheduling behavior. This is. I think this is kind of what we try to push so far. So.

G

Okay, um so just to wrap up on this, so we have the scheduler deployed when, when we have a specific pod being placed by that scheduler, so we just uh create that part- and here you can see that this part gets created on the zeroth worker, which is this one as expected.

G

So that kind of finishes up the demo.

A

Six, uh when you run off the time otherwise, then we will ask that other people have more uh questions raised, so we can follow up of nine on this one. This is george and thank you for come and give the status update and of the great demo, and uh so let's, if you have more questions, please talk to the sweaty, alexey and genre, and let's move to next topic. Is that? Okay, thanks thanks um next one says: do you want to open the discussion on that yeah? We?

A

I think that we talked about to briefly touch base this topic last week.

L

um I'm not sure if we talked about it last week. I think it was a different. I think it was a different issue last week, but uh this one was interesting and I wanted to get some feedback um so when we, when we do enforce node allocatable today, that that puts a memory limit on the cube, pods um c group and uh sets the shares- and this is where it gets tricky like so shares- is a is a. Is a ratio right?

L

It's a it's a relative number, but we deal it with cpus in terms of millicourse. So we we have this. uh You know we keep them on the same scale so that uh they continue to make sense. But um when you set like the system reservation to be like we, we have customers that will set like a four or six core reservation on a 96 core machine right, really big.

L

And so what will happen is uh the because you're not doing enforcement on the system reserved, so you can do pods system reserved, cube reserved right. If you don't do the system reserved in the enforce node allocatable, then the shares is not set on the system.

L

C group, and so even though you reserved for a 10 cores under contention, the system, slice or I'm speaking in systemd terms here, but but the system c group does not effectively have access to the reserved cpu count because its shares isn't set, and that throws off the calculation system wide where coupon slice has allocatable number of cpus and the system slice always has 1024, because we don't set it. But if you add those two numbers together, it doesn't add up to the total number of miller cores on the system.

B

So what do we see? Is the recommendation stuff that I mean today? If you had said enforce system reserved uh so like default, we say, enforcement of account allocated would be pods, and so we write what's on the coupon c group, but then you're right that nothing gets reflected on system slice for cpu. That's always wrong.

B

So, like I think the tension, I feel like vision I had when we first did this was we didn't want to cap memory, because memory was always going to be probably wrong and we wanted to let some burst on the system, but reflecting on this now, like especially with larger boxes like where you're not running 4b cpus but you're running in your case, you said 96 uh the imbalance on that ratio can get really.

L

Extreme yeah, it's actually even more out of balance on small machines so like if I did a four cpu system reservation on an eight core box. Well, if you add up shares and convert that to millicourse, there's only five cores on the box right because system slice has 1024, coupon slice says 496.

L

um and it now you've like now. Three cpus are unaccounted for in the shares.

B

So is the proposal that is saying no matter what you set enforcement allocable to. We always want to set cpu shares to a value.

L

Well, we can't really do that, um because this is where the tricky part comes in. So if you set system reserved uh as a key on enforce node allocatable, that uh makes it to where the user has to provide the system c group name as a parameter to the cubelet um in see in systemd's case. It's that would be system slice, um but it makes that required at that point.

L

um But if we, if we don't require that, then we don't technically know what the system c group's like system c group name is, um even though we can assume it in a lot of situations based on the c group driver.

L

So I'm not sure if we can just blindly start setting cpu shares on the system. C group slice without messing up backward compatibility.

C

Do we require cube reserve to be within system reserved? I thought they were accounted for separately against node allocatable.

A

That's the initial design is separable yeah. It's a system reserved initially since I came from this initially is for the any demon not kubernetes managed include after kernel, and then the cube was reserved actually for all those demon and an accumulated container antenna and the other symmetric cubo proxy back then that's for the cube, just like the kubernetes, provide the demon or demon side or whatever, and that's kind of the original source just share here.

L

The only way I can see out of it is to have people start, setting the system reserved key for node allocatable and then setting cpu shares, but not setting memory limit and bytes. When you do that, I think that that is the behavior that most people are expecting, but if the existing behavior is to set memory limit in bytes and for some reason, people want their system daemons um killed. If they go over.

L

That limit- which I don't think, is the desirable outcome for anyone, but it could possibly be the current behavior and what people expect um then, I'm not really sure my understanding is that we do set memory limit on bytes on the system c group, if you pass node allocate if you.

I

L

The system reserved key to node allocatable to enforce node allocable.

B

You know that right is the same okay, and I think that that is uh it's probably an error. And when we first thought about this to not allow you to distinguish.

B

Enforcing things that are um uh I'm having a word yeah uh compressible right, we should have a way of expressing to enforce compressible versus incompressible resources, and um we have that glob together into one concept right now that it's probably in a mistake.

B

So the the issue you're raising here is basically a report, but there isn't like a proposed fix right, or did you have a.

L

Proposed fix that you wanted to draw attention to. I don't currently have one, because I I don't have a solution that is transparent to the end user right we're going it's going to change the behavior of something it's just. What behavioral change is going to be the least disruptive and more aligned with what users expect. I guess.

E

Hey this is michael uh just for my own knowledge. Why is cpu shares even used at all anymore, like see groups v2 that stuff's going away, and it creates a very inconsistent latency profile like why isn't cfs quota just used across the board and cpu shares just ignored.

L

For our burstable quas tier so like it shares, allows a so in the pod spec when you make a request that maps to cpu shares when you set a limit on cpu that maps to see a cfs quota um and shares allows a pod to burst beyond what it is requested. As long as there is not contention on the machine, but the shares are there to enforce fairness. When there is contention.

A

Can you finish your sentence.

E

uh Yeah I get the burstable part as they can do more. I I don't know I've always been against cpu shares because of that like, if you look at it from latency profiles like depending on, if you're looking at a workload isolated depending on how much is run on a single machine, you get really good, latencies and stuff, even for reversible tasks, and then the machine gets gets been packed and now, like I as a batch user, my things are taken five times longer and that's the sole reason why cpu shares like it.

E

I don't know, I think, especially with c groups v2 it's going away, maybe a solution to fix this is to look into if this needs to be ran used at all anymore.

E

It seems kind of a thing of the past to me. As far as resource management goes.

B

D

B

You clarify that, because I view cpu weight as the equivalent to cfs shares and that it's proportionally distributing and secrets. V2 has cpue.

A

B

M

So well also in v1 there was a very bad colonel bug with cfs quotas, which was only fixed late last year. uh I can link to some issues and there are some uh well even right now.

A

Questions there is not fixed some corner cases still still have some problem there. So so that's the problem, so we end up to have like the professional, uh the special treatment on different of the workload. So so by default we still didn't enable off the cpu quota. The limit.

B

So yeah, I think, uh even on secrets, v2 host right and renault and giuseppe. I know you guys are looking at this. um We would still want to set cpu8 on system slice or system or cube slice. Whatever people have uh their c group taxonomy set up. That's to.

C

B

Appropriately between end user pods and system services, so, like I, I guess uh I don't have like a great answer. Seth. I don't because I feel like. If it were me, I don't think anyone in the world is setting enforce node allocatable in a way that they understand what it's doing that like I feel like. I would improve the world if I did set shares on it because it would match the expectation probably they have.

B

But I don't think I can do that and we probably should kind of cue something up in 120 to say: we'll fix this by distinguishing between enforcing compressible versus incompressible resources that that's.

B

My first thought because, like taken to the extreme like you might want to restrict the amount of pids or tasks that you might uh proportion out and like uh same with memory and uh like I personally probably would never run a cubelet worker where I enforced compressible or incompressible resources on the system daemons, um but maybe others have a better understanding on the workload that they would, but uh that's probably the best pitch. I would have that we could probably cue for 120 and like make the world a lot better.

B

Maybe others feel different.

C

I almost feel like it's a bug for us to not set shares. In other words, if, if I set enforce node allocatable pods, then it feels like it's that's wrong unless we're explicitly setting cpu shares on other c groups that that's balanced against.

L

That's a fair perspective too. Actually I, like that um yeah, I I agree with that and really the only hitch with that is that we can't assume what the system c group name is um right. Unless you pass the system reserved key into enforce node allocatable, which forces the user to provide that to us. um So the only way I see forward is users will have to set this. The system reserved key for enforce node allocatable if they want the shares set on their system slice properly, because that forces them to provide us with the.

L

B

I'm sorry seth couldn't we do like just find all peers of the cubepod slice um and set it appropriately with that like. If you only find one pier, then you know what to set it. If you find more than one that may be an issue with, uh I don't I think, if you find just one, you could auto know the right balance too right find all siblings of coupons if n equals one. You know to set shares to the alternate value.

L

I mean, if that, that's only if there's one other pier to cube pods slice right, which for systemd that's true system slice is the only pure to cube pod slice unless the.

B

Music slice appears there or something depending on how people had worked um yeah. I I don't know. If I have a brilliant idea, then.

C

The other thing we can consider is setting all shares interrupted.

A

A bit because we also have some other tumor topic here. Sorry- and we don't have the best answer here- can we carry that one through the because, like there's, no one says fit at all? No, the allocable config and the different. If you are thinking about the technical use cases and they are really sensitive for every single things and then you are running along those new, even accumulated learning, so we are carrying on that to the separate topic later and can we move to the next one so star wars?

A

Do you want to talk about the the the topic we've been used to be talked about at the signaled, the know, the readiness gateway and most focus on your problem, because the I think there's no, that's just a proposal. One way to solve your problem.

N

Yeah, okay, uh so just to give a brief overview. um My proposal is really just that we uh reopen uh this uh kubernetes enhancement proposal uh for node, written escape or node readiness gates, which basically just adds a declarative api for defining a set of pods, which all must be ready on the node before the node is considered ready and uh I'm coming over from istio.

N

I work on the cni or istio cni over there and our use case is kind of odd, but um it is, I think, kind of representative of this class of problem. um So to give you an overview of what we're doing, istio cni is basically a binary which installs a which is installed on the node um and is called just like a normal cni binary is uh it's installed by a daemon set, which you know, runs the installer and then adds the binary and also updates the configuration.

N

However, unlike most cni uh plugins, it doesn't cover the entire kind of networking stack. The only thing it does is actually exec into the pod namespace for a uh you know, given pod the network namespace and execute some iptables rules to setup port capture.

N

So the issue here is that since it's not a true cni, there's nothing that actually stops a pod that is scheduled before that cni, insta or plugin is installed and chained to whatever your main cni plugin is uh from uh being scheduled and starting up successfully uh without those ip cables rules being applied, which means that if you have a workload that depends on running istio, uh you will run into a problem where you have this pod uh created, but it can't actually execute you know.

N

So it's basically just sitting there dead in the water and consuming resources on your uh on your computer. uh So we call this the uh cni uh race condition over an istio, and the problem here is that, because of the way that this is implemented, we cannot actually repair these broken pods in place, because cni is only called during initialization of a pod network namespace.

N

um And uh so, while we do have mitigation measures, uh basically a uh ended container, which detects when um the pod is broken by just running a couple of loopback commands to see whether or not the iv tables are in place. um If it's broken, uh the only thing we can do is delete the pod and let it reschedule itself- um and that is you know- creates a lot of noise on the cluster.

N

You know you might see if a bunch of pods get scheduled because the cluster is in high contention, you may see pods get rescheduled. You know 50 times uh each for a large number of pods on a cluster that creates a lot of noise in your metrics and makes it look like something. Horrible is happening. When normally do it's actually just trying to repair itself until the the cni is installed, so cni is a requirement for a lot of customers or istio.

N

Cni is a requirement for a lot of customers, because it allows people to um actually run seo without having to you know, break their pod security policies and give the workload pods uh capnet admin, which is required in order to run iv tables inside of a pod network namespace from a workload pot.

N

So the proposal was initially closed, with the comment that a taint controller would register with taints would be an acceptable workaround and while it does work, it still kind of isn't a great solution.

N

The problem here is that the the node itself or cubelet has all of the information it needs to make the determination of whether or not it's ready. um So if we had a declarative api or we could state the conditions in which that's true and let cubelet manage it, it has the ability to stop things from scheduling on it without any kind of lag.

N

Whereas if you have a problem- or you know the problem with running it as a tank controller, which is running outside of cubelet- is that you know if a node comes up reports itself as ready, and there are a large number of pods that are sitting there waiting to be scheduled. They can all be scheduled onto the node before that taint gets applied and if the, if that happens, you'll have a large number of these pods, which are sitting there and an unusable state consuming resources.

N

And you know, if you don't have the repair controller enabled because it makes too much noise in the cluster or whatever they'll just sit there and you know, be completely useless, which breaks your monitoring, which breaks you know auto scalers um and they won't ever be descheduled, because there's no mechanism for you to actually go clean them up without something that just deletes the pods.

N

um The other aspect of that you know the the recommendation was for this tank controller to also be coupled with a repair or register with taints option in the cubelet and the problem with doing that. Is that that's not always available, and even when it is available a lot of times, the team that is managing istio at a large company does not actually have control over the arguments that are passed to cubelet.

N

So if you have complete control over the system- and it's one that allows you, uh you know to actually set the register with taints option uh yeah, you do have a workaround there, but again it's requires you coordinating with two components. You know one inside of the cluster one and kind of meta configuration space, um and you know it's just kind of not as clean as just having a declarative api where we can state.

N

You know this is the set of you know: label selectors uh and namespaces and node selectors, where um a node with this node selector is not valid. Unless it has a pod. You know matching this label inside of this namespace running on it, um and so that's a basic summary of kind of the istio cni use case, which I again I you know it when I'm giving a specific istio cni, but it generalizes.

N

I think well to this class of problem um and it's something I think that we really do need to have implemented in some form.

A

I just want to summarize general, actually it's just we've been talked about uh what, after and what kind of condition a giving node. It is rightly undertaker the user workload we've been have those discussed in the past uh and uh in the initial and kubernetes funded. We we expressly see like the node. The management is out of the kubernetes scope. So then that's what that's. Why? Initially I put those things in the init script.

A

I think that's uh addressed a lot of problem in the past because the kubernetes most built on top of the cloud provider or maybe used by private region, a private on-prem cluster by those each company for their own, and so they could control of the node initial time. So then they will have the initial script and then ensure of this node through the cloud provider and the and the node initial time, and then they can say.

A

Oh this node is the 10 and the gate is several status is ready and even we introduce of the demon side, uh but the problem. It is still demon side now to those also have the risks. So that's why we always first need to control, so easter's use cases is a little bit different, so just it should be common like when I want to join a node and how I'm going to ensure uh during that uh uh ensure of when I joined the node.

A

I have like the clear, like the what does that word say like a declarative state, um how we're going to say this node is ready to serve because this node is right. It could be like this. Node is uh boot up and and is alive, and then this node is ready. It is actually it is okay. Oh my node have some certain row and that's a certain rule are required of the certain functionality in the board.

A

The functional interval could be, the demon set is running and the device plugin is running, for example, and then I can then I can claim uh I'm writing so there's also like the when we have the class, uh seek class life cycle and start talk about the cluster api. So I talked to engineer working on that one. I said we do have the need to container, and then we do we. We do not sorry, we need a script.

A

We also have like those things like the initialization for the know, the startup, how we are going to solve that problem in cluster api. um I didn't say they solved that problem because I think that's kind of literally the node come up and they they mentioned off the machine life cycle and when machine come up alive and and then create of the node object and the register there should be have some state- and I can say: oh, I want to initial this node and for its ready state, but I have to see that here.

A

So I think this is kind of like the uh uh it's kind of the common issue, but I guess everyone, every provider already have that, because giving history likes the region where they have their own solution, but I still think about. Maybe it's a good time for us to revisit things since we we change off the kubernetes scope in many way, like at least we have the cluster api. So that's. Why.

N

So one additional thing I would add to that is that one of the problems with relying on the init scripts and things that are actually inside of kind of the meta, config, plane or meta configuration space of the kubernetes. Like you know the cubelet arguments and stuff like that, um is that if you start relying on that, you're essentially tying your kind of that your deployment of application level uh resources um that may change the requirements.

N

For um you know what is needed for a node to be ready uh for a given known, pool and you're tying that to the configuration of um the entire cluster as a whole right. So you know you're. um If you wanted to push a change or something like that, you would potentially have to roll the entire node pool over with the new configuration settings, um as opposed to just kind of configuring that inside of you know, krm and just applying that krm.

N

Just like you would any kind of custom resource and redefining what it means to be ready and then rolling. Your application over um you know so that it is now reflecting. The new state just drastically complicates upgrades and deployments of that kind of um you know or that that kind of process.

A

Yeah, just that's the legacy vision, because, uh due to the scope, kubernetes, that's why we cannot derek. You want comment.

B

um I I guess uh I I don't. I think it's a universal problem. I don't know or is the pitch here- that we're gonna bring back this cap or because I I'd be interested in sharing like challenges. We have as well that I don't think honestly met by this cap either, but um uh is, is there a what's the what's the goal, I guess do we want to revisit this problem, or are we going to say that it's like the infrastructure providers challenge to work through.

N

On my part, my recommendation was just using that as a starting point and if we need to remap that um you know to meet additional requirements that come up, um we could probably do so. uh You know yeah, but the the issue is that at the moment, there's kind of no proposal out there to solve anything in this class of problem.

N

um As far as I can tell, this is the only thing that I saw in motion regarding that and it was closed um with the taint controller being kind of the recommended alternative and that doesn't really meet all the requirements that there are, um and so you know either reopening this or opening a new one that kind of uses. This is kind of a subset of it.

N

um I I don't have any preference either way, that's up to you guys, but I just wanted to make sure that kind of I had presented a use case which was not met by the recommendation on that um that kept, and uh you know so that we could reopen that discussion.

B

Yeah so my my I have to re visit my understanding on what that cup had been, but my like lived experience with this is that, like when a node is considered ready, seems to vary based on uh both the vendor's choice, this node's now ready and then the customer's desire for what needs to be on that node before they also then run other workloads that you end up with like a ring type situation where like.

B

If, if your provider of kubernetes doesn't provide, you say log forwarding, but your organization demands that log forwarding is deployed to that cluster and then you use kubernetes itself to deploy fluentd down to that cluster. Like you end up with these, these security rings or node readiness rings, which I know, is kind of what the skate proposal was talking about, but it's hard to get to like one true solution, um but I definitely can empathize that it's a it's a it's a problem.

B

um I'd have to go back and think through like variations that people had done with tainting the problem with taints. Is that, then, you need to have those those system, services tolerate all taints, which also then becomes a problem. um So I can definitely agree that there is a a challenge here.

B

It's just it's hard to come up with one true answer, because one user might say oh cystig or fluentd or insert random thing must be on this note before my workloads are ready to support it, and another user might have a whole different list that uh I haven't like seen, something that solves it universally there's a lot of proposals that just solve it for like that, one user's group- and maybe my recollection of this kept- is inaccurate. But that's my experience.

A

I I think we all agree about the problem, and that sounds like and the solution could be different uh dark. I I think about the proposal kind. The proposed the the course the proposal, actually still it is uh per cluster base per provider-based skill.

A

So so um so stew is connected based on the provider, define a site of things condition and which is driving by the demon side or whatever, and they have to give the signal or it is ready or alive, and then it is like no, the claim it is writing so based on that one, so you could be uh flexible based on node pool, oh yeah, so I I think that's that word that could be and it could be.

A

The per cluster even could be flexible, like the per node, if you uh driving that one on the machine level.

A

So, but uh I I'm not sure the cap itself will capture all those detail yet because I forgot about the detail, but I remember we should discuss with me behind this thing before proposing to the signal, uh but from my top of my mind, if I remember correctly, didn't really saw some use cases because the uh there's the class level of the admin uh iteming and then there's some other people like what you just mentioned, like a universal thing. They have maybe have like the sabo uh group of the load.

A

They have certain of the admin requirement, how they are going to share those duty and override which one is which one so there's the sum like the complexity. But uh I think that's at that time. I do think about is a little bit over uh complicated use cases. um So uh so so we didn't provide the solution. So that's why I kinda, like the end up like the things most of lender already, will crawl up the issue so the end up.

A

We didn't really pushy that hard that proposal much harder because things solve the unique stream over complications, but I think about the base. My understanding today is tools, use cases. um I think that's enough to address that problem, but we should look into more detail, and so I just share my my experience and my memory here.

B

If you had a pointer to the istio specific concerns, I'd be curious. On reading up on that.

N

I can't link you to this. The race condition that we've had previously and the other mitigation uh efforts we have in place. If that's something you're interested in or I can uh write a new doc if you would prefer that just specifically goes into um what the uh requirements or like what we would request from. uh You know, as far as cube literature, node support, um which would you prefer.

B

I guess either what's convenient to you, I guess the part I question is if you can get down to a single signal versus many signals, and um I was trying to raise. Was that like? If you are a vendor working with um the existing might say, the node is ready at a different point than the vendor. That's working vendor, plus and says user workloads are only ready when the two are there, and so all that stands out to me is in the kept.

B

That's linked is ideally provide a single signal, and I think that, like um that's a premise, I still question versus having.

N

So I I haven't added anything to the cap, but my proposal would be more along the terms of this. um What you do is we have we added a declarative api which basically states.

N

Is a custom resource which defines inside of it a tuple which is the the name of the resource, the namespace um that a uh a damon set or deployment would be in um the existence of a um or a uh and a a node um selector, which you can use to define a node pool which is optional, and you can just remove it if you want it to be cluster wide and then a label selector for pod and what it does is it specifies that a particular pod has to be ready, um your that exists inside of that name space so that it can be.

N

You know, access control.

N

You can't have people just randomly inject, whatever workload um exists on every node um or exist on a given node before that, node is considered ready, and so you know, for example, istio would have one of these and then you'd define another one for the use case of cystig or whatever, or you have another one for fluentd and the node does not ever flip already and start accepting schedulable pods until all of those conditions are met, because you know you can still have pods schedule on to it that ignore the node readiness um you know via tolerations or whatever, but until all of those until the node contains one pod meeting every one of those conditions and all of those pods are in the ready state, it would just prevent the node from being considered schedulable.

N

So it's similar to how the taint would work, but it's using just kind of the same. You know normal tank. You would have for most damon sets already.

N

That was my thought, so the the signal is basically, does the pot exist and is it ready- um and you know, people that wanted to take advantage of this may have to convert whatever their workload is into something that just sits there and waits in a ready state?

N

um You know optionally checking to make sure that the condition is still true.

N

Does that make sense.

B

B

Yeah, I guess we don't need this. I don't figure out where that breaks down and uh um uh so either way. uh um I I think the problem makes sense if it meets every scenario. I'm not. uh I haven't given thought through enough, but uh I'll read the istio one dawn. If you could open up access.

A

Yeah I okay, I just link to back, I I will so direct. We don't need to solve this problem for the first time I think most. I just want to uh really introduce that problem, because we do have this problem in the past and we walk around that problem and also ask each vendor to solve that problem, but more and more, like the services built on top of the kubernetes.

A

So they may don't have like the window that, like luxury to force the vendor or this kind of thing, so maybe we found the kubernetes uh uh how I can address this problem. This is kind of the uh maybe it's the time to open that issue and we can discuss more and we it's not necessary, like the. What is the is. The implementation is that the solution- but I just want to we understand- also real common issues for many users, like the private service on top of kubernetes.

B

No, I completely empathize with that.

A

Also, I you heard your story like the big sounds like definitely openshift or whatever you support the customer have the same problem. I think this.

B

Is swati's demo showed the sri ov device plug-in right, so that has to be there before that.

B

The struggle I have is like: should we just give up on the ready condition like? Is it a bad premise to begin with, and should we promote instead uh other more specific conditions, and so it's kind of like um are we? Are we truly locked into what we originally had, or is there just another thing that we can think through and that that's what I haven't uh come to grips with in my head.

A

We can we can carry on like discussing, and I think that I can see that both have the pros counts and the way it is maybe like the surface complexity to the entire echo system, another one. Maybe it's just cut off that company, that complexity to the node, but how we are going to be able to do that, and also because without those final graduate uh readiness or whatever condition, is- and maybe it's not the best way, it's kind of the one that's fit to all solution is not the best way to so we can.

A

We can carry on those discussing yeah.

A

We we really run out of time and and um um the sergey. I notice that you, you ask the one question um and, uh and that question is pretty straightforward. You already put there and people if against what you ask and maybe then we can talk and then we can think about the have like the room meeting. Otherwise, maybe we just go keep to their today's format.

A

So then, because based on the policy, if the new meeting through the room cncf, the room account basically have to seek chair, and I have to start those kind of meetings so that basically have the direct either direct or myself have to be stuck at a meeting then either unnecessary another another like dependency on two of us.

A

So if nobody's joining against and the single product is not accessible and it's not inconvenient, then we maybe keep that one. For now, until someone uh have some inconvenience, I'm.

B

Pretty sure we can overcome this so I'll follow up with uh contribex to see what the options are there, but I feel like we can.

B

We can grow uh credential rates.

A

That's great follow up the thick windows make sure they have nectar proxy and the representative in the signal and the director is going to follow up the sick contributors and figure out a solution for those kind of things, and then we can carry on more discounting on the typology of while scheduling topic and also this note, the readiness signal more and also the and also the system reserve because always connect open issue. We need, we need to move forward and figure out if there's more generic or better solution.

A

Enhancement for the kubernetes seeker know the prospect.

A

A

Thank you. Everyone yeah.

A