Kubernetes SIG Node, 28 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SIG Node Resource Management WG, 2022/03/28

Description

Meeting notes and Agenda:

https://docs.google.com/document/d/1ALxPqeHbEc0QOIzJ3rWWPpwRMRlYDzCv0mu2mR4odR8/edit#

A

Dude right so welcome to uh this series uh for for the plugable resource management. The container compute interface driver extensions um just uh wanted to give today a shorter date where we are with the cap uh definition.

A

um So, basically, in the last days, we were updating the the gap definition based on the some of the feedback, but we've got some sessions back um and so now this is on the branch. Most probably, we will push it to our Master uh think and today and see it will be ready for another scan from the reviewers when you have time, um but just to go through some of the key things. What we we changed so summary is more or less similar as we.

A

What we had before motivation also remains untouched a lot, um so the the new stuff are starting or the changes start from the compute specification options. uh They are, as suggested uh by by the.

B

A

Group um we we basically decided to to pick one option. uh We had listed three options and one leading option based on the dynamic resource location claims.

A

um So uh this we describe it a little bit in more detail. um Currently the example there we provide an example and it's based still on on um a Json kind of conflict map, not using crds but I'm, mentioning that crd is an option for specifying that.

A

So in the example, we have first how you can use that um kind of uh or how you can use the attribute based API through a clean mechanism, which is nothing changing actually to to the classical dra specification type.

A

um The the a little bit more details can fall here uh which specifies how a such kind of config map, or, basically the what's the format of the Json. uh What we were thinking to or on the attribute, based API, which was asked several times what what is expected to be seen there.

A

um So basically, we give some definition of possible attributes and we are thinking basically to have a core list which is specifying how many cores you want to do to request, for um they can be um a static number of course, or it can be arranged um most probably for Alpha will not support ranges. Ranger are little bit in the if you think about burstable quality of service.

A

So if you want to spawn a container which can burst between one and four cores, we can pick basically a CPU set with four cores and and cap shares which allow it the the container to burst between one and four um but yeah. This is currently out of scope for Alpha um this the ranges we will stick just to fix the number of uh of uh requested course.

A

Then, um for each uh kind of um entry uh we have corresponding um attributes, so you can have device Affinity attributes which are basically um requiring some device Affinity or trying to get some device Affinity or not required similar for memory basically require requiring two more Infinity um bind is basically similar to single Numa or interleaved is similar to some sort of Numa spread kind of semantics.

A

um Later, with the internet also attributes for huge pages and stuff which which will be needed for uh better kind of specification in terms of memory, um then uh we have also some CPU attributes which control isolation and and siblings, scheduling, basically, exclusive shared isolation levels. Basically, you can get exclusive stores similar to what the CPU static, CPU manager does, with guaranteed quality of service um shared is basically um you get the CPU set which, which can be shared with other Bots containers, um then um core sibling required uh denied the this means.

A

Basically, if you want- um or if you have a, if you are requesting for certain amount, of course, um they will try to use the logical course on the same physical core. um So simply required. Is this option core sibling denied this? If you have? Basically, um if you want to take one of the logical course and block the other logical core from from being used from other thoughts, um so this is possibly denied.

A

There are some applications which wants to take the full physical core for them themselves, so that that's this case um yeah, um then then is there, is an option basically preferred which is trying to get uh the logicals or trying to get uh logical course, which are on the same physical core. But this is not a must.

A

C

A

This is the uh a little bit addition to the cap. um Then further additions are following here in the architecture section uh we are um we removed a little bit before we were mentioning that um we were not sure how to associate pots and uh and and drivers.

A

Now we mentioned that we will use the resource class at the array, for that uh and CCI resource manager basically uh will be using another path for the registration of drivers, Warren CCI, um which is um not to have conflicts with jury, basically, which is running under Warren dra.

A

um So you can have basically uh drivers with the same name um um between the two. uh Basically, they will be, one will go in under Vara on CCI, the other one will go.

B

What is this for so the things under borrow and CDI are just the CDI specs, the plug-in registration and everything happens under the I'm talking about for Dra, the plug-in registration happens under the standard plug-in directories in VAR kublet plugins. Oh.

A

You're right so, basically um the but after that you are sending this the for you are creating the um the entry points to the plugins in bar run. The array was it like that.

B

No there's no bar on the area there's bar on CDI and.

C

B

That's under and all that's in there is the CDI specs everything with plug it, plug with kublet registration and the sockets between the the kublet back and forth. So the plug-in all happen in the standard plug-in directories which are at bar, run, cubelet or I forget what the exact path is, but it's something underneath.

A

Lots, basically in bar run cubelet, you had the dra.socket or something where the registration.

B

Yeah I mean there's two sockets that are that are created. One is for the connection for the kublitz connection to the plug-in and then the one from the kublet from the plugin to the kublet and there's two separate directories. One Directory is called plugins where you create a dra specific, a driver, specific directory for each of your individual drivers and then um for the reverse connection. It's under the plug-in underscore registry directory, which is how C you know, CSI plugins and all other plugins to the kublet yeah.

A

I remember in my prototype: I did basically unique location for the CCI stuff so that they don't get mixed with the dra. Maybe I mix the bats here. Something I will double check that it might be wrong, uh but yeah. The the point is here: I have a unique path to the socket, um not not used not to use the Dre kind of sockets, but have a CCI socket so that for the registration, so that those two cannot are not mixed. More or less.

A

um I will correct that I will double check it. Make it correct. um The other kind of addition was on the scheduler side. I was looking a little bit to the kubernetes scheduler. uh There is this kind of cubelet provides so-called note listener or um architecture. The Colts here yeah, which would be interesting uh just to find the right editor.

A

So yeah there is this kind of.

C

A

Auto resource server, part of cubelet, uh which is responsible to expose um available CPUs available devices available memory to the scheduler um scheduler, takes those very or tasks uh each node for those videos, through this listener, more or less so, I added some clarification that if we want to to uh to provide or we, we have to provide correct information about uh what are the allocatable CPUs and the locatable memory, at least for long-term Beta release, where we want to coexist with Statics um kind of CPU management and stuff like that.

A

So basically those we have two ads uh a functionality which, where the CPU provider get allocatable CPUs is, is calling CCI manager um to to get the allocated. Those CPUs and memory before this was calling basically CPU manager that that would be one one thing we we have to to uh enable for CCI so that uh scheduler knows what are the allocate to those CPUs allocatable memory.

C

And there is also another endpoint, which is the list which is answering the well letting the caller knows which resources. So in this case, CPUs are currently allocated so yeah totally get allocatable. Also the list and point The.

A

Listener yeah yeah I think there are three okay. Yes,.

C

A

um Yeah so I I mentioned in the cap, uh the the the the component name, Bots resource server, I, don't know if this is sufficient. I can also mention the functions.

A

um What we need to touch, um but most probably this would be something which we have to provide uh as data to the um uh to the node listener to function properly.

A

Right, um this was small addition. This was this paragraph. um Basically, um then, the the other kind of addition was the checkpointing. Yes, uh we we will, uh as the CCI manager becomes the responsible component for CPU Management. In that case it has to provide checkpointing. So we we have saved load functions currently, which will basically save the state the store, State, um you know yeah, we can do it. Similarly to CPU manager, we can have CCI manager State.

A

Basically- and this is saved and load the loaded when, when we need it um dispose another small change so that we can get checkpointing in um then uh we were discussing.

A

Mono your s and waiting to get let in.

B

I have to admit somebody.

A

Yeah Okay, so going back to the camp um next kind of proposal was uh if we can um to make the component the CCI manager completely uh self-manageable without the needs to having any dra cubelet drivers or cubelet plugins um we put, uh or there was a suggestion if we can basically handle the um playing the logic which usually the which was used to reserve resources for a given claim and basically to free them inside the CCI drivers um and I was thinking as a first kind of iteration or possible interface.

A

uh We could do that by taking our admittance, remove container resource functions where we we add more or less the claim parameters to it. um I have a little bit the better view on that. You see it nicely in the um in the grpc definition, so this is very similar to notes prepare resource.

A

If you're looking not prepared resource of dra, it defines those four um and we can pull them in in a in the admit kind of request, and this can help us more or less keep track of the incoming resource claims and dual reservations. uh We will use also the resource handle as as good way to pass.

A

The arguments um basically before we had the CCI spec um just we now we can just pass the resource handle and that would be sufficient um and then, similarly, that we use remove resource request basically to to free claims and.

C

A

The allocated information or de-allocate resources for given container this was one suggestion how we could do that I.

C

A

It would be sufficient possibility for first approach any thoughts.

B

Yeah I mean I need to look in more detail about how this is actually put together, but just to be clear about what the proposal was last time was you.

A

B

The same D area architecture we have today it's just that, instead of what we have currently defined as the kublet plug-in for Dra, which returns a set of CDI devices, you guys can write your own variant of a kublet plugin that has its own grpc API, which is what you're presenting here and potentially as a different set of uh you know, calls that are made back and forth during different points in the container life cycle. In order to support CCI devices instead of CDI devices.

A

That that's the point we we call them a little bit differently, the functions we we and we have a little bit more information as we need some. Some traditional information for the containers and Bots, uh but basically we have the admission function is very similar to more hold some of the information. What not prepare resource was having before and our remove container resource. Basically, it's uh we. We can maybe rename them to admit resource or something, but.

B

A

Or to this this one: can it's a little bit party and I have resource container resource everywhere, but in admit I should do it consistent, no story.

A

um Yep, the that's, that's the main kind of change. What we did um in the cap um so far, I think um other than that we we try to address a lot of the other issues. Just at the end of the um oh yeah, there there's one final thing in the uh alternative sections at the ends.

A

We have this kind of annotation based approaches um just uh suggested. We um we um yeah shortly describe them. There.

A

Right so yeah we will merge that to master um to our kind of master um today, most probably and then uh yeah for for all people interested see. If you can take a look, uh we will send a link and give us some feedback.

A

um When you have time and yeah, then uh we we can try to work towards purging it as officially.

A

Yep that that's today, it's a little bit shorter I, don't have too much other things. Yeah I.

D

I have a question if, if possible, sure um sorry if it's already answered I was on vacation. uh So last time uh on the last meeting that I I participated, I asked question about um uh like other ports. They are not those there that are not referencing any claims uh and they they just request some kind of like some amount of CPUs and memory right, and you explained that those Sports uh they will be also could be also scheduled to the same node. When, like this CCI driver runs, is it correct.

A

D

Yeah uh and how like, in this situation, how we can avoid like double accounting, so I mean like from one hand the CCI driver would uh maintain allocations of CPU and memory like in future, and- and it would be also possible like if, like those spots that are not referencing.

C

D

Are not using like this approach? They, they also would consume some some CPU and memory resources, so how that would be synchronized.

A

Yeah this was avoided through the so-called store.

A

um Basically, after making a call to the mission, we get the results resource set for for the pot which was handled by driver and we put it in the store um they screen with, with some container ID resource set um and later, let's say you have another bot which was assigned to that note by scheduler um to get it assigned correctly. uh To that note, we we still have to consider the available resources.

A

This was the the point what I mentioned at the beginning, that for correct functioning of the scheduler available resources has to be propagated down to the node listeners, or that there is this kind of class available um and basically to ensure um no double accounting. The resource store has to be used, so basically the uh when new containers come in, they will be asking the resource store for available CPUs and the resource store knows at any time order the we both CPUs more or less.

A

If it's the it will maintain uh um a view of the available resources which will be correct at any time for us, but.

D

That that would happen on Google's side right, uh isn't it already late, so I I was kind of thinking that it some changes should be done like on scheduler site, to avoid this kind of situation that we actually schedule uh put that actually cannot be solved problem.

A

To avoid that, it's basically, uh we had a small um kind of verification. Some um like lost again yeah here it is um I will show the codes in a second, but basically there is uh so-called Pottery Source server inside cubelet, which is this kind of note listener. um So the both um yeah.

D

A

This is called Snippets of of the Bots resource server um in cubelet, and it provides or it um to what scheduler does calls this get allocatable resources uh for each node and.

C

Wait: sorry, sorry to interrupt, but the scheduler the default scheduler. Doesn't this information is consumed by other demon set, which uses this data to publish the information about the results availability in.

B

C

Another object- this is at least the case for the number where to suppose you were scheduling the more shadowing solution we were working at, and this was the recommended solution, but in any case, as best as as I know and I'm, not sure this changed to the array. I think no, it didn't, but the Shader is never ever calling the the node directly never communicates to nodes and I'm.

C

Just mentioning the array, because I I don't know, I I think it sounds very unlikely, but I don't know so I'm leaving one on one percent option open.

A

I I agree, so that there is basically this. This component is publishing somewhere this data basically periodically, and there is some listener on the scheduler side, which pulls the data or receives new data from from.

C

um This is this is up to you, or would you want to design the the the consumption if you need, um but.

A

We will not touch the scheduler, so we are making the.

C

Thing the thing is: if I understand that concern correctly and please battery skip correct me if I'm wrong. This is exactly the point if you provide this data, but you don't change the scheduler. The Shader is not aware of this data, so we are back to square one, and this I'm not sure this actually answered the question from that.

A

Provide the data on the server and the mechanism is the same. Why shouldn't it work? It should work, but.

C

Okay, so um the default scheduler doesn't need this data, the shutter plugin, which is enabled then the positive shadowing actually needs to consume. This data have the intermediate representing actually the API object to represent this data and then consume this data, but all of those steps are explicit so if and I really mean, if because I'm not up to date to the changes to the cap, so what I'm saying could be obsolete already.

C

If you have a model like that right, then you will need changes to the demon which pulls this data, to update the object and to the Shader plugin. If again, if not, then I don't know so I'm just pretty much telling. What is the this? The case for the normal shooter plugin.

A

In any case um for for in in terms of gear Ray scheduling, which is covered by the controllers, um this is not not really relevant. This is relevant to this as soon as you turn on static CPU management.

A

um So if you have static CPU management and there are standard Bots coming in with some some guaranteed quality of service. This is when this uh this actually becomes important, but in in the environment, where you didn't have Bots, with with static quality or with guaranteed quality of service and stuff, like that, um all the scheduling is covered, at least for the pots, with with standards with with claims will be covered by by the controllers in the array.

A

um So one one of the um kind of requirement. What we will have is for Alpha phase, our kind of uh pots, which will get claims, don't get or they will be using CPU management, Norm uh and additionally um use. Actually, we would avoid specifying request limits, uh request the CPU request, limits and and um and um yeah, basically, usually in the container spec. You have request limits and um yeah, as as the the whole specification of how many cores uh you want and stuff like that is happening through the resource claim.

A

um So we we would require that that basically request limits are left uh uh left uh outside or are not included in the pots package and in that case, basically scheduling in in uh it's handled by the by the gra controllers. So.

C

Okay, that could work well, I, think we can discuss in the cap itself and.

A

Make sure it's okay? Okay! This is maybe a nice simplification for Alpha version later, when we have static um kind of quality, of static and guaranteed quality of service. We have to come back to this point and take a look exactly on the allocatable CPUs allocateable memory, and if we need somehow to propagate the data further to the scheduler um yeah, but for Alpha I think we are fine as long as our Bots do not have um uh request limits um and yeah. In any case, we are not using static um policy so far,.

D

But even in this case, some some ports uh would be scheduled to the same note and they will anyway consume some CPU.

A

Resources right yeah: they can get scheduled on that note, but they will not skip get. They will be scheduled on resources which are in in the shared pool.

D

A

Have the similar to static CPU manager, they distinguish between um kind of the exclusive pool, the shares pool and and yeah. So basically, those parts we we can put in the SharePoint.

D

And this CCI driver would not actually work with those uh with Deadpool, with shared pool.

A

It can work also with the Shaft Tool um you have. uh If I switch to the cap, um we have uh basically uh isolation level called shared. So if you want to put a certain yeah certain applications on the shared pool, you can do that with with that flag. But in that case we don't care if they overlap, because they are shared, we it's known by contracts they could that they can overlap.

D

D

A

Yeah, this was an addition. We we did to the Gap a little bit more specification of the attribute, based stuff and and skull uh claim uh kind of integration can look like that. That was rated.

D

D

I would like to ask is about uh the situation when we have CCI driver and and some dra drivers, so how they would uh I I assume that they would be just like filtering claims by driver name or something like that right. So that to understand which actually claims should be served by a certain plugin which.

A

Driver um yeah by driver name, and we were discussing at the beginning, we can create a unique socket for the CCI kind of registration.

A

B

Little trickle, I, don't think I, don't think that's necessary. I mean just use the standard plug-in mechanism, it's already separated, and so it's not even necessarily worth.

D

Like all those plugins, they will be filtering uh all pods referencing any claim, so that kind of concerns me in terms of performance so like if, if.

B

We are talking about some City.

D

B

My phone was ringing, so I didn't I, couldn't hear you can you can you say it again.

D

um I mean like my concern, is that uh all these dra and CCI plugins uh would have to like filter out like would have to look at all the claims to to to actually get the claims they they want to serve right and.

B

uh I, don't follow so in the same way that you know standard dra works, you know, I have a I. Have my controller and my controller allocates resources. However, it knows how to allocate resources based on being called out to by the scheduler right. None of that changes in this world yeah yeah.

D

B

And then, once you get down to the kublet depending on you know, either CCI plugins or standard era, plugins will have registered with equivalent right and they have a specific name. That's associated with.

D

B

C

B

We're going to have to have something that sits above our dra manager that.

B

Knows how to request for um the standard era path or the CCI path, based on the driver name.

D

Yeah, but wouldn't it uh like um erase some like performance issues because, uh like for Syria for CDI, for these Dairy devices, they're kind of it's the amount of pots referencing, those claims are kind of like uh not not. That big, like in in this case, with CPU and memory, so I mean and.

B

This you're envisioning there's a lot more pods that'll, be using yes, this mechanism than what we do with the theory is that.

D

Is that your concern, yeah and and like this uh this code pass would be called much much more frequent than uh than now so it's potentially can create some performance issues.

B

Yeah um I mean that was one of the initial Arguments for keeping the CPU manager in the kublet to begin with, but I think I don't see there being much more overhead doing this than if we were to, you know, earn any CQ manager into a plug-in via any architecture right. The minute you decide to have this CPU allocation be done by a plug-in, and so that's the way we're going to do it going forward.

B

um I, don't see being more overhead than than that unless I misunderstanding, what you're saying.

A

It in any case it's something which can be measured and seen so, but.

B

Yeah I mean yes, we can definitely measure it, but I'm trying to understand Ed. Are you worried about there being more more overhead than just the sheer fact that you have to call out to a plug-in to make the decisions, or is there something more than that.

D

Yes, that so, basically uh in case of like many claims, much bigger amount than for for the devices so basically like. Let's imagine that, like every every port schedule to the node would would be referencing some claim because it it wants to use some CPUs and memory and in this case amount of queries to API server, because we we need to like for each.

C

Claim we need a very.

D

um Allocation results and parameters, maybe so some calls, so those calls the amount of those calls which will be much bigger so that that's that's the.

B

Claim is not embedded in the part itself, yeah.

B

Yeah I mean that's, definitely a concern. It's a it's an overhead that would come from that.

D

Okay, so this should be somehow uh captured.

B

D

Answer it capture it yeah in the cap.

B

Okay, that this is potential, but at this point um the.

A

Types of systems that are going to be doing this right, they may they may have that tolerance.

B

A

Can just mention it basically, that's that's, basically, that this uh processing the the claim which will still require the the connection or within cubelet. You still need to contact the the uh kubernetes API server for that um until that is not optimized away. It's it's yeah, it's a limitation which is there yeah.

B

I think one I think one of edge is concerns too, is just that, because we'll be doing this a lot more often in the kublet that delays, processing of quote unquote normal pods that don't need this mechanism.

B

um But the slow it slows down everyone else. If.

A

You have, but let's say you you're deploying normal thoughts. They they will not go through that path so or they.

B

Will they don't but potted mission is serialized and so processing one.

D

B

Means, and it's taking longer means that that next one won't be able to be scheduled or run until later. This.

A

We might revisit in better if there is some polarization capability or memorization space. If we can somehow overlap those two but yeah their race conditions, we want to avoid there, but most probably there. There is some space for doing better. If we do some stuff passing chronically.

B

Yeah, we'll see I mean by design by admission, is not and cannot be parallelized, but um but pod creation and container starting a container and running it can be in theory, although it's not at the moment. um Yes.

A

Yeah, we will mention it in known limitations, um something like that.

B

Until we measure it, we don't.

A

Yeah and yeah until it's measured, it's been cleared.

A

Right then yeah any anything else.

B

Not not to me I need to take a closer look at the changes that have been made here and go through the top again, but yeah.

A

A

Then yeah I will give you some minutes back. Thank you for joining today. Yep.

B

Thanks bye, thank you. Thank you.