Kubernetes SIG Node, 21 May 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Node 20200521

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Okay, it looks like we've got the recording going just welcome everyone to the sick, node Resource Management forum today is May 21st. Let me go ahead and share the document.

A

All right can everyone see the document for today.

B

Yeah, yes, okay,.

A

A

First, up on the agenda is.

C

A

Get preferred allocations who was presenting this Kevin, hey Kevin, sorry, it was a late night last night and I'm a little fuzzy still so, okay I. Will you want to bring up the presentation and thought from it from there sure.

D

um It says that I'm not allowed to oh you're, bringing up the presentations I, think.

D

Sure so, just a quick background on this, so I didn't put in a slide. Cuz I just threw these together right before the meeting started, but kind of the the motivation here is to allow us within the device manager to have some influence of what allocations to make given to us by the the various plugins that hook into the device manager. Because right now you know, we've got the topology manager, which looks just aligned based on kind of per node level information.

D

So you know things like the Numa node and in the future, maybe PCIe bus ID, but basically it's that topology hint for the topology manager will really only ever be constrained to properties of the node itself kind of node.

D

What node wide specific features and not device specific features, and so with this API that I'm proposed allows us to do, is have the device manager kind of call out to the plugins to allow them to give kind of divided, not necessarily device specific hints back to the device manager so that he can make his device allocations, but actually give the device manager a set of possible device allocations.

D

That he could make so that he can incorporate those with the hints that come back from the topology manager and then actually make a much more informed decision about which devices allocate than what he does today.

D

So that's just kind of a quick background before I get into the specifics, but the first thought I wanted to show here was just a quick overview of how the topology manager currently currently interacts with different providers. So you know, basically we run through this quick, five-step algorithm. Where, for every container in the system, the topology manager will call out to a hint provider to say, hey, give me some hints about where you have the ability to allocate this container in the line manner right now.

D

All we do is ask about the pneumo alignment, but the hem provider has an opportunity.

D

So, if you're asking for two of this specific device, what Numa lineman could I give you and all of those in providers pass that information back for the topology man and every can take the different hints that have been generated from each of the providers merged them into a single hint and then call out to allocate in those in providers so that they can allocate the the resources they've been requested in the lined way, based on the hints that was provided and merged together.

D

And if all, if all goes well, then we decide that we'll admit the pod. Otherwise we say that that we won't right. Are there any questions on this before we move on to the next slide?.

D

Victor, can you go to the so with all that in mind?

D

This slide is just a quick overview of what the device manager looks like today and how it interacts with a plug-in to actually carry out allocations once that allocate call from the level above comes into the device managers, he actually doesn't consult the plug-in at all before he makes it allocation decision yeah, the device plug-in will have kind of a priori already handed off the list of devices that the device manager could allocate for him and the device manager just going to incorporate the set of hints that came from the topology manager with that available device list and kind of allocate the devices at random right now.

D

It's just it maintained a list and it pulls the first two three. However, many you've asked for off the top of that list make sure that they are aligned with over them. Whatever the merchant comes in and then allocates those out, there's no opportunity for the device plug-in to come in and say: oh, you know what given the current state of allocation and given that merged hint that was just given to you. Maybe I'll have some more information about. You know what might be better.

D

The best set of devices to hand out, rather than just picking them a train and them from the list that I gave you previously right.

D

That's kind of the high level idea of what we're trying to give some control back to the device plugins for next slide, and the proposed way to do this is to add one more API call into the device plugin which I've called gets referred allocations, but basically adds a level of indirection kind of one more step in this process of allocating or the allocate call comes in from the topology managers and then, instead of immediately calling allocate from the second step, will be to call outs to device plug-in saying: hey, you know what, before I go ahead and make my allocation decision I've already gathered hints from the topology manager, can I.

D

Also gather some sort of preferred devices that you know, given that I'm asking you to allocate two devices, which two are the preferred sets of the devices that you would want me to allocate, and then I can gather that information combine it with the merchant that I got along with whatever the devale available device list. That I have is and make a more informed decision about which set of devices I actually want to allocate out and kind of the reason that the types of things that this device plug-in might do. They say.

D

Okay, you know what I want to constrain myself and some topology in a way that I don't want to be integrated back into the Kubla, because there's nothing to do with kind of inter device constraints. It's all about this device type that I'm currently operating with and so I'm the only one that needs to know this information about that.

D

So I can incorporate that in independent of the device manager ever needing to know about it, and I can report that information back in in, in the shape of you, know a list of allocations rather than something about. My actual topology of devices.

D

Any questions there before the next slide, at least about kind of the high level yeah.

A

I was just sort of curious: is there a particular device or set of devices you think would benefit from doing this? Yeah.

D

So yeah I work for NVIDIA GPUs or the obvious answer for us. We have an implementation somewhat similar to this in our internal cluster right now. It's kind of based off of a fork of communities that we run and that's kind of, is the first use case. But you know it's. The the API is designed to be more general for any device type. It doesn't have to be deep uses, because it's just about give me the list of preferred allocations of whatever device types you happen to have right.

A

Yep, okay, great that's, sort of what I was thinking, though yeah appreciate that thing's sure.

E

Wise cause a topology manager.

E

Why is called topology manager? Is it like.

D

Why is what called the topology manager yeah.

A

I think it's question.

F

A

This component called the topology manager, the.

F

One, that's it's the one that sits above this, the one in the first thought.

G

F

Correct yeah, because topology I am from the some.

E

Like a beggar I know they care, it may be the CPU about it may be the network topology right, yeah or.

F

E

Device connected.

F

Yep, that's right. This manages the.

E

D

E

D

Yeah the thing that we're talking about today, though, isn't the topology manager it's kind of after the topology manager has done some stuff now, the hint provider in this case, specifically the device manager, is trying to do his allocations. Based on alignments that the topology manager has told me about, and I don't want to do it based just on what these kind of device constraints or constraints I want to make sure that I give the plug-in.

H

D

Tell me what kind of device specific constraints it might have, so that I can make them more informed decisions. Does that make sense.

H

Just to do some person like why you cannot like include all this prepared application inside get the body things like what what will be behavior when you like you, get you get the prefer allocation from the device plugin, so you will be measured and like.

H

If you felt get it, then some him, like you, don't have anything once you measure him from the bottom and you murse, measured with the him that you get and you got and you.

H

D

The idea released of this first generation of this is that it would be best effort, so you're getting a preferred allocation, but this the final decision is ultimately left up to the device manager. What devices he wants to allocate okay yeah. Obviously you could have more sophisticated constraints put in place where you know the same way that the topology manager itself will attempt to. You know fail admission of the pod.

D

If some of the constraints can't be met down the line, we could imagine adding something like that to device specific constraints as well, but for now this is just kind of designed to work with what we have today in a very minimal way to allow the device manager to make a slightly better decision than what it currently can without having to overhaul and redesign how everything's organized so.

A

um One follow-up questions so for some devices, when the call from the vice manager to the plug-in is that get preferred allocations? Is it likely that some may provide the same hint or this is probably a different API? So it's a little bit different um I guess yeah.

D

Again, so it's not about provide, hence it's probably more clear when we get to an example: ok slides sure, but it's not about providing a hint it's about providing if the preferred I would prefer allocations is gonna say you can have a little bit more information than this, but it's gonna say something like hey device. Plugin I'm about to allocate two devices tell me which set of two devices. You prefer me to allocate alright.

A

D

So this is the actual proposed change, at least from the API standpoint. There would be a new API call called get preferred allocations, which takes a preferred allocation request and returns a preferred allocation response. The preferred allocation request has three fields: one of them is the set of available device IDs.

D

This is the device IDs that I want to plug in, to consider when it's trying to figure out ok, what set of devices should I prefer to allocate based on the third field, which is the size which would say how many of this device do I actually want to allocate, and then there's this field in the middle that refers to the must include device IDs.

D

This is given the available device IDs, whatever allocation preferred allocations, you come up with have to include this subset of devices, and the reason for passing this along is because we want this API to remain stateless. We don't we don't want to assume that the device plug-in knows anything about what devices have already been allocated. What devices may have been allocated to admit containers in the past that we want to make sure we reuse in future allocations and so on?

D

And so this is just a way of you know giving them all the information they need to say. Ok, I have to include these guys, and this is the list from which I can allocate the specific number of devices and then based on that it's going to send back a preferred allocation response, which is just a a list of preferred allocations which currently just contains a list of devices.

D

But if you go to the next slide, I wanted to make sure to leave this last message with a with an extra level of indirection so that it could include more fields later on as needed. So maybe we don't want to just say this is the list of devices that we also want to add a priority or a cost of allocating this set of devices versus a versus another one.

D

D

We can come back to that quit that second question that's above a little bit later.

D

Okay next slide. So this is this might make it a little bit more clear exactly what this proposal was saying it's. This is a more concrete example. So imagine you've got this set up on the Left, which is an actual topology of AD gx1 Volta machine which has two sockets for th CPUs on each socket and four GPUs on each socket. Where you know those constraints have I've got this socket I'm on this Numa node connected to these CPUs and what those are kind of per node constraints.

D

That I would want to eventually incorporate into my apology, manager and node level, topology hints, but a constraint that should probably never be encoded there, that this API allows us to now. Encode is the interconnection of different GPU devices using this in a Kinect technology called env link. So, as you can see, the green and gray bars down below you can see how you know. For example, GPUs 0, & 3 are interconnected by two lines: a gray one, the green one. That represents that these guys have a strong mm v link connection between them.

D

Likewise, 0 & 4 also have this double connection, but 3 & 7 only have a single link.

D

So that's you know a slightly stronger link, I'm, sorry, a slightly weaker link than if you have have two of them between them right and you know just kind of build up a a more interconnected topology of these GPUs, which implies that you know if I'm ever asking for two GPUs I'm gonna prefer some sort of to over a different set of two right, and so that's exactly what this API would allow me to encode when I, when I send this back.

D

If, if the, if the request that I got coming in was hey, you know no, no GPU have ever been allocated. Yet on this system, there's seven of them available someone's asking for two. What is the set of two devices that you would prefer me to allocate you know and based on it running some internal algorithm to figure out? What's the best set of two devices it would prefer you to allocate would be. It's gonna send back this response saying now.

D

If you're gonna allocate two I'd rather have you allocate either zero three one, two four seven or five, six I, don't want you allocating four zero, for example, because that would be a Nam knock them all interconnected. My envy links and other topology constraints that it takes into account next slide.

D

Likewise, if the request for four came in rather than then then two you'd want it to be able, to, you know, return a similar set of preferred allocations. Well, this time it's you know, obviously the four on the Left versus the four on the right. This is actually the same allocation that the device manager by itself would come up with if it was just doing allocations based on the constraints.

D

Obviously, the previous one wouldn't have been necessarily the one that the device manager would come up with even today, and then the last example I have on the next slide is just incorporating this notion of having the must include device IDs, where you know, if I pass the available ones as all eight and I want to make sure that it always includes all four zero and four, then it could run its algorithm under the hood and figure out. Okay.

D

Well, if that's the case, then the preferred allocation I want you to do is zero four one, five again, based on this kind of internal topology constraints that it knows about for these specific device types.

D

And that's basically, all there is to it. There's they're, not you know again, there's not a whole lot to this proposal. Other than adding this one quick call out to help the device manage will be a little bit more informed about its allocation decisions, instead of just picking from random from the device list that it's given us priori, you go to the next slide.

D

A couple of things to point out is that we would make getting preferred allocations a completely optional call. So if a device plug-in doesn't want to implement this, yet it doesn't actually care about making any preferred allocations. It's you know it's equally fine to get two devices from anywhere amongst the ones that's made available. It doesn't have a legate this at all the device plug-in. Will you know just notice that it's not implemented and not do anything with that information beside the device manager we'll do that.

D

It's also worth pointing out that we've, if you look at the cap, I've kind of walked through what would happen and all of the four possible cases of you know, interacting with between the device manager and the existence of this API call or not in any of the plugins.

D

So I encourage you to kind of take a look at that cap and comment on it to make sure that what I've said for each of these cases makes sense, but in general it will gracefully handle all of those in a way where nothing will change from how it is today, if no one ever implements the get referred allocations, and nothing will change from how it is today.

D

If someone disables the topology manager and also doesn't implement any of these new API calls- and the last thing to point out, as I mentioned at the very beginning, was any of these return preferences that come back we'll always be honored in kind of a best-effort fashion. It's not meant to be as kind of configurable and tunable. As you know, the node level topology hints that we are able to you know start up with the topology matter.

D

It's really just a way of doing a little bit better than what we have today with a simple API called extra API call out to the to the plugin and then the very last slide is a couple of quick questions.

D

The first one being you know you can imagine if you know imagine, I've got a hundred devices rather than just eight, and they say: hey I want you to allocate give me the list of preferred allocations for two devices, it's possible that you could end up with some explosion of possible combinations that you would want to send back, because you know you don't really care about. You know too much about which sets that are allocated, but you want to make sure that it is constrained somewhat.

D

So you end up sending back the list of, like you know, four thousand five thousand separate combinations that doesn't seem like something that you know be tractable in the long term, and so does it make sense to pass along the topology hint so that you can actually have the plug-in, incorporate that and limit his response based on it or maybe, instead of doing that, we don't have get preferred allocations, return a list of for further allocations.

D

Instead, you actually limit you kind of pre-filter that a list of available devices that get sent in to have already taken the you know: topology hint information from the topology manager, namely the Numa constraints in the considerations limit the list of devices that are passed in kind of pre filtered by that and then only have the plug-in return, a single preferred allocation based on the devices that have come into it and again, maybe there's only two left on the machine after this pre-filtering happened, and so now the device plug-in doesn't even need to call out to the or the device manager doesn't even need to call out to the plug-in, because it's got all the information it needs to do that allocation and that's kind of the combining both of these questions you know, should we add the topology hints as an extra parameter, so that the plug-in can take that information into consideration.

D

Or does it make sense to just have a single allocation return back and simplifies what the plug-in would need to return? The semantics with the user are much clearer. It doesn't. You know because it knows it can only pass one thing back. It doesn't have to worry about.

D

You know coming up with an entire list of all the possible things that it would want you to allocate just need to come up with something, that's good enough that, hopefully the device manager will just immediately be able to honor because he's already pre filtered it by whatever other topology constraints were we'll put in place, and that's it unless other people have other questions or comments on yeah.

A

Kevin there's a few questions in the chat so.

D

A

You want to talk to those for a minute yeah.

D

A

Bg is saying: hey just he's curious. Is it possible to allocate GPUs from Numa node one when the best tenth is zero? Zero? Zero? One true? Well, some.

D

Knew it so in your example BG is that one on the end, meaning Numa note zero has the has the one bit set, but Numa node one is actually set with a zero right. Now Oh got it. Are you corrected yourself.

D

Actually, I don't see the difference in a question. I think that the short answer probably is no, because again, the final decision is ultimately left up to the device manager, no matter which way this works out, and so, even if you prefer, if you, if the plug-in sent back a preferred allocation that included something that was impossible to combine with the Numa constraints that came from the topology manager, the device manager wouldn't do that allocation. It would instead ignore what the plug-in said and only allocate according to the topology manager constraint.

D

Does that answer your question? Bg.

C

Okay, okay, let's say we do this: how would you want the project, the test and sustainer.

C

How would how how would you see our existing testing around device plugins depress.

D

They are evolving to.

C

Account for this, like does, would we be writing a mock device, plugin stub that exercises this additional call like one of the things I'm trying to think about whom we look at all these things. It's just like what is the what's the sustainable way we in signa ensure that we don't regress as we get more.

D

Yeah I think that probably a combination of obviously extensive, you know tests where we kind of mock out what we expect the device plug-in to do based on things that come from. You know a device plugin, that's just you know mocked out in the test, but I think also. We would want to build an extensive suite of you know. Integration tests are the easy tests that we have, but again I think you could do it with a mock plug-in a stone plug-in.

D

Like you mentioned, it doesn't necessarily have to be a full-on device, because you know it's very easy to write one of these stub plugins, because the plug-in interface is very, very simple. All that the plugins do is the numerator sort of devices left of the device manager, device manager, notes those down and then any time it decides to make an allocation. It kind of informs the device plug-in. Who then has the opportunity to tell it?

D

Oh here's some things I want you to pass down to the container runtime, based on the fact that you have decided to make that allocation decision right. So stuffing. One of these things out to you know not necessarily test that a device plug-in itself works, but that the interface between the coolant and the device plug-in works is actually pretty straightforward.

D

Is that into your question, yeah.

A

D

Okay, so the second question in the zoom chat was: is this called per device? Only do we support the combination of devices so again, this API call is all about doing device, specific constraints and kind of encoding.

D

Those in this preferred list of allocations that come back, and so yes, it would be device only any inter device constraints are in the domain of the topology manager and, whatever you know extension, do we make to the topology hint down the road right now, that's just new, my node, but you could imagine adding PCI bus, ID or any other number of kind of inter device constraints later on.

I

Sorry for audio quality I'm a bit on the road Kevin. One thing which I wanted to note is what I listed in west, lock I work a proposal to enhance well grid before selecting like inter devices and what I'm thinking it will be a lot more useful to actually extend what apology hints to actually include with cost matrix inside with device and between the devices.

I

Because when when it means like a decision about with Inc like selecting multi device or visiting multiple instances within one twice, it will be on topology manager side, because if we add one more call, this additional call might be use it as a two tricked apology manager. So, like imagine the situation, what apology manager say is like I have, with six devices available and clever device, plug-in say is like oh, so somebody is going to allocate which subset of devices on miss six is available.

I

So let me drill your trigger what like well, only preferable is combination or wet combination, so this additional coal might might be just using to overwrite, well well influence what apology manager decision in the bad way and if we do like, where constant or cost, based definition between the instances of what devices between all the device, plugins and maybe system resources as well, when it will be like one clear, all very quiet,.

D

So that you can kind of a priori as the person that's about to deploy a pod to this node kind of know. If these, if this set of constraints is true, I know, I'm gonna get this kind of alignment, whereas this is going to be a bit more up to the different plugins in some ways as to what devices you'll actually get allocated at the end of the day that well.

I

My worries would so if the plugins will report like a graph of interconnect between the devices and between instances of where devices within one domain, when it will be one single, clear algorithm. If, if we do additional call, it means what's like some preferences will be dynamically changing. If we plug in will implement some clever logic and this additional yeah.

D

It's interesting since it's the idea that we're gonna be calling out to the plugins one by one they're gonna be making some decisions rather than kind of looking for lipstick we across what would be the actual best allocation, given all of the topology constraints in the systems. Yes, yeah.

I

So I'm really afraid what this plug-in will be on the fly: changing your own preferences so.

D

I, don't understand that piece of it, the on-the-fly changing the Preferences, because again it's really just about getting it's doing a little bit better than what we have today we're right now. All this happens is the device manager says I've got this list I'm going to pick two random things out of it. I didn't see how it could be any worse.

D

Adding this one, quick, API, call to say you know, go ahead and give me some some hint on what you would actually prefer that I, don't just pick two things at random: I instead have a little bit more informed decision about which to for whatever no.

I

What I'm saying is like we, this call is an improvement for the current situation, but it's worse when what plugin will say, here's my like fire. Oh sorry, eight instances of gpus- and this is how I interact we're interconnected, like emphasizing some cost between yeah.

D

So you're saying in some future world where we have the ability to consider all the costs- and you know, based on how the topology is set up across all of the different devices having this extra API call would actually in in some ways, maybe hurt the situation because I already.

A

D

The optimal decision, and now calling out to that that might trigger something to change from what I've already decided on. Yes,.

H

D

I see what you're saying I need to think about a bit more that actually, what you just said leads need to think that if we do add this API call, it should be more of the form that I had in the very last slide, which is the topology manager, whether that's your cost based algorithm or what exists there today will pass whatever hint information it needs to down the device manager so that he can make his allocation decision and now that he has that in front of him.

D

He can use that as a pre-filtering to decide if it even makes sense to call out to this API call to the plugin and because he still has a little bit of leeway and the options he could make and then now that he says okay, there's three devices sitting here in front of me that I could allocate to from why? Don't I go ahead and ask the plug-in, which two of those three I should I should go ahead. And do you know so it doesn't interfere with the with the higher level algorithm.

D

But given given the option that there might be some, we were there there, it could call out to this to a lot of plug-in to make that kind of last level decisions there's something like that makes sense.

I

Sorry, III don't see what screen so I need to look at it afterwards.

I

D

I

Offline because.

A

I

D

Proposal a few months back and I need to take a look at it again, because I definitely don't want to introduce something that will you know, preclude us from making better improvements internally to the way this apology, today, yeah and I, think.

I

We sure would wanted to say is what this call will be like within one. The main of devices well I know like in video just recently introduced eight 100 devices, where you have not only GPUs, but you also have like connectivity like close to GPUs, so you you very soon will end up with this scenario. What you need to have a pipeline of devices, so I would suggest, let's think about it, actually to expose where, like a real topology between my multiple domains or devices, so.

D

I

Constrain what.

D

Do you mean by domain for.

I

Example, like GPU plus some like nique or GPU, plus or FPGA, some something like generic PCI device so like when we, when we like example in Ligia 800 you have GPU, is, could connect it and we me directly so like you might want at some point allocate with GPU plus with nvme, for your workload that was I'm. One short.

D

I

I'm, saying is what weather would think how to expose like multi device preferred selection, rather when trying to feed right now, a single type of device.

D

Yeah I still think we should take this offline, because I still think that having this again, like I, said if it's kind of just the last level of don't, if you want to call it, that would still be beneficial. Even if we have solved the the the constraints they're talking in a more of cost based analysis for me.

D

So again, I think we really take this offline, because you know you've obviously thought quite a lot about this and I don't want to take up. Everyone was.

I

Okay, all right yeah, fine, yeah.

J

A

Sounds good, so I put a note in the in the doc here. Just a you know, sort of take a look at the algae on his presentation and sort of compare so okay. Any other discussion on presentation.

G

A question for the preferred allocations request: well, how do we set the most include device, IDs or where's, that.

D

The device the device manager would set that based on knowing that these set of devices need to be included in whatever preferred allocations you send back and one of the main motivations for that is NIC containers, because once you've allocated something to Internet container, those devices should be reused in future.

D

App containers, it's possible, and so you know, if I've given to CPU or to say GPUs in this case they give to GPUs to my first to NIC container, and the next thing is an app container he's asking for four GPUs I, don't want to just say give me any four from this list of devices I send you I want to say here's the available GPUs that are there, give me four of them, but you have to include these two because I know that I've already handed those out to an init container.

D

That's going to run before you I.

E

Also had one question position like at the US ID I think this is covered by the US manga or the aware, as Leia is a supportive management level, because also yeah. This also makes a really key to new york asking the slides question, because if you pass on the topology things through that, you was plugging in right and at hand, where is a high-level Holly will like hint always like the device.

D

E

D

Is way that all that I'm from way.

J

And all that everything.

D

Originates from the plug-in both what the noumic constraint is and what what these device IDs look like. Those get handed off to the device manager who then uses that information to pass it to the topology manager. That.

A

Merges the.

D

Hint who then, which then gets consumed by the device manager just before it's gonna, make its allocation decision. So what the set of devices are, what their topology constraints are always originates in.

E

Yes, thanks yeah.

A

Okay, so we've got about 15, more minutes, left I, think we've got this pretty much covered and how about we sort of take the rest of it off Lots. We have enough time to go over the other stuff that sound reasonable to everyone.

D

A

Right thanks, Kevin. Alright, the next item on the agenda was offline review, requests for Derek talking about support, part of a resource alignment, there's the enhancement, any discussion on that or we go on to the next one. A.

B

Chris here, if you could, let me share the screen: I, don't I! Think in I don't know three minutes. I can show the difference from the last update about that.

C

I'm curious how others feel about that enhancement.

B

Actually Kevin left us his I think this informal approve and the the concept changed a little bit from the last week so that I wanted to present that yeah.

C

I mean: let's do that: I just want to make sure that you're.

D

Talking about the pod.

B

And yeah, like Kevin, proposed, we changed that the idea with the topology manager scope, where you select the scope in which the topology manager will collect hints the scope is either a container by container or pot by pod.

B

And, yes, we we applied.

B

The comments that that you left to us and.

D

Previously, it was proposed as a new policy called hog level, single node and I came in and kind of counter proposed that, instead of having this one-off hog level constraint for the policies you actually, if we're going to introduce anything like this, it makes more sense to say whether all of the policies apply per container or all of the policies apply per pod, meaning that, if I wanted to make sure that all of the resources allocated to any container in the pod have to be aligned versus just on the container, by container based, it's and so I.

D

The the feedback that I gave was, if we're going to do this at all, it should be kind of a second level flag, which is. This is the scope on which the policies apply versus just adding one by one more and more policies, different things, yeah.

C

I appreciate that that was what kind of the earlier iterations I had seen. I was struggling to think if we kept outing. If we added a pod policy has appeared, a single new monody restricted, like I, really wasn't sure if we were reaching a state where I would know how to recommend a hidden user, how to deploy the system anymore, so I think the scope variant makes sense.

C

One of the things I'm curious about, though, is in light of the pod scope being identified. Do we feel that our present default is wrong?.

D

C

Like how do you more do do folks feel that container by container use cases are more prevalent or pod by pod? Use cases are more prevalent, it seemed like eg if I recall would felt that his use case in particular mandated that it had to be pod, but I was curious. It was not the same note if there was another use case. That said, the container use case was required as well.

I

Well, based on our trials, we most mostly thinking about what scenarios. What like FEG on caffeine, because you might have would say to data processing container switches- needs to be strictly aligned. But when you can have like bunch of sidecars, which you really don't care, because the amount of CPU time amount of data traffic is going to us or like shared memory, it doesn't really affect much for performance. So it might create more problems, and this, like more than two container setup. So.

C

I guess one of the things I'm curious about is I. Think this enhancement predates Alexander your discussion, where you kind of added.

C

Tacit hints, in the pots back on, if you would prefer potentially container versus pod scoped alignment, what I was trying to figure out and hopefully after we got through all our discussions here and what people were interested about is like.

C

Do people feel that we could combine, say this enhancement? What was some of the concepts that you had presented last week or two weeks ago in sig, note to get kind of to a more unified and state where, rather than have the key but nicely be configured one way, we could allow the the pot author to express the performance constraint they desired.

I

Well, as a first step, I think like having additional policy which aligns everything there is in 140. It's a good progress. We only just needs to be prepared to scenarios what it might be not available like amount of resources. So we again come come back to a scenario of rejection report, the from one node. So we just need to be prepared to. With that scenario, yeah.

C

So I feel like we're. Cautious about is once we add this. We in my view we can never take it away right, like I'm, not naive enough to not understand that. Like global 5g networks will start deploying with this on, and you know it will be the thing that is expected to work for the next. You know 50 years right and so I just wanted to make sure that, like as a community I'm, especially you outs and your your view was was being heard because I I felt like it's, it was possible.

C

We could blend the two concepts into one one outcome and maybe I'm mistaken on that. But.

C

Just wanted to get you know, the broader engineering communities here is feeling on this.

C

Because I view this as saying like I'm gonna have some container some use cases where I want container level I'm at other use cases where I want pod level alignment and.

C

It's not clear that you'll have a one answer, fits all outcome on on a node forever, and- and maybe that is the outcome for the next 12 to 24 months. But I just felt like this was a forum for us to to ask if some of the ideas that you had presented should be should be collapsed into this, so that the desired UX could be expressed away from the node configuration but on the part of it desired.

I

The UX I think it's well. We already have in pots back with note affinity and the affinity pattern which we are using for a scheduler. So if we can expand, expand to a port spec to include also this like container affinity and Auntie I, think when we can achieve all those like we're variants of combinations like regardless how on the backend that will be implemented, it will be just like one single UX for a user.

I

The question is like well based on what absorb report spec changes that come into the kubernetes like it's very long road to go. So if you for some songs, we need to enable something right now and we have topology manager right now. Maybe it is a temporary solution. Until we get reports back, define it yeah.

C

I, just I, don't believe anything's temporary, like once it's in it's in and so I think, there's an opportunity here that we could explore this.

C

The only other question would like pod level alignment would be things like to bug containers or you know we got a lot of pushback from other communities who have desired a sidecar container container type largely around folks that want a different pod shut down semantics and nor did on voice style or sto style, sidecar injection, and so it wasn't clear if, if we look at the use cases of pods, that would be authored or we assuming that if a pod is authored, is the same user using ISTE, o or boy or something else that might inject a sidecar that you would still want pod level alignment on that container or not.

C

That was the other concern. I had and I just I want to.

C

From I'll put my Red Hat hat on right now, which is like, if I say that, if my understanding is like, when many of the performance sensitive environments that people want to get these workloads into the user community, that wants to take advantage of all the innovation we have out of our ecosystem will often say things like well.

C

I want to combine sto and core kubernetes, and all these things at once and the intersection of like the final outcome, pods back, is not always as simple or understood so I just understanding like if we did pod level scope alignment. Are you getting the right out outcome? If that is a user who is using sto or anything else?

C

It's just something I'm trying to be sure that we're we're careful about, or at least calling out same with, everyone in like the CNF space, like it's hard to like show. Someone like this is the pods back for a CNF function and end or, if folks will go and and right, other mutating webhooks that change that CNF at deployment time, like so in general, I'm kind of wondering like? Is it better?

C

If we find a way to blend the hinting model, you had Alex with a static configuration model that this proposal defined and I'm also very open to being convinced, like that's just a horrible problem, to try to work out and let's make progress as a community and do incremental steps. But I do want to use this as a forum. They like actually have an open conversation about.

C

Like does that, am I the only one who thinks about that? Maybe or maybe Kevin I, don't know if you have a person.

D

Yeah no I completely I agree with what you're saying I hadn't thought too much about the fact that 25 years from now someone might still be using this, but I do tend to agree that you know once something's in whether it's beta even alpha. Some people will always come back and complain that hey this was there. Why did you take this away from me? What's the alternative now so I agree, we should definitely think carefully about whether we will not add any flags.

I

My fear way of thinking is what like, if, when you feature, will supersede whatever it was previously, when it will be easy to obsolete, but yeah I agree. We have an opportunity right now to define like how it will be visible from a user for the next several years. So, let's, let's come up something you useable.

I

C

Don't want to make a final approve or not approve on any particular enhancement until we feel, as a group we've gone through the design space and if, if we feel like after today's call, we've done that or we have a couple other calls that we want to have. I really want to use this. As we have 20 super sharp engineers together who have a chance to.

C

Try to get as much alignment as possible in one design space and if it turns out that there are enough unique swimlanes that there need to take three different approaches. That's fine, but like right now, I'm not feeling comfortable to approve this enhancement until I feel, like the other 18 folks on this call, are in complete alignment as well. Does that make sense.

A

Yep, yes, thanks.

B

A

A

So alright with that, though, what do we want to do then continue to review both presentations and come back.

D

I think we shared well. We should encourage people to go and comment on the kept PR, either with either with their kind of you know, loose approval or any concerns and questions they might have based on what Derrick is product yeah.

C

Just to maybe give a little I I've you the he changes the topology manager policy slightly different than maybe I view Kevin. Your preferred allocations proposal that one seems way more just incremental.

C

And so from my perspective that one seemed pretty good right, like I, didn't see a negative on that right now, but I do want to think about, like that's apology, manager, memory manager, stuff as a as a single outcome.

A

A

Okay, any more comments or discussion on that.

A

Okay, if not, there is one more item officer Township. We have one minute and it was really about the memory manager discussion that we had last week. So do we want to take time to a few minutes and go over with that now.

J

Yes, if you could just I, would just ask two questions and it might be presented this this presentation and next week.

A

A

Okay, we'll wait on the presentation and what go ahead with the questions.

J

How could we could we open these eyes? Yes,.

A

J

Just go to the 55th slide.

J

To search of the sixth, let the next one here so those recently there was a concern about user experience and it was caused by the number of flags that we proposed. So we just came to it with the idea that maybe in alpha version of memory manager we could we could simply skip some flags.

J

What what do you think.

J

Because otherwise, for example, here in this slide, I discussed multi group groups, flag and I started to wonder what should be the alternatives that we have.

J

If we don't want to include this back in in to incorporate this back into memory manager, and one of the option is exactly to skip the flag for alpha memory.

J

It's it's to facilitate, facilitate the review process and maybe also the approval so.

C

Maybe what we can do on this one is, since we are over time like if I think there was some confusion this week. If we were supposed to hold a meeting on Tuesday morning or not, do we want to have a meeting to you? Allow you to go through this slide before next Thursday, which it I'm more than happy to do and I'm. Maybe we can leave this call with parting thoughts on having folks review the slides and think about your question, but do we want to meet sooner.

J

Just do you mean yeah for me, it's that that would be great. It's not not an issue so.

D

Maybe I think that's on if we did it if we met Tuesday and.

C

Then kind of an ask I have four folks here: is attention I've been struggling with, which is.

C

Should we optimize our design thoughts around clusters for specific use cases where the way we approach flags and configuration would be? This is what the recommended cubelet config for an AI cluster would look like and does that look different than the recommended config for a cluster or running CNS, right and I. Think it's helpful for us if we as a community come to grips with the idea that do we see node pools configured for particular workload types and maybe I'm putting thoughts in my head Kevin, which aren't accurate, like I.

C

In my mind, I'm thinking, Nvidia wants to run certain type of workloads on GPUs that might be different than the workloads that packet processors are running, and maybe that's a mistake on my part, but.

D

That's that's definitely true, and not only that I'll just add one more thing: there's that you know we constrain you plan on constraining the way that the nodes organize what types of machines are there, what labels you put on these nodes in a way that lets you run scheduling, algorithms differently and so on. So.

A

D

Of you know it's very customized in that way, it's just general purpose would be the different cluster yeah.

C

And so that's kind of the design tension I have when we talk about UX and maybe some of what Alex's proposal did, which is I kind of view, alex's proposal eyes, I have a general purpose pool of compute. Let me give you hints on the right way of doing it and then I've used some of the work that Kevin you're pushing or some of the work that VG and and Samson is pushing us saying. I have a particular class of work that I want to run best on a cluster. That's been tuned for that workload.

C

What that is: AI, inferencing models versus packet, processing, environments and I think we are well served understanding how to evaluate proposals if we say like as a kubernetes deployer targeted on this workload, type I recommend we structure nodes as XYZ and starting to like define them. The dividing lines that may or may not exist between those different workloads, types I think that that is very helpful as well, because then the UX around flags is less interesting right, because vendors, like Intel, will point out documentation.

C

That says this is how to best configure your criminales cluster to run DB, DK or something right and NVIDIA could help documentation. It says this is the best way to configure clusters to run XYZ right and and Red Hat would be no different right now, I think there's a tension on.

C

Do we want to solve general-purpose or do we want to get more domain-specific and maybe each of the proposals that people have out there right now we can better understand each other, a community if we say like this is the workload I'm targeting- and this is the recommended node configuration for that workload. Type and I think that would be helpful for me personally and may be helpful for all of us to understand how to evaluate if there is a one-size-fits-all answer or not. For some of these things.

I

Actually, one thing which I haven't showed in my presentation, but we also thought about this kind of user experience or sis would mean a kind of experience is what like our proxy when when it starts, it expects to get a config and we have a demon, Rykov actual actually daemon set, which reads we config map and when pushes were configuration to do this cierra aprox here Damon and it can dynamically update. So it cannot dynamically reconfigure itself and it also understands like several levels of a fallbacks.

I

So there is a one config map for a whole cluster like default. Where is one config map which can be defined by the group? So you can label your note with specific group and it will fetch where config map specific for that group and when there is a config mark specific for one particular node. So the node name is used as a key to look up a config map with configuration and like all these kind of divisions and hardly a specific particular to it node. So it's possible.

C

I

Go over all of those, but.

C

But I've struggled on that one and I understand that is like it's like it's like it's like an answer that says: I want to allow every possibility and what I'm trying to figure out is like.

C

Do we focus on purpose-built clusters or our general purpose or multi-purpose clusters, where each purpose is tied to a node pool or not as like a first thought and where I see the attention is like for, let's just pick on edge, worried I see a lot of folks that say: I want to run clusters on the edge, and initially they have a very particular use case. They went run around the edge and I was just say, packet, processing, right and so there it's like. The hardware needs to be awesome. Everything needs to be awesome.

C

My workload needs to run as fast and awesome as it always did, and it needs to do that for the next 12 months. But then you know, maybe to 24 months from now, I want to start running other general-purpose workloads on that little cluster at the edge and I want to know if everyone's goals kind of feel, like those goals or not, or are we going to say, like hey for the next 12 months, we're gonna as a community say our recommendation?

C

Is you build a cluster for a purpose and when you have a new purpose, maybe you build a second cluster or are we gonna say you add? A second node pool to an existing cluster looks like a key first tension I'd like to figure out well.

I

We when, when we're designing, we were thinking what, like you, can have a heterogeneous cluster. So you can have like different kind of nodes. We can be organize it in the groups and you can have like one short specific nodes.

I

So, for example like like talking about wage, you can have like one node or two nodes, which is like data processing like radio and when you can have couple of more nodes, which is like genetic applications and when you can probably also have a VM in look like where, like somehow a cloud like connected to the same cluster, which will be doing something else, yeah and then.

C

The reality of that, though, is always tied back to what are the workloads you want to deploy and so like. If you have one workload that deploys X and it prints a certain level of cert manager, and then you have another workload that deploys Y that had a different level of certain managers at Brock, then like. Inevitably, you had attention there too, and so.

C

I'm just trying to ask that for each proposal that's presented forward, if we can put it in the context of like its intended use, that's all I think would be good like in two weeks.

C

If we came out and said, we think this enhancement is awesome for this use case and we're gonna go and try to explore that for that use case, and this enhancement is awesome for this other use case, we always have the clear use case in mind or the consumer in mind, for why we're pursuing any given enhancement rather than leaving it up to the deployer. To guess why we built something, I, guess and and or didn't, take a more general-purpose approach so either way. That's that's my general ask if we can just be like.

C

Cognizant of that, and so I appreciate Kevin you calling out that like that is what you are desiring right and I'm sure the other proposals are are here targeting the specific.

C

Otherwise, so do we want to meet again on Tuesday morning? Is that.

D

Seems like there's a couple plus ones in the chat.

A

Okay, if that's it for today, I'll stop the recording and send out a meeting invite for next Tuesday and thanks. Everyone for attending saw the great in and.