Kubernetes WG Resource Management, 25 Apr 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20170425

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

All right, so, let's see what get started on today. This is the resource management work meeting for April 26. We had a number of items on the agenda proposed, some of which are unfortunately not able to demo today over present today, due to other issues, so.

A

Agenda actually we're pointing out I, don't know if anybody wants to take notes volunteer to take notes for today and if so that would be awesome. We can do.

B

A

Thank You, Connor and.

A

So say on the agenda: we have four items here. The first is I want to know, give an update on the face place to see if there were any questions about it and raise any of those concerns or have any feedback based on the proposed schedule that fish throughout therefore I'm slotting items has everyone had a chance to review that proposal or proposed schedule, and if so, are there any questions, comments or concerns that people want to raise? Now you.

B

Have the link to the agenda duck right.

A

Right there, on the in section, yeah.

A

So I don't know a fish. If you want to explain how you came up with that scheduling at all, but to me it seemed rather reasonable.

C

Likelihood last week, I think there's a logical progression for each topic. So I was just trying to preserve that logical boundary between each of them.

D

Question yeah the stanwyk, so one other schedule before we had a thought we said, but I don't think fish, so I have asked about who want to step up and lead the discussion, but this is still question or winning. There's no owner and I've seen a lot of interest more into the the resource sharing and then the last one you receive move it up only for hot preset. For the part reset remember. We were thinking about ordering of those discussions of the whole array, so.

A

My perspective, I'm I I, believe a lot of the issues that we face as a project or are stuck on me how we do the resource isolation boots and like absent, an answer on that then like or once we get an answer on that I think some of the pod presetting parts become more naturally expressible or just ship less contentious. So actually, from my perspective, if need be depending on where we land on the first half of the discussion, I'd be happy to make the Tuesday afternoon 3:00 to 4:30 slot support the.

A

How do we make this easier for end users to consume around more enhanced pottery sets so awful Fiona, given that I actually probably biased to prepare materials for both topics, and we can use that as kind of scratch. Space depend on how we go. That's it thanks yep, any other questions come with your concerns.

A

All right and then for the second background, bookkeeping so I think we had set a goal that we were going to try to collect some precursor materials for individual topics by I think end of day Thursday.

A

So is there any concern some people being able to achieve that, and if not, do they want other folks to try to fill in those gaps? I know for myself. Fish and I've been collaborating on some stuff with sess, but I don't know if anyone else feels like they're overburden and would like someone else when they greenie to step up.

A

All right sounds good, so now that I guess the last point of order is just where we want to collect all of the prerequisite materials.

A

My assumption was that we were just going to add individual links to the current in-person agenda document, as they became ready. Any disagreement on that.

A

All right, so how the paperwork for the face-to-face and stuff is out of the way can move to topic to.

C

Digest just before that got.

A

C

Questions our logistics like how to get to the venue are or like is that something that you would need once you get there like.

A

So that's a good question for me, which actually so is there any special needs to actually get into the Google building like if we were to take an uber from a hotel to Google. What would happen here? I.

C

Suspect that you would need to be escorted by a Googler one once you get inside badged areas and I, don't know if the building that we are going to be meeting at has a lobby. If, as one denied you can select down there and I, should I should be there and I'll be having a few more people to take Caitlin.

C

um But yeah I mean if there is no Lobby and make sure that please one of us is there outside the door with the necessary just to let people in but I'm gonna basically hope that the list that we have in the darkness assign a list of attendees, then maybe printing batches for that list today or later see.

C

If I, if I happen to find out, if there's a lobby, I'll update that in the dock and if you should end up like showing up a little earlier than me, then you can sleep in the lobby. So.

A

I think this respect to the finalists I tried to get Marin all to participate, participate at least one of the days. So okay I'm not sure if he will be at it or not and I doubt he'll be there for both days but and if not, then at least at the the meetup at the end of May, I. Think, okay,.

C

um Fair enough I mean he's still kissing, so close I don't see reason like to not make it I didn't catch that what.

E

Did you say about me not we're not we're not Bernal from Red Hat, yeah, okay,.

C

Hi I Newsday: we can. We can talk about adult rain, but yeah.

A

C

A

Had no other particular questions, I guess.

C

Okay, then I used to give them along okay.

A

I think the next item on the agenda was something I threw on there.

A

So one of the topics we've been discussing for a little while is basically how we can support more performance, sensitive workloads and one of the resource types that's come up when discussing more performance. Sensitive workloads is if we want to support.

A

A

I'm sorry I'm hitting a problem where I'm trying to think through how I can I can't switch desktops in my laptop, while recording a zoom meeting so I'm trying to think about how to get to my demo in a second.

A

B

A

Now, I think this should be ready. One of the topics that have come up repeatedly are what we want to do with respect to applications that may work best with pre-allocated huge pages, so in.

A

Well, I hope this will work. If not, then I might have to defer my demo.

A

Our folks able to see my terminal window.

B

Yeah, except for the top line, okay.

D

B

F

B

F

Ehm, it says yes, I can.

B

A

F

Button but nothing else: okay,.

A

Can you see I typed, hello,.

A

Means you owe me, and he is.

D

There, like a thought, love is like, is that the problem very small so.

A

On the right, it should be showing Caprock mem info and on the left. It should be showing something that.

C

Can you khlo nothing? You should go into the middle of the screen.

A

C

I can just go into the middle of the screen. I'm literally.

A

Sharing, my okay, okay, it's a single screen on a single laptop and just.

E

Hit hit enter like 20 times and then the Chiefs be able to see your screen.

C

G

Yeah but we're.

C

Just saying like.

G

A one pixel line of white at the top of a big black box and we think that's the turban, so I thought yeah. Oh.

E

A

Well, I, don't know, what's gonna happen here.

A

Let me try to share again.

A

Let me try to share here.

A

It's just a common problem with these tools, can you see a code editor nope? We see a bigger black.

B

A

G

Wider, uh it looks like the aspect ratio of a screen, but there's no content.

A

What happy I'm gonna have a much less week with sharing my screen. Well I'll talk through the demo and you guys can trust me that it works basically I opened the pr trying to explore. Originally at the end of in q1, 6f Jennings from my team had opened up a PR to propose adding support for pre-allocated huge pages and in preparation for next week's face-to-face.

A

I tried to put together put a type demonstrating some of the ideas around that PR proposal, just as a verification, especially now that we have certain things like pod little C groups. That should make it easier. So in the link to the issue in question, it's a very quick prototype that demonstrates C advisor, adding discovery capability on the default, huge page size and the number of pages configured for that size.

A

And then the cubelet is able to express that as a allocatable capacity back to the node and then ultimately, the scheduler is able to schedule to that capacity with a request field that matched.

A

Matched the desired huge page size, so I think. If the contrast, this would like the original proposal from from Seth, like the distinction here, be that the request is not just for any huge pages, but it's a huge particular size and it's possible.

A

We could support nodes with more than one size, huge page or not, and then, generally speaking, when going through this, it seems like the individual container runtimes do not yet support control on letting you set per container huge allocations, and so at best, if we were looking to roll us out in the near term, I thought doing per pod. Huge page allocations was perfectly fine, but yeah.

A

The demo I was going to show was a simple java application, which was meant to show that any application based on a JVM, whether that's some of the big data frameworks or not most of those- can be configured to have their heat feedback by huge pages instead of normal memory, and so in the demonstration I have a simple application that just keeps taking data from the heat and that heat can either be backed by a huge page or not.

A

And then, depending on how you configured your node, you saw each pages being consumed and it would have made for a brilliant demo that I'll show you all next week, but I guess, generally speaking, what I was trying to call out was from the community.

A

If folks can comply plus on the particular application framework, they would get benefit of huge pages like let us know, because I think it would help us understand if we think it should be a first-class resource or not, and then it'd be also useful to know if your applications require the huge TLB FF or or not as it would help steer where we would prioritize implementations.

A

So that in mind, any questions comments or we can wait to see a demo next week. If folks want to see it.

C

Daikon got in time to actually look at your prototype yet, but what is the API that you are imagining? This is going to be a new type third size or yeah.

A

The API a prototype right now was basically I mean the history of this. Is it used to be a machine just had one size and so like in proc mem info? We just saw one thing, but then, if you have machines that have two mega bit few pages and I, you know gig edges, you get to vary by size. So, right now the resource requirement that you make would be huge pages and then the size of the page in KB, and that kept the naming convention that at least I see on in the Sisyphus.

A

When you look to see how you traders are configured basically, the naming convention would be the same okay, and so that has a nice side effect of as a pod spec author. If I knew I wanted one gig pages versus two Meg pages or 16 Meg verses, 16, gig, I forget the other sizes. I wouldn't have to have another piece of information to have my pod land on a node that had that particular thousand elbow. It was just actually expressed in combination with the huge page request: okay,.

C

I have some thoughts on assumption, like alternative thoughts on how the API can be, but I'm actually writing a talk on like representing resources like huge pages. Maybe we can defer the discussion length on the stalkers being shared yeah.

A

Yeah so basically yeah the representation was the sheer number of huge pages at a particular size and then notes expressed that capacity right, I'm, not sure I'm, not aware if it requires more complexity than that, but that was the general gist and then from an accounting standpoint, because huge pages basically represent preserved memory. I didn't like giving I, don't like the idea of having best-effort pods, get higher resource guarantees for any resource generally.

A

So me, it seemed like first of all and guaranteed pots of the only things I could consume these, and then it could be vish when I was thinking through. If you want to get by the resource representation, it could be that certain size, huge pages, are more valuable than others so having the size of the page expressed in the request X and makes quota work easier, like namespace like quota, because you can allocate them differently without any other decoration. So that was the idea there and.

C

So I was hoping that like to make the make the specification like abscess occasion, portable, maybe applicants can specify minimum requirements or like they cancel a request for four huge pages rather and like laments, and so they will just be slaughtered to the closest bucket. Just like a way to be. Let storage, I.

A

Think what do you mean by that? What I mean by.

C

That is like you, for example, and I guess: custom volumes if you ask for 10 gifts, but what is available is 20 gigs. You will just get the 20 kicks.

F

I think we both both modes, are needed, while this would also be interesting, but there are cases where there is a hard requirement, a say depending on. If you want to run a low latency period location, then you would want to satisfy the limit to when the request is satisfying. What is available may not be the right approach. Sorry.

C

It can you refuse the right I, don't get asked what.

F

A nice good place come, you need both type of allocations, put the guaranteed model and also the best effort model recording fantastic. You can't believe everyone. My stepfather is basically I. Guess when request is less than limit right, correct, that's the case, a quest equal to limit, then it's a guarantee, basically absolute guarantee. So.

C

Best of my best effort, Lord I, guess that echoes meaning- is that you don't specify any requests at all, nor this.

A

Is my yeah, so my assumption was that any best effort pod in the so basically pods I have no requests or limit for any resource enumerated. They would not be able to consume huge pages at all. So when the pod level see group is manifested the number of huge pages allowed for that pod would be strictly accounted at zero, whereas for huge pages generally I guess it's worth calling up. I did not think that they're a resource that's safe for overcommit, typically and so yeah.

A

The request and limit needed to always be equal and wish I don't still follow your comment on why huge pages are bucketed or say like like, for example, unless you're talking about like Numa.

C

Locality, so I was thinking that, like you, don't actually have to match pre allocation of huge pages to application requirements. Maybe you've already dealt with that I'm starting that's pre allocation of age ratings damages like.

E

C

You're, if, if your system is creating huge pages of the catastrophic size, then you don't net your ax, don't necessarily have to request that exact size. They can request anything and like they would be bucketed to the closest possible size. Oh yeah.

A

It's worth thinking through I think from my perspective, most of the apps that there's just a claim.

H

Is rated like turn off transparent.

A

Huge pages because you want pages larger than two megabytes anyway and then hang.

I

On there's there so few died need anything bigger than two Meg there's like even it's even arguable, whether one gig pages show anything on DPD K at the moment, like I get your point, fish I, just pretty early for an optimization like that, at least in my opinion. So.

C

Jeremy just to go back on your point: are you saying that, if like two mangers and ask them like would transfer into your faces, be enough? No, no!.

I

No, it's an entire note. These pages of a different location and apparently face each pages. Appendage pages are meant as a background worker thread to improve TLB usage under normal circumstances.

I

Huge pages are meant, as the leaves are taken into account generally during the application design for managing large amounts of memory, usually I miss a comment to comment on Derek's PR, but typically the tipping point is around 32 gigs or 64 gigs, depending on what chip generation, you have the breaking point for using these pages versus not and when the one get cute pages turn up when we're in like the terabyte or turbine less raising limit for one application so nap.

I

So that's why I'm saying and even like even the DVD K tests that we have internally one gig occasions are like I, think they're just using them, because the chairs are capable of them because they're very very little if any performance benefit of them.

E

A

You're going to this yeah I guess what I was going to say is I just wanted to draw attention towards the use case and get people's eyes on the PR, who haven't had a chance to look at it as a goal. I agree with you wish that getting an implementation, one seven would be a stretch. I would like, in the spirit of the performance, sensitive workload, topics that we've been discussing at least getting general design agreement worked out, and that's mainly what's this PR was trying to motivate and I believe actually been helpful.

A

I know Connor. You mentioned that you guys might have been doing something around the TL via FS, so yeah.

A

Basically, at this point, we don't need to like come to conclusion on this now, but I just wanted to make people aware of it and kind of speak up with their use cases, so we can get to the right design, but from what I got right now, the basic implementations assuming unknown huge page size with a pre-allocated value, is not super complicated to warrant, maybe not wanting to have it in the core to me, like I, don't want to treat this as an opaque resource, so anyway, people can review it and give comments.

A

Let me greatly appreciate it and I'm sorry, I can't demo for that. So.

C

This is super useful, actually putting together a prototype, yep.

A

And then I think the next topic on here was Connor.

I

Can I just ask one question before you go to Connor I mean I, don't know if you're able to share this, but are the VMS on gke or GCE already back with huge pages.

C

I'm, actually not sure I, don't gut feeling is that we do not expose your situations, but an artist check that I.

I

Was saying you don't need to expose them, but the whole memory of the VM could be back with week ages.

C

Yet so that's that's what I meant, like my expose, I meant like Borg, exposing it ah yeah I need to check on that. The.

I

Question is because, if we're doing it twice, the benefits are much less yeah.

C

I mean when it comes to watch life environments. We really don't know if you're really worth dealing with like what we don't know. What we're dealing with right, like everything, can be commercialized in sodium, like PCI and like Numa, and even the hardware trends. We don't really know what we're eating work. So that's general problem.

F

Yeah I made a Google search, it seems supported at least in. Do you see there is a comic and fat balls flow almost near back I a.

C

F

C

So the question is a little different. The question is whether GC VMs by default are like basic. Underlying beans. Memory is being backed by huge stages or not so I'm, not sure whether answer to that would show up on Stack Overflow. Maybe it might but I, don't know that, but.

A

For the context of at least the proposal that's been out, there I mean I. Think cubes should still run well on bare metal and I should be able to stow your manage asset feature right. That's just.

I

A side question Thanks, yep,.

A

Okay, so I think we got a lot of good conversation on that topic and we can pick up on it more next week. Let's see Connor, maybe you'll be more successful. In sharing your example,.

B

And another demo raised a but I just wanted to give a quick update on.

B

Okay, you see a browser logo, yes, okay, so just briefly the external isolator POC is lots of earlier is up.

B

We've created a new PR pointed against upstream just so it that is easier to find and easier to link against the feature issue and the other related issues. So hopefully we can get some comments on it. It's a ton of code because some of it is generated and there's you know the unit tests take up a lot of space.

B

So it's like a thousand lines of tests, but the the the PR description has the section, which should be helpful kind of as a tour to show how the different pieces hang together. And so here we have a link straight to the GE. Rpc protocol is probably a good first place to start for reviewers and then the event.

B

Dispatcher is the piece and the cubelet, the new that manages the communication between Cuba and the Isolators, and then these two links are the wiring from existing components in the event dispatcher unit tests and then there's a helper library that were used in our example Isolators. And then these are links to the example.

B

Isolators there's a no op one which just logs basically and I, think it injects an environment variable into a container just doesn't as a as an example and then there's the CPU affinity isolator, which we've demoed a couple times in this meeting before yeah. So that's pretty much it I can show you where the the current state of the events. So we have these four events that are emitted by the event dispatcher it.

B

Currently we have a pod pre-start, pod, post-op and container pre-start and container post out and then, when you reply, UK and send these lists of isolation, controls and the controls we have implemented so far are these. We have CPU sets CPUs and MEMS, and then we have container environment variables which you can inject into the container environment. And then we also have the C group huge TLB limit, so you can set so those are what's been implemented. They're really just examples.

B

If we were going to break this up into PRS to target, you know that we were actually trying to merge. We would probably start with one and then expand them on them on them, one by one, separate PRS. So yeah, that's the update. It's there for review. If you're interested, please take a look happy to have any comments, and this will be one of the topics it's on the agenda for the face to face. We have coming up and that's it thanks. Connor.

A

Look forward to taking a review cold, so I think the last item. The agenda here is a Renault, so if you want to, I have not had a chance to read your materials yet in too much detail. But if you want to more.

J

Go ahead, I'll be sharing my.

J

So come I just want to limit this conversation about the runtime isolation, part I, think this scheduling and like whether we do and there well all the scheduling part is something that should be addressed like Acuras, because and the runtime part is in our opinion, the thing called the basic of it. So I mean I'm going to start with. The simple premise is that no container engine or rent line works, natively, GPUs and Nvidia docker was, which is the product we started with only supported docker.

J

So that's why we went to this thing called Lib Nvidia container, and the idea is that we would have a better integration with all the container engines and.

J

The idea is that, basically, depending on the container engine you're going to earn the continued runtime you're going to do different steps and those steps are basically the different actions you're going to take. We put them in the limit, we put it in a library and for each container, rental I'm will have some we'll have a small book that will call in the library. Is that something that makes sense to everyone.

C

I'm again stating my opinion here, the runtime is not the right abstraction for hardware devices in kubernetes. The runtimes are meant to be imperative: they're not meant to have any sort of intelligence, I still have a pod or a containers being isolated, or what sort of like what sort of devices at get access to? So that's, basically, not the right abstraction.

H

Now of doing so, I'll say for you guys before those people that I was not knowing that I'm.

A

Having trouble hearing you I, don't know if others have the same problem because you get close to kill Mike.

H

I was just saying that just to restate our we know this using melee or enzyme, let us say intubated cage or isolate those devices and that's what we will just do so: I'd not be too different from rent times. So the other six tables but I do understand a point right.

C

The point is that GPUs are just one among and the trend devices that we need to deal with at the node level and like the cubelet are one of cubelets. Extensions was got a deal with a whole bunch of other resources, and so we have to take into account GPUs awesome and, like maybe GPUs a quirkiness. That's like it requires some extensions at the qubit level, but then, like the extension, is not at runtime. This is basically what I'm trying to say.

C

Well extension point should be from the cubelet and not from the 10, the runtime. Why.

K

They, like isolation, is part of the run time, so I don't get what also what.

C

The water runtimes do is basically just create a sea group sandbox and then just apply the settings, so the cubelet asked hit top like and that's about it.

K

The Seagrams is like one implementation detail tomorrow in mind that we might not be relying on cigarettes for isolationist GPUs. That's.

C

Fine and in that scenario of unite.

K

5 it to our drivers, for example, and actually might be the case pretty good, so.

C

Let me read Malory phrase on in Frank's way: the runtimes do what exactly they have been asked to do by the Kuebler. So the fact that a container gets a GPU is something that the cube little minds and not that something at the container runtime determines yeah, because it makes sense yeah.

J

But that's where we specify an environment variable at you container says: I want GPU UID 1 the runtime like the way it's Ted GPU and know is Megan in the window. Yeah. That's all.

K

We're doing yeah the decision is made by Coolidge to specify the devices you want to either like and how, but like the runtime, does the actual isolation and.

C

So we are also like considering not doing any sort of isolation at the runtime level and take care of everything at the pod sandbox level. For example like deal with all the C group settings, including like device isolation at the Pazzi group. Rather than going to use a dozen runtime so.

K

That would specific to see good like if, tomorrow we have the private wire API to do like some sort of isolation, then you probably don't want that in Toulouse because you don't know. What's the underlying runs, I mean whether we support it right.

C

So and that's what I'm saying that it need not be part of cubelet itself, but it could be an extension from the cubelet, not at untimely level.

C

If you aren't saying like it sort of what it's sort of similar to what Connor was demoing a little earlier and that like it could be just an isolated extension, that's going to speak to it to a set of devices and and like that that isolator is going to take care of doing what is necessary for giving apart and its containers access to one or more devices. So so. The problem here is that the.

J

Isolation depends on what runtime path and you don't have that information at the Cubist level. You don't know because it's hidden by really ah that.

K

It'll be helpful if you can like wearable, M pigment like a VM, you have to do like different things than like a container. So a couplet wouldn't know that that we need to add the actual run times where, like nvidia gpus within, like again, for example, or nvidia gpus within run c or or within, like like C or I, mean it's totally different. Every time.

C

So so maybe the reef is, are you saying or anything is that the act of binding a device? That's an NVIDIA GPU into a pod, is going to be completely dependent on the runtime I. Think.

B

C

No there's no disagreement there and that's probably a runtime problem, and um maybe maybe both are just talking about two different problems. What I'm saying is like cubelet is going to say what device a given part and its containers is going to use and we're like how you follow so I. Could we.

J

Agree about what you're saying that's we've always been agreeing that the cubed says what GPU should be isolated. Sure.

C

I Linda I I didn't like other point: Carell doesn't really care what a runtime has to do to make sure the containers have access to that device. Yes, that's that mean we don't we don't really care about that. Yes,.

J

That's what we're saying with NVIDIA container takes an input command that says: I want you I want you I, want you to isolate this GPU CPU Hewlett. Does the decision parts and then live and be a container as a round kind of hug isolates GPU.

C

So you're saying that, like as part of the CRI NVIDIA GPUs, that have to be special case from other devices, can.

J

You repeat that it's.

K

Not really OCIE like specific, it's just like like we are actually like doing inside the container itself like it's, not as I related we'd support. We can support other runtimes. There are no, so she I compliant, but.

B

I hear correctly that you're basically going to be keying off an environment.

A

Variable: that's what this was go cheap you to use well,.

K

Just because, for now this is the only thing that's like passed through down the stack but I'm a better way to like put a trigger.

A

Time so the answer vicious question. We're like GPUs needs the special in the CRI. They need to be special because the acting CRI implementation is to notice that that I'm Bahrain isn't there there's a good point. They.

K

Told, but right now there's no way to actually pass a method data down to the actual runtime. The only thing that's passed down is the environment variable that you're using right now the agenda is like you can expect continued like. If you get something you would be a happy to Larry good can.

C

The key be a major liner or just access to the nice fun I.

K

Know because, again like this is, if I coming can we can? We probably need like in the VM case, we need to do a high pass through or stuff like that in case of rigid view for a like GPU, like virtualization, like I, said you have also srl V. We have all immediately devices I mean there are a bunch of virtualization technology for devices like the C group is just one of them sure.

C

But like at the point, it's still not clear what the requirements are and holla'd maps to the use, reducing api. I see the like I'm happy that we have some agreement on like what the responsibilities are, a different layer. I feel like there. If you open questions, one is like what is the exact interface between the cubelet and actually the CRI under one x, because right now, that's just like device files which which would then have an associated major liner. Well.

K

You only not only.

C

K

Would be like an environment variable where you list a GPU on side wall? That's.

C

An opaque api I mean that just doesn't make any sense, because it just completely opaque yeah.

K

And that that that's why we are actually doing that, because I mean there is no way affecting what done time you're hearing that's.

C

Fine, it's not that, like all runtimes are going to be completely CRI compliant in that, like CRI might say that I need this feature foo and like at one time does not support, feature foo and that's part of the whole components and compatibility matrix. That Corinne is trying to solve separately.

K

Just like, if your family or is like CUDA, visible device, which is like what our drivers uses for CUDA for a growing crisis at the CUDA level, it's exactly the same thing, but at a higher level.

C

Would it be possible to share more details about this? Is.

K

Very much terrible that like when you run your credit code, you can specify an environment variable things like could a visible, be light and you obviously isolate those devices logics in that for the credit application and so for containers. It would be exactly the same thing you would say: Nvidia visible device and then the inner runtime will see that and I do like those according to the implementation. So.

J

I think there's also two things here: the user facing API will not be the user specifying an environment, and this is something that needs to be clear. This is the container facing API when you're calling in the continuous is say you add, an environment. Variable echo, like CPU UID. One do.

K

I think that the user is just asking for one to do is sort of constraint. It doesn't specify anything boys will deflect the environment variable so.

C

I feel like maybe the discussion around what the exact API on the CRA can we dealt with separately in probably just an issue, because at that point it's only a question of like kpra sanity and and not really a technical problem, I'm still interested in knowing how these EPS are being exposed, because you see saying that, like on one hand, uses, would have a portable specification.

C

On the other hand, you're saying that that would be use cases around virtual, a CPU um I'm more interested in knowing like how do you plan on exposing such features like if you're going to share a GPU across five or six different parts? How is that being a exposed plant users? That's.

K

Basically, the responsively of too late to expose that I mean it damaged the decision and just going to the idle like the runtime violation and.

J

Just try on that right now: Jeep instance, where you don't really support, sparing at the hardware at the hardware level it might, it is going to remove in the future, but or it might not we're not there yet. But right now, sharing GPUs between containers is not something that we advise.

C

C

Okay, um I'm just trying to think a lot at this point like to to better understand what exactly I should be at the CRI level. I think we need a list of different different means of consuming GPUs can the only the only use cases support know is like providing access to a complete device in the form of the form of like a an actual device. Detonation liner, um if there are additional scenarios and I think need a list.

K

Of those two I think it's just like a matter of passing like metadata, that we avoid it to the runtime. Well,.

C

I mean that's the implementation aspect of it, but I want to understand the use cases before jumping to the implementation. So what.

J

Do you mean like so what use case? Is that mean? The idea is that a user is just going to request a call that has like requirements from EP use, so it.

B

J

It with just like ask for a certain number of PPS, and then it was like specify the different constraints, so those constraints would be, for example, the compute capabilities or the memory I. Think the simple use case would be the memory, a user with say: I want to DP use with at least eight gigs of memory, and of course, that's not today. You can't express that levels or anything again, because if your note has a Jeep, I had to rent a noose and confuse and it doesn't work.

J

C

So, just to make sure we are on the same page, you're saying that at the runtime level you got to do different things based on based on how the runtime is set up or what the runtime tries to do. Is it just it's just based on the runtime, or is it also based on application requirements? I mean.

K

Depends with like, basically, your user will ask for two GPUs, but my end up like on the ultralight, GPUs or you might end up on the VM with GPUs or you might end up with a container which you use doesn't really matter the runtime dodgy isolation.

K

B

On timing, well,.

K

It could lead us to select widget use, but, like the actual runtime might be different on any machine that the isolation only depends on the runtime.

C

That's that's fine, so it's just a question of how do I expose this given GPU device to my process, that's later running inside a VM or it's running inside just of cgroups and they say sandbox, but that's that's the only problem that you're trying to solve that is that correct at this for the runtime level, unless okay, so that is like no special magic happening there based on they start like application requirements.

K

Well, not really I mean the ID requirements have basically the couplet that you can like the requirements. User specified role like select like the GPU, but apart from that, we don't well. We just our input is just I like this device and we'll do it. It's about it. How you like with devices applicable? Okay,.

C

So it's less about isolations more about like expose this device to this given set of transfer fees, okay, yeah um yeah. That seems sorry to me. I think we still have to have a separate conversation or like what API should be because then my not variables doesn't seem nice, but that's a much easier description. Kaha yeah.

K

Again, that's not I mean that's not user facing that's like just important, you can be darling, could sign.

C

Up but even then like it's a it's, a public API in the sense that, like every one time author is going to consume that, so we generally avoid having features being exposed to environment variables. Yep.

J

So I think the the second step is that I think the second step is resource scheduling, and this is something we want to see at this scheduled level.

J

I think the the basic example would be the one I just presented, which is a user once two GPUs, that least eight gigs of memory and um I mean that's something that can only sort be saw that a scheduler level and I think one of the discussions we had was in the sixth grade when group was, and one of the concerns we had is that what is the minimum set up?

J

Ftp functionality speaker can have or should have, access to, and should the GPUs be supported as a first-class resource, and our take on that is that it would be easier and to use the extender and not have the GPS be supported as a and first-class resource, and the simple idea is that and I think it's shown basically by this diagram is that the user specifies at Peapod, and this schedule goes through it filtered it filters once it's gone through it filters it's going to call a different extenders that are specified so and to give you an example, the part would just specify two GPUs and then it would have a resource affinity, a parameter way.

J

We would say: I want minimum memory at least four gigabytes and every extender that would specify I am a GPU I am an extender link to the GPU would be called actresses schedule, fetcher culture, and am I clear at this point or should I try? Another way? Is everyone following me.

J

I'll take this for yes and the edges that and after that, we'll go through the schedule. Prior dice scheduler would would apply all his prioritized functions and then call the matching extenders and finally it'll come to a decision and we're basing ourselves on a schedule at PR and at this point that and delegates the bind responsibilities to an extender and that's to do the resource management parts where you need the extender to be aware of which node has been taken to do proper resource management, and is that something everyone's following?

J

Is there any comments on this specific design or not.

F

C

F

By which was management ax are, you is the context disk managing the GPU resources or the interconnection with? How is it connected to the CPU double PCIe bus? What is your view of resource management here? What's the topology are looking like.

J

So your question is: are we trying to solve the topology problem in the schedule right? Okay,.

F

I mean, and also what are the scope of your topology? What are you are bringing to consideration? Is it just you know within GPUs? Are you bringing CEO to your.

J

Eyes just to be use constraints, yeah.

F

J

Could be like basically, external hardware constraints, not CPUs, and you know I think so, and the supporting part for GPUs is described a bit at the beginning of the document and and basically, if it's not done right, then it's not it's usually not worth and launching your task on GPUs that have better or not next to each other. So, for example, in the four GPUs with PCIe.

J

Usually, if you have a task that needs to DP on your scheduling it on uq0 and uq3, then it's usually not worth even launching the task because of the performance decrease you get yeah.

K

But I mean topology is like a big topic and pretty complicated, so we don't want to solve that. First, we didn't have on like basic constraint on GPU resource I mean you'll, take integer resource if we could just solve that. First, yes,.

J

Exactly right, cutting the minimum memory part would be amazing.

F

So people understand that your statement better so I'm, looking at your diagram saying for GPS at PCIe, the.

B

Reason is, if you 0.

F

And if you three meet it's going through I get it's going to the CPU. That's why ok and then you see them talking to each other, meaning interacting and that's where the problem is yeah.

K

The village is like really complicated and we would like to for the MVP not care about topology. It's like a broader topic that need to involve a lot of vendors and I mean basically to make topology decision. You need to appear on the whole system, I just use so for now, if we could just I focus on like having basic constraints for the GPU itself, unlike the requirements like memory or like compute capability, if you could just do that as I'm VP, that would be great right now.

K

It's not possible, at least not in the extender.

F

Yeah I mean yes like where, on one side it can argue topologies like really complicated, but you know impact invent tools, exercises OpenStack I mean if we can model it as network interconnected it pretty much. You know, and with that Apple it's not horrendously complicated. I agree. There is some multi-vendor environment working world, but it can be modeled, you know reasonably and it can be solved in open source framework. That's what I found so.

K

Curious Michelson choose grass on images, but, like you probably, the problem is topologies that you need to hold a node in the graph to actually make a decision and when you have like, when each node is actually taken care of by a different vendor, that's very difficult. So yeah.

B

Just set a question about who's: writing those those labels into the node and.

J

So I think this is the no discovery parts and what we actually found was that there was a well man. I was I. Think I, don't remember who pointed me to this, but there was a no discovery at repository.

J

I think what what were things that labels right now cannot express it's complicated to express and keep union features and with labels. So, like the idea is that if you have two GPUs on a note that don't have the same amount of memory, how do you express that? You knows? Oh.

C

Just clarify by labels, you mean gold labels for the hell yeah, so I'm working on I'm working on a proposal, sort of to like excellent resources to have additional metadata I. Think, like that's basically walking in here, yeah, put fit into the rest of your. They were designing.

K

And then like this, like entrainer on top of that precisely so.

C

I'll post a proposal- hopefully today or tomorrow, maybe that'll- be the missing puzzle, and then you put that in the dark. It was really comfortable. I haven't completed the dark, yet I'll be shining a Google Doc all through whatever means you're using for shining the rest of the docs.

J

And yeah is there any question on anything I think.

A

I think we're heading up on the end of the hour here so far. The best thing to do is everyone gets a chance to read all these materials more closely before next week and I think we'll proceed from there. John I think I'm. Looking forward to everybody participating next week, I'm excited to see a lot of participation in this space since the start of this new year. So yeah.

C

Likewise, it's a phase. We have never spent much final on the last two years, so.

A

Big thank you for all those who will be traveling or taking the time to be there and I look forward to family next week. So, thanks again, awesome.

F

And thank you guys for organizing this yeah. Thank.

D

You guys cheers thank you.

F