Kubernetes WG Resource Management, 14 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20180314

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, everyone welcome to March 14th edition of resource, which meant working groups. Community meeting looks like the main agenda today is to discuss idea me and device plugins. So it's going with face once your name correctly and he's going to talk about his experience. That's device plugins on GPUs and IDM. A go ahead like floor is yours. Take your time.

B

Okay, I first time, I insert you say: I were introduced today, a DMA device, clogging.

B

Let's use a device for lucky: I am a US flag in project, it's very simple and actually I wear sure how to use it.

B

Firstly, we were I will start the clock on one note, with with a TMO a network cutter.

B

Let's do they look like in start, and then we can. We can start a porter with with a DM every source request yesterday at EMA for the young fell as though this was request.

B

Now it can create the spot.

B

Okay, let's party you run now, we can. We can check that device in the pot.

B

Okay, let's to device to you, the idea, a device. Okay, this way, it's the famous you it's very simple.

C

Can you also showed me the the github repo of your open-source project I think you should share to everyone.

B

C

Github repo right right right: okay,.

B

D

Do you use the AR DMHC group anywhere you want? Do you want to plum that in ochio's see the projects a colonel has an art, unhc group that Mellanox added? Do you need to use that? Do you want to use that week.

E

Sorry to interrupt here para here, we would like to use our DMS C group as well. He's yeah, yinz plug-in currently does not use this and I have slides. That I would like to walk through to give a first a big picture once he's done with the demo of what are the modes and knobs that we have at an RDM a level and the requirements that we would like to integrate into the kubernetes at the device plugin level, as well as CNI and at other places.

B

Projects, yes,.

C

B

Ray again well protected, it's a it's very, it's very, very simple, I say: firstly, you should install the the a if you open the texture software and start with a device for luggage into, and then you can start the device clogging on all or deploy you with kubernetes. Then you can run again a container. You can run a container as they either example.

B

It's a very simple any other questions.

B

They paral paral. Are you all night I.

A

Have a question.

B

A

You mentioned was you need to install.

F

A

C

Repeat a question I think they're here some sure.

A

My question was one of the first prerequisites was that of installing some package. I was asking if that can be done from within a container.

B

E

No, probably I'll ask Korean I, get a branch first question, so for our DMA there are two things needed: one is the the kernel level drivers which usually gets loaded as part of the PCI device and the networking device gets loaded for Ethernet, as well as the InfiniBand network. So the the you day of system and the auto load of the kernel loads, these drivers, on top of that there userspace libraries and applications which will access this RDMA devices.

E

Those packages usually come from the OS vendors, such as Ubuntu, Red, Hat and so on. So so the container is not I think the container should not be doing the job of loading and installing those rpms it should come from the the OS vendor, who is on which this kubernetes is running.

A

D

Our I'll just say one more thing on our on our distribution: sorry Red, Hat's distribution. There is a system D unit that called our DMA that gets all of the drivers lined up assuming their upstream I notice. Here you have something called Mellanox mo fed, so I'm wondering if you have cost you are using mo fed user space in this in this demo or like the customized Mellanox ofat or if you're, using.

D

If that's just a naming thing, I.

E

I think so so usually the the Mellanox offer brings lots of features every quarter or so and though, usually the OS waste does not have those features at the same pace. So the users usually uninstall the drivers and user space packages that come with the OS and they tend to install the Mellen of software, which is equivalent to what they have in their OS.

E

But it's just that with with different features or more features in it, but the plug-in part we usually would like to keep it same weather, regardless, whether it's Mellanox offer package or it's an upstream OS package.

D

So I mean that's. That was vicious question. It sounds like you, you do want to install custom drivers and to be fair, you don't need them to make this work, but understood what you're saying about features yeah.

E

That's right, I.

D

Could have one other question which is have you talked a lot so.

E

I'm from Mellanox and part of what I found it exhausting. Oh okay,.

D

Sorry I sent on the prompt. Yes.

E

Yes, as a year in showing the demo of the device plug-in, but we work together on on this- mostly he is developing it and I'm.

E

I'm considering more larger use cases and the deployment scenarios and what all things needs to be done apart from Ethernet use case, so I'd like to present those slides and give a bigger picture so that you know all of us are on sync of what's coming and what we need to do.

E

Can I share the screen yayin if you are done showing the demo.

B

E

E

Come up in some time.

C

Yeah I can see that.

E

Okay, we can see, though, all those lights so I mean, although there's no trimming anywhere right.

C

It's not forth going I think you need to change to project our mode. Okay,.

E

G

E

To a different screen: okay, let's see great okay, all right, so, let's, let's go over it so before I jump to this lights, little more um bit more step back on our DMA, so our DMA is usually a kernel bypass method that allows application to directly talk to the hardware, bypassing the networking stack and system calls, and that allows is to do it much much faster rate where two nodes can talk to each other in 0.7 microsecond to to exchange the messages from application to application, and this is done through the DMA subsystem which exists in kernel for more than twelve years, I.

E

Think now, and now there are much more users with HPC and otherwise to make use of this.

E

Since it's a kernel, bypass method, there are various various security aspects being added to make sure that there is a right isolation and right access being provided to each containerized environment. So if we jump here to the slides here, there are two use cases. The first one is first, we would like to to have customers that user starts using it in this mode. There is one IB device per pod and all the containers would share this InfiniBand or a DMA device. Infiniband and RDMA are usually interchangeably.

E

Called I would update this latch to call our DMA, but for for this conversation, just we can in that it's our demon device. So in this more we would like to have one idml device or shared a dedicated to a pod in all the containers of it. Now what from orchestration point of view, what consists IDM a device, so edema device consists of one character device, which is last name InfiniBand. You are 0 1, 2, 3 and the number based on the device. Then it has got a cease. Fire cease.

E

Class files and this class file should be seen only by that container.

E

The third one is a RDM SCM device. This basically allows applications to establish the connection between the pods and between containers, so they use this device. This device is common across across all the pods and all the containers. So it's there's one such character. Device per system through which all the applications talk and perform. The listen bind connect such such calls, similar to what socket API used to do with the socket calls.

E

The fourth one that consists an edema device is that is one net device for each RDMA device, and this net device would live into the namespace network. Namespace of the pod shared among all the containers and how this and the intent is that this kubernetes to isolate these three, these four entities of an eidm a device on a per container basis, I'm sorry on a per for basis,.

H

So could you can you talk a bit more about the what kind of isolation you would like, not east, to pride.

E

Sure so the isolation that we are looking for is that we and when a pod and all the containers of pod started, a given container is assigned one IB device, one a DMA device and he should be able to see the character device only that character device and only the sea surface entries listed.

E

In second point with regards to this character device and third one is that our DMA cm device that should be accessible to all and that character device kernel implementation ensures that whenever it is accessed, he provides our access only to network device that belongs to that namespace. So this device is okay to share, among all the whole, the of the container instances.

A

E

Write the later part where we would like to do a bind mount of only specific sea surface entries.

A

E

So usually, source and destination are same I mean the directory path, but theory I think they can be different.

A

E

Would like to keep that flexibilities to have host path and container path. There's two different variables like what we have in the device: II group and resource name.

A

E

The libraries do make use of this character device file so typically, what they do is the open character device and then the issue, the read, write, calls this character device and that sets up the connection between the two nodes. Once that connection is set up through the character device, then all the data path operations to send and receive the message. That's all done through the user directly talking to the hardware.

E

A

E

Use the libraries, so the application is agnostic of the character device and sea-surface files, but the underlying libraries do due to the discovery and accessed through these devices.

E

Okay, moving to to the second use case, so here, if we see it, there is a there is a limitation of the scalability of how many containers and pods we can run depending on the how many devices we have in the system. This is typically done through SR, iove or other virtualized more and some users like to do a really large scale in more than three digits. The use case one can do up to hundred or odd or a smaller number of scale.

E

The second use case is where the user wants to do, thousands of them the RDI device. Typically, one device can really do really large scale operation, but the servers are in range of 256 to 500, GB, RAM or more, and the device can also really scale up to a large number of millions of connections, and in this mode we would like to have one device that is shareable across multiple ports.

E

So, in this case, what you will see is this will be only one device like previously. We saw one one multi character, devices, multiple sea-surface entries and so on. Here there would be only one such device and one sea-surface entry or maximum two of them, and in this case the isolation of the resource should be done through the RDM SE group. That says how many resources of one device you can consume, and in this case there would be one network device, nine device per namespace.

E

That would help to establish the connections between the containers and pots, so the net device is pretty much same between the two modes. The only big difference that we see here is one character device. One sea-surface entries versus multiple of them and in this is relatively less secure model, but there are use cases where they tend to you. They tend to have single tenant and.

A

E

There the scale is needed.

A

E

So a typical HPC application, which usually runs in the National Labs of the United States. There are seven or eight when they run these applications.

E

They are such a heavy load applications that they do not want to run anything else at that point, but they still like to have the agility of the kubernetes to deploy these applications.

A

E

Yep, that's right.

A

So I would say, like this would probably be of lower priority.

A

E

D

Priory, you you mean the thousands of you mean. That's this current slide he's looking at so.

A

I'm, referring to the use case very high, scalability mode, you want to share the device, not use key, is being primarily driven for MPI workloads in HPC. My key.

D

Bullets gonna be opposed a limitation or think anyway, that's Cal I mean we're not.

A

Designing yet for for MPI I'm, just like saying it out loud that resistance for getting the same okay.

E

We have one user I, we have one user who likes to use it that way. Unfortunately, I cannot speak about the user.

A

Self represent themselves in the community and.

E

Yes, we have been trying to ask them, but.

E

They are not a private body, so I'm my hands are tied here, but I will continue to push them to do so.

E

H

Can also ask a question so in this mode like we, we actually have some way to support multiple parts, sharing the same device and the kind of performance, the isolation you would like to get because it's possible like him. A pod could use a lot of bad ways of the undying at the MA pay twice.

A

H

So you rely on the underlying Qo'nos prove to guarantee the performance, isolation, a.

E

Kiss the resource isolation, if not performance, because it's not tied to the performance, but it's more about DDoS attacks or when one one part is misbehaving. He does not take away all the resources of the RDMA. There are device, so it's more for the resource isolation rather than performance.

E

So the C group, just a small update, RDM SE group core, is part of the run C spec now, and the implementation in the run C C group is also available. So hopefully we would like to integrate that and and I have some coping questions on how to do it. I'm not sure yet.

A

E

Priority, yes,.

A

E

D

Well, there are use cases of for our DMA outside of outside, of the national labs. I mean, oh sure,.

E

Very much and I mean $0.10, representing is one of the perfect example where they are using outside of the labs. You know, and N and I think there are other MPI users too. It's just that I'm, probably not in sync with them, but you know, even if they don't go for a scale of thousands. I would still see that they might come up with a few of you, four to eight four to eight parts running on arm on SEO system.

A

Need a lot of bandwidth each and every one of them. So you want to be more specific in terms of shouting and not just throw a bunch of them on the same node and they have other resource requirements too. So it's kind of.

A

D

I think the thousands is something like it's funny, because we hear these huge high density requirements and we're like wait. How come process? Isolation isn't in it, for you, I mean once you start talking all the additional cubelet plumbing on top and it's most like you know, you're paying a lot of overhead for I'm, not really sure that people fully calculate the the upside and downsides to these things.

D

One question for you: sorry, go.

E

D

Mean this is sorry, go ahead, ran out, I was going to suggest you yeah. You speak up. Okay,.

I

So this is a Renault from Nvidia speaking and I've I'm, actually working on a design document for sharing devices and there I'm about to release that document either today or early tomorrow, and there are two parts of that document that specifically match very much. What your what you're suggesting here, the first one, is sharing a device between multiple containers inside the same pod and that's something that we do have a use case for GPUs at Nvidia and I. Think you provided an additional use case.

I

That's really interesting and the second one which I think is a bit more complex that you were expressing, is sharing a device across multiple parts, and that's that's something that's the problem is you were mentioning is that it looks like there's, there's little isolation right, it's it's a bit less secure.

E

Yeah and as current features grows, it'll eventually become secure right now. This is this: is the current state and the currently? What we have to make them secure is various cardinal knobs one is cgroups to isolate the resources and there are SELinux policies that can be set up so to isolate between who can talk to whom in is being done through the SLE next policy setup, which is part of the 4:14 Colonel.

H

Could bit more about? Like the C group, you mentioned the you use Canosa group to limited resources. What kind of resource is limitation you have known? So there are two.

E

Two resources that the C group allows to control right now, so RDMA level has Q's or like Q pairs and completion Q's through which it does the radio transaction to send and receive messages, and so there that's one type of resource which we call HCA. It's here objects. These are like all the objects of that see. A shared q, CQ q pairs memory reaches protection domains. There are like six or seven such objects. These are controlled through HC objects of the C group and the second one is called HCA handles.

E

Let's see a handle is relatively smaller resource in terms of thousand or less than that, so Etsy handles is similar to file handles like if, if I, if one poor opens all the thousand handles, then the other poor cannot cannot even run an application.

E

Rdma based application, so that see handles is the second second resource which which allows to to basically say how well it can do further. Okay,.

H

So for the first one you you're actually limited the number of objects you can put into the shared cure, but then you to really the C group doesn't really mater. How much time would that devise a particular process contact.

E

No, not the time the the number of resource. So, for example, we can say with C group saying you can, one part can get 100 handles and the other part can get 50 handles. That's.

H

The that's the second part but I'm trying to ask, and the first one you mentioned earlier, also D mater. How many like it.

J

H

Into the queue, but each each object does a to get roughly the same amount of the time to use the to your device.

E

That's right: the scheduling of the objects so that these objects are qs q, pairs and completion queues, Northey messages.

H

E

Message rate or how many messages per second is not yet part of the C group.

E

The Q's are usually hundred thousand dollars and it's very unlikely that one container can take up all the Q's, but we still have them for some vendors who might have a smaller number of resources. But since it's more vendor driven it, it would be best that we we have both the objects.

E

H

Okay, thank you.

D

Did you do the C group implementation her off? Yes,.

F

D

Did you do you have a roadmap for things like and yang was mentioning around fairness between or you just don't see it as a valid scenario.

E

Good question: I I, don't have a roadmap right now. I need to discuss internally with our team on on setting up message rate between the containers or not pace a valid issue. Usually, the fairness is is ensured by the hardware among the among these cues.

D

Other vendors have similar hardware limitations that they've brought up. This is actually a question or statement for fish, there's or jinyang or veikkaus, there's other vendors, who have similar kind of hardware stuff that they wanted to bubble up into kubernetes.

D

Somehow another example is the number of, or maybe even coral, the number of silv devices or the number of virtual Nick's, which are not as rav devices but number of virtual NICs, that an adapter can support and it seems to be the collision course for something like resource classes, but there's more there's so much that these adapters can do is and I'm not sure if we've thought through the way to expose even one of these things, I mean the Mellanox might have a leg up because they have this RDM AC group and not the many other adapter vendors do not.

D

But have you guys considered a way of of advertising? These kind of integer based hardware resources up through the.

E

A

D

Want it to do all that stuff.

A

H

It's just sorry.

A

That's when we need like higher level API, but as of now, these all look like specialized applications are not really portable. So you don't have to worry about higher level API as long as device plugin.

A

G

Can you be a.

A

G

A

H

A

Okay, I apologize for my laptop, so what I was saying it was like device plug-in is the way to go for exposing any sort of countered hardware devices and when it comes to hardware devices that that doesn't meet scheduling, I would recommend going down the CSI route or just using host path volumes, because those big volumes were meant to also expose character, device files, for example, or block device files. It doesn't really matter as long as you don't need. You don't need any sort of scheduling.

A

Primitives go with that, and I was also saying that unless, unless we get to a point where some of some subset of these hardware devices become so ubiquitous that workload, portability becomes a blocker for cumulus adoption, I, don't think we really need to prioritize resources or anything of that sort. For these kind of special devices.

H

I think that perhaps, like maybe a separate discussion by the way, the even agree with me, if all people want to do it to create a character device in the container, perhaps using the best plug-in it's a bit heavy weight because you're your own use telling you to manage your demons. That are some kind of way to to use put the device plug in service.

H

But if you also want to do other things like monitor the device house to automatically install the drivers and those sins, perhaps you can use the best plugin, but I agree with evasion. If all you want to do is to import the device into the into the container, we may want to look at other ways that is more straightforward for vendor. To do this.

E

Let's see it's alright, so I think device plug-in currently supports quite a lot of features that align to to do these things.

E

One is the the character device. Second, one is sea surface mount points. The third one is is a querying to basically a locator device or set of devices for when the, when the body started and I think all the three requirements sort of tend to match from the device plug in right now, but I would let others also to comment.

H

Yeah I see if you think I could go device tracking anything nitrous. The your your use case more closely, the yeah great, but also feel that the evo by the second user case in your presentation.

H

It may be a bit premature to to focus on that use case. Oh yes,.

E

I agree, and and for second use case since there is only one device that we want to share among all the containers, there's practically no need of a device plugin. We just need the knob at the cubelet or the kinetics level to allow mapping that device character device into the to the runtime environment, and, if that, if that is done, I think that is sufficient. You in that case, we would only use the C group and the P key policy of these cleanups.

A

A

E

Sure I want to go through just last three slides, most of them is straightforward and I have been yeah in have a document also to elaborate more so right now. The issue that we face today in terms of requirement is the first issue that we face is the device. Flooding does not have a know anything about the pods and therefore, when the bodies are located, he just says a locator request. Give me one device without knowing anything about the pot ID or for name. So what happens?

E

Is that when multiple containers in the corner run, the new devices are located for each container and that's not desired? So we would like to extend the device plug-in API, where allocate requests tell us about the pod name or for ID, so that the the device plug-in can do one device allocation for all the containers of.

A

H

Think what you really wanted to expose the device ID up, how the sandbox resource not a container resource? Is it true to integrate this way.

A

This sounds a lot like a network resource because it.

A

Is in fact, a network device, except that, like that's,.

E

A

Sounds a lot like the CNIB and we haven't really solved. How do we? How do we take care of like accounting for or network devices? So this is currently an open problem. I mean if you can find some way to work around it for now, by all means, go ahead with that, but before opening up API any further I personally would like to see the overall picture and overall alignment, both DNI and vice plugins, before extending the device plugins for this specific use case. So.

E

So when the cni plugin is started and the cni plugin moves one network device here, let's see, if you look at this diagram here right for each of the InfiniBand device I should have shown here there is one net device also exist. So the cni plug-in moves the network device into that namespace. But here he doesn't have the details or the above one-two-three details that he also needs to do a bind.

E

Into that into that.

H

But I'm actually curious how you are going to use that pot ID, but they come out to the meeting.

G

I shared some slides on the slack Channel I think you might have missed those and I presented those slides at various groups as well, so maybe in in the next or maybe at the face-to-face and discuss that in detail.

A

You can avoid having to answer the same questions.

A

But for going back to this specific organization around, they are.

A

A

A

A

E

Yes, so you would like to avoid the hacks and the proposal we can discuss it and for the meetings. The idea that yeah in and we were thinking, is to have for ID being exported device plugin for two reasons. One is to allocate, and second one is to free that resource when the power goes away, so that the same resource can get allocated again and therefore knowing the product is, is good thing to have and the second one is.

E

If the cni plugin also knows about the power ID, then, along with the device parameters, then he can do the he he can map the right network device that belongs to this character device into the network namespace. So maybe, after this meeting, we should look at his document where he has described these details. You.

H

Know definitely, I think I you probably want to work with him to to make sure y'all. Your case is also a raptor incited you his a document, but also a great base like. I think there are a lot of missing pieces to to really make it work her to set up the network at device under the network namespace so that the best pocket yet no.

A

Just just going with the cid and a CNI model.

A

That understands these device files as well, then see if you can actually bring them into a pod.

A

E

But the CNI plug-in does not have an access to the amount namespace, the sea surface model namespace of the host and container, and it does not have an access to to the device C group to allow this device character device to be allowed into that C crew. Do you mean we should have.

E

Access for that.

H

H

H

G

G

Which comes by using the way we can advertise. It also provides this.

A

Not having an easy.

A

H

So so so looks like all: you need to has to see I plug-in it's a network device. It's true! Oh, do you need additional information that you need to learn from the best package and then pass that to a CI so.

E

From CN I plug in perspective, he needs this for information.

E

If you see on the slide here, if the cni plugin does based on the network device, then for this network device, he needs to know what is what is the character device and what are the sea surface files? He can possibly find that out by himself, but he still needs to have an access to the amount to bind mount namespace for the host and the container so that he can do.

E

Thou bind mount as well as he needs to know about the the filesystem view of the character device so that he can set up the device C group to set the readwrite permission, which currently is being done through the device. C group.

H

So you already have a POC for.

E

Device group- yes ya, Yin has developed the device C group, which does correct me if I wrong in it does to do the surface mount device in your demo.

A

You only have five minutes, so maybe it's time to wrap up this presentation.

H

Yeah I think we may want to say some follow up documentation from your set to to give us more detail on the architecture like I'm curious, the I'm curious to know whether you're, also you implement your own I plug-in and whether your CI protein and the yard West parking I mean the same binary, not most.

E

Likely there have been two different binaries: okay.

H

And when weather, like, also like what kind of information you would like to share between the two planes.

E

Yeah in has captured that in his love, human I will share that after this.

E

H

Yeah yeah I think that's definitely like well have asked to understand the better use case and I think you you may want to make sure I be and what cuz also like I read the document and see whether you can share some of the experience you this. Yes.

E

All right, that's what I have for the.

B

Introduce a lucky container under general speakin, which we should we should wish or container in a simple, should actually see a demerit device and the more we want a device clogging can work with work with saying, I plug-in together.

K

H

Thank you, I think we will read the document after that meeting the perhaps the Germany want to use the left hand to talk about the face-to-face meeting. Yeah.

A

Jeremy that I added that note just to make sure that everyone knows that there is no a finalized agenda and that we have to discuss that further than we actually discuss it. I.

D

Don't think there's a lot of like debate about it. Some minor concerns around making sure we have Bobby in the room when we need him, as relates to any like more intense. You know, scheduler topics, yeah, there's a finalized agenda on the on the on the mailing list. I spent a couple days ago well, last week, I believe I just dropped the list and the link in the chat so I'm now, as of the jinyang and others, are sending Docs to the list, I'm adding them here to the individual topics.

D

So we have what looks to be a reasonable agenda and yeah I haven't received anything on the GPS, yet renaud said he'll have a day or two. So that's great yeah and.

I

Just just a quick note and I've yet send an email as we're just finalizing the eating and those details, but I probably should send an email, I've registered 28 people out over the 29 word and the list because Chris Christopher does Joyce said if he cannot attend so, and we probably want to close that list.

I

That's it for me: okay,.

H

To me, so you mentioned like we may want to make sure Bobby can attend some. The topics discussion in particular I would like to have coffee around about the recess discussion. I, just don't know by the it's possible to swap that discussion with I think the hardware compadre discussion.

A

It's not just think we need representation from scheduling in general, so I think they have a meeting tomorrow. Maybe like Jeremy, you and I should actually go to the meeting and talk about this case and make sure that enough enough eyeballs I are on this face-to-face meeting and then we can have more more license. Yeah.

H

If we want to invite mastic I do know, people then definitely I think perhaps we want to have scheduled related discussions on Friday, because I think they have six scheduled meeting on Thursday around the noon. Time yeah well.

D

I think he says he can make it for 4 1 p.m. yeah.

A

More than just agree.

D

I was hopeful to get more right.

A

The point of this working group is that we actually have a presentation from different six and if, for example, if you go to like restricted this to a single sink, then we'll just operate as I said right. So I'm just saying that we need to think we need to do a do our own due diligence or advertising across all the stakeholders. Six.

A

That said, we're out of time thanks everyone for joining and times a lot of, and then yeah in for your awesome presentation, okay,.

H

A

Of you two weeks.

K

Face to face means I, think.

A

When out mentioned that that there might be a WebEx so I, maybe you should ask this question in the mailing list and and became so.

I

So so I just need to send that email as I was mentioning we're just figuring out the last detail and yes well, there will be a WebEx and you will be able to attend and I'll be sending that the other details in that email. So, sorry, again for not being able to say that you know early that.

D

That's good because that might get us more now. I know: Dan Williams who's, one of the CNI maintainer ziz, unable to make it he was going to be our representation from Sigma network from a Red Hat side anyway,.