Kubernetes WG Resource Management, 23 May 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20180523

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Mm okay, silly recording is not turned on welcome to yet another meeting of resource management. Working group and today is 23rd of May. We have quite a few topics in the agenda, so I think the first one is what I saw IO V device plugins, not sure who that is I'm assuming you're in this call. Can you like speak up and get going with the topic.

B

Hi, this is Abdul here and Luis should be here as well, so we are Luis currently presenting this proposal and I'll I'll run the demo.

B

Yeah I can hear you.

A

Yeah, we can see you now or I can see it now. Yeah.

B

We can see it is that.

C

So basically, what we wanted to bring this forward for was to kind of start, a discussion to create kind of a unified story for network devices. So our plan is to present about this resource management group and Signet work to get kind of a unified view on whether this is a good way forward, and so the plan today is to talk about our s, Riv implementation. What with the view that this could work further networking devices? This is just an example implementation.

C

So if there's any questions, churches staff me I know I'll take those. So the motivation behind starting this was that CNI is very good at what it does, but it does not cover the entire lifecycle of a network device. So for the cases where there's limited network availability, so in arc, CSRA vvvf and there's no mechanism to advertise these and to have the scheduler account for those when pods are being rendered.

C

So if you are running your pods with just C and I, enabled with SR IVC and I, and you run out of EF C or pods gonna fail and you'll have to look into the cube. Ctl described with the part to see why it failed, as opposed to seeing that the the scheduler is out of VF s--. So that's the first major reason we wanted to have a more granular view of limited network availability.

C

So then as well, there's no plan, no alignment with C and I. So a lot of the late intensity applications that we have from customers use CPU, pinning and also required that the the network traffic comes from the same new node as the the CPUs they're pinned to. So with the new manager proposal, we have device plugins and the CPU manager coordinating. So that was all of that issue for us and then lastly, CNI has no mechanism to manage device erupts.

C

As you know, so one of our use cases is for DDK, so we currently have to run a privilege pod. So we can see all the devices as opposed to a device. It was allocated which we could do with device plugins.

C

So there's been a couple of other work done in this area and so because, as a proposal, which kind of also outlines going to be a unified network and CNI plug-in, and then there was some proposal by Fabian and Peter um extending the the resource API, that's returned cubelet to account for for networking resources. So our proposal isn't brand new. It builds on these and kind of tries to create one unified story for network devices, so there's four components to the project.

C

So there's the SR IV Network device program, which is a device program, implementation for s, er IV, then there's a CNI shim, which is basically a bridge to communicate between the device program and the CNI, and it does Feig RPC and then there's a meta plugin which in our case is multi which enables us to do multiple interfaces in our pod. And finally, we have SRA vci. So with the the work in the network, roaming group for network objects. We've aligned this project with with that work as well. So that's a lined up.

C

So this kind of the the block diagram of our components, so the SRA be device plug-in, is responsible for discovering the SRA v. Vf is available on the node and then sending those back to cubelet advertised as extended resources on the node and then on allocate the SRA, V device plug-in will store the mapping of pod information to VF information which can be used later. So in this, where we're proposing extending the device plug-in API to pass the pod information to SRA V device plugin as opposed to reading the checkpoint file.

C

So we just think it's a cleaner option. If, if the information can be passed from the Q brush to the device plugin, so then Malta's is responsible for delegating to the CN eyes. So it reads the network CR DS in the pod spec and then cause relevant to see a nice. So in our example case we use flannel and the CNI shim. So what the CNI Shin does is. It creates a gr, PC connection to the SRA V device program and using the container Eric's passed into the CNI.

C

Shama gets the the pod named pod namespace and sends after the device program. The device program can then use this information to get the pod UID and get the correct, VF mapping and send that back to see you know ship and then CNI sham will send that information to SR IV CNI, which can do the actual plumbing of the VF to the pods network namespace.

C

So that's pretty much it for the presentation. I said: keep it short and sweet and then Abdul can go on and show the is there any questions before we go into the demo.

A

C

Maybe we go ahead with the demo, it's not too long, and then we can take all the questions if that's: okay, whatever I'll, stop, sharing and Abdul. If you want answer yeah.

B

Thanks, ladies.

B

Let no one you can see my screen.

B

Can you see the screen yeah.

C

B

So I will just briefly go over the configuration that I have on my setup. So as Luis mentioned that we have, we are using Maltese as our CNI plugin. That will enable us to add additional network on top of that default kubernetes network. So here we can see the multis configuration for the CNI, so this configuration called small, T's and Maltese will call other as an eye plugins as required.

B

So for this we we are using the latest version of Maltese so which follows the the standard the recently proposed and implemented by the the network working plumbing group. So this new merged is just we have some standard way of specifying the specs for the CNI and the plugin. So as the next the file. Here we have this EML files that describe the network objects for the CNI shim network, and then we have. This simple part is back here, so here we can see.

B

We are using this scenario network in here as annotation, and then here we are requesting for the SRA BB f1. For this part, so I'll go into the actual running this part. So here let's start this device, plugin.

B

So, as you can see them right hand, side bottom, the celebrity West plugin part is running, so I will just run the log, so we can see the interaction. What happening.

B

So, as you can see, the device plug-in discovered the sir IBK brennick and then it found we have here and their corresponding PCIe address and you can see their lists and watch advertising this device to the API. So now I'm going to create that our example pod.

B

So you can see the pot is being created. This running, so I will see the network interface in the pod.

B

Priscila IP address so, as you can see this, this is already the default kubernetes network, which is the default IP address for this cluster, and this is our additional network, which is the SRA BBF, and here is the IP address.

B

You can see that we given in the iPhone configuration in the in the network object that we defined before so I can just go over there, and so, as you can see, this is the subnet mask and IP address branch that we gave and the part actually got the same IP address and all this item information here.

B

So you can see the default gateway is the routing information as well so, and we see that I survived ESRI. We plug in actually successfully added the routing as well for this interface, and also we can see that device plug in here. If you top of the screen it caught the request from the CNI plugin and it sent the response of this VF for this part allocated before when the pod was created.

B

So this is the phase one in fermentation. We have done in this demo actually shows that one. So there were way as we can see in the our proposal document there, the Phase two. There are some other few tasks we need to do so. Yes, we, it would be great to get some feedback from this community and maybe we can improve before so I. Think that's the end of this demo. Is there any questions regarding demo or the configuration.

A

So one of the claims there on the dock is that you need a you, need a one-to-many or like many-to-many, nothing between Network, C, IDs and device plugins. That sort of feels like an anti-pattern in the sense that, as of now each device plug-in is exposed through compute resource name. So there's a one-to-one mapping there I wonder why you need that many to many mapping.

D

Abdullah, the any and we have done yeah I- can hear you.

B

D

So I was working with both Louise and Abdullah on this one too. So the question regarding the one-to-one mapping right. So, if you take the for networking, it's each and every network object is quite different. So in this case we have this Hashim CNI, which acts just like GSPC client to connect to the server, so it connecting to the device plug in get the piece a PC address, and it's giving back that information to the already CNI. So in this case, what happened like each network object will have different ipam information or different VLAN ID.

D

So that's why we said like it's, not a one-to-one mapping, so the device plug in in this case is like intelligent agent, which gives us the resources, but at the end of the day, the network object, which is not a one-to-one mapping with the device plug-in because we're just getting the resource information and after that, we putting the network configuration on top of that resources. So those resources will be different for each networking. So that's why there is a difference actually in the previous proposals.

D

If we can see they, they will have a one-to-one mapping kind of a thick plug-in model. They proposed.

D

Yeah I think the resource name is the kind of socket connection through which, via the shim stand, I will communicate to the device plug-in right, Louise and Swati. Whether I'm saying it's right right.

B

So we are passing in the socket name when the same chiana's or I will just show that quickly. Here, as we can see here, we have this device plug-in name here in this field as our IV net. So based on that, we call we connect to this device plug-in using this.

A

Resource name so I would naturally.

B

So the cni shim don't need to know about the resource name in here all it need to know that which device plug in II need to connect to and the device plugin will actually do the resource allocation itself. So here we only want to communicate from CNI ashame to that device plug-in in this case, every device plugin.

A

Translate this name in.

A

B

We are using the canonical location for the like cubelet bar, live cubelet device, plug-in location, and this as our IDs as a ravine at the socket is reside in there. So we we make use of data creation and using this name we assuming that there is a socket available to connect you, okay,.

A

So this is mainly for discovery and and connecting CNI plugins to the the corresponding device. Plugins. That's.

B

A

Is something that we all, which is like network resources? Are they per container well.

B

As we know, network resource is are shared by the pod, all the containers, so this is I think we mentioned also in the in our proposal recommendation. This is one of the we don't know where there's a challenges limitation that.

A

B

Need to address, but for the moment we are defining the resource request to a container, but we know the network actually being shared by pod for all the containers.

A

B

I think this, this is the mechanism we have currently with the cubelet at the moment. So this is very little space. We have to change anything in that scope.

B

We we don't have a define particle procurement. We still need to figure out how to how to approach this best so right now, because there is, when we actually consider we, this works now as it is, but when we take into account the new my alignment that will make things more difficult because you might have memory and CPU from one Numa node and then you need to make sure that network device also, for example, there'll, be also coming from the same Numenor if we want to give a good performance for the application.

B

So, yes, we still don't know how we can address this one. But surely, when we can look into this yeah.

A

I mean if I were to just think aloud you, you literally need all containers in a pod that are sharing such a network device to be within the same socket right. That's.

B

Right! Yes, yes, for further for optimal performance! Well,.

C

I think not necessarily so in some cases we could have our high performance like VN s, application in a container in a pod, and then we could have a telemetry or logging container as well in there and in theory that wouldn't really care about whether it's on the same socket.

A

The answer to my.

A

B

This is exactly what we trying to achieve here and probably we can get some more idea and how to address this. Okay,.

A

Okay, so I guess the next question that I had was cubelet currently does device allocation. It wasn't clear in the dock whether the plug-in was expected to do device allocation or the cubelet. Can you clarify that cubed.

C

Suitors allocation: okay: the only difference is the device program stores the allocation so that the scene I can get that information.

E

That's actually also my question, so if you the device, plug-in and also stop that information, what evil device rocky feels, how will it to read establish that that state.

B

Our current implementation doesn't possess this information, but we are planning to actually keep our a local file that can, if in case of restart, it can read that file and bring that state back.

E

Okay, because I think during the initial teaser OD was packing. The reason which was to allocation also start the even measuring in the local chat panel power to make device back in stateless and now, if we say that could post the best packing and the cutest, this information is something that was new to us.

F

To actually build on James points and if the device plug in is also storing the device information, it seems prone to errors, especially in terms of device plug-in, might have a different states. So it might be more interesting for you that when you register the device plug in, if qubit has information, it could give it to the device again.

E

B

E

Currently, what the information is passed from the best packing to the ocean.

B

So bf information so which includes the bf PCI address and corresponding peer from where that we have come from and VF ID. So there's three information we getting from the device pregnant, okay,.

E

And you mentioned that there's also some information stored in the hot annotation and how that pod, annotation information is passed to Casa de CI. Shame some some component to say the right.

B

Toward annotation for the network object, is it yeah? That is a pass through the say, an item? That's right! Yes, share! An item got this because we are using Maltese and there is a network object created using the CRD.

E

Okay, so this session, we'll watch the Aussie Rd.

B

We don't need to watch the C or D, so basically, I'll go over the poorest back there. So here is the sale item y ml file. So we create a network object using this file. Okay- and this is the name we we given the see, an item- net 1- and this is the parties for spec here. We specify, in this part, want to connect to this for this net network, on top of the additional the default for under network that we have configured. So this will be additional network I.

C

Think are you asking multis uses in cluster config to get the pod annotations to get to see IDs.

B

Yes, that so mulch is the way multis works. Is it its it sees this part annotation? Then it we do in cluster communication to get the details of this network object, which is say, an item. It will get exactly this. This configuration will be passed to this montes.

C

A

So just throw an idea: what would be what, if you.

A

B

Thought about it, but what? If for some reason, we we prom the network in that one device plug-in, but for some reason some other resource requirement aren't met and the parties is failed to create. So we will going to deallocate to our guard our delay of the network that.

A

That shouldn't be a blocker in the sense that there are admission checks which will ensure that cubelet can satisfy all the resource requirements before admitting a part. So admission checks are sort of like atomic and that's then your second problem is: you have to perform some B allocation.

A

B

C

Just the other thing, the when the device program allocate is called the network namespace isn't created. So we probably I saw some proposal where you plumbed the into the network namespace of the actual device plug-in pod and then watch for the namespace being created and then plummet in there. But that seems kind of a key workaround as opposed to a good interface to follow and.

G

In addition to what we must problem is for details in our create call.

B

Yes, this is technically the pointer. Lewis also said that we have challenge, so we are actually going out of the flow of the cubelet, creating the network. So I, don't.

A

B

That is the right approach: I.

A

Think the community.

A

Having free start.

A

In case you need something at the so in theory like I might be missing or glossing over some major pain points, but in theory, if you are able to get a hook in the device plugin between a pod sandbox and a container first container being stalled in there, so you should figure the network namespace right. So all you need is the network namespace information.

E

But I think before we going into that path, we need to say like because even it's a common operation needed by a lot of network asset Harper processor, it makes sense to her to do it through the seeing-eye.

E

Instead were trying to duplicate the same functionality in each component.

A

Like in theory, we could just imagine C and I'm being in work from within the device. Plugin I was recommending creating a new interface. Well I just want us to like really like think aloud on, like all possibilities before we constrain ourselves. Whatever interfaces we have today, but that's probably the reality in a sense and if you're on a ship it today, you you have to make do with water is appealable, but if the discussion is about, how do we improve this, and I would want this to not have too many constraints.

F

And just a side notes: I've been taking notes for the past five minutes. I might have missed some conversation, so if people want to complete those sure.

A

The only other question that I had was like how is I'm, assuming that VLAN is not part of your resource allocation problem, in a sense that a single like you can have a single set of VF interfaces which can belong to any number of VLAN IDs, and so it's a separate, config knob. Yes,.

B

We saw I, we say an eye plugin that were using their supports. Also baylin's I didn't show it in the demo, but we can specify here where this s or IV configuration here. We can especially the Brillion ID and the TF will get that. Will an ID configured.

E

So much about the implementation, so we mentioned you expect to have different phases of implementation, and you also mentioned, like some Numa awareness allocation. If the current implementation include any Numa related, see.

B

No, no okay! It.

C

Should in theory, if the the current Numa manager proposal were to go through and with the device program as it is in phase one where it requests explicitly the extended resources, if we're also to request integral CP use, the the Numa manager would coordinate so that they do end up in the the same socket without any changes to any of the components. That's there.

A

C

Because the the CPU manager is making the decision on the integral CPUs and then device manager is making the decision on the VFC, it won't have any.

C

So that's kind of a main open we have is the the Numa alignment, because one of the major use cases we have is that the the pod spec writer would only have to request one thing in their pods back. So in our current phase, one you have to request both the network, CRD annotation and the extended resource request. So I want to kind of limit the error there to only have one either the annotation or the extended resource request. But to do new, my lineman tweet.

C

We need to have the extended resource request but, as we said before, the networking is actually pod level.

C

So whether we put an extended resource request on every container or do we do one container or how do you know, which is the container that needs to consume? The VF is kind of a major open. We have.

E

But but you don't expect a device to be shared by multiple pause. You expect a device to be shared by multiple, continuous in a single pod. Right.

E

Great, that's, correct, yeah!

E

It's just curious what a use case. So what our target workload for this!

E

B

Mostly for the vnf type of workload they're looking at where you want to separate the control and a data plane so usually as our AV vehicle, we use for that high high performance data plane.

E

So this is to be run on switches or like a high performance.

B

Machines, yes, telco use cases. Usually they require the there's. Our every type of workload.

E

Okay, I will run any continuous on those in in on those environments. Currently.

B

Think I'm not sure it production levels were there there. There are definitely a lot of purity going on trying to create the vnf application in container.

E

D

Hi guys, can you hear me actually sorry.

A

D

I was outside actually I. Couldn't able to talk, you guys, I found a new place. So previously there was a III didn't listen problem, maybe so the device I listened to the conversation of what the device plug-in can invoke. The CNA, so I just want to get the community aspect like whether the the combinative ones, looking like the net. Multiple networking should be done as a device plug-in model or it should be as a CNA model.

D

So there was a lot of proposals going on so I just need to get the community view on that particular kind.

H

D

Like so in this case, you want to invoke the networking through the device plug-in right, so sometimes it's kind of reinventing the wheel. What the CNA has done before so I just want to get the opinion on.

A

E

D

E

See it actually wasn't our intention to really power the network resource, as its name suggests, it's a really intended to just a manager devise a resource and I also feel like it's probably better to solve this problem insuk network, because the folks there have knowledge on networking and, as we have seen like a trend to to make network resource a symbol that he was clocking in API. There are some something not really much like a network, a third, how the resource, it's not the continual resource, so I actually hope with the problem can be solved.

E

You see, I context. Okay,.

F

Just a small notes and I think one of the objectives at the beginning was to support network resource and this put off because I think one of the things you are right is that Signet work. It has probably a lot more knowledge, but one of the goals of this resource management group was also to have people from multiple SIG's meet and take decisions that are cross. Things and network devices seems like something that does cross Sun sex.

E

That's a good point and we have kept the same in proposal for networking as Peggy I think. Maybe someone you to make sure we have some saké networker represent a participating in this meet you so that we can get some opinions from their side.

A

That's a good point should make sure- maybe maybe as part of agenda like Korres, adding any Network topics inside and I should also probably notify Signet work and try to ensure that please, like one or two representatives- and this also seems like a really big problem like from talking with them. How can sort of picture he paints? Is that there's so many different problems that are getting interconnected?

A

There's issues with like managing hardware devices or psychological devices that are associated with hardware and then there's the whole gamut of like the networking features that have been built over the years. That also need configuration. I like these two are getting clocked together like this.

A

Rav is a classic example where you have to pass, and you have to handle not just this logical resource allocation, but you also have to deal with deal and and a whole bunch of other networking configuration so that space is the the number of permutations, then that space is quite vast and I suspect. Only a few people in the community are actually grasping all the different set of problems that are sort of related to the same area.

A

So again, like in a roundabout way, I'm saying folks in Signet, work are probably thinking about. This is a little bit harder than here think this community would probably have much more opinions stronger opinions on like how to deal with like the actual resource allocations, comes to interactions with CNI and when it comes to interactions with networks.

C

We plan on presenting to Signet work next week. Maybe if you guys can attend that as well, we can send out just when we're going to present if anyone's to attend, and here the discussion there as well I.

B

F

It's actually a really good idea for us to attend, discuss what next steps we can take, because I don't think we right now as a community care can well I mean unless you have suggestion, but I don't see how a lot of next steps for us unless presenting this to same network frizzes design document.

A

Yeah I mean someone who really cares about this problem and has users and proudly are building products. Out of this would be the ideal ones for championing this, because I say everyone else is like pretty much busy, so you would have to figure out a way to get the right set of decision makers together on this topic and.

B

We have shared this proposal to net Sikh community as well and yeah. As Lewis mentioned. We are planning to actually do a presentation and demo as well.

B

C

With regards passing the part information to the device Provins, is that something that's never going to head going to go ahead or if there is a solid set of use cases for doing that? Is it plausible change to the API that could be made I think.

E

We can discuss that, but I also want to clarify, but the expectation is to have diverse packing to be stateful, which I hope to avoid.

C

Okay, and would there be any opens to cubelets storing additional information for a device between any model? It's.

E

Currently, starting that, but I think the general asked to make this information available to other components and I think we are also discussing this in different count: tags like a for monitoring and maybe also for other purpose and the. Hopefully we can come up with a solution that can allow other components to discover this information easily. It.

A

Isn't just the information that cuban already has today or it is? It was the question about information back yet.

C

It's additional information, yeah yeah.

A

You might want to find an issue just about that. Okay,.

C

Sure and may already be easy: okay, great I'll, try and find one if there's I.

D

Think because that's filed a you sure on that one actually I don't know because maybe.

D

E

Think yeah I think there's an issue about surpassing more information during an advocate to come. Yeah.

C

I think that's passing more information to the device plug-in as opposed to the device program, passing back information but I'll check for sure, okay, sure and.

F

There was also a mention of sending to the devices plug-in or maybe not to the device plug-in, but I just want to make sure that this is something that I did right and sending the C or D and the extended resource requests was that to the device begin it was. Was that Susan knew my manager.

C

It was just so that the user wouldn't have to request both in the pod spec just request, one and then an admission controller or some other being would add the additional either extended resource or CR D, depending and but that information doesn't go to the device. Plugin.

F

Okay, thank you.

A

Yes, the next item that you know that we had was.

A

G

Practically it's my question which I wanted to bring to his group is because I started to notice what Mike robot starts to comment on many of related issues. What we're in roarton state, so we had proposal from I. Think beginning of April were comments like oh people will comment in it during April, but I think since April nothing happens. So on course anyone works on with what is the current state I.

E

Think because it has a POC available.

E

We are still trying to finalize the design, so that's why he hasn't been updating the POC, mostly because, like before the design was paralyzed I, don't think we should spend too much time to update the POC because of that may change, depending on the background. I know, that's a design, so we are hoping to get. The keys are finalized. Hopefully this culture and then a single a lot o parts in his POC can be reused by the way. I think we can definitely use that as a basis going forward.

G

Yeah but I think even for design discussion, I haven't seen comments in the last couple of weeks. I think yeah a bit stalled.

E

So the discussions we have some like offline discussions and also perhaps the not of them made on the on that PR, but we will try to guy some like conversation Jacko you so there have been discussions going on and we will try to make it public yeah.

I

I would say frankly, given that it's not going into the actual code base, this coder, that many of us are probably focusing our energy on closing out. What's we're trying to actually get into the code base in this current release, thank you.

G

What does generic feeling like? Can we get? Someone can I, don't know I for implementation and 112 release cycle or what.

I

G

I

G

We have several things like several features, working on like FPGA support and so on, and who, as we need to understand what what we are, what we are doing like we a dependent on current device plug in implementation or better. We would like to use when you think new resource classes, change.

I

The device implementation.

E

Right, so it's it's not going to replace the best plug-in, but I guess I! Guess your question perhaps of our FPGA. Since we have heard some official requests in the past, I guess the current model. You are building it to advertise a special resource name for each FPGA and your.

A

E

Driven by that, you can make that information as a part of the metadata information in the.

G

E

Class, not you the recipient, moto. What's one.

G

Of the things, but oaken is like a bit a user experience, how we can utilize well like special Hardware.

E

Mm-Hmm and you are trying to build some for FPGA on top of the resource class, POC I guess.

H

I hope so I hope.

A

I

Because, as POC was done about a year ago, I'd be shocked, he's rebased it since then, although most of the changes were in the scheduler I, don't think he's on this call now, but you could follow up and ask him resurrector rebase that branch, but.

I

G

I'll try to pin him in slack and see what what his current state is.

A

G

We have besides FPGA, we have our scenarios. Actually, we have like an hour item for our agenda, which my colleague was. We wanted to talk, but I'm not sure we have enough time for doing right.

E

Yeah, you feel how angry is furious to share any pinpoints of like modeling fpga results with the current.

A

If not, we can point it to the next meeting. Okay,.

J

Hello, so it's it's christian from christian it K from interest, so yeah I was at doing that. It's not really urgent, but if you think so that we have so what we could do is that I could give you a sort of high-level introduction what we are trying to achieve with that and when we run out of time we can continue during the next one or, if you prefer, then we can. We can do the whole thing in the next one I'm. What boss is fine with me, yeah.

I

I prefer that I think between now and two weeks from now that probably a lot of mental context. Switching and maybe the participants list will be different- that if we could just give this up for next time that be.

A

I

A

That, if you can write even one page or like five or six lines about what you're thinking, that'll, also help us get on the same page. Yeah.

J

Sure, actually I have a yeah. Yes, I can do that.

J

So we have basically I have a document I just couldn't share it, because I was creating it with my intellect count, and I I I will just recreate it with my with my personal right content and I will be able to open up for everybody and I will add it to the agenda. Is it a link to it? That's fine.

A

A

F

You very much thank.

A