Kubernetes WG Resource Management, 9 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20170509

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

So yeah welcome to the mainland meeting of the resource management workgroup I know. Many of us are fatigued from discussing resource management topics last week, but there two items on the agenda today, one as a recap on what we discussed and need some discussion, all next steps. And then we have a second item on the agenda to do a more detailed advise on the container device interface, which I think will dovetail a little bit into some of the topics we discussed last week as well.

A

So with that connor, since you so happily volunteered or was asked to give the recap- and I want to go ahead.

B

Sure so these are. These are notes that the derek took basically condensed from the raw notes. It looks like that david asheville graciously wrote for us during last week, so yeah. So the summary we had judge people from a diverse group of companies and there were a few work streams that were they were decided out of the out of the three days. I think we came down to uh looks like five are broken down here, so the first one was support for resource classes.

B

Choose surfaces repeat for air for a number of you who are on the last meeting, but you know a level of indirection between of what's in the pods back and what is in the node capacity to allow for concrete resources to be aggregated into slightly more abstract things that end up in the user. Pods back and resource classes is mentioned in a few of the other work streams. So this may be a at least a soft dependency and a number of the other items that that we have here.

B

There are links to the two detailed I guess what design we have is is in those Docs. Probably more work needed on all of these ok IM to was the extensible device support, and so this is I think just remembering back Nvidia agreed to be the guinea pig to implement a device plug-in and also help out with the implementation of the delegation side on the cubelet.

B

But the main idea here was that you would have a finely shaped API. That's intended for an external component to advertise some device resources and part of that information includes the topology sub graph, which the cubelet would add to its local node hardware, locality, information and use that in the device assignment decisions, and it would also be responsible for initializing tearing down and.

B

Health checking those devices that are consumed by by user containers, I think there's there's a bit of design left to do on this. One.

B

I'm not sure exactly I mean.

A

I would basically describe like shut up. You jump in yeah with the I said for this one, like the high level of things where identified, and so like I think the document that mission myself set up that I think is voiced their build a lot on a lot of the items that were presented in the originals CDI or container device interface with puzzle, but add some additional stuff.

A

That was not there, and so basically, we need someone to go and take the next step and build a proposal that kind of blends the two but yeah right now, like you, don't have the digital framework in mind for how like the key, but will trust individual device creators and stuff like that. So it's somewhere.

C

A

More like what are the corrector point we needed so.

C

I've been spending some time here in the money like thinking over the basic design idea. Oh, we came into the session interface, I mean just to say: someone community wants to go and like Sabo can really feel free to do so, but I think there's the thing to people in mind, which probably has already been told here, that the resource class API is a probably a prerequisite for anything you wanna do with devices, so that's probably going to have a higher priority relative to our excellent vegetable, so existing in media integration.

C

My continue to function, a siddhis and might not meet everyone's use cases, but even for that integration we need like every X step.

A

So I cannot cover that part, but yeah.

D

Jen research.

A

Classes our product, to example, devices it.

B

Seems like if, if somebody wanted to prototype at least they could they could fake the resource classes with oh I RS, just to get something, we're working and then transition onto resource classes as soon as we become available. But that probably wouldn't go upstream.

C

B

B

So the third item here is some enhancements and how the cubelet, by default, treats CPU resources and I think the concrete goals we had here was now to achieve good performance, or at least more predictable performance when compared with VM based orchestration systems, those what a what Derek described. As you know, disappointing bake-off results when he was asked to compare you know: M based containerization versus the you know the CFS CPU, sharing that that it implements this is the equivalent of a default and qiblah.

B

So the idea is to try to reduce the process scheduler latency by giving out exclusive course to some guaranteed pods based on how the resources are requested, and there are two strategies that were proposed. One is called the dynamic strategy which would be based on you know: feedback from the president for the for the pod and using that to do something smart with CPU sets, and then the aesthetic strategy would be.

B

You know as soon as a pod that needs an exclusive core lands would be to shrink the shareable pool to make a hole for that for that container. And then you know you would have a process that has exclusive access to that core for its lifetime.

B

Yeah was there anything else that we wanted to mention about this item, or is that pretty much cover it I think.

A

We basically want to explore those two strategies first and before, seeing whether or not sufficient right- and it might be that there's a third or some other pluggable strategy required in the future- that we can evaluate down the line but I think that that covered it pretty well.

A

The only thing I'll just add is like the dynamic strategy will be the default most likely, and so we need some people to prototype this before getting a more informed proposal out so I know, at least it right hat here on the call is going to start to prototype something I know. Some members with me from Intel have expressed an interest as well, so it'd be good. If people are interesting to go ahead, and so if they could speak us and we can surely collaborate to build a it could outcome for all, but yeah.

B

That would be great I know some some of the members of our group started just by seeing what they could do in terms of refactoring. The external isolator work to be a kind of internal delegate that we talked about during the meeting yep um to change it into or just a CPU manager interface inside, be a container manager package.

A

Yeah I, don't remember that's how much you've had time to look at this and plus tweak. If you want to give an update only a piece of the world where yes,.

E

I'm also looking in the container manager and I mean the cpu set. A/C group is kind of strange. It's got a CPU under core occlusive are pretty, but we do. We don't want to set that because that's my cause nightmare scenarios with races and things like that. If the what it does is it you know, you're defining the CPU set for your the cpu set CPUs for your CPU set and if you said exclusive well, one, if you can use in your CPU set, are included in a sibling.

E

If you set set fails, if it's not, then you you're set exclusive equals. It succeeds and then those course can't be assigned to CPU sentences. Link CPU sets so yeah. It's kind of complicated, but I think what we want to do is basically make cores explicitly exclusive by just removing them from all the sibling sets and not setting CPU under chore exclusive, but yeah that it's pretty it's.

E

Basically, what that means is that whenever I guaranteed pod lands on the node you're going to have to go through and touch every cpu set like at the claws, pod and container level to reserve that core for the guaranteed pod. So it's tricky and that's.

A

A killer thing to call it if, like a code, passed I, think for either strategy should not be like diametrically different like they have very similar problems like they both need to schedule a CPU. They both need to really succeed. You they both you to potentially know how fragmented they are so I think the hope is a free de compasso like we can get a base primitive that works from a code standpoint. The things not look at the fragmented right. So yes,.

B

I mean all you can all be instrumented away, it's just like sure done a hierarchical pool, abstraction or something that they both manage. But anyway.

D

B

Have to design it here, but yeah yeah, um you know stuff. If you wanted to collaborate on the same piece of code, we'd be happy to to try to figure that out. Okay, yeah.

E

As soon as I had code, I'm planning, I'm, just writing a product delegate and in the in prototype working such that I can make a meaningful proposal. But as soon as I have a code that is meaningful. I can't hear that okay.

B

Same same on our end, we'll just stay in touch: okay,.

F

B

Okay, so the next item was um a huge class. Huge pages will become first class and will support both modes of consuming huge pages either directly using and map or shim get or via the huge TLB FS.

B

So there's some work already for a huge pages back memory volume driver, so yeah, I, guess I, guess for this one it may be a matter of you know how much can be parallelized, you know, can we set up the new resource and the and the setting up of the cedar of limits in the cubelet in parallel with the with the volume plug-in for?

B

Does one have to block the other I? Don't think I'm just out.

A

B

Out of all of them in terms of work that could immediately start yeah.

A

I think the thing we need to iron out on this is I think in their prototypes that are out there. The well. Let me single out purpose in if we choose to map huge pages via resource classes and have an indirection we'd like to just say this is the size of a huge page. I want independence of the actual page size. So that's awkward to say, okay, but if you just say, I want to gig a few pages and you don't care if it's 2 megabyte or some other alternate architecture.

A

Size then like you, might want to use the resource class model, but this discuss, and so maybe resource classes become first, but if we don't feel like that's a strong requirement and yeah, this one feels pretty Pamela. So.

C

The idea that diversity, consistent quality and not worry about any application requirements are met by static, punish me. I. Couldn't.

A

Understand something so I'm.

C

Saying that like having the resource classes and the mix would basically mean that you can have portable API specifications and you don't have to statically those change instruction and week you're at requirements.

A

I'm sorry that I'm unable to connect that pic expected stuff, you're saying like I, could either make my my resource class for memory be back by normal memory or not normal memory. No.

C

I'm saying that like if you, if you ask for huge pages, there are like five gigs and if that's part of your resource name, then you are sort of tied to a specific deployment right. Yeah.

A

Yeah other way, I agree that so I guess we need to that's why the one edge case on the heat-ray, stuff and I think the other edge case was just ensuring that memory counting is working as we expected so to be able to to verify that the mem stage group is not double counting. Php function outside of food scale, but yeah generally, we should be close to beyond I get a final proposal on the stuff.

B

Cool okay, so second, the last one here is Numa. So the intention is for the cubelets. You understand the node topology, so it'll do some sort of I guess we talked about an initial round of introspection into the Ellie's via CPU, socket topology, and then, when device plug-ins advertise their own resources, those those graphs would be appended to the the known topology of the node and they would all be considered when making locals of resource allocation decisions on the Cuban side.

B

So I guess the one hard design constraint that we decided during the meeting was that the node topology will not be advertised to the to the scheduler. So guess where we'll just deal with counting resources and the the cubelet will be concerned with its mapping the allocated number of resources and concrete resources once the plan is assigned to the to the node.

A

So I don't know this if you wanna, add more color here to this I think, generally speaking to in order to go deeper on Numa, we need to get basically the grass modeled in the qiblah there's a lot of prereqs. We need to land before them and I think we'll learn some things along the way around like how we should treat relative good so which resources which should give better weights and I want to be surprised if, in a few months time we have a bunch of like like scoring, policies might be taja.

A

Claws are something different to say, like what's a minimal ethical score. For this thing to meet, isn't really, she says, I think we would all generally agree that the cubit will be something it'll be new malware and how advanced their policies get are.

C

Yeah I think like, and he like a more detailed proposal. Okay,.

A

A

C

Not let's be done to like.

C

Its okay whitter.

A

Yeah completely I mean I would want to get resource classes, devices and CPU figured out before China map Brendan.

A

Do we agree without this yeah.

G

A

Yeah but I think, generally speaking, it's helpful for people to know that the cubelet sees new monomer haizen and it needs to be somewhat intelligent about it in a future time.

B

Okay and then the as a less topic that we talked about was performance measurement and the main idea is to provide some way that we could witness performance regressions that happen specifically with node local policies and also to compare proposed proposed isolation strategies against each other in different situations. Hopefully, one common commonly deployed workloads in the cloud so there's one example linked here, which is a project that it happened at Intel and it's mainly about allowing the user to program, experiments and each experiment consists of a bunch of different.

B

You know configuration values that get unfolded into a matrix and then what you end up with is a big graph of sensitivity, profiles or specifically, a high-priority workload versus some co-located agressors. So you end up with a graph like this, and the main point was not to you know, push this project specifically as a as a solution, but just use it as inspiration for know what kind of data we might want when we are making decisions about the ocean cooler policies, because in I know a bunch of the people on the collar.

B

You know have real world real world experience with being fooled by intuition in the space. It's easy to make a mistake and do something that that doesn't behave as you expected.

A

Wondering this is like how we can take like these performance measurement data and then like make something more intelligent, so that we get like affinity and entice any scheduling roles assigned like intelligibly, so I don't know how to spot about that much Connor but, like generally speaking, I know we spend all the time talking about like initial resources and potentially a thing in the cube project. We're like based on historical usage. You can see.

A

What your default resource assignment should be, but I think it'd be good to understand.

D

A

We as a project can get more intelligent. Unlike understanding, the this pot is like an aggressor to another pod, like kind of viewed. That graph is showing right, yeah.

B

I think Leah. You know one paper that comes mine, quasar from from Stanford Christina delimits, for you.

C

B

Rename and cozy raucous, but anyway it was all about offline profiling and right sizing and doing a collocation to reduced basically to reduce the edge of completion times for a big set of batch tasks.

B

But I would think that would be like a V, 2 or 3 or 4. After that's,.

A

Not on I understand I'm, more wondering like with tools like salon and I, get the data like how can I use the existing primitives today to like in our scheduler to to respond to that data right like and I'm wondering if, like they're, like particular affinity rules, a particular label patterns, we can apply to pods to make things easier to consume.

A

So like, if you know that your I.

A

Got to read through your stuff, more a little more but I feel like we might get a large smell of like decent data and it'd be good like it's unclear to me that we need new new code to like responsive. You won't like how do you know where it's safe to schedule, stand angle, huntable, lose affinity or anti affinity to hit your desired outcomes.

B

Maybe you can defer to the Google folks on the line and see if they have some just works. If you can talk about.

C

A

B

A

Know when I asked you about the priority stuff I learned that you guys have like twenty eight thousand different priorities, so I thought, maybe you might have some.

F

Type of performance measurement.

A

Stuff, so hey Matt, thanks for giving a summary iron ore folks who are unable to attend our people.

A

Okay with what was described this far and I recommend people to read through what is Ernie's feedback. Weird like justly unhappy about any other things. I was going to come here.

G

Okay, oh you're speaking, we don't hear you.

A

So it could be David Oppenheimer because he's the only one: that's not unmuted mmm yeah I.

D

Heard something I thought he.

B

D

B

D

G

A

He's made it like: it's not I. What.

G

Was the question they heard something and then your name showed up on the screen. Twice I thought. Maybe he was speaking and your and your mic was not working well but anyway, so it looks like you will or not.

A

Okay, so assuming there isn't any immediate things like get people to upset.

A

We can move on to the second topic on the agenda, so Dennis I don't know if you're there. Thank you. There I think you wanted to get a rundown on your thoughts behind the container device interface proposal, and maybe you can walk us through that now and then we can give you some of our perspective back for items that we might have touched on at the face of it.

F

Yes, I'm here, can you hear me yep fine, I, think yeah you're. Basically in the summary already mentioned, that you are going to implement this part using AG, RPC, API and and video is going to do it so.

A

I I, don't I, don't think the mechanism or the trend for it is defined it's more of a quality. What are the? What are the use case? Requirements I, think we kind of nailed down set of use cases you're looking for and then what are the high level like design goals? You need to achieve I, think we nail down, but they actual mechanics I think are still to be determined.

F

A

I think the only major thing differentiates them. Your proposal was sorry bit delay since I read it was it worked at the pod scope and was like not take to the cubit.

C

And it's also not binary that, like it's a long-running process, yeah.

F

So yeah well I I just tried to mimic CNI The Container network interface because I thought well, it works already, and it has some elements that are similar, so I yeah just did that for device discovery, I implemented that in the proposal after your suggestions- and it also has well device discovery, location and de-allocation and communication in this proposal. In this version of the proposal happens, the command go through environment variables, while the output goes through Jason's who send it out.

D

Let you know this is Daniel is through sick network and also a CNI maintainer were sort of moving away from environment variables for some of the stuff, because we thought they weren't quite as flexible and so instead of environment variables. Removing to push that information through the standard input Jason configuration instead. That.

F

Was exactly my idea, yeah I also the the configuration file I would also included in the standard input together with the commands, because I think they are well somehow connected. I also wrote it in a comment to the pull request. Yesterday, I think yeah.

B

F

They're, the main difference between the what you worked out at the face-to-face and what I propose is that this is a binary and the other one. It's the long-running process. I, don't think it makes much of a difference for me at least for Maya, for our use cases, I.

A

F

Know I guess one.

A

Thing he cares about us. There were particular devices that you are looking to support that might have unique constraints that we didn't the stuff. So if it looks at all the time discussing, TV and and they were cost of challenges around.

A

Gpus that I feel unique, but.

C

We did talk about the scenario that I like as a vendor. You don't really care about giving a plug in, but then you just follow right by you and if they recall correctly, general agreement and all that the different thing here, that's going to be the plug in. Thank you professional, binary analysis. You can have one more level of indirection, look like you're, basically having a proxy the proxy some any API that you have. That could be an executable, for example, and then it clocks it over to on the GRDC and points in Cuba.

C

By the end of the day, similar mechanism inside, like you, have characterized the same, this is always not yet for you define, but then you have the actual devices or block devices they have topology associated with data. Then liquid is going to orchestrate access, Allah yeah.

A

But I guess the other things that I was thinking about. Vishna I agree with your foot. There was that and I'd like to spend all the time discussing how to get kernel drivers as well as your space drivers, for that might be as though there's in your device, and then we spent a fair amount of time discussing like in the allocate step.

A

If, if that was going to be used and so I feel like, we got a lot of pushback on the GPU side that they did not want to use the allocate some way of exposing container I'm wanting to do it with 1c hooks, but I wasn't sure like Dennis. If there were particular devices you were looking at that might have similar challenges that might help us.

A

Maybe better understand the constraints pays better, so, like other equivalent, like kernel drivers or user space driver issues that you would go through with many of the devices that you were looking to integrate with.

F

Well, our use case also GPUs, but we also looked at AMD GPUs, which don't have the user space constraint constraints because they're the kind of driver and the user space is basically decoupled.

F

So that's probably a bit easier, but apart from that, there is now no difference. Our use case is for visualization, so yeah. That's why I also included the tty as another example because well to get a NIC server running. You sometimes need a TTY.

A

So I guess try to think about how to proceed here.

A

Are you interested in pushing the device proposal board further, like? Are you agreeable to maybe looking at to the document that we put out and then digesting those and then potentially iterating on your existing proposal, and maybe we could use that as the as the pr2 rally against or I guess, what I'm trying to figure out is like what how long-term engaged are you looking to be in the project to push the teaching board.

F

Well, I'm I already read most of the documents from the face-to-face meeting and yeah. If that's still an option for you to propose to follow this, this proposal sure I can I can go through that and iterate with you, but I. Don't think it makes much sense to to go to different routes. That's why I would like to coordinate with you on like one one proposal, yeah.

A

I was trying to think. If we could, you could get a overwritten that, if you had time to or that the great like a lot of things, are not the blinds right. I think the mechanics between how to book those device discovery local to any Damon set pods that are running their needs.

A

A lot of detail and I think the the I think one of the things that came up there is we wanted to cubelet to be able to do a list and a watch on individual device, plugin providers, rather than pulling so that we can respond more quickly when we know a device failed and so I think I'm. Just if this is a topic that you're, particularly interested in you want to evolve.

A

Your proposal to take account of listeners that we discussed and assume a world where it's more native to the cubelet I would be happy if folks in the community wanted to do that.

A

F

Sure that it would be interesting to me, I would like to do that, but yeah we should I coordinate with, and what multiple people working on this or interested in this so yeah so basically needs someone.

A

To both prototype to inform the puzzle and write the proposal, I think what core nation standpoint. If we assume that the concept of a resource costs will exist, then, like I, assume dish myself and maybe we're know from Nvidia or probably the ones to coordinate live but like I know personally, myself, I won't have cycles to work on a proposal for the next few weeks and I assume dishes in a similar boat.

A

So, like I'm, just more wondering if you have time, if you want to go and take at the time, just drive the ball forward a little further okay.

C

So, yes, speaking of I, think like hiding or trying to come up with some apps IEPA the 108 times plane might be more realistic, like yeah.

A

I was expectation.

C

A

I'm not I'm not expecting anything in q1, seven right, I think the future concept has has been passed on that I'm more just wanting to. If people have time and cycles to dedicate and think on this and I understand where we have to brought a group felt like any proposal should go then like I, don't discourage people from from iterating on my proposal.

A

C

A

Just said, like.

C

A

C

Can be frustrated that they are not receiving any feedback or they are making any progress. That's the reason why I wanted.

D

C

Simplification states that, if like that, goes and spends a whole bunch of time alone, nothing really happened or like it doesn't go to much. Anyone came in 1/8 I wanted three, that that was the intention. Yeah.

F

Could you please repeat the last part: it was a bit mixed up: okay,.

C

So it's saying that, like oftentimes, what happens is like this amazing work happening, but they don't happening together and they're, not part of the roadmap and so oftentimes. Like those those amazing patches proposes designs. They don't get the right kind of attention, so I want to avoid that with all this one and that's why I was trying to set the expectation straight and that it's probably unrealistic to get a working implementation. Even if it's alpha in the next release.

C

So I was just suggesting that, like having a fully scoped out, prototype and and the proposal might be a more realistic goal for the next phase.

F

Mm-Hm yeah, my timeframe would be like half a year. Maybe that's.

C

Why I think adding a small tractable and having a few implementations against the API, you know.

F

So I would coordinate with three no I'm.

C

Not sure I know I, don't think that I was in the column like today. Here's an intern, so I don't know there how long he will be immediately. If general is not responding, then I can probably help you, because it's something that I tend to focus on, but enough I will be a 1990 contributing to it. Okay,.

F

A

And ten years well: Venice! Oh it's there, but yeah, so that I think we've gone through the two agenda items here, any other topic that people want to raise for today.

B

No at the end of the at the at the end of the face-to-face, we put some kind of bring leaders on to the topics just to get a roadmap together. I was wondering if we wanted to have like a consistent format for what the roadmap should look like for each one or if there are some examples that we should follow when.

A

You mean thing leaders without a portaloo yeah.

C

I mean basically, we we listed all the six items. The six major items I'd be description. Then we tried to like identify some basic supports or like high-level owners, would then have some form of a roadmap and like only we wanted to bring all those features together and I, like figure out an overall roadmap that makes final year or something because you don't wanna, do all that together and so wondering like. If I can merge all that together and then like have a clear, long-term roadmap, even if the long term is just for a year.

C

You're like begin the process, you try to identify some owners for each of those major topics. I guess a peach. Others will try to come up with some form for detail work plan, even though it may not be accurate. But it's supposed to give some idea on like how long the next one is going to take and depending on that, we can in stagger or we can- or we can do things in Parliament. So.

A

I guess I guess what I'm missing is. Where was that done? It's in the it should be in the meeting notes dock at the end, yeah.

B

I, don't see that good, there's, actually a link from I think one of the first lines. It's this topics and context.

B

It's a separate duck. That's only one page with them. Each item and the contact listed I just dropped in to chat you. Okay,.

A

So, given that.

A

Yeah, given that folks at Red Hat weren't there at that point, you have like an augment up, can I think.

A

Let me take a moment to digest this, but as I noted there we have more staffing available to work in these spaces, and so it might.

C

A

The same set of names, yeah.

C

So those are just a syllabus for now, um because you just thought that we could, you could make you know I think so we decided you need. Okay,.

D

C

Just a starting point: yeah.

A

C

Make vlogs so looking.

A

At this list, the only thing I would add is like what we discussed here, which was we're very motivated to do something in the cpu States in the near to mid term, and so stuff has been tasked with that and so upon. A stuffing Connor are communicating, I'm happy and then I will go through the rest of us afterwards to more detail like on the context I for resource passes, fish, that's really get to a two part problem.

A

That's like a scheduler problem and a node problem, and so I want to sure like who we assumed was maybe going to go and prototype the concept to ensure like even basic, counting work and I. There are potentially people on my team that I could sign up to do that, but that would just be verifying that the thing will actually work and not a so. The final thing: it's not.

C

Which feature-rich be cherrychanga so.

A

This would be the resource class feature, so I think one of the things we weren't sure about with resource classes is like if the scheduler will be able to actually properly count the resources consumed and like Sir David just for context like I, have a couple of people on team that largely are dedicated working on scheduling, so I'd be like a vest, and some other folks, and so like I, was just kind of think through like for folks that are more active in list scheduling side having them do an initial prototype just to validate the concept might be useful.

A

C

Interaction, that's called on, like defaulting in social, so Anita.

A

That's a whole separate topic for quota I. Think I noted, but I was a little concerned that we haven't had a community participation in quota aside from some folks here right, not really so, but I thought that was really tangental from the resource class stuff. Yeah.

G

I guess, if you get.

A

The resource accounting right it.

G

Can probably cover the Kern County as well I mean.

A

I'll call quote: accounts is pubs all right. So as long as your council thing we're not going to count, maybe there's a there's, an assumption on my side. That quota will not count the concrete resources, but the resource names identified in the resource classes. Oh yeah.

C

So that's the part that I want to I only like fleet flushes.

D

C

Then I also wanted up on to like flex a little bit more hope which Vegas would work, because you want to keep adding you use API. Are these the resources being added on but sort of ties into four pieces too?.

A

C

A

I will go and reveal this list and throw other related context: hey, there's more people than just those that are present on this call that I can contribute to the effort. Okay,.

C

That's also, maybe you can think of offline data. Come on. I am.

A

C

On this book, all.

A

G

Everyone I think.

A

We're done okay,.

G

Thank you guys, bye, thank.

A

You I'll go there.