Kubernetes WG Resource Management, 24 Oct 2018

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes Resource Management WG 20181024

Description

Meeting Agenda:

https://docs.google.com/document/d/1j3vrG6BgE0hUDs2e-1ZUegKN4W4Adb1B6oJ6j-4kyPU

A

Hello Brett so welcome everyone to the October 24th resource management. Working group meeting: apologies for those who are watching this and video Who I am starting the meeting a little bit late, and so thanks for those who stuck around- and so we have one item on today's agenda, it's my own Christian feel free to talk to your cut.

B

All right so hello, everyone, I'm I'm, East Babu Stern. Do you see my screen yeah? What's good, all right! So briefly, this this is about this. The jet fighter, lease per quality of service class, CPU affinity hints- and this is mostly about this AVX mitigation.

B

It's a bug, bug about our own fee or Cooper. On this issue about AVX and the idea there is that if we are running a VX / clouds that are mixed with this irregular workloads, then a VX will actually like push push down the base frequency very easily, which the means that non AVX workloads with are running slower and the problem here is that it makes it where it's slower and makes those work harder to kind of reason about the workload speed when this ABX workloads are running surf too.

B

So the kind of the core of the idea is that we want to split, split, a be expert clots and then only wave expert clod so that they are not running on the same course or sockets.

B

B

How we want to do this, and now when I say we I mean this document orders which are Connor Toyland, myself, Christian and Alexandra Nevsky.

B

So what we want to do is that we want to actually put the containers on certain cpu sets based on the port quality of service class.

B

So the idea is that we would have two new configuration values, the couplet which would repeat this best-effort cpu set and other cpus it, and that would mirror this. Best-Effort port would be limited to this PE cpu set antha parent granted. What would be running on the other, cpus it and by default both of these abuses would actually be or cpus, meaning that everything would run everywhere and only if needed- and this would be kind of system- could be partitioned in the and the smaller smaller pieces. And then.

B

Some some sort of side effects like this, that this unlockable quantity of CPUs needs to be calculated so that it's actually the smaller of our CPUs monasteries of CPUs and the other CPUs it. Because that's that's where these CPUs batteries, the cpu request code.

B

And then there are some some examples.

B

Examples and first variant is that this AVX workloads would be running up as best-effort workloads. So it means that everything else would have a cpu resource request on same, and so they would be running on some on some on this other pool and only this AVX, which would run on this best effort and that sort of the kind of them.

B

May be the kind of simplest like concept here, so just keep the workload separate, then the different extensions, how this could be done like this, for instance, the user admission control attack to modify the bot request so that if some some odd header label AVX, then it would have all these resources removed from it, which would then put it automatically to the best effort class, there's a matter, another variant, variance which a sort of simpler in that sense that they they fix them AVX is to some degree, but they don't require changes from kubernetes, but they have some other other is system, for example, this this.

B

Korean too means means basically that well, it's sort of there comes the other other way around, and if this AVX workloads are running us guaranteed current approach, it means that this that it AVX workloads will actually land on taro own on see kind of CPU stem, especially with the static see payments of polish, which means that if they actually are backed is integer integer cpu requests, then they will have their own on course, and the algorithm cosima, so that if there is like large enough requests- and it will try to allocate first full core like both hyper threads and if it's even larger it wouldn't try to go to the full circuit, then fan the body and three is that people do this?

B

What we she was suggesting when he was commenting on this on this kept told that we would just like sort of partition the whole cluster cluster, so that this AVX workloads are running in only in some on some nodes and only way. Abx works out, maybe running on other nodes.

B

But we say here that it can negatively affect cluster efficiency, which is sort of true, and it has some other side effects like most likely. People don't want to do it. We heard this from our customers that they are not not sort of happy about this. This approach, this alias, may be X mitigation, so they would rather rather split that nodes and not kind of split CPUs inside the nodes and not the nodes themselves.

B

Center is this method that is another appending kept about this, alright, so CPUs, and this sort of might actually help help with that gap, meaning that, if you think of the set of all CPUs and the best effort, CPU CPU certainty and the other CPUs at the just subsets of that, then the remaining CPUs outside of those sets that they are sort of outside of the normal cover the scheduling, which means that they are something very close, though so I so see, peers, there's some implementation implementation details, but, but maybe if there are now already some questions or comments about system.

C

Happy it's the cable running in the other pool, just like a real image. Every we're good, the cubelet, no.

B

It's it's sort of running outside of this there's a whole whole kind of this partitioning. The partitioning is sort of like internal to the containers.

A

Are you envisioning creating anything to make it easier to manage the decision process around which CPU sets? You said.

B

Yeah, this is actually a really really good question, because that sort of comes down to the question that how exactly do this a BX bar close- affect the frequency. The base frequency after for this kind of non-native, expert close and that sort of depends on many things. For instance, I am stunted it's different between different processors and so on.

B

So probably there will be some tooling- maybe maybe it's not done by us on us, but we do have some work that helps sort of the system administrators to analyze that what sort of heat there would be, for instance, if you think of the kind of the cleanest separation of AVX and non lyrics for a club stand, they are running on the different sockets to hold different processors. But if they're running on the same processor, then what can we have to expect there? What kind of performance changes then? That sort of depends.

B

Did that answer another question in it anyway, yeah.

A

I guess the what you've walked thus far. It seems pretty clear at the low-level side- it's just um when: how do we make the set of knobs I just I? Guess that's what I'm looking at on my side me reviewing this so.

B

Yeah, of course, it's the kind of thing sort of the ideal solution could would be that you could actually adjust the CPU pores without having to restart the Cupid, but it's still sort of under investigation by our stud. What kind of restrictions exactly out there when you do this, because I mean I'm kind of the simple solution.

B

Is that if you want to adjust the CPU set sizes, let's say then you just drain the node at just the CPU sets and then sort of run coordinate again, but it could be that there are some operations or some adjustments which you could always make without having to do that, and which would mean that you just let them CPU manager kind of let's get out the pods to the CDs. And then this.

A

Might be obvious to sound bit like what would class of work load is? Is.

A

Served by the introduction of this feature, I'm, sorry say again: what what type of workload is a customer who's interested in using this feature running.

B

It's it's typically like varied butter clocks, because the whole thing is a problem only when you have both a VX and no Navy exper cross running. At the same time, because I mean a VX bar plots are good in that sense that they actually like it's. They they perform really well, and the only problem comes when you have. This kind of this work loves running like mixed in. In the same note, and but.

A

I guess: is there a workload, that's requiring them to be mixed in the same note versus creating note pools that are targeted for AVX versus not IPs,.

B

So it so it depends, I mean I, don't have any kind of concrete examples of what kind of workloads could be this kind of mixed, what a clutch first, there would be like two pods, because for in order to have this this core adepts that class is there like Portland level things. So, if needs to have two ports when running a B, X and one running on Navy X, which would need to be then skeleton same note, that's what you mean. I think yeah, but but another case is just that.

B

You have, let's say a public cloud where you have like a virtual machine or some particular computer, and then somebody else is running Avox workloads there. You don't even know about it and he he's being the sort of the noisy neighbor and actually like affecting your work load. Speed.

B

B

So today these verbs may not from the same source, but they just may happen to be running on the same node and affecting the other workload like in the bad way.

A

But basically, as the cluster administrator I had to understand that I had this class of noisy neighbor workload and pre-pre layout at apology for my cluster that enables this.

C

To work as best yeah, that's.

B

That's right so, for example, you you might require the Debbie expert clods. If you knew, if you actually know what they are, because this kept kind of presumes that there is that this is kind of cooperative. So the expert clubs actually are somehow like known to behavior expert clothes. So there would be this label or 'tony this convention that they are just resources set so that they call this certain classes.

B

And if you have, let's say a label, then you can choose that this Amex barcodes all go to this particular node, which of them split so that they have.

A

Just so I understand the context. Learner is the noisy, neighbor problem only when you have an AV X and Navy X workload on the same node or is it a problem to have to the X workloads on the same node.

B

It the problem is, or the real problem, isn't a mental sort of different types, if they're, both AV experts and its students, sort of effect, because the base frequencies, anyway robot Ontario, X's sort of just getting older power. We are talking about sometimes about this kind of disdains a be expert cause when you accelerate executing a lot of these, a B X, maybe X, in structures on the same CPU. But that's that's not the problem that this can.

D

I can I say one one comment related to this, so so so, basically, actually the situation is more complex because the so how the base frequencies affect like is motor. How the bass frequencies affected is a function of several things.

D

One of them is the SK SK you so basically what type of CPU running on, but the other thing is the what kind of area constructions you are running and the third thing which affects it is that how many cores within the package are running AVX instructions, and one of the basic ideas of this cap is that by being able to limit the maximum number of cores that are executing AVX instructions within a node, you are able to sort of limit the down clocking effect to a accepted worst case. For that node.

D

Struggling with it's.

A

Like there's this idea that the operator knows which workloads are using ABI X versus not a VX and they've steered their pots appropriately to take that into account and and then we're kind of contrasting that, with this like well, I could be running on a cloud and I'm kind of black box and I. Don't know anything about my and user workload, but I pre pre configured.

A

My nodes with these separate CPU set policies like I, guess, I'm kind of wondering, like the consumer of this feature, is a user who, like I'm, still hung up a little Don. Why I would not just have dedicated node pools for one class of workload versus another and what the major like, if I'm sensitive to this issue, with the major workload requirement itself, is that requires that you were two pots to be co-resident like one that is, and it's not using any.

B

Access them I made a splitting of the cluster into this AVX nodes and no networks, not it. It works, works too. So no problem about that, but it's just like sort of an maybe bit easy solution for, because you can actually use that for any other things so for, for example, but I have extended resources for let's say GPUs. If you can touch us say that these are the GPU nodes antes at the non GPU nodes. Well, I have.

A

An extended resource for GPU is useful, so you can. They have a count of how many GPUs are scheduled in your cluster and like the presence of a GPU extended resource on the node directs schedule link to that node. So, whereas this it's not quite the same, it's more like I have a very performance, sensitive workload that takes advantage of a B X or is adverse to the usage of a B. X and I want to steer that workload to a particular node pool. That is not impacted by that noisy neighbor effect you're describing.

A

But it's not clear to me at why or how or what class of user would say. I'm going to.

A

Globally, for my cluster configure my nodes to take advantage of of the two working together by setting these policies appropriately and it'll,.

B

D

A

Of knowledge, sure yeah.

D

Okay, I have one comment related to this. So basically, if you, if you're a B, X workload, strategies that you want to run a B X workloads on your cluster for so called so basically for the spare cycles, so we are putting them.

D

You are putting them into always into the best effort quality of service class, but you are having, like you, thought, sensitive workloads which are, for instance, running in a guaranteed QoS class and with the static policy enabled so that you are having so those are running, be dedicated within dedicated course and those are sensitive and you want to make sure that you can freely schedule.

D

You do not need to set aside, for instance, time slots during the day when you can schedule your AVX workloads, but somehow you want to prevent them from severely affecting the performance, sensitive ones. Then one option is that you are limiting how large number, of course within a node, can concurrently execute AVX workloads, and by that you are indirectly limiting the base frequency downscaling effect of the noisy, neighbor avx-512.

D

I don't know whether this clarifies.

D

Clarifies a little bit things further.

B

You have two spare cycles. One and one other possibility is that if you have like a specialized cluster, for instance like a single node cluster or other cost after consists of like special nodes, which need to be performing many things because they happen to be somehow like special send you, then you actually could run a be expert close on those notes to if needed. Well, it would not affect the other other workloads there. So much.

A

Yeah I guess that's why I was trying to figure out like what is the.

A

What is the target admin for this specialized cluster.

A

Who would take advantage of this and it's very possible I'm under informed on the.

A

Target audience I guess that would be willing to create that cluster I'm. Just looking at this from the perspective of.

A

It's easier to manage pools of nodes than to try to get fancier about managing.

A

Partitioning within the node and the CPU manager used cases we've explored previously were pretty clear to me on who the beneficiaries were like they were, whether it was DP, DK or some other class of applications like that. That was that is clear, I guess and Ike. But in this particular case the who the who is not able to work the run a workload best on Korea as it exists today and in what use case environment with this issue with ABX is not as clear to main I apologize cause like I said.

A

This was probably me just not speaking to the same user base that you guys are representing here.

B

Yeah but that's um that's a fair point and I think for many users, just sort of like splitting up the cluster would work. Just fine, so I definitely agree with that, but I maintain that they are these kind of these special cases where it actually would be like beneficial to have this yeah.

A

So that's what I'm looking for is like what is the special case where the were the cost outweighs the benefit. I mean where the benefits outweigh the cost. I guess is it? Is it like an edge appliance grannies in a box type case where you have very particular intimate knowledge like that's what that's? What I'm just hung up a little bit on us understanding? What the special cases.

B

Yeah, for instance, we had recently this recently this presentation, presentation by I- never worry about this- this, how they're running this kind of this it in each equipment in the was it in the cell towers, even even which, which have these nodes, which needs to be doing many things at once then, for example, this kind of nodes might benefit from this.

B

I can't give you any kind of kind of this concrete, concrete, like clusters, that this is now definitely like, something somebody who will benefit from this, but there are, but we have, we have understood, understood factory. There are these these people, who would actually want to want to use this okay.

C

So basically it's it's. A user has a very.

A

Fix compute footprint, probably working in the edge that needs to satisfy a disparate.

C

Set of workloads as best possible with the greatest density on that fixed computers, that's one antenna as.

B

Christa mission to spare cycles, users are the are theater which what the grantees patched workloads without like having having the sort of shutdown and our Internode's necessarily for 30.

A

Okay, that's helpful, understand. Are there more areas? You want to call a document here.

A

B

I think I think this was sort of them kind of the beef. Then we have this implementation. Details and I have a small proof of concept. Implementing this, but I think this is a kind of sort of secondary I. Think the first question that I'm hoping to have this kind of these comments is that if this, this gap is something with that could actually like at some part, be be accepted and and what kind of changes or other ideas must come like regarding its whole ABX air leaks like mitigation issue, the.

A

Only other thing that struck me a little bit with this feature was we have a feature that set Jennings from redhead had started a while back called the croisé managed feature gate which basically made it that or what it aspires to do is make it that pods and lower quality of service tiers can scavenge resources that were already reserved by pods and higher quality of service tiers and.

A

I feel like there's a when you were first describing. This may be a potential overlap between the two concepts but after you've completed this I'm. Not sure that thought in my head is as well thought out, but.

A

But I guess as it pertains like, is there anything unique to the Numa steps that you guys have presented, or you know how memory is managed that impacts what's done here and I've done here.

B

No, not really, this wouldn't much much affect that, except, of course, that this is also about like in the end. This is about life in in containers to certain CPU sets, but this doesn't doesn't really relate to the total Numa. You must take that much I.

A

Guess what I was thinking about was right now the cause managed feature. If you look at it basically protects against best-effort pods from scavenging memory from burstable or guaranteed pods. By setting the memory limit for the best of your seeker peer.

A

Actually setting it, whereas typically at sunset and setting it based on the sum of first of all and guaranteed pods on that nodes request getting subtracted from when a best-effort could use, and so if one of the things you were showing here was basically having a cpu set that was fixed. That would just run best-effort pods, but I guess I was wondering as if.

A

Rather than fixed that CPU said you could make that CPU set just further. Restricting but I guess that's kind of what the static CPU manager already does in practice by just moving non-guaranteed workloads to to those other cords. And so, if I was to compare like a user who's running the static CPU manager or not.

A

The benefit of having a reserved set of best-effort cores.

A

Versus having the course step be restricted was that you could constrain the impacts of your noisy neighbor problem by restricting the number, of course, that you offer it as again as a guarantee. Is that how I understood it.

D

Yeah that that probably or that might work, because this is so indirectly this is trying to achieve the same thing that you can put a cap on how many cores are concurrently executing avx-512 instructions.

D

So usually so again, if you look at the details in some cases it might matter which exact course I guess, because it's so complex but I, don't think that would be a significant, so I, don't think you could come up with a workload or a configuration where it would show significant performance difference with two setups, which only differ on which particular course within the same package you are running. The avx-512 would end up significantly differently, affecting the other known AVX workloads on the cpu.

D

So I think that maybe maybe it would be enough to so what you're suggesting that instead of telling, which particular course just say that how many course can run avx-512, that's maybe that would do it.

A

Right or just like turn on a static, CPU manager policy inside cube, reserved, really high, and then the impact would that be would be you're. Guaranteed workloads would run in the integral cpu course that you want to write for either I guarantee for all your best effort. Workloads would Ronen that's lust bucket, and would you not have the same basic effect as pre, reserving particular CPU steps for best effort.

D

hmm Personally, I'm not sure, because now I have to admit that I, don't so I'm, not sure whether.

C

Because that's what basically, your best effort, CPU set.

A

Is doing is it's acting as a reservation.

B

It's true that if I remember correct the static policy it only like it or it doesn't sort of at least every subset from the really from any from the whole CPU said, but it just is something that they make sure that that it must be nonzero sort of so that this this other pots can also like executed there. So it actually might might work needs are.

D

B

D

Execute that no.

B

No, they are not.

A

Okay, well, I, guess that that's the set of feedback I would have looking at this initially, and maybe some of that feedback was good and useful, and maybe other is less so. But I want to thank you for presenting today and I'll Jess were running late on starting the meeting, and we can bring this topic again up and subscript meeting where we can hopefully get some discussion and other folks giving your feedback after watching the video. So thank you, everyone so anyhow, going to cover for today. Otherwise, I'll give back during.

B

Me at the time yeah. Thank you, everybody for for listening and and if you have any any comments at I/o, it is also to the project potential video war chest and just just at them to turn to them get PS as comments. Thank you very much. Okay. Thank.

C