Kubernetes SIG Scheduling, 14 Mar 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-03-14

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's start the meeting. As you know, this meeting is recorded and I will be uploaded to public internet, so whatever you say is likely to remain for a very very long time with that, let's start talking about some of the items that we have. Unfortunately, we don't have that many participants in today's meeting, so some of the feature owners may not be present today, we can actually go to some of the items that we have in the recording, so hopefully those people will get back to us later.

A

One of the items for 1:15 is any new equivalence, couch and new equivalence class. Hopefully we can get these to a level that is performant enough soon and or we can have those in 115. We definitely need to have a proper cap for these in order to merge them in 115 queue. Batch is one of the other items which is still in an incubator.

A

Project class is taking care of that, and that is as far as I know when the latest, based on the latest updates, is going forward pretty quickly and yes, so there is another item which is basically building the scheduling framework and building the extension points for the scheduling framework we have already the design merged, so the vision of the design is also met.

A

Jonathan actually helped us with its refining the design, so that that is in I will probably start creating a few issues for building some of the smaller items or smaller extension points in the scheduler. For the framework we need to mean we already have a couple of extension points based on the older design, the previous design. Those probably need to be modified slightly to make them fit better with the newer design, so I will probably either myself or Jonathan.

A

One of us will probably make those changes, and then we will create some other issues for other folks to contribute. Hopefully we can get a major portion of this and 115 we would like. We still would like to work on implementing quality scheduling policies. These are the policies that specify what kind of scheduling features people can put on their pods. That's another item.

A

This is not going very well because there are so many different ideas about how to do it and I've been working with the with the secod, particularly with a team ultrair about designing, is I will probably keep doing that for the coming quarter. They may.

B

A

An early version in 115 we will see. We still would like to build an integration test library for for our benchmarks. That's something that we need help Jonathan! You are here now so do you know if you will be able to take care of this in 115.

B

That's a good question: I had a I had started a branch with this framework, benchmarking framework and it kind of fell by the side of the road. So how many weeks do we have for one 1501.

A

15 three months, ish before code-free site with I, would expect the code freeze to be no sooner than so. If.

B

That's the case, I think there's a reasonable chance of getting that finish. That I have some I think open design questions regarding.

A

We lost your voice.

A

Okay, we don't hear you, but anyway we can talk about this. Later Jonathan looks like we lost your voice, so sure we can. We can actually discuss this later, and hopefully we can get this in as well yeah, so they I know you're working on inter pod affinity to more to support more than one pod.

A

Hopefully we'll get that in I know the PR is already hard. Hopefully we can just address the remaining things in the implementation, and this should be good to go right.

C

A

As soon as you are here, we are working on resource bin, packing I left. Some comments and I saw that you answered some of those comments already with respect to basically modifying one of the existing priority functions instead of instead of just creating a new one for resource bin packing. So hopefully, I mean based on your comments, looks like you're fine with that design as well. So hopefully we can get that in one in 115 as well. Is that fine with you do you want to discuss it? Yes,.

D

I think that works for me, so I just had some questions regarding the cap should I like give more design details in that or or that can be worked out after it's more something so.

A

For the cap, I, don't you know it's really up to us, but given that those priority function is already exist and we already pretty much know what we want to do and it's sufficient in my opinion, so we can I mean the rest becomes implementation details rather than design yeah.

D

A

We can we can follow the rest on on the PR. Okay.

D

Thank you, yeah.

A

Sure supporting non-preemptive priority so I know that valerie has Valerie's. Now here you've been working on this, you ran into an issue, I believe Alex or no, not Alec, yeah, Alex I believe he actually managed to fork that and fix the issue with with some of there was no no test failing with non-priority and he he managed to fix that issue. So hopefully we can get that in in 115 as well. We might actually need a cap for this I think we do yeah yeah yeah, so we don't have a cap right now correct.

A

If I'm not mistaken, we don't I, don't.

C

Think there is fun, I, think.

A

C

An old issue yeah.

A

Exactly so, we need we need a cap, I, don't know if you were willing to write one or if you you would like into that. Okay sounds great, so thank you so much for doing that. Then we can have this in 115 as well.

A

We need more scheduling, metrics way as volunteered to take care of that all right, and then there is a one relatively large feature that we would like to add to the scheduler and that's being aware of physical nodes in a cluster. So here the problem is that in many cases people run kubernetes on pram, basically on physical hosts and when they do so, they usually run kubernetes on top of another virtualization layer, for example Iran vSphere, or they run KVM or something like that and then create virtual machines.

A

These virtual machines become their nodes of the cluster. So now imagine you know you have let's say three physical hosts in your cluster and each run in 10 virtual machines, so 10 nodes of the cluster land on a single physical machine kubernetes and his way of spreading pods. She spreads paas among the nodes that it sees problem. Is that some of these nodes, or in some situations, many of these nodes land on a single physical house? Now, let's you run a web server which has ten replicas.

A

It spreads these ten replicas among ten notes and those ten note, it happened to be on the same physical host. So if, when a physical house goes down, all of those ten replicas go down and your web server or your web service will see an audit. This is not ideal. A lot of folks are running into this issue. This is not even it is not even limited to physical hosts or on-prem clusters. It.

A

It kind of applies even to an extent to the cloud providers too, because they have the very similar set up, so it would be even more reliable. I mean your communities. Cluster would be even more reliable if you could do the same thing for cloud forms. So we are. We are planning to build something that allows kubernetes to be aware of physical hosts, and then it labels nodes of the cluster with those physical hosts as well. So those physical hosts essentially become another failure.

A

Domain label in kubernetes and convention starts spreading pods among those the different failure domains a different physical host. We currently, as I said, have only a way to spread parts among nodes of the cluster. So with this we were change our priority function to spread among physical hosts. If that is not possible, then it spreads among the nodes, for example in case that or not in a physical host.

A

Then we don't have any other option than spreading among other nodes in the cluster which some of them could happen to be on the same physical host, so I have filed an issue for this, it's in our spreadsheet and we would like to work on it. I believe this is a alex has already volunteered for doing this, but this is I believe is a more than one person project. We need more volunteers, so basically, there are multiple aspects to this.

A

One is to build the you know: I'd, like a label for the failure, failure domain and added a functionality to dis scheduler to spread pods amount, physical hooks, that's one part of the project, which is probably one one-person project, but there is a much larger part of it, which is essentially building admission web hooks. That's one option so quickly know you may come up with other options as well, and those might be even better, but one one that I thought about is just building an admission web hook for various various virtualization layers.

A

For example it could we could build one for vSphere, they could build one for AWS. We can build one for GCP and so on. So so what this does is that it communicates with some of these ApS, for example, with vSphere APRs. Reads the physical host information from this fear and then adds the label at the time of node creation to the node with this label.

A

But, for example, let's say we call it failure domain I, don't know that case that IO or whatever slash physical host or something of that, and then that admission web hook is responsible for adding these labels. We need to build probably a few of these admissions web folks. We can start with vSphere for now. In the same issue, I commented that there is already a vSphere plugin that that has this functionality. Actually we can borrow ideas from there at least look at it for reference to what this will API.

A

We should call for a final physical house and and then build that admission webhook. So if any of you folks or other folks who not here and listen to our recording in the future or willing to participate, please go ahead and mention that in this issue we can probably be directed Brett to work among multiple folks to take care of different parts of it.

A

Finally, Leon is going to work on supporting a less-than and greater-than operators for inter pod affinity. So the idea here is that currently our label selectors for affinity and anti affinity, support like I believe equal, not equal or in and not in operators.

A

Basically, you can specify set of operators, and you can say that you don't want the labels of the pod to be in this set or not to be in this set, but we also want to support less than and greater than operators, so that people can specify a range of values and say that if the pod, for example, is in this range, then we have affinity to eight of what we have anti affinity to it. So Leon is working on that the cap is out and two we need to implement this.

A

One of the main concerns here is that Finnegan anti affinity is already slow. We have made it a lot faster than before, but in the past it was super slow. It was like a thousand times slower than order predicate. It's now kind of bearable, slow, it's like 10 times slower, but still significantly slower than other predicates, and by introducing this operator we might make them make it even slower. So we definitely need to make sure that this is not going to impact as much in terms of performance.

A

So that's one of the other than other than implementing the feature. It's one of the other items that we need to take care of and you need to ensure performance is adequate.

A

These are the items that I had in mind for 1 for 15, it's actually quite a bit, but if you have other items that you'd like to discuss or have any comments or feedback among about the existing ones, please go ahead and share with us.

C

Yeah, maybe I have mentioned something on the even distribution feature. Yes,.

A

C

Basically, we are, we almost have a sort of agreement on the direction. Only confusing things means that whether we can- or we cannot surpass the semantics I mentioned in the pod entire friendly. According to my understanding, I prefer to live.

C

It has this because it will make the design be confusing and awake in providing the even-par distribution function so basically, I think after the offline talk with Chris, Bobby I think the plan P, which is make the part even distribution as a standalone predicates, are our priorities, makes more sense because it has more may divine and maybe less error-prone, and can it's. You know it's a standalone stuff and the basically my idea is that it can extract some existing fields in affinity to the even spread expression.

C

So that means we have to implement similar things in affinity to make sure which parts should be grouped together. They may they more so they are more attracted to each other to be placed together right and based on that. We should have a top-level max-q settings to make sure the the degree of imbalance right so whether they are totally perfect friends or they are tolerant to have some degree of skew sort of like that, and we can have a also have should have a top level.

C

We can also call it topology or we can use a new name called distributed by or distribution key, so there is doesn't cost confusion with existing topology see. That means a mom is hard. We we calculated to merge together.

C

We, a high-level, is devotion key to manage them evenly across the cluster and defined by the Mexico, so I think it works, so it can work a dependent, independent right, but it can also work perfectly with you know the non affinity and part affinity right, yeah to some extent it's kind of little duplicated with part affinity, especially when they are definition, I mean the party selector, the namespace there.

C

The definition are exactly the same: they they are kind of identical, but we are not in forced people to have same definition in both the spreading and be part of in here right. So it's good to have them both there if they want alright, so we don't import them. Only yeah only issue is the pod anti infinity because these works independently. So we are not in forced to change the semantics of current pod affinity, which is to run only one part exclusively in ethology to me. Yes, yes,.

A

So, thank you. Thank you for sharing this I agree with many things that you said one regarding dean by the way. First of all, I apologize, I didn't mention this. In you know, our plan for 115 is actually one of the important projects that we would like to implement them on 15. So it's. The idea here is that we are gonna, we're gonna, add a new feature to spread pods evenly among different failure domains. Basically, it's very similar to enjoy affinity, but in a try affinity we were in like hard anti affinity.

A

We were letting only one part to exist in a particular failure domain. For example, if the failure domain was a node, we were spreading parts among the nodes and, if more than one part had to land on the same node, and there was no other node in the cluster, the end up, I wouldn't get scheduled right now. Let's say that you have, for example, 10 replicas. You have 5 nodes with this new feature. You evenly distribute these pods among these five nodes, so each part gets each. Each node gets two parts.

A

This is the idea behind evenly distributing cards among in a failure domain or among failure, domains regarding the topology or distribution here. I honestly prefer to go with the same topology key, because it's essentially the same thing. It's the same concept as we have infinity and I I feel it. It makes more sense for right for our ATI to be consistent, but once this feature is built, I think we need to go back and change certain things in anti affinity that that's what we had from the very beginning.

A

That, basically was our plan from the very beginning. We may want to completely get rid of presidenta affinity that that is basically not needed anymore.

A

Essentially, this even distribution or even spreading works exactly similarly to prefer an affinity, and we may make we should probably make hard and to infinity work only on nodes, not any other failure, time that will improve performance of an affinity, much and overall performance of the scheduler as well, because today, even if a pod is in use, anti-fan ID, we need to check anti affinity to him to see if there is any other pod existing part.

A

In the cluster that has an affinity and that anti Fenty could get violated, but by the pod in schedule, so until I fennel, tea is a little hard to to check and takes a lot of cycles. If we make it twist, if I make it limited to a node only then it becomes a much much simpler check when we run predicates, we can just easily check whether there is any anti-family in other parts without definitely on on a particular node, so yeah.

A

So basically to summarize we're gonna be of this even spreading of pods, and then we are gonna Nexus that after this is already there- and this is stable enough- we are gonna, go back and start making changes to a pod anti-authority feature that we currently have.

C

Yeah I think so, and my idea is more aggressive, I thinking into the ocean. Maybe we can kind of dedicate a pod untie infinite because they fixed the feature can be implemented using the human spread. Yes, max Q equals more yeah, absolutely.

A

One thing, though, is that when we deprecate, then we need to have a replacement for users, so.

B

A

You say it's deprecated, but there is no replacement. Then users will start asking what is the alternative to use, so we cannot start deprecating as soon as we start building this feature, we should probably have us at least the first version already.

D

A

Okay, these are all that I wanted to talk about. We have four more minutes left in this meeting. If you have other questions comments, please share with us.

A

Okay looks like there's Jonathan, you want to say something I, just.

B

Had one comment, which is that the pod spreading I I definitely see like a lot of like most of the users that wander into our slack asking questions about the scheduler, basically asking for this feature, so I think it's going to be really valuable when we deliver it, I also think it it supersedes. Maybe some of the other affinity changes that are in the pipeline and I'm I'm wondering if there's some yes.

A

So particularly this one addresses one other option. One other feature that you were targeting in the past and I was supporting like a like max pods, or something like that for anti anti. Basically, the idea there was that, instead of having an affinity to only one part which is the current state, we could have another parameter in our anti affinity.

A

Api to say what is the maximum available mega maximum acceptable number of parts that we have anti affinity to so is the one we may say: okay, we can have like three parts in this failure domain and, after that, after the three we no longer want to have anymore parts in the failure on that option or that fit. That particular feature is no longer needed.

A

Essentially this even pod distribution or ribbon for spreading supersedes that, and we no longer need to build that max part feature anymore. I.

C

Think that feature yeah left my origin, the design on to sort of max pass on topology doing it sounds like a little tricky for users to config a tune that kind of value right and.

A

C

The absolute number absolute number you know, along with the change of your epic REO deployment. Well, it can change and yeah really tricky to figure out. What's the right, especially.

A

When you think about auto scaling you're, the number of your replicas may change any time, and it's very hard to to know what that maximum value should be. So this even part, distribution is much a much better design.

A

Okay, any other questions or comments.

A

All right, thank you, guys, see some of you folks next week, bye MA.