Kubernetes SIG Scheduling, 21 Feb 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-02-21

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah we just started recording guys, you may know this meeting is recorded and will be uploaded to public Internet. So one please go ahead and ask your question. You or you wanted to ask some questions about equivalence class.

B

Because I have done most of the work but I find that because I will operator, Inc equivalence class in the Quinn house and is a scheduler. But it will have the load performance because of the block. I see.

A

So, are you talking about that first version of equivalence class, or we were talking about the more recent one that you've done mostly yeah? Okay, so you have implemented that and you've noticed that it actually causes slowed down because of locking.

A

Is it as the is this the new implementation that I suggested with that one map that has basically information, or maybe our keys for on schedule of all parts? Yeah, yes, I, see I, see yeah. What one of the issues that we've seen in the past has always been the case that you know adding more shared data structures such as a map could potentially cause a slowdown.

A

It's a little bit unfortunate that this one also causes to slow down. So let me think so. Okay for for the information of others, what we are doing here is that once the scheduler finds out that a part is unschedulable, it looks at a part and if it has a tenant reference, which essentially means that a part belongs to another collection like, for example, a replica set and so on.

A

It will add that pan reference as the key to a set, what it means is that- and this set basically is a set of parts that are potentially on your schedule potentially or unschedulable. So it means that all other parts which had which belonged to the same collection will also be on a schedule above because they they share the exact same specifications.

A

So all those parts will be on a scheduler ball unless something changes in the cost. So what we did for this particular design is that we add this reference to the map, to cache it basically or to a map or assets in Co implementation. We don't have a set, but essentially this is a set.

A

We don't store any particular value for it, but anyway, once this is this key is in the set. It means that all other parts which are from the same collection will also be considered on a schedule abode. This is a performance improvement, because the scheduler does not need to check all the other cards in the collection. You can just look up the map and figure out that okay, this other part, is also under schedule because it shares the same exact parent.

A

One of the issues that we are seeing now is that checking this map at every scheduled cycle causes a slowdown, because we need to acquire a lot to access the map.

A

So a couple of suggestions for this one potential suggestion you may want to consider using a cinch map. Sync map is one of these maps and go implementation, which is apparently more efficient compared to acquiring a lock over a regular map. I, don't know if you have tried that long, but that might help a little bit. If it doesn't, we need to revisit this I.

A

Don't have any great suggestion right now on top of my head, but I can take a look. Another thing that I was conceived, I was thinking is that do we actually need to acquire a log to access this map? The reason that I am saying this is that the scheduler accesses this map in its single-threaded a main loop right, and it only adds to the map in its own single threaded main loop as well. So it may not require any locks.

B

That's who, when I move the post to active cone I, will later the map yeah.

A

Yeah you're right: okay, yeah, the deletion is a problem, you're right, so, okay, so again for the information of others. When another event happens in the cluster that could potentially make parts of schedule of all. We need to remove all these entries from this set so that the scheduler retries to scheduling some of these parts once the schedule determines that these are still on on the schedule. Body we'll add them back to the set. But we need to remove these entries at some point.

A

And that point is you know the fighter that the event happens in the cluster that makes potentially makes parts of schedule. So an example of such events is deletion of a part from the cluster or addition of another node to the cluster and things of that sort which happen in parallel to the main scheduling loop. That requires locking you're right, you're, absolutely right long I can I, can take another look and see if we can find any better solution for this. But thanks for working on it, and thanks for sharing the problem with us.

B

Sure I, maybe I, can try the sick map first and the two-seed or redoubt yeah.

A

Yeah, that would be probably something relatively easy to do, because you just need to replace the existing map with a sync map and remove the locking itself, maybe that'll help yeah yeah sure. Thank you for the update.

A

All right, I also have a good news. Today. Our scalability results are in and we hit a pretty aggressive goal that we had to schedule a hundred parts per second in a 5,000, not cluster, so that this is actually great. This was the result of multiple algorithmic optimizations that we made to do scheduler in the past six months or so, and the last one of them was one PR that got merged recently, basically just yesterday, and today's results will pretty promising.

A

So we actually see that the average in our scaleable results hits 100 parts per second and that's our rate limitation. Essentially, the scheduler money potentially even exceed 100 parts per second, but because of this rate limitation that we have in our scalability test, it cannot really go beyond 100, but if that rental rate limitation is lifted, it could potentially even go beyond 100 parts per. Second I am trying to work with the scalability team to maybe raise that rate limitation because going forward.

A

If we make any further optimization to do scheduler, we will not see the results we need to have this raised limitation completely lifted or at least raised to a larger number so that we can keep observing our improvements. Also, if there is a change that reduces performance, but if it reduces performance marginally, for example, since we don't know whether we achieve 110 pots per second or not, if there is a performance in the gradation that reduces us from honeypot per second hundred ten pots per second 200 pots per signal will stop.

A

We will still not see it so for all of these reasons, it's necessary for having that rate limitation lifted. We will see where we can go with that, and hopefully the scalability we'll be able to help us with changing that rate temptation.

A

There was another.

C

A

All right sorry way go ahead. Yeah, that's really a milestone, yeah, alright, so there is another PR that Wang had sent out. Thank you very much one for that PR. That one is to basically not update the API server. Every time. Did the scheduler try, scheduling the part and determines that the part is on a schedule recently. We actually made this change, but every time that the scheduler finds out that the part is unschedulable, it adds the timestamp to the part.

A

This timestamp is then used in our scheduling to you to sort pods which have the same priority. Basically parts with your more recently retry go to the back of the queue. This is like a fairness, improvement mechanism in the scheduler, but updating the API server at every scheduling cycle is not necessarily efficient, especially in larger clusters, with many unschedulable pods. This eats up a lot of our bandwidth to this to the API server. I was just telling you about the at the rate limitation that exists.

A

That rate limitation applies to any request that we send to the API server, including request to update party status. So it's important to save that bandwidth as much as possible and this PR that one has sent and is almost ready. I had just a couple or like several minor comments in it, so it's gonna be merged. Soon is gonna. Remove that recent change that updates the timestamp of a part after every scheduling attempt. Instead it keeps that timestamp in the scheduler internal state.

A

There is really no reason to update the API server about this time stylist times. This timestamp is something that is only valuable to the scheduler, so hope fully. This will emerge soon and will improve for further the efficiency of schedule teaser, because some of the updates today they had I believe Klaus.

A

Did you add another item to the agenda? I believe you have something there that you want to talk about right. Oh yes,.

D

I will we try a mystery? We try to reuse some part party in the defaults teller. Now when we try to use the image relocate a party we fund light it's a depending on the meter Missa data and we only depend on the total total number in this prior tax. But when we build metadata from the factory, we have to pass silver Lister. Citrus salad is their service. It is service, lynnster and maybe some other Lister, so yeah so I think.

D

Maybe we can enhance such candle really has this factory, maybe I, don't know how I any solutions right now but I'd like to rest here is that maybe we can simplify the interface so for them whole April. Some user just want to one priority tiles, and this for this party only depend on maybe one or two misses data in the you know in the meter. So we can. We can have some way to build that easily.

A

Yeah yeah does this require changing the interface for all of our priority functions. I.

D

Don't think so I 212 idea in my mind is that so analyse the weekends to some skip for them. How we can skip to initialize the metadata in the in the minute data if we don't have for long way, if we don't have Lister.

A

If we don't the Lister, we cannot have any metadata. Oh I, see because yeah, because we.

D

Only know for you major locality: we only need a no number total, no number right yeah for the other metadata. We don't need that. So if we don't kill the early enough information, we the metadata felt really good enough to do date. You.

A

Know every time that something like this comes up, I I, think why we don't have the scheduling framework yet yeah. A lot of a lot of those issues would have gone away by a sort of automatically by the framework, but angle, right, I, see your point. I don't have any great solution at the moment. On top of my head, I'd-- terrifies, this is worth exploring. We should take a look and see how we can improve the situation. Yes, all right, so it looks like they also had an update for us. Thank you very much way.

A

You Ray has added a new cap for evenly distributing pods in a in an arbitrary topology. So this is actually a an effort that we recently been working on. The idea here is that we do. Basically, the idea here is that we want to distribute a number of parts in a particular topology domain. A topology domain, for example, can be a zone or multiple zones, or can be even a node, can be a region or whatever, depending on what labels you put on your notes. So today we have entire affinity.

A

The problem with anti affinity is that, once you put anti affinity- and let's say your parts, then you basically tell the scheduler to not put more than one of such parts in that particular topology that you, you have defined in the anti affinity rule. For example, the topology can be note by saying that you haven't ever you tell the scheduler, don't put more than one pods of this type on a node.

A

Now, let's say that you have five nodes in a cluster. You have tens of such parts, so putting on a a fan. Idiom on pods causes five of them to not get scheduled at all oftentimes. Actually, users want to achieve distribution instead of just having an affinity really so in the case that they have like ten parts and five nodes in the cluster. The oftentimes want to have like two parts on each node, not necessarily one part and five of them getting on the scale.

A

So we are we're trying to address this problem by adding a new feature to kubernetes, where users can specify a topology very similar to enter a finale. But this is basically the difference between this. An anti affinity is that this one tells the schedule to distribute pod evenly in a participant apology, so they has added an egg. Every singly I am trying. Okay, let me copy the link or meeting notes.

A

So after this is merged, you know implemented and merged. We may revisit the anti affinity today and say: affinity is one source of scalability issues in this scheduler. As many of you know, we've been trying to address this and we have achieved quite so much. As you may be aware, we have achieved over a hundred or close to 120 X performance improvement for that feature, but the feature was a thousand times slower than many of other scheduling features. So even after 120 X performance improvement, we are still like a 210 times slower than other features.

A

Eight to ten times is still a pretty large number. Most of the reason for this is slower. Performance of the feature is because it's very very flexible. We haven't a if any. Basically, once we have an anti affinity paradigm for infinity faster than every other part that is being scheduled, we need to check it and make sure that the entire fancy rules of the running part are honored, so you're changing this. Well, we're not changing it. Yet we are thinking about changing it so that we only provide anti affinity.

A

On the same note, not necessarily in any arbitrary topology, and then we add this new feature to evenly distribute in arbitrary topology the reason how basically are reasons for making these changes that most people really want to use anti affinity in arbitrary topology domain to achieve evenly into an even distribution of distribution of their parts. So once we have another feature which provides even distributions, we probably won't need to have an affinity in arbitrary topology domains and at an eternity on the node with most probably be enough for pretty much all use cases.

A

So these are some of our reasonings for revisiting yantai infinity.

C

Due to this I I also emphasize in my character right now, we just unpublished limit topology to some, for example, knows for.

A

C

Affinity in the car and also for part affinity, we I also described the semantics of even prosperous reading in part affinity. Also, I suggest to inform to initial implementation, also applied to to note only, and also for no definitive I, also.

A

So the initial implementation of evenly distribution or initially implementation of affinity on indefinitely so.

C

Basically, I I try to apply this under three affinity directory. We have no definitely part affinity and entire finis, but for the latter tools, two words affinity and anti affinity, I suggest to impose only on note I said pick up: yeah, because yeah because of the performance concern all.

A

Right make sense all right. Thank you very much. Yeah I will definitely take a look at your cab, other other people who were interested. Please go ahead and take a look at that. The link is in our meeting notes, all right. We are 24 24 minutes and our meeting I don't have any further updates for you. I do only think that I want to mention is that this week is a little bit tough for me we're at Google.

A

We are in this like performance review cycle, so most of us are busy with other stuff, so I apologize for not being very responsive and reviewing all your peers in a timely fashion, but I will get to those I know way. I owe you one review for one of your peers and also I. Don't review your captivation yeah, that's totally fine!

A

All right is there any.

C

Other comments, yeah go ahead: yeah I just want to yeah emphasize that an idea in mind cab that so originally, when we think about this pinpoint from a customer, we I proposed a original proposal named max past apology right, yes that, but we discussed, we exchanged a lot of five years and the following thing: that ideas you have to define a specific number. This is pretty challenging for the user price, but they are the to tune or something cry.

C

A

Work really I mean, if you think about real use cases providing a max is not gonna really be useful, and that Magnus was essentially our reaction to the fact that users wanted to to achieve even distribution, and so.

C

Yeah, so for improvement, so introduce that kind of value called max Q max skill described imbed the degree of imbalance of the past spreading in the cluster. For example, the default value will pull up a table. One. That means four for ten past apply on the five nailed Custer the department grow out could be one on each note and the sixth one, for example, is deploying on the first note and a single same one can only be deployed on the rest for most Destin. So.

A

Is this only required? Why don't we just add one I mean why don't we remove this whole, like idea of having a particular number and just because we.

C

I want, maybe it still gave the flexibility, for example, the distribution can be, for example, 2/0 for some instances if they want o example. If right now, the distribution is 2/1 right. If the skill value we said is 1, then it cannot be 3/1. It can only be true such to write. The max is between just described the imbalance, so we will see, of course, if we hard call it tomorrow and only provide a boolean value to use it's it's more easy.

A

This is I actually like I. Actually, like your idea, I initially I thought that you are its I didn't actually get it. That you are doing is Q value. I thought that you are adding a number that specifies how many more.

C

No, it does you that it doesn't describe. The abstract number of totally is justified, described at the difference. The max difference right.

A

Right that makes sense. Okay. Thank you very much. That's a good idea! Actually, all right, we have two more minutes left. Is there any other questions or comments.

A

Going once points alright, thank you very much for attending I will see some of you pops next week, because.

C

A