Kubernetes SIG Scheduling, 30 Aug 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetest SIG Scheduling Meeting - 2018-08-30

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, let's start and some of the stuff that we've been working on. So let me give you a couple of updates about some of the things. First, I, sorry, first of all, Soria that I was away for a couple of days or more than one in a couple of days. I was in an internal event somewhere else, not here, and I was in all day meeting, so I I'm a little bit behind on college reviews, but I will get to them.

A

Hopefully today and if there is anything left for 112, please let me know I will be happy to take a look and work with you guys to get them for one one. Twelve. We only have like several more days code freezes on September 4th, so there is only several days left and we need to work on any any PR that needs review or needs more work to prepare them for for 112, one of the things that we've been working on.

A

What was performance improvement, particularly an area which was which needed a lot of work, was interpret affinity, an anti affinity.

A

If you remember from like a few months ago, we had this problem that if an Indian anti affinity was a thousand times slower than many of other predicates and in like I, don't know the couple months ago or so, I added an optimization which improved our performance by 20x and recently I worked with a couple of folks in the community to implement another optimization that that has improved performance by another four to six X.

A

So, overall, we are seeing more than 100x performance improvement, given these two optimization that you've done for the after natee so which brings us to about 10x slower compared to other predicates, and we, as you know, some of our other predicates are really really fast and being 10x is lower. For a for such a complex predicate is not bad at all and I'm very high.

A

You see that this has happened, and you know you may remember that the affinity and intestine 80 performance was so bad that he had to go and update our public kubernetes documents and say: ask people basically not to use these predicates if they have a cluster, which is like more than a couple hundred nodes. And now with these improvements, we are hoping that we can remove such constraints and we can make these predicates usable by everybody in much larger clusters. That's one of the updates so.

A

So we are working on several things since Rob is here one of the one of the work that we've been thinking about, and we wanted to do as the scheduler and promoting the scheduler to a standard component in kubernetes looks like since we were not on the same page, with respect to promoting it or graduating it from incubator.

A

It's unlikely that we can actually make it a standard component of kubernetes in 112, but there was some discussion about how to actually use the scheduler in the future by simply making it a part of the scheduler and maybe running it as another threat after scheduler versus running it as a separate component and my my own boat was to basically have a separate component and I had several reasons for it mostly around the reliability. Basically, I wanted to make sure that the scheduler is not gonna interfere with the scheduler, for example.

A

If there is a part on these, kids were causing a cry for something: scheduler is not gonna get affected. When you are in the same process, management of it is going to become a little harder, it's also more flexible. If it's a separate component, it allows us to run it somewhere else. It allows us to not run it at all. Maybe some of the users don't want to run it over. So those guys have this opportunity.

A

Of course it could be configurable, so that I mean if it was one single company, it could be hot bigger about our users could be disable it, but configuring some of the components in some of the bigger deployments of kubernetes, like, for example, in gke and some of the other, similar hosted kubernetes, is not possible by users. So, as the result, users are stuck with, whatever their provider or service provider has chosen for them.

A

If it's a separate component- and it will leave it up to the user to want to run it or make it more profitable by the interfaces, Amala whatnot makes it a little bit easier to manage it's also more scalable, because we don't need to run scheduler and d scheduler on the same node anymore. They can be separated in run multiple nodes. One of them may turn on the master node in any other cluster.

A

Not so there were some of these reasons that I we I personally thought having a separate component would be better Rob is here today. If Robbie, you have anything more to add or you have different opinions. Please tell us other folks if they have any opinion about this, please, yes,.

B

As I've mentioned, I think even the email I have mentioned that I believe that it has to be a separate component because you could use it in order scalar or some other component. So the scalar calls me scheduler so that there is a balancing of nodes happens if a new node gets added in the cluster. So there are advantages of running in outside of scheduler, and the other thing is. The implementation also becomes a bit difficult because we have.

B

There is nothing like dynamically loading conference in go like they have something similar to that from go. 1.7 called plugins I was doing some research about it, but when, during this and then dynamically loading or equivalent of dynamically loading, this binary might be implementation, wise, difficult and then de scheduler could be used for other other aspects as well. So it alerted to the course scheduler some of the functionalities that we have in in our mind, for the roadmap.

B

We may have to take that out so because of that I'm, also in favor of having it separately.

B

A

Yeah thanks quick thing about those plugins. Those go plugins for for the scheduling framework. I was also thinking about maybe using some of those scheduling. Some of those go plugins, but Brian grant told me that there are multiple issues with with those go plugins and they are not praised for implementation. There are some details about it, but I don't want to take the time off. The team yeah.

B

A

Just have to exhibit.

B

Like using ways not exact or something like that, we have to call it binary, which is yes.

A

There are some version, SKU issues as well, which makes them makes using them a little hard. So.

A

Yeah is there any other objection comments about any of this for other folks who are present in the meeting I.

C

Had a question about one of the comments that was made was that because users can't control, that's just fault scheduler in installations like AWS or GC or gku, rather that we shouldn't have it be a part of the main scheduler, because then they couldn't configure it and would have to you using whatever gke chooses for them.

C

Is that going to affect us in the future, for example, with the scheduler framework, we're talking about breaking out components into plugins and saying that these plugins are configurable, which ones you run and what their weights are and so on and so forth. So it seems to me that, like if we say that people can't control which scheduler they use, that also means that they can't use schedule or framework, or at least change the settings of the of it and therefore, maybe can even add plugins like we are saying that they can.

C

So what are we thinking about that I guess yeah.

A

That is that's a valid point. Actually, you know, even without the scheduling framework with our current scheduler, we have a configuration that you can actually disable. Some of the predicates and priority functions very similar to the the scheduling framework where you can add, or remove some of the plugins, but in gke users don't have those kind of controls. There are always maybe the gke you, for example, can implement some other UI that you know that allows users to to change those configurations, but those you are you are, you is are not in place.

A

Those configurations are not in place, so currently users cannot really change anything. It's not impossible. To think that you know we can build some sort of a mechanism. For example, I, don't know in Google console to allow the users to change some of these weights or add or remove some of these plugins, but there is also the support ability aspect of the problem. So because of that, a lot of there is a lot of objections about you know allowing users to change their scheduler configuration. So there are some.

A

Some people think that users may not know exactly what they are doing and it they may shoot themselves. So there are some of these issues in and sre may also not be aware of these like items at first glance may not be aware that their scheduler is different from the default scheduler and that's why their cluster is behaving differently, so providing support could become more more difficult. So there are some of these concerns that should be figured out, but yeah anyway.

A

We are not always optimizing for for some of these hosted environments, and this particular concern was one of the few concerns that we had about like putting the two components together. It was not yeah.

C

Yeah, no, of course, and I'm just fearful of using the excuse that like if we build a future, a feature. They'll use it or something like that to say that we shouldn't and if we're gonna provide this functionality like it seems like if I run on, like I, want to use GK yeah.

A

This is slightly different than just changing the behavior of scheduler itself. This is more than just the scheduler itself. This schedule, for example, can help you balance your cluster better and things of that sort. So it's slightly different and because of that, some some users may actually want to run the scheduler separately yeah. All right. Is there any other comments or questions?

A

Okay, one more update that I have for you is that probably we want again, we won't be what to move equivalents, cash to paid up you're, seeing some issues and we are not 100% sure about the current implementation, so we're still exploring some of the options here and we are very close to the code free, so equivalents cash, which is a mechanism for us to improve schedulers performance. Further, it's probably not gonna make it to 112. I mean it's it's already.

A

There is I'm gonna, make it as a beta to watch what hard scheduling policies is being worked on. Yassin has updated the document and I think we are almost there. The final comment from from two of our big industries, Brian grant and Tim Hawking, both were saying that we should not build this as a like. A first I, don't know how to call it basically, but another London. We shouldn't make it like a v1 API.

A

We should go with CRD in the beginning and then in the next level or next stage after we get some feedback from users now be both promote this to like a standard API of copyrights and that's actually a valid point and a good comment. We are probably gonna do that. We are gonna, start working on the implementation, hopefully for 1:13, and we will hopefully see it in 1:13 at least the alpha version of scheduling policies in 113.

A

D

Into the scarran policy right now we are using a configured JSON fire to configure is predicated with priority. We can control right, I'm.

A

Not talking about those like scheduler configuration, this is like party scheduling policies. The scheduling policies is more than that, so these are similar to like security policies or, if you like, like our back kind of theme, so we it basically tells users what kind of scheduling requirements they can put on their part.

A

Today we don't have any of such policies, for example, I, think in a model and imagine a multi-tenant cluster I can go and put some anti affinity rules on my part and if I get lucky and my party schedule, I can have anti affinity to everything else in the zone. In that case, my part is the schedules. Nobody else can schedule any part in that song. Right so add me.

A

They want to prevent some scenarios like this or another example- is that some some customers go and change their more expensive notes, for example, notes with GPUs with the special taint, so that pause that don't need those kind of expensive resources, never land on those notes. But nothing is there to prevent users from putting the Toleration for those things on their parts.

A

So, as a user in the multi talents and multi-channel cluster I may as well go a toleration for those tints and put my pods that don't need those expensive resources on the nose that have those expensive issues. So there are some of these concerns we're trying to address those but by party scheduling policies, yeah.

D

I got I got it. Thank you. I.

E

Think that the whole, the premise is the fact that you centralize your policy, your scheduling policies in one place, as opposed to scattered all over. You know the part specs and all that time. Similarly, the security is the same way, though yeah for audit.

A

We didn't actually have even the scattered version of it, so we didn't have any policies really to prevent users from putting some of these scheduling requirements on the pause that would affect other uses faster well,.

E

I mean currently you, your policies are essentially the power spikes. You know you define everything at the pod level. Okay, this part hasn't Eyefinity with this or whatever, as opposed to that, you define it in one central place way. That means.

A

This is slightly different and just putting those so in in a pot aspect. You put the scheduling requirements. So basically, you say, for example: I want to have anti affinity or I want to have toleration. Things like things like that, but the scheduling policies actually specifies who can put what kind of affinity or who can put what kind of talk.

E

A

Policies similar to like more.

E

Like a hardback, okay, yeah.

A

Exactly yeah yeah.

A

All right so I have one more update for well actually I, don't know if Harry is here, I wanted to get a knob that update from Harry about like image, locality, priority that that priority function is already there, but we haven't been able to move it to I guess we have been able to move each bit of it. We haven't been able to get the test and to enters working, because we had some issues with uploading. The images to GCRA or I.

A

Don't know the status of Edwards, it's done or not, but I will follow it harry separately.

A

So yeah way you are here. What is the status of scheduling.

D

There there are two major issues once the update the the flag in the in the features to go and that caused some unit has an education has a fail. So yeah thanks for ready on pushing one of the ps2 to get those results. I think the changes are fine. We need, because the pool of me Janet and Mike Dennis seemed so we can't get hold of them together is.

A

That the only blocker right now I can actually go to there I mean that.

D

Is one the other blocker is that let me give you the East remember so that I.

B

Think it's really to the performance went ain't. Not my conditions is enabled in large clusters, like five K notes. Trendabl. There is an issue created by scalability that some of the notes, or not at all, coming up and some of the notes are taking a lot of time for the team to be applied.

F

And that's that is not is not scheduled team on sale. That is ten note bad condition. That's what I.

A

B

F

A

So yeah so issues was updating a failure. What was the other problem and.

D

The other, why is that? For some scenarios we didn't keep the tent properly and didn't sorry didn't give that aeration properly yeah.

A

I remember that problem, but I think PR was out. Isn't it I thought it's still.

D

Still, we need McDaniels and journey to the approval, so.

A

But the PRS are out, we just need them to be a prototype. I think.

D

The peers have been approved that can for several rounds. Yeah remember the last week, I mentioned another yeah raced by me, but that Pia can be closed because the changes has been contained by by clouds Chiara. This one is six six five to six so for this one and that's a failures. One then good for this feature. Okay,.

A

A

Let's see yes, that's I, guess all this stuff. If there is any other questions, comments or updates from you folks, please share them with us. Now.

A

Deepak, how I think I under force Aiden side, yeah.

E

Yeah so I think I wanted to give update so in the scheduling sake, test rear, and so there are total forty-one tests. Out of that thirty pass and eleven of them failed. Two of them were related to the environment. You know NVIDIA GPU, the resource code, nothing, so we didn't so even the default scheduler was failing for that as well. So we didn't really bother looking into that and four of them are related to events.

E

Essentially, if your scheduler cannot ever schedule the part because of the label, you know it's a scheduler part has a label which doesn't exist in any node, so it's never gonna get scheduled, so it sends out a event. It seems C so yeah, so so the the test checks for that. If the event has been sent and all that you know so- oh I.

A

Think I see so I I just see it guys. So, basically, when we update the status of the node, we get an offense is that no.

E

No, no, what happens is so the scheduler, so the node comes in says: I need the node affinity for a node which has a label a okay, but then there's no node in the cluster, which has a label a so that's. This part is never gonna get scheduled.

A

E

So yeah so the test case. Essentially there are four failures which are essentially related to events, so essentially is checking for events that you know these, so there were three of these kind of negative tests and then because we don't have a vent thing in place, so it fails because we're not sending out events and then for all the successful pods. Also it sends out one event as well. It seems I see so so. Basically,.

A

If the parties unschedulable, we update the status of the part and reflect that the reasons that is unschedulable and then when we update the status of a part, we get events for for the fact that part is under schedule. But probably that's why you are not saying that. Basically, maybe you are not updating this as of a thought so.

E

You think that the event automatically gets sent out the moment you update the status. Why.

A

Here, if you have the aps array, api service sends you the advanced okay,.

E

Okay, so I think so, I think the we looked at the code and as we we explicitly saw that event is being sent, so the status is set and, at the same time, event is also sent out somewhere. I'm like.

B

In scheduler code we have some events being generated like we have an event returning and all that I do not know which event you are referring. But yes, we have some some.

E

Of them so the for failure so and even for the successful pods also it sends.

A

Out event, we have two completely different categories that both are caught. Events, one is informal events. Informal events are the events that you get for. The updates of any object in like kubernetes, like API service, sends out those events I'm not so sure. If you're talking about informal events, maybe.

E

I think we don't know what the difference is. The the tests are failing because of that e to e tests are failing, yeah I think that was one, so we getting to the bottom of it and then so. This is more like a not a scheduler is.

E

The to the scheduler related issues, one more, the ephemeral storage and we don't check for that. Right now see the ephemeral road. We only checked OCP you memory. We don't check for that. There's enough ephemeral, storage, this! So that's one thing: we're gonna fix that and then the last one is the max pod thing. So currently, the max part logic which we have on a node is only considers the part scheduled by for moment. So we are, if we are gonna, do the add the informers to all other parts.

E

You know for all of the square part scheduled by so that should take care of it as well and yeah.

C

E

Remember the discussion we had last time as, for example,.

A

E

Of the pods are scheduled by the default scheduler or some other demon sent scheduler or whatever we yeah. So we need to when we do the max max part or node logic. We need to account for those parts as well. I see.

A

I see yeah, yes, do you see that we could possibly have an alpha version of Poseidon in the next release or next so.

E

There's another problem which we are looking at is the PV thing. Actually it's the storage sake. They're all a lot of tests are failing, because we don't know we don't have a predicate for the so.

D

E

That's done and we have this it's also. It's essentially does the coexistence of for moment or Poseidon with the fall scheduler. There are two problem. One is we are breaking things, for example, the one I and the other one is the default scheduler. So we need to cannot come for you know things which are the part which are scheduled by default schedule or some other scheduler. So once we fix that, we don't see any other man of the functionality wise. It's all in there.

A

So by this may be that a ladder problem is whole easy, I, don't know how you know how you implement a thing, so I cannot really say with confidence, but the ladder problem seems a little easier. I guess the PV issue might be slightly more complex, because PV is more than just a predicate currently in the default scheduler, a part that requires PV goes through. Usually, if the, if the required persistent volume is not mounted to a known, it goes to like two cycles of scheduling in the first cycle.

A

We go and find a note for the pod and PV controller starts binding the volume to that note and then in the second round, now that the volume exists on a particular node, that node will become the only place for the part that can be looked at.

E

The dynamic load, local volume, provisioning thing, the one that went in in one 111- is that it's.

A

Not it's not 111, it's actually study I, guess in 1 9. But yes, it's dynamic volume for visioning yeah.

E

But they're done above mine, but I was told that that code is not much there. So it's saying that done so that code is already merged. It's already upstream the dynamic volume provisioning. It has been a couple of already so obviously we're now gonna be able to do that. Oh I was thinking the simple things: simple scenario that the part comes in. It's like an order, finiti thing part comes in I, say: I have a requirement for this PV and he finds a node for that. It's no.

A

It's more than it's a lot more complex than that. Actually I mean that case exists and that's true if the volume is already bound to a node yeah.

E

A

It's already bound to know that the problem is easy. You can just.

E

Even we don't even have that predicate as well. Yeah.

A

Yeah but that's that's not too hard to add yeah it harder problem. Is that dynamic volume or.

E

That that we're not gonna be able to rev it up. That does connect so yeah. We don't do that. Currently, so I'm not worried by, but at least we want to be able to do basic, PV think so so, essentially, the ephemeral is similar kind, actually, ephemeral, storage and the PV storage static. You know that's something we need to do that. Actually, so hopefully that should bring down the failures in storage sake, which we are noticing all.

A

Right thanks, we are out of time, but if anybody else has any quick question, please go ahead. I.

B

Have a question: little denotes a condition. What is the plan for that? Are you going to graduate it or so.

A

uh Actually wasn't I I haven't seen this issue with the scalability. If it is serious, we probably will not be able to graduate it ada in once. Well, we are only several days away from from the code freeze, so you may not be able to do that, but I will go and take a look if, if this is something that can be quickly fixed, which I doubt given that it's a seemed like it major scalability issue, yes,.

B

Klaus has mentioned something saying that he's good he's working on it and I.

A

B

Graphs, we're like the call graphs where most of the time we are spending on I not work further on that, so based on, if you want to promote I, will continue working with you.

A

Know yeah, it will be great actually if you can help us and anybody else who is interested. If you guys can help us, it will be great to to be able to promote it because it has been a while that we really wanted to promote this to 2 beta and it's kind of resolved. Some kind of at least make many things a little bit cleaner. So so it'll be great if we can promote it and if you guys have the time and you think if we can reasonably get it to want.

A

Well, let's list: let's try yeah.

D

I think a way we do. Definitely you wanted to put to promote to beta in this race. Well, so take a look: okay,.

E

Quick question: the Bobby, if you can send me the PR, you know remember that dynamic volume provisioning thing that went in in one-one-nine, if you can just point me to the PR, that would be okay.

A

Sure, thank you.

A

Alright is there any other question comments.

A

All right thanks, everybody see you folks, thank.

F

A