Kubernetes SIG Scheduling, 29 Nov 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: kubernetes SIG Scheduling meeting - 2018-11-29

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right, I just started recording. As you know, this meeting is recorded and will be uploaded to public Internet, so whatever you say is gonna be visible to public, but today I have a couple of updates for you. Folks, there was a an issue recently, you know in the scheduler. This issue was initially discovered by our scalability team at Google.

A

They realize that when there are many parts in a cluster- and there are many unschedulable of parts and the cluster does at the same priority, let's say that you have a relatively large cluster with a few hundred nodes or thousand nodes, and then there are a group of parts some of them are scheduled about.

A

Some of them are not scheduled about, but they are all waiting for the scheduler to try them and once the scheduled error actually puts all of these in the queue and starts trying those parts IANA schedulable parts keep going to the head of the queue, despite having the same priority as the other parts and as a result, blocked ahead of the queue.

A

So I was honestly a little bit surprised by this behavior, because I felt, like those parts with the same traditional, necessarily go to the head of the queue but later on when we left at the top algorithm of heap. We realized the reason so the pop algorithm of heap, like you know, you can read the text books as well, and it's the same implementation in go libraries. It basically swaps the head of the queue with the last the tail, basically with the last element in the queue the tail and starts basically removed it.

A

Apart from the end of the queue which was, it used to be the head of the queue and starts running heapify, which is essentially, we arranged the heap based on the heap algorithm. But since all these parts have the same priority, the part which is now in the head of the queue and usually at the tail, stays in their head. This is a problem, of course, because the part which was at the bottom of the queue was a part which was tried recently by the scheduler, and it was on a schedule about.

A

So it goes to the head of the queue which tries it again, and so, if you have a number of these chances already scheduled, we'll just keep trying these on the schedule of all pods which are now occupying head and the tail dequeue. This is a problem that we were. We have sent a PR to address the the way to address it is that the scheduler looks at the priority of the parts and orders them in the queue based on their priority.

A

But if the priority is the same, then he looks at the last scheduled time so parts which are on the scheduler both have their have this like timers down the hadn't already I mean before this change. I mean you've always had it. So we look at that time stamp and we ordered them if they have the same priority bar by the time stamp. If these are new parts that are never scheduled or schedule has never attempted to schedule them, and we looked at their creation time and and sort them based on their creation.

A

I think this also addresses a problem that Sudesh was trying to address. I mean this is, of course more than that problem. Only Sudesh was trying to change this scheduling. The sorting algorithm of the scheduling queue to sort parts which had the same priority by their creation time and I was a little bit concerned that a part which is created early and is unschedulable can have this problem that we just talked about it. You know, will always go to the head of the queue and the scheduler made.

A

We try that part again again, but with this new change that we made, that concern goes away because we also look at the last scheduled time and then we sort them by last scheduled time first, if they are scheduled before so. A problem of blocking the head of the queue is not.

A

Is there any question about that.

A

Okay, so let me send a link to the chat window.

A

We would like to plan and the items for 114 right now. We have a few items. Unfortunately, there are not that many people today in a meeting we have a few items that I would like to discuss with the owners and assigned to them and basically confirm that they can work on them in 114, I speak about them anyways. Hopefully they will take a look later at the video or maybe meeting notes, and then they can come back to us if they cannot work on them.

A

One is to change the design of the scheduler equivalence, cache Jonatan a.k.a, mr. Kidd and Harry aka restore were supposed to work on it and I assume they are still gonna work on it and finished it design and implementation in 114. So it's going to be there. We are hoping that Klaus can have ganga scheduling an early version of ganga scheduling in queue that in 1:14, of course he has already, but that prototype, but we were hoping that he can can.

B

A

Better version and a something closer to like production already that the with a proposed API implementation in 114. Please do stop me if you have any questions while I am talking about these features.

A

As you know, I have already sent a PR to add a couple of extension points to the scheduling framework that PR can be managed now and we are going to add more extension points this case 1:14 as a part of the work on scheduling framework.

A

Now we we had a plan to implement policy edge wing policies. The problem is that we have had a lot of discussions about how to design scheduling policies and the policy team or our sick earth. People have had various opinion about that, so that work is still going on. Do you'll see how we are gonna proceed with that idea, but at the moment I am NOT super optimistic that we can have something in the open source for warm 14.

A

We will see how are things going in that front since Klaus is here before moving forward, I would like to hear from him. His plans about gang schedule are Klaus. Could you type yeah.

C

A

Morning for you folks, I'm sorry about that.

D

Yeah I think that's fine and you know I do some some change, but yeah I think I will put more time in committee.

A

So I know that you have you've changed an employer, I, don't know if your priorities have remained the same as before so laughs. Okay, yes,.

D

Yeah I still keep my responsibility, also similar scope as before catching Cornelius and concealing something I write. Still my top 14 okay sounds great. We also put some also keep my contribution in the community part yeah.

A

So since you are in who are we now and I know that what we has been working on Poseidon, which is the.

D

Scheduler and so.

A

Far, we have seen pretty promising numbers for the benchmarks and those guys released yeah. One of the features that they are prioritizing is gang scheduling, yeah I, don't know how much overlap we have between the two, but given that Satan is a completely different scheduler. Oh yes, it's not going to be easy for a lot of folks to completely switch to per se them at the moment at least, and what, while it's still not, maybe as robust as their default scheduler yeah.

A

You still probably want to pursue the idea of having a gang schedule, incan default to scheduler or as.

C

A

Of the standard kubernetes component, so it's important for.

C

Us to continue.

A

This and move it forward, yeah yeah, so you think that you will have some I know that you have already have a prototype, but do you think you can get the prototype closer to production in 1:14.

D

A

Food production, my man, I, didn't say that make it a production already product, but it is getting it closer to production, may be implementing some of the api's that we discussed in the yeah.

D

Yeah, yes, I think this is the the scheduler part is oh, it's almost there anything to give a to more code coverage. I. Think that's! This part is fine. I know some people already did some testing in their free product, environment yeah.

D

Another I think we also give some discussion about the Kota and the port group controller. This part is not implemented yet, but as I'd like to have them at least a pod whole group controller you import for him. He has a culture. Part, maybe maybe relate a little context, so maybe maybe nasca stab yes, and maybe maybe you know the downscaling design document is a it's not much Liat. Yes,.

A

D

I King wish, maybe you, if you have time all you can okay, how come I will talk with him. Yeah.

A

My own opinion is that all of these design darks should remain as working progress forever pretty much, so even they should still be updated gradually.

B

As things change.

A

B

A

I generally, my opinion is that we cannot really wait for these designs to become like the best ever past. Yes, relax yeah, we can. We can measure it once it reaches to a point that we feel it's good enough as a story. Yes,.

B

A

Improving those and I think the proposal that you've written is reach to that point. From my point of view at least yeah and I, think it's more or less ready for it to be measured. Maybe there are like otherwise it's I'm. Maybe we can. We can arrange with wish to see what yeah.

D

Yes, yes, I, agree with that. We can get here the mood for today yeah if we have any improvement or if I have any any comments, we can update it or we may be for polar controller polgár controller. Maybe we have some the other document for the detail. Yes,.

A

Yes, yeah, okay, yeah, okay,.

A

So yeah we Ravi is not here today, if I'm not mistaken, so we've had this idea of moving these scheduler as the static component of kubernetes. I would like to hear his opinion about that, and then Oh way is here. We I know that we didn't pursue the idea of supporting multiple, all pods for inter pod Evan ID in 114 113, because we decided that 113 was more of a stability release, but in 114 we would like to pursue that feature. Do you think you had the time to work on it.

B

A

Great thank you. We would also like to move pot limit. Priority function to beta Robby was working on it. We had to take it out in 113 sort of like last moment, because some of those tests were not ready, but hopefully we can have it in 114, I, don't know about moving balanced number of attached volumes to beta. That is still not very clear, because we have some performance concerns about that. We will see another.

A

Another item that you would like to have in 114 is having it back of mechanism for on a schedule about parts. The idea here is that a part that is being tried many times and it's on a schedule was shouldn't. Let's say that this is a high-priority part. This part go obviously the head of the queue and block everything behind it, so we should have.

B

A

Of mechanism, so that this part is backed off and is rescheduled with some delay if it turns out that it's on a schedule of all and if scheduler has tried it a few times already, that's something that you are going to add so another thing that we are also trying to do. This is also kind of similar and related to the idea of having a backup mechanism, an odds optimizing now the status updates. So today the scheduler retries, all the unschedulable parts at every node update.

A

This is needed, but the problem is that a node sent a node updates for all heartbeats as well. So no it sent heart beats every 10 seconds. Basically, scheduler receives a heartbeat, which appears like a very much a other node updates that may may have more in formation. So, for example, if I know it has a new.

B

A

Is going to be a node update if it has a new team, there's gonna be a new update, but sometimes no it doesn't have any information, and this node update is simply just a heartbeat. We receive those heartbeats once every 10 seconds for each node, so you can imagine in a custom that thousand nodes we receive by we I mean the schedule receives a hundred node updates on average per second and as a result, the scheduler keeps retrying on a schedule of parts.

A

While there is no significant change in the cluster that could make these on a schedule, but parts of schedule. So what we are trying to do is that to look at these node updates and find out whether these are actually changing something meaningful that could potentially make no it's feasible, and in that case we try scheduling parts. Otherwise we just don't know we don't want to strip retry those parts again and again. I have one item in my agenda to work on the scheduler performance.

A

This is something that you would like to pursue for larger clusters. Scheduler performance in larger cluster isn't is still that great. So in right now in our scalability task, scheduler throughput is about like 55 pars or so in 5000, not clusters. You would like to improve that, hopefully by another 20 to 30 percent, we will see how much we can get. I will be working on that and 114 I guess.

A

There is a repeated item here if I'm not mistaken, so I talked about interpret affinity being supporting multiple hearts yeah. So this is already discussed. I have to remove this item.

A

Yeah, that's pretty much all the items that I wanted to talk about. We have another 10 minutes left in this meeting. Please raise question almonds. If you have any problems to discuss, please go ahead. Oh.

D

I think there is a permanent or bother is also been packing. Okay, yeah.

A

D

I'm ravine, that's poor request, I think. Maybe we will burn her together. It's writing in unison, released yeah.

A

So screw you never I usually function for that already in the scheduler right, which is not enabled by default. Oh no.

D

It's a new project: okay,.

A

What does it other than what we have already.

D

This is the try to paste power to choose through the node which how the odorous, also it is similar to the podcast.

A

It has more resources than the odd request.

A

I know results.

D

A

D

A

D

If we, if it's a door results, is a similar similar to polar request them well, almost the evil is you know how I party for I see.

A

Yeah we have sort of like by default, we have worst fit sort of, and we have a priority for for most used. Basically based pick the gap that has the highest amount of resource utilization, but I guess what you're talking about there is right is best fit with. Essentially the closest match: okay, yeah.

A

D

Can have that too yeah yeah.

A

D

Is this at immune to the into the Saudis coup? Without so should I add it.

A

As an item to our plan.

E

Yes, I will either this to our crime. Is this a proposal, pull requests well.

D

E

Overcast I've been fully closed for the proposal. Not a code change, no change. We.

D

Also have a document here: I will open a document much easier to our panelists, so.

A

Class, do you know what is the github idea of the owner.

D

How'd, you see this one, some people fry in term, you.

A

Know their github ID, yes,.

A

I can find it okay. What about working on it? Is there any other question or comments for the folks.

B

Bobby I notice you're mentioning me on long existing requirement. That says maybe we can consider supporting max or minimum number of passed topology, so it can't be a candidate for next well, it is not sure I can.

C

See long term yeah sure for for affinity, an entire finicky right? Yes, yes, okay, yeah.

A

Sounds great if you can work on it, that would be great. That certainly requires us to seriously consider possibility. I mean seriously consider a performance change. We don't want to introduce another feature that would cause a major slowdown in the scheduler, but that's not really something that we should pursue.

F

B

Finish to the support interpret affinity first, okay,.

D

B

Maybe not that idea for next lives because yeah basis for performance and also in x-ray is there where beer will take one month vacation because of the Chinese New Year.

D

Sorry-Sorry highway to do mean the master poder / / doing for the body of funding yeah.

B

B

D

Yeah yeah, this part a single, maybe you we can talk with the sink I, so Wolcott T, yes,.

D

Some monkey were comma say that we would like to have a more general solution and use Dido for how to code marks the whole yeah yeah.

D

D

We can have a detailed discussion later. Yes, we used to have some discussed before so maybe you can go through this right. You see.

A

Yeah so like I believe we even had a P or maybe, if I'm not miss a so I believed at the time we decided not to pursue that idea because interpret affinity, an anti affinity, we're about a thousand times slower than other predicate.

B

A

We have, we have improved their performance quite a bit and or now well, we kinda in order up like a like 10 times slower, which is not yes, even that a lot of our predicates are really fast. So.

C

A

We can now I just considered adding more features if there are not making it too slow again. So yeah, yes, yes, yeah sure, but that's a good point that class product. They please make sure to look at that P. Our class has me and I'm pretty sure that PR won't work anymore, because we have made a lot of changes to the code. Yes,.

B

A

So I'm pretty sure that PR is not gonna work anymore, but it might be good to use some of the ideas. Yeah.

D

Yeah for you to have several discussions, anime cranks to them and will just comment at least yeah yeah yeah.

E

A

Okay, any question comments.

A

F

I was working on moving the event handlers as part of the cleanup, so I updated it with the what we wanted. So if you can take a look and see if it's in the right direction, I'm struggling a little bit on how to make the trespass I think it might be something with how I might have made a mistake in the PR. So I am I want to put more effort as long as it's in the right direction.

F

So if you can just say that just looks in the right direction, I'll put some more effort over the weekend. Okay, okay, I.

A

Will I will try to take a look? I've been I've, been on vacation for about ten days and when I came back, I had like.

A

Follow-Up sorry abut the fact that I have.

A

And Jonathan was also off this whole month, right.

F

A

F

This will go in 114 right, yes, exactly yeah, okay and one last question this kang's kid. You thing: is this going to be implemented as another predicate or something, or is it going to be something? No.

A

Getting scheduling cannot be just a predicate, it's a lot more than that. So no, it's not going to be just another predicate.

A

What we have at the moment, Sora, like tentatively in mind as to maybe have a different scheduler move, somehow come up with maybe a number of plugins for the currency scheduler that supports gang scheduling this, but it's.

F

A

To be just the predicate got.

F

It okay, I'll, read the dogs.

A

Okay, thank you very much. So is there any other question? We have one more minute left.

A

There is no more question we can finish the meeting now.

A

Thank you very much for attending see some of you folks next week and I'm looking forward also to meeting some of you folks in Q Khan, if you're attending.

F

Yes, I'll, be there.

E

All right, bye.