Kubernetes SIG Scheduling, 10 Jan 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling meeting - 2019-01-10

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello and welcome to scheduling, meeting and happy new year. This is the first meeting of scheduling in 2019 I hope you all have started a great new year. As you know, this meeting is recorded and will be uploaded to public Internet. All right with that, let's start the meeting I have a few updates. There are probably more updates than usual, given that we didn't have any meetings in the past couple of weeks.

A

So, let's see so I'm gonna I'm gonna try to start with the more important ones because we may not get to all of the updates this week. But hopefully we will try to cover those in the given time that we have all right. First of all, let's go over some of the project updates. So, as you know, we have a relatively long list of items for 114. Some of these items are already done.

A

One of the one of the items that is already done is adding a backup mechanism for our new schedulable pods. So this was an item that was basically pending for a while, given the complex area of the feature, but it's already merged. The idea here is that pods, which are not scheduled, able are not going to be retried in a tight loop. They are gonna basically yield to other pods. You know our current scheduling queue has any pretty sophisticated mechanism to give higher priority to parts which have higher priority.

A

But if such high priority pods are not a scheduler, but they could block the head of the queue. So in order to avoid that, we have added this back of mechanism. So when a high priority part is not a schedulable, it is subject to back off in order to not watch the queue. Basically, that item is done. We have optimized notice status updates, essentially, large clusters know the status updates are very frequent.

A

Every no, it's sent status updates every 10 seconds every time that these node updates arrive, scheduled retries rescheduling unschedulable parts, but a lot of these, no doubt dates are essentially no ops, they're, simply heartbeat updates. There is nothing to change on the node, so we've added an optimization to check what is changed on a node and if there is nothing that could make the node more schedulable and the scheduler is not gonna retry, this part is done.

A

We have landed another feature to improve performance of the scheduler. As you may know, we were trying to basically score fewer nodes than the whole cluster, especially in larger clusters, in order to improve throughput after scheduling his initial phase, we started with like scheduling, scoring 50% of the clusters and then in a subsequent phase. We added a dynamic, dynamic percentage in larger clusters. This percentage goes lower, so it's like 5,000. No, it's not clusters as soon as we find 10% of the nodes signify, hundreds, for example, in it 5,000 a cluster.

A

We stop scanning for more nodes and we used 500 nodes to schedule. The plot this percentage increases we see go is higher for smaller clusters. Sorry in like hundred Custer's, it's 50 notes or less it's gonna be hundred percent of the cluster and clusters we like to want to know it's, or so it's gonna be about. 30 percent of the fastest, so anyways it's there is a linear formula for this. This has improved the scheduler performance quite a bit in 5000 North clusters.

A

Now we see about 90, it changes slightly so 80 to 90 pots per second can be scheduled in large Buster's. These are the items that are already done. We have quite a few items and it works. So one of the items- I don't know if having is here. One of the item is the design of new I.

A

Don't see her here, design of the new equivalents cache Harry was working on that I. Don't know what is the latest status? He had a prototype, but we haven't seen a PR, yet gang scheduling, so class you're here I know that the proposal is already merged. I would like to hear your thoughts and thanks for the update by the way in the community meeting this morning, oh yeah.

B

Sure my pleasure yes I, can update to a power cascading this morning. I didn't give time for them to ask questions. So I didn't look at mostly bankfull now, but as far as I know, there are several people would like to try wrong some patchwork. Loading containers, I think cause scheduling is a good start porn for us to move a wall. Yes for the crown. Yes for the gun, scheduling the basic feature is implemented in the crew patch trying to work some about the resolution in culture and a controller for this part.

B

I think this is a nice plan for this, for this feature yeah yeah. So if anyway, want to try or want to give something back up outside on scheduling. Yes, you can try cool bad first day. Yeah, yes, I know one other.

A

Yeah, so in Cuba, do you use the new API or you're still using pot destruction budget? Yes,.

B

I feel the coup batch of France crash, so I'm, also yes, I'm, also trying to align with scalloping framework I'll be sad before yes I'm. Yes, the future of oku patches, almost there I in the short term item, maybe I will know how to a more feature here so I'm trying to choose to align with the gallery framework. So we can share share some comments, yeah that.

A

One yeah yeah, okay, excellent! Thank you. Okay, no speaking of the scheduling framework jonathan has worked on the scheduling framework, a bit mostly around the ideas of polishing the design and reflecting some of the changes that we've made afterwards. One of the important ones was, of course, the fact that we didn't want to build another scheduler from scratch. We decided to bring all the ideas of the scheduling forum work to the current scheduler in order to keep backward compatibility and everything. So the new revised virginia boson is there.

A

There is a PR I have linked it to our meeting notes. The document which is also in the calendar, invite you can you can check it out. Please go and read it if you care about this scheduling framework, because this is an important time to give us feedback later on, it will be harder to change things and, of course, after the after its implemented and it's out there, it becomes even harder and, of course, even further when it things become beta or GA, they become almost impossible to change.

A

So if you care about this scheduling framework, please go and read this design and give your feedback the.

B

Story is about I think we may have some praise that you have example how to use the skeletal framework. Yes, before we get more feedback from the from the community, because, as far as I know, some people is still using the HTTP vendor. Yes, they I think of their. There is a user.

B

Some people ask how to use as default scheduler as wonderful as SDK last week for using about the performance part. Yes, so I think we need our it was uncle together, is use or how to use that. So yes, so the community will say. Oh, this is what I want. Oh they don't would have. This is, and also what I want. So they will can give us feedback yes yeah. So I would like to have some some I believe the two example. The first one is a how to use the scattering framework.

B

The other one is a how to use our default algorithm such as predicts on prioritize. Yes, I think there we we use it to. Will we some some some issue has the introduced or issues that we cannot use the portal of finding other side scatter. Yes, you know I, some people submit a pull request of all this part. Yes, because.

C

B

Because a you bash, I use it, you involve the total finding and I know I know funny when I upgrade it to the ones out when those 13 I thought I found some theories you here. Yes,.

C

B

I think yeah, so I think we need to at least a to example that you make our scattering from work to more Ford yeah.

A

So part of it is related to the framework, and part of it is not in my opinion, for example, you know whether you can use interpret affinity or another component today is not related to the scheduler framework. That's that's kind of separate concept and I said yeah.

C

A

Worries, of course, different. We we had in the evening the initial proposal. We had like a high-level example of how you might meant certain features using the framework, but not like code examples right, I, believe you are looking for code examples right. Yes,.

B

Some code example yeah.

A

It's still a little early for the scheduling framework to have code examples, at least not in it design proposal. But definitely you know, as we are building the scheduling framework and we are adding extension points. We're definitely going to have some examples of the way so I know school and, of course, we're gonna also build some tests, which are gonna, also work as examples of how to use the framework but yeah. That's a good point. Thank you for the feedback. Yeah.

B

A

All right so there has been, there have been like several issues actually in the scheduling, queue and also a race condition in setting nominated. Now it named in a preemption logic of the scheduler. Some of all of this. Actually all of these issues are already fixed and cherry-pick pr's are also sent out. So hopefully those issues will be addressed, pretty quickly yeah.

A

So these these are the issues that are gonna affect mostly larger clusters or clusters, with a lot of pending, pods and stuff, like that, so I guess I talked about some of that stuff all right, so one more update is about affinity and Atta affinity. There is a thread started by Voight edge for those who don't know him, he is one of the top contributors to kubernetes.

A

He has been involved with kubernetes from the beginning pretty much, and he has raised this issue that pod affinity and anti affinity is a scalability blocker for kubernetes, and he is thinking that maybe we should change the design, so I've actually linked the issue that he has filed to our meeting notes, it's very relevant to us.

A

Of course, I don't agree with all the details in his proposal, but the discussion is going on, so he was thinking that maybe we should remove the feature or make it very, very limited to only node, but I don't think now it will be enough because there are users who want to use the feature in use. Cases such as I want to put all my paws in the same zone in order to avoid like network delays and also avoid Network charges, which is pretty common in most cloud providers.

A

If you go into zones, so I think there are use cases for affinity to zones and like larger collection of notes, other than just a note itself. So anyway, the discussion is ongoing. Nothing is finalized. I, don't think that we have concluded anything yet if you are interested, feel free to go and take a look at the discussion and participate.

A

These are most of the stuff that I wanted to talk about. Today. There are a few items that I would like to hear from the contributors about their status. One, let's see so Ravi's here, Robbie I as far as I can tell there is currently no plan for making these schedulers other component of kubernetes. Is that right? Is that still, okay?

A

Okay, so with respect to, let's see you were also working on moving pot limit priority to beta? How is that work going yeah.

C

I think I've obtained in the last meeting so that there's something that we could postpone to next release to I'm, not seeing much of cases from customers or other people. Okay,.

A

So this one gonna postpone right, yeah.

A

Okay and that's I, guess it or for you.

B

Sorry this is house I followed this candidate. Are we going to move this governor to crew? Six I think this is doing through you better destroy.

C

But have initiated the discussion, but I haven't started the migration, so the main reason is just because of the workload it's just that away. She has decided to move redhead, so okay, yeah at this part of town I'm, not finding enough time to start the migration process. I see.

A

I see, is he still involved with kubernetes or yeah.

C

He will be involved, but I'm, so he's mostly going to be like a consumer of Cuban Edison contributed at this point of them. This one I understood from his the.

A

Reason that I'm asking is that you know we need to keep our set of approvers can eliminate because you know or security concerns with those have a football power, and sometimes when their accounts become on use for a long time, then there is a chance that someone else might I, drag them and use them without them even noticing.

A

So you would like to keep the list shorter and also having them on the reviewers list causes get help to auto, assign some of the peers to them and as a result, if yours may not get reviewed by anyone for a long time, which also gives bad experience to our contributors. So I would like to clean up some of those bars so I. Basically, the reason that I ask the question is exactly because I wanted to make sure that, if he's not working on this, we remove him from the list. Yeah.

C

He told he will be still contributing, but I'm, not sure, to which extent he would be. We.

A

Can of course add him later he comes back, but it seems that is even so far seems like. He hasn't been very active in a past few months. Yeah.

C

So so tomorrow is going to his last rate and check with him and then let you know: okay.

A

Sounds good, thank you. All right, I think I covered most of this stuff. Oh there is I, don't know if Ray is here. Hey are you here? No.

C

Yeah I, don't think he's no.

A

Okay, so other folks may, if I have mists anything here that you are working on and it's already planned for 114, please, let me know, at least in our in our spreadsheet I, don't see other major projects by other folks, but actually.

C

There is a bug that I would like to talk about like it's the same thing that I updated you Bobby last week, the race condition between no lifecycle manager and cubelet, while applying the teens, especially when teen notes, where condition is enabled.

D

C

At the high level, what is happening is when autoscaler is creating a new node or bringing new node into the cluster. There are not enough teens or no scheduled paint on the node, which is stopping, which is not stopping the schedulers to scribble pots on do the node.

C

So because of that it is kind of becoming a problem because, from a security perspective, we do not want notes to update the taints on them or qubits to update teams on them, because they could steal the workload towards them or can that I do not want a particular pot to land on me. So updates are not possible, but we can register the teens during cubelet initialization at pains to the know during cubed a finished initialization, so I have created a PL for that, but it won't cover the case wave. There are updates, meaning.

C

If a cubelet has updated its status, it won't be immediately updating the taint or the no scheduled team, because more node lifecycle manager of non controller is responsible for doing that. So no node.

A

Controller manager is going to eventually look at that and remove the things right. It.

C

Will eventually do it, but cross has has a point to where he mention that, after like 300 seconds, usually the pod would get evicted and if you take some time for node controller, to apply 13 so long at 300 seconds how come I. The 300 seconds is coming from the the Toleration. The default operations.

A

No I know I know, but how come that note controller could take such a long time for removing the things no.

C

I'm not talking about the time that is needed for applying and removing the paint I'm talking about once the taint has been applied. Example no executing by Lee for validation. Time is 300 seconds to the correct. So in both the cases like I do not know the exact time like how long a node controller takes before applying the team when I tested it locally.

C

It was happening like very, like almost instantaneously, find out clusters that have tested, but the Toleration is something that cross has mentioned on on the PR that it might be difficult, especially in online environment, where some of the services would tell that okay, after 300 seconds, I am going to be unavailable that might be kind of a difficult proposition for them. So he has given couple of solutions like how can we add those stains within the node status?

C

As of now they're in node spec, the the paints are actually in the node spec, so he has like few sessions and he wanted a discussion to be started within six scheduling so that we could come up with some proposal and eventually close this out. Without that, we have to disable take notes by condition yeah.

A

That's a little unfortunate because we had to do that that for 112 and 113 in gke, given given this condition that exists, but you would like to entry enable it if possible, I, don't have the full understanding of the proposal that you mentioned, but one question that I have is that some classes here himself too? So I am aware that you know when you specify toleration. Usually you have the timeouts, that's three other side. Yes, afterwards, the parts are affected, but.

B

A

B

Think for the figs I'm, okay, with we house, we have our things in short-term, yes to handle this part, but for the can denote by condition we're already meters. That worries you about the risk condition so I'm, I'm thinking whether we should rethink about our design about this part yeah. So, yes for Fugees, kids, I think we need to I will create solution. For this part, we don't want to block the prodigies, who are drug users case yeah and for the long term, I would like to resolve this problem.

B

You know I I, don't want to have another discussion about serious condition. Yeah.

B

Yeah I think we can I think they're too strident. No first one. We need to have a clear phase 2 for our release and another is a we need to rethink whether tend to know the declination is a is a current information. If there's a right interaction, we may have some other option for us because the week, if we want more something runs by the truth, a person live there will enroll.

B

So we're just cutting here and backward compatibility issue, so I think I, don't I, don't I, don't think this will close in a short time. Yeah.

A

So one question before not before this feature not 10, not by condition we had a node status or no conditions which were like not ready or whatever, and the scheduler was taking those into account. So cubelets are allowed to change those right. I mean they deserve this. You know seriously concerns about that.

A

The questioning you know did note. We have note conditions as well like, for example, no, not ready and network unavailable and stuff like that right, so those are controlled by the cubelet itself. I believe right, so cubelet has permission to change those that.

C

Is right? Okay, genius.

A

C

But the team says such because those I'm.

A

Aware of the problem with Klaus.

C

Has even proposed one more solution like we can have a white listed paints which represent these conditions, and we will allow only. We will allow updates only to the set of things yeah.

A

That was the first solution that came to my mind as well. Maybe that's something we can do. It feels a little bit like a you know, sort of like a reaction to the problem and feels like a little bit of a hack to me, but but yeah. That's one of the immediate solution that comes to mind. Okay, I will but I, don't think we can actually solve the problem right now. I don't want to take another look at the PR and I will share my thoughts.

A

Running out of time, if other folks have any updates or question comments, please share them with us now.

D

A

D

Two days ago, I said now the but remember part of the portal GT and IOT of water to translation.

A

Yeah, you know we actually may actually it's chatted a little bit on slack I might be right. Mm-Hm.

D

A

So I think your proposal is fine, actually I initially I was not so sure about it, but after you explained I think it it's fine. We can have LT + GT operators, they should be okay, I, don't think it's gonna make performance a lot worse, but we need to actually make sure that that's the case. You need to basically have some performance tests, but are you kind of work on it? Do you wanna send a Senate PR or you yeah,.

D

A

D

A little work on D, okay,.

A

D

Okay, so let me actually.

A

Take a quick note here.

A

What is your, what is your github ID.

D

Its Kevin slash wanted a house and a witch okay.

A

All right is there any other. Okay, thank you. Is there any other question comments.

A

It someone say something that worried was background. Noise.

A

Okay, I assumed it was back to science.

A

Alright, one more quick thing: it's kind of like kind of related to Ziggy scheduling, we've seen a bunch of resource coder tests failing in various setups I, believe there are some issues in the test, not necessarily with the logic of resource coder, all right, that's something that we are also having an eye on and I will try to reach out to on, because who is the original author of the feature to see if we can address those issues?

A

If there is no other question or comments, we can end this reading.

C

But if the resource coordinator to bring up the classes or the scores, this.

A

Is related, it is related to priority, but why urine is one of the things and that resource code of control? It's not basically resource collar controls, many things, including priority classes. They also control a number of parts, for example, that a they can create. They can control memory, CPU and resources that users can consume stuff, like that, so one of the things that they control is like how much resources you can have a particular priority.

A

C

So that's why I was talking about scopes because for the restaurant stuff, I think Derek initially wrote those oh.

A

I see I, see yeah, then Derek is the original author sighs sorry.

B

A

Yeah I will actually I can actually paint him as well, but we'll see all right is there any other question or comments all right? Our next meeting gives next Thursday 10:00 a.m. California time see some of you folks next week. Bye thank.

C