Kubernetes SIG Scheduling, 11 Oct 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling meeting - 2018-10-11

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

For your information, this meeting is recorded and will be uploaded to public internet, so chances are whatever you say will remain for a very, very long time.

A

So all right, let's start I, have a couple of updates, but not that many. But today, then we can open up the question of comments and you guys might have so. One of these is regarding gang scheduling. You know this has been an effort that has been going on for a while. The most recent version of the gang scheduling for cause Ollie's close to final.

A

There are like a couple of more than a couple, but anyway, several minor comments on and the design that I left yesterday, but otherwise the design is almost close to the final design for the alpha version of gang scheduling.

A

If you're interested in the feature, please take a look by the way, it's being called Co scheduling for I, don't know some reasons that some people believe are valid reasons, particularly because of the connotation of gang with gangs and and we decided to call it Co scheduling so Co scheduling feature which does not necessarily mean that all the parts are going to be scheduled on single note, but it just means that parts are scheduled together.

A

This is a feature required for some of the HPC performance computing, as well as some machine learning workloads also generally in batch batch processing. This is a feature that is sometimes needed so anyway, if you are interested in the feature, please take a look at the proposal. Proposal e is linked to our spreadsheet and the spreadsheet is linked or meeting proposal which also our meeting notes, which is, and they in the calendar invite that's one update.

A

The second update is regarding the scheduling framework, so I have had the chance to start working on the framework, basically implementing some other points. What we are gonna do is to basically, in the very first version or.

B

A

First, PR of this disk during framework, we are going to introduce the interfaces for a couple of extension points, particularly for pre bind and reserved extension points. If you are familiar with the framework is that proposal. These are two extension points used after a node is chosen basically, and once we try to update the scheduler cache as well as before binding a pod to a node. These two extensions are called so we are for the very first draft. We are gonna introduce these extension points.

A

Both of them are gonna be alpha, so this is going to be a very first likely many things. You are going to be changed in the future, but I'm hoping that we can I can finish it in the next week or so, and I will send a PR.

A

So with that we are probably gonna also start thinking about how to test this feature. So we need to introduce some testing infrastructure as well for for the framework.

A

If you have any questions so far, please ask as you.

C

Know one question so rebind and mine stated more more of a fYI thing to the extension points that this is the node being chosen. So any particular reason that we're starting with the like the first implementation, is starting with these extension points like in a some sort of a reverse chronological. So.

A

Reserve is more like FYI. This is basically for mostly used for by the plugins or that want to update some of, for example, a cache for a state that they're keeping pre mind is an approval plug-in basically so, prima all the preborn plugins must return true, for the part be able to be bound or not. If any of them return false, it's not gonna be about, and the part is gonna basically get to reject it and go back to the scheduling queue for another attempt which.

C

Is similar to the current bind right? Yes,.

A

Somewhat it well the point today that we don't have any other checks, basically today, all right, so the scheduler today does not have any sort of checks after a notice chosen, it's just the point at the API server that could possibly fail.

A

D

E

More like another.

A

Level of checks and the reason that we chose these two particular plugin points to implement as the first ones is that there are a couple of features that would require these one, which is most important, is dynamic volume, binding, so dynamic volume binding is today implemented as a as sort of like a hard-coded part of the scheduler, and in fact it was one of the reasons that motivated us to to think about a scheduling framework. We are not very happy with the fact that it's a lot of logic is integrated inside the scheduler core.

A

So that's one of the reasons we would like to take it out. Jonnie scheduling is also gonna benefit from a couple of these extension points. How we both reserved, by probably it probably needs one more before we can actually implement thanks Caitlin using the current scheduler or scheduling framework, but anyway, these are also needed. Yes, thank you.

B

You're probably wondering on question on the framework so I think right now, the most I recommend in the way of the use of its minor efforts to eat, is to use the extender right to rise on several extender and implement their specific knowledge. So basically in the timeline perspective. So when do you think it's mature for a user to use the framework as the most recommending way to customize schedule, yeah sure.

A

So, of course, it all depends on what features of the framework needed right. So probably you know, since we are implementing just a reserved and free bonus, offer features and the coming release, which is 113.

B

A

Shouldn't be expecting users to have like a full fledged framework in 114 or maybe not even more on 15. But in.

B

A

We can't possibly think about, and maybe moving some of the extension points as beta. So 114 is probably gonna be the earliest, but you know again depends on the Preferences of our users. Some of the companies that we work with, for example, never use any of the alpha or beta features. They want only GE features so, depending on your criteria and restrictions that you have remained, you may want to wait until it's GA and that could possibly be a year from now.

B

E

Thank you sure.

A

So another update that I have is recording our our reviewers and through various, so first of all welcome way, as as our new reviewer to gang scheduling the sorry cig scheduling, so I also removed a several of the older reviewers who are not very active anymore.

A

The reason that I did this is a lot of our peers are assigned to these folks and need to be reassigned, and sometimes the authors of the these PRS are not fully aware of the fact that some of these folks are not very active and, as a result, their PRS remain unattended. Basically, nobody, nobody looks at them, so we wanted. We wanted to avoid this situation, so I actually sent a PR to change our reviewers and I hope.

A

Folks are okay with that. I know that particularly I actually wanted to bring this to our wishes. Attention Ravi's here, which is not here but August. If, if you see a badge, please tell him that actually moved him from the from the reviewers list of schedule. If he thinks that he has the time to help us more with the reviews, I will be happy to add him back.

F

Sure yeah thanks.

A

Okay, so there are also you know, there are also a number of PRS for the phase. One of this scheduler refactoring, a lot of them are already made or are almost ready to be matched jonathan is preparing the second phase. The second phase could be a little bit more intrusive.

A

Maybe a little bit change of some of the interfaces here and there, particularly around the priority and predicate functions of the scheduler, so I am hoping the jarton will have the time to start working on those ants and those issues out soon.

A

These are the updates that I have other folks. You know in owners of some of the features that we are working on, like a pod limit. Priority function, Ravi you working on that. Do you have any updates on on that? If you have so.

F

Actually I did not get time to work on them and there is one more feature that way and I have started them paying based evictions I have increased the coverage. Luckily, I did not create a PR for that when I was testing it locally. The other thing that I've noticed is it's not as fast as normal evictions are meaning the way we are calling the deletion from the node controller code. We were previously directly making calls to delete API for the pot on the no it's not as fast as that.

F

So I am trying to create a benchmark, especially to see what is the impact of team-based evictions I. Hopefully, I will have that up by tomorrow.

A

Do we use a different mechanism for teen faced eviction than deleting parts? Yes,.

F

So we are visually all the nodes into another team to queue, and it depends on the number of workers in the taint queue which are actually within the bottom. I see.

A

F

So that's nothing, but we need to look at so I'm trying to create a benchmark for that. Even if it's just Larry we should be good, but the only concern is if the server is is on another host. You.

C

F

Be difficult depending on the network, bandwidth and the processing time so.

A

You know what you have proaches, that I can think of. Of course, one is to have a controller DivX or these parts from tainted nodes. The second let Q play the victim right. If a cubelet detected a note that it that it's on is theta, then it can start affecting parts. So.

F

The taint manager is similar to a controller, and it has a cue from which it accepted sorry pics of the parts that it needs to have it from the.

E

F

Of threads that we are parallely using, it is the tunable parameter. Is one I think, okay.

D

Curavi yeah I also have a question related to what Bobby just asked is why I was expecting like a race condition or like a an issue. Was the attained based controller with putting notes on there and the couplet would studying evictions and the documentation says that there's a throttling implied here so in order to avoid kind of a massive epic eviction of pots on a note, but I would also strive for the simplest solution with the couplet, which seems to be a little bit more easier to ferment from an operative perspective, configuring those things.

D

So what is the kind of the basically? My question is: what is the added value instead of letting the couplet handle this.

F

Is the agent which is supposed to evict pots based on certain conditions at this point of time, painting or having unschedulable feel is not a triggering point based on which cubelet fix the pots. It.

A

Is but what the question is, of course, we know that the feature doesn't exist in cubelet today, but my question is that why can't we implement it in Hubler, I mean it cube. The cube litter has a benefit that it's a lot more parallel as well right.

A

So if, for example, in a cluster, you suddenly decide to take down a whole zone with I, don't know thousand nodes, then you will have like a thousand cubelets on thousand different machines that can work in parallel to evict all the parts, as opposed to like a single controller, and the master know that is that has to take down tens of thousands of parts. That's.

F

The job of note controller, to tell I, mean the eviction could happen from the cubelet side, but the node controller is the one that is responsible for telling okay I'm going to play a taint or I'm going to set the instead available, but I'm, not 100% sure if cubelet is going to do the elections based on the conditions that are set by external component. So if cubelet sets those fields based on that, if Q ability is everything I think that's the way it is currently designed.

F

A

Are you saying that the cubelet does not monitor these things? That's right so Triplett.

F

Reacts to the the funk, the main job of Q, that is, to identify the problems on the node and based on that it is going to take an action. It's not that the external influences or the triggering point should be from from an external entity. Rather Cubitt itself should do it. Is this what I think okay.

A

Yeah I cannot really say this with confidence, so I don't know if that that's really the case or not, but it's worth investigating so or at least taking a look to see. If that would be an option if it.

E

A

Could be an alternative implementation that is more honorable sure.

F

B

Ravi take a look at the DPR and there are clouds raised the thing 1.12 he'll be giving in last minute of the one control which is fixed performance of tenth note condition. So in the beginning, Kraus engine ensures a solution based out of code of tenth manager, I guess, which was used as a slice of China ethics, kind of seems didn't perform well, so finally, he changed really limited queue that seems to work. So if you want I can point you out back here, yeah.

F

That'll be great that pier I think it was. There was a scalable shoe which otech has created for the five thousand nodes limit and cross has created that pier to solve that problem, but I am Not sure the scalability test, if it has started working fine because of the change the because after the change has gone in, there were only a couple of times where the test was failing and then after something the test has started. Passing so I do not know if the change that cross has create has introduced has solve the problem.

F

But that's a good starting point. I can actually look into it.

A

Is that well, it's actually not related to eviction, but there was this issue with 112 that if P, if a cluster runs, 110 or or older parts 110 is actually supported, we support cubelets, which are two versions behind the control plane. So if you run a cluster 112, you can run one time. There was an issue with those classes sent a PR. It's not measure as far as I can tell yet.

A

At least it was not measure until last night. So the issue was that 110 didn't one thing: he didn't: have the capacity or ability to check node affinity. So this was conflicting with the new demons of the scheduling by the default scheduler and, as a result, some of these demons, scheduled by the invite the default scheduler we're getting rejected by those older cubelets in the cluster class, has a PR out that is going to fix this for 112, newer versions of the cubelet like 111 and on will have that change.

A

I mean the new the new changes, so those are fine. But if someone runs those older cubelets, they should be a little bit careful and ensure that this new PR that his image is included in in the version that you're running.

F

Yeah suicide I, initially looked at the issue. I had a discussion with the the person who has filed that issue. I was initially thinking that we could backward the the change that Kraus has made for 1.11 I believe but wondered well. We could back 4210 but Jordan. He told that we will not back for the EPA changes to 110.

F

A

Yeah I guess we do need to really I, don't know. We don't really need to carry all these changes forward, because 110 is the last version. Basically, that is supported. Do it with 112 and from the next release. 113, we are not going to support cublas older than 111. 111 will be fine, so it's it makes sense. To just add this change to the two one thing cubelets for now: I guess for for 112 reasons. Yes,.

F

And the other thing is: we need to update the release box, that that was the other thing that the issue fighter was asking for yeah.

A

True, okay, so one Ravi, you have one more item on your plate, which is deprecated critical, pod, annotation I, don't know if you were gonna have time for doing this, but given the number of contributors that we have on a lot of people who are asking for more work, if you don't have the time, please feel free to tell me so that we can reassigned to someone else. Yeah.

F

Sure I I think at this point of time and I'm kind of Orlan because of the open shift releases that are happening. So yes, I would be more than happy to create an issue and then, if someone is interested, I'd work with him yeah.

A

If anyone present today in the meeting is interested, they can take the issue. I see.

A

So and if not, we can actually create a Help Wanted issue on on github so that people I can can take it.

A

Ok, that's pretty much all we have six more minutes. If anyone else has any questions or comments or updates. Please have.

F

One question related to this P alpha, pod affinity: I think there is an incubator project called boot queue that redhead I started using extensively. We found a bug in in the latest version once that got merged, so I pointed to a like offline, but I'm curious. If is the thing that we are interested in getting going forward, the patch that.

A

Okay, yeah sounds good all right. Any other I had.

C

One sort of a little naive question, so please feel free to ask me to look at into awkward. I was I just had this question from the new scheduling framework, so to quote from there we mentioned that that keeping scheduler Vivaan backward compatibility is an on goal specifically with the v1. Extenders won't work in the new framework, so this statement did we mean that the but the new framework also says that it will still have the out of process plug and supported. So does that mean the out of process?

C

Plugins that will be supported in the new framework? Will just have new interfaces and the old interfaces will be not supported. Is that what it meant, or or.

A

Initial plan actually later on I actually later on. Maybe this this is not updated, correct. You know that was our initial plan. They see our initial plan was to build a framework from scratch. Sort of that was that got changed later on. We felt like it would be too much to build, as well as to roll out and particularly rolling out and use. Scheduler, which was not backward-compatible, would take almost infinite in fronts of time in clusters that drawn production workers.

A

So we decided to change that approach and go with building the scheduling framework features into our transit schedule. As a result, this new scheduling framework will remain backward compatible and those extenders are gonna be supported. We are probably gonna introduce some new extenders, but for.

C

A

Gonna support those existing centers, okay.

C

Okay, thanks and just one more thing related to that I mean the in process will definitely be the most driving forces like I, guess, performance to prevent the marshalling and and marshalling, but other than that, so in process would definitely mean that we would need to recompile or like I from the documented did not seem like a one statement. I can't seem to look at exactly where, but one statement referred to that we might not need to recompile I was kind of confused at how would we have in process plugins, which would not recompile ation?

C

A

Actually, given that, given that we don't have a clean concept similar to like dll's in go yeah, we compilation is gonna, be actually yeah all right. You need to copy files into our country scheduler and maybe just register your plugins in our registration process, but otherwise other than this and the final you need to recompile. Hopefully you don't need to make any other changes or you don't need to do any complicated marriages of the code with my quick compilation is ended and that's a bummer but yeah.

C

But we get the performance so.

A

Exactly yeah yeah thanks. Thank you.

A

Any other question comments.

B

About the 10 best eviction, so what update is that I have some conversation with surviving in the car, so Robbie and I are working on that so right now, Revere most focused on the increase that that's coverage of a urine test and booking on the e to e test. So once we are down, we kind first working on improve the coverage on any questions estimated. So and another thing is actually it's about me. So actually, I didn't I, didn't understand a tank based addiction crackery.

B

So my my understand, who was that the picture can chose that if, if there is a no executing and it will give Vic to the part right but and it'd be his behavior is not controlled by 10 best eviction, it's just controlled by internal, no execute and manager. No matter can basically be changing in be enabled or not. So what exactly can base the eviction? Controls is actually, if there is some already known already or ritual condition comes out, and this feature is enabled in well.

E

B

It will give it give the know, execute 10 to the node, that's it and the based on that yeah. It's not that it handles the eviction or not is.

E

A

So you're saying that thing based evictions just gonna, detect the problems and take the nodes. It's not gonna.

E

A

Responsible for a big thing cause.

E

B

By evicting, the behavior is controlled by a fundamental internal component. No axial and manager. Yeah I see.

A

I see so you're saying about that. That already exists yeah.

B

Already exists and when I checked each we test found funny is that the test that was named by a Pacific's and a Scott has a dog that goes so it's not enabled by default in to you, because it's 2d, the kind of like union task so I raised a few, are two kind of also be opposed, rename it and also increased some coverage. True tested, multiple paths and also to test on multiple toleration which have different origin seconds on the switcher.

B

Yes, but that is not exactly related to ten base eviction, as I mentioned, that is kind of fundamental.

E

Part to handle eviction generally I see yeah. Thank you. I need.

A

To actually educate myself a little bit on this looks like.

F

There was some confusion.

B

We can't we can talk offline as I just understand it. Yesterday.

B

Which part of the 10 best efficient controls so yeah understand? Basically, so that we can? We can make good tests sure.

F

F

You just play the teams and then is a team manager which actually makes those I see.

A

So the team manager exists and we don't need to make any changes there. That's I guess what I got. That's.

F

Right like at this point of them, we do not have to write code for the feature as such.

F

Okay, that's what I am just looking at the latest code and you.

A

Know why you're saying that we cubelet is not controlling that? Basically, it's not a part of this feature that we cube it does the eviction or someone else does eviction that that part already exists, and it's not the responsibility of cubelet is the responsibility of yeah.

F

Controller, it runs like an external make them call the we'll start, the process which actually runs the team manager. It has a subject queue and then it starts and that's yeah.

E

Yeah: okay, okay, sounds good all right. Thank you very much and.

B

One last thing about me: at the Interpol affinity, which we are going to support its mentioned, multiple parts so about their features. We need to kind of redesign right, so I have worked out a draft of ocean but haven't checking yet so I'm gonna I'm gonna create some benchmark tests. I have some conversation with Swiss with Harry and I will create some benchmark test and verify that it works well, so I hope I can I can check him before Jonathan's face refactoring I don't want to because three stories kind of all predicates priorities.

B

So there were a lot of conflicts which.

A

B

A

Brace yourself phase two is coming all right: okay, thanks everyone! So let's finish this meeting today, I hope to see some of you folks Thanks. Thank you. Thank you.