Kubernetes SIG Scheduling, 9 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20230309

Description

Kubernetes SIG Scheduling Weekly Meeting 2023-03-09T17:52:05Z

A

B

Yeah yeah hi everyone so today is uh March 9th, and this is uh six scheduling. Bi-Weekly meeting um this meeting is recorded and will be uploaded to YouTube. So please adhere to kubernetes code of conduct.

B

um We've got a few topics on the agenda. uh First, one Aldo. We have a race condition and pod info and yes uh nominated part, can.

C

You share the screen or should I share my screen.

B

I'll give you so that you can navigate easier.

B

C

So basically, uh Patrick uh was working on Dynamic resource allocation and then hit trigger his new tests trigger uh data race um in scheduling. um Wait. One second I just want to share my screen now.

C

All right so yeah, this is the issue, um so Patrick found this this data race, while adding this new test, uh but um after further analysis this didn't uh this was not exclusive to the new tests. It's just the new test made it more um easier to to trigger uh and uh way was able to. Was it a way?

C

Was it you or was it Patrick? You you suggested to Patrick to to try with the preemption pathway in the preemption tests, Yeah.

A

So basically, he his his crew says: okay, maybe it's the first time we have a chance to cover the code test. That I said. Maybe it's not because we have the.

A

We have the prevention, because that pie is unscrewable, so it will be so basically the the racing between two gold things, one is that a pie is claimed as unskidable and then the other way. The event Handler is updating the power, so this should go routines can can raise so in some of the three tests, although not too much like the preemption can also hit this kind of code path. So that is why he can reproduce that using the preemption, tester Suite.

B

Sorry, the races between the event handler and which go routine.

A

uh Apart was claimed as unscheduable, so basically update patch the path status.

B

C

B

The main goal routine- that is so.

C

The this this routine- that is, writing uh which comes from the scheduling events, the the event handlers um and, at the same time, there is a read from the scheduling cycle which obtains the information about uh the nominated Parts. They're nominated, yes, nominated parts.

C

So that's that's the the race that was found in both both scenarios. Well, this changes in in Patrick's scenario, but the the update is the same. uh The the same grid um so after some debugging well Patrick proposed one solution uh which was adding um a lock uh or a an atomic pointer, uh but we figured that might not be enough because there are other fields in the in the bot info that could have also hit the data race, but we discovered uh that um we have two objects. um What is this?

C

uh So we have two objects in uh I. Think it's this one. Yes, we have two locks, different locks between two objects, the port nominator and the scheduling queue, and they both share the same pointers towards the uh the cute, the port info. That holds the information about about queuing.

C

So so, basically, the The Proposal that I came to is that we somehow uh make those two objects use the same. Lock, uh here's a PR um it uh way was able to test it with the preemption path and it it solves. It seems to solve the problem.

C

um Patrick's still uh still uh going going to test it against his new tests, um but also I want to write a specific integration test, hopefully to to trigger this reliable in before before the fix, and that's the only thing that is holding me from merging this PR, so I allowed the integration test and then they should be ready for merging.

C

um There was some discussion about whether there are other possible risk databases, for example, when popping when popping an element from the queue from when popping a plot from the queue we we use that same pointer in the in the scheduling cycle. So there was a question about whether that's that could trigger uh another database, but uh it turns out it's not a problem because precisely we're popping out of the queue.

C

So if there is an update um that comes from outside the that comes from the event handlers, while we are scheduling well, there is no object in the in the queue, so there wouldn't be a there, wouldn't be a right in place in place right. So that's that should be fine and also if that was the case, we'll probably have hit, uh have found the database way earlier. So I think I feel pretty confident about this. This fix, um but, uh as I said, I will add the integration test.

C

um If you have any questions, please don't hesitate to put them in the APR or any other concerns. We will cherry pick this to to any any versions. I think all versions are affected by all. Supported versions are affected by this database, uh and that's it about this point. Unless there is any questions.

A

Yeah the solution looks neat to me: it's just the integration test, so in our CR pipeline they will enable the dash dash race for all the tests, those unit tests and integration tests.

A

Yeah I think so so.

C

Yeah but I'm pretty sure it does.

C

Okay, so that's it for that point. Thank you. Thank.

B

A

Okay, so the second one is the uh house, a new Alpha feature, and so basically the background is similar like what we introduced at the mesh label keys in the portability spread. So we want to introduce to part affinity and part entire thing as well. So basically the logic are pretty the same: I reviewed the logic and there's a bunch of tests. So basically it looks good. So it's just I think in the next Tuesdays. Sometimes it'll be the code phrase.

A

So on this side, I will be able to approve this today or tomorrow, and then we need some API expert to review the API changes. That's pretty much. The current shape.

B

Did you put in the API review I.

A

Haven't I haven't so basically, I will give a slasher 2 and then hand it over to API machinery.

A

Yeah but maybe just eating advancing well done, there's no harm.

A

Yeah, that's for this issue is pure.

B

Sounds like a major PR I'm just going through this one.

A

Yeah but it screws, if you screw the Auto Chain Auto generator changes. The call logic is just not too much and a lot of uh tests.

C

Quick question about the API um so previously I suppose the label selector was mandatory.

C

uh Oh actually, no right, it looks like monetary.

A

For part of the airport entertainment, you should satisfied right.

C

A

C

A

It doesn't make sense yeah.

C

Is it notepad matches or every pod matches? If you don't specify I.

A

A

um I think it's much all. For example, if you specify a pipe and the I said no op.

C

Inion, okay and then the two queries are ended. Okay, uh I think I can give a review from APA point of view at least the first pass.

A

Okay, that's not a gift.

C

But but I I don't have time to check the implementation.

A

Basically, as long as the previous previous metix are correct, what we did is to end the match labels, so it should be safe, but I will double check yeah when it works with Neo sector.

A

Yeah, that's pretty much for this item.

B

Okay, thanks um just related, there is also another um PR that is related to sidecar containers, there's refactoring.

B

That impacts how we calculate resources for containers I'm going to put that on here, just like in the context of.

B

Overall, it's okay, um it's always scary, because it's these lucky factorings can make nouns changes to some assumptions and semantics.

B

um Hopefully it's okay! It will be useful in the future as well, because we had pod overhead and pod. um What was the other one, the resize, how we had to change things in multiple places? So hopefully this will help us in the future to centralize it I, don't think it did go all the way it's still more room to improve, but I think they did enough.

B

C

Yeah, yes, because there was a default in in the scoring plugins.

B

Yeah I mean that's the one where.

C

B

That's the one where it's like makes things a bit ugly, like you know the non-zero thing right, yeah, that's the one that is still uh like a source thump and the way that we calculate things it's uh overall I think they at least unified.

B

The way we iterate over the containers like we have the main containers, the eight containers and most likely in the in the next PR that they have is they're going to be the sidecar containers, um so at least the iteration, the unified over all places, uh but that specific one yeah they had to do a bit of quirky logic to basically take into account the non-zero um defaults for scoring.

C

And the sidecar containers themselves are not coming in this release. Aren't they.

B

uh Sergey mentioned there is a very slight chance. I, don't know if they're going to get in in the next two days. uh The uh the main PR is up for review. I think um this one is every Factor, that's supposed to make things easier for them.

C

I guess I would be concerned if that we just merge vertical in place of the scaling. Yeah yeah, maybe not I,.

B

C

Right, yeah, yes, but maybe one at a time would be better, given that they are so closely.

B

C

Are similar or touch it touch similar places that would.

B

Okay, uh Part Group objects for MPI operator.

C

um Yeah, so not sure if I don't want to spend too much time on this, but um I wanted to share a discussion that we were having in MPI operator um so I guess. Let me share my screen again.

C

C

So MPA operator and in general Kim flow um has currently has support for pot group from volcano and um the the main training operator. uh Kim flow training operator, um also added support for the Costco cost scheduling, plugin frippo, the Pod group from the Costco scheduling, plugins repo, and similarly there is another open issue to add support for.

C

What's this other one coordinator with a k, um so basically I am I'm not liking. The idea that we dump a lot of apis in keep flow to support different schedulers, um so I was um kind of playing their little Devil's Advocate to to ask for a more unified solution that doesn't involve importing all these dependencies in at least in mp operator, but the same thinking extends to to the rest of key flow uh and um the main.

C

The main problem is that if we move dependency out of uh MBA operator or the dependencies, then it means that, um on on the other side, the schedulers have to add the the code to support the odd groups, the different sorry, the schedulers need to add code to handle.

C

The specific job objects in this case the MPI job, tensorflow, job and so on and so forth. So in a way, the dependency nightmare switches from from one repo to two other ripples and in general, I guess this is an and end-to-end problem right. You have end schedulers and you have M jobs and there's gonna be a this dependency nightmare so somewhere um so I was I was uh advocating for not having this nightmare in Q flow.

C

um But uh then it's unclear what the correct solution is. So um it's a long threat, uh uh but one of my proposals was that Okay. So if we look at them the key flow objects. Let me just share that real quick. So if we look at the key flow objects,.

A

So this is only for MPI operated enough other, like training operator.

C

They all look similar, so that's the good thing about it. They all look the same rather so there's this object called run policy which is in this spec, so MBA jobs back and similarly other specs. Have this run policy field and then, within the Run policy field there is the scheduling, I, think it's called scheduling policy field and then the scheduling policy field does it work.

C

The scaling policy field is the one that is being used to build. The report group The Volcano group.

C

But given the fact that they all look the same.

C

I'm thinking that this is some form of this, this makes it relate not relative. This makes it somewhat easy to support all of keep flows uh apis.

C

uh With one piece of code quota code, because there is dynamic, um I guess, Dynamic informers and things like that- that still need to be set up, but once you have all of those set up you, you still can Mark and Marshall into into this object and use this information to build the Pod group.

C

So this is just for qflow right. uh If we uh we think of Ray or we think of uh spark likely, they don't have this unified uh um spec.

C

So that's still a problem um so again to to summarize, my My overall idea was that well we have this similar specs, so um there could be a separate controller or even the scalar plugins themselves itself. Sorry that uh can use this API to build the port group.

C

Then there is no dependency of MPI operator into cost scheduling, plugins um and uh with that I would probably also advocate for removing the volcano um code from from uh NPR operator, pushing them pushing volcano to do the same to by by the problem, basically so by the problem of uh of having to deal with this integration, uh and my hope is that, of course, this is not maintainable right.

C

um It adds a lot of burdens, but if we, if we put pressure on on all these implementations and all these schedulers, hopefully there is enough pressure to push for a unified solution and there is already a couple of Solutions that are where it started long time ago, but not completed.

C

For example, the idea of having a suspensive resource um or you can rename it to job queuing some resource or uh I know there can be other names, uh so crds would would be able to advertise their scalable as a group and things like that in a unified manner.

C

Sorry in a yeah, I guess in a unifying manner um or the other alternative, which is to actually Upstream the the scheduling, plugin support group API into domain kubernetes, uh which last time we discussed it was spending on on a better understanding for auto scaling and the usage of the physical pmq extension point.

C

um So anyways I don't expect that uh we will be holding from doing just what we have been doing, which is at the cost. Curling plug independency in the operator, um but at least I wanted everybody to still have to think about a bigger picture and that what what we are trying to push for is not maintainable for any or any project, whether the skill flow or curly, plugins or or um or volcano.

C

uh Yes. um So that's that's the summary of uh the discussion, but uh if you have any thoughts now or if you wanna go into the issues um and have share your thoughts, that would be great um or if anybody wants to start um designing or thinking about these Solutions about a Super, Source or upstreaming. The bot group that would be that would be great um anyways. I'll stop here for more questions.

A

How many operators are there uh supporting are using the same spec like scheduling policy and the leveraging the park MPI operator training operator, anyone else so.

C

In terms of reposes too, in terms of apis, there are more than I I think there are around five or seven um they're, very PF job by torch. Job I, don't remember the other names, but uh they.

A

C

They they all share the same code uh except for NPR barrier. This is a pretty cool reporter. The rest share the same code, um so they look the same um yeah, the I guess the good thing about this uh scheduling policy struct is that well, it's kind of serving as a as a playground. Maybe that's that's all the fields we need. Maybe we will discover that we need more fields uh and then maybe this scaling policy can actually also be upstream or be part of this super resource that we Define for jobs.

B

There's one third long time: option which is long-term option, which is a the API that we're still designing the job set API. If we can use it for everybody else, and then all these parameters would sit in the job set API that creates the bot group.

B

You would create that uh you know potentially unification right, but again, all the solution needs convincing and, like collaboration from different viewpoints, lights either, for example, these reports to change the way that you described advertising a sub resource or whatnot or building on a unified API that handles all of this. For everyone.

B

Is not having a unified API for launching batch jobs, basically.

C

um Yes right, so is this a discussion we should probably continue having in in a working group batch as well and.

C

uh Yeah, it's gonna be a long-term investment anyways.

B

All right, last but not least, where you're proposing the to sort to score nodes. What do you mean.

A

Yeah one second, let me share my screen, so I think this is why the uh maybe you or Otto mentioned a little bit before so yeah. The background is that we do have score plugins and one users ask how can I? How can I desire this Behavior over the other score program? Our official answer is to use the weight for the individual project right, so that's correct, but don't forget that we have the percentage of nodes to score.

A

So, let's think about this a simple case. Supposedly we have a 20 Nails cluster and the potential nodes to squat equals ten percent. Just simplify the real logic that we have a minimum 100 nails to score. But just forgot forget about that. It's a simple, simple, five kicks says that uh 20 multiply. 10 equals true. Okay, since we have a department with two parts and they have the preferred part.

A

Affinity says they want to collocate to the same node together and then part one part A1 is scheduled, is scored in uh Moscow because of the percentage note score. So it's evaluate node one, no two, so it randomly, let's say land on this one. Two. So for part A2 when the camps they do want man, uh the the note that part A1 lens, but internally we have a scoring index randomly move by one position and then, as time goes, you can't land starts with any index.

A

Let's say: okay, this time when part A2 comes the evaluation, scope is no six and no seven. So in this case the peripheral path, Affinity primitive, is to know op, because it cannot evaluate the no one, no two, it's a Red Cross Key, so that the part A2 land are randomly, and so the scoring plugin doesn't work at all in this case. So this issue is pretty obvious in the idle cluster because idle class, you stop a pound.

A

You'll reach the sample size right, but in a very busy cluster you maybe just need to go through all the ordinals to reach the sampling size. So that is what I observe in our internal production cluster. So this is the I would say uh in the this is one case, and there is another. So this case is happening on the scheduling, primitive. We provide um workers on par and that can be some other scenarios like the second primitive was applied in the node, for example, the preferred no scheduled tint that can apply to the notes.

A

Let's say no 15 and no 20 have this kind of prefer no Scandal things, and when the power comes, you know reasonable schematics. This part doesn't want to be scheduled to know 15 or 20, because maybe it knows that being caught in sorry, not call them being some well, some.

A

Maybe customer customized situations that they don't want incoming parts to land down, but they don't want to put a strict, no schedule looking but prefer- and in this case again when the scheduling internal scoring index lands on this, and maybe just just pick up the exact notes with the prefer, no scaling tanks so that the past lens solver. So that is also not ideal case, which makes the prefer Note 10 he's just a no op.

A

So what I want to propose is that uh two to have a way to have the scheduled plugins to offer options to start this company knows like in our pre-filter we right now. We each we enable the user to provide a list to say: okay, you must, and the own name must try this nose right. That's uh that is the situation. There's basically no sorting there. It's just a master dual list, but in this case it's a prioritize. The list some knows I want to be prioritized and some notes I may want to deprioritize.

A

So providing a sorting function out there can make this make the uh if sampling logic more accurate. So this is basically my idea and this logic should be applied to the prefilter, either through a parameter or through some like preamp, so pre-filtered extension, something so that can provide a sorting function. So the pre-filter button can Implement that so that is Sovereign basic idea. Also is obvious issue. I. Think Abdullah. You used a long time ago. Where is some similar idea.

B

uh I, don't recall: okay.

B

But I think the but I I do like you know acknowledge this problem, like I mean scoring, is completely unpredictable and we've.

A

B

Of is where people are just basically assuming that, if a note exists with the preferred assignment that the schedule will actually pick it, but for a very long list of reasons the schedule doesn't. One of them is obviously this one's like. We are always looking at a subset of the nodes uh right, exactly I mean we tried to fix the waiting issue between scoring plugins, but this one I think is a bigger problem. Right, yeah.

A

So basically, it provides a new scheduling machinery to our scheduling framework and not only beneficial to our entry parking, but also to the other three packings.

B

Yeah and I agree, this needs to be like I mean in the extreme case. You want to run score before filter and then so that, like you, have a higher chance of actually picking the right node.

B

A

With the sort of design document on this, and basically there's a some implementation implementation options, you can evaluate which one better fits.

C

um But I think intrinsically This is Gonna Hurt performance, because, basically you need to evaluate all nodes in a way. So so.

A

Basically, it will if we provide that as a sorting signature. It basically is that the plugin is the parking up in yeah. It will sort of you have to do a do. A round of sorting for each past scheduling, yeah, that's different.

C

But, and and also what happens when two plugins have sorting.

A

That's a good question. That's a good question. So.

B

A

Is why I was yeah go ahead.

B

In case, you want to run score before filter because score merges that score.

A

No, no! No! No! It's not it's not score before filter. It's.

B

Just I understand what you're proposing I'm just saying that in the most extreme case, where you want to handle all the corner cases like the one that Elder mentioned, um is basically run Discord before and then um and then apply the filter.

B

C

Question I will have is why is sorting better than simply doing 100 percent age of nodes to score? Oh, no.

A

There's definitely conflict they're working together. It's just the amount of this percentage. You know to score I want to prioritize some notes. Otherwise, I cannot find that the design knows that I want it's. Basically, the the path sampling is pretty random. That's the that's. The key issues here.

C

How do you prioritize them without looking.

A

Yeah just give the.

C

Give the give the Machinery.

A

To the scheduling framework, so user can choose how to implement that for this case, uh when the the part A2 camps I will see. Okay, it carries the Preferred Product Affinity, so the proof, the product, Affinity interpolar, Affinity plugin, can implement the contract to say Okay I want to prioritize evaluating no one and node two and I'm fine, that is these two Nails doesn't fit for power A2, so I can I go to the other. I don't know, but if this choose fits I want to I want to give you the pretty high score.

A

No, so basically give you the high score is another um context. Basically, I want to evaluate these two notes. First, not in this case, these two nodes are not evalued at all. So that's a key issue here.

C

A

C

um Assuming, let's, let's simplify and assume one just the only exists, one um score: plugging um yeah, so sorting wise in. Why something different from scoring 100 of the notes.

A

Achieve the same goal.

C

Right so, but then sorting implies that we need.

A

To look at it yeah, if you specify 100, there's no meaning to do something. It's just to have this code work with the solid now.

C

How do you calculate the Sorting without looking at all nodes anyways? You have to look at all notes you have to then. What is the advantage over just setting 100.

B

Heuristic to pick them up at a lower cost, potentially okay.

C

B

This one, you could potentially have a heuristic that iterates all the all over all the nodes, but doesn't really need to have a like a as expensive as the scoring something much simpler, uh uh potentially um to to sorting like, depending on the plugin. Of course,.

A

Yeah, this is a global or profile specific configurator. That means for all this profile, you have to Simply 100. That's it's not the ideal case. I.

B

Guess I'll just point is that is: is there um is it possible that we can sort at lower cost than uh scoring all the nodes so.

A

Yeah that can be that kind of optimization. That is why another possible option is to ask the schedule plugins to return a list of prioritize, no less and maybe a deeper or it's faster less. There can be another implementation option, so basically, I will in the in the proposal. I will list all the options here. Starting is one just simply giving a original list is also one and.

B

Maybe it's not about like where it's not about the API, it's more about. Okay, we.

A

Understand API test, then yeah go ahead. Yeah.

B

It's not about that specifically, but it is it actually feasible and giving concrete examples of like, for example, preferred plot anti-affinity to produce that list at lower cost than doing hundred percent.

B

Because, like asymptotically like it's, it's on I have to go through all the noise, but I think the difference is in the probably most likely the constant that could make a big difference. Whether or not you want to go 100 or you want to do a pre filter identification of that yeah. Basically,.

A

If, if we give this option to choose which nails are prioritize, it depends on how the all the implementation and how the interface we give to them. If we say okay, just name the nodes, that is one contract, so that they can basically use some kind of Oscar, selector, no selector, to give us the no list on the on the. On the other hand, if we want the contract more generic, maybe starting it's just a big old and login operation, cost.

B

Again, like I, don't think it's that I think it's more about how you produce that list, whether it's sorted or an explicit list, um you have to go through all the nodes. You have to compute something to produce either the set the subset or the Sorting.

B

Is that computation going to be less expensive than actually scoring the whole thing right.

A

Running the score on.

B

B

Contract like the the pre-filter or whatever extension that you're going to run before filter the things that uh whatever privatization or a subset of nodes, yeah.

A

B

The nodes anyways, so it.

C

A

B

Something similar that would be the score if we are running at 100 percent.

A

uh I would say: sometimes there can be still some performance against comparing to a Brute, Force 100 scoring so yeah. Let's, let's see how the proposal can resolve your concern, comparing the their benefit comparing to the 100 of Mail system, Squad solution.

B

Because I mean in general, actually filters are less expensive than scores.

B

um But I guess yeah. That depends on the plugin, like which exact plugin is actually asking for my concern. The other concern is that we've optimized quite a bit that Loop to not do map lookups Etc, and this sounds to me like something that would add this kind of, like overheads back. If.

C

You remember iteration.

B

Is over like sequentially or going on over the node list and whatnot.

C

I I think um the designer has to include specific examples and not just how the API looks like, because if it's gonna end up costing the same amount of CPU as just scoring 100 percent I, don't see a reason why we should do it.

A

Another benefit is that right now, the 100 that this configuration is per the file, not popular cleaning. So if you want to just tune particular thinkings, you have the benefit right.

C

I wonder if something along the lines of just like making the percentage per score would be enough, but I'm not sure how.

B

A

Running out of time, I think the next.

B

Step is probably a proposal here, yeah all right. Thank you, everybody. Thank you. Thank you.