Kubernetes SIG Scheduling, 10 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20230810

Description

Kubernetes SIG Scheduling Weekly Meeting 2023-08-10T17:02:14Z

A

Hi everyone welcome to today's schedule. Meeting uh it's been leaving is being recorded, so be respectful to each other, we'll be publishing the internet. Okay, let me share my screen.

A

So we have cancer a couple of times, yeah.

A

The first one was uh last week last weekend: I found some potential issues, so I will start with the original bug. I found I'll. Give you a heads up. So is that.

A

Then it's a kind of regression in 147, because uh when I, when I upgrade the schedule, plugins from 126 to 127 I spot this issue, the issue is that we introduce a Gap Machinery into pre-filter. So there are the foreign.

A

But every pre-filter we are sort of get wired with a post filter implementation hook, which is the preferator extensions, add pod remove part, because that is need to be dry wrong. uh Preemption right there are some prefectors importation was relying on reading a pre-calculated state with that particular pre-cycle State key, and that will cause if the skipped converter and doesn't specify that key and also in the letter in the latter case you specify I'm sorry, it costs the pure add part and the remove part it will cause cause issues cause the prevention totally non-functional.

A

So this is because you want 27.

A

um If you look at this logic, if a prefetter returns are scheduleable, they will just return and without setting the skip, filter plugin state in this. So we should the fix wasn't to ensure it's always been executed, so they are in the latter case. uh We are get. You know this issue, so it has happens on 127. uh We should pick two or seven and this remarks into another discussion with uh cancer, which is that this might be a general problem.

A

uh Let's say applying a return schedule is pre-filter, so it gets returned quickly.

A

Then maybe plugin B is placed after plugin a and then plugin B doesn't have the chance to to do is pre-filter calculation as well as maybe there's something to pre-calculate its state, which you will be used in the in the latter case. Yes, sorry in the latter phase that is used for preemption, then once the power is not be scheduled, it's not scheduleable, then plugins be well cost error and then clear the whole preemption process. So this is the general uh problem. I want to bring up today, uh so Cafe reserve.

A

A proposal is that uh we can say: Okay continue to run prefetter if I plug in returns unscheduable. That means that will be sort of for waste because then there's also not not online. With current reputation. That means goes through all the pre filters. Calculation, no matter it will be using or not. So this is one proposal and he raised one PR yeah run up filter one, the preemption, the happiness in the same schedule, I think I, I yeah go ahead.

B

I have some questions so I think uh there is one bug that affects um the cube scaler right.

A

um Yes, uh but lucky, but unfortunately, is that all our schedule, programs return on scheduleable um and the wrong resolvable. So that means that will block the preemption. So vanilla schedule will be good. It just will be.

B

But I think uh there was a bug or we were not uh injecting this status into the nodes. When we had unscalable.

A

Yeah, that's that that's a second second, that's a second regression that is also well uh and I. Can't say it was discussing about this and yeah. We, we are both uh skeptical about our memory. I do think that unresolvable will block the preemption, but the the behavior is not so I. We tried to find the regression PR, and that was turns out. This is the Google and yeah. This is the third thing I want to mention later, but this is definitely needs to be ozotic to what's 27 and 26.. So this.

B

Well, we need this. One affects vanilla, scheduler.

A

Yes correct, they were effect. That means that the story that this semantics doesn't work at all you will wait. Yeah. You will waste some cycle, some prevention, because okay, the plugin, already tells you that this will be unresolvable. Why you spend extra efforts on the prevention, drywall Etc? That definitely needs to be checked.

B

A

Yeah, so this issue is uh it's not aggression. It's uh I would say it's uh minor issues that my exists in all versions and it because because entry plugins always return our schedule our own and are on reservable, so vanilla schedule won't be impacting it's just all out of trade. Plugins yeah Elia go ahead.

C

Hi um I have a slightly old, related question, but maybe slightly different. So let me know it's not applicable, uh so this plugin specifically will disallow uh preemption in scheduler. However, it was, the pad will still can be evicted through the uh drain right. So let's say: if somebody wants to remove the spot by other means which goes through eviction and pdb, it will be still evicted right, so in a sense that the functionality in scheduler and other paths to preempt will be slightly inconsistent. Is that a concern or.

A

Is that the issue? No? No, that is not the same thing. uh When we talk about preemption. Definitely the power is unscheduable right, because only the part is unscredible, it might trigger preemption, so so eviction or some other things yeah. In the scope of this discussion.

C

Right what I was trying to get into it? Essentially when we preempt the Pod, the scheduler preemption routine checks for pdbs and checks if the pod essentially should be Des scheduled right.

C

So in this case we will say oh right, even though otherwise, this pod should be the schedule, because it cannot be run anywhere else. We're gonna exclude it, which not goes the same way. If somebody decides to drain you know, then that part will be be scheduled or removed without reservations of schedule ability anywhere else. What I'm trying to get to is that those two path kind of will behave slightly inconsistently with each other right.

A

uh Those two path, some Progressive, we mix too many things here: okay,.

C

Yeah, yes, let.

A

A

Because the regression, if, if you are talking about the regression that is unresolvable, was not honored, then yes, it will impact the prevention pass, because the parent pass doesn't need to be executed in this case right so but it happens since 126., so that may trigger on some um totally unnecessary eviction of the part, yeah you're right yeah. That can cause your symptom. Yes, so.

C

We can, we can discuss it of the offline I guess I, just don't want to just Sidetrack the whole discussion.

B

uh Wait but I think the bigger problem here is not that preemption my might do uh unnecessary preemptions, but rather that it would crash it's. It could potentially crash right because not Opera filters run.

A

So again we are mixing some some discussion here. So are you talking about this or.

B

A

Yeah, so uh it won't crash I would say, because once you made the first unschedule uh status, you will return right and when you return then this this schematics tells you I, don't want to continue on preemption, but you did so the symptom is. That is, although it's not totally hopeless for for preemption for this part and still girls, preemption and then there's any maybe unnecessary eviction, but still can now find a good spot or something yes.

B

But preemption Works per per node yeah, we need it needs the status map.

B

Yeah, it needs the status map node to status map to.

A

Be do that that is yeah. That is that that is new in this case.

B

But uh if, if we stopped adding the status to the status map, then then the the scheduler, the preemption, would still try to.

B

Preemption because well there is no unskillable and unresolvable yeah and in it could crash.

A

Foreign, maybe not crash, maybe it's just a wasting extra Cycles on the preemption, but I haven't confirmed where they throw a crash or not I guess it will not crash. It's just do some redundant efforts because of this.

A

Okay, so yeah, maybe in this it's confusing you so basically, this discussion evolves. Two regression one is that the skip results is not populated. The other is that the unscrewable and unresolvable is not honored. They posted both regressions and then that's. It's inspired me. Another discussion is that being the difference with unscheduled and unresolvable is bad.

A

If, after trade plugins Returns on schedule, it will stop to block any further pre-filters running and then, if a further prefers plugins, we are rely on their pre-filter cycle to do some pre-calculation logic, and then it will be reducing the preemptions logic like add pop, remove, pass driver and then well be problematic, and this is not a big issue is because all vanilla, plugin is use the semantics of unscalable and unresolved, although that is a regression, but definitely we will fix it.

A

So the more General discussion is that for more for out of trade plugins or for future industry practice that may use on scheduleable watch your video here so cassette says. Maybe we should continue running the prefer, the plugins for all failures that is only Returns on scheduleable, because that is a lightweight error. That means the further preemption may have. So this is one option. The other option is I said here. uh Maybe often up in scheduled config to the users choose whether they want to continue around prefer account the First on schedule or not.

A

The other is to try to infer the intention of the user without opening a configuration. Basically, we can detect whether the with a pre-filter implements the prefer, the extension or not. So that is the critical part. That is what impacted the preemption, because it continues to add power and the remote path uh section.

A

So that is the thing I want to bring up to a discussion today or make you all aware of.

B

This one starting question is: do we have any case already in in the Community plugins repo that uses unscalable.

A

Right now now, but you know, plugin can be carbonated. So if we don't support that we should add a sort of thing. Like limitations, that's okay! If you uh preferred her Returns on schedule, you may try it on a place, one one, this kind of plugging there right and then and maybe also place it in the end of your filter. Plugin list. The other thing we can. We can do for now schedule, plugins I, think only one plugin, which is called capacitive scheduling, which is the.

A

And that is only one that has written on schedule in the pre-filter, but I can't I need to double confirm this can say because since there's some of our internal plugins will return this in some very, very edge case, so we will figure out. uh Maybe it will impact entry or not.

B

All right, I would think I would prefer. We just do the easier implementation which is just to continue if it unscheduledable.

A

Yeah that will be uh no option for for entry practice.

B

Correct which is fine and then for out of three yeah. We.

B

I think it it makes sense, because we don't know if there like we shouldn't, we shouldn't try to check the next plugins. They should be independent, yeah.

A

That will be some complex issue because very well happens in the beginning of running all the projector cabinets. So you are aware of switch plugins as they Enchanted to the wrong prefecture, and some of them doesn't have.

A

Yeah, this is yeah correct, go ahead.

D

So I'm one of the uh Engineers on the Apache unicorn project, which does make use of these apis um I, am I.

C

D

The way we use this is we rely on returning on scheduleable and unresolvable to ensure that preemption does not occur afterwards.

A

D

What's that, yes,.

A

D

We're not returning just on schedule we're returning on schedule, one unresolvable specifically, because we don't want creation to hurt. Currently we actually have the entire preemption plug-in disabled in our configuration uh that was actually just but um being able to selectively control. That instead of global, would would probably be preferable. I, like the semantics of unschedule means we might try it on schedule. An unresolvable almost.

A

I think that's exactly! What's up it's just a regression breaks, yes, yeah.

D

So we're not currently bitten by this, but.

C

If we had taken a slightly different attack, we.

D

A

Yeah, okay, uh I think. The conclusion is that both the two regression will be terrific and then for this one we can just let it soak for well, because it doesn't impact the vanilla schedule and also for yes, very rare case. You have two prefer against at the same time, uh return ask everyone. So if you want to understand, what's the impact of this General discussion, you can return through this uh about where and how this can uh can be triggered.

A

So this criteria is still pretty strict, I would say: yeah you can take care of them, uh the other. You have another one.

A

Hello, could you give a high level introduction about this.

B

B

um We were, we were I was debugging a customer issue, and we found that you know in in kubernetes. We have this asynchronous nature. So what could theoretically happen is that um you could schedule some parts in a node user pods in a node before system pods from a demon set are created um because of their synchronous, nature, right and well. This is okay for service workloads.

B

Sometimes it's not okay for certain applications like gaming, video calls or certain AI ml Frameworks are not very resilient to this kind of preemption, uh so they they need stronger guarantees. um Now uh they, this kind of workloads also don't want to be evicted.

B

If uh there is some kind of uh temporary or temporary disruption in a node, so, for example, I know temporarily becomes not ready or unreachable things like that. um This, this uh applications don't want to be evicted in that case, so we provide the semantics for not to be evicted right. That's through. We have this no execute uh paint and corresponding to duration.

B

So these users, these users, um use the the Toleration the the Toleration, not ready, no execute right, um uh which is supposed to give you this Behavior to not be evicted. However, the the actual semantics of no execute is that it also restricts scheduling.

B

So if there is a no execute paint, your pot is not cannot be scheduled, but addition, additionally, if you, if you, provide a toleration for this taint, um then you are allowed to schedule before and all becomes ready. So.

A

Hold on a second hold on a second I think, a new xq 10 and no schedule a handle different one separate.

B

A

It's a it's a it's on the line. It's the override like. If a node is applied with a no execute, you will definitely given uh no schedule. Okay, I didn't check the logic.

B

um So the semantics of no execute is not is eviction and and scheduling, whereas the semantics of no no schedule is just the scheduling so yeah. If you have acceleration for no execute you're tolerating being scheduled early and you're tolerating not being disrupted or tolerating yeah, your tolerating disruptions or sorry not being disrupted. Yes, um so in a sense no execute is, can be think of as as a union of well, you can think of no schedule as a subset of no execute.

A

Yeah and that's a start point I want to raise, because the old schematics is that they are working separately independently. Yeah.

B

E uh right, so you you have, you can have tolerations for them separately, but uh the semantics of no execute is that stupid.

B

So if there is a taint for no execute and you tolerate this state, you will be scheduled and you will be accepted from evictions. That's the that's the behavior! What these workloads want is being saved from eviction, but they don't want to be scheduled early. They don't want to be scheduled before they know it's not ready.

B

So so, basically, what they want is system pods to be ready, uh and only once the system ports are ready, the um they want to schedule and uh and once they are scheduled, they don't want to be evicted.

B

That's the that's the semantics they want and such semantics semantics don't exist today, because no execute has this double meaning.

B

So what I'm proposing here at a high level, is that we add a new kind of tint, which we can call, let's say no continue or evict, and only only the only effect it has is eviction. And if you tolerate you would uh you would tolerate eviction? But you would you wouldn't tolerate scheduling.

B

um I um so that that would be the the API the new API needed, uh but then we would need to change the node as we call the node controller node lifecycle controller.

B

To add these stains for not ready, we don't know continue uh effect, but at the same time we need attained and existing. So we don't. This would just just what I just said would be a breaking change right, so we would need to. uh We will need to make a no execute taint.

B

Sorry, an execute Toleration to tolerate and not continue taint.

B

um So that that's uh that's uh the proposal that I I have.

B

Maybe we can post any questions here if you have any.

A

Try it downstairs so if the thumbnail is like Network partition, that is already a supply with some system no execute tint and then suppose you want to tolerate this kind of Network partition, because maybe you are more certain or you have on the integrator to control the probability of this path. So you want totally you don't just system stable of bills, so you want to tolerate the system automatic ad, no execute 10, so your part already carries the no execute um Toleration.

A

So what you want to add is that if the node is added with no SQ tent.

A

uh You still don't want a new part that is, carries the no xq deterioration to be scheduled onto this Maybe problematic note right instead of the UTEP.

B

Yes, that is correct. um The problem is that we cannot change the behavior of no execute that will be breaking, because that you could have. For example, a system demon set a user, a user defined system demon set that needs to be scheduled before not before the node is ready or actually or is part of the you know, Readiness checks.

B

So there is a valid scenario where you want to both be scheduled and the uh and be exempt from eviction. That's that's a valid use case for system, but these workloads don't want that they just want.

B

They just want to be exempt from eviction, but they want to be. They don't want to be scheduled early.

A

Yeah yeah maybe have some idea: yeah I.

C

Have clarification question? Maybe so, uh if I understand the gist of the issue, is that no.

B

C

Will result in a workload eviction, but it does not prevent new workloads which tolerate them is good to be scheduled right.

B

um So if you have a toleration for no execute, you would both not be evicted and- and you would be if, if it is on you, if it's a new note, you would be scheduled right.

C

Right because, essentially that's what you want in your workloads and hey, ignore this new: execute change for that Toleration period. Right.

B

Correct, yes, if there is a period if there is no period, it's forever, but yes right.

C

So now, if the note also is you know, scheduled hasn't been scheduled, even though I tolerate my part, tolerate No, execute and doesn't tolerate No schedule, it shouldn't be scheduled right or maybe it's mine.

B

That is correct.

C

So right so normally when node goes not ready and let's say it's not ready, because it's a keyboard that and so forth, usually both tanks will be eventually applied. First problem is going to be no schedule because no trading rate and Then followed by an execute, after whatever five minutes right.

B

um It will it will always so that that's another.

C

So I guess what I'm trying to clarify is that so for this to be a problem, the workload pod has to tolerate both no execute and no schedule things on this node right.

B

Yes, but but uh the problem is that as I would point, it did not ready they're, not ready, tolerate they're, not really taint is one. It's only one and it's not execute.

C

Sorry I did this, so there is maybe would not read it. There is no schedule and no execute effects right. Sorry.

B

So let's talk about specifically the not really taint, not.

C

The effect, what is not ready, I'm not familiar with that.

B

C

B

And not ready is one of those.

C

B

C

An execute effect right.

B

It only has no execute effect together. Thank you so that that is a good point. So if, if they know then in this case, they're, not lifecycle, controller would add two things right, one for no execute and one for no schedule.

B

If the user only has a if the user only has attained for a toleration for no execute, then still it wouldn't be scheduled. So that would give us the effect that uh we want okay,.

C

So yes good! So that's what kind of narrowing down! So if.

B

C

Propose or refrain in this conversation can we add two teens and if user chose to tolerate only no execute, they won't be able to schedule. However, if we chose to tolerate both no execute and no schedule, then it will be the same thing as effect as today.

B

um Yes, now we have we. The problem is that we still have a backwards compatibility problem here, because if we add, if we add this uh extra taint, there might be pods system pots that don't tolerate it and then they would yeah.

C

I believe, if I'm mistaken again, they always come in pair. So even though not really goes to no excuse, I think there should be another one schedule and disabled on the Node. It should have no schedule 10 as well right.

B

Yeah yeah, so what I'm saying is? Yes, we let's say we add two things to Lenova.

C

Well, no, no! What I'm saying is that we probably don't need to add so I think when node it becomes not ready. It should have both schedule, no schedule and no execute right now today, maybe have a different key I. Just don't remember, of the top.

B

No, it only has one: okay yeah, that's a.

A

Problem I think I have the same.

A

Is that not cute and no schedule the tensor handled differently and the television is kind of different, like just one one match so I'm, not sure if I am standing, some of our yeah, the kubernetes No Fly suck managers wrong or something because it sums to me.

A

Well, let me check the check the code so.

C

Yeah I'm, looking on the notes right now, some of the examples where I see the nose being not ready so I do see collection of things. So you have a known that kubernetes that are unreachable, no execute, then you have node companies that are unscheduleable no schedule. You have a so yeah. You have not that kubernetes that are unreachable, no schedule as well, so unreachable key um comes in both no schedule and no execute right.

B

If I remember correctly, the unskid no schedule is something that user ad or.

C

B

It comes from a drain procedure.

C

B

If I remember correctly,.

B

A

um Go ahead yeah, maybe if we should clarify the the system semantics first.

C

A

Without now, I don't know no life cycle controller will add both tens to that exactly.

C

Yeah and that's what I think it would be nice to verify that we can. We do have this use case where we can have a node with only one, no executane and not accompanied no schedule things with the same or different keys.

B

Yeah right, so the the thing is that either adding attained with the same key or different key that has a no scheduled um is a breaking change.

B

For parts that, uh let's say for system pods that um we're assuming that the only thing is no execute, so the explicitly tolerated no execute.

C

Yeah, that's what I'm doing I'm specifically challenging that assumption um from what my understanding is. It could be entirely incorrect. I'm, sorry, if I'm wasting your time is that while it is possible to have only one no executane in reality, when notes goes not really becoming scheduleable. Typically they come with a set of taints one or more, and usually they come accompanied with no schedule 10 with one of the different keys. But again, if I'm I could be wrong in that. So that's why I'm kind of trying to challenge the assumption.

B

Okay, I'm not familiar I, will need to double check what happens with the nuts.

B

C

I can also try to do that. I can coordinate and we'll see what things being applied.

B

C

Actually, it's going to be different Cordon, it does not make no Trading I should be able to report it in the kind cluster when stopping qubit and see what happens on the team's side.

A

Yeah I'm also doing some cash here to bring up a note and because I'm using code and here and then the node will be no fake Cube like responding for his heart, but maybe in some time one minute or sometimes to check this.

C

Thing default: okay,.

A

That's fine I'll report back while the meeting is over yeah. Okay.

B

So I guess I'll update the the issue here because, yes, you are right that we might already have the semantics that we need from the existing things.

B

But we need to find a safe way to introduce two things instead of one, if we, if we are to proceed this way, um yeah.

A

um Oh, and, and also one thing, is that not sure if you're, using just one single node or several Nails there's an internal fully disrupted ma, there were another applied 10 or something that may make the behavior a little different. So, for example, here I'm just using one node and then maybe I won't see the expected Behavior, because that is controlled by a internal special State called fully disrupted.

A

Yeah I will report back, I will just use several nodes and some of them already. Some of them are ready to see how that hands are going to apply.

A

Okay, I think we almost used up for today's time and other questions.

B

A

Okay, there then I will stop sharing and thank you, everyone for joining today's meeting. I will see you in next two weeks, I.

B

A