Kubernetes SIG Scheduling, 16 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20220616

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

This meeting is being recorded all right. Thank you away for starting the um the recording as you as you might know this well, this meeting is being recorded and uh it will be uploaded to youtube um remember to adhere to the to the cncf grace code of conduct.

A

um So today we don't have any items in the agenda, but uh we are close to the enhancements freeze, so I figure we can quickly uh show you what's in the works for for this next release.

A

Can you um can you give me permissions to share my screen? Oh yeah, I do. I do have the mushroom.

A

B

Yeah you're good to go.

A

um Sorry yeah I mean you are good to go. Oh, yes uh is my screen being shared yeah. I can see your screen. Okay, perfect. So.

A

Quickly, let's quickly recap some of the six scheduling caps, the first one is uh kind of obvious, or uh I guess uh we have been working on this for for a lot of releases, uh we've gone through uh a v1 alpha one alpha two we've been through a beta 1 beta 2 with the 3.

A

So after all these iterations, uh we we think is the time to finally um graduate the scheduler component config to ga and well that practically means we're just copying the code copying the latest beta 3 api. We have and renaming it to v1. um So this this was fairly straightforward.

A

uh I mean in this release, but it was a lot of work throughout the releases. um So unless there are any questions about this, I can move on to the next one.

A

I'm not seeing um I'm not looking at the the calls. So please talk if you want to say something: um okay, the next one is a new feature.

A

This uh we actually, uh oh, yes, this is a new one. um Basically, um today, when you define a pot.

C

Apology spread.

A

You have to specify uh how to match the pots by specifying the the.

C

Label and the value.

A

Of of the label, uh this has been working, okay, except when you, when you have a rolling upgrade.

A

So when you have a rolling upgrade um each each version has a different label, and uh you might want to limit your spreading to specifically this version right or this version, or this replica set um so that you, once your replica set finishes. Upgrading or I mean the old replica set- is fine, fully removed.

A

Your new replica set is fully um spread it so for that uh alex introduced or is proposing a new api uh that simply expands the the topology spread policy, so you can just say: derive, derive the label from the current part and use that to match with other parts.

A

um This, as far as I know, is already yes, this is tracked, so this is good to go. It's just pending implementation and reviews.

A

So that's the one any questions about it.

A

Okay, um the next one look at this, the next one a it does. It is a graduation from alpha to beta.

A

So I believe this one is also fully tracked. No, it just needs a few more tweaks, okay. So what is this about?.

A

When you have a topology spread.

C

A

Policy um uh currently, the the the calculations are based on the nodes that exist, and uh this is somewhat limited if you have plaster autoscaler, um if you have cluster of the scalar, maybe your zone today has maybe you know your nose are only in two zones, but you want to spread across three zones.

A

um So with this simple uh mean domains, uh api field, you can now say you know, I want to spread among zones, but I also want them to be at least three zones. uh So if you know, if the scalar doesn't find the third zone, it would mark the pod, um it would consider that zone. It would consider that there is a zone which has zero nodes and if you know the skew doesn't work out with that that theoretical zone, then the pod doesn't get.

C

A

And then the auto scatter would kick in and you you will have perfect spreading after this, how to scale your ads, and you know so that's uh this was already released as alpha in 124. uh We're we're targeting beta now and 125..

A

Any questions about this feature.

B

Okay, so this is has been implemented in the last release as our feature right. It has me merged. Yes, it is alpha already yeah yeah, it's just we are. We will evaluate the way that we are going to promote the beta or something.

A

Yes, yes, uh it's! This is pretty much approved. It's just that the template changed. Okay, so.

C

And it needs to be updated.

A

um So that's the that's. What is left here.

A

I think those are the known I missed one. I missed this one, uh this one also topology spreading. We have three features just for topology spreading um this one. It was. I think the cap was approved last last release, but the implementation didn't make it so, yes, it had to be postponed 125..

A

So what is this about? What apology spreading um currently um ignores things and tolerations?

A

uh So when calculating the skew- and we heard your feedback and some people consider that this is not comprehensive enough, like there are some use cases where you want to take into account what the tolerations and things are.

A

um But then, when, when doing this, we realize we realize that we are already taking a a decision about what to do with node affinity, so we we also suggested that we add a policy for whether to consider the node affinity or not when calculating skew.

A

So this cap is just introducing well, it's introducing two fields, one for no affinity and one for tolerations to determine whether or not they should be taken into account when calculating skew and that's as a feature. I think this is yes also has the same problem. It has an outdated template, but the code is already there. It's already.

A

I think the code is even already merged, so the cap needs to be updated. um Yes, so that that's also going that's going for alpha 125., and I noticed uh way you added this one today. I suppose this um this is currently under discussion, because there is a lot missing in the cup currently.

A

So I'm not sure this is going to make it to the 125th release yeah, but yeah.

B

We will give a try to polish all the missing pieces, like the implementation details, as well as some like the uh explanation of how to integrate with the ca cluster autoscaler and something else so, basically, the.

B

I think the idea has been brought up in previous meeting that we want kubernetes native api to represent a scheduling capability to schedule, a group of paths all together or schedule enough enough of them. So that's basically a scheduled directive. We want to introduce natively. So that's the background and in terms of the design and the implementations, there's still some details need to shape shape up so yeah me and alex are working on it, and hopefully we can catch up for the freezing date next week. Next annual next week,.

A

Right, um yes, um as it currently stands, the the cap only proposes a an api without implementation and it's actually a goal not to give an implementation.

B

No, no, we are definitely adding the missing implementation part. I know that without the invitation, the api will have no chances to be made. It landed.

A

Yes, so, and with that, I think the major concern from my point of view is that um there there are a few alternative implementations that are.

C

A

Kind of looking at the same problem in different ways, and then we only have one chance. I think to to to do this right.

A

Because uh if we need, we later decide that we need.

C

A different api.

A

uh To tackle us, you know the reserve, like the reservations api, that abdullah is proposing.

A

Then we are, then we have three objects already in the in in the database, which is problematic. I already know some customers that don't even want to use job api just because.

C

A

Duplicate the number of objects they have so.

B

It's yeah, I agree we we do want to do it right from day one, uh but on the other hand, I'm seeking a way to do the things into iteratively like okay. We can't get this implemented and, additionally adding the support for reservation for back filling so instead of well.

B

We so basically I'm not that convinced that we need to get everything ready and then start to implement the idea so like if the code scheduling is a standalone feature that can be implemented right now, so it's better to have the feature to open to the users and when the reservation, api and backfilling stuff are ready, they can add. On top of that feature, that is how the principles we design each scheduling, plugins and each works.

B

We demand it depending on with each other then, but they can use that building blocks that stack together to make to satisfy a complex, complex scheduling scenario right.

A

So one one good thing about having a new api is that it can, it has to be it has to go through the alpha api phase, which guarantees that we can remove it.

A

But um this proposal is also adding us a field to the v1 pod spec and that might be more problematic. So we need to figure out.

B

um I'm not sure I will let jordan to comment so yeah, so basically the functions.

B

I do I'm aware of the reservation requirement that can not not a guarantee but best efforts govern so the better best efforts to reserve some resources for the power group right, but the resolution can be only very uh core screen, I mean so, even if you reserve this kind of resource there might be some like interpod constraints out there that you are pretty hard to reserve the resource on each knows to quite satisfy.

B

Otherwise you have to make the knowledge of all the past distribution of the cluster view, so they can make a very guaranteed reservation right. So I mean my point: is the reservation? Is the best efforts manner right, so that is good to have to on top of the code scheduling, but because scheduling doesn't quite depends on the.

A

Future of reservation right, but I guess the point is that they they might be complementary features but having two apis might be problematic.

B

We have pretty a lot of scheduling directives just spread out of the pod spread the api stuff. I don't see that a big problem, so if we start with a1.

C

B

Maybe we we, we should have composed the one single object, called like scheduling constraints and put everything inside there. But I mean uh this is current state and uh we are not breaking.

B

That too much so yeah having two different apis to represent its separate vpn semantics doesn't seem to look like a big problem for me.

A

Well, I specifically mean multiple objects, but um again the the problem is that the the size of the database.

A

So well, anyways. I think I think we can continue the discussion in the cap. Yeah yeah.

B

A

Unless somebody else has specific questions or thoughts in.

C

A

This proposal, or the other ones.

A

Okay, let me stop sharing my screen.

A

How do I cancel this.

A

Oh, did I already stop sharing my screen. I'm confused.

A

A

So with that any questions you want to bring up.

C

I guess if we have a second, I just want to call out that um in the in the d scheduler repo, we started doing some of the refactoring that we were talking about for trying to make it more of a framework, try to get the code a little bit more stable and um more adaptable and customizable. So those changes are all going on in the repo.

C

We talked a lot about it on these calls in the past, calls and posted the design docs. For so, if you, if anyone's interested or wants to help out or offer input, please feel free to join in on any of the for us that are opening. I know young has opened a couple already um and we're just gonna start. You know migrating some of the internal code of the descheduler into more of a you know, plug-in design like the scheduling framework. um So just a little shout out to that. uh That's all.

B

A

A

um I see a question in the chat about q.

A

um Q is a project um sponsored by by six scheduling um as well, just like the scheduler uh and yes, it is. The question was whether it's a separate report and temporarily, um or would it be march back to keep scavenger eventually um I?

A

um Yes, it's it's a super repo and it's going to be a separate repo for the foreseeable future. um I don't think it will ever be merged into cube scalar in particular, but.

C

A

It could be merged into the controller.

C

A

For example, um or the or maybe just the apis would be merged into kubernetes, um but uh that's something that needs to mature uh in this. It needs to appear more before we can even consider that um and it kind of relates to to the pod group api as well right now. There is a similar api in in queue, uh so it's also. It might also be valuable to somehow see if we can merge bot group with workload api and they can share basically the same object in the future.

A

But um yes, it's it's it's. It's still an open discussion, long term discussion.

A

Yes, we just released uh 1.0. Sorry 0.1.1 with a few bug fixes.

A

Any other topics that folks want to bring to.

B

The media build don't have any. I have a rough idea to improv improve the metric of measuring uh scheduled latency, but before I share, I can give in time to any other books who have other topics.

B

If not, I can share my screen. I will take you another or eight minutes bear with me. uh So basically uh in some production system, especially there's a a lot of tenants and you know like 3k or 5k nails cluster, there might be some different priority parts and your pos scheduling.

B

May your scheduling, scheduling chances may be head of blocked by other parts, but right now we don't have a very good metrics to measure that. So, basically, if you look at this picture as past successful scheduling may increase uh several attempts and each attempts in include three kind of phases.

B

The first one is that okay, the path is literally ready to be scheduled, and so it be added to the active cure, the origin block, but that doesn't mean it will be immediately pop-up to this schedule, because there's maybe some other parts ahead of it, because maybe they have earlier time stamp to be added or has a higher priority.

B

So the part here has to be weighted for the for the voice chance and uh we are missing the time that the party is waiting in active queue and also then the next phase is that okay, the father has, has this term to be scheduled to be popular, then it enters the pure scheduling phase. That is the usually the uh regular phase we are talking about, like pre-filter filter, uh pre-score score. All the things happen in this phase, so we could.

B

We can call it either pure scheduling phase and after that, if it's uh claimed to be unscheduleable, it will have to owner the global back-off timer settings so that it's like a penalty so saying that, okay, you are unscheduled. You have to be sit there for a while and until another the back of timer is is up. You can be popped out back, so that is the backup queue so for right.

B

Now there is in one attempt of the scheduling the attempt I mean it might be a unsuccessful attempt, because nobody, sorry no node- can fit the part. So we are missing the orange part and the green part, and we only have the blue part. So in the code is in the is called yeah. The local variable is called scheduling latency, but the name in the metrics called scheduling, attempt during seconds and some legacy version is called the entering, but the entering isn't misleading. So we defecated a little bit yeah.

B

I think this matrix is deprecated and this one should be yeah replaced by this one okay anyway. So uh what I want to propose is that we may want to record the duration, the orange block or even the green block. So right now I did a poc internally is that I expose the orange plug so that I add a face to the scheduling, latency metric and locally.

B

Have some fake notes and.

B

I schedule 200 parts which are on schedule because I requested 1024 cpus, so I have run it for a while yeah for a few minutes. So let's take a look. How the matrix looks like.

B

So this is just like the current metric we have so I want to check in each attempt the pure scheduling times period. So it's like okay, I only spent uh 952 nanoseconds, which is pretty good because it's running in my local cluster and only two mils and but I do not know.

B

Yeah, how much time is spending the waiting in the actual? Because if you look at the assimilate scrap, I randomly created the priority and set the priority to each part, so there should be some queue sorting out there and some yeah some waiting time I haven't seen here.

B

Then I execute this query, so 90 percentile and it spans it's, not a small number. It's almost uh 300 milliseconds right. So if you want to break down some uh production issues, say: okay, a customer may be asking you why my part is staying.

B

They are painting for so long time and you want to narrow down. Maybe it's a custom scheduling, plugins implementation issues which cause the issue, because the performance issue, or it's just waiting in actual too long or waiting back off queue too long. So that is the idea I want to put bruh. I just want to get you, although it's pretty preliminary about to gather some feedback. If you want to have some or have similar requirements.

A

If I may it, this seems very valuable, um but I think we probably want a new metric, just maybe dedicated to cues queuing time or something q, q, q duration, because um because the the other, the the metric is already stable right. Yes,.

B

Yeah and I can post a question to seek instrumentation to see how yeah how it can be compatible to satisfy you. It's.

A

Stable, it's not too bad to have it as a new metric dedicated to queueing time yeah, intermediate phase yeah.

B

And uh do you think the record the duration span in back off queue is valuable. For now I didn't implement that I just implement the duration back to queue.

B

A

Good, um so the the active queue you count from the time it enters the queue until it's possible until.

B

It's popped up, yes, yeah.

A

B

Other one would.

A

Be from the time you wreak you to the time it moves to active it, it could be valuable.

B

Yeah so that we have the full picture, how much time it spends on each sub phase of our attempt.

B

But maybe we have to also uh introduce that other label dimensions like priority to see uh each priorities pass: durations yeah, but yeah. Just some early plcn. There will try to draft a formal proposal.

B

Okay, yeah: that's what I want to bring up do.

A

You have any links, no, no! It's just in my.

B

Laptop no, a formal design, yeah.

A

Okay, there are no other topics.

A

Oh there's one question always feel free to to talk, but if not I'm happy to repeat the question from the chat, uh is there any benchmark to test the throughput of all of all or nothing scheduling? um Now.

B

We have some general benchmark and that fits the path. So we can just build up a scenario for the part group scheduling. Then we well get the benchmark.

A

um There, I don't know if um this was discussed ever, but um there is some con. At least I have some content contention of using the term gang scheduling. I prefer.

B

Yeah yeah yeah- that is more in the inclusive perspective, so it's maybe more suitable to car. The co scheduling.

A

Because scheduling or all or nothing, which is even more um more expressive.

A

All right with that, I think uh we can close this session. um Yeah just remember the enhancements freeze is next week and yeah. If you still have any proposals you you have some time, but it's very limited so.

A