Kubernetes SIG Scheduling, 29 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Weekly Meeting for 20210729

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hi, everyone uh today is july, 29, 2021 and welcome to this week. Six scheduling meeting and this meeting is being recorded, so be aware for what you're saying this will be uploaded to youtube later.

A

So, let's go over the agenda. The first one is that someone is proposing a.

A

Customizable scheduling queue, abstraction or interface. So if yeah, the author has talked to me offline at the very beginning, and the background is that uh we we do expose some internal mechanics of the schedule. Queue like adjustable uh flashing interval as well as some other stuff, but the whole internal queue thing is not totally possible.

A

So if a user wants to highly customizable scheduling, cube behavior, not only the sorting, but also they want to totally manage the like internal, how to do back off how to do flash, how to do maybe some advanced features like multiple sub-q inside the sky q interface. So maybe they it's a good idea to expose the interface for them to implement. So this is the background of the the so the cap, but in terms of a cab, because right now, cab is a very strict, kubernetes process.

A

If you raise that cap, you have to follow all the templates that follow all the pr for the everything cabinets and I think one critical issues there out there is.

A

We may don't have a long run plan for how to reflect the scheduling key, especially the internal scanning queue uh implementation, because right now, some some stuff like the option, as well as some other uh internal things, are quite coupled with the internal q implementation, instead of to make it totally abstract on the interface level. So in the long run I do want that the scheduling queue can be obstructed and reflected so that the user don't need to implement the brand new scheduling queue, implementation. That is too much burden for the for the am user.

A

But right now we don't have. We don't have the time and effort to test apart the which parts should be abstracted, so we don't have a long-run plan that is doesn't like quite fit for the cap. So uh I think also abdullah mentioned here- is that maybe we can go with just with the pr and with the document attached. So that would be should be good enough for the initial phase, how to load this forward.

B

I guess my my point is that, as you mentioned, there is no clear plan. uh The proposal is really not clear or not deep enough, and just simply saying we want to be able to pass in the instance instead of of the queue, instead of it being like instantiated, uh implicitly all the time as the internal one.

B

um That is not a proposal right like I mean it, doesn't really add too much, um and even if we go with that like as planned, which right now, which is we're gonna, allow you to instantiate um a queue and pass in that. I don't think we wanna canonically support this like the way we do for plugins, because um I don't feel that this is, for example, the right interface that we would want to have for supported, like you know, customizable queue.

B

The current interface exposes a lot of the internal queue details, so I would like to see something nicer like more abstract um that allows you to do.

B

You know like um similar to the framework, for example like in in a sense um or or if we go in with a much simpler approach like okay, it's just an interface that you would want to implement, but that interface should be um again abstract enough from the current internal implementation, um and there should be examples of how different implementations can use that interface, like how different limitations can apply to that interface.

B

Some sort of a survey, for example, of what others are doing in terms of queuing for schedulers and whatnot.

B

That would be interesting. uh This one I mean it seems like just a hack yeah, basically or like what the proposal is to enable a hack, um and I don't support supporting this uh long term, like even short term like I'm, not gonna, say I don't think we want to see this in the docs. Oh, if you want to skip like a customizable cue, then go ahead and do this and that I don't think we want to.

B

We want to advocate for this at all, but if whoever is going to do this want to take the risk, it's it's on their own like risk. We will try to help them basically by doing this small refactoring, um but I don't support doing anything more than that at this point, without clear proposal of how to support a customizable queue.

B

Okay, I don't know if everybody agrees or not.

A

Yeah, I got your point so yeah you're, not you do not like the current the short-term way, which is sort of uh just put the very heavy interface there and they do sort of a hack of the invitation, and it says that we support this right. We don't don't think this is a good good idea to claim.

A

We support this kind of customer queue. Rather than have a more detailed, refactoring and more detailed abstraction on the interface. Then then we claim we support it right.

B

Yeah like even if we do this, I don't support like uh documenting or saying that um yeah yeah, whether we're doing it for the framework and plugins and whatnot um and and again we're just doing it, because it's a small, refactoring sure we can make instantiating the um the queue at an earlier stage that you can replace it.

B

um If you like, make it easier for anybody who is importing the scheduler code to change it basically, um but I don't think we want to go further than that at this point, without um a decent proposal for a customized queue, interface or framework. What whatever that may be.

A

All right so yeah, maybe let's comment on the on the issue, as we do like uh sort of more detail reflecting on the scheduling, skill interface, firstly and yeah pistol, part or the interdependencies of the internal implementation with the interface, then we can have a better interface and time. We have really the customers getting kills apart.

A

All right, so any others have comments on this. One.

C

C

uh I'm alex, I talked with edo offline and we have another ideas about the queue management I want to have briefly introduced in this meeting.

D

Yeah so yeah. This is the idea of well extending suspend to be beyond beyond just the job api right.

C

D

So I guess yeah, this kind of was beyond just it's more about sick apps right uh than six scheduling, so maybe we can chat a little bit about it once we finish with the gathering topics.

C

I I prepare some briefly introduce. I can share it in this. Can I share my yesterday yeah. Let me stop my shirt.

A

C

A

C

C

Cow here my can, you see.

A

This yeah, I'm gonna, make you as a co-host just in case. You cannot share.

A

Yeah, you can see your screen.

C

Yeah thanks, uh let me briefly introduce some of our ideas on queue management and we merely want to achieve two goals.

C

The first one is to provide different skills for multi tenants and support fair scheduling among multi tenants, and the second one is manage job priority and quota through the queue.

C

The first page is a key texture of our system.

C

The main idea is to add the suspend annotation or the suspended status of the job by the annotation, and then the operator will to stop creating posts, and then the jobs will be queuing in the new controller and after it is scheduled successfully, we will update the queue setters and then the operator will create the post and the first steps we will add the suspended annotation by the webhook.

C

The second one the operator will find the job has a system. Annotation and operator will stop creating the post and the second to the third is we have some uh some past the two parts of the q controller, the first one is extension extension will identify the uh jobs template and will create the q unit. The q unit is a unit of our queue and the s the four.

C

The fourth is that we will also provide some magic of the queue and the collateral scalar will use the magic to find if there are no jobs I pending in the clinical and it will create more machine in advance.

C

After the after job scheduled successfully, we will update the status of the q unit. Then the extension will remove the suspend annotation. Then the operator will create a post. This is the main idea of our system. We set up a project called qbq and it is also used in alibaba and attendance on production, environment and the second page is the implementation of the q controller.

C

It is similar to the scheduler, but the schedule unit is the q unit. Q united represents the job. It mainly consists of two parts. The first one is multi-queue and the second is the quota system.

C

We have multi cues for different tenants and different queue will have different strategies and then.

C

Then the second part is the scheduling cycle.

C

We will find the best qsu fielding strategy and use a multi-queen sort plug-in, and we also use the quota system like resource code and capacity class capacity, and two quarter will check if the queue can, if the job can schedule successfully.

C

If we don't have enough quota or capacity, it will be put in the back of queue and I will retry in the next second.

C

The above is my brief introduction and we will have a detailed design.

A

um Yeah yeah, I have some questions so, if I understand correctly the letter part, the queue management you mentioned is the controller. So it's it's not inside the scheduler or it's not, and also it's not a scheduled tracking drive. It's a totally independent component yeah, the second page. Okay. So how does that associate with the scheduling framework like the qsr filter and result.

C

A

Totally customer user defined is not the same thing as the schedule framework.

C

It is not implemented with the schedule framework, it is about the q controller. The schedule unit is the job in the job level, not the portal level.

A

I mean so it's a new thing. It just has the same name like filter, wizard,.

C

But the ghost is the same with the key, the issue we talked before and we just to solve the multi-tenants uh scheduling through the controller not to win the schedule framework.

B

um Have you seen our uh like uh case job management proposal like we were thinking along the same lines and that's why we introduced the suspend um flag to the core job api.

C

Yeah I I talked to with adult before this week offline. I think we want to solve the problem about with a similar method. I think we can try to work together to employment.

B

And your current status is that this is just the proposal like you said that you have it running in alibaba, but then you said we will come up with a design, so I'm confused whether this is actually implemented or not.

C

Yeah we have employment, is controlling our company and used it in production, but I will have the detailed design this week with the with you and adult and.

B

Helpful, how much has it has it been in production? If I may ask uh about seven months, seven months.

C

Yeah, we also use the internet in the public cloud and used in some other startup yeah and.

B

What is like, if I may ask like what is how many jobs like do you handle? Does this controller handles like how how long like each queue is going to be um and how many queues do you have just like uh just for me to get a rough understanding of what scale are you targeting with this design.

C

uh The queue is not not to learn more about less than 10k can kills in the controller, but we will schedule about 40 000.

C

140 or 50 000 jobs wanting to use this.

B

For 40 000 jobs in what like in like, they could be pending at one at what, at a time.

C

B

So the q, a like 40 000 jobs,.

C

Yeah, uh maybe more maybe more than this yeah I will.

B

C

New, more jobs and more systems to use is.

B

And each job is like how big is each job? How many pods and like in, on average,.

C

It may be different, but mainly is a less than 10 poles in your jaw. Yeah.

B

Okay, yeah, that's all like pretty great information, um just give us like in a sense of uh and okay okay, that's great, like I'm probably aldo mentioned that, like we are working along the same lines, um we're interested in collaborating on this. If you are planning to open source this, um we can take a look as well and and see contribute to it or, if not, then, if you just only want to discuss the design, then we can implement something uh in the open for the community.

B

Based on that, like we have our own ideas as well, we want more things to be involved like, for example, how do you define capacity for each to? How do you define budgeting.

B

And other things so yeah yeah, it's interesting to see how how this is going to play out. Yeah like in your design. Like you once you answer sprint, you basically depend on the default scheduler to actually schedule the pods right.

C

B

And you don't want anything related to like where the pods are gonna end up like do you? Do you control like like who's gonna control, the cod creation of these jobs.

C

The operators like tiago bridger or petra job or the other in kubiflow community.

B

Okay, so so, basically you are focused on not just the job api. You have predefined number of like job apis, which is the tf operator, torch, mpi and and.

C

B

C

We we don't want to change the api of the all the cab jobs certified. We just add the annotation to express the status and we need to add this logic about when the window job has annotation.

C

The operator will stop creating stop creating the post and we talked with the people from cubiflow community, and we will add this logic in the uh operator in cubic flow, and they also set up a common operator for the jobs, and we also talk with them. To add this logic in the operator. Yeah.

D

What I'm thinking is early, we can definitely agree on common apis um and then the decision of whether we want to share the same controller can be left for later.

D

um But another question is like is what would we host this if, if we host like, where would kubernetes host this as part of cigars, six scheduling or a new working group? um I guess those are questions we might want to answer later.

B

I think this belongs more to sick scheduling um and I don't know if, like yeah, you could have another working group or I mean we're just adding another layer and another another, like you know organizational version, um I'm not sure if sick acts as like a whole sig is, uh you know, invested in this other than the job api uh say, scheduling is more invested in batch schedule. We have a lot of people here.

B

I'm trying to you know enhance the default scheduled support batch cases, um but as for hosting it like I mean we have a clear option which is like you know like a similar to the plugins one.

B

But that's, I think, premature talk like let's see like the proposed design first um and and maybe then we can, we can see how how to proceed. Good.

C

C

Yeah I stopped my showing exactly sure.

A

Yeah, thank you. Thank you alex so yeah. Together we call it. Could you.

A

Make you the slice shareable and pick the link either in the issue or under the agenda? So now we know that we are. We have gone through uh alternative design of the and manage the cost creation and manage the paths which belongs to a job belong to a bachelor class at the creation time, instead of the scheduling time. So that is another angle of solving this problem.

A

Yeah. Could you do that to make your slides uh link back here?

A

Yeah. Okay, thank you.

A

So the second item is that I discussed without the offline refactoring some our behavior on how to handle the scheduling, internal failures, so internal failures, including some internal scheduling, thoughts and upon the thoughts so right now we don't distinguish the internal errors with the standard errors like the filter errors, filter, fit error or some other things. So I think iodo's idea is that upon this internal error, maybe next term the internal and the transient error will go.

A

So we prefer more to treat this kind of error as transient and make it retriable as soon as possible. So that means we prefer to put it to backup cure. Instead of on scheduling queue, because, if you're buying is in unscheduled queue, you have to wait for a related event that comes in and then the trigger that part to be recharged yeah. That is the background of this discussion.

A

Yeah. I don't uh to have any supplement.

D

Yeah, no, not really. The distinction is basically internal errors, or maybe the api server was down or just the api request failed because of uh because the ps7 is overwhelmed, and this is different from from a scheduling, unscheduleable, uh pod yeah. So.

A

D

So yeah my my idea is that we should move these spots back to back off directly um and I think uh yeah we've seen this uh similar problem when doing pvcs doing when a pod has pvcs uh and the pvc is somehow not created yet or is you know because it's all asynchronous uh and then we should be retrying that.

B

But right like uh the pvc, when it gets created, it will create an event and it will move the part back. I mean the first yeah. The first example you give is reasonable, like if you have errors related to apis here, but I don't know where like in which plugins do we face these errors, like where none of these plugins actually directly talk to the api, except for other than the ones try to.

D

Do but yeah! The point is that during bending, if the binding fails, we are putting the plot in the unscalable q. That's that's the problem and yeah. My point is that that shouldn't be the case. If there is a bending error.

B

Yeah um binding is issue is one good example.

B

D

I'm not sure if any other, actually any other plugins would have different retriable errors. um Yeah most of the errors are, should be unexpected right. They they only happen if there is a weird bug somewhere.

D

um So in those cases I guess I don't know what to do, um but in in bending it should, I feel, like you, should go directly to the back off queue um so that that's one topic, I don't know if we all agree, but the the side topic is that the whole retry logic, the whole recording scaling errors, and uh all of that is it's very flat. Lucky like that. The code is, is getting very complex, uh so we might need to find a better design. Just for that part.

D

Of errors and error handling, that's kind.

B

D

B

um I'm not sure if there's one already.

D

uh No, no, I didn't open one.

D

um So what about the main question, though, do you think that's reasonable, just uh retry or.

B

uh Right now like what will happen is that you're gonna have the part in the unscheduled queue and then only when that expires, it will be moved back unless, like there's a lucky event that moves it out right, correct yeah, I mean.

B

I, like I'm, I'm not sure like did you have an experiment where you've noticed that this is a problem.

D

uh No not released uh it was coming because of uh these changes around permit.

D

uh Where we're trying to handle this, this kind of errors differently for each bot, uh uh if a permit fails or if, if a permit, uh is denied, uh then we we have to decide what to do with that pod, and I just realized that we we're not doing these kind of decisions, we're just always putting everything in a scalable queue and yeah it's kind of I've. I have a feeling that this is kind of uh this triggers those questions of. Why do we need a backup key? Why can't? um Why can I?

D

Why can I not just remove the back off cue uh and I feel like it's, because this kind of uh retries the causal delay on these spots and and shouldn't um so yeah, but I don't have a real world.

D

A

Yeah, so we can document this concerns first and see see what's the best way to going forward.

A

Thank you all right.

A

All right, uh I think, that's pretty much for today's meeting and abdullah, maybe sometime, we can talk offline about the discussion of the max release.

A

Items yeah and any others have some any other discussion to bring.

A

Up, uh if not I'll, stop recording and uh thank you for joining today's meeting, uh we'll see you in another two weeks and uh thank you for your time. Bye.