Kubernetes SIG Scheduling, 19 Sep 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meeting - 2019-09-19

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, hi everyone. So, as you all know, this meeting is recorded, I will be uploaded to YouTube, so I guess I can start to the items we have. The less I have two quick items on my side, so the first one just reminder for the release scheduled for 1:30 17 is going to be shorter than the previous cycles because of the holidays next mid next month, but within a month total 15 enhancement, freeze and then a month after that, November 14. We would have the code freeze.

A

So if you're working with a feature that would require enhancement, freeze, please submit it as soon as you can looking at the list of tasks that we have planned for 117 I, don't think we have any other than all those topology spreading and I might have something for the framework migration for the frame of migration. It's not going to be strictly like an alpha beta G. A feature is mostly restructuring and refactoring. In general D factory.

A

We will not be introducing new api's and, and so it's less critical, but if Aldo is planning to have to work on on on the on the default spreading API and have it in 117 has alpha then I think we should have an enhancement submitted before the end before October 15.

A

Okay. Does anyone else have any questions about the new that didn't sorry.

B

I didn't get that what's the the outer future working off the spreading.

A

Is the yeah the default topology spreading.

C

A

It's basically providing an and a new API in the component conflict. Allow specifying a default value for probably for the topology key is for default. Spreading right now is fixed for zone, but if you want to make it configure with the new should have something and component config to allow users to specify or understand, specify the default spreading. Topology.

A

Okay, so the second item and again this is going to be fairly quick and so I'll put up the migration plan initial thoughts. Thank you. Everybody for commenting, I started to work on some scaffolding. I will call them pr's.

A

That will allow me to create the ARS that others can can address in parallel, and what I'm trying to do right now is to figure out the how the configuration would work out and basically the dependencies that we have you have to independence is the configuration. Api is one dependency, so we have the policy and the component company you're just trying to figure out a way to methods and the other one is cubelets and both other other components that actually call our products directly. So this is less critical.

A

The first one is the most critical one, so um I started to I started working on a PR tour. I didn't try that and then once that one is done, I think the picture will be a lot more clear and we can do incremental migration of priorities and predicates into into plugins.

A

Does anyone have any questions related to the documentary shared that would like to discuss now.

A

Okay sounds good, so Vinnie take it away I. Would you please put vertical scaling kit.

D

So this is just a follow up from last week eye to eye test. One was to see if there are any concerns. You know that Klaus had some comments. I responded and I was wondering if there are any open concerns at this point, the the one comment that was regarding how to handle race condition during resize, so we looked at potentially you could use the max of resources and resources allocated for requests and the status resources limits for the limits.

D

However, at the point of admission we can we don't expect this to be like a long outage. There can be temporary situation where and the memory is being reduced or a CPU is being reduced. That particular node on which the pod whose CPU or memory is being reduced, is offline. Then, overall, there might be a slight cluster resource over scripture over subscription, but this shouldn't be a common thing.

D

In any case, we I talked about this with signal on Tuesday, and if this is, if this does become a problem, it's a few lines of code change to fix it. So we figured we'll go with the simple brush for now. We'll use the resources and, as it is currently done, we'll just use it. It's just gonna become mutable for the limit.

D

Ranger I, don't believe there is any issues we have to just make sure that the min and Max are respected, and we can either cap it at the min max levels or we can reject the request if it is not compliant with the min max. It's fine, so I just wanted to ask if there are any concerns. Besides that, from the scheduling side,.

A

um So I guess at the end of the day, I mean whatever feature. Would water oxidation going with? We need some like to experiment with it and have some data points on how come or not differ different. You know, corner cases that you talked about occur so I'm fine with with what the Cape is proposing, and my hope is that during alpha we will get some feedback on how often these universe conditions will happen and how? How would that actually impact?

A

You know the cluster in general. um My my hope is that we will have a solution that makes the cluster in a in a good state eventually like fish, basically eventual consistency, so it doesn't matter if, if we have race condition at some point, but it's not as, for example, the cubelet projects, the port, for example- and it gets affected eventually I get to see schedule somewhere else, I, guess: I! Guess that's fine! Okay,.

D

So the only other thing that I needed here in the cap for APA review I need to mention approvers from each sig. That is going to be impacted by this cap. But signal is the biggest impact. Six scheduling is a fairly small impact and auto scaling, which is a consumer. They all have almost zero impact, except one of the proposed policies is getting pushed to them. So I, just want to know, can enlist in the approvers list from signal is a two or class you.

A

Can list either person, maybe class, because you already commented indicate multiple times.

D

D

I provide an update that I'll update the cap by Monday once I get confirmation from auto-scaling as well and signal okay.

D

Sounds good, that's all I had. Thank you very much. Thanks.

A

So we have one more topic: does anybody have any questions about Benes work on pod, vertical scaling, any concerns from other force contributors? It is scheduled.

A

Okay, so moving on to the next item, we have a discussion point on scheduling, custom resources. I, don't know who I did that because didn't post their name hi.

E

Can you hear me yeah.

A

E

I'm Cola and I want to discuss about the issue with propulsion: okay,.

A

Sounds good okay,.

E

So what we want to do is add a custom resources, scheduling for schedule which provides fleet policy and priority based atomic scheduling for the CRD resources. We hope that users can describe the CRD resource as a whole and cognitive schedules, the ERG as a whole, and then I want to introduce the backgrounds of this.

E

Actually, a many companies use kubernetes as the underlying platform for their deep learning platforms, deep learning, job you usually use GPU to track and as the scale of the job increases distributed, training becomes inevitable, which single training job must use multiple poles at poles to host. However, users often only care about how many resources our number to be exactly the number of the GPUs to use for for training job rather than the distribution of the backend.

E

So at the same time, since the result is configuration and current usage of the cluster is unknown to users, there may be some problems where result allocation is determined by user defined, a group, so the first one is the hours and power resource allocation. If the job is divided into a multiple small grant thoughts, it will cause unnecessary interaction of head between the pods and the second is the job starvation.

E

If the, if the available resources of the class cannot satisfy the large grant job request, the job will not be scheduled for a long time, especially when small grant jobs or pods are scheduled, and the last problem may be results that deadlock when some when some part of a job are scheduled, the remaining part cannot be scheduled and the current occupied result is cannot be released. So the result is that the Lord may happen.

D

E

In summary, we hope to implement the following functions. The first is the users only needs to describe the overall results requirements of the CR D and the trustor should automatically completed the splitting of the CR date spot according to the total amount of resulted and the current cluster results usage and in the second, is all parts belonging to a single CRT as a scheduler automatically, which means or or nothing. That's all our issue about okay,.

A

So if I get that correctly, so you want to specify a new a new resource start. They describe it as custom resource at C or D, and you want specify a high-level resource requirements that being CPU memory or number of GPUs and scheduling time we're gonna. Just the scheduler should try to schedule that whole CRD on the cluster or on a specific node.

E

Actually, in the CRT we describe the resource requirements of the CPU and a GPU, and then the CRT is scheduled with the pod when fitting into the scheduling list at first right.

A

So you're gonna have to define an API, for example at run time, decide how you're gonna split that CRD into pods as.

E

We want to introduce splits in the face in the XT, so users can to find their own split ways based on their own policies. So.

A

At the end of the day, the skater is going to place pods on a specific nodes. What I want to understand is when you define a CRD, let's call it a pod group rather than it's your thing, because that's what it is is am I correct, like it's a group.

E

A

A it's a it's a it's! A pod group right. The this you're gathered.

D

A

And you want to define the total amount of resources for that pod group. But you don't want to specify how many pods yeah.

C

What is the resources.

A

Of each part, at the time you create that what go right! Yes,.

E

A

And you want a scheduler to basically before you call that before they split that pod group into into smaller cards, you want the scheduler to determine whether that whole pod group can be scheduled or not. Is that correct.

A

I'm I'm trying to understand when you want this scheduler to call the API that you just mentioned to split that group into smoke into into into pods is that before or after we reserve the resources.

E

A

Let's, let's take a specific example: you create, let's call it the CRE that you talked about. It's called a pod group. Okay in the pod group, the spec. You just specify the amount of resources that you want to allocate for that pod group, but you're not specifying the number of parts that this pod groups splits to right. Yes,.

E

A

So the scheduler was, you know watching for that part group picked it up and then what happens next? Do you want the scheduler to reserve resources and then call the split API to split the resources we already reserved or you go to call this plate and then you're, gonna, you're, gonna startin to pause and then the skater was just go ahead and schedule the parts normally as I was doing right now.

E

The part that that was that split it with the splitter they were, they will be. They will be binded by the five extender that contains those feature. Maybe I can I. Can it's learn more about our solution? May I yes.

A

E

Based on our our discussion and before we propose the following principles: first, we need the cube scheduler to support the scheduling force, the Rd, but it does not sure I needed to understand the CRT. Many, because the give a diversity of CRT and then the function that means understand of the CRT is implemented by the user according to the characteristics of TRD and the scheduler calls the function through. The schedulers extend to mechanism which CRD needs to be scheduler depends on the configuration and the based on the above. Just design principles are pretty liberally.

E

Concept on the systems is ad below. First is the cube. Scheduler, this part is a function extension to the default schedule and melly included, including first determine which CR these need to be scheduled according to the configuration, and the second is inputting implement management of CRT and pause in the unified scheduling queue, and the third is cooperate with post praetor to complete the atomic scheduling of CR D. And then we comes to the pasta Vader. If malfunction is responsible to the schedule of requests, the splitters created the zr d into pod.

E

According to the system resource configuration current to resolve, resolve Missy and the user-defined CRD distributed strategy, then it creates this pot and returns the results to scheduler yeah.

A

Can you give I, don't know if someone else has any any any questions? Please go ahead. I, don't wanna, take all the bandwidth well,.

F

I have one initial question, which is: why do we even need different.

F

Different parts, why can't these be created as a single part, with whatever number of containers.

E

F

So the question is why creating our part with multiple containers, rather than multiple parts, can solve this problem.

F

So just playing again III.

E

Can't understand he'll question: maybe we were thinking more about this, but for now you have jobs or something like this. They describe their jobs with thought more frequently.

E

Their business process in many important right.

F

Ok: well, it's always an alternative to to try even the depart is the scheduling unit. So if we can already fit our our workloads in this scaling unit, then we don't need any extra features, but.

A

So the idea is sometimes like large training jobs. You'll have to split it into into multiple tasks, basically to to to do the training in on different data points, so a parsec aliy going to be usually a single process. Basically, even if it is not a threaded, it's going to be confined to a single node or maybe, if it is designed to use a single GPU, then you're out of luck using multiple GPUs without splitting your training into multiple pods.

A

So that's the that's the motivation they think the question here is whether we can provide an API for core scheduling. This is like one of the features that I guess.

A

Queuing is is asking for the other. One is probably some sort of colocation like you want to specify. You know so that we address the starvation problem that you one of the problems that you had right like you had to.

F

Earlier for the right, or at least kept in review for.

A

Course scheduling yes, I'm just trying to address the issues that they're trying to solve.

A

And and then so that's the deadlock and non-optimal issue. It can be solved with with course, scheduling, I believe and I like to understand from queueing. If we forget about efficiency.

A

Would cause scheduling solve your problem.

E

Yes, yes, okay,.

A

So like in at the higher level, what you're asking for is something like cost scheduling, but you want it to be more flexible. Basically, you don't want to specify the the group of parts that you want to be scheduled ahead of time. You want it to be decided at at runtime and that's why you're proposing this split API.

E

A

The question is: why do you want it to be done at runtime? Why do you want to make a decision or how many pods needs to be created for a specific, what you call CRD? Why does that decision need to be at runtime.

E

Because, because when we divide the pod and divide the PRD into part, we want to know how the result is in in a system how many results can be used. So that's basically what what scheduler knows, though we want to. We want to insert the process in the in the process of the schedule, but.

A

I mean you've already, okay, so so, basically, what you're going to that, like the only thing that you you're gonna, create an object and that object is basically you're using just for the schedule to call you back right so and the object contains the CRD that you mentioned contains opaque data to the scheduler. So the scheduler has like really nothing, no knowledge of that and what in front. But you still need from the scheduler some information.

A

You need scheduler to pass to you some information so that you decide how many parts to create right at runtime. Basically, you want to know how much resources is left on the node on this. The whole cluster.

E

So the implement implement in the implementation of the extender so different, not the user different different different companies may have their own solutions. So the scheduler just need to define the the interface of the data of the extender right.

A

And I mean to me, this looks like.

A

Like a second scheduler I know that you discussed this in your issue, but I.

A

Would like I would hope that, like, if you can't, if you can create a.

B

We can, we can read through oh yeah, then comment yeah.

A

Exactly you can't clearly, like enhancement may be kept so that we can't comment on it, but I, but I understand. What's trying to do what you want to be done, I guess: I just want to make sure that we we don't at all like have the mean student to to to do what you want to do and and that's why we have to have, for example, a.

B

C

Hi I'm with her I can explain. This do mean wait. We should propose a KP in the in the community and is that and you you can comment our to give a evolution or something yeah.

B

We can't because I think a week we think we can go through the issue first to know, what's a background and to see how we understand the problem and common there first and if we think it's fixable, then then we were coming there and yeah. You can Dan open the cap, yeah.

C

Okay, I understand you, so we still use the issue right and if you agree, then we will keep supposed keep our right, yeah sure yeah.

A

I predict it's feasible, then maybe you can submit a kit with with a more concrete. You know, proposal of what the I would look like, etc, like we will take a look at the at the current issue and try to better understand how to solve your problem, and if there are no available solutions we can. We can discuss what you're proposing to add.

C

Okay, I'm asked if there some other solution, our suggestions for these problems like resource, if efficacy like to we just mentioned before, to khaki, give us some advice, our something.

A

If you're talking about resource efficiency in as in collocation or trying to you, can look at what the cluster autoscaler tries to do, to make sure that it and like you know,.

A

If you look at the schedule there are, there are ways we can schedule pause, so there are priorities and where the priorities is to spread far as and use use, nodes that have lower utilization and one of these priorities actually does the opposite. So it basically tries to prioritize nodes with higher utilization so that you basically try to co-locate or as much as possible so that it don'ts you don't spread.

A

Apart of our larger number of nodes, I, don't know if that that's the problem trying to solve this is my understanding, at least from description and the non optimal example.

C

A

Basically like, if you have job number one tip number one took four GPUs on on one of the nodes and that that node had, for example, eight eight GPUs, so, instead of them, if the next step comes in or the next part comes in once also another four GPUs, it would prioritize the note that that already runs some parts so that it's already utilized.

C

Okay, okay, we will think about it.

C

Arm ask if there is a plan, our for the community, to scheduling the custom resource. Our is all the how the based and there.

A

Are no plans, it's the skater is the scattering unit. Is parts.

C

Part, what do mean I'm sorry, you're.

A

Asking if there are plans to skip you or something other than pies and I'm saying no? Oh,.

C

C

Okay, that's it! We will talk about on the issues and then we all decide what to do for the next step. Can.

F

You please share what one of these here these could look like, so that we get a better picture. It's.

C

Just an issue issue: okay, on common this; for this and.

E

Description of the CRV and maybe a example, will be added additive there. Okay.

F

That sounds good.

A

Okay, so we're five minutes over time. Does anyone have any questions or comments feedback? For me finish.

A

Okay, thank you. Everyone.

A