Kubernetes SIG Scheduling, 24 Apr 2017

Previous Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Scheduling Meetings 20170424

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, it says that it's recording we'll see I've never done this before so I guess we should get started. I. Think the ten items for today were talking about the proposal for priority preemption resource sharing between batch jobs and then also talked about 1.7 other other things. So Tim you were the one who wanted to discuss it at this meeting, although I expected that we would have done it anyway. So do you want to start start that discussion? Hopefully, people have had a chance to read the proposal. It's kind of not trivial.

A

To summarize it I guess a 10-second summary is that we're proposing to add priorities, pod preemption mechanism that allows spending pods to evict already running pods of a lower priority when the cluster is over committed and also to introduce a batch job concept for gang scheduling of multiple resources needed for running batch jobs like spark jobs, things like that and a resource sharing mechanism- that's not described in great detail, but it's intended to be flexible enough to allow people to experiment with different resource arbitration strategies, including the kind that they have today in systems like yarn and and mezzos.

A

So that's that's my 10 second explanation. So Tim did you wanna. Do a talk well,.

B

I think the first question is that you know the document has an opinion view of some of the batch mechanisms that you guys may have worked with, but I think there might be several implementations. It's not really clear as an on goal from the beginning of the document that this is almost like. An architectural perspective like this is one way we could possibly solve this type of solution where the primary team was walking in his priority preemption. But the outline specifics of this particular batch implementation is is one of book could be many yeah.

A

I mean I think that that relates to a comment that you and I think at least one other person I. Think Joe Beda made the same related comment about whether this should have been separated into two documents and I think you know I sent to the mailing list. My opinion on this I mean I think it could have gone either way and there's there's good arguments on both sides sort of the in separating the priority preemption stuff from the batch scheduling and the resource arbitration among that job stuff.

A

My argument was that they should be considered together. All those definitely there needs to be sprits design Docs in more detail for both things. That wasn't was what this was trying to do. Like you said this was trying to do a high-level architecture, and that probably was not made clear enough in the document, but the argument for combining them was just that. You know we're kind of we're talking about how to you know a well.

C

A

To share the resources in a cluster- that's that's! That's too vague to be meaningful, but but specifically like how to decide which workloads and which car could which work was most important at any given time and who should be made to wait and that kind of issue happens. Kind of in both both systems, the the batch, the batch queueing model and then also in the the priority preemption stuff, they're kind of both doing preemption of a sort and prioritizing stuff. And so that's why I kind of wanted people to be thinking about them together.

A

And so that's why we combine I mean.

C

A

Eric to nandana rude and that of the people in the gaps have been thinking about, attach scheduling this gang scheduling for patch stuff already because they needed something for spark to set up multiple resources, both the ones that produce pods and the ones that produce other kinds of API resources. Like secrets and stuff like that, and so they had been thinking about some kind of batch queuing. Some and batch admissions.

A

Some and and I had been thinking about priority and preemption I mean there's been discussions going on for a long time and then the third third piece of the puzzle was was Klaus from IBM who had a proposal. I forgot what it's called. It was called by multiple applications sharing a kubernetes cluster or something. Then he had. He had written up a proposal and there have been some discussion on that and so I thought.

A

This is a good time to try to pull together those three different threads into something that we could because a lot of unified proposal, but I do understand that it's a lot of moving pieces and that that it's useful to decompose them. If we go as we go into more detail, I mean if this data, this kind of, was curious. What people think about kind of the overall architecture that was proposed there.

C

Is nice for the authors to uh to consider all the things but for consumption of other people, maybe would have been better to have two or three Shepherd dogs yeah.

A

And I think we will have to have two or three separate ducks I'm, not sure I agree. That's only useful for the authors to consider these issues. Good I mean I. Think that you know we want the community to understand and evaluate whether this is a good idea, because you know we do want to get consensus on this kind of thing and I think that I mean this is maybe we shouldn't waste time arguing about.

A

This is philosophical but like like like seeing how these three different pieces together, I think it's beneficial, but but other people have disagreed and then I can see that perspective too I. Don't.

B

Think I think it was just it's just a matter of language in the beginning, honestly, so that the set of what this document actually the purpose of the intent behind it to explicitly state that, then it's not a not it's like a non goal of this document to outline you know like the actual batch system itself. This is an example of one, and you know the implementation details could be left it's what's.

B

The primary focus of is just prior and preemption, but you're outlining these other pieces of the puzzle to to give you an idea of how the model would work. Yeah.

A

I mean I'm, like quite.

B

I think, just just how a Texan is to sending that text to the beginning would help to clarify the ambiguous yeah.

A

I mean I think we were trying to do something a little bit stronger in the sense that you know anything Erica phrased it well in the dock.

A

I, don't have it open at the moment, but like there are certain six pieces that we're proposing that would make it possible for people to then experiment with the different best strategies like, for example, the the Mission Control quota, the quota admission control and and how that allocating resources like between cues and sort of making cues these a fundamental abstraction and stuff like that I think that's the kind of thing that we would want to get some consensus on I mean there's a lot of stuff that that is that I think we would want to be open for experimentation like it should be possible for multiple people to implement different batch job controllers and experiment with policies and stuff like that, but and so that people can.

A

You know because there's all kinds of allocation policies out there and the real world but I, think there are certain pieces and maybe that's part of the process of processing. This document is understanding which are the pieces. We want to be kind of fixed and say this is part of the architecture and which are the pieces that are just like you said. Here's an example. Here's an illustration of this kind of thing. You could do yeah.

B

I like that, last summary, your last sentence actually was concise. It would help a lot I. The other point I wanted to make too as well. You use gang scheduling a number of times, but I wanted to make sure it's used in the context that most other people understand it, because gang scheduling is usually book. Block scheduling done in one single alloc around typically for purposes of MPI jobs where you have a coordinated start new resource starting is important.

B

Did you specifically meet it that way, or we refer you to it than just some other context? Yeah.

A

I mean that's a good question: I think we've been abusing the term and I think I think you're completely right that we shouldn't be doing it. I mean we were like, like we were talking about gang scheduling, one aspect of it being like scheduling things like secrets and volumes at the same time, as you admit, a a job, that's going to be producing pods and things like that, and that may not be the right.

A

The right terminology, especially since I'm, not sure how strong a guarantee we can give to being to the schedule ability of the Tod producing thing, since we don't have like atomic scheduling anything like that. So it's probably a weaker than true gain scheduling. Mangala. There is a tie-in.

C

Here, I think it's just. If you have true gang scheduling, you'd need a way to say, I'm submitting this thing and I need a hundred things that are the shape for a hundred machines or whatever that are adding them all at once and we're saying like you could instead submit like a pod. That then goes and schedules a hundred things, and so we want some way like pre reserved resources without requiring the exact shape of the thing that you want to run to be specified in a very verbose configuration. Language is part of Q&A.

C

So what if you need like on your things like this and 20, to look like that and 47? Look like that, like I, don't want to write that a config file I want to like have a fairly good probability that this now is the time to launch that and then launch a program that then goes and launches the gangs that I need the collections of identical things that I need.

B

Some people might call that some level of resource reservation under those Connor, what's determined or not I, think.

C

A gives a desire to like so it's three things. There there's like a desire to have an aggregate resource specification for a deferred like collection creation. It's like there should be this creamy about this mini resourcing for you and think about starting this. It still might not be enough. If there's weird, then banking problems like this is about what I'm going to need.

C

Then you want a way to say: I'm, going to make a bunch of things that if you have to clean me up clean up all these things label them the same, and then the third thing, that's not addressing that dock at all, but we probably do want some day is the ability to say like make all these pods and then, if any of them get terminated like delete them, and it's not clear whether that belongs to the scheduler or like kind of controller level, that last one is closest to what you're calling what the correct definition of gang scheduling is.

C

I guess is yes, but they're kind of tied together like you're saying. If you have to kill this pod, then you might as well clean up the rest of me.

A

Yeah I think we want to accommodate different kinds of jobs, the kind where I've heard a good term internally used at Google for four I, think crystalline, or something like like like jobs where like. If, if any one instance dies and I mean there's multiple definitions of it, the one would be like if, unless.

C

You can run all of the.

A

Instances you shouldn't bother to run any of them, but then there's.

C

Others, where, like.

A

Spark with, like you know, I forgot what it's like like, where there's some mode in spark where it can dynamics I, think it's called dynamic allocation where it can dynamically vary. The number of tasks that are running the number of executor is over time, and and so it can adapt to the fact that there may be less or more resources and it wouldn't want to be killed if it get less so I mean I. Definitely think that we would want to accommodate both both kinds of job.

A

D

Then you use in this part case. There is a notion of the minimum number of resources you want to reserve a priori, and so the former use case is more common yeah.

A

So I mean that's a good point to there's a hybrid I guess where it's like their stomach. Well, exactly what you said: there's like if I can't get at least this much and fill me, but ideally I'd, like more and sort of within that range, it's okay to reduce my allocation. As long as you don't go below some certain value.

B

Yeah I think the terminology would be either one or either call it reservations or resource sorting resource forwarding is usually the example of like give me as much as I can never can allocate enough. Then don't do it.

C

So the questions people that have worked in up heavily in other batch systems do they're like one correct, well accepted model for cubes, or is this something where everyone's got an opinion about it? It's better to like make sure it's pluggable, so everyone can do their own thing or are people going if we come up with something? That's reason we flex everyone envisions. She wanted to talk to you like yarn or or some great engine, or should we just like already thoughts on that there's.

B

C

But what like, what should I do with all that prior art ah make.

B

It pluggable is the answer, because the only thing you can do is make it livable, because people will have some people create their own priority mechanism of the experiment with ideas.

B

I think that experimentation alone is worth its weight in gold, because what you'll probably find is people will create a novel implementation of some system, whether it be some stream processing engine or some other system where they'll have their own weights and models based upon whatever problem they're trying to solve and in other systems which you'll find you is, it has there's no canonical example, and sometimes it's more expressive than it is it's massively wide open space. The system of batch processing goes back 30 years, I.

E

I would definitely say that in the yarn universe, there are originally three now early, two major schedulers, because it was originally decided to make it flow level and the fact that it was pluggable led to a bunch of innovation that the original Hazuki MIT yahoo and did not come up with. So the probability was valuable from the project perspective. But.

C

Now that that's happened, can we just take the best of it and call it a day or do we still need plug ability, Sean.

E

I have to imagine that were somebody trying to make a were somebody trying to make it system in yarn that also had continuously long-running application. I, the two existing ones wouldn't work well for them, and so they'd have to explore a whole bunch of new designs there. So my suspicion is that the plug ability pretty useful well.

C

I think I was thinking about is the the priority. Preemption mechanisms would probably be not very pluggable so and the concept of having like a collection they get em cubed like it acute to be run in the future. That would be pre architect into kubernetes, but the policies about you could use with queues, excuse cascade to each other, how you prioritize, which fuse to run what resources are needed before you start something from a queue or preamp something from different views. That would be highly pluggable. That makes me it's.

C

The only batch jobs would use queues. Many batch jobs would need a pluggable policy, whereas long-running jobs would just be started directly using resource code ever so there would need to be some pluggable interaction between like long-running resource quota, which basically is like cool. If you imagine a time to mention the quota, its quota, the active in that it's constant for over the entire timeline versus like batch quota, we have an aggregate amount that you're then shuffling amongst different sub uses I. That makes any sense from that's how I was thinking about it. Yeah.

A

And I we tried to capture that in the dock, I think, but maybe didn't do a good job. I think that was a really good at point. The.

B

Problem with a lot of systems, too, is they want to rebalance dynamically. So, even if you had a coordinating system, you might want to have an expressive policy by which you can change the evaluation. This was commonly done on existing legacy systems, including LSF and Condor, where it was much more expressive to be done on the fly right. So that way you could. You could rebalance your whole cluster upon a couple of knobs yeah.

E

Sorry got it I'm wondering if there's some kind of approach where the primitive operations that are possible, like you know, preempt. This thing launch this thing right now, right I reserved this in a bad way. If those operations are available operations and there's a kind of pluggable piece of code that maybe has a thread or two that runs in one active place at at a moment, and that thing then uses those primitives so that you could like the decisions about what to preempt or whatever could be very plausible.

E

But the actual operations are in a frame request.

A

E

A

Think draw a dividing line. I think is the thing that requires some more thought like what is what is part of the primitives that everybody uses and what is the policy part setting? You can draw a line, multiple ways yeah, because.

C

You already have that and do you feel like it's not an example.

E

To be followed, yarn yarn has it there's in yarn. The actions themselves are even more primitive right that the actions aren't even preempted this or whatever. The actions are like messages that you send to the nodes and I think they went a little too far. They could have had slightly higher level operations that are a little safer to use, but the policy of they preempted. This don't crimp this launch of this in a in a batching way or whatever those it's not.

E

You want to use those words in yarn, but those policies, absolutely our pluggable and and they're they're portable ones. You basically say in yarn, you say: here's your scheduler and that you know in one scheduler comes kit and kaboodle all of those things. The.

C

Way I've been thinking about it is like app when a pod is on a node like the policy. There is like that. Preemption, that's, not pluggable! That's up to coolant! You really can't affect that when you are like have multiple things that are pending and you need to like schedule something now. That's not pluggable other than the fact that a scheduler is pluggable but like if you're, using a stock scheduler. You can't really plug in how priorities handles things.

C

Look for pods that are pending for good, for, like a group of pods to be creating the future, meaning like a replication, controller or deployment or whatever like. If you want to create that at a future time, and it's going to make more things like that collection of entities is like that's a cute job and that is highly configurable. How you pre end that collection, I bet you.

E

Could make something like that to work right? The idea that you say if kubernetes could have an API promise that if you set up your priorities, this way then kubernetes will show is to preempted in this way. You know you know you like this ordering or whatever and then the thing that causes that there's an intermediary in between the user and what kubernetes knows about those things and that's the pluggable thing, so the user may speak one language to the pluggable thing and then that pluggable thing goes and says.

E

Oh, you know: okay, kubernetes, your priorities are this in this order, and so then you know kubernetes is free to you know to operate within that constraint. That's set up for it that feels like it could work very different than yarn, but that feels ok, yeah.

A

So I mean we kind of proposed, like I, think the way that that would fit in with what we proposed in the dog is that you know you could have something like an admission controller, not a quota. This is unrelated to quota, but like an admission controller that Maps properties of the pod, like maybe QoS, or something killed, or something to internal priority, which is the thing that we would put on the pod and then that priority would be used by the scheduler and possibly other components to decide who who to preempt them.

A

To do that and to do the preemption sue like if I understood what you're suggesting correctly, it's kind of like adding a level of indirection between what the user specifies and the total ordering on the priorities that the system enforces yeah so that I don't think that was explicitly mentioned in the doc, but I think that's a I. Think that's a good I! Think that's a good idea. I mean the one. The related thing that we mentioned the dock was the author administrator should be able to map the names for the total order.

A

Like the suggestion was like not to have integers as the priorities, but just to have kind of abstract string and then the cluster administrators could could define the total order on those strings that defines the executive priority hierarchy, but may be an alternative to that, or certainly you could also combine it, but I don't know if necessary would be kind of what he suggested where, instead of doing the indirection through the names of the cluster administrator specifies you do the indirection through some admission controller that then assign integer a priority.

C

A

C

Would make into many decisions if the hierarchy is like physically based that'd be gross to distribute that hierarchy? To all the notes? Sorry I didn't hear what you said Eric when we had that. If we had non-numerical priorities which were user-configurable, then you would have to dispute that knowledge to every node.

B

So coulis can make independent pop. Yes, it could just be. It could be delayed evaluation if it's an expression that gets evaluated upon the number actually coming in then then, during a cycle it could evaluate because usually there's periodic evaluation, expressions that occur in other systems, so even on the coolant so like there's, there's killing values and circles that occur to reevaluate priority and preemption on the notes, and if it's expression, then that value would be re-evaluated every single time. I have.

E

To admit, I don't quite understand the idea here of the the string with the global string namespace, which the operator specified ordering of those it's like, for instance, I'm.

E

Sorry, it's a Hadrian I see what I, okay, so that that could be a scheme that could work I mean, but I would hop suggest that maybe that scheme having an operator say you know these are my orderings or whatever is something that could be in the pluggable part of what we just imagined and the actual you know, and so the operators interface, if they're use exact cause that instantiation of the plug-in might be. This string ordering I have some concerns about that.

E

But but it is a plausible implementation, but then what the rest of kubernetes sees underneath doesn't have to be those strings, but that's, it could be responsible for setting a currently correct global namespace that could be just lexographic ordering or integers or whatever, and since that thing is running in the cluster frequently it it doesn't the same madness.

A

Yeah I think I think that's I think that's pretty previewing.

C

Like users are allowed to specify integers, because those have the problem of like oh yeah, like big tengo, 220 problem, but the like cluster components like the scheduler and the Google, it should use integers and the API server should and should be a thing. That's responsible for mapping, strings to integers and somehow anyone and so there's never a remapping problem. I am I know that as.

E

Long as there's an atomic way to get the information from this pluggable scheduler ish thing to any individual node and that can be a atomic consistent operation. You're, probably good I understand what you mean by atomic. uh If you need to update, let's say the thing: that's going to the integer, the nodes that makes it decide what to preempt. If there are, you know thirty pods on the nodes with varying integers, for whatever things they are. You know you don't want to update four of them, but not the other twenty six.

E

You want to update all thirty away the.

C

Longest, each decision-maker has a consistent view of the mapping and knows what version is operating on. Then you can have different decision makers well gate decision-makers, kind of advance at their own rate, correct, correct.

E

That's what I mean yeah.

C

The Keyblade needs to like use one decision, engineer's, comparing yeah.

E

But I would think at a higher level. Point I think this idea of the the operator specified ordering should ideally be relegated to this pluggable thing and is thus not a part of the core kubernetes thing. It may be reference implementation or something, but that kind of thing is an area that I would expect a community to innovate on I, see.

C

So, like I'm trumpeting Alex users could specify numeric values that they wanted that's exposed to them, but our reference implementation says: here's suggested ones: gold, silver and platinum priority and but you're free to like redefine them in a way that's pluggable or whatever yeah.

E

Let me give another example right like in the in the yarn world right people are using these hierarchical queues where they say you know. 60% of my cluster goes to this business unit and 40% goes to that business units. You.

C

E

Know and then, within each business unit, they may have priorities of themselves until you can't ask business unit a and business unit, B like an operator trying to compare all of a s and B's individual things, they don't know how to compare they're only within two did seem to know to comparison and mobile ordering other work. Every.

B

Single system I've ever worked on has reinvented this piece, mm-hmm yep, so.

E

That's implying.

A

A plausibility and I really respected hotpack into why I was arguing that these issues consider together like why we combine prior to preemption and the batch the batch scheduling stuff together. It's exactly that kind of an early did that.

B

General feature that people call it the general thing that puts your name again: Shawn Shawn the general feature that people call. It is usually some hierarchical group quota feature yeah.

F

A

C

A

C

Question is like: how do you think that might be an innovation here? Is how like how you mix hierarchical group, wrote it for long-running jobs with a whole group quota for batch computation? How that relates to priorities, how both those are distributed? How, when you submit a job you're expressing without your consuming batch or a long-running quota, yeah.

F

C

A useful distinction we will think about separating kubernetes that is consuming long-running quota, which is like giving to that group indefinitely because it has an own business need versus like batch quota, which is like somewhat shared and sloshed around do unpredictable future needs, and so therefore we want to separate those that you I don't know. People agree with that. The.

A

Document kind of made a proposal in that regard. It kind of a way to separate, have separate quota for the long-running stuff, what's called I, guess continuous running the doc and and the and the batch that stuff, which could then be dynamically reallocated, based on fair shares of hierarchical queues and stuff, like that and I think the one other piece of that was like you know, we're proposing that the batch controller would have some interface to the collections that is managing.

A

So we could like, like be kind of like the the scale sub resource where it could adjust the size of the jobs that it's running, to keep them within the kind of the quota or the whatever the resource allocation budget. So that was sort of a kind of preemption like mechanism that was proposed for a batch was something where, like the batch controller, would be aware of the shares that each of the jobs should be using and.

C

A

Would would tell them to scale to the appropriate amount and then the continuous running stuff would use kind of a more direct preemption mechanism where the scheduler could preempt the lower priority things, and you could somehow separate the quota for the long-running stuff and and the bad stuff. So here.

B

Here's an interesting phenomenon that I don't think a lot of people have keydown to with kubernetes in the fact that you can run you can create your own batch system atop by the notion of pods, in the shape of your pod and, like I've, talked about this sporadically with some of their folks, but typical batch systems take advantage of the they try to prevent you from going through scheduling. Again, that's actually the purpose of high throughput right.

B

So if your goal is primarily throughput, if you have something that has the same dimensions or similar dimensions, you do what's called claimer use. So, like you, you start up things again on that same machine without me, going through this whole round-robin thing and that individual scheduler can have its own priority preemption model outside of this. So if you simplify the process- and you take all of the nomenclature that you have around batch, you just take it out in the idea, and you buy a very simple priority.

B

Preemption model they can do their own schema internally and use kubernetes itself, as they've landed on there to adjust their weights and shares right. I. Think simplifying the model that you've created allows allows for other people to experiment using the primitives that already exists inside abscissa Timothy. Are you? Are you saying that you.

C

Want people to be able to like, if to specify when they submit their batch job I need a hundred times one gig in one core pods and then, when they get like it to the front of the queue and launched they're guaranteed those 100. You know pod allocations, I.

B

Think it's I think it's a race I think let it 3d algorithm exist. They they try to submit what they can, but they maintain their own cubes right, I'm, not saying have a queue within the system and say having a queue within each individual subsystem offload that logic on to another system, so they maintain their to you entirely. Why put the queue inside the core.

C

Over there kicking well you're saying, have frameworks and then allocate frameworks, claim resources and in.

B

Frameworks arbitrate amongst their tenants exactly it's a similar model to how may those kind of works, but that that way, you simplify the logic a little if you're.

C

Going to consistency across frameworks which I think people I don't like the kubernetes is consistent across applications that run on it. Well,.

B

A Chris doesn't see across you'd have consistency across your model at the core but like. If you wanted to do your own system, you could design that right. So I think there's this. This slide bar of whip belongs and core look belongs outside and I think. For the first part, we can probably slide the slide bar down to prevent everything, read the minimal set of priority preemption primitives into the core, and then every time we can slide that out right where we patterned become a version.

B

We say to ourselves: okay, let's: let's it will make use of the core for doing this. Then we can do that right. Yeah.

A

I mean I, think there's I mean I kind of agree with with Erics observation about the consistency, saying I mean you're kind of proposing a two-level allocation thing and I may be one way to make these combine. These ideas together is not a two-level but yeah I know I mean if you're saying you have Frank per framework schedulers somebody is deciding how much resources that each framework should get right.

A

So that was what I was referring to as like the first level allocation um would be, would be how much resources the framework to get and the framework gets to decide among its jobs. Was it maybe I misunderstood here before you were suggesting I'm.

B

Removing something sim to that idea, but it's still central in the core I think I think when I'm trying to push back on ever so gently is the notion of pre-baking into many concepts into the core and to support plug ability, and it starts with very simple atoms and then slowly grow that over time. But what what the proposal darks upon is is several core things and they're all kind of, inter woven based upon some experience that you guys have right. Yeah.

A

I mean what so justify that make sense, I just to finish the thought that I was doing. What before was that? Like you know we, the idea was, you would in queue these batch objects and they could then I guess manage I mean yeah, I, guess I was going to say. Those controllers could then manage resources like within them, but it could have sort of a very sink one level flavor. The thing that we proposed and yeah there's pros and cons of that I am.

C

A winner cause any maysa users on the call. Maybe they haven't talked yet. Who could give a you? A social perspective may be talking about long run like marathon and batch happening in the same cluster or have frameworks versus our whole group photo ID, one that wants to talk about that. I, don't know one.

G

One point: bad, it's just a kind of building with Tim said before I think that the most important concept to bring in would be to make the Q public kind of, like you have in your jism, where you have a first class thing called the Q, and you know the pods that you need to run our inspectable easily from the API there's. One problem that happens when you're trying to gang schedule is eventually you see that you're, starved and you're wondering why.

G

Why isn't my thing running and if the queue is o me in, like the memory space of your framework, then it makes it super hard to debug.

C

They said cuteness is positive, a finite lifetime. That means we would have to introduce like a pod allocation concept and then gang schedule. Those like allocations, I guess.

G

I'm not really they.

C

Just put a bunch of pot objects in a cube, because if one of the machines ones running on gets deleted, then you have to replace it with a new pod. I guess.

F

Also, there is a slightly.

H

Different type of badge to have to consider, for example, the background job say, for example, a compaction job which is not really controllers, could use, but more of notification driven I mean one mode is to say that a you trigger it explicitly only through controls, the other common mode with the stapler. Today's, for example, chef or some storage notice. Essentially, the background of is the good based automatically triggered based on you know, available states right and the question is, then: we need to tie that into the oral scheduling, get the be maximal benefit.

H

I, don't know whether this is addressed today, I'm very, very new to this proposal. Just catching up, if not I, can write that up, at least in a paragraph what that means. You.

C

Can take a shot if you want to run on time, reduces I.

H

Know what I meant with yes I mean there are two modes of running it. One is it's a D, for example, a background compaction and storage background compaction. You want to do it on time, trigger other is automatic, and today the systems that are deployed more in automatic fashion, often I mean basically, when you start seeing hey you're running order phase you automatically, then the tied job internally there in that node fight. So the question is now: this has a overall scheduling impact. The question is: how is that being conveyed to the controller it?

H

You know, basically the master so that it can make write, scheduling decisions, and this could last I mean. Basically, you know it could be a few seconds. I mean it could be or even longer it depends on the type of job it is, but it's triggered automatically versus being scheduled by a controller.

A

Yeah I'm not sure I understand the distinct I mean like like I, think. The proposal here is that everything would go through some kind of controller right and so and it would be making the decision about about if anything, preempted or scaled down to, because the new work is higher priority. I'm not wasn't clear to me what how if the thing you're describing, does or doesn't fit into that model.

H

It's more like a use case. The question is well we're considering I mean revamping the scheduler considering all types of jobs, then how does this fit into the that paradigm? Right? Basically, it's a background job. It can be triggered through a controller or it could be automatic and putted systems are deployed more in automated fashion. When there is a trigger threshold, a you run out of space or some other trigger.

H

Then you basically start that backgrounds of automatically in the node, but none if you're doing hedges, and if you convey that information to you know the central scheduler, then the scheduler can do a better job of overall resource scheduling right. So it's a slight variation of a bad job which I'm bringing in the question is whether we want to you know start thinking about at this point here or you consider that hey is it a separate extension effort or you know, think about I mean.

A

I would hope that we could come up with something that could accommodate that I mean like. If it's something that needs to run right away, then you would run it a high of either and a high priority queue. Is it's a batch job or at a high priority? If it's a continue, it's not a continuous design job so like it would be like. If you considered a batch job, then.

I

I guess you would like run it in a high priority.

A

Queue or something like that, um and we didn't have any concrete proposal like here about deadlines, we're trying to do anything like you know, deadlines based, scheduling or time shifting. It was more like you know, release the job once the necessary resources are available, but it's all kind of real-time. It's not like reshuffling, the ordering of pending jobs, so that's like the ones that are due soon run run sooner.

A

That's an extension that we are kind of not not trying to fight off all of that at once, but like I, think if you have some kind of compaction thing or something that needs to run now, then you know. Hopefully the priority mechanism would would work for that.

H

Absolutely up locally, it will certainly work, but my question is in a hedge, a model. If you're scheduling to multiple nodes, then you know is there something which you know can be done, smarter with the scheduling or overall just load balancing the traffic, probably I would think a dynamic load. Balancing of the traffic will be more applicable than scheduling is what I'm seeing now thinking about it more and more um because the I mean, depending on the time frame, I mean basically how long it runs.

H

Then it may not be scheduling impact opposed, it's gonna run for a minute, then maybe the scheduler can, in that window, do a better job of scheduling other resources right. But if the time it was probably you know seconds, then maybe this comes to dynamic, overflowed load, balancing point: it just depends on the type of job at a circuit and.

C

Then you might benefit from see, you know specific use case and yeah.

A

C

Yeah I'm gonna.

F

Stand out here.

A

Yeah and cool it your way to send it out to the mailing lists that were invited to this meeting. Oh, that would be good. Thank you, yeah sure.

B

Eric we thank you yeah.

C

This Conor made a comment about the ability to expect the things that are in the queue and I think I kind of responded, off-the-cuff today and I. Think and I want to go back and understand it better because I, you want to say any more about the the use case for affecting resource alarms of things in the queue yeah.

G

Sure so, just two seconds of background for people and familiar with Meadows, but it's kind of backwards from grantees, where you have a bunch of frameworks and they receive a resource offer and then at that point they can decide whether they want to run a task or not, and so there you know each queue of release, tasks, meaning I, know: I need to run this thing, but I haven't yet scheduled.

G

It is, you know it's the responsibility of each framework to maintain and there's no API in the system for a to get a global view of the union of all the release, tasks for the connected frameworks, so.

C

The thing that we.

G

Have a you know, a batch job scheduler. That's this hidden release task! It's going to be difficult to get all the information you need to do to drive preemption because you won't know if somebody starves. So if.

I

G

Be a PI, then I think it's okay and they can be totally pluggable. I agree with everything else that Tim said about you know having all primitive, but it totally makes sense. There.

B

Are multi clerking yeah there? There are multiple scheduler models that I'm, not the framework based design that that do have global priority sorting, but the they do push Peters periodic evaluation where all the schedulers push to a collector, what their cues are or what their individual queues are and then that's globally sorted right. If the problem was made, those is because it has. It has it's a different two-phase model, but it's it.

B

Has the approach of you having the ability to have individual atomic units that can execute with their own local priority, but then get periodically resolved with global priority order.

A

Yeah I mean I think that the way I mean I don't know that we figured out all the way, I'm afraid all the details.

C

Here but I mean I, think the.

A

Idea that how the queues would work that Eric and we're thinking of was like you know, the work that a batch job wanted to do would go into this queue, and then it would have some kind of controller that was responsible for like managing the the resource allocation towards and killing pods. If it needs to scale down and stuff like that, and so I mean what you're talking about it's kind of like how you get the demand signal from the framework into the centralized scheduling, framework and I.

A

Think kind of the idea was at a high level. Was that that magic would happen in that batch scheduling that batch job controller but and the demand signal will be coming from the like the SPARC controller and and the other other other frame. Those kinds.

C

Of things but I don't know that we figured out all.

A

The I mean we definitely didn't figure out the details that I.

E

Have a question about something that I saw in the doctor, I think I just heard. Why is it that we think that the scheduling of the the long-running containers and the batch and the containers from the batch jobs or pods whatever? Why do we think that they are totally separate and in particular, what I'm thinking about is if an organization- but you know this case where you know an organization- a sub organization- want some amount of resource here and a different sub organization, one some amount of resource there.

E

We, you know the high-level org may just want to say you guys get so much. You guys get that much, and each organization might figure out how much they want to spend on long running and how much they want to spend on batch, and they don't have to keep going back to the high level to reevaluate that balance and I'm worried that if we start with the separation being batch versus long-running and carve it up that way, it makes your model that I, oh.

B

My my thing froze I froze her at the engine and it was so hungry. Looking like nodding and I. Think.

E

What I've everything is it hard, yeah I think that bottle I think.

A

We were trying to kind of uh shoehorn. This into the I mean shoehorn, is kind of a pejorative but like into the quota model that we have today in kubernetes and I mean you know, one way that you can do that is you know you get photos to name spaces, and then you can use your quota either for batch or long-running and I and I think that that I don't know that we want to tie it to namespace.

A

But I think that the idea that you can get quota flexibly is a good is a good requirement to have I mean I think that the quota needs to be per priority level. Just so you don't get an arms race or everyone. You know set figures out what what maps to intern acts and sets their priority to that, but, but but definitely being able to flexibly share the quota between the the batch and serving stuff, at least at first thought.

A

You know, it seems seems like a reasonable requirement to have and I'm not completely sure whether the proposal of that well or not I need to I sure I forgot the details of it, but yeah. So.

C

I, just two things you might be saying: I guess one would be. You should be able to use the clusters resources flexibly between long-running and batch jobs, or only thing you might say is there should be. Should there be a difference between how I start something I expect to run basically forever versus something and that I want to versus something that I want to I, know the duration or has a finite duration?

C

E

I have to imagine that there's a difference between those last two cases right I, somehow have to tell the cluster what kind it is well.

A

Yeah and I think that the ship has already sailed on that second thing: I mean we have separate abstractions for replicas set and staple set, and job and stuff like that. So I think it's kind of too late to say: don't distinguish, oh.

E

Yeah but you have to be able to say just somehow that they are sharing things out of the same cool. Yes,.

C

I

What do you tell your quota? I, don't think running it. I want to clarify David. You said that this ship has sailed, but before I'm, not sure if we were considering those differences, basically whether a job Tennessee's coin or continues to write in decisions basically something that finishes at some point of course, is done and releases its resources and those resources will become available to schedule can- and you know, schedule more jobs in its place. So by saying that we want to have a distinction between Indies, with adding the priority and preemption.

I

Are we saying that the scheduler is going to decide? For example, let's say that we have you have a job with the same priority as a continuously running service and scheduler is going to make a distinction between the two when deciding which one to schedule first, so both of them have to say priority and you're up to try I'm.

A

Adding the model that we were proposing was that the scheduler like the default scheduler is just real-time, whatever is pending, like then doesn't fit, whereas the batch, the batch job controller, was more of like admitting jobs when there's enough free resources, so I mean they're kind of I.

A

Think we probably need to define this distinction better, but they're kind of doing somewhat similar things that, although the scheduler was always going to be involved in scheduling those kinds of jobs like the default scheduler there's the thing that's scheduling, pods I mean one one way to think of it is that the batch job controller is admitting hold jobs at a time and then the default scheduler is scheduling the individual pod I, don't know. If I answered your question or not yeah.

I

Yeah sort of, but if thoughts could completely key areas, as you said, we need to discuss it further and need to hash out the details so.

C

We got four four minutes left. Is there a way we can summarize people's feedback as like not able to make a decision? Keep going this direction. Do something different? Is there anything? What can we draw as action? All nice discussion, I think.

B

What serves there is a document but I think what I'd like to take a step back for it is to refine the requirements such that we can all agree upon a very primitive set of requirements, because if we do that, if we say like there's, either going to be there's going to be some definition between service and job and where we can define priority and preemption models across those. If we have a concrete finite set of like five to ten requirements, both of old would help drive.

B

You know potential implementations, but I think what we have is we reversed it. We have a potential of limitation and we don't necessarily have the definitive set of requirements, because what will happen is I'm sure the IBM guys will go off and go define what exactly they need right, and you know folks at them. Different companies wants to do similar things, there's something you can put together. Timothy I know you have a lot of experience with hearing. I can help with it, but I can't I.

B

Think I can help solicit the requirements and I can help work on them. Yeah, maybe.

A

B

Maybe you and I can can work on that, together with.

F

B

I, don't think it should be more than like ten fundamental, simple requirements. Right I agree.

I

With you term I agree with you on and like a few simple requirements so that we can develop on that and so once those requirements, maybe I want. Peter document is just basing item by name least on these requirements, and then we can work on those items and it's a bit more documents on how high your calculation should work.

B

I

As a nexus, maybe how a batch scheduling should work and what is that at all.

A

Yeah I think I think we have some process that will converge within a reasonable amount of time. I think maybe jumping to an implementation, and not it was the wrong thing. I mean I tried to have someone with requirements in the dock, but it probably wasn't complete enough and I definitely take Tim's comments to heart that the distinction between what is six part of the architecture, and what is you know? Drop-In replaceable was not clear and maybe even not completely thought out so I think that, in terms of stuff, I took away from Tim's comments.

A

That was the biggest thing for me was that we need to refine that kind of that distinction. In addition to the requirements um part.

I

Yeah that'll be happy to help if I can do anything to slide is a good.

A

Line well yeah, but we should figure out a way to get this to converge relatively quickly, because I mean there is just just so people understand the drivers of this, like klaus and IBM folks are very eager, as you probably saw from that from that github issue, to get to get working on a batch sharing resource sharing work.

A

We have a p0 request from the node sig to have a priority mechanism that they can use on the cubelet to drive, eviction ordering and then also we have just in general, people have always been asking for a preemption mechanism for continuous running jobs, not just for batch jobs. That's that's that they want so so I think you know, there's been like Tim alluded to there's been decades of research in this area. I think we should try to to not spend decades coming up with the design.

A

We should figure out how we can how we can get this to converge relatively quickly. At least the high level I mean like I said. That was my goal. Was that we could? We could figure out? You know the what the high level architecture is and then people could experiment with some more detailed, lower levels and write, separate docks. So yeah I don't know, maybe maybe the next we're kind of out of time, but I mean I, don't know if people want to talk about it again at the next meeting or maybe Tim.

C

And I can talk.

A

About the requirements offline and in the meantime, I can try to clarify the documents in some of the ax that the people brought up here. um I, don't know people have other suggestions.

A

Eric, did you have any any suggestion.

C

I mean no, perhaps.

B

Merely this week we can, we can define, who can set up like an interim meeting or something just to hash through some of proposed requirements that come to it with just a couple of them in mind, and then we can, you know, force all or subset of folks who are interested. We could just work on that bit sure we could also yeah.

A

Try doing it by email or with the doc to start with and then meet to kind of finalize the list with you one possibility so that we can make it easier for more people to participate anyway.

C

I'm getting kicked.

A

Out of this room, I'm, sorry, you guys can continue talking, but I got to go.