Kubernetes Batch Working Group Weekly, 14 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes WG Batch Weekly Meeting for 20220414

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, good afternoon, good evening, depending on where you are today is april 14th- and this is another uh work group batch meeting- my name is manche and I'll be your host. Today, our agenda is packed with three topics. So, let's start with aldo and abdullah and the q 0.1.0 release.

B

Hello hi. Yes, um yes, as you might know, uh in six schedule, we started working in this uh controller for for job cueing, um how we released our first version. uh This version has native support for the the core, the batch patch p1 job api, using the suspend the suspend flag that we worked on in 20 and sorry 1, 121, 23.

B

And uh yeah you can define resource sharing um to to do well um to use the unused resources from one tenant in another tenant and a resource fungibility, which means uh you can define which um which models of vms or or resources um you uh you want to fall back to if there's no more space, for example, this the the canonical use case is spot vms versus on-demand vms.

B

So you can define a quota for for spot and if, if you run out of that quota, you can fall back to to on demand um and well all the documentation is available in in the github repository, and we are very happy to accept. uh You know, feature requests or, of course, bug reports and yeah we're.

B

We are building this with the community, so we we want to hear from you uh um about. What's uh what's on top of mine for for you um yeah, I know if abdullah, you want to add anything else,.

C

uh Yeah, no thank you. You covered it um like just last note like we uh welcome any contributions. Any suggestions how we can improve. um We've got uh really decent documentation, trying to detail the concepts common tasks, um and it would be nice if you can try it. Try these like reading through the documentation, make sure that it is will explain, etc and if it fits your use case, try it out. If there are gaps, please find issues.

A

Okay, does anyone have any questions to abdullah or aldo about the first cura uh release.

C

We're also happy to demo it at some point, uh maybe a couple of weeks or and four weeks um after cubecon.

A

Yeah definitely that'll be very valuable for sure.

D

Maybe a question: sorry, I missed the beginning of the meeting, but I I was willing to give this a go. So what what can be tried right now, I I thought there are some some key concepts and so that base functionality should already be working.

C

Yeah everything documented uh is, it should be working like everything documented on the queue website uh on the on the github uh should be working. um You can go through the tasks. um Those should cover most of the functionality that the um that's currently.

D

Supported and then and then for things that do not work, we open issues but for things that do work. Do you want also like a report of what what the experience was on the good side as well.

C

That would be creative yeah. Anything would be great um if you find that okay, this is like the way that we're defining specific things uh is confusing, um or it's not working as expected, or it did not match your expectations.

C

Maybe it's working as intended as we intended it, but it didn't match your uh expectation. um Yeah feel free to open any issues. um Everything is welcome. Everything will be triaged and try to um you know categorize it awesome. Thank you.

A

Any other questions.

A

Hearing none, okay: we can jump to the next topic. Abdullah, pod resource reservation,.

C

Can I share the my screen.

A

Yeah sure, let me stop sharing right away. Go ahead.

C

I'm just going to share the same documents, but I wanted.

C

Can you see my screen?

C

E

C

um Right yeah, so this is the proposal that I mentioned um last time.

C

um I hope you got a chance to look at it and, if you don't I'm just going to give a quick summary of what we're trying to achieve, what we're proposing what we're trying to achieve, um and so the the the idea here is that we wanted to introduce a new uh object api that is like exactly similar to a pod, but it doesn't doesn't really run like it. It doesn't execute like the container in it wouldn't be um started by cubelet.

C

It wouldn't even create the uh you know the c groups or anything else, and this object api, we're naming it reservation um that basically can be created um and would allow users to reserve, um like you know, some resources on the cluster that the scheduler and cubelet are aware of and would allow a future a pod, a future created pod to run in that reservation, um and the motivation for this is twofold.

C

The first one is pre-provisioning of resources and in anticipation of spikes, um for example like if you know, if, if a service knows that every like you know, noon, uh you're gonna get a spike of traffic, they can create these reservations to scale up the cluster. So imagine that you're running in an auto-scale cluster and auto scaling is sometimes expensive, like creating the vms and getting the node ready.

C

But but like you, want to absorb that amortize that cost, and so you would try to trigger a scale up currently the way that people do it is they create pause, pods with low priority and then the actual workload pause that gets created. They have higher priority and they would basically preempt the lower priority cost. The problem with this approach is that it's in description right, like any higher priority pod, can can uh take that pause. Pod.

C

You know space with uh with this proposal, you'll be able to to make a reservation uh and and and basically label it and say. Okay. This reservation can only be taken by this part that is selecting it and so you'll be able to say. Okay, I'm gonna scale up the cluster um by creating these reservations, but these reservations will all really exist, but if some other service tries to like scale up not tries to schedule its pods, they will not take that they will not be able to like you know, take up that space.

C

They will continue to be reserved. For that specific. um You know spike that you are you're anticipating.

C

um The second use case here is um managing uh resource allocation of few jobs, and this is um basically imagine it's similar to the case where we have in queue um so in queue. We are basically proposing to manage um to manage jobs um and we're doing that by managing when they should start now, if, if a job gets to the head of the queue and it gets started by basically unsuspending, it.

C

There is still no guarantee that will actually get the resources on the cluster right because, because this resource manager is not is not about to know the scheduler, uh it's not going to guarantee that the space exists on the uh on the cluster. It's just a quarter manager again similar to even resource quotas, not the normal resource quarter thing. uh If you have quota and the cluster to create x number of parts, it doesn't mean that those pods will actually get the resources that they want right.

C

um But with this approach you could imagine that this, like you know, controller, would be able to create a reservation before starting the job. Make sure that you get the actual resources on the cluster and then start the job. That job would select those reservations, we're writing these reservations. So those are the two high-level like use cases that we have in mind.

C

Others mentioned that it could be like useful, for example, if the reservation on the node would pre-pull, for example, the images as well, which also it's like an optimization. Sometimes images are so huge uh that are expensive to uh to put startup times so again, if you're, anticipating a spike in traffic, it's not only the cost of provisioning, the resources and bringing up the vms and the node, but also pre-pulling the images as well.

C

Any questions on these two motivations motivating use cases.

C

So the the proposal here is the api is fairly simple at the initial api, as I mentioned, like you've got the spec and the status, but the spec is really is just a part-time, but templates, and it's basically describing what would the future part that will take a place of this exhibition uh is gonna look like um and in the pod spec we would have a reservation affinity, it's basically a label selector that says like when I create when you create the pod.

C

You will say ah this pod is schedulable in place of these reservations, and so the idea here is that cube. Scheduler will be aware of the reservation api.

C

Just like it's aware of pods, it needs to schedule both of them, um but one thing that is special for that is for the um for, for pods, with observation affinity is that they will?

C

The schedule, for example, have a you know, a plug-in that checks if there are existing reservations that match this selector and if it does, it would basically schedule this part on the node, where that reservation was assigned beforehand.

C

um So that is the high level idea at the pod level. um It is a concept that's similar to alloc as well or as alex someone mentioned that before and I think in other schedulers um again it's it's going to be an api that mirrors upon you can think uh of an alternative solution here. Alternative um you know proposal is to have what we call like a fake part, a part, the actual part, but you could add a mode to it that this part doesn't run, for example, um but it would like yeah.

C

It could be more complicated, claudia.

F

Hi, yes, um uh I have a question. Maybe you explained that and I didn't uh understand um so the reservation is that a way to is it ensured that the pod will run at some point, or can it stay reserved forever and never start in that case? Would it is there some time out or something like that, because I understand that the resources anyway get kind of assigned to that part, even if the body is not running so that the schedulers see that the resources are taken already by that point. So how would that.

C

Work um so the resources will be set aside as long as the reservation object exists and is assigned to a node think of it as a part that is scheduled on a node and think of it as a passport. Basically right like a passport, a part that is doing nothing just standing still in on the node, taking up resources that the scheduler knows that is not available on the node and cubelet as well.

C

So reservation is exactly the same thing, um and and- and so the idea here is that you create the reservation before you you, you create the pods that will take uh place of that reservation, and when you don't want that reservation, you delete it. You can imagine reusing the reservation. You create the reservation, for example, for um a job.

C

The job has like you need to complete 1000 instances of the job uh but you're doing it 100 at a time, so you could create 100 reservations. Those will stay there and the job controller will will all the time continue to make sure that we have. You have 100 pods right like every time it finishes.

C

It completes one instance. It will recreate another pod to complete the next one. Those pods would would basically be able to schedule in place of these same reservations uh back to back, so you would create, like 100 reservation object, um and then the job controller would create the first one and pause which will schedule in place of those reservations and then, as they complete at different points. It doesn't matter a new pod will be created, but it will be able to take place of that reservation.

C

Now, if assumed that you created only 50 reservations, and you have um 100 pods that select that reservation only 50 will be scheduled and the other 50 will not, because there is no not enough reservations for for the other 50 that selects the reservation.

C

Does that make sense.

F

Yeah, I think I understand that a little better now, so basically a preservation is, let me pass the term attached to an existing set of pods that are running and then, if there are more, the same spot will be reserved would be used by the new pods once the existing ones are completed.

F

So there is no the scenario where the reservation is created, but there is actually no job running.

C

No, that does exist like the idea is that you create the reservation before creating the job before creating the parts right, like think of the reservation as um a sub node right, it's basically a a like a like a a set of resources on a node that you set aside and only can be used by a future pod that selects that reservation.

F

So it's like a reservation, a resource reservation rather than reservation.

C

Yeah, exactly it's a resource reservation that a pod can run in right, like at the end of the day. You want this reservation. These resources reserved for a future part right like for something to run in in the future and in kubernetes. That thing is always a part.

F

Right, yeah, yeah.

F

And yeah, so the the person who is creating the reservation has to delete that at some point. Otherwise I mean there is no um automatic way to delete or there shouldn't be any automatic way to delete a reservation. That's been there since like forever, and it's not allowing other jobs to run. So that's basically, a hundred percent on on the user.

C

That's a great question: I think the life cycle of the reservation could have a mode where you could say. Okay, one mode could be never delete the reservation automatically. It should be deleted uh like explicitly.

C

Another mode of operation is, for example, use once like if one card uses that reservation uh once that part exists, then delete the reservation automatically.

C

So I imagine something like that, like you, could there could be multiple modes of operation for for reservations,.

F

Yeah that would make it I mean, I think it that would be very useful to have, because otherwise you reach points where the cluster cannot be used might not be used completely. Even if not pods are running- and I mean this is uh I, like the the connection to like um bursting scenarios where you have at some point in time.

F

You think that there you know that there can be um a burst of pot, so even in that case, having some automation or some way to specify a timeline time frame where this should happen, that would make it cool, or at least that's my opinion on it.

C

Right, the idea here is again to split the resource provisioning from application. Startup like right now they are merged into one through the pod spec. Right with this approach, you are giving some flexibility to a resource manager like like job queuing um to to basically manage resources, but not manage the application, which is what it wants right like we want to be application agnostic. It doesn't.

G

Matter what type.

C

Of job that will be started, but I want to manage the resources and the way that you manage resources. You say: okay, I'm going q would basically create the reservation and then, um by doing so, you can also implement fungibility. Like q, says: okay, this job should run on spot vms, and so you would create a result.

C

You create a reservation with a with a um uh you know, with an affinity to spot vms, right and so you're, basically guiding the job that will start to to force it to go to land on a spot vm. Without modifying the original job template to include these node affinities and.

G

So this is a kind.

C

Of a powerful concept like you are you you you when you split these two things, you're able to do um you're able to control better uh your resources in the cluster and give some guarantees basically um around them. Yep got it thanks.

A

How can you ensure the the full replaceability of the reservation pods with an actual workflow pod in case, for example, the cluster is less busy because, obviously, when you have fully packed cluster, um the lower priority pot will get evicted in favor of the new one, but in in a cluster which is not that busy, where you have free resources, um how would you ensure that the the workload is actually replacing the the reservation pod.

C

The worker is replacing the reservation.

A

Well, yeah because you basically um you're basically talking that you are reserving this many resources, um and this applies strictly to the cases where you have pretty packed clusters. But let's assume a situation where we are. I don't know 80 90 of the usage of the cluster, so there's still 10 20 of free resources.

A

How can you ensure that, whenever a an actual um workload when it starts using its resources, it will in parallel or the reservation in parallel, will start removing the pots that we're reserving the resources?

A

um How the how the interaction between I don't know. Let's say the job controller and the reservation controller will look like in case when one is creating the actual parts and the other ones should be in response, um limiting the reservations or replacing them.

C

So, um let's so the job controller itself is not aware of the reservation controller. It has nothing to do with it. The only thing that it does is it creates the parts based on your job spec. Now the only thing in the spec that the user could include is if I want to use a reservation, and you do that by selecting by adding this label selector uh the reservation affinity.

C

Now we can have a number of modes of operation. You can see. Okay, the schedule and like from there it's it's the scheduler's, uh you know issue, imagine you didn't have any reservations uh created. The scheduler sees this part with a reservation affinity.

C

It will say it you could have that implemented as a filter, and so if there is no reservation in the cluster that matches this label selector, the pod will continue to be unscheduled and pending. Until someone creates the reservation again like it could like it, uh it could be like a queue. A job like you know, a core time manager, a q manager, whatever it is, um their part, would continue to be uh pending.

C

Once the reservation is created, the reservation will first get scheduled by the scheduler this again, the scalar is not gonna really match or anything. It will do the same thing as doing right now schedule one entity at a time that is going to be a reservation or a part. It will schedule the reservation once this reservation is scheduled, it will pick up like the unscheduled parts. One of them is going to be this one that was previously unscheduled.

C

Now, it's that filter for matching uh the label selector to an existing affinity, will succeed, and so the pod will be assigned to the node where the reservation was scheduled.

C

Now um I guess one one thing here that is going to be uh like that is still. I didn't discuss like a lot, which is who's going to create the reservation.

C

Maybe maybe is that what uh make things a little bit unclear so one one thing that I had in mind was again like these: like few controllers or quarter managers, they can benefit from this or uh the controller that you know manages these spikes like it has okay, I know what, when the spike is going to happen, I don't know you schedule like a crown job that creates this distribution and if you want to manage a group of reservations- and this is what I'm like suggesting to do here- which you could create a um what I'm, what I'm calling like a reservation set.

C

um It's basically like a replica set. It's a con reservation controller that continues to make sure that you have an x number of reservations uh at a time with these specific labels. um But again, another question is who's going to create that application set uh reservation, set itself um and who's going to delete it? It's that that is like again, I think, in my mind, is- is some like this. This whole concept will be used.

C

Typically, by, um like you know, a resource manager that doesn't want to replace the scheduler that doesn't want to replace you know any parts of cubelet it just wants to make uh to take control over parts of the resources in the cluster we yeah, I mean in a in a like. How did I say in an integrated way with the rest of the you know: ecosystem and kubernetes.

C

H

Yeah, so I have a couple of questions. I think uh you sort of mentioned about pod overhead right, so you think uh the existing part overhead is going to have problems because it is statically ascend and you want to make it dynamic. That's why you want to introduce a new api for this uh point.

C

Overhead is no I'm not replacing point overhead here.

H

I think the goal uh here is to make sure that scheduler should take into consideration the x uh or extra amount of resources that are needed for uh part and one way to represent that is using pod overhead right. No, I mean.

C

The idea here is that to create to preserve resources before the pod is actually created.

C

So before you actually create the pod, you want to reserve resources on the node for that pod.

H

I see and like you do not want to wait till the pod has been created. That is, uh that is exactly exactly you don't want to.

C

Wait for the pot to be created itself, you want the um you want to reserve the resources before that.

C

I see and like again, like the the case that I mentioned. uh For example, now people use pause pods right in order to reserve that resource they create a dummy pod with a positive control, a container that sits there on the note doing nothing, but you know holding on to some resources that a future higher priority pod will take when it gets created right. What the paused part did. Is it uh basically triggered or a scalar to scale up resources.

C

And and basically uh made sure that there is a amount of resources under is set aside for a future workload, and this is exactly the same. But it's you would have a lot more control over it because, because you can, control which future worker that will be created can take these resources that are already reserved on the node. And you don't have to go through. Preemption, like with the pause called approach that the new, the the future part will have to trigger a preemption right.

C

So you will have to delete the the pause card that existed in the node and the future part will have to go through two scheduling cycles to actually get scheduled.

H

Got it so I think what you are trying to say here is: you do not want those spots to be created and the pod overhead or the resources that are associated with the podbot passport to be used. Rather, you want to have some sort of way to express that in future I may need these many resources and I would like to reserve them now exactly right.

C

And another use case here. Imagine like you have this quarter controller. That says you can express in the controller the quarter controller that you have x, amount of resources right and a job goes through that quarter, controller and says okay, I have 100 cores available, I'm going and I have a job coming in um that needs 100 cores.

C

Okay, I have enough quota, I'm going to create I'm going to allow the job to start, but in reality you could on the cluster itself, you can't make any um you know, guarantee that that job will actually get the resources right, because you could have another part otherwise being created outside that quota controller. That can take up the resources on the on the cluster.

C

So creating these, like you, know, race conditions, um so in order to harden the whole thing and make it like, you know uh more robust and and give more guarantees over. You know uh the resource management of the cluster.

C

um This reservation concept can give you that guarantee, because you don't need to give users the ability to create reservations. You just need to give them the the uh permission to create pods them. Creating parts does not guarantee that those pods will get scheduled, because if you have a policy agent, for example, that says no pod can be created without a reservation uh affinity set right, and so you as an admin you you can have these.

C

You know controllers uh that manages these resources um and you can you you provide better guarantees over who can use what and uh at what time got it.

H

uh Two more slightly uh related questions, so say: if you have a reservation uh and if it's not being used, is there an expiry time kind of thing, after which the reservation would not be valued anymore?.

C

Yeah, that is uh that's a good question. uh I think.

C

That was like uh we discussed a little bit uh uh a few minutes ago, which is you could have modes of operation. You could either say: okay, a reservation um should always exist like it doesn't it doesn't get deleted because you could reuse it, and the example I gave was a job with a thousand completions, but 100 parallelism right and so the job control will continue will have eventually to create 1000 pods right, but only 100 in parallel, but.

G

C

Parallel can always you reuse the same 100 reservations that you created initially. So that's one case the other case you could be the other like mode of operation. You could say: okay, the reservation life cycle is tied with the first part that uses it.

C

You could also say: okay, a reservation uh has as exactly as as you mentioned, um like you know, it has a like imagine that you're trying to optimize the the cluster um usage. You have a queueing system where users must define. You know the maximum uh runtime, and so you don't really need to go and delete the pods or have a custom integration.

C

With the workload controller to get the pods out, um you can basically delete the reservation, and so even if the controller continues to create the pods, that job is not going to get scheduled right because there's no reservation so.

G

C

You can always you can also control preemption using this approach, although a little bit ugly, because the pods will continue to stay around and unschedulable, but at least you get control over over the resources.

C

um So so that's like basically two things that the reservation can offer you uh here, as I mentioned before, which is first, you can basically attach um like by association. You can do scheduling right, uh you can inject whatever affinities you want on the on the reservation and so the file, the future part, will just follow it without ha without that part, having that affinity, uh the second thing is is uh preemption and those two, those two properties would allow us to manage.

C

Custom workload controllers without you know, without explicit integrations with that controller, to introduce, for example, like you know, preemption or, uh or the ability to inject affinities into their pod, spec, etc.

H

All right, I have a few more questions. I think.

A

Ravi yeah, I I wanted to be mindful of we have eight minutes left swati has her uh hand raised for a bit, so I want to give her uh three minutes to answer the question. I would still want to give alba a chance to speak about the last topic in the last five minutes.

G

A

If you please try to fit in within three minutes: go ahead. Swanny.

E

Yeah, I think I have a quick one, um so I see um in the port spec for a part that wants to use a specific reservation. We specify a match label say, like you have their full bar. Is there a way to specify a a particular reservation itself like by its name, or is it always going to be by a label selector.

C

I guess that is like I would just I mean the canonical way is to set the reservation uh label selector, but we do have that reservation name here. The assumption is that this would be set by the scheduler so that you create that association right so that cubelet knows okay, this queue and both equivalent schedule knows that these two things are one. So we show that we don't double count right right.

E

um And then, in case of, say, reservation set how? How would that be like assigned again same way? The scheduler assigns it the reservation name.

C

um Well, there is a vision set is just a controller that makes sure that you have an x number of reservations, objects. The the scheduler would not be aware of the reservation set. It's just like it's not aware of the replica set.

E

C

So it just maintains an x number of reservations.

E

And the as part of the reservation set, all the reservations are going to be identical or they could be different as well.

C

um For a in a reservation set so we're here we're proposing that you could have a reserve, I'm calling it bundle. Think of it as the like. The part group.

C

Basically, you could have like multiple templates right for each template. You create this number of replicas. uh I don't know if you can see my screen, but that um yeah, so this is this- is the idea.

D

C

Like again, you create these, like you know, for spark jobs, etc. You create one reservation for both the driver and the workers.

E

Okay, I was actually thinking- maybe you've put some thought into like we we're doing it kind of you know analogous to we. The replica set kind of framework, have you thought about, say: maybe reservation classes, so we create a class and then pods could refer to those classes as opposed to because at this point in time a reservation is tied to a pod and there's a one-to-one relationship, whereas we could have like a class, and many pods could refer to those kind of reservations.

C

um I think this case is represented here, uh but but can you please like comment uh with that with this idea, so that we can explore it more? I like the concept of a class in general, it it groups things, and so it should simplify some of the things, but I yeah, but it would be great if you can comment.

E

C

um Thank you so much for your questions. I just want to give back uh last five minutes to um to aldo and for the last item.

A

Yes correct uh if I share my screen quickly and let's go back to the agenda and actually to on the issue that aldo brought up, although you want to talk about it,.

B

Yes, um just wanted to bring awareness of this feature.

G

B

uh We're basically waiting for for people to to give feedback on it.

B

As you know, we recently introduced in 122., we introduced index jobs which allows you to to give a number to each pod in a job, and the request here is that we allow we allow to also include which indexes you want to run or yeah. The user wants to run um so instead of zero to two n minus one.

B

You do one five seven and the use case here is that uh if you have a very big job and it might just reach the back of limit and then some of the indexes completed, some of the indexes didn't and you just wanna- you know finish up the job by running the indexes, the indexes that fail.

B

um So that's the use case.

B

One kind of open question is whether we want to include the failed indexes in the job status. We today we only include the completed indexes and the use case would be once we know it failed. We just copy and paste the uh the list of indexes that failed into the into the spec of the new job. uh There are some performance considerations, or rather limitations in the in the storage that we need to take take into consideration.

B

So that's kind of like the the uh contention there, and there is a slightly related feature request where um we consider the back-off limit per index instead of for the entire job, so yeah a few. A few feature required requests floating around uh about this feature, uh so I I wanted to bring attention to people. Please comment: if you see something that is useful to you and if you're willing to volunteer to implement it. That's also welcome.

A

Yeah I would like to before we actually proceed with eventual implementation. I would like to hear more inputs with regards to what cases does this solve, which, specifically, um I don't feel like. We need to expose the failed indices, because, basically, whether they failed or did not complete, you can easily calculate that by looking at what completed and what is not in completed set means.

A

It either never started or failed, which basically means in both cases that you'll probably want to retry them.

A

I'm very I'll be very cautious, with introducing something like retries with allowing users to pick and choose those indices I'll be more curious about digging through why the job did not succeed for them or what we could improve to make the job a controller as it currently stands more reliable either, like you I'll mention by introducing the back of per pod or some kind of tweaks around this, rather than um allowing users to pick and choose the the indices that they want to uh run I'll I'll. Try to commend there um right.

C

So I think the sorry, no, I think one use case here is that, like the failure might not be related to the drop controller itself, like imagine, you're running an index job and some of the indices you're working on corrupt that works on corrupt uh chunks that keeps failing the uh and then the user will go ahead. Okay, I fixes these things um like. Maybe it's a wrong path to a chunk or something like that uh or yeah and once doesn't want to restart the whole job, just wanna. Okay, let's continue with these indices.

A

Yeah but then I would probably just try to figure out how we can identify those sooner than later and be able to react as in fix those ad-hoc problems, while the job is running so even if it would take longer to finish the job um I don't know but yeah, it's very use case. Definitely, okay, I don't wanna, uh take anyone any longer, we're already one minute uh past the designated 45 minutes.

A

Thank you very much. All uh that was a very uh nice talk with all the topics that we had today see you again next time.

H

Thank you thanks.