Cloud Native Computing Foundation Research End User Group, 6 Apr 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF Research End User Group: Kueue Controller and the Kubernetes Batch WG

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah so uh yeah so welcome everyone to our bi-weekly meeting. Today we have um abdullah from google uh presenting the work that has been done on uh supporting kubernetes native job queueing, maybe abdullah. Maybe you can also just briefly introduce the the batch working group and the work that has been put there uh briefly because kind of on topic.

A

If you, if you can say a couple of words, uh the the link has been circulated, but just just to put it in context as well yeah. Otherwise I suggest, uh like we listen to uh abdullah and then we should have plenty of time for discussion. This should be a good one.

B

Okay, um sorry yeah this, this camera this. This is the one that's working. This one is now um yeah. Thank you for having me uh so I'm abdullah, uh I'm a contributor to kubernetes, I'm a co-chair of sex scheduling and a recent working group that's been formed called badge working group, but then within kubernetes.

B

I work for google again part of the gte team uh within google and focused on batch as well. um So as ricardo mentioned a couple months ago, or a month ago, we proposed to form a new working group with 10 kubernetes, um to focus on that and um uh reduce the fragmentation on efforts related to batch workloads within core kubernetes um and try to make batch kind of a first class citizen of of kubernetes.

B

We feel that until recently, bad has been a guest on the platform more than it hasn't been a home like services, and so we want to try to push that use case uh forward. um So the goal of the uh working group is like based on the charter. That we've agreed on is threefold.

B

One is to enhance, uh come up with like reasonable apis to start um jobs. We already have an api job api, but we're looking to improve uh its capabilities. uh Reliability, scalability aspects, uh its applicability to various types of patch workloads. How can be reused, for example, um to reduce uh the fragmentation we have in the community in building uh job apis? For example? How can we run mpi workloads on top of the job api? How can we run uh tensorflow, like reinforcement, learning uh workloads on top of it? So that's one pillar.

B

The second pillar is uh job level management, um most of the components within uh kubernetes. They are kind of like podcentric, whether that being like the scheduler or the auto scaler.

B

uh Even quotas um like it mostly works at the pod level- and this is this- does not lend itself well to jobs and batch workloads in general, which most of the time you want to manage the whole job, not just like a single pod, um and so- and this is part of what we propose I'll be discussing here as well uh in in uh in this presentation in queue. um The third pillar um is mostly focused on um like hvc, mostly on node level.

B

um You know enhancements uh to use special accelerators uh special types of hardware and how it works with the scheduling as well like normal, aware, scheduling or um like how do we uh basically better use fpgas et cetera they have their own uses of, like you know, resetting the fpga for before report can use it and, and all these kinds of you know hooks that allows uh special hardware to be better used within the kubernetes ecosystem.

B

uh Do you have any questions uh about their working group? um I thanks ricardo for posting, uh the the pr for the working group. The charter has been merged I can once and down. I will also post links to the uh mailing list, the slack channel and the poll uh on like we're, trying to decide on what time is going to be the meeting. uh It's probably going to be on a weekly or bi-weekly basis for one hour.

C

I think on on that note last time I checked in with klaus, I think I was the only person who'd responded to the doodle so uh to try and pick the time if people are interested in the conversation there is a channel in slack or what is it called? Is it batch wg or wg batch wg.

D

Batch, it has several entries now, no.

C

No, it's spanish wg batch dash. Wg is the one that I think is.

B

Www, which is now the channel, would then kubernetes for this working group. Is that what you're referring to on cncf.

E

B

No, no, that's not. I.

A

B

A

Talking about two different things: this is the kubernetes batch working group and then there's an initiative which is the cncf patch system initiative, and there will be some coordination to be done between the two but yeah. Okay,.

C

He's talking about the.

A

Kubernetes one interesting we're all at the same.

C

Point of uh yeah got it.

B

Yeah- and it is like to your point, one of the uh um basically things that we're planning to do as well uh is to try and help defragment the community try to reach out to cnc. And that's what we're trying to do here as well. Presenting what we're planning to discussing within core kubernetes uh to the larger community um and make sure we're aligned in in our efforts got.

F

It thanks so so without without derailing the discussion, but can somebody like, in a couple of sentences, see the difference between those two groups that you just mentioned? The batch group in kubernetes and the one in cncf.

B

F

Think there's a difference.

B

So um I haven't like I don't know about the cncf one I didn't read any charter or uh about that group, um but the kubernetes one is focused on core kubernetes enhancements like what do we do with the core kubernetes um feature set to make it easier to run batch workloads um and the people working on it are leads in six within kubernetes. So the working group is going to set recommendations or how can we improve core kubernetes to a better ex to better execute batch workloads?

B

The individual sigs, like six scheduling, sig apps, uh uh auto scaler, sig node, will take on the execution for these enhancements, um and so that is uh that is the goal of the kubernetes working group. um But I can't speak too much about the cncf one.

A

Maybe I can I can, I can say a couple of words: we had a few discussions, also in the toc about this and and the goal is really to to try to promote uh progress in this area as much as possible, and this can happen in the kubernetes core, like what abdullah was describing, but there's quite a lot of initiatives in other projects in the cncf that can have a different release cycle or focus than necessarily the kubernetes core. So the goal is really to see how how these two groups go.

A

Try to align them as much as possible, because it there's little point if, like the functionality, gets into core, it's probably good for everyone to reuse it, but to to maybe review this in six months one year and and see see how things align better.

A

I don't know what everyone thinks, but I guess, if you have any feedback, the issues are posted in the chat, so feel free to push some feedback there. The cncf work uh initiative is within the tag runtime.

G

I would add also that there are things magical just also working in gke in google.

G

We are also uh like declaring a lot of very important things for much more approach as out of scope for the kubernetes group like we know that component is there's no intent in kubernetes to handle workflow for workflows, for example, uh and many other aspects of the job orchestration. The experience of how you use how the whole researcher or data scientist interacts with with the whole stack lower. So we indeed need to coordinate with the other between the two working groups, but there is a lot, a lot of scope of work to be done.

G

That's way broader than kubernetes, where there is, there is a need for leadership and coordination to drive it, and it is declared as out of scope for the the kubernetes uh working group, and this working group wants to make sure that primitives related to kubernetes work very well with things outside and like we. We do need to work on drawing the lines and making sure it's well coordinated, but I lost what other things, but I think it it makes sense to have like separate, very focused and clearly defined uh working streams and.

C

Hello you'll be pleased to know that I added mcad to the charter that we drew up uh for the the higher level cncf working group. That's where we'll talk about things like armada and volcano and mcad, and how all of those pieces how we should all be working together to uh on those pieces.

C

A huge part of uh working together will be also watching and reacting and contributing to what the kubernetes working group is doing as well, because that will play into our our all eventual aims. But um but yeah there's, as we know, a whole bunch of other pieces to think about like multi-cluster stuff and uh and how we've all sold that so yeah. Those discussions can take place in the at the cncf working group that we're trying to put together.

F

Great great, thank you and thank you for adding mcat to the list, so yeah.

C

I wouldn't forgive you, I wouldn't forget you.

B

Sounds good um yeah with that said uh I'll, I'm here mostly like to present um q. This is a new proposal that uh a again like focus is on the second pillar I mentioned for patchworking group and within kubernetes, which is like job level management um within core kubernetes. um I mean the the idea here is.

B

I want to start with discussion, but this working or like this, this audience is already aware of what what's the job and how we define it, um but quickly we're just thinking of jobs as computations that run to to completion and this basically, a group of pods that either run independently or like multicultural simulations or collaboratively processes that, like being an mpi job or even um reinforcement, learning job where you have you know uh like workers and and drivers.

B

One important aspect here that we are focused on is the fact that jobs sometimes are like it's a type of workload that is often flexible on multiple dimensions, either flexible on time.

B

When the job could start, it's sometimes flexible, flexible, on location uh like which zone it could it could, it could run or even type of resources um and types resources could be even like type of provisioning, for example, as a spot or or on demand, or even or even types of accelerators like uh whether it could run on gpu model x or y um and the like on on-prem clusters.

B

They do have flexibilities in one way or another, but in the cloud this becomes even bigger issue, because you know in the cloud we have way too many types of resources that users will look at to manage this trade-off between performance and coast, um and so this would be a forecast for us as well like it's a problem we want to solve um and under the higher level, what is job queueing or the type of job queueing we're looking at or how we define it's.

B

Basically, what we're trying to do is to have mechanics and mechanisms to manage access to a limited pool of resources shared like multiple tenants um and basically, what job queueing will do is decide which jobs should uh wait and which can start now, based on a number of constraints.

B

And why do we need job queueing um again like on-prem? This is clear: you have static uh and sometimes smaller scale clusters, uh but on the cloud this is sometimes less clear. Why do you need queuing in the cloud? Since I mean people sometimes think of it as like infinitely scalable and and and can absorb every single workload? You have that's not true. uh There are a number of aspects here. One is utilizing discounts.

B

uh Cloud providers offer discounts if you pre declare how much resources you want in google, for example, we have something called committed use discount. You can pay, for example, to use a number of core cores over over three years or one year period, and now that you've paid for them. You always want to have them used and you don't want to use more than them, and so it's basically you've created your own static cluster within the cloud, and you want to manage access to these resources.

B

Another thing is like users: they have spending limits. They can't just keep create executing every single job that gets created. um They want to control their budgets, and so they have spending limits. um They also want to introduce pertinent limits, uh not like users, the individual that run bad workloads, are big organizations with different research groups, etc. You want set limits pair pertinent, even on the cloud and last but not least, we have cluster size limits, like kubernetes itself can't scale infinitely in in gte.

B

We do support up to 15 000 nodes, um but again, like many other and many other instances, you can't scale more than 5 000 or even 1 000, depending on your workload.

B

So what exactly? uh We think that users want from from from queueing, I mean. Obviously, you want queuing jobs that don't fit existing capacity, should just basically wait uh and execute when, when capacity becomes available, um you want to have knobs where users can decide on execution order. um You want knobs for fair sharing of available capacity between multiple tenants.

B

um Also, budgeting is not about all only how much resources you can use in specific point in time, but also over a period of time and also ability to set policies uh who can use which types of resources and up to at limit. um We have customers, for example, that they open the tab for their users to use preemptable or like spot vms.

B

You can use as much as you want there as many jobs you can run, but when using on demand, you have a specific limit or you can't use it at all same thing with gpus. Those are expensive. Scarce resources give them to any tenant, and last is flexible placement again across different resource types and location and time.

B

When you have your job submitted to the queue you want, the ability to start the job uh based on what is available on your infrastructure and the flexibilities of the job, that's declared by the user.

B

Any questions about like do those requirements resonate. Do they capture, uh like the use cases that you have in mind um for for queuing.

A

There's a comment on the chat. I think it's pretty relevant. Maybe too much you can mention the storage.

H

Yeah, just in general, researchers like to buy storage up front in large chunks, especially if they have a grant.

H

If you have an nsf grant- and you have to keep your data for five years and your grant's over in two years- there's no way you can do that in the cloud. Nobody in the cloud will sell you storage ahead of time.

A

You have a question: go ahead.

I

Hi, so I mean is this concept of q6 above a cluster where, when a job comes in, it says: okay iq it, and you know this- this is a cluster. That's dispatched to this cluster that cluster it just it's within the cluster.

B

um So I will get to the apis in a second, but conceptually it's not like. Initially we will or we will implement it as a as a within cluster controller, uh but I can imagine this being run in a nodeless cluster. For example. Imagine it running in a like as a controller outs that manages multiple clusters uh and it can watch for jobs uh being created in multiple clusters.

B

We need to, uh like you know, fine-tune these concepts a little bit, but I don't see a problem having this controller running you know uh and watching for multiple uh over on multiple, like api servers and and try to manage source across multiple clusters.

B

I

Not tied to the single cluster story, I guess so I mean is that a use case that uh uh you want to focus on like first release or it'll be like comes later.

B

So the mvp is focused on, like it will be run on the master of a single cluster, uh it's like so so that that is the mvp. Then the next step is to how how this can run uh outside uh to manage multiple clusters.

C

Okay and thanks well gareth, if, if you're interested in the multi-cluster conversation sooner rather than later, that's where you might be interested in the cncf batch working group conversation. But that's one of the things we're directly talking to.

I

Can you send me the link of that batch.

C

Yeah, I put a whole bunch of them about just before timothy middle coup's, accept storage thing, there's the slack thing that I posted in the charter. Above that then ricardo uh posted the issue where.

A

I'll I'll add those those links in the agenda as well, uh so that we can go back yeah.

C

Thank you so actually one one thing on um just what users want, I would add, speed or or scale of things.

C

I I don't know that that's absolutely listed here but like when we tried to do this with just the regular kubernetes scheduler or building a custom scheduler a couple years ago. It just wasn't fast enough, especially when we scaled the cluster up to a really big uh big size. So I don't know yeah. All of these things speak to us for sure, and then we have a few others. So.

B

Yeah scale is always is always top of mind. You you're absolutely right, and we have worked on improving that for the scheduler. It's like the pod to node, scheduler, cube scheduler and it it is something that we're taking into consideration while we're designing implementing the controller yeah.

E

B

To all like, I don't know like um thousands of jobs or a million pods uh type of like you know, um skill.

C

I I assumed as much but I just wanted to yeah just throw that in so cool.

B

If there are others, you mentioned that this is this captures some of your subset of your requirements. uh If there are other requirements, please like just post, maybe in the chat and then just we just want to make sure that we take them into consideration as we move forward, uh yeah.

E

B

um So why a new controller um like as you notice playing kubernetes, doesn't really lend itself well to managing jobs without, like with respect to, uh like you know, queuing in general, uh anything that you create on on kubernetes. Basically, the whole cluster is going to try to reconcile itself to create, pause and schedule the pause and start the pause. There's no way they could say. uh If there is not enough resources, just don't do anything and wait until resources are becoming available.

B

It will basically continuously attempt to do that, and it will work itself to death, um especially like when you have like thousands and hundreds of thousands of jobs being created um and kubernetes coders are not really enforced uh like in a way that allows it's not dynamically enforced, basically, enforcement resource creation time. So it's whether you are able to create the job in the first place or not, and if you're not you don't have quota to create the job, then there's no place to park it until it's like that resources are available during the job.

B

um So volcano is uh one of the most famous um schedulers for for drip scheduling. uh Our issue with volcano is that it it re-implements a number of existing functionalities. It is a scheduler, so it's the second schedule that runs side-by-side, cube, scheduler uh and that causes a number of issues related to you know: race conditions, every implementation of some of the features and how it can catch up with the features that we are actually pushing up in in in kubernetes or kubernetes.

B

uh The second thing is that has on drop apis. It is also a job cycle. It has a drop life cycle uh controller and so again, three implements the job api that we have called kubernetes um it. The other thing is that it lacks clear integration with auto scaling. um One important thing we have design aspect in queue is that needs to have a clear integration with cluster or scale, because this is extremely important aspect of managing jobs, because you want to allocate resources for a whole job.

B

How do we do that before the job actually starts, and how do you send it to a specific location or specific gpu model type or specific um cpu or provisioning like standard versus on demand? um And this is like space of the last one. It basically lacks clear support for resource fungibility or flexibility, um and so so. This is the issues that we have with um with volcano, and here also I want to mention that uh gke had like google cloud, they had a previous effort uh decommissioned.

B

Now it's called bathroom gt a couple years ago. It had similar issues, it reinvented scheduling, uh job life, cycle measurement, auto scaling um the other thing it was. It was close source, and so it was hard to meet customers. Requirements of portability like customers want to run this on-prem there's a ton of patch workers already that will continue to run on on-prem, and so we need to speak to these customers right like we want them to be able uh to run.

B

um You know, manage their jobs, on-prem and maybe sometimes spill into the cloud or have a multi-cloud story as or a multi or on-prem plus, or a hybrid story uh really well thought out, um and so our thought here is that okay, let's try to come up with a proposal against, should be open. Source um driven by the community uh addresses the requirements that we've mentioned before. uh In terms of you know uh uh that plays on the strengths, of both the cloud and on on-prem cloud has a ton of capabilities.

B

um We that is exposed through our scaling and auto scaling should be like a central piece of the design of any job management controller um that that's, I guess how we look at it. Any any questions on on the uh quick, related work review here.

B

D

Question sorry, uh having too much focus on the auto scaler kind of leaves out the the people running it on prem right. So I I would say like if, if we say like auto scaler is going to be first and mostly, then we don't care about bare metal where you have a fixed set of machines.

D

Right like it should be something that we care about auto scaler, but uh we care about people that is going to have a fix it because I feel, like this story is being told as a batch is for auto scaling on a queuing system right like if you keep repeating that the auto scalar is the most important thing that we need to integrate and if you have a batch, uh a fixated system as they. I see the list of people in the school like from academia and universities.

D

They have a fixed machine, so the autoscaler is out of the picture.

B

Right, I mean that's the thing that, like I'm mentioning this because of like the fact that most batch schedules are built in the past, they were fixed.

B

They were designed for fixed clusters and I'm trying to emphasize the point here that this is changing and we need to take cluster autoscaler into consideration here in an environment where you have, you know a ton of elasticity and flexibility and the fact that batch workloads are migrating from on-prem into the cloud um I did not mean to because, like I still like, the idea that, like for example, batch on gke didn't succeed, is because exactly what you mentioned, that's why we want to start from a point where it needs to be open source.

B

It needs to speak to on-prem customers, but at the same time it should take care of environments where you have flexibility, you have elasticity.

B

Those have never been top of mind uh and I'm trying to emphasize the point, but maybe I I overdid it.

D

uh Trust me, I'm, like I'm, supporting this idea, I'm just trying to uh help to define the idea.

B

Right, it's like a pendulum goes like this, and this and you know it takes nice.

I

So question uh so there is q and there is batch. So in this proposal are we going to combine both of them together or they are? They are still going to be two different entities.

B

I

Sorry, I did not get the. What do you mean like this batch and there's q? So I mean q can apply on top of normal kubernetes scheduler right I mean you have q and then you can put it on a normal kubernetes scheduler, which schedule things one at a time now batch is like scheduling things in together right, so now are we going to combine this batch scheduler and the queues capability together or okay? It's going to be two different things: okay,.

B

That's a great question uh and it is one of the main design principles that we're carrying here, um which is don't reinvent the wheel uh and when you say cube we're talking about the cube scheduler. Am I right? That's right, yeah! This is exactly to the point of this slide. We don't want to reinvent the we don't want to have a second scheduler running like a pod to node scheduler running.

B

We don't want to re-implement, auto scaling the cluster or say that we have open source as part of the kubernetes.

B

Package, we should just reuse that same thing with job lifecycle management. I don't want to propose a new job api. We just need to manage the existing job api and have hooks to manage custom workloads, custom jobs, build that cannot basically reuse the job api, um but we don't want to force like we don't want to introduce a new api for creating jobs. Basically, um so so yeah we're not doing that, and the advantage is that we're using significant existing functionality, uh we're not concerned about functionality, divergence.

B

uh It enforces separation of concerns uh in a sense that, like the control that we're proposing, is not going to do auto scaling, it's not going to assign parts to nodes, it's not going to create the parts of a job. All of that is the existing components. It will only decide when the job should start and using uh kate's native scheduling directors like node affinity, chance integration to direct the job to the place where it should run based on existing capacity of the cluster.

F

Can I ask a question.

B

F

Yeah, so so, uh first 100 percent uh with you on you know, separation of concern. What scheduling is separate? This is a job meta scheduler. uh You know, I agree with you and and not reinventing the wheel. Just the question I have is about the um the job, representation right and job life cycle management, because I want something that is general enough, that if I have a spark job or a ray job or whatever kind of of job, how complex it is right, it may include multiple deployments, etc.

F

I want to be able to say this is my job right and cue it as one entity. So I'm not sure if, if the current kubernetes job uh specification is, is general enough to accommodate you know all those types of jobs.

B

This is a great point and it I will I will. I will um address that in the next slide. We want to support both the thing again. We have users that their journeys are simple. They just want to run a batch job, uh the job api. We try to fix the job api, um but let me let me finish this one go to the next one address that point. We do acknowledge that there are some like cautions or limitations in this approach.

B

uh It creates two layers of resource management, so we need to to make sure that we address this point. uh We have multiple components involved in starting the job, uh and so this may add extra latency um again could become harder, and so all of these things we need to make sure that the way that we design the controller, the ux.

B

It should be designed in a way that limits these uh these limitations. I don't think we can completely get rid of them, but it's a necessary evil to the fact that we want to reuse, significant dysfunctionality and separation of concerns.

B

um Yeah, as I, as I mentioned, um we try to fix the job api like, for example, you mentioned array, jobs or index jobs. We tried, we introduced index jobs to the v1 job. We fixed completion status tracking.

B

It was pretty much broken like if uh tracking was based on like if it basically, if, if a part completes the pod object, itself needs to continue to exist in the api server to make sure that the job completes like this is how we were tracking things and that did not work on environments where you have, for example, spot vms when the spot vm gets preempted any part, even if, if it completed that was on the api, server had a node name assigned to that node, it will get garbage collected and so basically you're posing progress in the job.

B

So we fixed that as well. uh We introduced uh uh some new status like tracking, greedy pods and job status, which is required to implement uh tf and mpi jobs on top of job api. So, obviously the point is that we are trying to improve the job api uh to address the simple use cases and make it usable uh to implement more complex workloads, but we do acknowledge that there will always be a percentage of workers that will not be able to use the drop api. That's absolutely true.

B

That's why we have a cons like this. Is that the resource model for q, we do have the concept of acute workload. It's basically an abstract representation of any drop in queue, um and the idea here is that it, this queued workload uh object. Api is going to serve as a proxy between the actual job, without that being, for example, as you mentioned, spark drop, and what q is queuing.

B

We had a concept of a resource claim here. This is a bit maybe early to introduce, but it's an api that we'll be introducing to cluster autoscaler to ask for resources, and this is what I meant by. We need that native integration with auto scaling. This is not like necessary to have, for example, in in on-prem environments.

B

uh The whole thing could still work without resource claim, um but in on the cloud it will be quite powerful because before we starting the job, we want to ask for resources, we're just gonna communicate with cluster or scale. The r square will tell us okay, I have these resources in zone x or y, and then we will start the workload by injecting affinities to the workload to send it to the resources that cluster autoscale provisions for us um and the other. The the last two are maybe slightly not completely surprising.

B

Is that the queue which is uh basically an organizing concept for grouping managing uh uh and reading about closely related uh resource uh jobs, and then there is the concept of a capacity which defines how much resources exist.

B

For for different tenants, we are reusing. The namespace as a tenant concept, which seems to be taking like is, is well accepted uh concept now in kubernetes, and so uh in this. In this case, um you would basically model, for example, your teams, as name spaces. You would create cues for them, these queues are name spaced, they would point to capacity. The capacity is a cluster-wide resource, and so usually the cluster admin or the batch admin would be the one who's managing the the capacity uh and the qr resources, the one that creates them.

B

The uh the this is like the personas that were focused on and then the batches that should be basically just uh runs and monitors job like the way that would work it's basically for them. They will create the job, and then you would have admin. um You know setting up all these like using capacities that decides when the job will start and how much resources exist for each stand.

B

um So this is like quick slide on like the theory of operation. um Sorry, I'm not paying attention to the questions on this channel chat. I hope that someone like aldo or magic are answering them or please interrupt me if there's something that I need to clarify more.

A

It's just in the interest of time like we have around 15, maybe 20 minutes. If we overload a bit so I I don't know either either if it's not a lot more, otherwise I would suggest we go through it and they take questions at the end. I don't know what people prefer or we can just uh interrupt, and then we see where we get um so.

B

I think this after this slide, the the story would become a little bit clearer and then um I can show a couple of use cases and- and we can have questions um so here- I'm just trying to show how this q controller is going to work. um As I mentioned, we are using a lot of existing functionality. The red boxes are existing controllers, part of the part of kubernetes, I'm introducing a new one called q, um so in in time zero.

B

Basically in the top left corner here you can see the batch admin would create the queue and capacity resources uh it could have like gatekeeper, type policies to select like who can submit where and then the batch user will basically start the job. Let's assume here again we're using the v1 job api. It will set the queue name where the job should be started and the job will start in a suspended mode.

B

Basically, we're gonna have, for example, a webhook and we're also discussing within kubernetes community uh for uh like setting policies so that uh we enforce that thing, like basically, jobs started in in a suspended. Basically, the job controller is not going to act on it or just ignore it.

B

The second uh step here is that q will look at that is watching on these jobs. It will assign them to a capacity. uh It will create a resource claim if it has an integration with cluster orders character to understand where these resources gonna come from um and then custer or scale. Basically, once consider us there fulfills this resource claim queue is watching on that it will unsuspend the job once the job is unsuspended uh again. The rest is the same as it's working right now. The job control will create the pods.

B

The schedule place them on on the nodes.

B

One important aspect here is like: how do we direct?

B

Basically, how do we do job level scheduling, as I mentioned before, we're using uh kubernetes native scheduling directives like queue based on where the resources were allocated in the resource via the relations claim, q will basically inject affinities or even tolerations onto the job, to send it to a specific place.

B

um I'm going to skip this one, uh this one's, probably the more important thing to allah allah is uh concerned. uh The cued workload is an abstraction that we're introducing to allow managing of more complex workloads.

B

There is the idea here is that, once the custom workload is created, we would need a controller that understands the custom workload and translates it uh to create a cued workload resource and this cute workload resource. If you like, I don't have the spec for it here, but it's basically how much resources you need.

B

It's basically like a pod template uh an account, maybe even like you know, an array of that, because you could have a driver and and for example, the workers like in a spark job and q would be aware of these cute workers you'll be watching on them, assigning them to capacity, and then it would basically uh mark the job as uh fulfilled, and then the controller here could would be the one that starts the cue, the custom workload.

B

The main requirement uh is that for custom workers to suspend uh to support suspension like basically, it needs the ability to start in a suspend mode and have a way for us to start it basically by setting suspend to false, uh and this provides an agnostic way of deciding when the job can start and when it when when it should stop meaning like preempted, for example,.

B

In discussing like uh introducing a suspense sub resource similar to the scale sub resource, if you are aware of it, that allows uh hpa horizontal port, odd scaling to work agnostically across different types of deployments. So we can. We are thinking of suspending the same way.

F

So yeah just to make sure I understand so in this case, for example, if I'm interested in you know, let's say, spark jobs that are started by a spark custom resource. I need to make sure that that um you know whatever spark controller implement the suspend api.

F

Would that be how it works.

B

Yeah, you would need uh to have like the top level um um resource object. That represents the job to have these, like the ability to tell it: okay, suspend or resume yeah. So this is the integration point like, and we feel that this is like a really small surface area of in like of integration uh relatively the complexity here. Is that again we're in a point where kubernetes is extremely flexible and allows you to build anything you want uh and the fact that you want to manage all these types of custom resources right.

B

So that's the design that we came up with uh the at this, like the the the integration surface, uh seems to be to us reasonably small. um Hopefully it we will not be proven wrong, so we'll see how it works.

F

Yeah I mean we're putting a requirement on everybody like whoever is implementing the spark controller, the uh ray controller, whatever you you name, it right go to the tensor job uh pythor's job right I mean everybody has to implement that suspend interface. I guess that's the uh yeah, the the the uphill battle here.

J

um Hi um so yeah, it's not that uphill. uh um I mean I'm already a contributor in kubeflow, so I can do this for the mk operator, but, for example, I think alex alex is here at this one, and I think he has discussed this already with uh with the maintainers of of the train, the training operator, and they they are fine with the idea we just need to to implement the change, at least for q flow. It's I think this battle is pretty simple: it's not a battle either.

J

uh Now, I'm pretty sure we can work with with other communities to integrate it. As abdullah said it's, it is a simple field that doesn't require much thought.

G

I would also add that, ideally, actually that should not be necessary, but we would love that um most of these tools, the job api, so that we can actually consolidate on the base drop life cycle really with the core wpi on kubernetes.

G

Now, not all jobs and spark is a good example that probably the spark has some specific requirements that, like the job api, will not be able to meet, um but we are at least um we are looking into like trying to curate like a stack ranked list of all of the tools um that need to indeed have this integration and like at a later stage.

G

If we see that this gets traction until early adopters and first users, one of the elements of work would be and help also that we could use is indeed that we do a targeted effort and that kind of stack rank starting like airflow. Our go ai cup flow, various flavors actually of course, etc that they all make sure that that we have this integration either.

D

With your core job.

G

Api, instead of using their own job life cycle management approach or or this approach of unsuspending, if something more complex is needed,.

B

The other thing here is that this idea helps in scaling addressing scaling concerns like we don't want the pods to be created only like from the beginning. That will help us scale like if you have hundreds of thousands of jobs being created, and you want to cue them. You don't want all of them to create pods and just basically manage the million pods that only one tenth of them will actually execute at a time and so like.

B

I feel that this could also enforce a shift of like a new design pattern, basically uh that that should be more scalable moving forward.

B

um I don't have well, as as you mentioned, we don't have a lot of time, so the we have the controller design uh the controller, it's a different beast, uh we'll leave for another day, but the design document is there. We created a repo. uh We have a like a a proof concept that we're planning to open source next week, uh and so hopefully the community can start uh looking at it and helping us ship it uh and and improve it.

B

um I don't think we gotta have time for the apis. I think if we have uh more questions.

D

Question on that I've been thinking after reading. The proposal is adding a new controller to kubernetes feels like a a really heavy thing to do right. uh Do you see this like an actual thing that can happen? I I feel that for the last year currently has focused on stability and maybe even to run current addition like edge cases and telcos and all of that and then adding a new controller.

D

That is a use case for a lot of people, but not for, I would say, the eighty percent of kubernetes use cases uh are looking into this, so adding a new controller will make kubernetes heavier. uh How does the kure nettie's uh community feel about this.

B

That's a very good question, so we're starting as a sub project uh not incorporated. We want to prove the case. We improve that this works. uh We are planning to integrate this with kubernetes. That's why the way that we're designing this such that it integrates with existing controllers. So that's one like you know rock that we're trying to avoid from the beginning. The other thing is: we have cube controller manager, which is basically it's not going to be like a new binary. uh Sorry, a new yeah, a new binary executable on its own.

B

It's just going to be another controller that gets created within the cube controller manager set of reconcilers. um So that's. Why also, we form the batch working group to get to convince the community that there's conviction around these ideas like there are there's momentum, uh there's a new type of workers that we need to open up, kubernetes for um and yeah I mean I can't tell you that it will happen, but we're trying we're making decisions right now that, hopefully, will help us uh in the future to make the case for having it in kokubuniti's.

B

And again, if nobody is using it, it's just going to be sitting there right like it's not uh again like a new.

B

Container that you need to start, etc. It's not going to be like that.

A

All right, this sounds amazing, so we are reaching six o'clock, but I suggest because we started a bit late for those that can stay to stay five more minutes and then we wrap it up. So I I saw a lot of activity also from kevin. Do you want to raise a couple of questions? One or two questions kevin.

K

um Sure um so we run hpc systems here at pinonell. um I was curious how these cues are going to interact with each other. um You know, usually we give each project a namespace, so they would kind of have their own queue in this api, but on our hpc systems we have cues where each of the projects submits their jobs, they can see where they are in the overall view of the system queue.

K

So they know hey. It's gonna be two days before their job starts or whatever um and and all the projects are fairly, their jobs are fairly uh scheduled across the different projects, so one project doesn't dominate the whole system. um How is having separate cues at the uh name, space level, kind of work for that use case.

B

So cues are simply like. If I, if you look at the api, it's simply a pointer to where the actual capacity is. The thing is like uh having q namespace solve a couple of problems. One discoverability, like users, usually only have access to the list.

L

So for them to know what is.

B

Is in the capacity api- and this is not a namespace object, and this is something that multiple cues can point to right like um like. Even if you have, um for example, multiple namespaces, you can, you can group them using labels and say: okay, all of them can basically point to the same capacity and they share the same quality.

B

The other thing that q will help us as being namespace is a case that someone brought up while discussing this in open source. Consider the case where you want to run an experiment like a user, wants to run an experiment, and it's like thousands of jobs that they want to run, but they don't want to use more than, for example, 8 gpus, because they don't want like, for some reason, um the ability for the users themselves to create a cue in their namespaces and set limits within the queue right.

B

So those names do not give you any promise on whether you will get the capacity or not, but there are limits on how on the maximum amount of resources that your experiment is going to use. So I would imagine users creating a queue, even paying large scale experiments themselves and setting those limits at the with. For that specific experiment.

B

At the end of the day, what will control? How much capacity will actually get? Is the capacity object api? um Does that make sense.

K

I think so so the qapi is pointing.

D

K

Dedicated pool or a capacity, um the the queues jobs are assigned to cues or whatever, but there's kind of a scheduler level queue that aggregates all of the q api objects into capacity, and it's looking at when the various jobs are submitted and still evenly scheduling them.

B

Exactly yeah like at the end of the day, your actual key is going to be the capacity um where, like basically we're, gonna decide, okay, which one is going to get executed first or not like they will all be basically dependent on the capacity.

A

All right, so I think we have two minutes left, so I would suggest, like I I'm pretty sure, we'll have more more to talk about. This is uh very, very interesting to do.

B

Yeah, maybe in a couple months we can come back with a demo.

A

But yeah absolutely that would be amazing. The other thing I would ask you is: what's the best way for the people in this group to to provide input and feedback and try to help out, uh maybe just.

B

A

B

Here's the repo, oh it's, not, okay. I don't know if but.

A

B

Does it open okay? um So we will have everything there like the like. This is the repo we will upload the code there we'll have uh the links to the design documents and the api, etc. um If you have specific suggestions, please create issues um to help us. You know uh better uh shape this project right now. It's just a template, there's nothing in the repo, but we will. We should upload something this week or next week.

I

So our goal is to have it as incubating in cf, okay,.

B

Yeah in in kubernetes, so this is a sub project sponsored by six scheduling. Okay, um so it's not a cncf project. I don't know like. I don't know the details there, but it's a it's a sub project within kubernetes got it thanks.

A

And I guess like if people can have a look at the proposal in the google doc as well put as much input as they can there. I guess there was a lot of discussion going on uh as we had some time, but I see that people have a lot more feedback to give. So I think we can we can interact there, there's also the mailing list.

A

I also linked that in the agenda, so I suggested, like everyone, checks those links and yeah like let's say we, we sync again in like a couple of months, uh we'll we'll make sure there's a slot for this. um It's been great like it's been super nice.

A

D

A

I I saw a couple of new new timers, so a first timer, so I hope you you're here again in two weeks so that we can, we can have a proper introduction. Otherwise uh does anyone else have anything else to raise.

A

If not like, thanks again to abdullah mache canaldo for for the really nice uh presentation and we meet again in uh two weeks march, 2nd um in principle, the topic will be air gap solutions and we stick with the topic for now, but uh we'll we'll send the reminders as usual.

A

Thank you thanks everyone for attending and thanks a lot for the discussion thanks bye-bye. Thank you.

F