Kubernetes SIG Apps, 7 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes SIG Apps 20220207

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, um welcome everyone to the success meeting. I'm your host today janet and I have our co-host machete here with us, and then we have one item on the agenda, which is the a proposal for jack qa.

A

Do we have someone here to talk about it.

B

Yeah can I present.

A

Sure I think yeah.

C

Yeah, I just made you a co-host, so you should be able to share your screen all right.

B

Can you see my screen? Yes, we can great.

B

All right um yeah, thank you for having me, um so this is a proposal for kubernetes native job um queuing. um What I'm trying to get from this uh from this presentation today is like mostly awareness, it's very early stages in the proposal. We have a detailed proposal in the document.

B

I would love like any feedback that we can get on these apis. Our assumptions about how it could operate and the use cases that we're covering um the proposal has been public for a month.

B

Now we got a lot of feedback already, but I thought it's time to go around like uh relevant sex to present it and then uh once the batch working group gets approved, um we will be hopefully will be presenting it there trying to push it forward within the working group, again collaboration with with the community, um so the agenda when I discard the problem, the proposal, the apis and example- use cases.

B

um So what is what is a job like? What we're trying to do here is to focus on type of computations that run to completion. Basically, they have a star, they start and then they- and they finish- they don't continue to run forever. Like services, um we classified them at the higher level into two high-level categories. uh There are jobs that run that have multiple tasks run independently, like multi-color simulations, and there are types of jobs that run collaborative like each job. There are a bunch of tasks that communicate with each other to process specific.

B

uh um You know uh uh to perform a specific uh functionality. Examples include like mpi, um even uh uh reinforcement learning uh and what not one important characteristics of a characteristic of uh batch jobs is that they are often flexible on multiple dimensions. Sometimes they are flexible on time they could run now or if the results are not available, they could wait and run a little bit in the future, uh they're flexible, sometimes on location.

B

They don't really need to run in a specific zone or or or region some some cases they do, because it depends on where the data is, but in many cases uh they are flexible. Whatever the resources are available, they could run.

B

They are also sometimes flexible on the type of resources. um For example, uh they could run on any gpu model, not like, for example, if a100 is available, they could run it. If it is not, then they they could run on, for example, um k80 uh nvidia gpu.

B

uh Another example is they could run on like uh uh standard uh or on demand vms. But if there is no, you know quota available, they could run on spot because they are cheaper.

B

um So what is the job curing that we're looking at what type of bring it up looking at and and the high level definition, uh it's basically like what we're trying to propose mechanisms to manage access to a limited pool of resources shared by multiple tenants. There are multiple tenants, they share it. They share a cluster uh with resources. Those resources could be auto scaled or not. It doesn't matter.

B

um We want a mechanism by which those tenants are able to create these jobs. Those jobs will wait until they get a chance to run based on several criteria, whether the availability of resources, whether based on priority they're based on um you, know, type of resources that they actually won't become available.

B

And so we need that mechanism to allow the management of the of the of the jobs at the job, not not pod management, job management, and so why do you need job job queueing again like in many cases, tennis? They have lots of jobs but limited amount of resources, and so you want a place where the tenant is able to create a job. Just forget about it. It primes when it runs uh when, when resources become available and what not so on-prem well, clusters are usually small and in scale and static.

B

You can't auto scale them, and so it's obvious that you would need uh queuing there on the cloud. It's a little bit less obvious on cloud. You still have limitations. For example, you some customers want to utilize discounts. Basically, they purchase a committed use discount. They purchase from the cloud provider specific amount of resources uh on a discount and they don't want to use beyond those resources because they are on discount. For example, they buy uh 1 000 cores uh per month, and that's it.

B

They don't want to use more than that and they want to have their jobs only use that amount of resources um also some customers. They have spend limits or to avoid cost overruns. They have specific like dollar budget. They don't want to go over. um Also, you have per tenant limits like some tenants, for example, they are higher than others, and then there are also cluster size limits. Just scalability like cloud sometimes is thought as like this utopia of infinite scalability, but that's actually in practice, not true.

B

There are always limits, whether it's the kubernetes level, whether it's a cloud parameter. So at this point I want to stop here before I go forward and see. If I have there are any questions about this intro.

B

Sounds good uh so what what users actually want from drip community, like, obviously they want queuing, they want to be able to create jobs uh and those jobs to wait until um capacity becomes available. They want the ability to express execution order. A higher priority job should run before lower priority jobs.

B

um Third again in a multi-tenant environment, they want fair sharing. uh Fashion is a concept that is quite common in in job schedulers, where you want to be able to define like specific capacity per tenant and and also be able to fairly share unused capacity.

B

um Third, they want to be able to specify budgeting ability to say, okay. I want to use this amount of cores per week, for example, um ability to set policies where you're able to say who can use what and uh to what limit and that, if, last but not least, flexible placement across resource types, location and time, um and so why are we talking about a new controller? Why, for example, playing kubernetes is not good enough?

B

Well, the reason is that claim: kubernetes does not many jobs at the job level, any job that you create, whether it's a custom job or a v1 job. The control player will continuously attempt to start. The workload basically is create the pause uh and the scheduler continuously try schedule reports. There is no concept of fair sharing. uh If there are no available resources, it will. Basically, it will work itself today continuously try to schedule this part.

B

They will not be able to schedule and all the controls could continue to like to to chug along doing nothing. Basically, and the other thing is that kate's quotas are not really enforced, are enforced at resource creation. They are not meant to do resource queuing. They are meant to basically provide a mechanism to protect the cluster from collapsing. By setting limits on how many resources you can create in a cluster, um so there are other uh open source. um You know projects the most prominent one is volcano.

B

The problem with volcano that we see is that it pre-invents scheduling it. Reinvents drop life cycle management. They have their own definition of job api. um It lacks proper integration with cluster auto scaler. um It's production. Readiness is questionable, like we we're not really like it's still in alpha, since I don't know for how long um and it is led by a single company so far like it doesn't have this contributor diversity. It did not really get that traction to become like a core community project.

B

um Let's say now: we we had a project in gke and it's called batch on gke. Again, this is a decommissioned project. uh The api, I think, is still available in in the uh google get repo. It has similar problems as volcano. It reinvented scheduling. It is a schedule by itself uh it has on apis. uh The other thing is, it was initially closed source and it was hard to meet customers requirements of portability.

B

um So what are we proposing here? um We're proposing a a a new controller called q with the following design principles? First, don't reinvent the wheel. You want to read no re-implementation of cluster or scale up or put to node scheduling or top life cycle management.

B

It should have native support for batch uh v1 drop api, as well as the ability to hook in custom workloads like mpi jobs. The advantage of this of these, like applying this principle, is that we reuse significant existing functionality. uh For example, if there are new scheduling features, we don't really need to continue to. uh You know uh chase what the core kubernetes community is is is doing.

B

It has simpler customer adoption path they could start by using, for example, we want the job and then, if they become big enough and require queueing voila, they just that's it like they install queue and then, and then it's uh and they are able to set up their cluster such that the jobs are being queued with quarters and whatnot.

B

uh There are no concerns about. Functionality uh diverges, as I mentioned, related to new functions, added to that its cluster or scalar, or the scheduler, and also enforces separation of concerns. um Again, it is not going to be a part-to-node scheduler or a provisioner. It's just a job level manager.

B

uh There are a number of uh concerns related to this model. The the the the first is that it creates two layers of resource management, there's the job manager that does resource management at the job level, and then you have the scheduler which manages resources. At the pod level. We will discuss how we can resolve that um there are multiple components involved, starting a job, and so it adds extra latency. But we believe that is not that.

B

That is acceptable in the grand scheme of things where you have a lot of drugs on being running, uh and you have this like huge pipeline of jobs that continue to finish and and new jobs, uh starting um and before we get into the high level design.

B

We want to mention here that we tried to improve the b1 job api we fixed and extended the existing uh api fixed completion status tracking, um which was which prevented customers from your users from using basically spot-free ends with uh with the job api we introduce index job, we add a suspended job, tracking, greedy pods and job status and many other other features of trying to continue to improve their v1. Draw api.

B

We're also working on on extending that in in the in 124 and 125, with new features related to non-retireable container exit codes, define a roadmap for scalable batch v1 batch to run millions of pods and thousands of uh millions of completions and thousands of basically pods running in parallel.

B

um So here's our resource model and before we get to the like the I want to describe what is our resource model we're proposing uh to introduce three new resource types? The first one is called queued workload which basically abstracts the represented the representation of any job in queue. As I mentioned, we want to support v1 job. This is like the queue will support that natively, but we also want hooks to support mpi job or tf operator or spark operator those jobs. They have their own controllers, they create the pods themselves, etc.

B

They don't model their workloads as v1 job. We try to do to convince the community to do that, but there are always limitations to using view and drop, and so we still needed a way to abstract the queueing of a job inside queue, and so we propose we're proposing a new resource called cued workload where you, where you represent the amount of resources needed for a job.

B

We have a new concept, also called resource claim. Again all these are proposals, uh think of it as an api to ask for resources from cluster or scalar. uh Again, it's not necessarily the cluster or scalar. We have in a power resource anyone, uh basically we're going to say we create a resource claim to. I want this amount of resources and then the cluster of scalar would provision them and make them and mark the resource claim as satisfied based on which we can reduce the resources, we have two main resources related to managing queues.

B

The first one is uh surprise: it's queue, which is a namespace resource. uh It's an organizing concept for grouping managing reasoning about closely related tenant jobs. The other one is um capacity which defines um which governs a pool of resources and defines cluster wide usage limits and boundaries of fair sharing. This is a cluster scope resource, so here, like I, I I show the dependency.

B

Model, basically, a batch workload would would basically create acute workload um that queue would be watching for and star and and map to a capacity. The queued workload could be associated with the resource claim uh resource that gets created to provision resources before actually starting the job.

B

um We're thinking about two high-level personas for queue, the first one is the batch admin. It's the one that sits up the tenants. Basically a tenant here is a namespace.

B

It creates the queues and the capacities and maps planned resources to capacities, and then the batch users and the batch users, their user journey should be extremely simple, which is simply when they create the job. They set the queue name and that's it. Everything else should just work.

B

So how this whole thing works, so I mentioned that we're looking at not recreating anything we're just going to add a new controller called queue that does job manager and in this flow I'm going to describe how this works. So at the top right corner we have the user creating, for example, let's start with the v1 job, we're not looking at custom workloads for for the moment, so the key they create the v1 drop.

B

They assign a cue name, for example, at the beginning. Let's use an annotation we can. They can express that through an annotation and then we would have um a web hook that forces every job that's created to be created in a suspended state.

B

So now this job exists on the api server queue, watches for these jobs, notices, a new job being created and what it does based on the queue that is assigned to it. It will assign it to a capacity.

B

So at this point, if we have an integration with a cluster or a scalar before actually starting the job, it would of course check if there is capacity in the capacity like this is the vector capacity, and if it does it will, it would try to actually provision that capacity by creating a resource claim and waiting for cluster autoscaler to provision that to fulfill the resource claim in number four, once the resource scheme is fulfilled, basically q will unsuspend the job once the job is unsuspended, the job controller, the actual lifecycle manager of the job, will create the pods.

B

The pods will be scheduled by the cube, uh scheduler and uh and and basically that's it once the job controller.

B

Once the job actually finishes, you would be watching for finished jobs, and then it will reclaim the capacity.

B

Are there any questions about this uh model or is there anything that is not clear or ambiguous.

B

Okay, let me zoom in a little bit on the cluster autoscaler integration.

B

um As I mentioned, um we have this api called resource claim um where q creates a resource, claim object, api that cluster autoscaler watches for and it provisions resources based on the claim, spec and then marks it as fulfilled when, when the resources become available, after which q will be able to unsuspend the job, this uh you know flow allows us is a key enabler for the location, flexibility, but q or the scheduler does not know about future skill future resources that will be scaled up by cluster or scalar.

B

We don't know whether the capacity will be available is on x or y or z, or whether the cluster or scale will be able to provision, for example, model x or model y of the gpu. That's why this integration is critical to enable location flexibility in auto scaled environments.

B

It also handles stock outs again the cloud as you may know, we have we run into these stock out problems, and so uh the customers will be able to say okay if it sees stock out in. If, if the job is created to be in a single zone, it will basically scale up the zone where we have uh resources available, um and it also will be able to have this like handle both all skilled and non-skilled environments.

B

Customer scaler uh has an integration, basically imports the scheduler code, and uh if there are existing capacity in the cluster it will, it will be able to fulfill the resource claim for existing capacity. So a resource claim does not mean that there will be a provision, a newly provisioned capacity.

B

It basically means that the whole cluster as a whole has resources to fulfill this claim, um and so we could, for example, deploy the cluster auto scaler in a in a in an on-prem cluster and do and and do a similar job to what it's doing on on the cloud. Modulo. The auto scaling part.

B

Now the question is: how do we support custom workloads? As I mentioned, we have this concept of acute workload.

B

um I I don't know if I want to get into too much details here, but the idea is that for each worker that you're creating, we would need a controller or an integration from the uh custom workload controller itself to create a cube, acute workload that represents the custom workload.

B

uh This is what the queue uh will be watching for and what it will. Basically uh what what the q, what the queued workload will include, is basically a reference to the actual workload and the actual workload should support, suspend resume of of um of the workload itself. The idea here is that we would be introducing that as a sub-resource, similar to like the scale, sub-resource and so q would be able to start just to suspend and suspend uh any workload without actually knowing the details of that of that crt.

B

um So we are developing a proof of concept. uh It's a different piece: the queue control, design I'll leave it further that they probably, although will present that in the near future,.

B

um How much time do I have magic.

C

We don't have any additional topics so depends on how far you want to go. uh The overall call is until uh for additional three uh 30 minutes. So if you want to leave some time for questions, I think that'll be reasonable.

B

Yeah I'll start finishing in 10 minutes, um so, as I mentioned, we have first the job api. Everybody knows the job api that we have right now again we're not adding anything to other than the ability to specify the queue name. Hopefully we will have this as part of the spec in the in the in the future, so the the user journey the batch user journey. Again, it's going to be extremely simple.

B

The only thing that you need to do is add the queue name where the job should be queued um to to get the resources. The queue api is also simple. uh It's basically, um oh sorry,.

B

So the queue api again it's a namespace resource. uh It only points to the capacity from which the resources needs to be allocated. The nice thing about this as a namespace resource is that, like the namespace is the most natural way for users to discover what their cues, uh what cues they can submit.

B

You know jobs to and so creating a queue could be something that only the batch admin can do, and so they create the queue and they uh in the namespaces of of the tenants, and then the tenants discover their cues, basically by listing them and listing whatever cues available in the namespace.

B

We're also planning to add some limits here on on a pair q level, but this is uh I'll leave that to the detailed document where we discuss more features um related to the queue api itself.

B

The queue capacity api is where the most interesting thing happened. um It has here I'm detailing this. The capacity api is a nonstick and it is a cluster scoped resource that defines amount of resources that can be requested by the jobs, starting through this capacity.

B

Let me start with the requestable resources field: it's basically a list of resources that this capacity has. For example, here you can see cpu and at the bottom you see nvidia gpu for each resource type. You could have multiple flavors. We call them flavors for the lack of better name.

B

You could have, for example, uh here we call them like c2 standard, for example, it's like on demand resources, and then you have spot um and the way that you would be able to specify that this is standard, and this spot is using labels and chains same way that we're labeling nodes right just to differentiate between spot vm and standard vm.

B

This does not map to actual nodes in the cluster again. This is virtual capacity that you have, that can be concealed by jobs, starting through the capacity.

B

At the top you have, there are two fields here: called the borrowing cohort and borrowing weight. Those represent uh knobs to control, um fair sharing, it's basically um each capacity. Think of it as a pool of resources. You could have multiple capacities within a cluster uh each each representing a pool of resources for, for example, a business unit and then the borrowing codes. Basically, uh an identifier defines these group of capacities. They can share resources between unused resources between each other.

B

The way it basically determines for each capacity how much it how how what is its priority in utilizing unused capacity, for example, if you have capacity a and b and c and a and b they are using their full capacity c, is using half of it. That half will basically be split between a and b based on the weight.

B

This is a typical model like you can see it and, for example, slurm or yarn um that, with with slightly different ways of controlling it, sometimes they use weights. Sometimes they use explicit sealing uh how much you can use from others.

B

And, and and finally like for each resource type, you are, as I mentioned before, you are distinguishing it from other types, other flavors using labels.

B

Now, when a job when, when when q, starts a job by assigning it, for example, a specific amount of resources from a specific flavor, it needs to make sure that the drop will actually use that flavor and the way that we're doing it is by converting these labels into affinities that we inject on the job, and this is in the v1 job api. This is uh possible because of now we are. We have uh mutable uh scheduling primitives on the pod template, for example like. If you see here, we have two types of cpus.

B

If q decided that the job should use tin, 10 standard cores, it will inject the the labels on on the core as affinity to force it to use that.

B

um I've got a bunch of use cases here that I'm gonna represent through the actual design document. um I'm just gonna show one one example and and uh and end the presentation here um so again like here you could have multiple namespaces each namespace represent a tenant tenant, a could have two tenants, then a1 and tenant a2.

B

You could have uh you represent them using labels, for example. If those two tenants belong to the same high level group um again, the queues are namespaced resources. You create these queues, they they point to their own capacity notice that tenant, a1 and tenant a2. They point to the same tenant, a capacity.

B

Currently, they have their own capacity.

B

um So that is like the simplest presentation to address that use case where the admin wants to sit code as pair per tenant and how much resources they can use and the resources could be cpu memory gpus, whatever that the scheduler is aware of and this caliber resource. Basically um an extension to this is again, as I mentioned.

B

uh If the admin sets the the cohort as a borrowing cohort, um it basically means that tenant a and tenant b, they can borrow resources and use resources from each other if tenant a, for example, used all the resources available to it, it can bears into the resource unused resources of tenant b, and then we have we're planning other knobs, for example, to say what if tenant b starts ramping up, what what should we do?

B

Should we preempt uh tenant, a jobs that are borrowing resources, or should we just wait until they finish, and then they get reclaimed etc. Those are all like part of the discussion that we're having on the document.

B

um So that's the uh sorry it was long uh longer than I expected. uh There are a ton of details really. What I wanted to do here is bring more attention to this proposal. We are working on.

B

Just going here uh as a summary again, um where there is a proof concept in progress that we're planning to open source soon um that we're hoping uh to iterate over with the community through the batch working group to refine the apis, make sure that we cover all the use cases that the community has in mind.

B

um As I mentioned, the resource claim api is a proposal.

B

We're trying, it will likely be also open source as well, uh or at least like every api is, is open source uh we'll, try to standardize a way to request resources from auto scalers um to make the experience for admins easy we're planning a cube cattle extension for creating queues, looking at queues and capacities and whatnot, um and our north star is to make it part of the kate's core apis and- and our uh approach here is to make sure that it, it works well with every existing functionality that core kubernetes has we don't want to re-implement anything.

B

We want it to just work with the existing uh architecture and controllers.

B

With that I um yeah thank you for listening um and I'm happy to answer any questions and if there was anything unclear, please take a look at the document. There are a lot more details there that probably can give a better context than I get than I gave.

C

Does anyone have any questions I'll? Let others speak before I'll I'll ask my questions.

D

I had a question um one thing that wasn't clear to me, so you want to do all of this work in tree or were you looking at doing it as crds and then bringing it in.

B

Yeah, I'm we're looking to build uh to prove concept as a sub project, um like I was discussing with tammy saying like look proof concept is worth 100 kips uh show that it works, show that scales show that the apis are actually able to um satisfy, like communities, use cases reasonably well, um and then we can start the process of okay uh trying to make it available in in core coverage. Again, that's what I mentioned. This is the north star, but it's not what we where we are starting.

D

Okay, I guess here's the thing right, so you call a volcano and say that it's basically a single source, single vendor solution, that's being kind of really driven by one company, and it hasn't failed to get the kind of multi-company community support that the rest of the ecosystem has like by introducing this, isn't it just creating additional fragmentation like? Isn't it one more thing? That's also supported primarily by google that isn't.

D

How do we avoid that? I guess is my question right like if your goal is to like unify the community around a standard framework for doing batch workloads on kubernetes, it adds value over what the v1 batch does today that that sounds like a great thing to me, but the the plan for a community engagement is is kind of that's the hard part here right so like how do we think? How do you think we could do that.

B

Right, that's a great question, so I think, as you already noted like, we are starting from the core kubernetes apis. So that is something that the community already uh is is grouped around right, like the scheduler, um the job, api, etc. So we we want to meet the community where they are.

B

We don't want to introduce new apis for starting jobs and and bring the community to it. So that is one thing. That's like basically the on-ramp uh part. Now, how do we bring the larger community and avoid that fragmentation?

B

We are making everything available from from the poc available like in it and even the apis from the very extreme beginning.

B

um The document has already went through some revisions based on the community's feedback, for example, moving the queue from a non from cluster wide uh api to namespace, because with we wanted to address specific use cases that we didn't have in mind, uh we are starting the batch working group again in the same bind and that's why we reached out to sig abs, for example, um magic and and signored in in uh in.

B

To signal, and um just other scheduling, forks like way, for example, he's with apple right now like it is a like. How do I say it?

B

It is like a main success criteria here is to make sure that we're presenting this and make some community uh like have some community agreement around it before actually making it like to to uh to to core kubernetes and proposing it as a kit like we don't want to say, okay, this is the thing it is working.

B

You have to adopt it the way it is, and then we are going to force it into the core kubernetes by creating kips and and and and basically forcing what we've already developed and and make it available in production into the community again. This is not something that is in production in google or gte. It is a very early poc uh and we're hoping that the community would help us uh shape it, basically, the way that it works um for their use cases as well.

B

Now the the third thing that we hope it would not cause fragmentation is again with not trying to re-implement anything and that we feel is a barrier to adoption.

B

We want things to work with the rest of the kubernetes controllers, just like you know seamlessly, um and so we believe that this could be an enabler basically and and yeah. That's. I guess my my thinking here.

D

So the other thing is, you said that volcano was in alpha status, but I thought my understanding is it's been in beta for a while now um and that, like it's, not really an early adopter thing that it's used in production um like fairly broad well, it's used in production at various largest companies across the industry.

D

um So I mean like. Is it? Is there really that much of a concern about like in in terms of building another thing for batch scheduling? Is there that much of a concern around the stability future- or um you know, scalability of volcano yeah,.

B

I mean it doesn't have proper, it re-implements the scheduler. uh So it's basically a second schedule that you're deploying on the cluster, um and so it's really not easy to use. If you have a cluster that runs multiple types of workloads, for example, it doesn't integrate with cluster autoscaler like it. It's not.

B

You know the cassava scale, for example, it imports the scheduler code, and so they need to be in sync to be able to make proper decisions, uh and so that is that, maybe you can see this is a little bit tangential, but still costa rose is under cigar scaling, it's part of the overall kate's.

B

uh You know, uh I don't know how I feel like a community, um so so that that is. uh That is the other thing.

E

B

Concern with with with volcano is again it's as I'm. I'm sure that if, if people put the time and effort they will, you know make it like, production will be available. I have no doubt about that. I don't think that the model itself is not scalable. I don't think the model itself is non-production, but my concern is it re-implements things it's gonna face? You know issues with exist like it's harder to adopt uh uh from from. From my perspective,.

C

So I'm thinking that um it's somewhat related to what ken was basically asking, maybe we should or we could. um There are two parts of my question. One is how we can figure out and make the job extensions, because um you did mention that, ideally, the the queue would be part of the the native batch v1 job.

C

um I wonder how we could make the default uh job resource such updated in such a way that it could be reused by not only your q solution, but maybe by volcano um and other solutions.

C

uh So that's one thing and I'm- and I was talking earlier today with ravi and we're brainstorming on ideas, and he pointed that there's a project which is already a part of the kubernetes six, which is cube batch, which lays that kind of api on top of cube.

C

uh Maybe the question is: can we shape the q batch towards the way that I don't know? Maybe your proposal is one of possible implementation and we should focus on uh creating interfaces which we could embed in cube, but implementation details would then allow everyone to pick their or.

C

Reuse, their current solutions. If there are people who are using volcano, they should be easily able to plug it into that solution. Similarly, folks with using, uh I don't know, tensorflow if they could easily plug in into that solution. That's something that I was thinking uh while listening.

B

Yeah cube match is the is volcano uh in a sense.

D

Tensorflow and sorry kuby flow and volcano both leverage, kooby batch yeah.

B

And cube batch the one that, in the in in six, I think, is discontinued, like what I don't think there are any commits to. It, has moved to be part of the volcano package uh and the the development continued to be done there. uh It is the core scheduler for broken volcano includes, you know a job controller right. So that's the that the project volcano. There are multiple things in it. One one is the cube, scheduler the cube batteries in schedule, and then there is the uh their own job api as well.

D

Yeah kuby batch basically got subsumed into volcano a while ago.

E

I don't think you have.

D

E

To kuby batch like two years yeah, but what I'm thinking is that, instead of you know committing to a particular.

C

Approach, if I'm basically uh building on ken's question what we can do towards making, because there is the concern of fragmentation uh already and people already started, building their own solutions that are very targeted at their specific use cases.

C

So, while working on it, I would I would like to see- I don't know some kind of an effort where we could provide those interfaces where everyone could inject rather than providing them with one particular. uh I I left a comment in your uh in your document.

C

Basically, when we were discussing several years ago, the idea of adding workflow support after we uh after the initial part, where we were done with jobs and scheduler right, we started building the next uh layer on top of jobs and scheduling jobs, which was basically some kind of a workflow dac support. Whatever and after the initial brainstorming and and couple of meetings, even we had a proposal that was actually merged into uh kubernetes.

C

We have, we eventually decided to revert the proposal, because the consensus was that we would provide people with one and only one correct way of running workflows, which we were aware that it wouldn't uh fit everyone.

C

So I'm worried about something similar with this kind of approach. I'm not saying that it's a bad, I'm saying how we could maybe try to generalize this. If we want to include this in as a default component. Eventually it would be nice if we could gather all uh or all interested parties input and somehow provide generic solution, and then everyone could inject their own additional functionalities and build on top of that. Maybe that would be something much more valuable uh for everyone right.

B

So um that is why we're taking this approach, uh like one step at a time, we're starting with a proof concept with a sub-project uh we're, starting with the apis being available for to the public from the very early stages, um explicitly mentioning all the set of use cases that we are we have in mind and targeting my plan is to present this to the research user group in cncf, which is, I think, it's where volcanoes mostly discussed uh we're presenting this there as well. um We have uh kubecon.

B

um I I proposed a presentation about this in coupon. Hopefully we'll get accepted I'll try to get as much exposure as I can at the very early stages of this work to make sure it does it's either extensible enough or okay. We get to the point. It is not extensible, it's not going to address specific use cases, and then it would be the decision. Okay, this could continue as a sub project, it's hard to make it as a a core kubernetes api, which I I I would I like.

B

I wouldn't be surprised if that was the case, but we need to start somewhere uh yeah. I mean.

C

It's not that we are gonna solve all the problems at a single shot.

C

There's nothing stopping us to extend the current built-in types with those atoms that allow those external products external projects to leverage the built-ins as much as possible. um The question is: what are those missing bits that we decided.

B

C

Nail down yeah.

B

So, and this is what we tried to do at the beginning, if you notice and since our engagement sig apps year ago, okay, we figured index job is: is generic? It's it's gonna solve a lot. It's gonna open, kubernetes for more types of batch workloads, um the idea of suspend, for example, just controlling jobs at the higher level, uh maybe trying to represent that as a sub resource. Again like we're trying to add these abstractions as as we go um and and it's a discovery process.

B

To be honest, like I mean the community has solved this for services and continued to do that like there is one like kind of situation of creating services and kubernetes right now at the at the very low level right like the service api um and- and they got to the uh to the next, like you know, v2, I guess, after a while, um after experiments with native et cetera, et cetera, we're trying to follow similar path uh of some sort, um I do want to depend on core kubernetes apis. I want to build on that.

B

I want to have the experience where, if you are someone with a simple workload, you shouldn't need to like pick and choose you start with core kubernetes you're going to have a nice experience.

B

If you work, if your exp, if your setup gets more complicated, you should just be able to add to it not reinstall and bring in new thing and rewrite your yamos to use new apis to present your workloads. So that is my like hope. I guess.

D

My question is: how realistic is that, given the type of workloads that you want to support like mpi, that's a hpc workload, there's nothing simple about that, and even supporting that on top of commodity cloud can be extremely complicated. Given the way I mean usually like, if you're doing this at scale, you're, not even you're, using infiniband right like you're, not.

E

D

Ethernet is something you can't use right so, like I'm, not I'm not sure it's that's like if someone wants to run mpi on kubernetes, okay, that makes sense, but it's not like they're they're, not necessarily the typical user right like it is. It is a target community. It's an important community for kubernetes. I would say, because I mean like hpc workloads: they they do. They get a lot of core hours right, um but the other.

D

The same would be true of spark right, like you're running a data intensive workload typically and you're you're you've got a secondary scheduler that you want to implement on top of it, whether you're implementing like the spark resource integration or the smart kubernetes integration.

D

It's not like a super simple workload um so like having reducing the barrier to entry for people who want to run more complicated workloads on top of kubernetes, I think is always been a goal um and if we you feel like this is the right abstraction for them to do that, but but it I don't think that the goal would be for a gender like a generalist software engineer who works in product development. At a tech company to come in and just like turn up, mpi.

B

Well, I mean no, it's not, they will install mpi operating. That's what I'm saying. We need to support those custom workers, but we need to have the right hooks to be able to provision the resources required for an mpi.

E

B

So if you have that abstraction that you could say, okay, I'm an mpi job. I need these amount of resources and require standard vms, not spot, and they require gpu model specific model and um vms that are co-located.

B

On the same rack, you will be able to do this via queue, because we could extend the cued workload to to have these provisioning requirements that can be related to the cluster autoscaler, to provision that, for you and then send the job there, and so these mechanics will be enabled by queue um and- and that's where, like you, you simplify a lot uh like a significant part of that journey.

B

um I don't know if the answer is question, but this is like hpc is becoming bigger and bigger on kubernetes, also on the cloud uh they have, as you mentioned, specific provisioning requirements that we need to relay to the auto scale, to be able to bring up those resources.

D

So, but I guess I mean the thing is when I'm looking at the workloads that are already supported by volcano, or at least if they claim are well supported, they are mpi tensorflow kuby flow right, so I mean, if your thesis is, that the abstraction there is not good because it doesn't leverage native kubernetes or we could do better there. You know, I I'm definitely. I I support that conclusion in so much well. I don't reject that conclusion right.

D

Like you know, I I'm not a customer that uses that personally so like I I I don't know how bad that friction actually is, but if you think it's bad, I mean like you know, I don't think I definitely wouldn't push back on taking that to um the working group and saying okay. Well, you know maybe as magic suggested. What we can do here is build a better set of abstractions leveraging native kubernetes constructs that we can all use throughout the community and if you want to use volcano, that's great.

D

If you want to use our primitives that's great too, um then we can, and that would be a good way to probably um foster collaboration and adoption, as opposed to creating fragmentation too yeah.

B

I completely and again, like my biggest concern with volcanoes. It's a it's an actual second schedule. It competes with cube scheduler. It does part to north assignment um it. It is not.

D

The scheduler you've co-run the scheduler of the volcano. At least you used to co-run the volcano scheduler with the default ruby scheduler right like we always we always had an architecture that allowed the idea of multiple schedulers, and that was a design, and there has always been.

B

Problems with running multiple schedules in a kubernetes cluster.

D

Yeah I mean they, they can fight, they can tight lube. They can wedge it's not necessarily happy fun times. Unless you get it right. I I get that totally right and that's a big problem.

D

It is, but you have to look at it from the scheduler side perspective like because you know one of the things prior to volcano and kubi bash. That people were talking about was like. Maybe we should, you know, modify the internal kubernetes scheduler to make it more back friendly, but the reason they allowed for multiple schedulers, like misos, had multiple schedulers, because it was meant to be hierarchical. You had a resource orchestrator and then you had a specific scheduler per workload.

D

Kubernetes definitely wanted to be more user friendly, so the default scheduler works very well for most workloads that people run but for complicated batch workloads. It's not great, so the decision was made to like we're not going to try to build that into the default scheduler. We would rather for more complicated workloads, have a second scheduler. That just does that. Well right, so I mean it wasn't like an accident. It was somewhat intentional, but if your thesis is that I can, I think I can achieve very good results, leveraging the default scheduler without having modifications.

D

Again, I don't think people would hate that experience. That makes sense.

B

Exactly exactly like what I'm seeing here is that what I think here is that we could have two level schedule: okay, there's a part, two node schedule and there's a job scheduler. What we're trying to build here with queue is a job level. Scarce basically decides for the whole job where it should land and it uses the schedule. The low level schedule primitives to influence where that job is going to land using injecting the affinities and whatnot.

B

um So I I I think that this model is is going to be more like friendly and easier to use more extensible, because again you build on existing features that kubernetes has added uh to to to the to cube scheduler um the problem with multiple schedulers. Again, there are a number of issues there with race conditions, integration with with auto scalers. That is extremely hard to solve and and hard to debug, and I feel this will actually reasonably solve a good chunk of use cases for for batch workloads like this.

B

This high level model having a controller that manages the whole job and have apis to request resources uh before starting these jobs and then sending these jobs to those to those provisioned resources.

B

But I get the point about fragmentation.

B

I can't force volcano to adopt anything we propose, but what I can do is I can try to make the argument um that what we have here might like to to find how I say like to find areas of uh collaboration or like um where um we we reduced that fragmentation.

B

Again, I don't know how popular is cube. uh Volcano. It's certainly like from my perspective from our perspective, is not is not popular in google cloud. I don't know about other clouds, but this is my only data point like I can look at what my our customers are doing, they're not using it. um I don't know about others. Maybe there are I'm pretty sure like other companies, they do, but I don't know if they, if it is available at scale.

D

But I mean I know: huawei uses it, I don't. I can't speak to what everyone does internally um again. You know, I don't think, there's going to be a lot of friction um if you want to do this work and you think that it's going to be valuable, the community offering it just I mean we're just trying to give advice, try to collaborate as much as possible with those projects in the ecosystem that already are live and already are kind of providing value.

D

um The other thing that's kind of like you know when batch containerized workloads were first coming out. The idea was that- and this was like the the misos kind of promise right, like you're, going to be able to co-locate your batch workloads and your serving workloads inside of your do-it-yourself data center. So you can get better speed utilization by filling holes in your machine shapes with batch work when you're not utilizing it right.

D

So, like one reason why I maybe volcano and other batch workload schedulers that would be co-located in the same cluster with you know a serving workload scheduler armies, I today, I don't think about the problem. That way, right, like with commodity cloud and especially with commodity cloud, that offers me arbitrary machine shapes. I can totally co-locate all of my my serving workloads in one cluster and if I want batch workloads, I can just get another cluster with an auto scaler.

D

Stop it full of batch work and then let it scale back down when I don't need it anymore scale to zero. If I'm not doing any right, so I don't get the same benefit in increased cluster utilization by collocating batch this. This is anecdotal for just for me uh for co-locating batch at with um serving workloads, as I did you know 10 years ago, so like it could be a the utilization thing could be like we don't see it a lot, because it's just not mature.

D

It could just be that people are using different gcp services in order to orchestrate their batch work because they can, you know, get scaled to zero if they're, mostly running spark jobs. If I remember correctly, gcp has um some managed services that allow you to stand up a standalone spark cluster scale. It up, pull your data down from google cloud storage. Spin up your batch work on it and turn the whole thing down when you're done churning your data.

B

I I completely, but the thing is like there is. I have two two comments here: one we are seeing more and more shift towards deploying batch workloads in kubernetes, um so that that is actually a thing. um I think people have realized that okay, we've convert to kubernetes as the orchestrator for containers.

B

There are groups that are containerizing batch workloads and their first answer to running them is okay. How can we do this in kubernetes, because this is the standard now we're trying to find solutions for them?

B

There are many services, as you mentioned, for example, databricks. They they have their managed services available. All clouds we have our own as well, but still there are people who are based on our user research. They have these large clusters, they want to deploy multiple types of workloads and they only want to have one managed service, which is which is kubernetes cluster, whether it's managed like gke or or on-prem,.

D

um I wasn't saying I wasn't saying they wouldn't want to use kubernetes not to be confused. I was saying that the the complexity of code of running multiple schedulers, simultaneously for different workloads in the same cluster might not lead to the same benefit that it did like 10 years ago and now like if you, even if you want to use kubernetes. If you wanted to use kube and volcanoes spinning up a separate cluster, just to run your batch workloads and scaling it up to meet the demands of whatever you're trying to run on it.

E

D

It auto scale back down when you're not running any batch might be just as good a strategy as trying to co-locate your batch workloads in the in the same cluster as your serving workflows or not. I mean.

B

Yeah I mean I mean the there are still use cases where you want to do that again. As I mentioned, you have committed use discounts where you always need to have your resources used all the time.

B

Otherwise you paid for something that is that you're, not using the other thing is there are cases where you actually have um you provision more than what you need just anticipation of spikes, and we do see that, like you know like there are cases for our customers doing, that there are provision capacities just sitting there, because there's a flash sale or whatever is going to happen that will spike traffic instantly, and so they want that provision capacity available, because provisioning resources is much more expensive than preempting a batch workload.

B

For example, it's just like kicking kicking out the pot, so there are still cases for in the cloud that you know have that model um but yeah. I don't wanna like keep picking on on volcano. I just don't think it is the right it does have. I don't think it has the right apis or support all the cases that we're looking at, for example, supporting flexibility cloud. As you mentioned, you have all these types of vm shapes and resource types that you don't usually see in in on print on on-prem.

B

You just buy this cluster, it's homogeneous most of the time. That's it in cloud. You have so many types of resources with different costs and performance trade-offs. You want to provide a way for them to be able to fall back different.

B

um You know uh types of resources in an easy way, so that that feature, for example, as well as not implemented in is not doesn't have. Volcano doesn't have the right apis for it. um So that's my pitch um again, as I mentioned, um we're we're trying to do a proof concept. We want to to engage the community from the very early stages.

B

I'm planning to present this to the research user group and cncf, which uh volcano is part of um I'll try. We will try to do all this. I want to say I uh like I'm just represented here, but I honestly need to have leads from the community to shape this with us again. Our goal is to have something portable. I don't want something that only works for google, just like we want kubernetes to be portable.

D

Well, I can personally say that when you have a poc that you're ready to share it, I will kick the tires. I'd love to see it.

B

Well, that's wonderful and it's not complicated we're only implementing the very basic use case for it just to show how it's going to work again. We don't want to shape anything or make anything opinionated. I just want to have something: okay, look this theory of operation that we're proposing uh we implemented it. Let's see how we do other things, I'm happy if you're interested in and to contribute to it as well.

B

All right thank.

A

A

Thanks so I think we're at time um so I'm going to end this call and then, if you have questions we can follow up offline.

A

Maybe in the dark that you shared.

B

A

Cool thanks. Everyone.

A