Kubernetes Batch Working Group Weekly, 18 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes WG Batch Weekly Meeting 20220818

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Okay, good morning, good evening, good afternoon, depending on where you are uh today is august. 18Th, if I didn't miss the date uh and we have a single item in our agenda, it's about q, alder and abdullah will be presenting a q overview to fit into the series of presenting various uh batch frameworks that exist in the kubernetes ecosystem.

A

So take it away from here.

B

um Can I have some uh moderator or co-presenter permissions.

A

Yeah give me a sec.

A

For me, zone is jumping between windows and as chad,.

A

Okay, I think I have enough yeah.

A

B

Let's see if it works.

C

B

Okay, can you see my screen.

A

Support for high q, yes.

B

Okay um and let's hope the slideshow works. Yes, yes, it works all right, perfect, so um hello, everyone, my name, is aldo.

A

B

I'm and abdullah is joining me to present uh cube. um So, as you might know, q is a project sponsored by six scheduling. um So here we're gonna go through the motivations of what led us to build the system and the apis, and if there's time we'll, we have a short demo for you. um So the problem, of course uh what this is why we are here. We are talking about job queuing. uh We we.

B

We have a limited uh set of resources in our clusters and then we have multiple tenants sending jobs to the cluster, so we need to decide which jobs should wait and which can start at a given time. um Why? Why is it necessary to do job queueing?

B

Well, we have too many jobs, limited resources in particular, if we're talking about on-prem clusters, these classes are generally static, so the the resources are are very limited and in the cloud theoretically, you could have infinite amount of capacity, but in reality that's not true, because you might have discounts for a specific size.

B

You want to establish quotas between your tenants and, of course, the clusters cannot scale to infinity. Kubernetes has certain limitations, so we need job queueing, even if even in the cloud.

B

So why we built q, why we didn't use whatever was existing already in the ecosystem?

B

So first, uh what was what was wrong with playing kubernetes uh so in kubernetes uh in the?

B

If we consider the the keep scheduler, the cube scanner will continuously attempt to start pots so and if these spots have dependencies between each other, we can have pots that uh basically starve the cluster and and you could have pending pods, they start uh queuing up in the in the in the in memory uh of of the scheduler.

B

Everything starts to go to go slow. uh On the other hand, uh kubernetes offers a concept of chorus, but quotas are enforced at source creation. So if you could only say, I have this limit of resources for my pots, but I cannot enforce this quota at the job level, for example, and uh and once my pot failed to be created, there is no way to. There is no place where it's skewed.

B

I have to retry, um and I can it's- it's not possible to express order or priority for for these spots to be accepted so, and these are the two main problems uh in playing kubernetes now, of course, there are existing custom schedulers that try to to also find these problems so, but the problem is what that we found with them is that first, a lot of them re-implemented a lot of existing functionality, so a lot of them replace, keep scheduler, for example, a lot of them replace the job controller or introduce a new job api, which it makes it hard to maintain hard.

B

It makes it harder for portability, etc.

B

More importantly, these uh these schedulers didn't have a clear integration with autoscaling with the cluster autoscaler and as such, uh it's it's very hard to use these schedulers in the cloud.

B

And an additional problem which came from from customer research is that uh customers want a resource, fungibility or flexibility, uh for example um in the cloud you could have vms on demand or you could you could use spot vms um so.

B

A job might want to uh use spot vms whenever they are available, but if we run out of spot vms and the job is high priority, we still want to be able to jump into another vm type or another resource model. If I'm talking about gpus, I could, I could opt in into a slower gpu that is more available.

B

So this this kind of resource flexibility is also not available in existing schedulers. That we found.

B

So here is the proposal that q is bringing uh so a key q is, uh is a is an operator, so you can install it on top of on top of the your existing cluster and it would. uh It would play well with existing the existing kernel of of kubernetes.

B

uh The implementation is slim, it doesn't because it doesn't re-implement keep scheduler, it doesn't implement cluster, auto scaler or the job controller. The implementation is rather small, while still providing um all the functionality needed for job queueing.

B

uh Again we we obtained to ma to re, reduce uh core kubernetes as it uh the the maximum amount of apis and components. We can reuse from from core kubernetes, and this uh this leads to a full compatibility with the with the ecosystem.

B

So now I want to talk a little bit about the the apis, the resource model, so we decided to go with the namespace as the the tenant abstraction. I mean it's already a it's already, a canonical container abstraction. So we we started from there and then there are two closely related uh resources. One is a queue. The other one is a clustered queue.

B

A queue is a namespace object and it's you can use it to group closely related jobs from a single tenant.

B

And then we have a cluster queue which is actually what governs the the pool of resources so in in the cluster queue you can define. uh You can see here the quota. um This is a simplified view of the api, but you can, you can say the how many cores you want. You can have a minimum or or you can spill over to 200, and then um it defines uh some boundaries for fair sharing. So here you can see this. This field called cohort.

B

We're gonna see a little bit more about it in a bit so and here we are highlighting existing resources so uh job the job is the the kubernetes job api.

B

Now, if we have multiple namespaces, each namespace has its queue. uh Researchers send jobs to to their queue, and then the queue sends the jobs to to the cluster queue, and here is where they, the first sharing happens.

B

And you could have more complex scenarios, uh so this this is. This is where the cohort comes in place um you have um so the idea is that if this cluster queue is not using all of its resources, uh then the the robotics pool cluster queue can spill over and use the the quota.

B

The the quota that this cluster nlp poll is not using and then you can still have cluster kills that don't that don't borrow to others because they might be more expensive of or whatever your motivations are.

B

um If I can't stop here, if there are any questions so far, uh otherwise I'll I'll go into more details in the api.

D

I have a couple questions.

D

B

Hear me so uh we uh we have diana, has questions. Okay,.

D

Yes, I'm looking.

B

At the screen, so you have to talk to me.

D

Sorry, I did, and can you hear me yes here? Oh okay, great, so just a couple questions um when you were talking about the uh some of the motivations where the um scheduler you know, there's a lot of uh the scheduler will get could get overwhelmed uh with a lot of requests that just aren't satisfiable because of the limitation of the resources- um and you don't have to mention this now, but have you compared the overall performance with and without q?

D

With regards to the scenario when you overload the the control plane basically and how the control plane benefits from having the queuing system, if you did any experiments of that and you can.

B

D

B

Right so uh we haven't, we are on the works to to build our scalability tests, um but uh the the big benefit of q. Why?

B

The reason why q wouldn't cause this kind of problems is because uh queue restricts the pod create the pod creation, so the the scheduler itself doesn't have pots to process before q has the chance to allow allow these these jobs to proceed.

D

Yeah, that's great yeah and I'm interested in knowing how this other prof, uh I guess the testing with regards to the enablement of q, now you've reduced the overhead on the um control plane, which is which I think will show a lot of value of being able to have this queuing system and reducing the overhead so yeah. That was one thing and then the other question I had, and you talked a little bit about borrowing from the quota.

D

uh Is this borrowing also enabling preemption as well? In other words, there was borrowing, but then the associated queue wants to have some of its resources back or is it once you borrow, then you have to wait for the release. The release of resources.

B

Yes, for now, we are relying on the jobs to finish to recover la quora, but preemption is in our roadmap for, for the end of this year,.

D

Awesome. Thank you.

B

Let me so if there are no more questions, I would like to go through here before I go through the apis um and then what diana might become more clear. So this is how the the operation of queue happens. So uh first we have the the batch administrator, uh the administrator of the cluster.

B

This this persona creates the name. Spaces creates the queues, creates the cluster cues, to set up the quotas and for sharing etc. Now the batch user or like, for example, the researcher, the the the only thing they need to do, is they have to create a v1 job and they then say they set the the queue name.

B

Now then, q comes into play based on the quota and the order, priorities, etc. It would admit the job and during admission it would inject node affinities. This is where the results fundability happens. The the the known affinities, uh for example, for for spot vms or a particular gpu model, will get injected into the job, and then the job is unsuspended.

B

Then the job controller uh would observe that the job is unsuspended and they will start creating the pods and then the keep scheduler would uh schedule the pots so yeah. This is uh this is what we mean by slim implementation, this job controller, this skill controller is, is, is acting in uh in this pipeline and then the rest of the system works as as intended.

B

um So with that, I think I can. I can get back to the apis.

A

So I have a question then. Yes, so in that operations like every job that you submit, should have the capability to suspend itself.

B

Right, yes, uh we already did that for the for the v1 job, um but if you want to, uh if you want to uh add support for for a custom job, it just has to have the semantics of suspension.

B

I can go over that in a bit as well, um so back to the job. Pretty much. Nothing changes, you just add an annotation with the queue name internally. You would create a separate entity for this, uh which has in the queue name and other other things, um the queue the queue api uh it just it's a namespace object, so it lives in your namespace and it points to a cluster queue in the future. We, we might want to add some some more fields so that you can control.

B

You can control uh quota even for even for yourself as a researcher. um It might come handy um and then the most important object is the cluster queue.

B

This is where most of these things are defined, um so we already talked about the cohort, so this is where we define, which other resources can which other cluster queues can train or use resources from this one, and then you define which resources are this cluster queue is governing so, for example, for cpu we have defined a couple of flavors, uh so the flavor on demand and the flavor spot.

B

Now these for a flavor, you can define uh labels right. So these these labels are going to be injected. So if you uh will observe the quota, if there is quota for on-demand, it will insert the labels for on-demand and then your workloads, your pots, will, since they already have the affinity they would, they would start on this particular nodes.

B

If there is no quota, then it will go to the next flavor and it would assign these these labels now you can also define taints. So if, if this flavor is, you know request it has certain requirements, then you can define you. Your pot can set up toleration for for this.

B

For this flavor.

B

This uh these two, um these two values uh we had a hard time naming them. We, we call them in a max for now, but min means that this is kind of like your guaranteed capacity. This is what the capacity this is, the capacity that your cluster key really has, and then max is the capacity that can be borrowed from the rest of the from the rest of the cohort.

B

um So this is how you define uh how you how much you can borrow uh well, I already explained through this um and that's it.

B

With that, any any questions here.

D

I had another question: I'm not sure I understand so it's uh this part where any injection of the node affinity. That is, that, does that, can you help me understand a little bit more? um Is it the the the request is checked against the associated quota and then once it passes, or it's allowed to allocate some of those buddhas? What how does the node affinity change the way the job gets, dispatched or released.

B

uh So the node affinity gets inserted into the bot spec template. Okay of the job.

D

B

So that every part that is created has this affinity and then they will go into this particular the notes that have these labels. I.

D

See: okay, okay, okay and that's how? Essentially it's released.

B

Yes, at the same time, we release the job we unsuspend the job. We insert the affinities.

D

Okay, thank you.

B

uh So what what what do we expect.

A

Oh yeah question: can you explain to me again what the flavor is? I don't think I understood that right.

B

uh Okay, so um the flavor is kind of you can think of it of of models. So here we are talking about cpus or this cpu is, is kind of the it's kind of wrapping the entire vm, so in in in cloud you can have, you can have certain types of vms and then these vms uh could be spot could be, preemptive, sorry could be preemptable or not. Preemptable, other example would be. Maybe I have arm intel right and then they have. I have different chords for for uh for arm different chords for for intel.

A

So the flavor is really the virtual machine. Flavor.

B

A

Yes, not something related to the queue I mean.

B

A

But okay, that's.

B

It's the flavor of the resource. Okay, yes, and in the case of gpus, you could have different models or maybe even different brands. uh So the idea is that your job has certain. uh So you don't have to define the node affinity for your job. Your job might be capable of running on different car work and then q decides which one is the best hardware to use at the moment.

A

B

And then all of uh the difference with you know the default scheduler is that now it is guaranteed that all the pots, the your job runs on are in the same model.

B

Was that a question sorry, I.

C

Had a couple questions that sounded like, maybe you weren't finished, what are you.

B

No, that was it okay,.

C

um So my first question is um about suspension. um If a job is suspended, um what is the effect of quota does? Is it returned to the pool or not.

B

Yes, yes, we haven't implemented preemption yet, but once yeah the idea is that once it's suspended, uh the the quota comes back.

C

All right- and my other question is what what is the interaction um of you know the cluster cues quota with you know the traditional resource quota say: um you know you have resource, supported, defined on a namespace and you have adequate. um You know cluster queue quota, but maybe not enough um resource quota in the name space where it's scheduled.

B

Right so the quota, since it works at the pod level, we wouldn't recommend using it uh because yeah you would, you could have you could end up with with partial jobs, so we we would recommend that you only use the the quota model from q.

B

Eventually, we would like to somehow make them work together, but uh that's gonna be very far in the in the in the road map.

C

And in general like how would it interact with other types of admission controllers I mean? Does it evaluate whether or not it passes admission before scheduling or okay.

A

B

We do some admissions, the most important ones uh or like uh important web hooks or not weapons but mutating at mission controllers. We mimic the behavior, but not we don't. We wouldn't uh support a custom admission controller.

E

I just want to add here that, like, um for example, admission controller like gatekeeper could be used to complement queue, for example, if you want to implement specific policies who can use which queue which cluster queue, for example, um have something more advanced than what we have in built-in queue, they would complement each other, and- and this is one of the design goals of q- is not to re-implement existing functionality um regarding resource code has it resource? Coders are going to be like.

E

Usually it's it's used to protect the cluster from failing over um because they're not too flexible like if you they basically prevent you from creating the resource completely, and so you still need, like a uh um you, know, a control plane of your own to retry that when the quota is available, it's not they're not going to retry it for you right like if you don't have color to create pods, that's it right, um and if you continue to do that, like, for example, if you have a deployment department continues to try creating the pods and the pods keep getting like blocked by resource quota.

E

This is not a nice model right, like you're, gonna kill the control plane as well, like the uh uh like, so so, usually resource coder kicks in like at the limits. So you don't you prevent and protect the cluster, not in the case where oh, I have limited resources.

E

I want to make sure that I maximize their usage, and so I have some sort of like queuing system that releases these resources and creates them at the right time is that is that, like um that make sense, I think this is how we view how those like admission controllers resource quota, interact with what q is doing, which is basically dynamic resource code in a sense.

C

Yeah that makes sense.

B

I think there was another question.

A

Yeah, so I think I heard that you you mentioned that the cluster queue selects the best flavor for running your job. Can you mention how does it select the best flavor to run my job.

B

Well, it's based on on the existing usage.

B

For now we got the flavors in order. So if this is the, if you said this, you can always swap them right. uh So we're going this order and try to use this now. You could think of different strategies. For example, you might want to prioritize.

B

Not borrowing only use first use the the the flavors that you have available on your own. But yes, there's this kind of strategies we haven't implemented.

A

Okay, so this is still in box. I see thank you.

B

uh So when I go through this, uh what do we ask from the ecosystem from kubernetes um so from kubernetes? What we are asking is that the the job api advances to support more use cases. uh Of course we ourselves have been working on a lot of these ones. We have, for example, failure policies, failure per index uh or we we are thinking about like should we add multiple pot templates in a job in the job api or an alternative job, so this is well.

B

This is the work that they, the working group itself, is doing um and uh minor details, maybe add a cue name to the job api. Currently, it's hard to justify, because q is the only consumer and q is not part of kubernetes.

B

So it's a little bit hard to justify at the moment, um and then this is more of a longer request to add some kind of suspend of cueing or queueing subresource uh to to the apis, uh so that uh you can uh can suspend and suspend any arbitrary custom workload in an arbitrary in a agnostic way.

B

um You might be familiar with some resources from, for example, the scale some resource with the skeletal resource. Horizontal portal test header can scale up or down any custom.

B

Custom resource and yeah, so the idea would be to have the same four for uh for suspending or queueing and what we asked to uh to the ecosystem in general, like keyflow, argo, etc. So, ideally, we would like everybody to you to use the job api.

B

uh Not we don't mean that you, you would get rid of your resource definitions, but you could still use a job as an underlying resource that that deals with the pod creation. Of course, this is a very long term uh ideal.

B

uh So, for the time being we we would like the ecosystem to also support uh suspense semantics and non-affinity injection. um So yeah, that's that's the idea uh with that. I could go into a quick demo live demo. Of course that's gonna go well.

E

We have a question abby shark.

A

Yeah, so I think I heard um horizontal pod auto scaler and um I mean frankly, uh I've been reading more about cluster auto scaler and they they react to pending pods, and so how, if you decide to scale, then will q react to pending boards.

B

um Trying to cancel so there again, the since uh q, doesn't um q would prevent the the creation of pots, and once these spots are created, uh everything starts doing what they are supposed to do. The keep scheduler would start scheduling the pots that are really fit in existing nodes and once the node, the pods don't fit, are marcus pending the cluster of the scatter. We would kick in and create the notes, so it it's.

B

Basically, everything is behaving the same way, except that q is coming in the middle just to throttle uh to throttle the demands and then so that uh workloads don't compete with each other for for the resources when they don't need to uh so that in the future we would like to have some more direct integration with autoscaler.

B

That's that's also in the roadmap, so that's the so the queue can say. Oh I have all of these things pending. I can start already uh asking the cluster autoscaler to scale up, but uh that's still a normal way.

A

Okay, you're, basically working off pending boards. As I understand yes,.

B

Correct, thank you so with that. Let me share my demo real quick.

B

uh Can you see my terminal.

A

B

Okay, I have too many so uh here I have a system with with two cluster cues. uh You can see them at the bottom. I have alpha and beta there. They are running my workloads and they are part.

B

They are part of the same cohort, so cohort the cohort all so in theory, they they should be able to share resources. Now the problem is that there are too many pending workloads in d in beta, and once this this all these workloads complete, you will see that alpha will start borrowing resources from from beta.

B

Now um this is an autoscale cluster at the moment. Well, it's a it's maximum size because all of the resources are being used um now for for the given quota right. So if I, if I show.

B

So if I describe this cluster cue, you can see that well, I have defined for the flavor on the map. I have a quota of 10, 10, cpus and 36 gigabytes of gigabytes of memory. So that's that's what we're seeing here uh well 10 from alpha and 10 from beta okay! So now we're getting to the point here down here in beta, where uh we no longer have pending workloads, uh but we still have pending workloads in in alpha. So we we can see that now. uh Alpha is starting to borrow resources from from beta.

B

Now all the workloads from beta are being running now, so I'm gonna.

B

I'm gonna exhaust the system- let's say so now, my everything so both both uh cluster cues are being overwhelmed. So I need more resources.

B

uh So I have this extra cluster queue, which is currently in a different uh um in a different cohort, so I'm going to put it in the same cohort and then this means that other work, the other cluster, kills, can start borrowing resources. So now we can see that there are plenty. All of these workloads have been admitted um and let me just put this faster.

B

This is just creating jobs, um so let's give it a couple of minutes. This is an autoscale environment. So at the moment the cluster autoscaler is working on setting these these nodes up. So we should see them in a bit, um but basically that's the idea. Once uh the workloads are consumed.

B

um Oh there you go so this is yeah the cluster of the scalar started. So now all of these pending workloads are being processed in these new nodes, um and that's it that's the does the demo, so you can see here in action, a different flavors and borrowing and very sharing.

B

E

Just wanted that part of the design for the api is trying to um fill gaps related to queueing in a cloud environment like all these new, seemingly new concepts of like flavor fallback, adding a new cluster eq on the fly.

E

All of this usually doesn't happen in a an on-prem cluster where you already beforehand, you know uh how much resources you have it's fixed size, but in a cloud environment like the example like aldo, showed that oh, I have new cluster queue, it's not that like in in an on-prem environment, it's not like you got a new shipment and oh, I installed the new server in the cloud. This can happen in an instant right. Like you say you decide.

E

Oh, I purchased this new reservation or I decided to use more spot vms because now they are cheap, so I can spend more money on them, and so you want all of that flexibility to be expressed in the api to use the power of the fungibility of the cloud and extensibility right. So you can at any moment decide you scale up or down, and so you want uh to be able to express those. um So hopefully you look at the api with that lens.

E

It might look at the beginning as like a little bit confusing or um the concepts might not be too clear, but if you think of it in the context of the cloud it might, it might make more sense, especially the example that aldo gave, for example, on demand to spot, etc or various gpu models.

E

uh It will make more sense when you start to think about it in the cloud environment.

D

And when you mean in the cloud environment, you also need to be able to scale the cluster out and in exactly.

E

Okay, that's one thing, but the other thing is the various types of resources like in an on-prem cluster. You usually have one type right in in the cloud you have on demand. You have spot. You have various machine families like in in gp in in the in google. You have c2s and and a2s like you have multiple types of machines. You have memory optimized like all of these dimensions.

E

You need to somehow be able to express how you want to use them fall back between each other and what and express different quarters for them. So usually you don't have that many dimensions on on-prem on the cloud you do.

B

Oh, by the way that was a demo of version 0.1.1 and we are working on the release of 0.2. Hopefully it should be out next week. We wanted to have it this week, but there's always a lot of cleanups here and there so.

A

Okay, we have two minutes left. Are there any last questions for aldo and abdullah.

A

Okay hearing none. Thank you very much all. Thank you very much. Alder abdullah.

D

A

See you next time in two weeks, bye all.

D