Kubernetes Batch Working Group Weekly, 16 Feb 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes WG Batch Bi-Weekly Meeting for 20230216

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, good evening, good afternoon, depending on where you are today is February 16th- and this is another of our bi-weekly uh worker batch uh uh calls. My name is Matthew and I'll. Be your host today and I see that we have topics on the agenda already from Aldo, so I guess I'll go take it away from here.

B

Yes, magic. Thank you. um Do we already have screen sharing permissions I.

A

Think so you should be able, let me stop sharing, but you should be able to share your screen.

B

B

Is my screen visible and can you hear me yes.

A

We can hear you and we see your Q13 progress report presentation all.

B

Right, yes, so slideshow.

B

Okay so today uh the developers of Q wanted to share some uh status report of how we're doing towards the the next release of Q just a 0.3 release. So my name is Aldo I'm joined by marching us from from gke I'm cutting from dial cloud.

B

um So um we uh we already announced a few, a few things that we are already targeting for the V 0.3 release.

B

um The one of the big updates is uh that we're going for our first beta apis, and this is going to be quite brinky. Breaking uh uh I'm gonna, give us summary of the changes, but uh the the whole point of going to.

C

B

Is that from now on, we will have a backwards company uh backwards compatibility guarantees, so, uh from now on, you wouldn't be. uh You will have time to migrate and proper tooling for migration um and an important feature that was highly requested from the beginning of the project is preemption and I'm. Gonna talk also a little bit about how it works.

B

uh Another feature request was uh some form of uh All or Nothing, uh which we implemented through uh waiting for for pots to be to be scheduled and uh also some other contributors helped us at end-to-end tests, which are now part of our um pre-submits.

B

uh So it has helped us uh spot some, some problems before going to master or to the main branch, um and we had some performance tests written by against peace initially and the Machine is going to show us a few results. We got uh from running those those load tests.

B

um Also, we have other ongoing efforts. Happy I'm, happy to announce that we have a new website under the six Dot Kate's IO domain. um So that's uh yeah. That's our new thing. Let me quickly share. So still you know in a way work in progress. We we applied for cncf to uh help us with a with a logo.

B

So that's that's coming next um and uh but if you want to collaborate- and you have some- um you know uh um web experience, uh please uh don't hesitate to to help or if you just can't design and so on, um but we already uh migrated the dogs. So now they look pretty nice pretty neat.

B

um So that's that's! That's the website. uh Let me yes, let me go back here.

B

uh We also uh kante here is gonna talk a little bit about uh our plans or uh for a library to be able to integrate custom jobs so so far, uh Q has only supported the job API by default, uh while it is possible to already add a controller that integrates with with q, we want to offer a library so that it's very easy to write these controllers um and kind of parallel effort with that is that we are already adding support uh for qflow's mpa job uh by Mikhail and Yuki.

B

um So that's already ongoing anyways uh I wanted to highlight the changes we uh we plan for beta. This uh I think this is the.

B

If you, if you look at look at the slides later, you'll find all the links to to the dogs or PRS and whatnot, but I wanted to highlight uh a few changes. um In particular this one uh we uh GK, we actually run some ux research studies um and also from feedback we got from from early users. We we learned that people didn't didn't understand.

B

It got confused with these two terms that we use for quora mean unbox, um so we we did a bunch of brainstorming to figure out how to properly name these fields, but- uh and we just ended up with these two names- uh which we hope are more understandable. So here we're saying that uh the CPU, the CPU resource has a flavor of spot and- uh and we have 40 CPUs available of.

C

B

Now the next field is a borrowing limit, which means, once you have reached your quora. If your cluster queue belongs to a cohort, then you can borrow quora from these other cluster queues in the same cohort. So and then this is the the limit of how much you are able to borrow from all those other cluster queues. uh So I guess. The only thing to note here is that now, instead of Max, we are doing you know the extra which is 60 so then 60 plus 4 is the old Max.

B

um And another thing to note is that these two terms match the um API fields in Apria. In.

D

B

Servers priority on furnace uh if you're familiar with it, which also deals with quotas and sharing and so on. So they already had decided this these two terms and we found out that they fit uh nicely with our semantics as well. So we are doing that uh uh yeah so with that I wanted to also uh give a shout out to Jordan Liggett, who gave us a quite a few hours of review of the API.

B

So we we did all of all of this uh well well um doing this proposal for a for a beta API. uh Now still in the same in the same uh object, we went a bit further, so you can see here on the left kind of the updates, the updated um spec for for the cluster queue. So basically, you can see that we specify multiple resources, CPU and memory and then for each resource we specify flavors uh and for each we specify the quota. Now.

B

uh What's the problem here, so usually things like CPU and memory are very deeply coupled because they they are tied to the to the node. So, usually, you want the same flavors uh to be listed for CPU and memory and of course you can. You cannot assign a flavor of spot to CPU and then a flavor of on-demand to memory that that will be incompatible so in in a way we actually want to define a flavor. Instead, that has the flavor, gives you CPU and memory. So that's that's the the change.

B

We theory was needed uh to give a contour example here we have, for example, a license which a license probably is not tied to the VM right, um so it it doesn't have the same flavors as the other resources. So now you have here, you have well, we just put a some boilerplate here, but let's say you have license for the operating system, so you have only disorder for Windows and this much quota for Linux and then this is not tied to do the other resources anyways.

B

So since we have this uh semantics of resources that are tied to each other, we made it explicit. So we now we're defining this. What we call a resource Group so in a resource Group.

B

In this case, it's composed by CPN memory and for for this group you have flavors so Health flavor spot and you have a flavor on demand and for each flavor you have different resources um and you know have the quarters for each here here: I left out the the borrowing limit just to simplify the The View here, but you, of course you can have that as well and then for license license would be a separate Resource.

B

Group and again you have the flavors for that Resource Group, um so I think yeah that those are the yeah. Those are the changes to the API.

B

We hope it's going to be more uh uh more explicit and also uh easier to understand from from the from the beginning, uh and of course we are pairing it with updated documentation.

B

um Any questions here.

B

B

uh Actually I I don't know if there is a hand or anything I I don't see. The screen I only see my slides at the moment, so.

A

B

Okay sounds good, sounds good, um so the next uh Feature Feature that we have been working on is preemption, um so impression I wanted to simplify with this graphic. Here. Let's say you have this nominal quota for team, a on the left on the green and then this nominal quota for Team B uh in Orange. Now you have this usage this this green block. So this these two cluster cues are tied together, meaning that they can share resources. They can borrow resources from each other.

B

So let's say team a is currently using this um using extra quota that Team B is not using right. Team B on the right is using less than its minimal quota. Then team a can borrow the quota. That b is not using, but now we have a new job coming from B, so B wants to recover its quota because it belongs to it's part of his uh part of its nominal quota.

B

So what should happen here is that we Q needs to preempt some workloads that are running in this quota to accommodate for for Team B for the new workload from TB. So that's that's the that's one scenario of of preemption that you want to recover quota from from cluster queues that are currently borrowing from you.

B

So that's the semantics here. uh You can see that the API is pretty simple. Just there is a preemption field, and then we can say when reclaiming from within the cohort- and here there are some options like uh you can reclaim you can disable it.

B

You can say: never uh you can say always, or you can say, I only want to preempt if the priority from in the priorities I only want to print workloads that have lower priority than me uh uh so and and just as a reminder, this cluster queue is an object that administrators set up, not users not end users. So this is uh setting policies right. This is setting policies from the organization whether um you know the preemptions should respect or not. The priority might depend on your needs.

B

um Another another uh scenario for for preemption is within within the cluster queue. Let's say there is no more space to borrow Team B is using everything uh you have a higher priority, high priority workload coming for a then. In that case, you might actually want to uh preempt some workloads within that are running within the quarter of a so that's the second scenario- and this is the second field- uh the policy that this second field is controlling, so preemption within cluster queue. Again you can say: never you can say lower priority.

B

You cannot say always because that wouldn't make sense, you wouldn't want to preempt a workload that has higher priority to accommodate a lower Priority One. um But you can say you can disable this thing. So that's preemption! uh This is already merged.

B

um We don't have a release yet, but so, if you are interested, you can for sure uh try it out in uh in the main branch.

B

Any questions uh about preemption.

B

B

um This is a the next feature we that is also merged uh uh waiting for jobs to be scheduled, so this is actually a.

B

Startup configuration field, so this is how you can you can you can check the documentation? uh Actually, this link might be no, it's it's the new olympius, uh so um um yeah. So this this is a configuration field for for the future, wait for pots ready and you can enable and disable, and you can set up a timeout for how long you're willing to wait uh for pods to be ready. So what's what's what's this for uh so when you enable this feature, you will wait for the pots of a job to be ready.

B

uh If you, uh if you're, if you follow the um the progress in the job API, you might know that uh there is a field. A new kind of New Field called ready in the job status. That tells you uh if the pots are well scheduled and already running, so uh we use that field within within Q to you know, gate uh the jobs. So we wait for a job to schedule to start up, and only then we uh schedule the next job, and this offers All or Nothing guarantees.

B

When either you have a very fragmented, uh a cluster that could potentially get fragmented easily, um namely um um fixed size, cluster and or if you uh are prone to stockouts, which are possible in um it's possible when you're like requesting way too many gpus in the cloud or some other um uncommon or highly demanded piece of hardware so yeah. This is a kind of a first first attempt at uh offering All or Nothing, but keep in mind that this is. uh This is not uh what we intend to have long-term long term.

B

We want to have uh deeper integration with cluster autoscaler so that cluster Auto scalar can give a feedback, a feedback. We can have a feedback loop between q and cluster Auto, scalar or the scheduler, or both to inform you that a scalab was not possible or scheduling, uh All or Nothing scheduling was not possible, so Q can back off the next jobs.

B

So that's that's the uh the long-term idea.

B

um Yep um marching, can you take it from here.

C

So um we run a couple of performance and scalability test and queue the while we don't have any uh specific scenarios. We thought that a couple of them would be quite interesting. The first one focuses on comparing how much overhead brings Q over job controller.

C

So we try to create 100 jobs, five parts, each each of the pods running just pause for two seconds and we were creating 10 jobs a second, so we measured 50 percentile of time to from the moment that the job is created in etcd to the actual start of the job and then from start to to finish, we did this measurements on a 12 note, cluster running for CPU machines on gke and the results seem to be very very similar with q and without queue.

C

um Both create times were 14 and 13 seconds and time needed for jobs to be started were also similar. We tried also creating large Q inside of Q. We submitted 7000 jobs, the queue didn't crash and the memorial conception was at reasonable level of 500 megabytes.

C

Okay, yes, sir.

E

C

You have any ideas for performance tests and you would like to check the scenarios that are common in your company. Please let us know.

B

Yeah, so two things are: we we uh Victors, who originally wrote the the um the scale test. uh Did it through? Did it uh through cluster loader, which is a highly configurable.

B

Framework for building load tests, so it's very easy to to set up now. It's super easy to set up new scenarios. You know with with the jobs the number of bots you, your organization might have per job, and then you can tweak all the parameters and then it's very easy to run uh the the tests themselves um and another thing I wanted to highlight about q. A why the memory consumption is is low, is because uh Q takes decisions at the job level.

B

So even if your job has hundreds or thousands of PODS that doesn't matter that doesn't increase the memory consumption of Q because we just look at jobs so they're more parallel, your jobs are the happier Q is.

B

Any questions or performance or or any other.

F

Thing a single week already right.

F

B

Sorry I completely miss the question. Can you speak louder.

F

I guess we disable the future again about the purpose already went on the Pokemon.

B

I'm so sorry I couldn't understand I.

F

Mean I mean we disabled went for course already right.

B

uh If we tested with pots ready.

F

B

uh No, we we, this is uh using the default uh um configuration which doesn't uh use spots spots, ready um yeah. That could be that could be easily added and we have, but we haven't tested yet.

E

B

E

Can you imagine how deep the cues can actually get with um the key system.

B

How deep in light pending? Yes, exactly yeah yeah, so this is the second scenario we talked about uh yeah we just basically created but like 100 pots per 100 jobs per second, or something like that, and we just let it uh hold and yeah. So the memory consumption was a 500 megabytes for for 7 000 jobs.

B

um So that's it that's the main uh consumption but but latency Wise It's, it doesn't like 7000, is still not uh it's still, not nothing. We we just used um One Core, and that was.

E

Enough so so Q will buffer jobs in a queue outside of the kubernetes um scheduler.

B

Right so Q would call the jobs so that there are no parts. Actually until Q says you can uh you have we have quota for this job and then the job controller is able to create the pots and only then keep scheduler would take over um so yeah that, in in a sense, uh Q will reduce the memory consumption of the entire cluster. When you have way too many jobs um yeah, because there are less spots overall to process.

G

Okay, thank you so remember that, like all these jobs are actually created in the API server, so there is a dependency on how scalable the API server is as well like. If you create um it's not like we queue is, uh is like you know the front end that receives and stores these objects. They are all on the API server.

G

um So just like I think this is an important thing to basically note and do you have like specific limits that you've tested before or you think are more reasonable to test in terms of like number of jobs pending jobs.

D

It means the number of jumpsters will be queued at any one time. Yeah, so I think we do on the order of what a million two million jobs a day, something like that.

G

But all of them exist on the API server.

D

Or outside no, we keep them outside yeah.

G

B

Yeah one thing we did notice is that um we we sometimes run into the limitation of of the API server itself that uh once you have too many parts there, each the each of the cubelets is sending startups of days. So it the entire API server is slow. So you you need a beefy master or a BP API server um to handle those kind of loads. So an IQ is highly dependent on on that size. On the the.

G

B

B

Anyways, um if there are no more questions about our uh what we already talked about, uh I'll leave it to kante.

F

uh Yeah, uh you know uh currently we support.

F

Do we want to for the mod apps like those into your job, and so so uh so we can choose this just as a scaffold, and so you don't need to build your control from scratch and to achieve this we we defined a new interface named We name, the generic job I think the name is there a bit inside and it will shapes the default Behavior. We need to integrate with you and you know the most. They know.

F

Besides this, uh we we hope to provide a full control.

F

All we need to do is just to improve the interface that you will get the full future literature and you just bought the controller. That's what that's all what we want to do so.

F

uh But the part of the good news is that NPR job will be the of course the volunteer and uh Let's. uh Let's see what will happen and I think Amisha is working on the npl controller.

F

Yes, that's all.

B

uh Thank you kante. um Any questions about this.

B

um I have one thought about this: I forgot. um Let's uh just continue uh so as usual, we are always looking for contributors, not just continuous in code but documentation or now the website um and just general feedback. uh Oh yes, so, for example, I I left in the in the notes for today's meeting the links to some of the PRS and whatnot, for example, the custom job Library, the custom job Library.

B

uh If you have any feedback or you're already thinking of integrating with Q, it would be useful to to have your feedback to adjust um if needed, um but in general we we have other tasks for grabs, for example, partial admission of jobs where you might have a job where um you know.

B

Ideally, you want 10 10 parts but you're, okay with five, so those kind of scenarios and integrating with people such so that we can measure a higher granularity, the performance of Q and some like usage user experience, improvements such as a capacity or plugging uh and perhaps even grafana samples to build dashboards. We already have metrics, but we don't have ready to use grafana dashboards, uh which we, it would be nice to provide, and there are some like people asking about Helm and other deployment mechanisms.

B

So if you have any experience with any of those um where what you're welcome to to contribute and I think that's all we have for today, um so I would like to leave the last 10 minutes for any questions.

A

Okay, folks, does anyone have any questions or Alda and the team.

C

uh When can we expect, though, release.

B

uh Good question, um so we are finishing up I. Think the the biggest change that uh we're still implementing is the API.

C

B

It's it's. uh It's touching every piece of queue, so uh once that's finished, uh we just need to update the documentation uh and we we have a tech writer, uh help us with the with those changes so to to really bring the best quality to the to the presentation and I would expect you know, um maybe around four weeks to the next release.

A

Okay, folks, don't be shy. Does.

C

A

Else have any other questions.

B

But I I suspect everybody knows I I, didn't say what Q is I assume everybody knew. But uh if you don't know please ask now and we can clarify.

A

Or check the new docs page.

B

Yes, beautiful web page.

A

Okay, I'm not seeing any rate and or any questions in the chat so with that I will give back everyone roughly nine minutes of your time. Thank you very much all for today and see you next time. Bye. All thank.

D

You thanks. Everyone take care.

A

F