Kubernetes Batch Working Group Weekly, 23 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes WG Batch Bi-Weekly Meeting for 20220623

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Good morning, good evening, good afternoon, depending on where you are today is june 23rd, uh my name is mati and I'll be your host today, and this is another of our bi-weekly worker batch- calls quick announcement.

A

The cap freeze is later today, so if you have any last minute, caps make sure to grab your reviewers and prr approvers and get them done within the next couple of hours. I can't remember the exact hour, but I remember it's halfway through my polish night.

A

So with that out of the way uh we have william, uh who will be giving us a volcano overview so take it away from uh here, william.

B

A

B

A

Yep loud and clear.

B

Yeah sounds great. Let me share my screen.

B

B

So I prepared a material to show over you about the volcano. So let's, let's get started so my name is william wong and I'm volcano maintainer and it's my pleasure to share my product share. My practice in the last few years on clonitive batch system here is the agenda.

B

uh I first introduced challenges we were facing on kubernetes patch and then I'd like to talk about volcano and later I'm gonna show some some use cases of being of customers and last one is about the community.

B

So here is a background about why we would like to have volcano project. At the very beginning. We built a batch system for traditional hpc workload, and then we had a big data platform to handle the data intensive analysis workload and after that we leverage cognitive technology for ai workload.

B

But the major problem is that all these platforms are using different tech, technical stacks building, different ecosystems, which make it hard to share resources between these different kinds of workloads and our resource utilization is lower than expected and maintaining multiple technical stacks at the same time takes a lot of costs and effort. So more and more organizations were leveraging collaborative technology to build a unified platform for our workloads, but there are still some gaps in clinicic system to to support a batch.

B

So we built cool batch in 2017 and to handle the scheduled scheduling gap, and then we built the volcano based on the cool badge to deal with other gaps like performance and the job management.

B

So here are some major gaps in cloning develop system for batch workload, so the first one is is about the job management.

B

It's a common requirement to have different approach, templates and fingering, the lifecycle management job and the second gap is about the scheduling such as the job-based priority. The plan preemption resource reservation, backfield, topology and and so on, and the third gap is both the is is is is about the crd, so support different workloads in in common job of crd, to reduce the complexity and the maintenance effort.

B

As we know, there's there's there. There are too too many uh crd so far, the uh such as the mpl printer pet touch operator, 10 kf operator.

B

It's it is a bit it's a big effort for customers to maintain so many operators and each operator have its is unique, have its characters, which is different from each other and last one is both the is about the performance and the dynamic resource sharing.

B

So most of batch workload requires hair is throughput demanding and, for example, we got a requirement from of customers to dispatch 8000 posts per second for spark job, but currently uh the the the uh for the default scheduler. It's it only can dispatch about 100 parts per second.

B

So so that's that's why we started the volcano project several years ago, and here is the overview of volcano volcano. In includes several components for multi-cloud scenario. We have federation sub sub project in pipeline to balance results between different clusters and in each individual cluster.

B

We introduce several crds, such as job for common batch workload and the queue for resource sharing and the controller to help management to the manage the lifecycle of these crds and the volcano. Scheduler vc scheduler provides rich advanced, scheduler scheduling, policies and performance enhancement, especially to have those batch workload better running on top of kubernetes and from the graph we can see. A volcano engaged deeply with upstream computing framework, like spark flink tension flow.

B

So far, we have supported almost all of the popular mainstream framework.

B

So volcano is open, source or open source at 2019 and then donated to cncf in 2020 this year it was promoted, promoted to incubator project, and currently there are more than 350 contributors in the community and more than 50 enterprise users adopt volcano in their production environment.

B

So, as we know in batch system, there are several concepts which are important to the high level design. The first concept is the job for batch system is usually have common job specification for all kinds of workloads such as such as ai, big data and so on, include the native x system is required to introduce multiple code templates and fingering. The error handling.

B

The second one is is about the tenants. Traditionally badge system introduced the user as a tenant and kubernetes use namespace as tenant, based on my my understand and in order to avoid system overloading since resource limits is required to the batch system. For example, the slurm has job queue, quality of service and kubernetes supports quota to control. How many resources can be created in in system in system for each tenant?

B

So last one is is about the queue. Q is common concept in batch system and it is widely used in batch systems such as learn rsf. It has administrator to manage, manage resources in cluster and share resources between these different tenants. In addition, some seems some system support, different scheduling, scheduling configuration for each queue. It is useful feature for for users, so in the following class, I'm going to introduce the detail of keno. The first section is about the job management in volcano.

B

We introduced a common job crd with multiple code templates as common specification to support all kinds of workloads. Currently, we already verified the server type of workload in production environment such as mpi, tensorflow, patch hardware, chrome well, and here is an example.

B

Example about the mpi, the mpi master and the work are using different code templates with different command line and different resource requests.

B

We can also support fingering the lifecycle management, for example.

B

We can restart the whole job when one task in the job fields. We also build job plugin to help a user to customize enhancement. For example, the ssh plugin is used to configure ssh automatically for mti job and the service plugin is used to create a headline service for communication between the posts in each job.

B

As we know, q is common concept in batch system. It is used to share results between different tenants in volcano. We follow this practice to make a q in cluster level, so q share resources between my students and the quota is considered as a resource limit of tenants. Currently, a volcano supports the first in first first out fairness, priority algorithms, false queue and the configuration is global for all oculus.

B

So we are also going to support configuration per q for for users.

B

Clear is another example showing how to share resources between queues. Suppose we have six cpu in in the cluster and two queues. One queue is q1, the other one is q2, which is mapping two teams, the the the width are two and one accordingly. At the beginning, in the q1 and q2 at the beginning, we submit the one job with with six code in q1 and the q2 is empty.

B

According to the proportion algorithm, the port in q1 is able to borrow resources from q2 to get all the code running, and then we submit a new job to q2. The scheduler will reclaim two cpu from q1 based on the weight of them, so so the the new job in q2 gather two cpus and and get the running.

B

In addition, we support guaranteeing the results in queue for for high priority jobs and supports the capacity to avoid the system overloading.

B

The next one is about the fair share scenario: fair share is a common requirement for elastic or streaming jobs like spark flink. However, in kubernetes the mobile submitted the more possibility to get more resources there, there's no fair share and volcano prod, plus the fair share between jobs and namespace for user.

B

So we can see from the graph that user1 and user 2 submit smart job and big job to the sim queue. The smart job maps get starving without fair fairness, getting fairness scheduling, so can you ensure insurance, big job and smart job? Get resources fairly with drf agrees, but only only double fair share is not enough. Suppose, one namespace oversubmits a lot of jobs to skill compared to other name services. It would pos it would possibly occupy most of the queue resources.

B

So we add the namespace level fair share in volcano, as the graph shows also the namespaces 3 submits more jobs than namespace 2, but we get the same amount of resources immediately.

B

This is important in a multi-tenance environment, for customer, and here are here- are some some scheduling policies in volcano, uh and I will have you because uh I will show some part of them in in the in in the slides.

B

So the one is about the elastic elastic training scenario, a machine learning workload has higher demand for gpu compared to traditional workloads. So gpu is expensive resource. How to improve the gpu utilization is a hot topic and have great value. Elastic elastic training can dynamically adjust the number of instances involved in in the training and so greatly improving the utilization of gpu resources, especially on the public cloud. It can work with sport instance to get a lower cost and improve the training progress.

B

Firstly, let's see what elastic job is like the last, the left figure shows a volcano job. The main available shows in refers to the job which has 5 port at least and the replicas in first to the pod. The job has 10 temples at most, the job gets running where failed posts get allocated, and then the job will extend more more posts if there's more free, gpu resources.

B

Here is another scenario as used. As you know, the inference service always have a lower gpu utilization compared to the training workload, as people tends to collocated colloquiate the inference service, with the training workload to improve the overall unit utilization. The the red figure shows an example: the influence job 2.

B

These high priority preempt elastic code from train job 1, to ensure it's sre. Whenever there's free resource available, the job 1 will extend the mob post to accelerate the training.

B

Clear is another useful uh policy for for for user uh it. This page shows how the task topology and the l awareness helps the distributing training for some gpu training thesis. The data exchange between tasks cost a lot of time and and becomes the bottom line of the training in our performance test. If the time of the data exchange could be reduced, the training performance can be improved.

B

The task topology would schedule the ps and worker on the sim node to reduce the data exchange latency.

B

We also had the test with the default scheduler and there are three nodes in the test cluster and we submit training job with two ps and four workers. We got three different three different placements at last. The results are random. With default scheduler, as you know, the group c is the best placement that we want with the task topology scheduling. We are able to get stable results as group c.

B

As far as I know, some users use the affinity and anti-affinity features to achieve this goal. However, in test the complex, the complexity increases with this cluster skill, and the performance is not so good enough. We also do some research on the io awareness scheduling with the task topology info, and I o information. We can minimal minimal lights, minimalize the max data transfer latency and get even better performance for some kind of models. This figure shows the vgg 16 model, training results with default, scheduler task, topology and io awareness scheduling.

B

The lower knee scheduling gets us. 30 percent performance increase compared with default scheduler and the result depends. uh The results depends on the data exchange and the the the mode models.

B

So in a real production cluster we user often submit much multiple kind of overclock, for example the smart job and the big job. How to avoid the big job or small job guide. Starving is very important, so the traditionally traditional hpc system usually supports this kind of feature, but in coordinates there's no kind of capability. The left, the left figure, shows an example at the moment of t1 users submit a big job with gun and the smart job too.

B

The smart job gets allocated and job and big job 1 keep pending due to the insufficient resource at the at the moment. T2, a new small jobs theory was submitted and got allocated the big job. One keep pending with the time goes on. The big job will get starving if the release the resources always cannot satisfy its gun and the user submits some small job continuously.

B

So the sre scheduling algorithm allows user to configure the job so that it is completed on time to reduce the risk of missed deadlines.

B

Sara waiting time is a maximal time that one job should stay in pending when sra waiting time reached. The sle plugin in volcano moves the pending job to the next state and start to resolve or hold the resources for this job until the jobs request is satisfied.

B

Spark starts to pro provide support in spark 2.3 version in 2017 and later spa computer provides another vid to help run spark on kubernetes as well. However, in a lot in a very long time, survival companies lack of batch scheduling, abilities so late. Last year we started we started to work with spa contributor to support custom customer matched.

B

The batch scheduler for spark and spark with volcano provides the the common batch scheduling abilities like job-based priority queue, fair share resource resolution and so on, and it has, it has been released in spark 3.3 uh this week and and and user user user can use spark 3.3 with volcano to to enable the batch schedule batch scheduling policies.

B

As this is another, this is another use case in uh in huawei cloud uh as the amount of data it continues to grow and the capacity of business increase.

B

Our users required coordinated clusters support larger skill. So we also spend a lot a lot of effort to support 10 000 nodes and mainly cost 1 million cluster. By optimizing, the container network scheduling, container engine etcd and api server, the red part leads to some specific approaches. Take a schedule. For example, we improve the scheduling throughput to 101.5 thousand ports per second by adopt the the cache batch binding and isenc async bonding and the code level uh pro code level of team. Optimizing.

B

So here I will have you, I will show several use cases for kubernetes and volcano. So is one of the top social media, media and e-commerce company in china. Many people use their apps in mobile phone and they have one million active users per month and the main workload they are running to is to provide the recom recommendation for end users. They need to refresh the module every every s, every several minutes and see, and they have some online service to do immediately reaction, when user refresh then refresh their their notes.

B

So the challenge is, they have large cluster with nearly with nearly 2000 nodes and the model have the model have 100 million parameters, one tencent fluid one. Tensorflow job has more than hundreds of ps and a worker code. Actually they want the best resource allocation and the utilization, so the user adopts the task, topology scheduling and again the 20 about around 20 performance increased, and he also used the see the sre based scheduling to prevent large. That job from starting another case is about the financial sector user.

B

They initially used the hadoop pr to schedule the batch jobs and the company groups. The environmental policy also changed different research. Research team require container deployment to avoid the environmental, confliction and dependency. So, however, the quantities default scheduler like lacks fair, share scheduling between much multiple teams. So we have requirements of using different frameworks such as the also they also have the requirements of using different framework such as tensorflow, pathage and mpr, that that requirement.

B

That requires to install different kinds of printers in their environment, which leading higher cost maintenance and the learning because they are faster.

B

They are a small fast fast starting company, and the user looks for the solution and found a volcano can satisfy their environment, their requirements and offer diverse scheduling, abilities and what's more and they could use a volcano job to unify the tension, flow path, touch and ampli jobs individually. They decide to migrate it from the yard to kubernetes and the volcano.

B

Now now volcano supports about 300 thousand posts increase per day in their production environment.

B

So for the public adoption we do get a lot of users adoption, especially for people who are running the arabic data. Gene transcoding workload on kubernetes here are some parts of the adopters using volcano in production environment for the code diversity. In recent years you can find that we get.

B

We get a good diversity in community development, almost almost more than 50 percent contributors are from the community members.

B

So here's here is the release journey of volcano. We have released more than more than 16 major versions since open sourced at the early age. We did we developed a set of scheduling policies to support batch workload and then integrate in integrate, integrate with ecosystem such as cube flow spark operator, flink operator, argo, manuspor, pad pedo and so on, and then we found that there's a lot of gaps in the job management. So we spent quite a lot of time to enhance the job management to deeply support upstream computing framework and in.

A

B

A

Yeah sorry to interrupt you, uh we are halfway well two-thirds uh through our uh reserve time, and I would like to use the remaining 15 minutes uh for the questions. Oh.

B

A

Do you have many more slides that you want to go through.

B

Okay, this is the last last one: okay, okay, last page yeah, perfect.

A

B

A

Okay, if you, if you don't, have any further uh closing comments, abdullah, you had your hands raised for quite a while go ahead.

C

um uh Thanks uh this was like really great. uh You know rundown of uh the volcano ecosystem. um I have a couple of questions. I'll start with one related to the uh the job api. I mentioned that um you have multiple. You know components in volcano, which is like the scheduler, the job uh orchestra, like management um and whatnot.

C

I'm wondering like if volcano has been like started. Like I don't know in 2017 three four years ago, I was wondering why didn't you think about improving the existing job api uh to satisfy your needs like within the next like in three four years? I would imagine by now, um like the job aware, could have evolved to fix all these gaps that you've noted earlier in the journey.

C

uh It's easy to ask this question now into this big, but I wonder if you have any thoughts that you still believe like, for example, the core kubernetes job api, wouldn't you know, evolve to fix all these gaps.

B

B

So for the job management, uh I think this is the biggest gap is the the multiple template because, because for the batch workload, the all most of the batch workloads have have several rules. Several rows in one job.

B

uh Sp, for example, this bar because you have driver and executor for the flinkers, you have job management, support and task management port for the tenant, follow the lps for the worker code, so port template is is very important.

B

Without the multiple template we we, we cannot consolidate to all those different kind of workload.

C

Right but like was this proposed, for example, to the kate's community and was rejected like, for example, I imagine like a v2 job api that includes multiple templates um like to to fix this gap. um It's just like my concern here is that again, like with it introducing a new job api like if you had invested in in the the core combination, you might have gotten like more traction um with with volcano, because it will be easier for users to you know, migrate to use um volcano without changing too much and without being concerned about.

C

You know fitting their workload for a yet another custom api, um but I, but I totally get like if, if you have a concern about like uh velocity, how fast we can iterate over it, um it's just like. I was like because you you started like three four years ago.

C

It's tempting to ask this question. Oh, if we had started three four years ago, improving the job api. By this time, we would have had all these gaps fixed, um but uh yeah.

B

Yes, it's a very important to.

B

It's it's very important uh so far. I have no idea about the decision. I'm I'm I'm I don't know see the upstream, the the coordinates, the the strategy with the the other crd. So in the early time we we want to coordinate, to support the batch job api and at that time maybe.

B

Maybe in kubernetes z1 most of people want to see the java api would be of it and and the lcrd ucrd to describe a complex, complex, the job so so far I have no idea about whether it's right to, or is a good choice, to support all kind of this. This.

B

This ability, like multiple templates in kubernetes java, v2 job api, 2.

C

That is okay. I think I just hope that if we manage to evolve the job api to stage where it fixes these caps, maybe volcano can migrate to use that common api. This could be like one way uh moving forward with this, but my second high level question about the schedule like it is: um how do you keep updated with the new kids features like, for example, uh wait for first consumer volumes, those are like baked in into the core kubernetes scheduler, or this new proposal.

C

I don't know if you've been following it in kubernetes, which is the dynamic resource allocation. um This is a really complex proposal that involves many components, including the scheduler. Do you import any code from the code kubernetes scheduler to achieve these? Like you know, requirements uh similar things with like affinity and whatnot? I don't know if you reimplemented them yourself or like volcano, I mean, or you just import them.

B

From oh, that's a good question: uh the kubernetes support multiple scheduler, so in volcano uh we cannot just import the defaults, the scheduler framework package and use the and use this use the input to import the the default scheduler ability in volcano to support the affinity, the the toleration and all these all these policies.

C

All right thanks.

A

D

A

Else have any questions for william.

E

Hi, so I have a question so you mentioned federation and sort of sharing workload between multiple kubernetes clusters. Can you please elaborate on this slightly.

B

Yeah uh yeah the uh we we we have planned for the volcano federation and to balance resources among different clusters, and this feature will be. We will be ready in the this year and we also found that there are. There are already several multiple cluster projects like commander, like other other other project, other projects, but all these this project are aims to resolve aims to provide the dha for for the micro service, so our user, our community users, uh always have multiple class clusters in their environment. So a lot of users require this.

B

The volcano federation.

B

E

So perhaps you could also comment on a relationship between volcano and something like mother.

B

Amanda yeah.

B

Currently, I I just uh I I know little about the amanda amanda, some of the users, you, you use user amanda.

B

Maybe I I have, I haven't, have a deeper uh investigation and compare comparison so far, uh maybe a in a little time of uh I I will maybe after the investigation I will.

B

I will show you about the the difference.

E

Okay, so I have one more question, so do you uh so one of the worries with something like volcano to me to me at least, would be uh load on xcd.

E

It seems like you, have lots of jobs and you have a system that sort of fundamentally has a lot of interaction with etsy by creating these custom resource definitions and so on. I'd be a bit concerned about a tv falling over when you have high throughput. Is this something that you had issues with.

E

um So I was just worried wondering if you could comment on potential issues with high load on xcd from a system like okay,.

B

E

And no worries, if you don't uh haven't, I just curious.

A

Okay, although you want to you, want to have a question.

F

Yes, um so a common, a common um complaint, I've I've heard from customers about about volcano is about its interactions with with a cluster autoscaler. um Do you have any any plans to to? You know solve this issue or, like do you plan to also fork a cluster outer scaler?

F

What are your thoughts on of the scaling.

B

F

I I'm wondering what were your long-term plans for auto scaling cluster.

B

Skating yeah, auto scaling is, is, is a big problem for us. Our basic idea is we we we have. We have planned to because the odd scaling and the schedulers have the higher relationship to to achieve a better, a better, a better result, but but not right. Now they are separated.

B

So we have a plan and a a basic idea, but it is not. This is not material. The basic idea is we. We might create a crt to to to to to connect the the sk out scaling and the scheduler. We use the reuse, the crd, to communicate the the detailed information to make this scheduling more intelligent, but so far I have no detailed idea.

F

It would be good if that you, you can bring that idea once you have it, uh because I I I think it would be a very bad thing if, if you end up forking, let's say cluster autoscaler, um because then we you know it's harder for people to to migrate, to maintain two systems and so on and so forth.

F

So yeah. That will be my my request.

B

Yes, so far, we focus uh autoscaler and do some enhancements internally.

B

We ever submit some patch to to to upstream, uh but it's sometimes it it can't.

B

It has a a long on duration to emerge cpr. So just right now we just fork of scalar.

A

What do you have.

D

Yeah, uh so I in the presentation, I noticed that you mentioned um volcano, supports scheduling policies and one of them is no awareness, I'm just curious. How is that support enabled for batch workloads in say, kubernetes uh and one of the the reason I'm asking? That is because we've been working on uh enabling new awareness uh in kubernetes so like? Is there something we need to converge? Those efforts is there dependency? Is there uh reuse of some of the work that we have been doing or is there something we should be doing to reuse?

D

Some of the work that you've been doing in this area.

B

Yes, yes, this is a good question. uh The first new awareness scheduling uh for in the last year. We are one of our customer wants this feature, so we communicate with the is the the community. According to community a the developer said the is a long-term feature and cannot meet of the of of that line. So we start the support uh study to support pneumonia during in volcano.

B

So currently it's a good idea to whether we can to merge or do something to keep keep the coordinates. Pneumatic scheduling and the volcano aware scheduling to reuse to to have a good good user experience for for for our user.

D

um In terms of like, if you can point us to um like some of the code that has been done to implement that and volcano, we can take a look because I'd be curious to see like how the apis look like and if there's something that maybe we need to take into consideration when we are enabling no awareness in kubernetes natively.

B

Yeah sure sure we can we we we can, we can be committing and and and and talk talk in that meeting too for some detail about the the new aware.

D

Cool sounds good thanks. Video.

A

G

I I do have a question uh so for volcano: did you do any kind of benchmarking, uh which means that published number hey? uh This can support this many job burst or for this cluster size so that we can have an idea. What is the scalability limit and yeah.

B

um Benchmark we we we test for for some, we task for the scheduling policies, uh several zero, zero scheduling policies, just like the the minimal resource reservation for spark for the performance increase and for we test for each important uh scheduling, policies for the the performance improved and- and they also see the scheduling through throughput, but we don't have a a complete benchmark to test uh uh all the. uh Oh all these. What you said what you just said.

G

So the job burst and then all that another question is that you mentioned in one of the sliders that the mac cluster is 10k nodes. So is that 10k node, like 10 000 nodes are in one cluster or they are spread across multiple cluster and you federated them.

B

uh Sorry, I I didn't. I didn't catch you, okay,.

G

So in in one of the slides, I remember seeing that the the max node tested at 10 000 nodes and 1 million pods, so I was just wondering- are those 10 000 nodes are part of one kubernetes cluster or multiple kubernetes cluster.

B

uh uh uh It's it's it's! It's it's one kubernetes cluster and uh in in huawei huawei cluster.

B

Okay, yeah yeah. We.

A

B

Do some we do some, we do some enhancements some of the enhancements we want to. We want to submit the pr to kubernetes and it's. It is in our pipeline to to submit cpr, for example, the the badge api batch band api. It is very useful to improve the scheduling, throughput.

G

A

Okay, thank you very much all uh sorry for all uh for taking extra five minutes uh above the usual time. uh Thank you very much william for presenting the volcano and see you all in two weeks. Bye all.

A