Cloud Native Computing Foundation KubeCon + CloudNativeCon North America 2022, 12 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Efficient Scheduling Of High Performance Batch Computing For... Krzysztof Adamski & Tinco Boekestijn

Description

Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon Europe 2023 in Amsterdam, The Netherlands from April 17-21. Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

Efficient Scheduling Of High Performance Batch Computing For Analytics Workloads With Volcano - Krzysztof Adamski & Tinco Boekestijn, ING

Three years ago ING Wholesale Banking Advanced Analytics team set up an ambitious goal to gather in one place a curated portfolio of internal data sources together with a large scale compute platform. At its core the idea of allowing internal projects to get access to a rich toolset of open source and industry standards frameworks and preprocessed data to validate business ideas in the secure exploration environment. Extensive growth with over 300 internal projects so far and more than 2000 internal users proofs advanced analytics i.e. ML, AI, NLP capabilities should become easily consumable not only by specialized, dedicated teams, but make them close to subject matter experts. In this session we would like to shed more light on how a specialized cloud native Kubernetes scheduler (Volcano) enables us to deliver multi-tenant large scale processing capabilities. The optimal resource usage with stability of core services are key for our cloud native platform. To enable dynamic allocation and hdrf (hierarchical dominant resource fairness) we have created an extension to Apache Spark binaries. This allows users to use Volcano with Spark interactive mode in a Jupyter notebook. Additionally we have created interfaces to visualize all the scheduling metrics like the yarn ui.

A

Hello welcome to our talk. My name is Chris and, together with Tinker we're gonna. Take you on a short Journey, how we adapted Cloud native tools to build our data analytics platform and how the recent addition to that stack with those Advanced schedulers help us to you know achieve a bigger scale or operate in that multi-tenant environment.

A

So that's going to be our story, but just to begin with we're going to introduce to you to our company. So ing is a financial institution that operates globally with European Europe I have over 52 000 employees. Working for us yeah with our mission is to empower people to stay ahead in life and business, and also the truths of the company are from the deep economic crisis of the 20s of the 20th century.

A

Our mission right now is also to help to transition our world into the new era, with uh to shift in a low carbon future and to speed up the innovation in finance being like over 10 years. In that banking industry is also a fascinating area where a lot of regulations are there. So it's adopting a new technology.

A

It's never easy, but also gives you, because one of the most important thing in the bank is trust that that was the way we want to handle data with respect to whatever it's only required to the data to be handled and to respect your rights in that regard.

A

The mission uh for the company, but the mission for our platform is to become a data driven. What we want to emphasize here is like support our employees with the self-service platform, so we are in a platform economy and what what we built? The mission is to empower you to build on top of that platform and to solve the business needs the growth factor, so we started around 2018. The whole platform rules are in 2013 when we started adopting open source technology to bootstrap the product initiatives in 2018.

A

We are starting to look in the new version of the platform when, finally, we see that the transition towards the cloud native tools help us to to build the platform with the new foundation on the infrastructure layer. The numbers may be not impressive, but we have a stable growth since then, and the adoption rate is is growing and over 400 projects exist on our data index platform, so that self-service Paradigm really helped with the adoption rate.

A

What we, what we built is essentially a product of the product driven mindset is something that is embedded in our in our mindset using the model open source technology. What you see here is like an entry point to our platform that allows users to use the significant compute power with the some security and compliance configuration embedded inside, supporting uh not only Global but also local needs of our users.

A

Three Foundation pillars for the self-service platform is scalability, seamless and security, and we wanted to emphasize on that engineering capabilities of our platform that Engineers can start their data Antics Journey, using predefined pipelines, building sharing testing the deployment and create create new insights for the business data is stored in a secure way. We what's one of the most important factor of the platform is that the sharing resources sharing data Resources with the products is based on it's based on predefined roles.

A

The tool set. It's uh is then also delivered for uh you know the latest tools that it is then available for the users we're looking for the ultimate answer and in my uh perfect way the ultimate answer would be like a search box when the users would like to type whatever I want to do next and everything is going to be taken care for the users and we will guide you through that user Journey on the next steps. Director is a little bit complex.

A

Obviously you have that pieces of the puzzle that you need to combine to make sure that your platform delivers on the promise. What we think of this is the set of liars that that you stack up to deliver the end project. The end product and titles for the user needs.

A

On that Journey we prepare a couple of interfaces to our users can use on the platform what we saw in that previous picture, the landing page. So the latest like the starting point for the users when they can choose the from the portfolio of the platform the product we have our opinionated data science toolbox. That is obviously with the Jupiter notebook Enterprise.

A

We recently switched from our custom-made built environment towards build packs that helps us to build to build this environment with data Discovery reported. The cooperation with with Lyft folks help us to enable the data Discovery and metadata engine. On top of that platform, we have a superset for our bi and analytics needs.

A

The recent transition that happened for us was that uh that's something that made it possible is to transition to our Cloud native tools. So we swap we switched from the Hadoop ecosystems towards object, storage towards kubernetes, with the addition of the caching layer. Something that's also some talks were here yesterday, helps us to achieve proper performance on top of the those object, storage, implementation.

A

The key challenges that we saw so far with the of the cloud native tools, uh as as you can see the job, job, management, scheduling and multi framework support. uh Still Hadoop was there for us. Despite you know, leveraging Cloud native Technologies like kubernetes for stateless application, it was even possible to to set up stateful application, but still for data analytics workloads, we're missing a major uh schedule. Mature scheduler, like yarn, is to enable those jobs running concurrently with the multi-tenant environment.

A

uh What's what's, the yarn was missing, on the other hand, was support for those new Frameworks that are happening like tensorflow pytorch. So that's something that we're mostly also looking for. What's going to be next next big big thing uh for our platform, with that, that's a short transition towards volcano and we'll hand it over to tinko. To give you a little bit more details about the our implementation.

B

Okay, so uh hello, everyone on the trail to Forest solution, so yeah um Chris, uh just uh told us a little bit about it. I hope you can take this image in mind. For us, on-premise means that we have a fixed cluster available and on our previous cluster, we had this. We had a scheduling which was split between Hadoop and kubernetes.

B

Essentially, all the big spark jobs that we ran were happening on our Hadoop yarn cluster, for like algorithms, like all these different kind of tasks and the rest was all fully kubernetes but like the rest of the world, we actually aim for only having kubernetes, since this makes the hero Paradigm much simpler.

B

um So here, if you look at from a resource consumption, you could see that if we have a certain split in workloads, kubernetes and Hadoop- and we then in during normal office hours, you might see that when the kubernetes were actually like using somewhat but not everything on the Hadoop part, we're definitely uh since Park is wants to try to use as many resources possible. We try.

B

We have many jobs running, but, for example, during the night, once we run all big batch jobs which are happening in our bank, then the batch over the batch part is like utilized fully but, like the kubernetes part, is barely using anything.

B

uh Although on Peak capacity you could see like that's, that's all different kind of things are being used.

B

um You might see that this is not very an optimal way of allocating, and what's also is important to us, is that we have a distinction between batch and user, so interactive sites are the people that are using. The platform immediately batch are like the big processes that we have to finish on time, so to say for business critical applications.

B

If we, for example, have a full kubernetes cluster and we run all the spark jobs on top of it, then the research loads might look like this, where we have, for example, the bolts running and all the the all essentially all available space that the pods and the core servers aren't using it, and a little bit of spare capacity for the rest of the deployments will be will be available for spark to use.

B

So this means that even for example, during the night time, then our batch processes might, we might be I said that we have much many more resources to use.

B

So essentially, this is where a volcano comes in.

B

There are different many approaches, but essentially we need a job scheduler for kubernetes, specifically currently for spark, but in the future, for different kind of Technologies, and what volcano offers is that it has job cues with weighted priorities. It has the ability that you can essentially cues are like the ways how you divide the cluster up. So if there are four users running on the cluster, then everybody gets one fourth of the cluster to use, so their job can complete as fast as possible, but also the ability has has to commit above Q limits.

B

So, for example, if two users aren't using anything, then other users may use their resources. So it's like a system that you try to keep try to claim as many resources as possible, and it also has the main ability to preempt spots when more ports can come in and the. Lastly, it also has configurable strategies to deal with competing workloads, for example uh from a task scheduling perspective, for example. Spark it is, it is I say that you can preempt machines, you can kill machines and then the job will still Computing you for tensorflow.

B

This might not be the case um all these different. These features- these are already exist. Only they only exist in yarn and not in the kubernetes uh I said it in the kubernetes ecosystem.

B

So, since the release of 3.3.0 does spark officially have support for volcano. This is about the community um I. This was made by the volcano Community itself and there's also another workless, a bad scheduler called unicorn, which is Now supported in the latest release of 3.3.1.

B

um So I will tell you more about volcano along the way if you, um but volcano essentially, is a generic task scheduler.

B

So we as we see it, you just you, have server scheduling and you have task scheduling and all different kind of tasks which you like have a certain predefined moment that it will stop. That's like tensorflow by torch, spark and kubeflow, and you can think of any other application. On top of it, and essentially the the configuration is I, wouldn't say simple, but I won't say easy, but it's more like it tries to be as as simple as possible.

B

So you just have a job object, which is a small abstraction on top of bots, um in which you can select amount of replicas that you want to use schedule by, and you can also add policies like, for example, if the entire job is completed and I will once, for example, if L3 bolt is completed and I will want things to so that the job is completed, but also things like I want.

B

If certain job bot is doesn't work, then it should restart like five times, and it has this pluginable architecture in which you can select. Okay, I want to use different kind of plugins um and there are different kind of actions that can happen within volcano itself and then also you have this Q part, which is specifically needed for our spark setup, which I will demonstrate later.

B

So it's just a a relatively simple overlay on top of the kubernetes API.

B

So if you would look at the volcano, then also it actually encompasses out of three different Services. You have admission service which will check if everything is correct, then also you have the controller manager, and then you have the sketchner and the scheduler is the main one: making decisions where to allocate the task Etc.

B

So for let's get into the balancing kind of situation. So what's the main differences is that I want to move forward? Is that there are quite a bit of differences in strategies how you want to schedule pulse from a surface perspective and from.

A

B

Job perspective, for example, if you have surfaces running in kubernetes, you want them to be spread as much as possible, so that you, if there's a certain note that fails, then you can still resume the rest of the services and you have a lot of redundancy.

B

But this is not the thing that actually with for like high performance and jobs that you want to do essentially want to put the jobs as close as possible, so you have less Network traffic and so that it can complete faster, um but also you might want to have that. For example, the applications of all different users are spread as much as possible so that you don't get like incompeting workloads on one nodes.

B

um So what you here can see is that for this calculate, while in kubernetes, it's mostly like by spreading all the balls, is that 4K now actually has many different plugins attached to it. So you can have like drf, which is the dominant resource fairness algorithm, which I will get into you can have gang scheduling where you say, Okay I want to only deploy all the four pots when all the resources like are available for all four parts. So that's like no bolt is hanging.

B

I want to add priorities, I want to add uh resource quotas, I want to add SLA. So, for example, I want to check whether uh certain pulse, if there isn't available space, that it doesn't take too long. So there are many different kind of algorithms you could think of when you approaches from a task scheduling- and this might mean this means for us that we can further optimize the traffic of spark in the kubernetes cluster.

B

This is very important to us because, essentially in the old yarn cluster, this was highly dedicated for spark job. So this has been optimized by by uh by many years of experience and in kubernetes. We essentially now still have to do it by ourselves.

B

The main feature that was important for us and which why we essentially selected 4K now, is that we needed to have dominant resource fairness, which they have enabled.

B

So, for example, if you have two users and for example, now currently, let's assume that you have a cluster with 18 CPUs on one side and you have around, let's say 24 times, 3 72 gigabytes of memory, then you would like that that both users can use as many CPUs as possible so that no of their jobs I said that cannot execute.

B

So you want to have a situation where one user can have nine CPUs and the other user also gets nine CPUs in the system, but that will, but if one user is using less, then that might mean that you use 12 CPUs and the other user user use six CPUs.

B

So this is the part of like you have a weighted claim, and this is done in four kind of use. Excuse and you can over commit an unused resource from another process and calculations, the dominant resource fairness, so the the calculations are done on the most dominant research which is being used. So in this case it's CPU, because memory is used way less and there's even some available space.

B

That means that in 4k now resources are preempted once, for example, user one wants to use more resources. That means that Bolts from the user 2 will be deleted uh to make space for user. One.

B

So if we would go to a stack up, then there's also the part of resource starvation, for example, if you have two notes to your available, then you want to have a long running surfaces, for example with us that is like Presto tree now, or we have a cache like a luxia which are like we run on every surface, so we have a also different kind of compute options, and also we have a caching layer. So we can access data as fast as possible.

B

That means that, with the rest of the space is essentially available for tasks but yeah. We also if we use everything for tasks, that might mean that we cannot do any deployments anymore or we cannot make any changes. So, for example, we added some available uh available parts to it so that you can have always have some spare capacity in the cluster so that we can do deployments and uh without any issues.

B

With next part, oh from spark itself it. Actually, we have made some custom changes on the top of markets in in volcano itself. But for us it would look like this. Where you say, Okay I want to use in my spark. Config I want to use volcano.

B

Then you can Define. Okay, I want to use root user, one that is my Q name and then I want to use this group name. This bot group, where all these Bots are I, say that I said that this abstraction over it um spark itself will create this queues and both both groups automatically within our setup and then the startup balls are automatically assigned to the volcano Port group that you have declared.

B

We have owner references and drive for heartbeat for garbage collection. So that means that if somebody's spark session spark driver will stop, then the executors will go down itself. This is, uh this is how it's done in the spark, and then we have Mark spending Bots an option to limit the amount of allocation.

B

So spark is trying to ask as many resources from the full kind of scheduler as possible, and uh if, if you limited the bias, then it will um ask for less and less uh I said yeah, but the main part is is that we have Dynamic allocation being used. So if you're running for, for example, now for the user, like the main resource requirements, are as much I say that hidden away from from them as possible, they just get the spark both in which they can run the query.

B

For example, if they need more resources, then spark automatically will ask volcano. Like can I have more boats? Can I have more boats? Can I have more pots to execute the process? The job as fast as possible.

B

Then it comes to a small demo which I wanted to show um here comes. The part which I will say is that we work in a highly regulated bank, so I cannot show you our kubernetes commands Etc so because there might be some sensitive information lying around, but I can show you, for example, now the grafana dashboards, which I have updated a little bit in which we run Spark so and this I think, will immediately make it clear for you guys how it works.

B

So, for example, if you I'm adding one user to it, it will automatically ask more executors executors as much as possible and it will try to fill up the cluster as much as full as possible. So that will be like 44, 34 executors.

B

If suddenly, a second user is being added. That means that automatically that's first user is being downscaled, so like his podcast skills or killed off and make room for the second user to come in and then that will balance itself again and then you have both have 17 executors to your uh to their high side to the leverage and the their job is like still running.

B

On the background, if you add the third user, then it will automatically scale itself back that everybody has around I, think around nine executors, so sorry, 12 executors and there you can see the amount of CPU and memory being used.

B

So in this situation you also see on the right uh parts that, like we tend to use all the resources as possible and if, for example, you remove a let's go back a little bit if, for example, we try to remove user free, that means that essentially, we can divide the cluster back to two people again, so they will get back more resources if needed in.

A

This situation, you have like a static allocation of the cluster, but then you manipulate how many resources particular job or user gets. It helps you to. You know, have the cost at Bay and giving as much power as possible to specific jobs. If the cluster is empty, the user gets full capacity of the cluster. If new users are coming, they are sharing the resources and therefore something really missing in the in the case to bring that data analytics computation towards computers.

B

Exactly and for example, if we didn't have volcano, then we would, for example, for every user. We have to limit the namespace and the resource requirements that, but that means that it can only use like many less resources, because we statically have to declare it.

B

um Then there is available in the cluster. So this is for us like from a resource, consumption standpoint. This is very much needed and if we don't give it unbounded, then essentially they will fill up the cluster and then we can't do any deployments anymore.

B

uh So back then, if we for example, then all Spark uh processes are killed, then you will see then the fingers everything is going down. This is all being random, using spark interactive mode. So that's quite nice, so um yeah. This is just with these commands and they can just do anything on top of it and we provide this dashboard to the users.

B

But essentially most of this is hidden away, so they they might, they might complain if many users are in the system and they get less space so to say about performance, but they want they will always get the option to run their spark job.

B

So then, into a little bit of workload. Monitoring uh we have uh also I want to show you a bit, but that's uh still not done is that we also can have a drf dashboard in which we want to show a little bit more fine grains in more higher Coral situations that we have one road queue. We have an interactive queue which you know interactive user and we have a project queue in which all the big projects are that we run and, for example, there we can give the project queue more parity over the rest.

B

So that's our crit uh I said a business. Critical spark applications always get more resources to their to their Facebook.

A

If you have like a bad job, CTL jobs that needs to have like a higher priority, because they need to finish to to bring new data to the cluster. But then you have those rest divided for the interactive cues for interactive user sessions. Okay,.

B

And then also, we want to essentially avoid traffic jams and then also we were thinking about adding some cluster Rush Hour part in which we can give the user a little bit more context about, like probably when the cluster is more more used at the moment. So that's more from a self-service kind of ux perspective, because we essentially want to make this hidden away as much.

A

As possible, as you see, it's like, we are elementing eliminating the toy for data Engineers data analysts, so they start the session. They do not care about the configuration of the cluster to run their job and they get the best performance possible. What we also want to make it visible towards them. You know we're going to be the best time to during the day that you can run your job, because the cluster is less busy exactly.

B

So since I'm, not a volcano, maintainer I want to just give all the love to all the kind of folks that essentially made this terrific scheduler. That's a really great job. I personally, also think that it's a that this is like the best way to go, because they have a nice nice abstraction over how we can try to make a task scheduling more formal in kubernetes and how we can get more performance out of it, um I'm, not very great, open source worker.

B

So these things are working and with a small amount of changes on top of it, but I haven't open source it myself. So hopefully, if somebody from volcano is here, we can talk.

B

um We have added the drf dashboard, we added Squarespace automatic queue, management, more Prometheus, metrics, updates to graphene dashboards, Cube State metrics, and they leverage some closer white permissions and I have reduced that a bit, but essentially the what they have. What it's all done is working and I think. Definitely. This is the way to go and it would be cool in the future to also support like uh tensorflow by torch, like different kind of distributed methods. So we can add all different kind of tasks um on top of kubernetes.

B

This way, then I want to conclude my presentation.