Cloud Native Computing Foundation Kubernetes Batch + HPC Day EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Efficient Deep Learning Training with Ludwig AutoML, Ray, and N... Anne Marie Holler & Travis Addair

Description

Efficient Deep Learning Training with Ludwig AutoML, Ray, and Nodeless Kubernetes - Anne Marie Holler, Elotl & Travis Addair, Predibase

Deep Learning(DL) has been successfully applied to many fields, including computer vision, natural language, business, and science. The open-source platforms Ray and Ludwig make DL accessible to diverse users, by reducing complexity barriers to training, scaling, deploying, and serving DL models. However, DL’s cost and operational overhead present significant challenges. DL model dev/test/tuning requires intermittent use of substantial GPU resources, which cloud vendors are well-positioned to provide, though at non-trivial prices. Given the expense, managing GPU resources is critical to the practical use of DL. This talk describes running Ray and Ludwig on cloud Kubernetes clusters, using Nodeless K8s to add right-sized GPU resources when they are needed and to remove them when not. Experiments comparing cost and operational overhead of using Nodeless K8s vs directly on EC2 show sizable improvements in efficiency and usability.

A

Good afternoon, thanks for hanging in there, it's getting pretty close to the end of the day, um I'm anne holler and I'm happy to be here with travis on an mp4 that he recorded uh earlier uh to present uh efficient, deep learning training with ludwig automl ray and nodeless kubernetes.

A

uh I want to start off by just a shout out to several recent articles that contributed to material that's in this presentation and to the people from the ludwig ray and the local communities that contributed to these um this material. So the first is a recent cncf block from february on managing public cloud resources for deep learning training. The second is a medium blog from that same month on lidwig automl, for deep learning.

A

This was focused on tabular data sets and then, thirdly, our presentation from cloud native rejects this past fall where we created a poc for running ray on public cloud kubernetes. So without further ado, let's get on with it. So deep learning has been applied to many fields, but it's well known to be uh complex. To get it from. You know planning to development, to production, ray and ludwig open source systems, gladly reduce the complexity, barriers to training, scaling, deploying and serving deep learning.

A

However, even when complexity barriers are reduced, the cost and operational overhead of deep learning presents significant challenges. So deep learning intermittently needs substantial gpu resources.

A

Public cloud vendors are perfectly happy to provide those, but at non-trivial prices, so managing gpu resources and operational overhead is critical to practical use of deep learning, um lodal's nodeless kubernetes, nicknamed luna commoditizes compute for kubernetes clusters, so it provisions just-in-time, right-sized, cost-effective compute for kubernetes applications when they start and removes those resources from the kubernetes cluster when they end. So its purpose is to manage public cloud resources judiciously so bringing all this stuff together.

A

This talk is on running ray and ludwig on cloud kubernetes clusters using luna as a smart cluster provisioner, and so we'll look at experiments using ludwig automl, deep learning training as the experimental workload that shows sizeable improvements in efficiency and usability versus the way I was running, lidwig automl, deep learning, training prior to setting it up this way. So compared to my prior way uh decreased.

A

This elapsed, time was decreased by 61 percent computing cost by 54 and idle rate cluster cost by 66 and lowered my operational complexity and also retain the performance results of the automl. So we'll start up. Let me now turn it over to um travis.

B

Hi everyone thanks for coming to our talk today. My name is travis adair, I'm, the cto of a company called predabase, building an enterprise, low code, machine learning platform built on top of lubric, and today I'd like to tell you a little bit about the background behind the ludwig project and how uh automl fits into the vision of what we're doing with the open source luba project.

B

To start. I want to present the background on why we believe that ludwig is a valuable addition to the ml ecosystem. So our observation is that if you look at the way ml is done in industry today, there are essentially two incomplete options that are available to companies and organizations that want to operationalize ml.

B

On the one hand, you have low level apis like tensorflow and pytorch, that provide a great deal of flexibility and, on the other hand, you have traditional automl systems that provide a lot of simplicity, but neither of them end up being ideal, because oftentimes the low-level apis are difficult to gain the production for non-expert users, while the automl systems end up being these black boxes. That you end up graduating out of because they don't always solve the problem, the first time around, and so when we look at what we're doing with ludwig.

B

The core insight is that we believe that there is a third option that needs to be explored, which is the what we call declarative machine learning systems with declarative.

B

What we intend to do is provide a high level of abstraction, a higher level abstraction that provides the flexibility and automation use of use of automl, while still giving you the flexibility of lower level tools like pytorch and opening the door for non-experts to harness the power of ml uh without needing to resort to these more granular tools and the way that ludwig works to kind of make. This declare division possible is uh similar to kind of systems that provide infrastructure as code, I'm sure people who, in the kubernetes community community are very familiar with.

B

We provide yaml configurations that declare declaratively define uh models that you might wish to train and so, for example, it's very easy to get started in ludwig. You just say: here's a yaml config saying what my input features and their types are, what my output features and their types are, and then everything else the kind of how will get filled in automatically on your behalf. But at the same time we provide a lot of expert level control as well.

B

So if you say, I want to use a specific type of mall architecture to encode a particular feature if you want to use a particular learning rate or regularization or dropout, all those options are available to you, as well as more advanced features like hyper parameter, search on any of the different parameters within the config and what makes this all possible is the ludwig architecture.

B

So every input, feature and output feature in your data set passes through an architecture we call ecd for encoder compiler decoder. uh Every feature is pre-processed according to pre-processing rules that you can configure in the in the emo config and then encode it into a vector which can be a machine learning, model, pre-trained or otherwise or learned. And then all the different features are combined into an embedding space and then individual output features.

B

Then pass through a very similar decoding step where we get the final prediction, and the benefit of this architecture is that it provides a great deal of task flexibility without a lot of additional complexities. So if you want to do a regression problem, you can have any types of inputs and then just specify a numerical output. You want to do speech verification. You can have two different audio inputs that then have a binary output.

C

B

Tells you whether or not the audio streams are, for example, equivalent or something to that effect for the same speaker and any number of other problems, including text or image or forecasting.

B

Tabular data problems, they're all possible with rubric another core component of blue wig is scalability, and so because we integrate heavily with kubernetes.

B

We also integrate heavily with other distributed systems that sit on top of the build on top of kubernetes like ray, and so all of the pre-processing uh can be distributed across a cluster of pods um using uh das genre and our training system um uses a framework called horovad that allows you to distribute training across multiple nodes and multiple gpus and then model artifacts can then all be uploaded to a registry like something like ml flow, which we.

A

B

Integration with out of the box as well for hyper parameter search, it's very similar and very modular again, so we use ray tune which sits at a level on top of the training process and can perturb different parts of the config and every one of those config variants then becomes its own trial. That goes through the same training, pre-processing training and evaluation step as any other training process in moodwig.

B

And then, at the end of the day, you can get all the different model trials that were explored and choose the one that you would like to use in production and when we started to look at building an automl layer. On top of this, our goal was that we wanted to be something that was ultimately a glass box and not a black box. And so one thing that is very nice about the automl system. Moodwig is at the end of the day. um You can see it as like a co-pilot, that's helping!

B

You generate an ideal lubricant figure for your data set, so you can start by saying something as simple as uh create a configuration from my data set, which can be a data frame or rk file or whatever, and then I want to predict this particular column in this case intent, and then it can give you a config that then you can do whatever you want with modify anything and uh to your heart's content, so how this works under the hood.

B

Is you just provide those two parameters plus an optional time budget and then ludwig automl will do some influence to determine the input and output feature types choose the appropriate mall architecture based on your task.

B

Select the parameters and hyperparameter ranges that wants to explore, given the time, constraints and resource constraints and then launch the hyper parameter, search trial trials on ray tune using your gpu workers and the outputs will be the best tuned model, along with other models that were explored, and you can then take those results and deploy them into production as well, and now I'd like to hand it back to uh and to talk a little bit more about the lodo.

B

The important thing that I want to emphasize here is that there's more to this story than just the automotive side, because there's when you're running this thing in production or kind of in a large distributed setting, there's also a component of how you want to do this process efficiently to optimize the usage of these uh commodity resources like gpus, so that you're using them judiciously and not wasting resources.

B

So this is where liberal fits into the picture, particularly for running kubernetes workloads, and so now, I'd like to hand it back to ann, to tell you more about uh a little and the work that she's done on combining the automl and uh these other systems. Together.

A

Smart cluster, sorry about that, it's a smart cluster provisioner that runs in standard kubernetes clusters. It monitors for pending pod creation, requests and adds additional compute to the kubernetes clusters to satisfy those requests that compute can be in the form of vms uh on demand or spot vms, or it can be in in the form of serverless compute like aws fargate, and it chooses the compute based on current availability of that kind of compute, the cost of that kind of compute and other user specific requirements.

A

You may want a gpu you may want to prefer not to have a certain kind of gp things like this and on an ongoing basis. Luna is monitoring the node usage in the cluster and will remove compute from the kubernetes cluster when it's no longer needed. So luna is comparable to the kubernetes cluster, auto scaler, but it provides more flexible, node selection without the need to create and maintain what can be hundreds of node groups to represent all the image types.

A

I mean all the instance types and it's similar to aws carpenter, but it works across cloud. Vendors provides instant family exclusion and supports a deterministic application of rules. So, with all that background, let's go into uh what we did to look at the different ways. We could run um ludwig automl and how they much they would cost and how easy they would be. So just some background.

A

The ludwig automl heuristics that we developed for tabular data sets were developed after analyzing thousands of hours of model training across 12 tabular data sets, and after we had those heuristics, you know we ran them on the training set of 12 data sets to make sure they could produce good models in a short period of time like an hour rather than thousands of hours, and then we said: okay, that's, like you know, running on the training set.

A

Now, let's take an additional nine validation data sets, we've never seen them before they're tabular and let's run ludwig automl on them, and let's get the resulting models and compare them to highly tuned publicly um reported models, and so that's I did that and that's the I did that on nine uh data.

A

Those nine data sets, and so the workload we'll look at here is three of those data sets running for one hour two hour and four hour, raytoon time budgets and we'll look at the way I originally ran them and the way I would run them on kubernetes with luna available. So the remainder of this talk we'll look at the baseline configuration the two other configurations. First, look at a high level. You know a description of the three and then deep dive each one.

A

So the top level is how I ran things originally and did the original validation of automl. For the nine data sets. I had a three node ray cluster. I deployed it directly on aws vms. All three of the nodes were gpu enabled, meaning the ray head itself could run part of the workload during the auto-tune process, as well as the two workers.

A

I ran them on nvidia, t4, gpu vms as being a kind of a commodity gpu system that did a good job for these workloads.

A

The first alternative to that basic uh configuration is to instead of deploying ray directly onto vms, to repo deploy ray into a kubernetes cluster. That's got luna installed in it and to enable the ray, auto scaler so to deploy ray with the head being gpu enabled and allow ray to scale up to eight workers. You'll see here instead of just two we'll talk about that in a minute, um and so what happens here is when the ray auto scaler realizes that more workers are needed.

A

It asks for the workers, not in terms of an instance type, but in terms of the amount of resources that are needed and when luna sees those resource requests. Those pending pod requests that aren't satisfied. It's going to go out and pull available instance type and put it into the kubernetes cluster on demand and then lay when later, when the ray, auto scaler doesn't need a worker.

A

It's going to get rid of the worker and the luna system that k it's nodeless is going to see that that node is no longer needed and pull it out of the kubernetes cluster.

A

So, in fact, in this case, even the ray head when it's originally deployed luna is the one that uses a gpu, enabled node and puts it into the cluster alternative. Two is just like alternative one with one change, which is that the head of the right cluster is cpu only. This is good because gpus are expensive, and so in this case you can run an idle ray cluster and it would cost less money all right. So why did I run this? You know. How did I choose this baseline?

A

Well, the basic principles of why I ran the original experiments in this configuration of the fixed size, three node ray cluster with um t4 uh gpu instances was. I wanted a standardized comp amount of compute for the automl time budget. So if I say I'm going to run automl for one hour automatic well for two hours, auto mouth for four hours, there needs to be a standard amount of compute behind that you know time basis.

A

So I cared a lot about that. I cared a lot about operational complexity. I didn't want to reason about whether my experiment was good or not. I wanted to have confidence that all the compete was available for the entire time budget and what I was getting was a legit result and my final thing was: I wanted to control idle cost. So when I, when the experiment was finished, I would log into the ray head. I would make sure that I didn't see anything bogus about the experiment I would poke around.

A

I would record things so I was pretty sensitive to idle because I might leave the raid cluster running, for you know a non-trivial amount of time after the experiment was over, so g4dn means t4 gpus and I could have used any of the variants of t4 gpus because for this workload it's all about the gpu and the gpu memory, um but I chose 4x large a little bit spendy, because when I tried to get cheaper instances in my region they were often not available, and so that was operational complexity for me to keep trying to redeploy the ray cluster and so on.

A

So you know that was my choice. Three nodes, I knew, would do a good job running the 10 hyper parameter search trials, which is the default number of search trials run by automl and ludwig. So I knew three nodes: could complete 10 trials in a reasonable amount of time?

A

I used non-spot instances for the same reason that I had a fixed size cluster. I didn't want anything to go away during the run and I ran a single. You know job at a time, not six or nine nodes for that idle issue.

A

So, of course, the baseline for these were three workloads uh running for the three-time budgets, matched our expectations, for uh you know attuned accuracy of the models versus uh manually tuned models. The elapsed time for this run was 22.6 hours and you might be saying: well why wasn't it 21 hours? It's? You know three times one plus three times two plus three times four, but the extra 1.6 hours were for parts of the job that only run on the head.

A

So when the data is loaded up and pre-processed that's done on the head and when the final auto-tune job is complete, the head also runs the evaluation of the best model from each trial. uh The cost of this workload, because g4 uh dn4x large uh instances cost 1.204 dollars an hour. The overall workload cost here was 81.63 and the idle cost, of course, is 3.612 an hour.

A

So some observations about this baseline, these baseline runs. I did well, it would be nice to get. You know the results in quicker than 22.6 hours, and this is just three of the nine um and you know the obvious way to do that would be to run more than one of the jobs at a time run the three one-hour jobs in parallel and three two-hour jobs in parallel and so on.

A

And of course I didn't do that originally, because I was worried that the ray auto scaler wouldn't be able to obtain the instance types needed when I needed them, but that's where luna comes in so now you know. If you combine the ray, auto scaling with luna, then the ray auto scaler asks for resources and luna satisfies them. So basically that's the magic that allows this to you know reduce idle cost, because now I don't need all three nodes running at the end and reduce elapsed time, but you might be thinking okay.

A

Well, that's fine, but what about workload cost? Can you really reduce that? I mean sure you could get rid of the 1.6 hours where only the head is needed, but what about the workers that are needed to run the um you know the autotune searches for hyper parameter? Well, actually those workers aren't always needed during the entire run either.

A

So we in the automl for ludwig uses um something a search strategy called async hyperband and what async hyperband does is just continue trials that don't look very promising compared to the trials it's already run and so, depending on the data set, a lot of trials may be discontinued quickly and when there's fewer than three trials left and fewer of three workers are needed.

A

So this is a picture of the 22.6 hours on the x-axis and on the y-axis you see, you know the three one-hour data sets running for one hour. Each then for two hours, each then for four hours each and you see a trial. You know the ten trials for the first data set, so you can see that, for the first data set, only two trials really run into the end of the time budget.

A

The other ones are discontinued as being not promising and for the second data set, you can see only one really survives till the end of the hour. Third data set, you know, is the trials are more competitive, but hey that's what auto scaling is all about. So there's a lot of opportunity here to save resources, even during the run.

A

So that's brings us to the um configuration where ray uh is running with its auto scaler, where luna is managing the kubernetes cluster in terms of scaling it um and where we deploy ray onto a kubernetes cluster, which is eks in this case, and we actually used a control to make sure that uh luna didn't choose nvidia m60 gpus, which are actually cheaper than t4, because they didn't work well for machine learning, workloads or kind of for graphics workloads, um and so now we uh we run the the three uh jobs in parallel of the same time budget and we set max concurrent trials to three so that they still only will run three at a time just like they would have in the fixed size cluster.

A

To begin with. So in this configuration we got competitive accuracy results. So there was no compromise on the accuracy that the models that auto-tune automl found using the ray autotune, but the elapsed time was greatly reduced. So in this parallel run the elapsed time was reduced to 8.75 hours, which was a big difference. Of course, the idle cost was reduced by two-thirds, because only that gpu head is running at the end and the workload cost was reduced by 54.

A

Let's, let's look at where that's coming from more than half of that is coming from exploiting the auto scaling we talked about. You know scaling down workers not needed during the run um for the head-only parts or for the parts where there's fewer than three trials left, um but a little bit you know uh about 20.

A

Some percent was also coming from using cheaper instances so for when um luna ran this job, it shows the g4 dn4x large for the head, because I had said the head needs more cpu memory to handle the evaluations of the data processing, but I had said that the workers need less cpu memory, and you know luna, patient and willing to uh to look for uh available. Instances was able to get. uh You know, 2x large instances which are cheaper.

A

Okay cool, so now that's you know way better in elapsed time, idle cost and workload cost, but we still have that you know um gpu machine running if the ray head is up- and I guess there's two things there one is, you know it makes it means you're going to feel a little guilty just leaving it running all the time and so you're probably going to spin it down once you've, gotten all the data that you got that you needed to get off the head and also it's just you know a waste of money in general because you don't need those gpus once the job is finished.

A

So in this case, with the cpu only head, we can see how cheap we can get this and possibly even leave the cluster up if you're comfortable, with the cost of idle, once you've switched to the cpu head uh now in this kind of deployment, the head can't run a worker, so we need to bring up nine workers max to get three three three um and also because the way ludwig automl checks for resources. It looks at the right head to see if there's gpu enabled in the cluster, so there's a slight option.

A

You need to add to ludwig automl to run it this way.

A

So again, in this configuration we were able to uh to get match the accuracy of the models. So that's good the elapsed time we suffered a little bit instead of 8.75 hours. It was nine because a cpu head running the evaluation uh was a little less efficient, but you know it's still a way better than 22.6 hours. So if you're willing to take a little hit there, that's fine. The idle cost was 0.452 dollars an hour, so 62 then leaving the gpu head enabled um and 87 percent. That should be a baseline.

A

So basically it was a big savings and you know, might I mean I felt way less guilty leaving the cluster ray cluster up. In this case, the workload cost was essentially the same uh as in the previous case.

A

So, overall, the lessons we learned here was: you know the um efficiency you can get along with operational uh ease of use of using load, luna, nodeless, kubernetes and using ray on top of kubernetes was really worthwhile. um Gpu only had I mean gpu enabled head was good. Cpu only head was better, so uh we've shown these benefits in the future. We want to continue to enhance luna to handle efficient scaling of all sorts of workloads. You know certainly deep learning.

A

Training is not the only workload that can benefit from this ci cd and many other workloads can as well, and we want to continue to extend literally automl to new domains to enable efficient development and scaling, and we've already just recently that, like in the past two weeks, uh announced that lidwig, auto well ml, now works for text classification data sets and also shows good savings for those as well.

A

So that's it thank you and any questions.

C

So do I get at the right that lunar doesn't actually add nodes to the cluster, but rather add some compute and you're handing off the workload to that thing, and if so, what is the difference between just spawning notes on demand that fits your needs.

A

Sorry just make sure I understand the question so right now. Luna during the experiment is adding virtual machines to the kubernetes cluster to satisfy the pending pod requests that couldn't be placed when the kubernetes cluster uh at the current size of the kubernetes cluster.

A

I'm sorry they're, I'm not sure if I understand their.

A

Their nodes in kubernetes.

D

You mentioned there, the higgs data set, oh yeah, so I'm very curious. What is this.

A

So there's a whole, it's it's pretty cool, but there's a whole bunch of famous tabular data sets, and I wanted to choose famous ones that people had done a lot of disclosures of best models for so that I could make sure that automl was competitive with those. So there's one for higgs, higgs boson data set it's a very large, very nice data set it's a very challenging data set to run so yeah. It's it's an interesting data set, so it's out there you can download it and run it. It's also.

A

These data sets are built into lid, wigs, so they're available. If you're using the lidwig platform. There's a bunch of standard data sets, so each of the ones for automl are checked in.

E

Just a quick question, like did you compare like luna, with, for example, gk, autoscaler or carpenter, like I think here, you're, comparing with yourself right like.

A

Yes, so here I'm comparing with my own lame, you know baseline, but yes, I I kind of mentioned on the uh the first uh kind of description of luna, that it is comparable to aws carpenter and it is comparable to the cluster auto scaler. I would say the difference for the cluster auto scalers. Typically, you have to end up creating a whole bunch of node groups to cover every possible instance type you want, whereas you don't have to do that here and for carpenter.

A

I've actually run experiments on other cloud clouds other than aws. So that's one issue with carpenter at least right now is that it's really aws carbon also doesn't allow you to do instance, exclusion. So I really wanted to do instance, exclusion here, because I didn't want the crappy m60 gpus. um It also has this weird thing with the rules where it's not deterministic, what order a carpenter applies, the rules, so these are little picky things, but but, um but so I feel like this is kind of robust for my use case.

E