Cloud Native Computing Foundation ArgoCon 2022, 21 Sep 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Managing Thousands of Automatic Machine Learning Experiments with... Yuan Tang & Andrey Velichkevich

Description

Managing Thousands of Automatic Machine Learning Experiments with Argo and Katib - Yuan Tang, Akuity & Andrey Velichkevich, Apple

The fairly recent field of Automated Machine Learning (AutoML) provides the richness of powerful algorithms for model selection and hyperparameter (HP) tuning – one of the most important steps of the MLOps lifecycle. However, it’s non-trivial to advance these AutoML algorithms from research papers to production. ML engineers have to learn DevOps and cloud-native technologies to achieve that, but the main focus should be on inventing new ML algorithms. Katib and Argo open source projects provide an infrastructure to use and develop AutoML algorithms easily and fast in a cloud-native environment. In this talk, we will walk through the best practices (such as Argo caching and synchronization) for managing thousands of complex HP tuning experiments that bring the optimal performance. We will demonstrate how Argo Workflows and Katib bring the best of both worlds of Kubernetes-native workflow orchestration and HP tuning at scale.

A

Hi everyone I'm ian from akiti, I'm a maintainer of argo workflows and co-chair of cooper training, working group.

B

Hi everyone, I'm andre, I'm the software engineer apple and also I'm the co-chair of working group. uh Optimal and training inquiry flow so today we're going to speak about how we leverage argo flow in cadib and how we can manage thousands of automated machine learning experiments uh with this integration.

B

So let me first jump to the queue flow, so kdp is part of keyflow umbrella. uh So, if you don't like know, keyflow is the open source project for envelopes. On top of kubernetes, keyflow contains different components to perform a different way of ml activities such as has a normal solution for notebooks and jupiter. Labs also has the components for digital training operator, with wide support of open source ml framework such as tensorflow, pytorch and extent.

B

Xgboost and mpi also keyflow leverage the functionality for male metadata and has an own component for ml pipelines, which I think many of you know about. Also kevlar has a component for outsmall, specifically cadib for hyperbaric tuning and senior attack research, and we have this serving component for model serving in clouds which, uh with a lot of like very unique functionality, also q flow can be easily deploying on any sort of the public clouds or on-prem and can offer the web ui interface, sdks or kubectl to make interaction with these components.

B

uh Let me jump to the cadib, because this would be our main focus in the previous presentation. So, okay, as I mentioned before, is part of the keyflow uh components and it's uh the project for automl, specifically for hp tuning a list of mega neurohead research. In the meantime, we are working on making additional support for feature engineering and model compression to allow you to do other like uh optimal features.

B

On top of the cloud also, you can run kdp to perform your custom, auto algorithms, so we provide a platform to do it like in a cloud-native way. Also, we can have it like unique feature to actually orchestrate any kubernetes custom resources, and I will just jump to this to the next couple of slides, how we can do it and since we run on top of the kubernetes, we are like agnostic, tml frameworks and we have a native integration with key flow components such as training, notebooks and the pipelines.

B

So join me to the kdp architecture, it's quite straightforward. So when the user submits the experiment, we have the experiment controller, which is reconcile this experiment.

B

And then we have a suggestion, controller, which is responsible to spawn the algorithm service. When the algorithm service is the uh completed the coupled service, we just produce the hyper parameters based on the experiment specification. Then these five parameters pass through the trial controllers and trial controller, basically spots trial in the parallel execution.

B

So we have a unique feature to support any type of the tria worker to be run as a trial, whether it can be a simple commercial job, team, job or even argo workflow, and in this worker you basically produce a training, and then we have a matrix collector, which pars the necessary uh necessary metrics from the workers and send this med send this message to the db.

B

Then these metrics pass back to the experiment controller and uh they we're getting. This evaluation results to the algorithm service uh to produce new hyper parameters. This process repeated again and again uh when the hyperbarian tuning job is finished, a user can get the best type of parameters and use them in production training so jumping to why we actually need argo and what are the current problems? We have uh the main problem that, um in the evolution step in hyperparameter tuning, we have couple of problems such as.

B

Usually the training process is not like quite straightforward. When you can just perform a simple job to run your training, maybe you need to run some pre-processing data. Maybe you want to run some post-processing data and all of these steps can be done during your evaluation.

B

So basically, the simple commerce's job uh doesn't like uh give us the functionality to cover all these problems, and that is why we moving forward to using the complicated workflows such as argo, to be able to uh just solve this issues also have problems with the multi-objective optimization when we want to tune uh to an experiment in the different objective, and also we can do some parallel training with which joanna will be talking about in the next couple of slides.

B

So next, let me just pause to learn and he will speak about the arguables and how we can solve these problems.

A

With all those challenges in real world machine learning pipelines, I'm going to talk about how algo workflows makes it easy and then introduce a few common use cases for machine learning pipelines. So oracle workflows is a container native workflow engine for kubernetes.

A

The main use cases for argo workflows include machine learning, pipelines, data processing, etl infrastructure, automation, continuous delivery and integration. On the right hand, side is a screenshot of what the argo workflows. Ui looks like the diagram at the bottom gives some example. Ecosystem projects that use other workflows more can be found at the auxiliary github repository linked below.

A

Let's first talk about memorization cache functionality in our world. First that will be leveraged when dealing with the pre-processing challenges that andre mentioned previously. Aggro workflow's controller creates cache which can save the output of a step to be used in the next step. For example, here step b requires the output from the previous step a when the workflow is executed. For the first time, our workflows will create a cache for step.

A

A the cache contains the result of step a and is saved as a key value pair in a kubernetes config map once step a finishes, step b will be executed the next time. So, when the same workflow executes again, it will check whether a cache from step a already exists and whether it's still flash, for example, if the cache is created 10 seconds ago, and that step b thinks this is fresh. It will retrieve the saved output from the cache and use it directly in step b, without wasting resources and time to re-execute step a.

A

Here's how to use the memorization functionality in our workflows in the template, spec. On the left hand, side we can specify the memorization spec here we specify the key in the cache to be cache key.

A

The max age represents the maximum duration before we consider a cache as old when future workflows or steps try to use the cache. We also specify the name of the config map that we want to save the cache to here's. What the config map looks like on the right hand, side, the key is key and data contains the output parameter produced from this particular step, which is the parameter hello with value world.

A

Let's take a real-world machine learning, workflow as an example to see how memorization can be leveraged, assume that we triggered a cutie experiment that executes a machine learning pipeline using argo workflows, a simple machine learning workflow may look like this. First there's data ingestion step, that's responsible for ingest data from the data source. You may have a cache store, that's in place using argo world kubernetes to check whether the data has been updated or not. Recently, in order to skip this particular data ingestion step, if nothing changes in the data set.

A

Otherwise you would have to execute that data ingestion from scratch, which costs a lot of computational resources after we've ingested the data we start model training step. The model training can have multiple workers and multiple data charts depending on the selected distribution strategy. Here, for example, we are running the distributed model, training step using or reduce the model.

A

Training may consist of code written in frameworks such as tensorflow or pytorch, and then you can use quick flow to submit a distributed test level training job so that the algorithm developers or data engineers they don't have to worry about. The infrastructure side of things kubflow will communicate with kubernetes requests necessary to request necessary computational resources for each of the workers and parameters so that pencil can just focus on the algorithms or the models. We can also use cutip for more complicated model, training that leverages hyperparameter, tuning neural architecture, search, early, stopping and so on.

A

Let's take a look at uh how this can be achieved with other workflows. On the left hand, side we define the entry point of the workflow, which consists of sequential steps for both data ingestion and distributed tensorflow training. The data ingestion step picks a parameter that represents the location of the data that we will save to once. The data ingestion is finished in the data ingestion step.

A

We save the data set to the specified s3 path and then cache the location with the max max age of one hour and then in the distributed model, training step. We are training a tensorflow model using kubeflow's tf job with the dataset that we just saved when this workflow gets executed again within an hour, the data ingestion step will be skipped and the training step will reuse.

A

The previously generated dataset next, let's take a look at the more complex pipeline that involves multi-objective optimization in order to achieve better overall performance for a machine learning problem here, we'd like to build three different models with three different model model: architectures such as logistic regression, neural networks and decision trees and with different objectives here, we're using accuracy, auc and loss.

A

There are two different data injections: uh steps that ingest two different data sets. The the staging model will use the different data and the other models. After that, after we've trained uh finished training. These three models. We then trigger a cartier experiment that collects the matrix and suggestion and suggests an optimized set of hyper parameters.

A

Once the suggestion is made, we will trigger a new workflow that uses the suggested hyperparameters.

A

Here's how to implement this pipeline in argo workflows. First, we construct the dac that consists of the major components that we showed in the previous diagram. The data ingestion step consists sub steps to execute data ingestion from two different data sources. We use the width sequence syntax to loop through different data sources. The model training steps consist step templates for different model types, different data sources and objectives.

A

We assume that the single model training template used in these model training steps support these different different parameters. Next, andrei will give a live demo.

B

Thanks john yeah, so I'm going to give you the demo regarding to the cache how we can leverage this in cadip. uh Let me quickly jump to the ui first of all, so uh this is kiplo ui, uh I hope maybe you're familiar with that, I'm going to jump directly to the kdp, one which is part of the qfloor umbrella.

B

So, as you can see here, this is kdpi. uh Where we can submit new experiments, we can specify the necessary information for a hyperparameter experiment such as metadata uh trial threshold, for example, how many trials you want to run in parallel. What is the maximum number of trials? What is the maximum file number of trials?

B

Also, you can specify the objective so the main metrics you want to tune the additional matrix you want to collect the goal for your objective, metrics and other extra information, so the search algorithm can tip out of the box, supports variety number of different algorithms. We continually evolve into adding new algorithms and even even like, as I mentioned before, provide an option to deploy custom algorithms.

B

uh We can also specify the earlier stop techniques to avoid overheating for your hyperbarian tuning experiment. Then we can set the hyper parameters. We can add a new parameter. We support a various number of distributions, such as categorial, categorical, discrete, double sorry, a categorical double integer and discrete. So we can specify the range of hyper parameters. You can specify the step you can also edit the hyper parameters. Then we can jump to the matrix collector uh specification and to the child template, which is actually executing the training during your hyper parameter evaluation.

B

So in this particular example, I want to take the example with arc workflow as child template, and let me just copy first of all, the whole yaml to our ui and just to submit this experiment before we can analyze the results.

B

uh So before I'm jumping to this ui, I just want to quickly introduce what kind of experiment we are running. So we are running a simple uh tip experiment with the arc workflow as a trial. So, in terms of in terms of the objective, we're going to tune validation, accuracy uh with the additional metrics that I'm going to collect is the training accuracy so for for the algorithm, we just select a simple random algorithm and we're going to run two parallel trials and the maximum number trials of five.

B

So basically, in this example, each trial is the argo workflow. So we're going to trial we're going to spawn this separate argo flow in parallel and going to execute the the workflow inside the argo uh inside our workflow. So basically we're going to learning, create and uh with this kind of fringes, and let me jump directly to the child template. So what you need to specify in a child template to be able to run argo flow.

B

You just need to send set primary port labels, uh primary container name, the success condition when your workflow is finished and the failure condition when your workflow is failed. So in terms of the workflow, if you're familiar with argo, it will be very easy to understand how it looked like. Basically, we have two steps. The first step is data preprocessing, and the second step is the model training, so in the first step, uh data preprocessing we're going to generate the number of examples.

B

So this is a super simple toy example, but just show like the power of using argo and cadib, because you can have a very complicated preprocessing here and as you want to mention before we can store the value of the processing in the cache. So basically we generating the random value and we try to store this value in the cache.

B

And then we reusing this value in the next um in next uh workflows, to not write, run pre-processing again and again, and then we just basically pass this number of examples to our second stem, which is model training. So, as you can see here we're getting the number of examples from the previous step and we're getting the learning rate from the suggested parameters.

B

uh Then we can run this training with these two parameters. So the first one is number of examples. The second one is during learning create and then we run training. So again, as I mentioned before, memorization is very important because you don't want to run proposing, for uh maybe you just want to run it only once and in the next step of the evaluation you just want to run only training and collect your metrics, uh which is important for hyper parametering job. So let me jump back to the kdpy.

B

As you can see here, uh the experiment is currently running. This is the experiment I run before in this ui we can analyze that we can get the optimize optimize trial. We can see some of the metrics that we collected. We can jump to the uh to this ui just to see what is the name of the experiment. What is the current status? What is the best trial for now? What is the best travel performance?

B

uh Also, we can just see some experiment conditions. uh We can jump to the trials just to see their metrics, their validation, their match that actually collect uh the best hyper parameters and also we can see some distributions in this ui uh and again, as I mentioned, each trial here is a separate argo workflow.

B

So uh we can jump even to the argo workflow ui and we're going to see that, for example, let me take one of the kp trials so, for example, this one and we're going to see this. This is like represent the whole arc workflow and inside arc workflow. We have a different steps, so let me jump to the one of the uh one of the workflows here.

B

So basically we uh we have the data pre-processing step, which is storing value in the cache, and then we have a model training we just reused as well from a cache. So, as you can see here, we basically have a number of six seven six nine and we, if we're going to check other workflows, we're going to see the same number for each step.

B

So if we're going to go here, we can see seven six, six, nine again and if I click to the model training, uh I'm going to see the exact training which is happening and we're just collecting the results from the training. Basically, so uh this is very powerful and again uh jumping to the ui. uh We can click to the trial we can see which metrics have been collected, how the metrics are going to produce.

B

uh You also can analyze the data based on these trials, so you can see uh what is the like performance, which is producer. You can also collect more metrics if you need so just to use this ui in terms of like the metrics tracking uh uh process and also you can some details of the experiment. So this is a very simple example, but at the end you can create more sophisticated examples.

B

You can again, as you wanna mention you, can create even uh some uh multi-objective experiments with the deck when you have like more than one model which shrink in parallel and one evaluation step, and you can even run whatever you want, which argo workflow offers.

B

So let me jump back to our presentation and at the end I really want to quickly mention a couple of slides regarding the community, because all of these amazing features won't be available without uh the great fork from the open source community. So uh if you want to uh check this experiment and just try to run it by yourself, you can follow this guide. uh Also.

B

I strongly encourage you to join the argo flow and caleb community meetings who so we meet uh almost every week and we're pretty open to the new contributors, we're pretty open to the new proposals and the feature requests that uh we can integrate in our projects. Also, please uh check our github repositories, our slack channels and, if you're, using kdp or argo, please update the adopters list.

B

So we really want to have interaction with the customers to see what kind of pain point you have and what it should be like next world map, and if you want to learn more about caddy, please check this presentation list um to learn about more about ultima and how to use it in there with an argon uh and just at the end, uh please feel free to to uh just fingers. If you have any questions, I'm not happy to answer all of your questions with that. Thank you so much for listening us.

B

We are more than happy to answer all of your questions.

A

Hello, can you hear me okay, thank you, everyone for for watching our pre-recorded video today, um so our andre will be on our argo, con slack for any offline discussions and questions and I'll be here for our q, a yeah.

C

A

Know if you have any questions- and I can answer them now, raise your hand if you have any questions.

A

A

So how many of you are working in machine learning related applications?

A

Okay, I see a couple of hands. Are you using argo workflows? Okay, what do you use for like distributed training, for example,.

A

Sorry, I can't hear you siege maker. Okay, do you find it easy enough for you to run all sorts of experiments.

A

Okay, uh yeah, I I hope you guys uh will try out many of the sub projects available in coop flow and we have distributed training operators and there's a project called tip that we just mentioned it's for managing otml experiments and there are a lot of built-in algorithms, for example, high programming, tuning and architectural search and so on. Yes,.

D

um I have an intern uh pursuing a master's degree and we have him doing a machine learning research project, but I don't think he's picked his tech stack yet actually so kind of like more generally, I was looking for something that I could take back to him and kind of show him this, because uh probably gives him pro most everything he needs. I think maybe he'll know better than I do so like like just like basic, um like some of the same stuff you showed here. Is there any of that like I could like?

D

Is it going to be basically like, like I could get like your slack information or something and like introduce him, so he could maybe get some help with his project.

A

My recommendation is, if you are running things in the cloud or on kubernetes already, argo workflows will be the de facto choice uh for workflow orchestration and since it's really scalable and easy to use and if you're running uh distributed machining training, especially on kubernetes, then kubeflow training operator is definitely something you want to look into, because it's you can describe a distributed training job in a crd, it's very easy to use once you install the operators yeah, yes,.

C

Is it possible to run ktip without the kubeflow infrastructure? For example, we have say argo and metaflow sort of uh integrated for distributor training. Can I leverage ktiv independently from the cubeflowcup.

A

But you still want to use katip right yeah, so khatib is kind of independent of uh other independent other coup flow ecosystem, like you can run any custom crd as an experiment as a trial, and then you can spin up a lot of experiments using khatib and then within cut using khatib. You can also automatically starting different experiments with different parameters if needed. So if you have, if you can describe your sagemaker job in in terms of a crd or a script, then I think you can use it directly.

E

So somebody uh developing this how much skillful they need to be in argo overflows, writing all the the workflows and writing the dags and all that stuff.

E

So so how much? How much skillful do they need to be writing the dags and the argo workflow syntax, how much skillful, how much skills do they need to.

A

It's just like kubernetes crds, and once you install the controllers, you can just write everything in your yaml and there's also like python sdk java and go sdk that you can also use.

A

I think it's pretty easy to use yeah so now like if you're a python developer or if you are one of the data scientists in your company, you can just use one of our sdks.

E

Where do they need to know our workflows.

A

uh Yes, you need to know the the basic concepts, but there's also integration. So I know you mentioned metaphor right. They also added the integration with argo workflows so that you don't have to understand all the concepts behind workflows.

A

You can just write the regular meta flow steps and then underlying it will invoke and create other workflows without having the users to worry about it. Yeah does that help? Okay, any other questions.

A

Any questions found here: no okay, uh we we can take additional questions offline either on the argo count, slack and I'll be at the acuity booth. If you want to stop by. Thank you.