Argo ArgoCon 2021, 10 Dec 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ArgoCon '21 - Building Medical Grade AI with Argo Workflows (Omri Fima)

Description

AI and Data change the healthcare industry. In the last year, Argo was helping us to bring medical-grade AI workflows to production and improve the lives of over 20 million patients.

In this talk, I will share the practices and techniques we use to build reusable, production-grade AI workflows using Argo Workflows, How you can write your workflows for a more reusable pipeline. And Finally, How we integrated Argo with Jupyter Notebooks for robust and fast experimentation.

A

Please join me in welcoming omri to the stage. Omri is a data hacker a maker and a lego builder currently he's the chief architect at diagnostic robotics and he's building ai and predictive analytics models to help make healthcare better, cheaper and more widely available. Welcome omri.

B

Hi everyone good morning today, I want to talk with you about components, workflows and cookbooks, but more specifically, what I want to talk with you today is about machine learning, pipelines and heart attacks, but before I'll do that, I wish to do a short introduction. So I'm omri, I'm an architect company called diagnostic robotics, I'm an engineer. While I'm not data scientist myself, I can probably understand every second war that you say and what we do at diagnostic.

B

Robotics is ai driven precision, population health and you might be asking what a driven precision population health means. It basically means something like this: nice meme from the liam neeson movie taken. um One of the issues with um health care in the united states in the world in general. Is that too many people don't get the health care that they need, whether it's because they don't have enough knowledge or whether it's resources or the fact that their insurance company can give them enough attention?

B

And what we're trying to do is basically look for the people that needs the most attention that the condition might get worsen in the next few months or few years, and we try to find them find the best intervention for them, an intervention that actually can work and get them treated.

B

Now our story began like any other ai story, with a notebook, a few data scientists that do some magic and get actually interesting results, and what we're trying to solve is actually find an intervention that we can do now to prevent hospitalization six months from now, and the results were very promising and yeah.

B

There was even a paper, and but when you try to get it into production, we got into the world into the problems of get into production, whether it's the problem of scale or the problem of governance or the problems of stability um in your hunters envelopes.

B

Now mlaps is a lot of things, but today I wish to focus on pipelines and pipeline. Orchestration is basically a sole problem. They are more than enough to tools to choose from, and the two we chose to use is argo and we love argo, mostly because it's kubernetes native and we are kubernetes workshop, um it's api and cli.

B

First, it has a really nice ui and it allows us to create really robust deployments with helm and argo cd, which we use extensively and that allows us to take our ai are basically very basic notebooks and make them work at scale and that's an example of part of the pipelines that we are doing that. That's actually a very small script, screenshot of a very long pipelines that runs above and below that that screenshot screenshot so excellent job.

B

uh The business people said now. Can you do the same thing for for one carving and I've looked in data scientists in my team and they looked back at me and he felt for a second said. Okay, oh dear, because, while going from research to production, we've sung, we've lost something in the journey because once we're looking on a pipeline, it might look simple enough on the demo.

B

But reality is much more complex and development of these pipelines become complicated and nobody wants to work with 15 000 line. Yaml files to define these pipelines and the deployment becomes much more complicated because we have different models in different stages of life and different maturity that need to run together on the same environment and on the same data, and now you make sure that one researcher does not does not create issues or problems to other researchers or even our production code and iteratively experimentation became suddenly very slow.

B

Once uh in the note, in the old notebook days, we just had to make a change. Click on enter and cd change actually manifest. Now we need to wait for an image to build, will wait for a pipeline to actually be deployed and then only run it. It can take a few minutes and it really hinders our ability to run fast research and that led us to understanding that when we're talking about an ai product or machine learning product, there are actually two modes. We need to think about.

B

One mode is the research mode when we need to be very fast, very experimental, very adjustable, and it's more or less an iterative process in this process. The data scientist is the king, and then there is the development in the production mode when we need to think about scalability reliability, making sure that this thing is governable and it's especially important when we're talking about medical grade products and we need it to work in a fire and forget mode.

B

We cannot make sure that we always have a look and need to attend this pipeline, make sure that they work every day every month and in this stage engineering is king. And but the issue is that with medical products, it's not just these two phases: it's actually a whole spectrum between them.

B

So when we're talking about um the progress or the spectrum between research and the in production, we're looking from the uh most basic steps when we're trying to do basic exploration- and we are just going all over the place and looking for interest in data- you know we need to be very very very fast and we don't care much for scale or quality of our code. And then once we go more and more towards production. We start to look into some things like retrospective studies, which actually needs some governance and need some level of scale.

B

But we still need the ability to react fast and to adjust fast and once we move that we need to pass clinical validation when an actual medical doctor just looks on the model and actually signs it off. And then we need to make sure that our code and that our system is also governable and we don't just change things without any control and when we go into prospective research, basically sending it into the market and having a test, whether it works or not, on real life from real population.

B

We actually need governance because we need to show results and we need to show that these results are valid, um so mlaps.

A

And the issue that.

B

Emelops adds complexity, anything there's a way to do experimentation and the solution was surprisingly not more tools. Necessarily the solution was architecture. So today I wish to share with you five lessons that we've learned about how to move from production, to research start from research to production and back again.

B

So the first lesson is the pythonizer pipelines um and the reason why you might want to pythonize your pipelines is first of all it's because it's developer friendly people know to write python code. They don't necessarily know how to write. Yaml codes they're much nicer to work with, and we can get all the nice things that we can get in code, whether it's obstructions or autocomplete, schema validation, testing, linkedin.

B

We can build our own opinionated, api or scaffolds, and basically, once we have pipelines or treat pipelines as compilable code, it's much easier to bring engineering practices and patterns into into the mix.

B

There are several ways to do that and the way we chose in the beginning was to just generate identity code from the open api schema. But today there are a lot of nice tools, not nice, open source libraries. You can use to generate um to generate pipelines from your python code and our first state was looking more or less like this. This is basically a low wall pattern and how it looks in python, um but suddenly it was not good enough.

B

It wasn't good enough because it wasn't easy enough for data scientists and we've found more and more ways to make it easier for our data scientists and data engineers to write better and better python code. The issue that we find it complex is that for research we actually sometimes needed to run these pipelines locally. We didn't want to do the entire process of pushing the image and pushing the pipeline to production and running it in a remote cluster for scale.

B

Sometimes all we needed is to just run our code very quickly or just run our pipeline locally and maybe stop it or debug it in the middle and when we try to do that, we've actually built another set of pipelines written in python. That can work only on our local computers, um and now we suddenly have two sets of pipelines.

B

One is for one is a pipeline for scale and one is a local pipeline for running locally and debugging, and this actually created an issue because we need to maintain this both of these pipelines um and then we thought about it and asked: can we run a lot code locally for research and then run the same code without without the need to maintain both code bases for scale in local research at argo at scale, and we went back to the drawing board and thought about writing a little bit of a new dsl, a new library, to write better pipelines near enters pythonic dsl.

B

What is pythonic dsl well pythonic dsl is a mostly backward compatible subset of python. What does mostly backward compatible subset of python means. It doesn't mean that every code, you're writing python, will automatically get translated into argo pipelines. It does mean that everything you write in this dsl can run locally as standard python code.

B

So let's have a look on how it works. So that's a very basic python code, a very basic, probably the most basic map, reduced job that I can think about, and it looks like a legitimate python code. um It has some sort of a generate list that creates chunks of data, and it has a very specific um map task that should be paralyzed and some sort of reduced task.

B

That runs it all in order to make it into an argo dog. All we need to do is add some decorators that defines which of these um are tasks which are. These are dags that call other tasks and the task where we wish to run them. This, specifically, specifically, we want to run them on a python-free image and the neat thing about that is, while basically, we can create from the same python method, a workflow that will run its scale in argo.

B

This still works. We can actually still call the function mapreduce and it will work locally.

B

You can actually use this tool today, we've been using it in the last few months and you can use it by going and pit installing argo pro tools and you can contribute to our github page.

B

So once we had, this is a new way of writing pipelines of writing workflows. It was time to start thinking about pipeline architecture and when talking about python architecture, I would like to talk with you about in-n-out burger.

B

Now, if you don't know what in-n-out burger are in and out, burgers are a very famous fast food chain in the west coast and what they became famous for is the secret menu. The secret menu basically allows you to mix and match lots of in and out ingredients into some customizable meals with very crazy names like form four animal style or the flying dutchman.

B

But when you look a little bit deeper about the magic of the in and out secret menu, the really neat thing about that is that they allow you to customize their menu to your wheel, but they are doing it with the same four or five or six ingredients that they already have and they already use for their basic meals. And that's, I think, the really really nice thing, because they allow you to customize your mail from on one way and they and on the other way they do use or.

A

B

Use the same system, same ingredients and processes that allows them to create something which is reproducible, that is always in a high level and very, very consistent, and I was promising to tell you about machine learning, pipelines and architects. So that's probably the part about heart attacks. So yeah you can basically customize your burger into crazy into the crazy domains. um Don't know why you would do that, but you can still do that, so what we can learn from in and out and building our pipelines.

B

So when we were looking on our pipelines, we actually split them into three types of pipelines of three types of elements. One element is components: components are basically reusable pieces of code and patterns that we wish to abstract away. Workflows are more or less like recipes. There are reusable atomic space of works that can be consumed independently or as a part of a larger cookbook and cookbooks are a combination of one or more workflows and cookbooks tailored for a very specific use, just like a very specific customizable menu that is tailored for a specific customer.

B

So let's have a drill down and see how we all you you use them. So the first thing is components. What are components? Components are basically pieces of the workflow that we wish to abstract from our users. Things like kubernetes resource management, configuration injections, secrets, common environment settings. Basically, all the things that our engineers and our data scientists don't really care about. All they want is to configure their workflows and start running. They don't care about resource management for kubernetes and what we do with them.

B

Is we actually try to take off this configuration and pack them as a really nice obstruction? In this example, this very specific configuration of a kubernetes resource we just named it as a medium memory machine and now, when someone wants to write a pipeline that is intended for etl, um they will use this type of machine and when they wish to use a pipeline or a task for training, they can use a training machine or a train with gpu machine and so on and so on, without thinking about how it works behind the scenes.

B

Another example for that might be how we configure a secret. So we just gave a nice name to all this secret configuration, and basically this piece of code is maintained by our devops team in our engineering and that science teams don't need to know how secrets are handled behind the scenes.

B

The next part or the next component is workflows. Workflow is an atomic, is an atomic piece of workflow some piece of pipeline that can work independently. That actually has an input as an output. It makes some sort of useful job. The basic uh difference between a workflow and just a simple dag is that for a workflow we define another decorator called workflow template.

B

This basically tells our code to take this dag, take off this dependencies and pack it up and deploy it into our argo cluster. Now we can use it and call it just from the ui. We can make sure that it's compiled and maintained in a very predictable, very maintained manner.

B

The third part is cookbooks, so if we add components which basically talk about obstructions and workflow, which are are reusable well maintained, production-grade blocks of work. Now we have cookbooks, which are basically a mix and match of these well-maintained pieces of work, blues lots of add-on pieces of work or haddock workflows that you can mix and match together in order to create a customized, um a customized workflow that we can use, whether it's for training or serving or whatever, so on the basic um on the most basic manner.

B

We can just use this to write down our entire pipeline or customize pipeline for a specific use case, and if our user's case just becomes more complex, we can easily change it and if we wish to run it, um let's say for testing. We can just add another test, called sampling and run it really really quickly.

B

And if we wish to just run an e3 test, we can take this cookbook and just wrap it with a few tasks before the cookbook itself. That creates a test context and a few tasks when they, when the cookbook ends, that takes care for model validation or test cleanup, and this makes a lot a lot easier.

B

When we're talking about organizing these pipelines. There are few ways we found that are effective. This is how we chose to organize it on diagnostic robotics.

B

Basically, there are several folders split into components, workflows and cookbooks, where workflows are basically the curated shareable, reliable and consistent building blocks and cookbooks are mixed mix and match of curated pipelines, environment, specific task, specific and even experimental pipelines, and the basic idea is that when you change something under the workforce photo you need to make sure that you don't break anything when you add something or change something under the cookbook folder.

B

The assumption is that they don't need to survive long, so some cookbooks might need to survive more. There are more less production to books, some are experimental and can live for two or three four weeks until they're not necessary anymore, and this allows us to mix and match things that need to be more stable and need to and things that need to be more experimental.

B

The third lesson that we've learned is about deployment, so we've talked about how we have a cookbook menu, how we automatically deploy our cookbooks um into our argo clusters, so our data scientists can run their cookbooks uh from from the ui, and that makes things a lot easier for them. But the issue with that is that very quickly we get junk into a lot of junk into the classes. So someone wants to create a version with a small change.

B

They don't want to hurt other developers, so what they do is they rename the workflow a little bit and push it. And then we get a lot of a lot, a lot of version, small version that we need to find if they need to be maintained or not need to be maintained and the solution will find for that is just basically creating namespace isolation for pipelines. So basically each branch in our git receives their own namespace.

B

When you push an update to git, your namespace will be created um or updated automatically, and it will never hurt other researchers, namespace and each namespace can be isolated by resources and artifacts. So no data getting mixed between production, namespaces and research namespaces.

B

We basically use um conventions to define these types of namespaces naming conventions. There are namespaces used for serving the branches need to be start with the word serving, and there are namespaces need for researchers, and there is a master between namespace intent for continuously validating the health of our system, we're, of course, using argo cd to deploy this. It makes our life a lot easier.

B

Managing these namespaces and one of the nice things is that, in order to make sure that we don't get a lot of giant into our cluster, all namespaces are automatically pruned if are not used in more than seven days.

B

The fourth lesson I wish to talk about is about interactive experimentation. When we look on pipelines, especially in the experimentation phase, sometimes I would like to work on skill and then stop for a second run. A small code locally in our machine have a look on it, maybe have a few iterations and then once I'm happy with the result, continue with the entire pipeline. And the question is: how can I do that with argo and our experience in the beginning wasn't very nice, so it was very fragmented experience.

B

So we went into the code started to write something up started to run something up and then went in to get into the um argo ui launch our own um long-term workflow from there stop the workflow get back to the id, get back to the ui and so forth and so forth, and some of the things that we found that became very effective in the small library that we wrote down is a library that allowed us to integrate um our argo workflows into our notebooks and into our basic debugging and research code.

B

So, basically, it looks like a very simple argument. Client that allows you very easily to submit a job, whether it's a job that you've just created or a job that already sets on the cluster itself, and you can simply call a python method and initiate this job wait for it to come, wait for it to complete and get all the relevant clubs um back to the console.

B

You can suspend a resume job, whether it's a laundry we can check for the status of the job. If you wish one of the nice things is that you will actually get the job results. Serialized and deserialized for you, so if you just if you need to look into the outputs or artifacts of the jobs, all you need to do is just call results that outputs name the output and you can treat it just like a regular python object, passion, python dictionary and the same goes with artifacts.

B

If you wish to read the csv results, you can easily do that, um and the really nice thing is that integrates well with our platonic dsl. So we can take this code back with. um We can take the code that we wrote. We can take the pipelines that we wrote in a very pectonic way and on one hand, run it locally and other end with two lines of code run it on a cluster which makes things which make things a lot easier.

B

So these are the four even five lessons that we need to learn from this. So one thing that we've learned is- and that's probably the basic lesson behind it all- is to understand that there is a duality between research and development processes.

B

The second lesson is how to treat our pipeliners code and it's not means to write our pipeline as code, but actually treat it the same as code, so make sure that we can run it um the same way that as we as we write and run our code. The second lesson is pipeline architecture. How to is to think about how to build your pipeline to be reusable and be stable over time without losing the ability to do research.

B

The third lesson is about deployment, and the fourth and last lesson is about how you can integrate interactive experiments with code that need to work on scale on a cluster.

B

So that's basically it uh I invite you to um go and use um argo workflow tools and the pytonic dso that we wrote in the client api that we wrote if we wish to talk about more about how to create better argo, what are good pipeline than argo pipeline architecture?

B

You can find me at orem at my email or at linkedin. So thank you very much.

B

Have a great day.