Dagster Dagster in Media, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Adding CI/CD to ML Pipeline with MLflow, Dagster, and Github Actions

Description

This is an ML pipeline with both CI & CD components*.

The story today is about CD - Continuous Deployment**. I’m assuming as the Data Scientist I don’t have access to the production environment (no “one-click deploy to prod” for me)***.

I develop my model (new feature branch) locally, using our dev database and local compute. I track experiments with MLflow and orchestrate my code with Dagster.

Once satisfied with my new model, I do the following to deploy:
- I check my code into a feature-branch in source control
- Open a pull request to the dev branch of our code base
- Kick back and watch

This starts a CI job via Github actions that:
- Builds my project
- Test my code
- Deploys to dev environment (teeny little deployment :)

If successful, another team member will merge my code into the dev branch.

Upon the merge into dev, this automatically triggers (dare I say, continuously) a deployment job:
- Deploys my code to Staging
- In my case, this job re-runs my ML pipeline and tests on staging data
- The benefit here is that typically staging data is closer to production data than whatever I was using in dev

If successful, this job initiates a manual review process for deployment to prod:
- Prompts an admin to review my code and choose whether to run the final job in the workflow: deploying to production.

- As the admin, I approve the deployment and the CD jobs finishes by training and deploying the model in prod.

Lots of hand-waving in this example, but I hope it helps show the git-based workflow moving between environments and the larger theme of the significant work required to actually deploy an ML project.

* This is part of my ongoing saga is to get closer to something resembling an actual production deployment instead of the notebook-based fit/predict/API patterns you tend to see

** I have a tendency of using CI/CD interchangeably (read: incorrectly). Setting up this example has really helped clarify where

*** In my examples, I only move code between environments, never models. This is the pattern I see most often with ML teams. It’s possible you might deploy your model from dev to staging to prod

Continual - we're lucky to work with ML teams that care about software engineering best practices. If that sounds like you, please hit us up.
#python #ml #dagster #mlflow

Feel free to connect with me on LI: https://www.linkedin.com/in/gustafrcavanaugh/

A

So, in my ongoing saga to get closer to a production ml example, you know typically, it's notebook thing fit predict open up rest API, one click, push to production or shift enter push production. That's not realistic. Typically, as a data scientist we're hearing, hey I, don't have access to production or staging I. Have my local environment for compute I have access to some Dev tables I'm now going to add in an actual continuous deployment step. So previously, I've shown some continuous integration. Now we're going to do some continuous deployment on this end-end pipeline.

A

The details here we're using I'm using ml flow for experiment, tracking dagster for orchestration and GitHub actions for both this explicit CI job, as well as the CD job but jargon aside, let's step through it, so you can see. What's going on so I'm going to go to vs code and and just as a I'm going to completely hand wave the feature. Engineering model training, all the cool data sciency stuff I- would do. Let's just make a quick edit to this uh readme file and before I.

A

Do that what I should have done was I should have said we need to check out a new feature Branch. So this is the standard. Workflow would be as the data scientist I'm, going to check out a feature, branch and we'll call this one GRC cool feature uh now I will make my change in this branch and in the interest of time, I'll just call this Foo. We made a change. We can see that here get status. We've made this change I'm, going to add the change I'm going to commit it.

A

Oh sorry for all the bad spelling and now I'm going to push this change shoe origin and we're going to push to GRC cool feature. This will then prompt me to open up a pull request, which is going to be the workflow here, I'm going to create this open. This pull request. This will then kick off the CI job that we've set up in our CI CD system, and this will which will then test our code.

A

So I will create the pull request which, with its misspellings and its ugliness and its silliness, but we get hey, we're gonna. This Branch hasn't been deployed boom, we're starting something here. What's going on, this is the CI job, and we've got should have this nicely labeled as our CI job and what's happening in this workflow file. Just so, you can see.

A

Gus is learning GitHub actions ML on pull request to Dev, which is the main branch. In my case, we're going to run we're going to pull down our code set up a python environment, run our dagster job, which is gonna, which very nicely kind of encapsulates all the complexity of the ml Pipeline and all the tracking with ML flow.

A

We just let essentially let Daxter handle it, and now once this is complete, which will take a minute or so to spin up once this runs, and this by the way is, is solving for hey pinky, swear I made a change of the code base and I ran it in my machine. It works no, no we're going to test independently of that once this is done. This will then kick off a second workflow that will actually do a deployment so we're doing the integration here, integrating all the code running our tests.

A

After that we want to deploy. We want to have a deployment run in development. We're then going to have a second deployment run in staging, and if that succeeds, we'll finally have this workflow we'll have a deployment running prod and in between these two steps, we're going to using GitHub actions, we're able to say hold on. Don't automatically do this, you can see so the build just finish. Sorry on the left. Now this deployed development is occurring immediately.

A

So as we look at our our different steps, our different stages here in the summary we ran this build step. This is kind of our core CI step and a small deployment to Dev. This succeeded excellent, which now means hey. We, someone from the engineering team would come over. They would see this PR they'll open it up. They'll, say this guy doesn't know how to spell he can't code, but his his his CI test pass his initial deployment passed we'll go ahead and merge. This pull request now upon this merge.

A

This is when the the second, our real cd part of the pipeline, is going to kick off. So I'll come back over to action, so we can see this happening and then boom. We've got this this next workflow occurring and what's happening here now is we're going to rerun our build just for posterity, and maybe I should eliminate that then we're going to deploy to staging and the reason why we wanted to deploy to staging is and again in a traditional environment.

A

As the data scientist I, don't have access, I have access to a development environment. I have some Dev data, typically with ML stuff. The data volumes are really large and in Dev I probably have some older data. I want to run this on something that's closer to what our production data volumes and and what production data looks like, but I can't it's either going to involve lots of compute data security issues, so I work with the subset and then in staging. We keep the staging environment pretty close, if not identical to production.

A

This is kind of the real core test. So, of course, or hopefully, my pipeline here is going to succeed, but it's very possible that yours would not. You might fail at this step or the model metrics that come out of this might be like hey. It looks great in Dev, but now that we're actually running on what's close to real data, I'm realizing this isn't this isn't actually working or it's not as effective as I thought.

A

I need to go back to the drawing board, but for this case we have this automated workflow running we'll go back to our workflow file. This is the part of the workflow we're doing now we're going to go ahead and run our deploy to staging, and we can see as this as this kicks off back to the summary step when this succeeds. This will then kick off the next part of the workflow, which is a human in the loop review. So this succeeded great it's staging past.

A

This means you know everything all of our tests passed the code. Every the model ran everything worked and I need to I need to write some code to do this, but and our model metrics. Our model looks good. Now we get to go to production, but in this case we're going to be explicit. There's no okay, automatically one click deploy data scientist gets to do it. No, no there's an admin, there's an engineering team that decides what goes into production. They control this.

A

At this point in time, this person is notified now, for me, I'm the only one in this project, so I'm notifying myself, but I need to review this deployment. I need to go in and actually take a look at it and then, when I'm satisfied with reviewed, metrics, Etc I can then say: hey I approve this deployment model looks good, obviously, I'm hand waving all of this, but this is part of like the governance, the actual workflow that goes into this approve and now our deploy to production. Job.

A

Okay, kick off so in this way the the end user, the data scientists, their workflow to get their model into production is simply right, simply could be, could be distilled as write code push code from to feature Branch or into the development branch. And then you can it's typical. At least I keep saying you can, but what we typically see from teams they'll say once the data scientist has done this.

A

We have a series of steps, a series of audit that some of which is automated to build their Code test their code and then upon success of those tests, move their code into different environments, at which point we'll what we're calling deployments, maybe Dev, staging and prod, and each step there'll be different gateways and different review processes.

A

If this is a bank- and this is a model- that's related to loans, this the model review process might take six or seven months, there might be spreadsheets, and you know paper going into files as part of this process. In other avenues, it might be a bit faster, but typically there's always this sort of governance. It's never! Okay. You know my notebook looks great.

A

Let me just click a button really push a button inside of my cloud-based IDE and I'm, going to deploy something like to production, right, I'm, gonna, I'm, gonna, check code into Source control and that's going to kick off other workflows anyway. This has already gotten way too long uh way more to come from me around the details of all this looping back to mlflow, for our experiment, tracking, using different environments in dagster and the other great orchestration tools out there.

A

But at least for me this was really helpful in seeing kind of a clear difference between the CI workflow and the CD workflow, as it pertains to the end user, which is like hey I, just checked my code in and then the process kicks off.