Dagster Dagster Community Demos, 13 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Thinking Machines: Running Dagster for Machine Learning Pipelines in Production

Description

Carlson Cheng—Machine Learning Engineer at Thinking Machines—demos running Dagster for pipelines in production

See full July 13, 2021 Community Meeting here: https://www.youtube.com/watch?v=tjDnyE7Xcvo&t=1148s

A

Okay, so hello, everyone, I'm carlson from thinking machines, I'm the head of the ml engineering team there. So thanks for inviting us over, uh so that we can talk about how we use dagster for running ml pipelines in production.

A

So just a brief intro with thinking machines. We are. We are a global technology, consultancy, building, ai and ml solutions and data warehousing platforms to solve, like high impact problems for our clients. So our clients range from large corporations inside in southeast asia and like global nonprofit organizations, and our main goal is to empower these business users with valuable data and insights just so that they can make better decision making.

A

We are. We are internationally recognized in the field of data science we presented in top machine learning conferences uh most recently icml and europe's 2020, wherein we've been awarded the best paper for europe's for one of the mla workshops for development, for our research in gsp geospatial for poverty estimation using uh satellite imagery.

A

And we've used laxar for a number of uh projects, uh primarily for big data warehousing, wherein we use dagser for more traditional etl and elt use cases, unifying data for multiple sources and stakeholders, building up data warehouse and dashboards for analytics.

A

We've also used it for big data ml for orchestrating large-scale batch predictions in the cloud for terabytes of daily data and lastly, for ml ops for automating key processes in the ml workflow like training and evaluation.

A

So we'll be focusing more on the third use case, which is for mlaps, so mentioning one of our projects where we've used dagster so for building a smart, unified search app. So this app in particular consolidates a number of data sources and allows users to basically search through these sources and get relevant information.

A

So one example of this is a user. Would then ask our application like how do I apply for vacation leave and then our application using ml algorithms would then give the most relevant section in the employee handbook, highlighting the steps that you would need to actually file for a leave and our use cases extend from that also allowing our users to query relevant entities like people and companies.

A

So a user could then search for the company gamestop and then they'll get relevant information regarding that company, and so our search app composes of like three main search features. One is semantic search wherein, based on your search, query, to give you the most relevant faq document.

A

Second, is for q, a search which is our initial example, wherein you will get the most relevant section of the employee handbook for your question and third is entity search, giving you the most relevant person or company information, and all of this, like search results, are piped into a search ranker. So the search ranker is like an additional ml model that prioritizes from the your search results which one to actually uh list down as the top priority and most relevant search result.

A

So this is based on like the search query itself and the user, and also like the confidence course we have for each of the three different search results.

A

So just a simple uh up architecture for our project. You can see here. Our training and test sets are piped into the automated training and evaluation pipelines that we've built using daxter and based on this. We can do continuous training for newer data and then build like new models that we then stage inside the s3 bucket and from there we can redeploy our web application servers, storing these new models and, as our users are using our application.

A

Their interactions will then give us more relevant information and give us user feedback so that we can build our models and create more fine-tuned models for their uh application needs, and we have like an additional set here where and we created a few uh dax or pipeline for meta pipeline monitoring.

A

So this is our ml automation workflow in general, it goes from poc to dev and to prod and focusing on the first stage. First, the proof on proof of concept phase is primarily done inside jupyter notebooks, where in our data, scientists can fully uh test out their different ml approaches and run experiments wherein then, after that, they can finally finalize their ml methodology and then that's when we start migrating over their jupyter notebooks inside digester pipelines, where we can do further fine tuning and polishing.

A

So how does this work in practice? In the left side, you can see here the jupyter notebook that our typical data scientists would create. So in this case, after they've done like some initial data prep, they would then start doing hyper parameter. Optimizations uh they're, using a module here called optuna which is used for uh hyperparameter search and given like a number set, a set number of trials like 100 trials.

A

They would then get like the best scoring uh model with its hyper parameters and accuracy score and that's what we can then uh save over and export as our best model from that number of trials. So this is like a typical thing that you would get from a data, scientist's notebook and then once we like finalize this and polish it, we can actually start moving it over to our dagser pipelines and the convenient thing here is: you can pretty much just copy paste over your notebook code inside dax or solid since daxter is very pythonic.

A

It doesn't really require you to like write anything extra since most of uh dax or solids are just python functions.

A

uh You can pretty much just port it over to your dax or solid, and the additional steps here that you just do in coordination with your data scientists is to add um the data descriptions for your solid definition and the input and output definitions, and this is important because uh we'll need these definitions later on when we're validating and debugging our pipeline inside the ui, and some additional steps is just adding logs and dagger assets for ml metadata tracking. So more on this later.

A

Just some learnings that we've had when running ml pipelines alongside existing dags or pipelines. So usually when we're adding and porting over our jupyter notebooks into like more standard ml pipelines, we would already have like an existing dagster infrastructure set in place for more traditional etl and elt uh pipelines. So we don't really need to create like a new, uh a new set of platform for our ml workflows. We can just uh make use of our existing uh taxer infrastructure and then add in our ml pipelines there.

A

So one thing to take note is we should just organize our pipelines into logical groups. So, for example, you would have a group for your different etl pipelines for source a and source b, and then you would have other groups for your ml pipelines for let's say a specific model x and then another model y. So we make use of like the dax feature for repositories which helps us like isolate the individual uh groups, and this also further helps us isolate the dependencies for each of these uh pipelines. So you can avoid conflict.

A

So if say for model x, it requires a specific uh version of pytorch or scikit-learn. We can pretty much just ensure that that version, it does not conflict with model-wise version of pytorch and scikit-learn.

A

And just some additional learnings that we've had like daxter makes it very easy to move over to prod since pipeline implementation is pretty much the same whenever you're running inside. Let's say your local machine or your kubernetes production environment, so there's very minimal changes when moving over your pipelines. So most of the changes is done inside the high level dagger configurations, but then on the pipeline level, you don't really need to change much to port it over.

A

So speaking of production, moving on to that, this is where we can fully make use of our automated training and evaluation model pipelines, producing new models and deploying them into our servers where we can do further monitoring and get new data so for pipeline monitoring for ml ops, we make use of daxer's asset materialization feature, so we can keep track of ml metadata. So, coming back to our initial example, we have a code snippet here where we create, like an asset called generatedmodel.

A

We give it a description and a set number of entries, and these entries are basically just key value pairs that we store to keep track of interesting ml metrics, like accuracy scores for your model, the best hyper parameters for it and the output file name of the your model.

A

So this is important because later on, when you're actually checking your dagster ui, you can view the metadata for each of your training runs. So here your latest run can pretty much show you your file name, the hyper parameters that was the best scoring model and your accuracy score there. So your data scientists can pretty much uh keep track of like the scores of each of your training runs and we get to easily uh view if, like a certain run, is performing well.

A

A certain model performed really well and then notice that uh the score went down. So we can easily like debug uh issues within the ml workflow.

A

And based on the ui, we also like uh appreciate like a lot of like the pipeline definitions. Since dagster pipelines are very data aware we can see like the input and output coming through each of the solids, so we can keep track of how our data is uh processed and changes throughout the pipeline, as opposed to airflow ui, wherein the data is abstracted away from you. We don't really get to see how the data is processed in our pipelines and overall, it makes it a more intimidating ui to work with.

A

So what our data sciences can pretty much do in the daxter ui, they check the inputs and outputs of each pipeline step validating the data schema of the pipelines they filter through the pipeline logs from the ui being able to filter through different log levels and debug their pipeline code.

A

They monitor the pipeline outputs in the daxer assets, page uh validating these models and ensuring that the models meet a certain threshold before they deploy they trigger pipelines with different configurations uh running the training pipeline, either in a dev mode, wherein they just run on a subset of the data, or they can run the training pipeline with a production mode wherein it runs on full set of data.

A

They re-run the pipelines or just subsets of the pipelines for further debugging.

A

uh For a pipeline for additional pipeline monitoring, we also make use of the slap notifications um in general, like our company, uh makes use of slap for our day-to-day like communication, so it allows this allows us to spend less time manually, checking the ui for pipeline passing or failure messages, and we get to find out when something happens as soon as it happens.

A

So uh just an example here is we get to see that a daxer hourly ingestion pipeline is actually performing and succeeded, although it didn't really get anything new data, so this still counts as a success for us and it we can see here that the elapsed time takes this much and we can see if there's anything any issues with the cpu resource.

A

If ever this amount of time passes is incredibly high and we can see the s3 or sg link to the ml model path that we create and used for the pipeline, and similarly, we can check like pipeline errors. Whenever it happens, we can see the specific pipeline that failed and which solid actually failed there, and even the error message that shows up in the solid and further on we created like a handy link here that allows us to just uh enter and then go and which sends us over to the daxter run itself.

A

Where can we do where we can do further? Debugging.

A

So, just a extra step on top of like pipeline monitoring, we do like x, meta pipeline monitoring, we're in we've created like a dax or pipeline that checks other pipelines.

A

So how we do this is we build a separate dixor pipeline to regularly do a health check on our production pipelines and summarize their status in slack?

A

As you can see here, we have like a summary that we get on a day-to-day basis, giving us all of the different production pipelines that we have and it's success scores in the past. A few runs that it was executed on. So we can easily see here that some of the pipelines are working as expected and then some are not doing as well, and some things might need to be flagged for further debugging and we even have like the last success date which helps us like further check.

A

If there's any issues uh based on that date and how we do this, is we have like a pipeline that accesses the dax database, uh primarily the runs table. So we produce like a simple sql query that just checks the number of pipeline runs for each pipeline. uh So based on, like let's say the past 10 runs of a pipeline.

A

If it ran and succeeded only three times out of the 10, it would have then a success rate of 30, which would then create that weather icon, which gives us like a visual tool to quickly verify our pipeline statuses.

A

So yeah in conclusion, uh why we think daxo works for ammo pipelines in production data. Scientists overall gets a very uh user-friendly ui which enables them to run data pipelines without fear. They can easily monitor their pipelines and debug their pipelines from the ui and, secondly, dagster is very versatile, wherein we would have. We would usually have a dagster infrastructure that already supports etl and elt pipelines, and then we could easily just extend that to support ammo pipelines. So this uh removes the overhead of setting up something completely new for our ml workflows.

A

And thirdly, uh daxter is uniquely uh works for ml ops, because, unlike other orchestrators uh daxxer, has features that supports uh emel ops on top of like automating our training and evaluation.

A

It also has dagger assets that helps us keep track of ml metadata and also in general, like dagster's pipelines, are pythonic and data aware, which makes it very easy for us to sport over from jupiter notebooks to dax pipelines.

A

So, um just some extra things that we we're planning on working on next uh in the future uh we plan on migrating over to using grpc servers inside kubernetes, so that we could separate out the pipeline code from core daxer infrastructure. So this helps us update our pipeline separately from the dax or daemons like the scheduler and the sensors, so that we can avoid redeploying them all together. So this is just a step into more process isolation, but inside kubernetes.

A

uh Second, we we're planning on trying out more comprehensive data and schema validations using dagser's integration, with great expectations for better monitoring our data quality and third, this is more of our recent taxi feature from 0.11.

A

uh We want to try out dynamic orchestration for etl, allowing us to generate solids uh more dynamically during runtime instead of having to manually, define it, and this has a bonus of like making it very easy to check inside the ui, since uh it allows you to view those dynamic solids, a lot easier than manually defined solids.

A

So yeah, that's pretty much it. uh Thank you for listening and I'm free to for any questions later on. Thanks.

B

That was fantastic carlson. um Anyone in the audience feel free to. uh We can take brief questions now, if you want to put anything in the chat um or you can just if you're uh feeling brave you can just uh unmute and pop in this is a pretty unregulated zoom call.

C

I actually have a quick question. This is rebecca here. Thanks for the presentation, it's really cool, to see how you guys use it. uh I just had a question about your slack uh pipelines. The I mean the notifications about the pipelines that run uh the summary one looks really great the the ones that report on individual pipelines. I may.

B

Have misunderstood.

C

That, but if that's what it's doing, does it does it get spammy? Is it something that's uh helpful for you to monitor uh systems and that kind of stuff, or how do you use that.

A

uh Yeah for the staff notifications in general, like um overall, like the most important ones, are the ones that actually fail. So in terms of like success notifications that might not be as important for us. So that's just like an additional check that we do, but then overall the ones that actually do fail. Those are the ones that we actually tag users on, so we automate like also the tagging functionality we're in whenever we have a pipeline error, if we would tag the relevant person for that pipeline.

A

So that's those are the most important ones, and on top of that, we just have those daily summaries that we do like a quick uh sense check on yeah.

C