Dagster Dagster Community Demos, 14 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Earnest Research: Powering Machine Learning Pipelines with Dagster

Description

Alessandro Marrella—Staff Software Engineer at Earnest Research—discusses using Dagster to power their machine learning pipelines

See the full September 14, 2021 Community Meeting here: https://www.youtube.com/watch?v=oCakb_tB_dU&t=1643s

A

Okay, hi everyone, uh I'm alessandro, and uh today I would like to present you the work that uh ernest has been uh doing and uh how the company leverages taxer to achieve it.

A

So first, let's give you a bit of background, so ernest research was founded in 2012. It's a company that sells data analytics products to understand the customer economy.

A

It was funded by kevin carson and it has a globally distributed team headquarter in new york with a growing dublin office and remote people such as me that I'm calling you from italy it has more than 110 employees and, as I said, that is focused on consumer and market research.

A

So what does ernest do in practice? uh It sells products that let their customers understand how the economy is moving, how where consumers are spending their money, and so with that information, how merchants are performing in the market.

A

Ernest is a very data-driven company data is at the core so from the origination of the data, so getting data from several data sources to classification, which means attributing entities to the data normalization so grouping this data, for example in panels and then interpretation so which is providing analytics and information about the data.

A

It has a great track record. uh In the past five years, more than uh 500 potential supply visa misses were predicted against the consensus.

A

Consensus is what the markets think a company, how the markets think a company will do, and uh so ernest was able to predict better results than the expectations that were then validated at the end of each quarter.

A

Having said that, as you can see, data and analytics are a very important function in earnest and for this reason, the data science enablement team was created with a mission to grow earnest product with data science and machine learning, capabilities to support initiatives across earnest engineering and analyst teams. So the team wants to enable both engineers and analysts, which are a larger part of the company, to run data science and machine learning on the data to be able to do that.

A

The team built a machine learning platform whose infrastructure consists in jupiter notebooks, which are heavily used for exploratory analysis and experimentation, as we will see later pipelines, which the team leverage leverages daxter for and groups and creates reusable standardized components.

A

It's integrated with other teams using dagster, sensors and operations are farmed out to managed services such as ai platform, dataflow and bigquery. To connect all this infrastructure, the team created a library called the data science, development kit or, in short, the sdk, which I will talk about soon.

A

So let's talk a little bit about the gap between experiments and production pipelines. So we saw that experiments in the company are created in jupiter notebooks and it's usually python code. The company historically runs production pipelines in airflow using kubernetes pod operator, but to go from the first step to the second step.

A

This there are actually a lot of things that go in between which are, for example, writing. The coding python, initially like in your experiment, run tests locally if you're good and then refactor these uh chain transformations into cli apps, to wrap them individually in kubernetes, pod operator tasks and then either run airflow locally or push it to a remote distance to test. If everything works great, but it never works on the first case. So go back to two and do it again and again and again, and especially the uh wrapping the cli up.

A

I found it uh very very hard to iterate on so like we call it cli health, because you need to keep passing arguments with cli and hope that it's what it wants.

A

So why did we choose daxter? We first looked at airflow because uh it was uh what was being used in the company and we saw that uh well. First of all, there are some missing features that are just not there so directly executing bugs or input and output type checking, which would have been nice configuration. Yes, there is, but it's a json blob, that's not really validated, but the most important thing is that the programming model is very task. Centric, you do a then you do b.

A

Then you do c without really much information being passed between tasks beyond like passing, maybe a string via xcom.

A

If you decide to use python operator, your code becomes a little bit more ergonomic, but you would have to deploy a separate airflow instance for each of the of your pipelines, which is not realistic or in the latest version of airflow, deploy all the virtual environments that you want to use in the airflow instance.

A

If you use kubernetes spot operator, as I mentioned, there is cli health so again for especially for data scientists. Iterating on this was very painful, so we looked around and the established solutions, especially in google cloud, where cubeflow and tfx both are nicer to work on than airflow, but share some of the issues that airflow has. First of all data input output is still a side effect. You don't pass data between tasks.

A

You pass information about data if you're, lucky and configuration is validated, but it's not as good, and we there is a tfx, also picks the boxes of directly executing and testing dax and input and output type checking, but is very, very tied to tensorflow, which uh is not the only library we want to use in the company. So, as you can see from the last column daxter instead, besides ticking all the boxes that we wanted. It also has a great api and, in my opinion, also a great programming model by putting data at the edges.

A

So, just to reiterate, why do we like daxter uh it's easy to run the same code locally, execute pipelines in notebooks, as we will see soon write tests around kubernetes type checking is everywhere, despite it being python, it has a data, centric approach which leads to well-designed dags. uh I think it's very important to put data as inputs and outputs and not start reading data randomly in a dag and then writing a solid. It's just a matter of decorating a python function and dependencial is nicely avoided. Thanks to the repositories and grpc servers.

A

So we have this infrastructure and we wanted to build uh some tooling for our data scientists to interface with using infrastructure. So we built the data science development kit, the sdk.

A

We want also to provide a standard way of developing machine learning products and a way of going from experiments to products in a seamless way, which was the problem really that we're trying to solve. So the sdk provides several base components and then some components that build on top of them.

A

You can see, for example, data source, just produces data training, get some data output, some other inference, get some model output, some data, etc.

A

We needed to integrate with different data sources and things, uh especially in google cloud. Actually, when we started this exercise, we were in aws, so we moved to google cloud uh with the entire company uh and ducks are helped with that too. uh The execution uh happens in different layers uh and uh the sdk provides integration with all of them and it supports different file formats and utilities that you can use day to day in the notebooks.

A

So I've been seeing these components here you can see, then these components can be changed because the types match so you have, for example, a transform component that splits training and test data gives the training data to training which outputs a model inference also receives some data and then produces the final results.

A

Each of these components also translates to a duxter solid via a two-solid method. So uh and daxter is really at the core of everything that happens in the sdk.

A

We still wanted to add this layer to be able to expand with other execution layers and other stuff.

A

We also have a custom, io manager and some custom types. So first talking about the types we have a location type, which is essentially a pointer to data that may be computed elsewhere, so this could be bigquery or gcs or local file system and data frame, which instead is data. That's actually computed in the python code that the solid is running or is needed in the python code that the solid is running.

A

You can see here a weird thing, which is that the types mismatch or seem to mismatch actually io manager. The I o manager that we built takes care of converting between the two, so we can change solids without having to compromise on performance, for example, a component that runs a sql query. It would be weird to load the data frame just to yield it to the next component.

A

So you can see here that I put some icons about the types that you of location and data frame, that you can import and export. So this is how it looks like in target. You can just swap bigquery with gcs, with google sheet with local file system, without changing anything in the code. So just by changing there it just you can go from a unit test to what you do in production. Essentially so enough talk and let's show you some code, and so now I'm switching to a notebook interface here.

A

uh So here, while we are suppressing warnings, because it's a demo, uh I'm importing the sdk, which is the library I just talked about importing some libraries that I want to use in the demo- and this is the workflow that data scientists alternate usually does so they start experimenting with some data, so they load the data. For example, here we are using the rsd data dataset, uh obviously like uh in real world data science.

A

This notebook would be much messier, there would be like a retries and the cell numbers would be different, but here it's all presented to you nicely, but the important thing is that data scientists will work with the data trying to implement the classes or the components that I presented before.

A

So we have a transform component which, as I said, gets some data and yields some data training which gets some data and builds a model which you can also type the model and an inference component, which gets some data some model and some data and again a transform component.

A

If data scientists are able to code such that everything fits in these classes or further classes that we're building, then they get a lot of stuff for free and it's very easy actually to implement these these classes because uh they use standard python types, and uh you just need to forget about where you get the data from and where the data goes. It's just python functions really so by it just being python functions, you can run them and experiment with them locally.

A

You can run them on a row. You can create a confusion. Matrix and keep iterating until you're happy with your results, then once you're happy, let's make it a pipeline.

A

So, as I said, the classes can be transformed to solid, via these two solid method that where you need to specify essentially the solid name and the types that it gets and it yields and these classes automatically become solids.

A

This is boilerplate that could be further obstructed, but it's still good because it's not a lot of work to do. You can then build your daxter pipeline. As you can see, we are not hiding that we are using daxter. Daxter again is at the very core of what we do so with the solids we built above.

A

We are creating here, a pipeline uh with the usual nice texture. Syntax, where, like the results, are just results and it looks like python and then the pipeline can be executed in the notebook thanks to the daxter execute pipeline function, so we pass in the pipeline and then we pass some configuration here, some standard, bio manager, config and then some solid, specific configuration. Obviously this pipeline runs.

A

You can see now all the output uh in the in the jupyter notebook look at all the logs and then you can also inspect the output and look uh if the results are what you want. Maybe the results are not what you want, because here there is too much mismatch with these classes. So you want to experiment again, since here we are using sqlearn and the interface for inference stays the same, we're just swapping training in this case.

A

So the data scientist implements a new training class with the same interface and it can test again locally, just to sense, check everything, check the confusion matrix so to make it a pipeline. We just need to transform the new class into solid and define the pipeline again, and we can run it again and again we can get the confusion matrix, you get it and okay, so here we go data locally. So, as you can see here, the configuration location, local, I put the url and the data format.

A

If I want to get the data from bigquery, which is what happens in production, I just need to swap this, so I just need to say: hey get this data from the bigquery public data.

A

So again you can run your pipeline locally and test if it works instead of a table here, you could specify a query as well, uh which might be helpful to to just like. If you have like a big table, you don't want to run it in a notebook, so you can do select and limit and whatever.

A

Okay, so you get your results.

A

And yeah this was a wrong output. We can also swap executors. So, for example, we if we want to run inference on apache beam because we want to highly parallelize the inference we can first of all run inference in in the notebook. The class has also a two-beam function, so you don't need to implement anything.

A

It runs on beam, and here we enforce that the locations are specified externally using the sdk types, and once you define your pipeline, you can then instantiate a beam runner in this case a local runner and specify the output schema because beam wants the output schema for bigquery and run the the beam pipeline in the notebook once this will take a little bit so once you're happy with your beam pipeline to transform a beam pipeline into a daxter solid, we have a function that transforms the pipeline into solid.

A

The the key here is that the the transformation doesn't read or write any data, but we append and attach at the beginning the uh reading, input and output and the I o manager and the inference component in this case will take care of transforming the data. You can again define a pipeline then and execute it.

A

The beam runner is a daxter resource, so here you can configure it and again. Here we are specifying. We want to use a local runner, so it's it's running beam locally, but running it on data flow would just be a matter of using the data floor and earlier so, as you can see, the dynastar resource model also helps by making execution swappable, which makes me think that even executor indexer could be a resource, but that this is a topic for another meeting.

A

Maybe um so, anyway, as you can see, you can run your pipelines in the notebook. It almost looks like production. So what do you need to do to deploy to production? Well, we usually have a docker file which contains the sdk a workspace file which gives a dagstar hint on what to load and then a python file where we literally copy paste the classes and the imports from the notebook we copy paste the two solid function.

A

We copy paste the pipeline and we we just need to define these three lines of code and this is ready for production.

A

So this was the notebook and the translation to production. I just want uh to reiterate that uh dax really helps us going from experiments to production.

A

It significantly reduced the friction, especially compared to what we had with airflow before the time system lets us separate the business logic from computation and data, serialization deceleration and the resources, and I o manager, lets us integrate very easily with our cloud in the future. We want to make more components in the sdk. We want to start using baxtermine to send notebooks as artifacts, and we want to start using the asset api because we are not using it yet and we want to migrate to the new syntax. Obviously, that's it thanks.