Dagster Dagster Community Demos, 29 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Bestplace: Building a Modern Data Analytics Platform with Dagster

Description

Dmitry Matasov from Bestplace talks about the business problem the company solves, the evolution of their data platform, how they leverage Dagster in their workflows across Gitlab, Jupyter, even Google Sheets!

🎞 Slides 🎞
Bestplace & Dagster (Dmitry Matas)➡️ :
https://drive.google.com/file/d/1BSaQmSc9szcKTT16-B_HzwPIYUKuxe81/view?usp=sharing

🌟 Socials 🌟

Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/

A

Hello hi everyone, my name is dmitry, I'm the cto of the best place company and uh just to give you, um you know a quick context of on what we're doing actually with dexter um I'll show you what what best place does so um we're building a machine learning, driven geoanalytical platform for retail companies to help them to find the most profitable locations for new stores and for consumer goods companies to optimize their distribution strategy and to align their product mixes with the actual customers around the stores they are distributing.

A

So it all started like this. We had a team of engineers and data scientists and the client, and then it all became it all changed and now we're somewhere in here, so we're having much more clients and to deal with them. uh Now we have a lot of analysts, not data scientists, so um basically um yeah. We wanted everybody to be happy in this formula.

A

We wanted our developers to take their python scripts and to make everything reproducible with docker.

A

We wanted our data scientists um to work with their pandas and dusk and jupiters, and um we wanted our analysts not to dive too deep into coding and to manage with all the complicated machine learning stuff using configs and yamls or google sheets.

A

uh So um actually um we have several ways to collaborate between each of us and data science team collaborate with our analyst team through jupiter data scientists, make new experimental methods and analysts all can code and python and can use their new notebooks and also we all use gitlab as our main storage for configs and deployments, and brands and analysts can use it too, with its nice web id.

A

It all started like this. We were using because of its geo operations we needed for spatial analysis and jupiter's and back in 2016.

A

The best alternative was to make something like this, because the other pipeline engines or orchestrators were either too immature or buggy or too slow for a startup looking for its solution. So our own solution looked like this. We had a yaml config describing the model, the features we wanted to calculate and it was running from a jupiter with a kind of script like uh import, our library and run the pipeline.

A

Then we um had more of the pipelines and uh not each of them were about the predicting the new locations, but about the other things, and we wanted them to be scheduled and to be reproducible.

A

It's not about the experiments anymore. So the development part of the company and tried airflow, but you can't pass parameters to airflow seriously. It has quite an unfriendly ui.

A

It does nothing with our jupiters and has quite a slow feedback loop. When you're trying to experiment with pipelines, you have to deploy them to the airflow.

A

So that's why it was a no way for our analysts and they went this yeah and finally, like um half year ago, we got back to the dexter and so that it was mature enough for us to try it again and yeah we've tried dexter, so it actually helped us with several things, to keep our pipelines, reproducible and version controlled, to use our jupiters with papermill and to configure everything in a yamu-friendly way.

A

That's how our development flow looks like so in you know, shared production environment. We have uh our own self-deployed s3 server,.

A

For intermediates and for log storage and postgresql for run storage and we use a docker and docker salary dexter deployment. So when we're when the developers um work, luckily they start everything with ansible we're deploying with ansible as locally or on the production, and they deploy a local version of dexter.

A

So they can develop code and python in their favorite ids and to run it through the python api in jupiter.

A

Why jupiter? Because it is blazingly fast in updating your pipelines code, you can just restart the kernel and run it. You don't need to wait when dexter, when dagit will catch the new code and restart and the other things is debugging and profiling. You can actually we're working working load with data frames and pandas, and we wanted to display it nicely and to have profiling and ipad. So it looks like this.

A

You just develop your solids and by charm, for example, and run it with jupiter locally deployed having this yamu config and loading it and running, execute pipeline uh yeah we're having this nice tequiliums and we can preview everything just in line yeah. What comes next? um We commit it to gitlab and deploy it to the shared environment and yeah it. um You can also test it locally on the exact same deployment as uh in production having the nice uh stuff with uh presets yeah.

A

Here it looks like this we're building uh versioned containers with code and run it with daggett and having the same jupiter in production. So you can tweak something or explore the failure.

A

So in gitlab we just deploy it with sensible and have our nice pipelines and deck it yeah. You see we're we're using tequila in dec um yeah, and how does the analyst workflow looks like so it starts.

A

It consists of four elements: the developers solids, which are robust, documented and optimized, and you can use them as these from the library. The second way is. The second element is analyst solids, so there are paper mill driven jupiters that are business, specific they're visual. You can do graphing and displaying data frames, uh making your own ad-hoc little parts of the pipelines and uh I'm a kind of scientist myself. uh The analyst can tweak everything you hear she wants. So it actually looks like this.

A

If some of you haven't worked with the paper mill, so you define a notebook, like example, solid and label, one of the cells with the parameters tag.

A

So then the papermill will put the real parameters here and run the notebook up from from up to down, and then you have actually two options in our repository. You just can commit your notebook and it will automatically be become a solid with no inputs and no no outputs, just nothing bind or you can define your own inputs and outputs definition like this.

A

It is all can be done from the web ide and like copying and pasting the example. So the analysts are quite easy. With this stuff, then we actually have taken the example from dexter's.

A

Examples repository with pipeline dsl and modified it a bit, so actually we can now define pipelines not in python but in yamls.

A

So these are the two identical pipelines defining this x stuff here this one in python and this one in yaml. So you can reference the existing solids like. I want this one rename it, and this guy has his inputs from well sorry from example, admiral. uh Oh sorry, example at one an example x, uh and he has an output uh named output with sum and with product uh somewhere here yeah, um so you can actually do forking with the animals um yeah and the fourth part of it is uh sophisticated conflicts from google sheets.

A

um It's actually quite a specific thing: um we're storing um google sheet that can define way to process your data, for example. This is a labeling solid that contains the rules and substrings for categorizing. uh Some points on map like data sales points and we're just having a function in our library to download it from there. So you can actually pass a link to this table in google api to the solid config and it will download it and use the configuration file and yep.

A

Here we have a pipeline purely defined by an analyst without a local deployment just in the web ide of gitlab and to I wanted to add with a little proposal here.

A

Currently, there is no way to preview a notebook from an f3 when it executes in paper mill. So it just saves the run notebook and you can actually download it and look at it, but we wanted to to have a way to debug it if you, if it's broken.

A

So the proposal is this um to actually not to substitute the parameter cell, but to command it out and leave, as is and then to insert the dexter stuff and, in the end, with nearby the tear down cell to add, um commit commanded out, commit statement that would be able to when you play around with your notebook, find an issue fix it, and then you can just comment it out and execute the cell. It will strip the unnecessary cells and command the original parameters cell and commit it to the gitlab.

A

So you will be able to debug debug your jupyter solids very nicely and on the fly, so that's much of it. Yeah thanks everybody and you can direct me in slack. If anything of this was interesting to you,.