Dagster Dagster Community Demos, 5 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Introduction

Description

Warning: This video is fairly out of date due to the rapid development of Dagster.

This is supposedly a quick introduction to the Dagster tooling for Python.
Dagster is useful for data engineering, and more generally any parallelizable expensive python script.

https://dagster.io/
https://airflow.apache.org/

A

Hey guys um today, I'm going to do a short little intro uh into dexter.

A

um Dexter is kind of like airflow, it's a tool. You can use to turn your python scripts into or it's a tool that can take your python strips scripts and help you parallelize them as well, as do things like schedule, them uh monitor them uh and a little bit of memorization as well. uh Dexter is a data centric uh dag orchestrator, so it's a little bit different than airflow and I'm gonna get into how it's different later.

A

um So why might you want to use this with your scripts? Well yeah. So it helps you parallelize your scripts, and it gives you the ability to easily run them on clusters as well.

A

So you can export extra scripts to airflow scripts, and then you can run them everywhere where airflow runs so on kubernetes clusters and stuff, like that, um dexter also gives you free monitoring and limited memorization. There is more complete memorization available, uh but it's not quite it's not quite ironed out yet dexter is fairly. New.

A

um Dexter also gives you a flex flexible configuration system and scheduler, um which is useful for when your scripts take a long time to run, and you want to set them up to run. One after another, or in parallel or yeah or at a certain time of day, you want to trigger a script, uh and you want to do all this all on the same machine. So the scripts are aware of each other and they don't all run. At the same time,.

A

So what dexter does is it takes your like all? You need to do to turn your python script into a dagster script. Is you need to take the functions that your script is made made of, and you need to turn those into solids, which really just means that you put like you wrap them in a dagster uh decorator called a solid or dexter functions are called solids and you create those by wrapping your functions in this decorator and they in this decorator.

A

You can define the inputs and outputs, including types for your functions, um and then you combine these solids to create pipelines uh which are the dags in dexter, and then you get the nice configurations, yaml framework that you can use to configure both your pipeline. So your runtime environment environment, like which executor you're, going to use if you want it to run in parallel or not, as well as the actual solids themselves to your experiments, so that could be which data set you're going to operate on or how you want to handle that data set.

A

If you want to normalize or not all this kind of stuff yeah, this is going to be a little bit.

A

Data analysis centric, but this dexter is useful for any python script that takes a that takes a while to run um and dexter comes with daggett, which is a web interface that you can use to monitor everything you can use it to write your configuration in the ammo files and it has a nice little like uh checker and helpful uh hints for writing the yamafly, the animal files and I'll, let you know, live if there's any errors in it and what it's expecting that you're not giving and all that kind of stuff all right.

A

Let's have a little demo.

A

I'm using nyx, so I need to load my environment, which takes a second sometimes.

A

um And I set up a little script that I'm going to turn into a dexter script. um It's just a I don't know a really simple pretend to be expect pretend to be expensive and by expensive. I mean take a lot of time. uh Script.

A

uh So this script is called demo um and it just it's just got two functions so expensive, setup and expensive analysis.

A

uh Expensive setup takes 10 seconds and you know just returns a complicated, some complicated data that it's set up and then the expensive analysis function takes um another 10 seconds to do something with the complicated data and then in the main function. uh You just pass one to the other and run the both.

A

So all we need to do to turn this into a dagster script. Is we need to.

A

Sorry typing and talking.

A

Okay, you need to turn your functions into solids and you can just do that by providing the solid decorator and then you need to turn your main function into a pipeline. So let's call this expensive pipeline.

A

Okay and everything else can stay the same and if you already had your script set up and like segmented out into functions that you combine in your your main, your yeah, your main function. Then then you're done here. So this is already a dexter.

A

uh Script- and I can show you how it works, so we can run daggett on it.

A

A

And you can see, daggett serves the page of the port 3000 and here's our function, so we have oops.

A

Oh here we go again. Actually we don't need this here.

A

So we have expensive setup and expensive analysis, and in here you can see all of the pipelines you've created. So we created the expensive, expensive pipeline uh which we're in right here um and there's. No, we haven't set anything else up, but we can run this function if we want. Oh, that's, that's a sneak peek ahead, so we haven't set up any kind of configuration, so there's no configuration required and we could just launch our.

A

Launch our script and you can see it's working, so this is the the running scripts page or the runs page, and you can see it's working through. Oh here, if I load it up here, it's working through expensive analysis right now and it finished expensive setup a second ago and you can see how long everything takes and what relies on what.

A

So that's one of the cool features of dagster and the reason it cares about your types and all that information is- uh and this is another the ways it differs from uh airflow, which I'll talk more about later. um Is it cares about the inputs and outputs of the functions and how they, how they um compose.

A

Yeah, so we can see our runs. We had a success. We dextra has a nice logger as well that we will use later. um You can hook it up into a database for when you have like postgres for when you have um a whole bunch of stuff running in parallel, so that you know you don't get a file lock um yeah. So I'm going to start expanding this script to show you some of the features of the extras.

A

So the first thing we're going to do is we're going to turn on logging, so um the deck each solid decorator provides your functions with uh a context argument.

A

And these arguments are used to access uh all of the dexter features from within a solid from within one of your functions so, for example, uh log dot.

A

What's this function info, so this is the log function in the context.

A

Object there we go, so that's how you log uh information in dexter and then you can also use dexter for some cool like to simplify your uh uh like how your dag is organized. So you can do something called dynamic output or you can use dynamic output.

A

A

A

Is that correct?

A

Okay, so let's say in your setup: you don't just return. This is a complicated data. Let's say you want to return multiple things that you want to operate on differently. So to do that, you need to define your outputs.

A

A

um So for the output we already have we're just gonna uh have a standard output definition which we need to include here.

A

A

um And we are also going to need.

A

Okay, uh we're also going to need our dynamic output definition, okay and we need a name. So let's call this common uh data and then let's call this one uh just data, so what you can do is, if it expensive setup. If this were to return. For example, um I different or 10 different, um like data frames, for example, each of which you want to do something different with then you can create 10, different outputs of the same kind that will all be operated on by whatever your next set of functions are.

A

So let's set that up real, quick, so we're going to want to yield our dynamic output uh whose name is data?

A

Am I defining this right? Does the output name.

A

um And then we are going to want to have a value in this case. I don't know, let's just make it I, and then we're also going to need a mapping key. The mapping key is used to uh identify which which of the data outputs. This is you of display.

A

I did okay and then we also need to. Since we have multiple outputs now we need to define what the original one is. um So this one will just be a standard output uh whose name is.

A

uh Let's open it again,.

A

Output name common data and.

A

Okie dokie, and we can stick this in there.

A

Okay, so what we've done here is we have two sets of output. The first output, which we can actually put above, is just the original, like it's just one standard output that we had before and it's just. This is a complicated data. It's just that string and we defined that as the common data output. Common data is the name I gave to that that output value, and then we have a second output, which is the dynamic output definition, which we call data now data.

A

We actually yield this output 10 times so there's 10 different values of on it. So we need to handle this correctly in our pipeline.

A

So we'll have um uh we'll have our common data common data first and then we're going to follow that up with the actual data value and then we're going to want to run um expensive analysis once per data output, but we're also going to want lambda x.

A

Okay, so x in this case, is uh one of the data here. Let's just call this plural one of these, um um uh like integers in this case, um so an expensive analysis we might want to handle both of these commons so we'll take in common and and one of these specific cases. So, let's give it common and data all right, undefined name data underscore.

A

Okay, so if we reload this.

A

Why did that not update.

A

Repository reloaded.

A

Overview ah here we go okay. I don't know why that didn't work before um you can see that there's two two values being passed from expensive setup to expensive analysis and the kind of overlaid, expensive analysis here indicates that there's um one of these is a dynamic output that gets acted on multiple times.

A

So if we run this script now you will see that because we yield 10 outputs in the dynamic output that expensive analysis will run 10 times, um we can.

A

Should lower that down from 10 seconds, that's a bit. It's a bit long to wait, um and currently we haven't set up um parallelism yet so it's all going to run in series, but you should see that expensive analysis with the integer zero. This zero that's displayed here is the mapping key is being run, so this will run for 10 10 more times and you can terminate uh scripts and then this one failed.

A

So what you can do is you can re-execute from failure um which doesn't work great, sometimes with um dynamic output, but it seems to be working now so.

A

A

ah Yes, okay, so once it wants a different I o manager for the memorization. We will set that up as well: um okay, yeah! So let's, let's do that, let's also so so. Actually the first thing I'm going to do is I'm going to set up some configuration.

A

So that we can adjust how long these functions take to run um so, let's, let's make a time variable and give it an integer. um So the way you uh take in configuration variables is you define a dictionary of them in your solids in your solid uh decorator and then you can access them from the contacts. So, for example, instead of slipping 10 we're going to sleep contacts dot, I always forget what this is called solid config uh and we're gonna have a lot of time.

A

This is what we called it up here um and let's do this down here as well. So this is going to take a time variable as well.

A

Okay, now, if we run this.

A

You will see if we reload this.

A

That now we need a configuration in order to run in order to run this this script, so this could be which data set you want to use or which which features you want to enable um or what kind of analysis you want to do, um and the nice thing about daggett is it. It has all these helpers for trying to construct these yang the ammo files, so you can make it kind of fill in uh as best as it can what it thinks you need as a config.

A

So in this case, let's, let's make things and things take five seconds instead of ten okay. Now we're also going to want to parallelize this because, uh as you saw last time doing, all the sincere or in series is going to take going to take some time. So in order to do that, we are going to need to configure our pipeline.

A

um So in the pipeline configuration you can select what executor you're going to use um so the standard one just executes it, one at a time in in series, and then you can set a there's all sorts available. There's um a desk executor, there's just a standard. um Multi-Processing executor, which is the one I'm going to use, um and for that we're going to need uh the multi-process.

A

We're also going to need a mode definition and in order so that some of the caching works better. We're going to need to use the fsil manager, which we also need for the um multi-processing executor, because uh the multi-processing executor needs to cache between runs of the solid like between solids. It needs to cache the output of a solid in order to feed the input to a new process, and then we're also going to.

A

um Okay, so now, let's, let's define these things so we're going to need a.

A

Okay, what are we going to need.

A

Okay, yeah we're gonna need to make this mode available, so we're gonna need some mode.

A

Definitions no deaths did I spell that right. Is that what we use?

A

Yeah, okay, okay, very good, okay and in this we're just gonna include one two, four definition, and this is going to be uh the one that uses free the resource def, which is going to be the one that we wanted, which is going to be. I don't know I o manager.

A

Okay, let's go to this.

A

Manager um now did these are cool because um so the I o manager is a little uh class that gets called every time a solid finishes to save that salt's output. So this default one here, just pickles the result and then unpickles it later. But you can. You can create your own very easily to use something like feather if you're mainly working with um pandas data frames, and then it speeds up uh the caching and makes your whole your whole multi-processing or your whole script, much faster if you're, using like multi uh multi-processing.

A

Okay, we're also going to need to define our executor.

A

Oh uh so we're gonna want the default ones, but we're also going to want the multi processing one.

A

Okay, now, uh when we reload this bad boy,.

A

Oh, it didn't like that. Okay, so we can check out check out the air see what went wrong. Can I import name? Okay, so I must have spelled something about multi pro.

A

A

um Let's try reloading that: okay, we're back in action.

A

Okay now, so we have our solid set up, so we can also set up our pipeline uh or sorry our like execution environment, and for that we can look in the side here for what it suggests. So we are looking for execution execution and we are going to want to configure the multi process and we are going to want to config and.

A

A

Everything uses different amounts of spaces.

A

And we are going to want max concurrency of zero, so we don't want a maximum currency. Now oops.

A

Do the time to use your multiple, but your price line could sold out that will not be stored somewhere.

A

Start somewhere, managers for this hotline uh so we're gonna have to.

A

Reload this again, I guess I'm just going to.

A

A

A

A

hmm ah Bio manager, so this has to be excellent for the space instead of uh underscore there.

A

Okey-Dokey here we go so expensive. Setup uh is just one function that gets run once and then it's 10 outputs, the dynamic outputs we created called data gets split into these 10 that I'll get run in parallel now because they're all independent of one another- and you can see these start to finish, and when you click on one we can see it's all sorts of information.

A

uh Let me see if I can get more obvious, including our logging logged and logged output, um and you can go back and figure out how much time it took and all that kind of stuff um yeah. uh That's that's an overview of how how dexter works. So you don't only need to run the dexter scripts from inside of daggett.

A

You can also run them um using there's a cli uh cli, and then you can also run them directly from within python as well uh and then, instead of like you, can load yaml files or you can feed it um python dictionaries, if I'm not mistaken. So you can build this into your larger, larger python infrastructure.

A

So, that's! That's! That's a quick little demo on how this stuff works. uh As you create more pipelines, they will show up here so you're not limited to one pipeline per file or anything like that and you can set up. You know you can see your run history. You can set up um um a scheduler. Oh, how do we, sir okay yeah, so you can set up a scheduler, so you can run things at certain times of the day.

A

uh There's also a backfill and you can include sensors so um that you react to certain events happening, but it allows you to react to certain events happening. um Okay,.

A

uh So dexter versus airflow, so dagster is a lot easier to set up uh airflow. I found had a quite involved setup, but dexter is much more immature. It also. I like the dexter python interface, much more, the decorator setup and the um the way you combine functions into pipelines just using like functional programming.

A

I find it to be much more intuitive than how airflow sets up its um execution dependencies. um So that's kind of the high like the shallow overview, but then what really separates the two is that airflow is: has execution dependencies and extras data dependencies. So in airflow you airflow knows nothing and wants to know nothing about the information passing between items in its dag. It just wants you to explicitly define these execution dependencies.

A

I haven't used airflow much, so this is all from reading various blog posts, whereas in daxter it's just like a functional program and uh that allows you to include the typing and testing as well, and it makes it uh ideally much easier to cache in dexter um and also to uh like set up state for testing, um although that's still a little bit immature because dexter dexter is so new. uh That's how I would differentiate these two, though so the pros of dagster is like. I I've gone over a couple of these things. Already.

A

um The flexible pipeline configuration system is great, so uh I use this. I use this yaml configuration extensively for when I set up a data analysis script to allow that script to run on a whole bunch of different data sets with a whole bunch of different data, set specific parameters. um So you're not sitting there with a directory full of python scripts with one or two lines changed for different different data sets, and then it's easy to parallelize your scripts.

A

The other thing is dexter is very lightweight, so I guess I've gone over this before as well, but it's literally just a couple of uh pip dependencies away, um yeah and then the cons is that everything is experimental, so dynamic pipelines uh aren't horribly strong because you can't nest them very well as far as I could find. So, uh for example, if you had two functions in a row, each of which outputted 10 of something that you wanted to all iterate over.

A

So, for example, let's say you load, two different data sets in one data, loader function, or you realize, in your like setup function, that you want to load 10 different versions of a data set that you're going to want to run everything on each of them, and then you also want to do a k-fold cross-validation.

A

So within those 10 you want to split those out 10 into 10 again, and then let's say you want to uh cut like uh uh you have another function that wants to operate like between, like across that gap, then uh dexter doesn't handle that very well yeah, um so you always have to like be working directly within your one dynamic output at a time um and it's very awkward for exploratory data analysis, so you can't drop into ipython.

A

At any point, it's very much a I would say a data engineer, engineering tool, as opposed to you know an exploratory data science tool, so you can't drop into ipython at any point it, but it works great if you do test driven development um yeah. So that's my that's my hopefully quick. I don't know how long this is. Oh half half hour overview of dexter. uh Well, if you stuck around for the whole thing, um have a good have a good one and enjoy your day.