Dagster Dagster Team Product Demos, 16 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Demo - Dec 2022 - Ben Pankow

Description

Every other week, Dagster hosts a live demo with Q&A. You can join us at a future session by signing up here: https://dagster.io/dagster-demo-signup - alternatively. you can watch a recording of this demo from Dec 2022 featuring Elementl developer Ben Pankow as he builds a data pipeline using Software-defined assets, dbt models, scheduling, and more.

A

uh You know my name's Ben I'm, an engineer here at dagster uh mostly been working on the cloud end um and today I'm going to be walking through kind of a high level demo of dagster, starting with you know, very brief kind of overview of you know where Dexter fits in the data platform, how we think about orchestration and then moving into kind of building a basic data platform data pipeline using Dexter and then kind of seeing how that would expand out to a more mature data platform.

A

And then, after that, we have plenty of time to answer any questions that might come up over the course of the demo or you know, sort of anything related to diagster or diagster cloud.

A

So I'll start off here uh with just a single slide to kind of give some background to dagster in the way that we think about things.

A

So, in general, um we believe that the goal of a data platform is to generate data assets. This is a pretty generic term and this can mean any sort of persistent data from a table. uh You know, maybe that's in Snowflake, to a file an S3 to a notebook. Maybe for bi purposes to an ml model, uh any sort of artifact uh that's used for some purpose. That's persistent is a data asset in this context and data assets have dependencies.

A

An ml model might depend on the training data that you use to build it, and maybe that training data is actually used created by transforming some data that we're ingesting from another source. So there's a clear set of dependencies here between our data assets and the role of an orchestrator in this ecosystem is to create these Assets in sequence and there's a couple different ways that you might do this.

A

If you're kind of approaching this having no orchestrator and sort of a simple set of assets and asset dependencies, one tool you might look to is cron. If you have an asset B that depends on an asset a you know, you could use a cron schedule to run a task that would materialize or create asset B an hour after you created asset a uh and this works in a very basic case.

A

um But it's not very robust if you're tasked to create asset a would fail, or it takes too long kind of going beyond that hour that you expect uh that's going to cause issues.

A

um It's particularly tricky to scale this as the number of dependencies growth uh and your uh kind of set of assets gets more complicated, you're going to start to have to manually schedule things in a very complicated manner. So this is where tools uh like airflow come in and why folks might turn to a tool like airflow.

A

um You know you can tell airflow to run the task to generate B after the task for a- and this is a lot less fragile. If the task for a fails or takes longer, that's totally fine airflow will handle a failure properly and you know not run the downstream task or it'll wait. You know for your task to complete before running the one further down in sequence, but airflow is an older, Tool uh and developers. You know tend to find frictions with it over time. uh It's in particular, designed to run production code.

A

So It's tricky to run locally uh it's hard to have different environments like a staging environment, a production environment and it's hard to unit test. The sort of uh development flow that you kind of expect in a software engineering world, and it also doesn't really have a first class concept of assets, uh which you know is kind of. The entire purpose of orchestration. uh All of the kind of core, abstractions and airflow are built around tasks rather than the assets that they produce.

A

So we kind of view dagstore as the next step in this Evolution. um It's focused primarily on data assets uh and also on fixing some of these problems with the development life cycle which hopefully you'll see in today's demo.

A

So uh to kind of illustrate this, let's walk through building a very basic pipeline from scratch, uh using assets, um in particular, we'll be using dagster's software-defined assets where we Define our Assets in Python code. So here we have an empty python file in our editor and we'll start by importing pandas and the asset decorator from dagster, which will let us Define software-defined assets, we'll create our first asset, which just looks like the asset decorator applied to a python function.

A

So we'll call this country population we'll make a kind of a very simple pipeline which will pull some population data and do some very basic analysis on it.

A

So here we have a very basic signature for software defined asset. It's just a python function. The name of the asset is the name of the function in this case country population. The type of the asset is a panda's data frame and the body of the function which we haven't defined yet is what's going to actually Define how our asset is created.

A

So here I have a Wikipedia page that you know lists some population information for various countries. So let's just pull the data down from this page, so we can get a data frame by uh telling pandas to read uh from this Wikipedia page and grab the table and we'll actually go ahead, and just because uh you know the column names that are going to be automatically generated are going to be a bit messy. Let's go ahead and relabel those columns and then return the data frame that we just generated uh and for good measure.

A

Let's have a little comment here, so this is all we need to do to generate a basic software defined asset um and we can actually go ahead and materialize. This asset um by running dagget, which is dagster's uh kind of UI interface, and pointing it at our python file here.

A

So this is going to spin up a local web server and in our browser we can open it up and we'll see here. This is our asset graph, which shows each of our software defined assets and the dependencies between them. Here we just have the country population asset which we generated. We can go ahead and select it and materialize it under the hood. uh Dagster is going to figure out what computation is needed to regenerate these assets and then cue that computation.

A

So here you know, this is where all of our compute steps would show up. We only have one here. This is where the logs would show up again. You know it's happened pretty quickly, but we see that our run succeeded and our asset was materialized.

A

So The Logical, Next Step here is maybe to build an asset that depends on our country population asset. So let's go ahead and Define another asset. uh Let's call this continent population which will just aggregate the country population into stats for each continent.

A

The only difference here is that we'll actually Define an input to our python function here, um which encodes the dependency between our two assets. So here we use country population as an input, and this is going to let dicer know that there's actually a data dependency between these two assets.

A

So, let's comment and Let's uh just Group by the region, column uh in our country, data and uh kind of sum up: the population.

A

If we go back to uh our Dexter UI here and we go back to the isograph, we can reload um and we'll see. Now we have uh our new continent population asset. Here we can see. You know the country population has been materialized since we ran it just a couple of minutes ago, so we can just select the continent population if we wanted and re-materialize it since we've already materialized, the Upstream asset it'll just use that that prior cache value uh and great uh the Run has already succeeded, since it was a pretty simple computation.

A

uh Of course, you know we could easily materialize all of them and run both steps in sequence. If we wanted to.

A

So this is, you know, a pretty Bare Bones example, but we're already kind of seeing um what the difference is are between uh dagster and a more task-based orchestrator here uh in a task-based orchestrator world we would have to uh you know, create a task to create the country population data we'd have to fetch the data explicitly store it somewhere to find another task to read the data from wherever we stored it modify it, store it again somewhere else, and then you know explicitly Define a pipeline with our tasks in order, but a lot of that's being done for us by dagster.

A

Here uh you know the the I o between each of our uh assets is happening automatically and the ordering is also happening just based on the data dependencies that we've encoded between our steps.

A

So let's take a step back- um and you know this- this file that I've been writing is actually in the context of a larger project. So we can see what it looks like to import our assets um into. You know, maybe a larger data platform.

A

um So here we have our our population assets and we actually have this repository file here, which defines an empty diagster repository. You can kind of think of the repository as an entry point uh for a bigger diagster project. So if you have assets coming from a bunch of different places, this is where they'll uh kind of all come together.

A

So the first thing we'll do is just get our existing assets loaded as part of the repository here. uh So what we'll want to do is create some population assets and we'll use this load assets from package module uh utility function uh to import our assets from the population, python module, we'll assign them a group name just for organization purposes and we'll add this key prefix, which I'll talk about a bit later. But this just prefixes kind of the the name of the assets.

A

I'll explain why this has been uh in just a just a few minutes. um Then all we need to do is return. Our population assets here is part of our repository, and then we can open update again and we should see uh the same set of assets.

A

So, let's just make sure that we have this looking properly great, so one thing, I sort of glossed over earlier is uh where are assets are actually being saved to and loaded from a kind of insinuated that the I o is happening automatically, and this is being done using dagster's. I o managers, so the audio managers are kind of a built-in abstraction that lets you decide where and how uh your inputs and outputs for your assets are stored uh running locally by default. That's just on the file system, um but it's really easy.

A

For example, if we wanted to store our assets somewhere, more persistent like snowflake for us to do that. So let's see what that might look like we'll start by importing.

A

uh The snowflake I o manager and we'll also import the the snowflake pandas type Handler that'll, just let our I o manager uh know how to handle serialize and deserialize pandas data frames and then we'll actually create our file manager.

A

So now that we have our, I o manager defined here, uh we'll want to bind it to our assets to let the extra know that these assets should use this particular I o manager you're able to have a bunch of different. I o managers, if you wanted to sort of, save and load different assets from different places, you know some might want to live in S3 or snowflake or maybe locally is fine.

A

We'll use the with resources utility here to kind of bind our I o manager to our assets.

A

And so we will bind.

A

um Our snowflake I o manager and we'll actually provide some configuration here to point towards our snowflake instance. So here I'm just pulling, uh you know all the credentials from the environment.

A

So if we go back to diagster here and reload, um we're not going to see any media change here because we're loading the same set of assets. But if we go ahead and re-materialize our assets, um it will actually output them in additional flake and we should be able to see that uh I think in the locks here great. So we're yielding our data frame. Outputs- and this is when it's actually going to be written to snowflake.

A

And so here, if we go back to the asset graph- um and you know, take a look at our country population asset- we can see- we actually get some additional metadata that the snowflake I o manager is attaching to our asset. So we can see we actually get the The Columns that are being output into our snowflake table uh and the data types of those columns.

A

We get the row count and we actually get a query here that we can uh run in a snowflake uh against our snowflake instance to see uh to see the asset that we just materialized. So here we're getting uh all of that country data um that was just written into snowflake by the I o manager. So now, let's see what this looks like um adding in some additional assets, Dexter also has the ability to integrate with other uh kind of tools in your data stack.

A

One, that's very popular is DBT, which is a SQL based transformation tool that embeds directly into diagster.

A

So we actually have a DBT project already set up here, uh defining a couple Transformations from our country and continent data and if you've used DBT before they actually have their own kind of graph here. That represents uh the dependencies between uh their SQL transformation. So here we have, you know the country, population and the continent population uh being transformed into some ranking information and some cleaned information and then a summary and some roll-up tables.

A

So we can actually import this entire DBT ref um and run it from within Dexter. So, let's see what that looks like.

A

So, let's first just specify where our DBT project is located relative to our repository file here and then we can actually load our DDT assets. um All this transformation assets uh using this load assets from DBT project utility, uh so we'll specify our DBT project directory.

A

Our profiles directory is going to be the same as the project directory and then we'll also add the key prefix here um just for context. The the key prefix here determines uh which snowflake schema we're writing to when we're using the snowflake. I o manager, so I'm just writing to uh our Sandy sandbox database, and so you know, Ben is just the the schema that I have right access to. So that's why we're using it as the keyboard effects.

A

So this is all we need to do to kind of load, our DVT assets into dagster, um textural, parse, the project files and automatically build software-defined assets associated with each of the tables that DBT is going to produce here.

A

So let's go ahead and add our transformation assets uh here and we'll also need to bind a DBT resource, so we'll set up a DBT CLI resource which will tell the extra that we want to execute DBT locally using the CLI rather than using DVT cloud, and you know we'll just need to point it at the same project and profile directories here.

A

So if we go back to uh to dagster to the asset graph, let's go ahead and reload our definitions again and provided I did everything right? uh We should see our DVT assets appear um great. So you know here we have uh our duty assets, uh Downstream of the Python assets that we wrote earlier, um and you know you can even see we have some metadata attached, for example, showing what that DBT uh transformation looks like for each of our individual assets, and you know if we wanted to.

A

We could re-materialize everything which is going to first run our python assets as we'll see uh our country population and then our continent population, and then it's going to invoke DBT to generate all of those Downstream assets.

A

uh Well, we wait for that to run, uh let's also pull in a few more assets that were kind of pre-defined here, uh so we have some forecasting assets that perform some very basic uh population forecasting.

A

um This is just going to be pulled from another python module uh and it's going to depend on those population assets that we defined earlier so we'll grab those and add them to our repository definition here uh and while we wait for this run to complete, let's go back to the asset graph, then reload and just see what those forecasting assets look like.

A

Great, so here we can see we have uh some uh our forecasting assets, so it sets up some features uh based on the the country and continent population data, uh creates an ml model uh and then uh creates a forecast. uh A lot of this is, you know, kind of mocked out, but for for sake of the demo, let's say that you know we wanted to regenerate this population forecast. uh How would we go about doing that?

A

um We could choose to rematerialize all of our assets, but this is kind of an expensive operation. So one thing we can do is uh you know, go to this forecasted population asset. We could view it in the asset catalog. This is kind of built into dagster and shows us every time this asset has been materialized and some of the metadata associated with it. We haven't actually materialized this asset.

A

Yet so we don't have any data there, but we could go to the lineage Tab and view just its Upstream dependencies uh and from here you know we could just re-materialize that forecast and everything that it depends on rather than our entire asset graph.

A

If we wanted to avoid going to this, uh you know lineage page every time we wanted to do that, we could actually Define a Dexter job that just re-materializes our forecast uh and everything uh that it depends on. So, let's see what that would look like.

A

To do that, we can use Define asset job. This is just going to build a you know, a job definition that will re-materialize a fixed set of assets. So, let's call this update, forecast.

A

And provide a selection of assets to re-materialize, so here we can use an asset selection.

A

We can specify the asset keys, that we'd like to read material rematerialize, the population. So here we're just specifying that forecasted population asset and then we can actually tell Dexter to grab everything Upstream of that. um So if we Define this job and uh uh reload our set of definitions again.

A

We should see in the sidebar uh we'll now have our job, uh which will um just re-materialize our forecast related assets. So you can see um the other assets that we don't depend on. These are just kind of linked out, so we can view them externally, but they're not going to be recreated by this job.

A

Now. This is all well and good. uh You know we can on a whim, update our forecast, but if we wanted to do this on a regular Cadence, we might want to attach a schedule to this job to have it update automatically. So that's pretty easy as well. We can wrap our job here with a schedule. Definition uh pointed at this job and then give it a bronze schedule.

A

To run, let's say once every hour: if we go ahead and reload our project again, uh we'll see that this job will also have a schedule attached to it. So this way it'll run every hour automatically just rematerializing our forecast and all of the Upstream assets that it depends on. So here we have our hourly schedule that we can toggle on uh and now we're kind of Off to the Races. We can have our uh forecast regenerating automatically every hour.

A

So hopefully, this gives you kind of a brief overview. A brief idea of dagster's programming model and what using dagster, uh Booth kind of from the python end and also from the UI end, looks like.