Dagster Dagster Day 2022, 9 Aug 2022

Previous Meeting

⏯

youtube image

►

From YouTube: Software-Defined Assets Demo

Description

This video shows off how to get started with Dagster’s Software-Defined Assets, and the features they enable. It walks through creating assets in Python, as well as loading them from tools such as dbt.

Read more about Software-defined Assets in Dagster here: https://dagster.io/blog/software-defined-assets

A

Hi, my name is owen and I'm a software engineer working on dagster. This demo will walk through how to get started with software-defined assets and the powerful features that they enable through it we'll build out a sample data platform showing how to create our own assets in python, as well as how to load them from external tools such as dbt.

A

With that, let's jump into the code. Let's start from the very beginning, our team wants to take some raw data and store it in a data. Warehouse we'll start with a few imports, then define a function to generate our raw data to keep things simple, we'll just grab some data from wikipedia using pandas, but this is completely arbitrary python code. We can use whatever libraries we want and load operate on and return data of any type by annotating our function with asset decorator. It becomes an asset. The key or name of the asset is country.

A

Population and its contents are computed using the function. We just defined you'll notice that we haven't yet told dagster how and where to store this data. We could include that logic in line, but it's often useful to keep this business logic separate from I o concerns by default. Our assets will be stored as pickled files on our local file system, but this behavior is completely customizable and dagstr is built-in support for storing assets with major cloud storage systems such as s3, adls and snowflake.

A

We'll come back to that in a minute, but for now, let's spin up a web server on our local machine to view this asset in daxter's ui, because we've just defined this asset, its data isn't stored anywhere. Yet we can kick off a run to fix this straight from the ui.

A

We can monitor its status directly from this view, seeing that a run is currently refreshing it or we can jump into a live updating timeline to view detailed logs as they come in once the run successfully completes our data is stored and ready to be used from here, it's natural to want to take our new data and do something with it. For example, we might want to use the raw population data and aggregate it per continent.

A

This is where the declarative model really shines. All we have to do is define a function that takes our first data asset as input and return. Some transformed data.

A

We define a data dependency simply by adding an argument to our function, with the name of the asset that we want to depend on once we turn this into an asset. Dijkstra will handle the rest from creating a lineage link between these two assets to loading the contents of country population as input to this function when it comes time to run it.

A

If we go back to the ui and reload our definitions, we'll see our new asset, which indeed depends on country population. We can now kick off a run to refresh one or both of these assets.

A

While that's running it's useful to take a step back and consider the benefits we're already getting out of this declarative model. After just a few lines of code. At no point in this process did we need to think about tasks. We just wrote the code necessary to compute the contents of our assets and the orchestrator was able to string these definitions together to execute them in the proper order.

A

In addition, our orchestrator has direct understanding of the assets that it's responsible for giving us insight into how they're computed and how up to date, they are now that we have a couple of assets working smoothly. Let's fast forward a bit and see how this scales to keep everything organized, we'll break things out into separate files, then combine our assets into a single dagster repository first we'll load in the population data assets that we just created.

A

Now we'll bring in a dbt project to transform and process our population data. We can load these dbt models as software-defined assets just by pointing dagster at a local dbt project.

A

Finally, we'll bring in a machine learning team that will use the data processed by dbt to train a machine learning model. These assets were defined in python, just the same as our original assets. Once again, the code inside these assets is completely arbitrary and you can use whatever tools or libraries you want with all these assets. Added to our repository, we now have a number of assets which are represented as pandas data frames in python, then serialized as files on local disk.

A

Instead of these pickled files, we might want these data frames to be stored to and loaded from, snowflake tables. We can tell dexter to do that by importing a built-in I o manager and applying it to our assets.

A

You might have noticed that changing the storage location of our assets didn't require modifying the definitions themselves. This makes it easy to write unit tests for our assets as the business logic stays decoupled from the external systems where the asset will be stored. Each asset can be assigned a different. I o manager allowing precise handling of storage behavior now that we have some more assets defined. Let's head back to the ui, the first thing, we'll notice is that our original population assets now have some downstream dependencies.

A

When we loaded our assets, we separated them out into different asset groups, giving us easy to parse views of data assets associated with specific projects or teams right now we're just looking at the population group, but we can also look at the other groups of assets that we loaded.

A

These are the dbt models that we loaded as assets in dagster containing all of the same lineage information as the original project, and these are the forecasting assets that we loaded from our machine learning team, which consumed some of those dbt assets as input.

A

If we go to the global lineage graph, we get a total view over all of the assets in our data platform, this graph spans groups, jobs and code locations, giving you insights into your data dependencies. Regardless of how you choose to organize your code and execution clicking. Any of these assets will bring up a sidebar containing some high level information about the asset.

A

For example, we can see that this forecasted population table is partitioned by day, meaning each run only updates a single day's worth of data in the table, rather than recomputing everything from scratch.

A

This is one of the many views throughout the ui that allows you to see which partitions have been computed for an asset making it easy to tell if any data is missing. If we want to learn more about a specific asset such as our population summary table, we can search for it by name and get taken to its asset details page from here we can see information on every time. It's been materialized, its definition and which assets it relates to. If we want to refresh this asset, we can do that directly from this page.

A

We can kick off a run that re-materializes just this asset, or we can select its upstream dependencies and re-materialize those as well to make sure it's as up-to-date as possible.

A

This sort of workflow is also easy to put on a schedule. If we go back to our code, we can define a new job that targets a selection of assets that are upstream of our population summary table and put it on a schedule that will run it once a day back in the ui we can see. Our asset is now on a schedule and the job we created shows up ready to be run.

A

All together, software defined assets and the tooling around them make it easier to understand and manage the state of even large data platforms.