Dagster Dagster Community Day - Dec 2022, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Declarative Scheduling of Software-defined Assets - Dagster Community Day - Dec 2022

Description

Dagster 1.1 introduces a system for scheduling data pipelines that allows you to escape writing workflows entirely. Instead, you directly specify how up-to-date you expect each asset to be, as well as how to determine whether source data has changed. Dagster then automatically schedules asset materializations to ensure that data arrives on time while avoiding unnecessary computation.

A

I'm Sandy the lead engineer on the dagster project and I'm here to talk about functionality, indexer 1.1 that introduces declarative scheduling for software-defined assets. We use orchestrators to keep data assets like tables and machine learning models. Up to date, scheduling data pipelines means managing change in data assets, which boils down to a few basic elements.

A

So, first to compute our data assets, we typically run code. When that code changes, we eventually want to update our data assets to reflect the New Logic. We drive our data assets from Upstream data. When that Upstream data changes or grows. We eventually want to update our assets to incorporate those changes and then, last of all, depending on how our data assets are used, we'll have different requirements for how up to date they need to be most orchestration tools, don't think.

A

In these terms, they instead require thinking in imperative, workflows, sequences of tasks that run together on fixed schedules, because it's divorced from the elements of managing change in data assets. So we just ran through this imperative. Workflow model has some big frictions.

A

First of all, it makes it awkward to express what should happen in some fairly common situations. For example, if some tables are hourly and others are daily, but those tables have shared dependencies, it's difficult to construct a set of dags and schedules that run in the right order and don't duplicate work.

A

Second, every time you add an asset, you have to find a dag to put it in to get it scheduled. Then you have to worry about whether dags are getting too large and unwieldy or too small and fragmented, and last you get alerted when your task fails. Not when your data is out of date, which is often what you actually care about. If the system can retry and self-correct before the deadline, then nobody needs to get paged.

A

Dexter 1.1 helps move Beyond, workflow-based scheduling by introducing a set of features that enable scheduling data pipelines in a declarative and asset focused way, I'm going to take a deep dive into a few of these. To give a taste of what this looks like. One of these features is freshness policies. You can now construct policies that specify how up-to-date you expect. Your assets to be, and then use those policies for monitoring, alerting and scheduling, to understand how this works. Let's look at a simple graph of data assets.

A

The base table of events that we pull into our data warehouse from our app database, and these are two tables that summarize it for different business users. We want this for some retail to be pretty up to date. Events are constantly streaming into our production database and we want them to be pulled into our data warehouse and represented in this table within five minutes of when they arrive in production.

A

The second summary table is more expensive to compute and we only actually care about looking at it once per day, which is at a team check-in meeting that happens at 9am, so we set freshness policies on these assets and the way that we do, that is in code. So here's where we've defined our assets- and you can see the five minute freshness policy on this one and the 9 A.M freshness policy on this other one.

A

When we look at these assets and dag it, we can find out whether they're, in violation of their freshness policies. So this login summary asset is violating its freshness policy.

A

um It's five minutes late, because the last time we re-materialized it was 10 minutes ago. So it's not incorporating all the data that we expect it to, but the daily one isn't, even though we haven't updated it for a while. That's fine, because we don't need it to be updated until 9am tomorrow.

A

Something to note is that freshness policies typically Define our expectations for how quickly we incorporate the source data in this case. That data is the events that are landing inside our appdb.

A

So if we just re-materialize this login summary table, that's not going to make our data on time, because it's still using an old version of the login events table.

A

Freshness policies aren't just for finding out whether our data is fresh enough. We can also use them for scheduling. We can set up a sensor that automatically submits runs to materialize assets that they meet their freshness policies.

A

We do this by adding the asset reconciliation sensor to our repository and then turning it on in dag it. We see that it automatically submits a run to remedy this late asset after we do that.

A

The sensor will avoid duplicating work when two assets depend on the same Upstream asset. It knows that the same materialization of the login events asset can be used to help both of these Downstream assets meet their freshness policies. Achieving the same outcome without freshness-based scheduling would require a complex pyramid of jobs, schedules and sensors with freshness-based scheduling. We just tell dags for how fresh we want our data to be and it handles the rest.

A

A related new Dexter feature is versioning. You can now track which of your assets are stale in a more reliable Way by assigning fine-grained versions to code and Upstream data to understand how this works. Let's look at a small graph of assets.

A

We've materialized all these assets recently, so they're all up to date, we'd like to make a change to the code that generates this middle asset. Maybe we're fixing a bug, that's causing it to be missing some records, so here's the code and here's our change.

A

Something you'll notice here, which is new, is that we've assigned code versions to all of the Assets in our graph. The code version represents the version of the function that computes the asset from its dependencies. If that function, changes we'd expect the contents of the asset to change as well, and so we because we've changed our function in this case, we're also going to bump the code version to reflect that it changed.

A

Then we can go into our web UI and reload our definitions.

A

You can see now that these two assets are marked as stale the asset that we bumped. The version on is marked as stale and any assets. Downstream are also marked as stale, because their contents would also change if they were re-materialized.

A

We can remedy the fact that they're stale by kicking off a run that materializes only all of the stale assets.

A

Changing code isn't the only way that assets can become stale Source data can also change if we use dags or Source assets to represent Upstream data that our pipeline depends on. We can use versions to track when that Upstream data changes.

A

So in this asset graph we have a single Source asset that corresponds to a file on our file system.

A

We've versioned The Source asset by writing, an observation function which looks at the file and returns a value representing its version. In this case, we look at the last modified timestamp.

A

Of the file and use that, as its version, an alternative strategy could be to take a hash of the contents of the file in the web UI. We can then click on this button to observe our source asset. So that's going to run our observation function and then pick up the latest version in this case nothing changes because the file has the same modification timestamp as the last time that we observed it.

A

So now, let's try making a modification to this file, we'll observe it again, and this is going to pick up the new version and Mark all of the downstream assets as stale. We can launch a run to materialize these stale assets and make them fresh.

A

That was a sample of the new asset, reconciliation and scheduling functionality that we introduced in dagster 1.1 to learn all the details. You can take a look at the 1.1 release, notes, read the docs at docs.dagster.io and ask us questions in slack. Thank you.