Dagster Dagster Talks, Panels, & Interviews, 6 Jul 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Asset-Based Data Orchestration (from DATA + AI Summit 2023)

Description

On June 8th of this year, Sandy Ryza, lead engineer on the Dagster project gave a presentation at the DATA + AI Summit in San Francisco. The talk was entitled "The Future of Data Orchestration: Asset-Based Orchestration".

We are happy to share the key points of the talk in the video below.

Sandy's thesis: Data orchestration is a core component for any batch data processing platform and we’ve been using patterns that haven't changed since the 1980s. Sandy introduces a new pattern and way of thinking for data orchestration known as asset-based orchestration, with data freshness sensors to trigger pipelines.

A

Hello, I'm, Sandy and I'm the lead engineer on the open source, dagster project I'm here to talk to you today about data pipelines. If you're a data engineer, a data scientist, an ml engineer, there's a good chance that one of the main things you spend your time on is building and maintaining data Pipelines.

A

What is a data pipeline? First of all, a data pipeline normally normally culminates in a set of data assets. Data asset is a file, a machine learning model a table. Any persistent object that captures some understanding of the world. The point of a data pipeline is to construct or update data assets that can be used to help make a business decision or power an application to get to those assets. You're normally going to need to run some computations pull data from external systems run a spark job.

A

Do some pandas Transformations train a pytorch model, you name it and those computations are usually going to consume and produce other data assets. These other data assets might be Source data or they might capture some intermediate stage of the data transformation process.

A

Unless you like to sit at your desk all day, clicking buttons, you probably want the computations in your pipeline to run automatically this aspect of data pipelines automatically running them is going to be the topic of this talk. How should you make the decision of when to launch computation to update Assets in your pipeline, and how should these decisions be expressed in software?

A

To answer that question I believe we need to step back and answer a more fundamental question, which is why I run computations to update Assets in the first place? Why add to your Cloud Bill? Why go to the effort? In my experience, there are a couple reasons. One is that our inputs change. Our data is derived from some Upstream data and that Upstream data changes or grows, and we want to keep our Downstream data up to date. Different Source data changes in different ways. Some sort of data changes constantly.

A

Maybe you run an online store and new orders are coming in every second and other source data changes at discrete moments. Maybe you have an external vendor that delivers a file to you every day which contains the data that's corresponding to that day.

A

The second factor is our code changes data is derived using code and if we fix a bug in that code or improve it in some way, then we're usually going to want our data to reflect the updated code.

A

And the third and final reason is that fresh data is needed by the application or analyst who's using the data. If our Upstream data is changing constantly, but we're only going to look at our report once per day and it's often a waste of resources to update it constantly. For example, we might have an asset that we only need updated daily, but for another asset we might want to incorporate new Upstream data as soon as that new data arrives.

A

So there's a set of situations where we want to automatically update our data assets. How do we go about doing this? If you look at the last couple decades, the answer that you'll probably end up with is with a workflow engine. Workflow engines are systems that take responsibility for executing a set of tasks in the right order. Airflow shown here is a very popular one and the way it works is basically, you define a dag and can then put that dag on a schedule.

A

If you squint, this seems like a reasonable way to schedule data pipelines, because your data pipeline is a dag and you also have this dag of tasks. But there are some big problems which have personally caused me much anguish in my years as a data engineer, the biggest one is that it makes the assumption that your computations should be running in lockstep, as we shot saw earlier. Different data arrives at different times, and different data products have different freshness requirements.

A

If you force everything to execute together, then you end up caught up between a rock and a hard place of, on the one hand, running too frequently and needlessly updating data assets that don't need to be updated or, on the other hand, running to infrequently and data not being available when it's needed and of course you can split your dag into multiple, smaller dags, then you have the whole extra dimension of complexity with dependencies between those smaller tags, and then you still have the lock step problem within the smaller Dags.

A

A second related friction with workflow based orchestration. Is that every time you add an asset, you have to find a dag to put it in to get it scheduled. This means that on one hand, you have to worry about dags getting too large and unwieldy, or on the other extreme too small and fragmented, which is a code management problem.

A

Last you get alerted when your task fails. um Not when your data is out of date, which is often what you actually care about. If the system can retry and self-correct before the deadline, then nobody needs to get paged.

A

So imagine we could throw out workflow engines and design an approach from the ground up for orchestrating data pipelines. What would that look like that's going to be the topic for the rest of this talk? Our ideal orchestration system for data pipelines has a few goals. First of all, there are some important scheduling outcomes. We want data to be ready when it's needed and we want to avoid redundant work.

A

Second, as we saw earlier, our requirements for our data assets are the main determinants for when we want computations to run so we'd like to be able to express scheduling in terms of these assets and then last of all, we want to be able to understand the scheduling decisions that our system is making so that we can debug them we're going to explore this through dagster, which is an open source data. Orchestrator dagster has a quote: unquote.

A

Traditional workflow based scheduling like we talked about earlier, but we've also recently added a new scheduling subsystem to it. That's aimed at those goals that we talked about on the last slide. Something that makes Dexter uniquely suited to this kind of scheduling is that it views data pipelines in terms of data assets.

A

Here's a one slide primer on data pipelines with dagster shown here is a pipeline with three assets which is defined using dagster's python apis.

A

Each of these decorated functions on the python side defines a data asset that we want to exist so up here at the top, we've got an asset called events table a table in our data Lake that contains events that we've collected from our website.

A

The code inside the function is what dagster runs when we tell it to materialize. The asset and materialize essentially means updating the asset, I'm running code to update or replace its contents. It's not pictured here, but in this case it reads the event, data from the location that it gets dumped to and then writes it out to uh writes a processed version of the events table to the data lake. So it's creating the events table.

A

The second asset here is a table of logins, that's derived from the events table. The decorated function has an argument here, which is named events table which tells dagster that it depends on the events table asset. That dependency is also represented visually here on the right, which is a screenshot from dextrose UI and then finally, we've got a third asset that depends on both of the assets that we just discussed.

A

The Dexter UI lets you manually materialize assets by clicking a button again. Materializing just means running some computation to update the data asset and that's cool, but the whole point of this talk is how do you avoid needing to sit at our computer every hour and click that button? How do we materialize this asset automatically at the right time?

A

Dagster lets you specify this by adding what it calls an auto materialized policy to the asset definition itself. An auto materialized policy essentially describes when we want to update a particular asset. The most common kind of Auto materialized policy is what we call an eager, Auto materialized policy and that basically just means update this asset whenever the Upstream assets that it depends on gets updated.

A

So, in the case pictured here, dagster will automatically update the logins table anytime, that the events table changes it's as simple as this and for assets that are generated as part of the Dexter data pipeline.

A

It's easy to know when they've changed, because dagster is the one: that's changing them, but what about the root of the data pipeline assets at the root of the graph are usually derived from some data source, but that data isn't generated inside the data pipeline in the examples that we've been working with, we have this events table asset, which is generated from events that get periodically dumped by some other process into some storage bucket in the data Lake Dexter lets us model that Source data that we don't control with a special kind of asset called a source asset.

A

A source asset is exactly what we just described. It's an asset that Dexter knows about, but the Dexter doesn't materialize in this case. It represents that storage bucket that the raw events data gets dumped to Dexter then lets us write arbitrary code that checks this file and sees if it's changed, here's a code, definition for the source asset that we were just talking about. Just like our other assets, it's a decorated function, but the code in this function isn't generating the asset.

A

It's checking to see whether the asset has changed every time the code runs, it returns a version string and if the version string is different than last time, that means the file has changed. We can set up this code to run at some interval like every minute and when it indicates that the asset has changed, Dexter will then Auto materialize, any Downstream assets that have eager policies.

A

Dexter also has the ability to track code changes but I'm going to skip that for the sake of time.

A

So we looked at policies that materialize Downstream data as soon as Upstream data changes, that's useful in many situations, but in many others it's too often, for example, you might have a data source, that's changing every hour or even every second, but the downstream data asset doesn't need to be that fresh, so it would be wasteful to constantly recompute it or you might also have assets whose sole purpose is to power other data assets for those intermediate assets. They only need to be materialized if the downstream data assets need them to be up to date.

A

Otherwise, there's no point in materializing them, so with lazy, Auto, materialized policies, instead of eagerly acting as soon as Upstream data changes, you wait until data is needed. Downstream dagster expresses this idea of quote-unquote needed Downstream with a concept called a freshness policy. A freshness policy is essentially a data SLA. It defines how fresh a data asset needs to be here. We've added a freshness policy to our fraudulent logins model. It expresses that new source data needs to be incorporated into the model within a day of when it arrives.

A

If it isn't, then the model would be considered overdue to illustrate what this looks like here is a timeline. In this case. Our asset is considered fresh because, after new source data, arrived materializations happened that allowed the source data to flow into our fraudulent logins asset and here's a case where our asset is considered overdue because it wasn't materialized in time.

A

The most basic use of freshness policies is for this kind of reporting, where Dexter helps you understand, if your data is overdue, but the reason they're relevant to this talk is that they can also be used for scheduling.

A

When you give your asset, a lazy, Auto, materialized policy, Dexter will material will materialize it when doing so will help it or a downstream asset meet its freshness policy so, once per day, Dexter is going to notice that both of these assets need to be materialized in order to meet the freshness policy on the logins model, and then it's going to automatically materialize them. One of the situations where freshness-based scheduling really shines is when the same asset is Upstream of assets that have different freshness policies.

A

So here we have a logins table, that's Upstream of both the logins dashboard and the fraud model. The dashboard needs to be updated hourly, but the fraud model which is more expensive to compute only needs to be updated daily, trying to schedule this work with workflows gets very awkward quickly. One option would be to use two overlapping workflows, one that runs hourly and one that runs daily, but sometimes these workflows will run at the same time and will redundantly update the Upstream table twice when we only need it once there's.

A

Also this option where you have one dag trigger another dag, but you have to make it so that's only sometimes and then a third option where you try to handle it with skip logic inside one Mega, dag I've tried all of these approaches at different times and found each to be difficult and unsatisfying in its own unique way, but with freshness-based scheduling you can literally Express this diagram in code and the scheduling system will handle updating the assets on time without launching redundant computations.

A

The last thing that I want to talk about before closing is observability scheduling system that makes the right decision. 99 of the time is going to drive you crazy.

A

If you can't understand, what's going wrong in the other one percent, a scheduling system that makes decisions based on Upstream changes and data freshness is uniquely able to explain why it's making the decisions that it does Dexter takes advantage of this by displaying a history of all the scheduling decisions that we've made for each asset for each tick of the scheduler, you can see the set of conditions that inform the scheduler's decision of whether to materialize the asset materialization conditions indicate whether there's any reason to materialize the asset.

A

For example, new data has arrived Upstream if no materialization conditions are met and there's no point in materializing the asset or doing anything skip conditions indicate whether there's a reason not to materialize the asset, for example, if we're waiting on other Upstream data to arrive first or if there's already an active backfill kicked off by a user, that's manually materializing those assets.

A

If any of those are true, then we won't materialize combined. These two kinds of conditions give a full picture of why an asset is or isn't being automatically materialized and let you debug what's happening when you're not getting the behavior that you expect. Let's sum up what we talked about here: First Data pipelines are graphs of data assets like files tables and ml models connected by computations.

A

The computations inside data pipelines need to be scheduled, but workflows are not an adequate scheduling, abstraction asset-based, orchestration, conceives scheduling in terms of keeping assets up to date, which allows first expressing intentions more clearly, second, avoiding redundant, computations and third understanding and debugging scheduling decisions.

A

If you're curious about dagster, you can check us out on GitHub when you try, which you can get to by just Googling dagster GitHub and we'd love to see you in our slack, which you can get to from our website dagster.io.

A