Dagster Dagster Team Product Demos, 21 Mar 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Partitioned Data Pipelines in Data Engineering

Description

Partitioning is a technique that helps data engineers and ML engineers organize data and the computations that produce that data.

A

A

The lead engineer on the dagster project and I'm here to talk to you about partition data Pipelines partitioning, is a technique that helps data engineers and machine learning, Engineers, organize data and the computations that produce that data. Partitioning also makes data pipelines more performant by letting them operate on subsets of data instead of all of it at once.

A

Dagster is a data orchestrator with Rich support for partitions and I'm going to spend the next few minutes demonstrating this support. The idea behind partitioning is that a data asset can correspond to a collection of partitions which are tracked and computed independently.

A

Typically, each partition corresponds to a separate file or a slice of a table in a database.

A

Sometimes partitions correspond to time Windows like daily or hourly and other times they correspond to sets of entities like countries, customers or financial instruments. Let's start by looking at an asset with a partition for each hour. Here's how we Define it in code.

A

We construct an hourly partitions definition and attach it to our asset definition and here's how it shows up in the diagster's UI. In this case, most of the hourly partitions are filled. Many others are not.

A

If we go to the asset Details page, we can inspect individual partitions of our asset I'm clicking here on the partition that corresponds to a particular hour of a particular day and I can inspect the metadata for that partition, such as the file where it's stored, I, can also click into the run that materialized it to see. The logs we can launch runs to fill in or recompute partitions of our asset.

A

I can click the materialize button for this asset, which allows me to select a partition to materialize. I can also launch a backfill by selecting a range of partitions.

A

By default, this backflip will launch a separate run for each partition. However, we can, alternatively, choose the option to launch a single run that covers all partitions, which would be helpful if we're using a parallel processing engine like spark or Snowflake, and we want to execute our backfill in a single query or job.

A

We can also schedule our asset so that each hourly partition will be filled in at the end of that hour. Partitioned assets can depend on other partitioned assets, creating a partition data pipeline. Here's some code that implements this pattern. We have the hourly asset that we looked at before, along with another hourly asset that depends on it. Each hourly partition of this Downstream asset depends on the corresponding partition in the Upstream asset.

A

If we go back into the UI, we can select both of them and then click. The materialize button to launch runs to materialize. Those partitions in order dagster can also handle dependencies between assets with different time. Partitionings. Here's code that includes the hourly assets that we looked at before, along with a daily partitioned asset that depends on both of them.

A

Each daily partition depends on the 24 Upstream hourly partitions for the same day in the UI. We can select all of these assets and then launch a backfill over a selected time range. While it's executing this backfill Dexter will wait until all the Upstream hourly partitions are filled before filling in the corresponding Downstream daily partition.

A

We can also schedule our assets so that each hourly partition will be filled in at the end of that hour and in the daily partition will be filled in once all the 24 hourly partitions it depends on are there by default each time partition depends on the same time window in Upstream assets, but we can override this when we want to, for example, here's an asset that depends on a rolling window.

A

Upstream data each daily partition depends on the prior daily partition in the Upstream asset, and if we launch a backfill over both of these assets, the Dexter won't fill a partition of the downstream asset until the prior day's partition has been filled in the Upstream asset.

A

We just looked at assets that were partitioned by time, but that's not the only way to partition an asset. Imagine an asset, that's partitioned by country, for each country. It has data on all the weather stations for that country.

A

Data for different countries might arrive at different times so partitioning it in this way allows us to update the data for a particular country. Without touching the data for other countries, we Define this asset by constructing a static partitions definition with a list of countries and assigning it to our asset.

A

The country partitioned asset functions just like the time partitioned assets that we saw earlier in the asset graph in Dexter's UI. We can see the number of partitions that are materialized, missing and failed. We can go to the asset Details page to inspect individual partitions.

A

We can materialize individual partitions or we can backfill many of them at once.

A

So far, I've looked at time, partitioned assets and we've looked at statically partitioned assets as well. What if we want an asset to be both, for example, maybe we have an asset that contains weather events from the weather stations that we tracked in our previous asset. Each day we add weather events for each country, so we want a separate partition for every date for every country.

A

Dexter supports this kind of multi-dimensional partitioning and it does it with the multi-partitions definition class. In this case, we construct one that contains both a time-based partitioning and a country-based partitioning.

A

Let's look at this multi-partitioned asset in the UI on the asset Details page, we can make a selection on both the country Dimension and the date Dimension to view the corresponding partition. For example, here's details on the partition for USA on a particular date.

A

When we materialize this asset, we can choose both a country and a date to Target. We can also launch a backfill that will cover both our country partitioned asset and our multi-dimensional asset. For example, we could backfill all the historical data for both the USA and Brazil in both of these assets.

A

In everything that we've looked at so far, the set of partitions is fully determined by the code that defines the asset, but in some situations we need to be able to add and remove partitions dynamically, for example, consider a data pipeline that creates a derived file for every file that lands in a particular directory as new files land. We need to create new partitions to represent them or consider a machine learning pipeline that we want to run with ad hoc hyper parameters to create a set of ml models.

A

We can compare each time we launch a run with a new set of hyper parameters. We want to create a new Partition to represent the new machine learning model that is generated by that run in dagster. We can handle these situations with dynamically partitioned assets. Here's a dynamically partitioned assets with the partition for every release of the dagster project itself. Dexter publishes a release roughly once a week, but some weeks have no release and some other weeks have multiple releases.

A

Here's a whole graph of assets that are all partitioned by release if we add a partition to one of them that partition will be added to all of the other ones as well.

A

It's common to combine dynamically partitioned assets with dagster sensors. Here's a sensor that monitors GitHub for new dagster releases when it finds a new release. It adds a new Partition for that release and then it requests a run to materialize that partition in the whole pipeline of assets that are partitioned by release.

A

We can also add Dynamic partitions through the dagster UI. If we select one of our assets and click the materialize button, we get to type in the name of a release and then launch a run for it. So that was a whirlwind tour of dagster's partitioning functionality using partitions in your data pipelines has some big advantages.

A

It helps you Monitor and materialize the subsets of your data that you care about in a particular context. This gives the peace of mind that you're operating on the data that you need to be and avoids wasting computation on data that you don't need to touch to learn more visit, dagster.io and look for partitions in our docs. Thank you, foreign foreign.