Dagster Dagster Talks, Panels, & Interviews, 8 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Comparing Apache Airflow and Dagster

Description

Many data engineers are looking to get past the limitations of Apache Airflow, the incumbent in the data orchestration layer. Dagster proposes a new paradigm centered on Data Assets and the tools to support a full development lifecycle that radically boosts the productivity of data engineering teams.

A

Foreign I'm the lead engineer on the dagster project before dagster I spent years as a data engineer and machine learning, engineer and I used airflow extensively in those roles. I joined the Dexter project in large part because of my frustrations with airflow I found that I was spending more of my time. Fighting with it than writing the data.

A

Nml pipelines were my actual job I'm, going to spend a few minutes here talking to you about why that is and what makes dagster so different from airflow data practitioners use orchestrators like dagster or airflow to build and run data Pipelines.

A

The point of a data pipeline is typically to produce and maintain a set of data assets like tables files or machine learning models accomplishing that usually requires modeling, a graph of computations and intermediate data that get you from the source data that you're, starting with to the data products that you're trying to create airflow, helps out with this, because it's a workflow engine. It models a graph of tasks and executes them on a fixed schedule.

A

It was the first python-based workflow engine to have a full web interface which set it on the road to become one of the most popular tools for running data Pipelines, but first does not always mean best. Airflow is designed in a way that we believe actually makes it a poor fit for this task of building and maintaining data Pipelines.

A

Its design is a product of an era when software engineering principles hadn't yet permeated the world of data and, as a result, it takes a very narrow view of data Pipelines. It often misses out on the bigger picture of what modern data teams are trying to accomplish.

A

First, it schedules tasks, but it doesn't understand that tasks are built to produce and maintain data assets. And second, it's focused on production environments that support heavyweight infrastructure with long-running processes. It makes pipelines hard to work with in local development unit tests, continuous integration code review or debugging.

A

Taking this narrow view has some material consequences. First, it results in low developer productivity. Iteration Cycles are slow when pipelines can only execute in production and it's cumbersome to translate a pipeline of data assets into scheduled. Workflows of tasks.

A

Second, it results in poor reliability because if you can't catch errors before your changes make it to production, you'll catch them in production and third, it makes it hard to understand what's going on when a pipeline is deployed, because it mainly gives you visibility into what tasks have run. Not what data assets have updated.

A

Dexter takes a broader View. It was designed to assist with the holistic task of developing pipelines of data assets and evolving those pipelines over time. We believe that, taking this broader view can make data teams dramatically more productive and make data pipelines dramatically more reliable to make this more concrete. Let's start by zooming in on these phases of the development life cycle. What's the difference between Daxter and airflow When, developing data Pipelines.

A

Developing with airflow is difficult because airflow pipelines are heavyweight and difficult to run quickly as part of an iterative development. Loop all airflow runs go through its scheduler Loop, which means that to run any pipeline, airflow you'd have a long-running scheduler process. That's monitoring a database and after launching a run, you need to wait for the scheduler to see it also to avoid dependency conflicts.

A

Most guides recommend defining airflow tasks with operators like the kubernetes Pod operator, which dictate that the task gets executed in a particular environment like kubernetes when a dag is written in this way, with the pipeline bound to a particular execution environment, it's near impossible to run it locally or as part of continuous integration. Unless you want to set up a kubernetes cluster on your laptop dagster, on the other hand, was built from the start to support rapid development and prototyping of data. Pipelines dicer's programming model encourages separating business logic from infrastructure.

A

This means that you can have a pipeline that runs distributed across kubernetes when in production, but also run it within a single python process during a unit test without sacrificing dependency, isolation, dagster execution is extremely lightweight. It doesn't require any long-running services or schedules. If you don't want to access, if you do want to access dagster's UI, you can just type dagster Dev in the command line and be up and running.

A

Dagster also has Rich testing apis, which make it easy to write unit tests for any component of a data Pipeline and to stub out external services that the pipeline interacts with another big difference between dagster and airflow is the abstractions they offer for building and operating data. Pipelines dagster sees the goal of a data pipeline as producing a set of data assets like tables files or machine learning models, Dexter's programming model and user interface are heavily focused on that goal. So it allows you to think in assets when you're building and operating your data.

A

Pipelines airflow, on the other hand, is primarily a task orchestrator. An airflow dag is a workflow of tasks connected by execution. Dependencies airflow recently introduced a data set abstraction, but it's bolted Loosely on top, not a core part of the operating model or programming model thinking in assets allows you to express your intentions more directly, which means less code boilerplate.

A

As an example, here's a comparison of the same data pipeline written in both airflow and dagster. The pipeline has one data asset, that's derived from another data asset with airflows apis. You need to tell airflow that the task building the second asset should run after the task building the first asset and then also read from the first asset in the second task. It's a lot to keep track of in digest apis you just Express the dependency between Assets in one place after you've written your data pipeline.

A

You typically use your orchestrator's web UI to monitor it. Airflow's UI is primarily concerned with what tasks ran, but dagster's web UI, which is pictured here, also focuses on the data that was produced by those tasks. It makes it easy to include metadata about that data and track how it evolves over time.

A

Another benefit of dagster's asset focus is that it enables much deeper Integrations with modern data stack tools, for example, consider DBT, which is a tool that helps analytics Engineers write SQL to build tables, airflow focuses on tasks, so it represents the entire DBT monograph as a single node in its Dag in dagster. Dbt models are easy to represent as dagster assets.

A

This means Dexter can represent the full DBT graph, which makes it easy to understand the relationships between individual DBT models and other data assets. It also makes it easy to run individual DBT models and track Which models completed successfully.

A

We only had a short amount of time, so we didn't get to a bunch of the other differences between Dexter and airflow like how they handle dependency, isolation, event-based execution or upgrades, but I hope. This gave a high level picture of how the two systems compare foreign.