►
From YouTube: Comparing Apache Airflow and Dagster
Description
Many data engineers are looking to get past the limitations of Apache Airflow, the incumbent in the data orchestration layer. Dagster proposes a new paradigm centered on Data Assets and the tools to support a full development lifecycle that radically boosts the productivity of data engineering teams.
A
Foreign
I'm
the
lead
engineer
on
the
dagster
project
before
dagster
I
spent
years
as
a
data
engineer
and
machine
learning,
engineer
and
I
used
airflow
extensively
in
those
roles.
I
joined
the
Dexter
project
in
large
part
because
of
my
frustrations
with
airflow
I
found
that
I
was
spending
more
of
my
time.
Fighting
with
it
than
writing
the
data.
A
The
point
of
a
data
pipeline
is
typically
to
produce
and
maintain
a
set
of
data
assets
like
tables
files
or
machine
learning
models
accomplishing
that
usually
requires
modeling,
a
graph
of
computations
and
intermediate
data
that
get
you
from
the
source
data
that
you're,
starting
with
to
the
data
products
that
you're
trying
to
create
airflow,
helps
out
with
this,
because
it's
a
workflow
engine.
It
models
a
graph
of
tasks
and
executes
them
on
a
fixed
schedule.
A
It
was
the
first
python-based
workflow
engine
to
have
a
full
web
interface
which
set
it
on
the
road
to
become
one
of
the
most
popular
tools
for
running
data
Pipelines,
but
first
does
not
always
mean
best.
Airflow
is
designed
in
a
way
that
we
believe
actually
makes
it
a
poor
fit
for
this
task
of
building
and
maintaining
data
Pipelines.
A
A
First,
it
schedules
tasks,
but
it
doesn't
understand
that
tasks
are
built
to
produce
and
maintain
data
assets.
And
second,
it's
focused
on
production
environments
that
support
heavyweight
infrastructure
with
long-running
processes.
It
makes
pipelines
hard
to
work
with
in
local
development
unit
tests,
continuous
integration
code
review
or
debugging.
A
A
Second,
it
results
in
poor
reliability
because
if
you
can't
catch
errors
before
your
changes
make
it
to
production,
you'll
catch
them
in
production
and
third,
it
makes
it
hard
to
understand
what's
going
on
when
a
pipeline
is
deployed,
because
it
mainly
gives
you
visibility
into
what
tasks
have
run.
Not
what
data
assets
have
updated.
A
Dexter
takes
a
broader
View.
It
was
designed
to
assist
with
the
holistic
task
of
developing
pipelines
of
data
assets
and
evolving
those
pipelines
over
time.
We
believe
that,
taking
this
broader
view
can
make
data
teams
dramatically
more
productive
and
make
data
pipelines
dramatically
more
reliable
to
make
this
more
concrete.
Let's
start
by
zooming
in
on
these
phases
of
the
development
life
cycle.
What's
the
difference
between
Daxter
and
airflow
When,
developing
data
Pipelines.
A
Developing
with
airflow
is
difficult
because
airflow
pipelines
are
heavyweight
and
difficult
to
run
quickly
as
part
of
an
iterative
development.
Loop
all
airflow
runs
go
through
its
scheduler
Loop,
which
means
that
to
run
any
pipeline,
airflow
you'd
have
a
long-running
scheduler
process.
That's
monitoring
a
database
and
after
launching
a
run,
you
need
to
wait
for
the
scheduler
to
see
it
also
to
avoid
dependency
conflicts.
A
Most
guides
recommend
defining
airflow
tasks
with
operators
like
the
kubernetes
Pod
operator,
which
dictate
that
the
task
gets
executed
in
a
particular
environment
like
kubernetes
when
a
dag
is
written
in
this
way,
with
the
pipeline
bound
to
a
particular
execution
environment,
it's
near
impossible
to
run
it
locally
or
as
part
of
continuous
integration.
Unless
you
want
to
set
up
a
kubernetes
cluster
on
your
laptop
dagster,
on
the
other
hand,
was
built
from
the
start
to
support
rapid
development
and
prototyping
of
data.
Pipelines
dicer's
programming
model
encourages
separating
business
logic
from
infrastructure.
A
This
means
that
you
can
have
a
pipeline
that
runs
distributed
across
kubernetes
when
in
production,
but
also
run
it
within
a
single
python
process
during
a
unit
test
without
sacrificing
dependency,
isolation,
dagster
execution
is
extremely
lightweight.
It
doesn't
require
any
long-running
services
or
schedules.
If
you
don't
want
to
access,
if
you
do
want
to
access
dagster's
UI,
you
can
just
type
dagster
Dev
in
the
command
line
and
be
up
and
running.
A
Dagster
also
has
Rich
testing
apis,
which
make
it
easy
to
write
unit
tests
for
any
component
of
a
data
Pipeline
and
to
stub
out
external
services
that
the
pipeline
interacts
with
another
big
difference
between
dagster
and
airflow
is
the
abstractions
they
offer
for
building
and
operating
data.
Pipelines
dagster
sees
the
goal
of
a
data
pipeline
as
producing
a
set
of
data
assets
like
tables
files
or
machine
learning
models,
Dexter's
programming
model
and
user
interface
are
heavily
focused
on
that
goal.
So
it
allows
you
to
think
in
assets
when
you're
building
and
operating
your
data.
A
Pipelines
airflow,
on
the
other
hand,
is
primarily
a
task
orchestrator.
An
airflow
dag
is
a
workflow
of
tasks
connected
by
execution.
Dependencies
airflow
recently
introduced
a
data
set
abstraction,
but
it's
bolted
Loosely
on
top,
not
a
core
part
of
the
operating
model
or
programming
model
thinking
in
assets
allows
you
to
express
your
intentions
more
directly,
which
means
less
code
boilerplate.
A
As
an
example,
here's
a
comparison
of
the
same
data
pipeline
written
in
both
airflow
and
dagster.
The
pipeline
has
one
data
asset,
that's
derived
from
another
data
asset
with
airflows
apis.
You
need
to
tell
airflow
that
the
task
building
the
second
asset
should
run
after
the
task
building
the
first
asset
and
then
also
read
from
the
first
asset
in
the
second
task.
It's
a
lot
to
keep
track
of
in
digest
apis
you
just
Express
the
dependency
between
Assets
in
one
place
after
you've
written
your
data
pipeline.
A
You
typically
use
your
orchestrator's
web
UI
to
monitor
it.
Airflow's
UI
is
primarily
concerned
with
what
tasks
ran,
but
dagster's
web
UI,
which
is
pictured
here,
also
focuses
on
the
data
that
was
produced
by
those
tasks.
It
makes
it
easy
to
include
metadata
about
that
data
and
track
how
it
evolves
over
time.
A
Another
benefit
of
dagster's
asset
focus
is
that
it
enables
much
deeper
Integrations
with
modern
data
stack
tools,
for
example,
consider
DBT,
which
is
a
tool
that
helps
analytics
Engineers
write
SQL
to
build
tables,
airflow
focuses
on
tasks,
so
it
represents
the
entire
DBT
monograph
as
a
single
node
in
its
Dag
in
dagster.
Dbt
models
are
easy
to
represent
as
dagster
assets.