►
Description
On June 8th of this year, Sandy Ryza, lead engineer on the Dagster project gave a presentation at the DATA + AI Summit in San Francisco. The talk was entitled "The Future of Data Orchestration: Asset-Based Orchestration".
We are happy to share the key points of the talk in the video below.
Sandy's thesis: Data orchestration is a core component for any batch data processing platform and we’ve been using patterns that haven't changed since the 1980s. Sandy introduces a new pattern and way of thinking for data orchestration known as asset-based orchestration, with data freshness sensors to trigger pipelines.
A
Hello,
I'm,
Sandy
and
I'm
the
lead
engineer
on
the
open
source,
dagster
project
I'm
here
to
talk
to
you
today
about
data
pipelines.
If
you're
a
data
engineer,
a
data
scientist,
an
ml
engineer,
there's
a
good
chance
that
one
of
the
main
things
you
spend
your
time
on
is
building
and
maintaining
data
Pipelines.
A
What
is
a
data
pipeline?
First
of
all,
a
data
pipeline
normally
normally
culminates
in
a
set
of
data
assets.
Data
asset
is
a
file,
a
machine
learning
model
a
table.
Any
persistent
object
that
captures
some
understanding
of
the
world.
The
point
of
a
data
pipeline
is
to
construct
or
update
data
assets
that
can
be
used
to
help
make
a
business
decision
or
power
an
application
to
get
to
those
assets.
You're
normally
going
to
need
to
run
some
computations
pull
data
from
external
systems
run
a
spark
job.
A
A
Unless
you
like
to
sit
at
your
desk
all
day,
clicking
buttons,
you
probably
want
the
computations
in
your
pipeline
to
run
automatically
this
aspect
of
data
pipelines
automatically
running
them
is
going
to
be
the
topic
of
this
talk.
How
should
you
make
the
decision
of
when
to
launch
computation
to
update
Assets
in
your
pipeline,
and
how
should
these
decisions
be
expressed
in
software?
A
To
answer
that
question
I
believe
we
need
to
step
back
and
answer
a
more
fundamental
question,
which
is
why
I
run
computations
to
update
Assets
in
the
first
place?
Why
add
to
your
Cloud
Bill?
Why
go
to
the
effort?
In
my
experience,
there
are
a
couple
reasons.
One
is
that
our
inputs
change.
Our
data
is
derived
from
some
Upstream
data
and
that
Upstream
data
changes
or
grows,
and
we
want
to
keep
our
Downstream
data
up
to
date.
Different
Source
data
changes
in
different
ways.
Some
sort
of
data
changes
constantly.
A
A
And
the
third
and
final
reason
is
that
fresh
data
is
needed
by
the
application
or
analyst
who's
using
the
data.
If
our
Upstream
data
is
changing
constantly,
but
we're
only
going
to
look
at
our
report
once
per
day
and
it's
often
a
waste
of
resources
to
update
it
constantly.
For
example,
we
might
have
an
asset
that
we
only
need
updated
daily,
but
for
another
asset
we
might
want
to
incorporate
new
Upstream
data
as
soon
as
that
new
data
arrives.
A
So
there's
a
set
of
situations
where
we
want
to
automatically
update
our
data
assets.
How
do
we
go
about
doing
this?
If
you
look
at
the
last
couple
decades,
the
answer
that
you'll
probably
end
up
with
is
with
a
workflow
engine.
Workflow
engines
are
systems
that
take
responsibility
for
executing
a
set
of
tasks
in
the
right
order.
Airflow
shown
here
is
a
very
popular
one
and
the
way
it
works
is
basically,
you
define
a
dag
and
can
then
put
that
dag
on
a
schedule.
A
If
you
squint,
this
seems
like
a
reasonable
way
to
schedule
data
pipelines,
because
your
data
pipeline
is
a
dag
and
you
also
have
this
dag
of
tasks.
But
there
are
some
big
problems
which
have
personally
caused
me
much
anguish
in
my
years
as
a
data
engineer,
the
biggest
one
is
that
it
makes
the
assumption
that
your
computations
should
be
running
in
lockstep,
as
we
shot
saw
earlier.
Different
data
arrives
at
different
times,
and
different
data
products
have
different
freshness
requirements.
A
A
second
related
friction
with
workflow
based
orchestration.
Is
that
every
time
you
add
an
asset,
you
have
to
find
a
dag
to
put
it
in
to
get
it
scheduled.
This
means
that
on
one
hand,
you
have
to
worry
about
dags
getting
too
large
and
unwieldy,
or
on
the
other
extreme
too
small
and
fragmented,
which
is
a
code
management
problem.
A
Last
you
get
alerted
when
your
task
fails.
Not
when
your
data
is
out
of
date,
which
is
often
what
you
actually
care
about.
If
the
system
can
retry
and
self-correct
before
the
deadline,
then
nobody
needs
to
get
paged.
A
So
imagine
we
could
throw
out
workflow
engines
and
design
an
approach
from
the
ground
up
for
orchestrating
data
pipelines.
What
would
that
look
like
that's
going
to
be
the
topic
for
the
rest
of
this
talk?
Our
ideal
orchestration
system
for
data
pipelines
has
a
few
goals.
First
of
all,
there
are
some
important
scheduling
outcomes.
We
want
data
to
be
ready
when
it's
needed
and
we
want
to
avoid
redundant
work.
A
Second,
as
we
saw
earlier,
our
requirements
for
our
data
assets
are
the
main
determinants
for
when
we
want
computations
to
run
so
we'd
like
to
be
able
to
express
scheduling
in
terms
of
these
assets
and
then
last
of
all,
we
want
to
be
able
to
understand
the
scheduling
decisions
that
our
system
is
making
so
that
we
can
debug
them
we're
going
to
explore
this
through
dagster,
which
is
an
open
source
data.
Orchestrator
dagster
has
a
quote:
unquote.
A
Traditional
workflow
based
scheduling
like
we
talked
about
earlier,
but
we've
also
recently
added
a
new
scheduling
subsystem
to
it.
That's
aimed
at
those
goals
that
we
talked
about
on
the
last
slide.
Something
that
makes
Dexter
uniquely
suited
to
this
kind
of
scheduling
is
that
it
views
data
pipelines
in
terms
of
data
assets.
A
The
code
inside
the
function
is
what
dagster
runs
when
we
tell
it
to
materialize.
The
asset
and
materialize
essentially
means
updating
the
asset,
I'm
running
code
to
update
or
replace
its
contents.
It's
not
pictured
here,
but
in
this
case
it
reads
the
event,
data
from
the
location
that
it
gets
dumped
to
and
then
writes
it
out
to
writes
a
processed
version
of
the
events
table
to
the
data
lake.
So
it's
creating
the
events
table.
A
The
second
asset
here
is
a
table
of
logins,
that's
derived
from
the
events
table.
The
decorated
function
has
an
argument
here,
which
is
named
events
table
which
tells
dagster
that
it
depends
on
the
events
table
asset.
That
dependency
is
also
represented
visually
here
on
the
right,
which
is
a
screenshot
from
dextrose
UI
and
then
finally,
we've
got
a
third
asset
that
depends
on
both
of
the
assets
that
we
just
discussed.
A
The
Dexter
UI
lets
you
manually
materialize
assets
by
clicking
a
button
again.
Materializing
just
means
running
some
computation
to
update
the
data
asset
and
that's
cool,
but
the
whole
point
of
this
talk
is
how
do
you
avoid
needing
to
sit
at
our
computer
every
hour
and
click
that
button?
How
do
we
materialize
this
asset
automatically
at
the
right
time?
A
Dagster
lets
you
specify
this
by
adding
what
it
calls
an
auto
materialized
policy
to
the
asset
definition
itself.
An
auto
materialized
policy
essentially
describes
when
we
want
to
update
a
particular
asset.
The
most
common
kind
of
Auto
materialized
policy
is
what
we
call
an
eager,
Auto
materialized
policy
and
that
basically
just
means
update
this
asset
whenever
the
Upstream
assets
that
it
depends
on
gets
updated.
A
A
source
asset
is
exactly
what
we
just
described.
It's
an
asset
that
Dexter
knows
about,
but
the
Dexter
doesn't
materialize
in
this
case.
It
represents
that
storage
bucket
that
the
raw
events
data
gets
dumped
to
Dexter
then
lets
us
write
arbitrary
code
that
checks
this
file
and
sees
if
it's
changed,
here's
a
code,
definition
for
the
source
asset
that
we
were
just
talking
about.
Just
like
our
other
assets,
it's
a
decorated
function,
but
the
code
in
this
function
isn't
generating
the
asset.
A
It's
checking
to
see
whether
the
asset
has
changed
every
time
the
code
runs,
it
returns
a
version
string
and
if
the
version
string
is
different
than
last
time,
that
means
the
file
has
changed.
We
can
set
up
this
code
to
run
at
some
interval
like
every
minute
and
when
it
indicates
that
the
asset
has
changed,
Dexter
will
then
Auto
materialize,
any
Downstream
assets
that
have
eager
policies.
A
A
So
we
looked
at
policies
that
materialize
Downstream
data
as
soon
as
Upstream
data
changes,
that's
useful
in
many
situations,
but
in
many
others
it's
too
often,
for
example,
you
might
have
a
data
source,
that's
changing
every
hour
or
even
every
second,
but
the
downstream
data
asset
doesn't
need
to
be
that
fresh,
so
it
would
be
wasteful
to
constantly
recompute
it
or
you
might
also
have
assets
whose
sole
purpose
is
to
power
other
data
assets
for
those
intermediate
assets.
They
only
need
to
be
materialized
if
the
downstream
data
assets
need
them
to
be
up
to
date.
A
Otherwise,
there's
no
point
in
materializing
them,
so
with
lazy,
Auto,
materialized
policies,
instead
of
eagerly
acting
as
soon
as
Upstream
data
changes,
you
wait
until
data
is
needed.
Downstream
dagster
expresses
this
idea
of
quote-unquote
needed
Downstream
with
a
concept
called
a
freshness
policy.
A
freshness
policy
is
essentially
a
data
SLA.
It
defines
how
fresh
a
data
asset
needs
to
be
here.
We've
added
a
freshness
policy
to
our
fraudulent
logins
model.
It
expresses
that
new
source
data
needs
to
be
incorporated
into
the
model
within
a
day
of
when
it
arrives.
A
If
it
isn't,
then
the
model
would
be
considered
overdue
to
illustrate
what
this
looks
like
here
is
a
timeline.
In
this
case.
Our
asset
is
considered
fresh
because,
after
new
source
data,
arrived
materializations
happened
that
allowed
the
source
data
to
flow
into
our
fraudulent
logins
asset
and
here's
a
case
where
our
asset
is
considered
overdue
because
it
wasn't
materialized
in
time.
A
When
you
give
your
asset,
a
lazy,
Auto,
materialized
policy,
Dexter
will
material
will
materialize
it
when
doing
so
will
help
it
or
a
downstream
asset
meet
its
freshness
policy
so,
once
per
day,
Dexter
is
going
to
notice
that
both
of
these
assets
need
to
be
materialized
in
order
to
meet
the
freshness
policy
on
the
logins
model,
and
then
it's
going
to
automatically
materialize
them.
One
of
the
situations
where
freshness-based
scheduling
really
shines
is
when
the
same
asset
is
Upstream
of
assets
that
have
different
freshness
policies.
A
So
here
we
have
a
logins
table,
that's
Upstream
of
both
the
logins
dashboard
and
the
fraud
model.
The
dashboard
needs
to
be
updated
hourly,
but
the
fraud
model
which
is
more
expensive
to
compute
only
needs
to
be
updated
daily,
trying
to
schedule
this
work
with
workflows
gets
very
awkward
quickly.
One
option
would
be
to
use
two
overlapping
workflows,
one
that
runs
hourly
and
one
that
runs
daily,
but
sometimes
these
workflows
will
run
at
the
same
time
and
will
redundantly
update
the
Upstream
table
twice
when
we
only
need
it
once
there's.
A
A
If
any
of
those
are
true,
then
we
won't
materialize
combined.
These
two
kinds
of
conditions
give
a
full
picture
of
why
an
asset
is
or
isn't
being
automatically
materialized
and
let
you
debug
what's
happening
when
you're
not
getting
the
behavior
that
you
expect.
Let's
sum
up
what
we
talked
about
here:
First
Data
pipelines
are
graphs
of
data
assets
like
files
tables
and
ml
models
connected
by
computations.