►
From YouTube: Dagster Data Orchestration 10 min walkthrough - Jan 2023
Description
Dagster is a cloud-native data orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. In this short overview, Sean Lopp—data engineer at Elementl—gives us a tour of Dagster's capabilities and how this modern orchestrator helps data engineering teams break out of a typical vicious cycle.
A
Hi,
my
name's
Sean
and
I'm
an
engineer
working
on
dagster
I,
get
to
talk
to
a
lot
of
different
engineering
teams
and
unfortunately
they
all
say
that
they're
struggling
they
spend
too
much
time
babysitting
production
and
they
don't
have
a
chance
to
build
new
things
and
be
proactive
with
stakeholders.
So
why
is
this?
Well,
unfortunately,
a
lot
of
those
teams
are
using
task-based
orchestrators,
like
airflow,
and
that
puts
them
into
this
vicious
cycle.
A
Where
Unfortunately
they
can't
test
code
out
locally,
so
they
have
to
push
it
straight
to
production,
but
because
it's
hard
for
them
to
reason
ahead
of
time
about
what
new
code
will
do
often
pushing
straight
into
production,
leads
to
failures
and
outages,
and
that's
what
ends
up
paging
on
call
and
interrupting
those
Engineers
who
are
trying
to
do
new
work.
Unfortunately,
because
of
those
interruptions,
the
team
is
slow
and
often
criticized
for
being
behind,
and
that
in
turn
means
they're
unable
to
pay
down
technical
debt.
A
A
It
allows
you
to
think
about
individual
assets
and
to
take
a
declarative
approach.
So,
instead
of
having
to
build
one
monolithic,
dag
that's
tied
to
your
production
resources,
you
can
write
new
code
incrementally
and
then
the
orchestrator
will
figure
out
when
those
new
data
assets
need
to
be
run.
This
approach
sounds
familiar.
It's
because
many
modern
web
Engineers
have
taken
this
declarative
approach.
In
fact,
the
migration
from
angular
to
react
was
all
about
adopting
these
benefits.
A
Here's
the
global
data
asset
graph
for
the
Huli
data
engineering
team.
You
can
see
they
start
by
grabbing
some
data
from
an
API
that
data
is
fed
through
a
series
of
Transformations
and
eventually
a
daily
order.
Summary
table
is
created.
That
table
is
then
used
by
the
data
science
team
to
run
forecasting
routines
and
create
predictions.
It's
also
used
by
the
marketing
team
for
kpi
reporting
and
executive
dashboards.
A
So
what
are
the
benefits
of
using
assets?
Well,
imagine
an
executive
has
a
question
about
the
daily
order.
Summary
something
doesn't
look
quite
right.
Well,
in
a
normal
orchestrator,
you
would
have
to
go
spelunking
through
all
the
different
tasks
logs
trying
to
figure
out
what
task
might
have
impacted
that
table
or
some
dagster.
You
can
immediately
look
at
the
the
daily
order
summary
and
see
metadata
about
it,
see
the
Run
logs
associated
with
it,
and
even
information
like
the
SQL
that
generated
the
table.
A
In
addition,
if
I
get
an
asset,
first
approach
allows
dagster
to
do
declarative
scheduling
so,
instead
of
having
to
create
a
single
monolithic,
dag
or
try
to
reason
through
when
different
crime
schedules
should
be
applied
to
different
jobs,
you
can
simply
Define
new
assets
and
encode
the
SLA
that
stakeholders
have
for
them.
So,
for
example,
this
average
order
asset
that
the
marketing
team
relies
on
needs
to
be
updated
pretty
frequently
because
it's
in
a
kpi
dashboard,
so
a
policy
has
been
set
that
the
asset
should
never
be
more
than
90
minutes.
Stale.
A
In
contrast,
the
daily
order.
Summary
asset
only
needs
to
be
updated
every
day
by
9
am
dagster
figures
out
when
these
assets
should
run
and
because
it's
aware
of
all
the
different
data
assets
that
your
team
cares
about
and
how
they
depend
on
one
another
tags
are
smart
enough
to
avoid
redundant
work,
so
here
we're
seeing
that
the
average
order
data
set
that
has
that
SLA
encoded
to
be
up
to
date.
Every
90
minutes
needs
to
have
itself
and
two
other
stale
assets,
Upstream
updated,
but
everything
else
is
already
fresh
enough.
A
Let's
take
a
look
at
a
dagster
project:
diagster
projects
are
formatted
as
python
packages
and
within
a
project
we
can
create
an
asset.
By
simply
writing
a
new
function.
Assets
in
dagster
can
be
pandas
data
frames.
They
can
be
Jupiter
notebooks,
they
can
be
spark
data
frames
or
really
any
arbitrary
code.
A
Once
we
have
our
asset
created
in
dagster,
we
can
run
everything
locally,
so
we'll
fire
up
a
local
copy
of
our
dagster
user
interface,
and
here
I
can
test
out
that
my
co-logical
code,
that
I
just
wrote,
runs
when
I
run
things
locally.
I
don't
have
to
use
production
resources.
So
here,
when
I
run,
all
of
my
code,
I'm
going
to
be
using
just
the
local
file
system
to
store
intermediate
results
and
the
SQL
that
I'm
writing
will
execute
against
a
local
ductdb
Warehouse.
A
Normally,
when
data
teams
open
pull
requests,
you
can
review
the
code,
but
you
have
to
guess
what
that
code
will
actually
do
once
it's
in
production
with
dagster,
we
create,
what's
called
a
branch
deployment
which
is
essentially
an
isolated
copy
of
our
entire
data
platform.
Just
for
this
pull
request
that
allows
my
team
to
actually
run
the
code
and
see
what
it's
going
to
look
like
in
this
case
we're
running
against
resources
that
are
very
similar
to
production,
we're
using
snowflake
to
clone
a
copy
of
our
production
database
that
this
pull
request
can
run
against.
A
A
Once
you're
ready
to
put
code
into
production,
dagster
was
built
with
all
the
modern
bells
and
whistles.
So,
for
example,
multiple
teams
can
collaborate
together
in
different
virtual
environments
and
different
projects.
You
don't
have
to
try
to
get
everyone
on
the
same
version
of
pandas,
while
still
having
a
global
asset
view
where
those
teams
can
depend
on
one
another's
work.
A
A
Dagster
has
a
variety
of
settings
to
help
ensure
that
the
orchestrator
is
robust,
including
things
like
automatic
Opry
tries
and
run
cues
with
different
priorities,
and
finally,
dagster
supports
a
variety
of
different
alerting
policies.
Like
many
orchestrators,
you
can
alert
on
failure,
but
dagster
actually
helps
teams
avoid
alert
fatigue
by
also
allowing
you
to
alert
on
SLA
violations,
and
that
means
that
you're
only
going
to
get
notified
when
data
sets
are
outside
of
the
slas
that
actually
matter
to
stakeholders
and
not
get
notified
based
on
spurious
failures
that
are
automatically
recoverable.
A
So
we
hope
you're
excited
about
dagster
and
ready
to
give
it
a
shot.
If
that's
the
case,
we've
made
it
really
easy
to
get
started
with
diagster
Cloud.
You
can
clone
an
example
project
and
get
running
in
no
time
or
you
can
start
out
by
developing
locally
once
you're
ready
to
run
things
in
production,
you
can
either
host
tags
to
open
source
yourself
or
dagster.
Cloud
comes
with
a
fully
serverless
option
or
a
hybrid
computation
models
available
as
well
so
be
sure
to
check
us
out,
find
us
on
GitHub
and
give
us
a
star.