►
From YouTube: Dagster Demo - Dec 2022 - Ben Pankow
Description
Every other week, Dagster hosts a live demo with Q&A. You can join us at a future session by signing up here: https://dagster.io/dagster-demo-signup - alternatively. you can watch a recording of this demo from Dec 2022 featuring Elementl developer Ben Pankow as he builds a data pipeline using Software-defined assets, dbt models, scheduling, and more.
A
You
know
my
name's
Ben
I'm,
an
engineer
here
at
dagster
mostly
been
working
on
the
cloud
end
and
today
I'm
going
to
be
walking
through
kind
of
a
high
level
demo
of
dagster,
starting
with
you
know,
very
brief
kind
of
overview
of
you
know
where
Dexter
fits
in
the
data
platform,
how
we
think
about
orchestration
and
then
moving
into
kind
of
building
a
basic
data
platform
data
pipeline
using
Dexter
and
then
kind
of
seeing
how
that
would
expand
out
to
a
more
mature
data
platform.
A
So
I'll
start
off
here
with
just
a
single
slide
to
kind
of
give
some
background
to
dagster
in
the
way
that
we
think
about
things.
A
So,
in
general,
we
believe
that
the
goal
of
a
data
platform
is
to
generate
data
assets.
This
is
a
pretty
generic
term
and
this
can
mean
any
sort
of
persistent
data
from
a
table.
You
know,
maybe
that's
in
Snowflake,
to
a
file
an
S3
to
a
notebook.
Maybe
for
bi
purposes
to
an
ml
model,
any
sort
of
artifact
that's
used
for
some
purpose.
That's
persistent
is
a
data
asset
in
this
context
and
data
assets
have
dependencies.
A
An
ml
model
might
depend
on
the
training
data
that
you
use
to
build
it,
and
maybe
that
training
data
is
actually
used
created
by
transforming
some
data
that
we're
ingesting
from
another
source.
So
there's
a
clear
set
of
dependencies
here
between
our
data
assets
and
the
role
of
an
orchestrator
in
this
ecosystem
is
to
create
these
Assets
in
sequence
and
there's
a
couple
different
ways
that
you
might
do
this.
A
If
you're
kind
of
approaching
this
having
no
orchestrator
and
sort
of
a
simple
set
of
assets
and
asset
dependencies,
one
tool
you
might
look
to
is
cron.
If
you
have
an
asset
B
that
depends
on
an
asset
a
you
know,
you
could
use
a
cron
schedule
to
run
a
task
that
would
materialize
or
create
asset
B
an
hour
after
you
created
asset
a
and
this
works
in
a
very
basic
case.
A
A
It's
particularly
tricky
to
scale
this
as
the
number
of
dependencies
growth
and
your
kind
of
set
of
assets
gets
more
complicated,
you're
going
to
start
to
have
to
manually
schedule
things
in
a
very
complicated
manner.
So
this
is
where
tools
like
airflow
come
in
and
why
folks
might
turn
to
a
tool
like
airflow.
A
You
know
you
can
tell
airflow
to
run
the
task
to
generate
B
after
the
task
for
a-
and
this
is
a
lot
less
fragile.
If
the
task
for
a
fails
or
takes
longer,
that's
totally
fine
airflow
will
handle
a
failure
properly
and
you
know
not
run
the
downstream
task
or
it'll
wait.
You
know
for
your
task
to
complete
before
running
the
one
further
down
in
sequence,
but
airflow
is
an
older,
Tool
and
developers.
You
know
tend
to
find
frictions
with
it
over
time.
It's
in
particular,
designed
to
run
production
code.
A
So
It's
tricky
to
run
locally
it's
hard
to
have
different
environments
like
a
staging
environment,
a
production
environment
and
it's
hard
to
unit
test.
The
sort
of
development
flow
that
you
kind
of
expect
in
a
software
engineering
world,
and
it
also
doesn't
really
have
a
first
class
concept
of
assets,
which
you
know
is
kind
of.
The
entire
purpose
of
orchestration.
All
of
the
kind
of
core,
abstractions
and
airflow
are
built
around
tasks
rather
than
the
assets
that
they
produce.
A
So
we
kind
of
view
dagstore
as
the
next
step
in
this
Evolution.
It's
focused
primarily
on
data
assets
and
also
on
fixing
some
of
these
problems
with
the
development
life
cycle
which
hopefully
you'll
see
in
today's
demo.
A
So
to
kind
of
illustrate
this,
let's
walk
through
building
a
very
basic
pipeline
from
scratch,
using
assets,
in
particular,
we'll
be
using
dagster's
software-defined
assets
where
we
Define
our
Assets
in
Python
code.
So
here
we
have
an
empty
python
file
in
our
editor
and
we'll
start
by
importing
pandas
and
the
asset
decorator
from
dagster,
which
will
let
us
Define
software-defined
assets,
we'll
create
our
first
asset,
which
just
looks
like
the
asset
decorator
applied
to
a
python
function.
A
So
here
we
have
a
very
basic
signature
for
software
defined
asset.
It's
just
a
python
function.
The
name
of
the
asset
is
the
name
of
the
function
in
this
case
country
population.
The
type
of
the
asset
is
a
panda's
data
frame
and
the
body
of
the
function
which
we
haven't
defined
yet
is
what's
going
to
actually
Define
how
our
asset
is
created.
A
So
here
I
have
a
Wikipedia
page
that
you
know
lists
some
population
information
for
various
countries.
So
let's
just
pull
the
data
down
from
this
page,
so
we
can
get
a
data
frame
by
telling
pandas
to
read
from
this
Wikipedia
page
and
grab
the
table
and
we'll
actually
go
ahead,
and
just
because
you
know
the
column
names
that
are
going
to
be
automatically
generated
are
going
to
be
a
bit
messy.
Let's
go
ahead
and
relabel
those
columns
and
then
return
the
data
frame
that
we
just
generated
and
for
good
measure.
A
Let's
have
a
little
comment
here,
so
this
is
all
we
need
to
do
to
generate
a
basic
software
defined
asset
and
we
can
actually
go
ahead
and
materialize.
This
asset
by
running
dagget,
which
is
dagster's
kind
of
UI
interface,
and
pointing
it
at
our
python
file
here.
A
So
this
is
going
to
spin
up
a
local
web
server
and
in
our
browser
we
can
open
it
up
and
we'll
see
here.
This
is
our
asset
graph,
which
shows
each
of
our
software
defined
assets
and
the
dependencies
between
them.
Here
we
just
have
the
country
population
asset
which
we
generated.
We
can
go
ahead
and
select
it
and
materialize
it
under
the
hood.
Dagster
is
going
to
figure
out
what
computation
is
needed
to
regenerate
these
assets
and
then
cue
that
computation.
A
A
So
The
Logical,
Next
Step
here
is
maybe
to
build
an
asset
that
depends
on
our
country
population
asset.
So
let's
go
ahead
and
Define
another
asset.
Let's
call
this
continent
population
which
will
just
aggregate
the
country
population
into
stats
for
each
continent.
A
A
So,
let's
comment
and
Let's
just
Group
by
the
region,
column
in
our
country,
data
and
kind
of
sum
up:
the
population.
A
If
we
go
back
to
our
Dexter
UI
here
and
we
go
back
to
the
isograph,
we
can
reload
and
we'll
see.
Now
we
have
our
new
continent
population
asset.
Here
we
can
see.
You
know
the
country
population
has
been
materialized
since
we
ran
it
just
a
couple
of
minutes
ago,
so
we
can
just
select
the
continent
population
if
we
wanted
and
re-materialize
it
since
we've
already
materialized,
the
Upstream
asset
it'll
just
use
that
that
prior
cache
value
and
great
the
Run
has
already
succeeded,
since
it
was
a
pretty
simple
computation.
A
Of
course,
you
know
we
could
easily
materialize
all
of
them
and
run
both
steps
in
sequence.
If
we
wanted
to.
A
So
this
is,
you
know,
a
pretty
Bare
Bones
example,
but
we're
already
kind
of
seeing
what
the
difference
is
are
between
dagster
and
a
more
task-based
orchestrator
here
in
a
task-based
orchestrator
world
we
would
have
to
you
know,
create
a
task
to
create
the
country
population
data
we'd
have
to
fetch
the
data
explicitly
store
it
somewhere
to
find
another
task
to
read
the
data
from
wherever
we
stored
it
modify
it,
store
it
again
somewhere
else,
and
then
you
know
explicitly
Define
a
pipeline
with
our
tasks
in
order,
but
a
lot
of
that's
being
done
for
us
by
dagster.
A
Here
you
know
the
the
I
o
between
each
of
our
assets
is
happening
automatically
and
the
ordering
is
also
happening
just
based
on
the
data
dependencies
that
we've
encoded
between
our
steps.
A
So
let's
take
a
step
back-
and
you
know
this-
this
file
that
I've
been
writing
is
actually
in
the
context
of
a
larger
project.
So
we
can
see
what
it
looks
like
to
import
our
assets
into.
You
know,
maybe
a
larger
data
platform.
A
So
here
we
have
our
our
population
assets
and
we
actually
have
this
repository
file
here,
which
defines
an
empty
diagster
repository.
You
can
kind
of
think
of
the
repository
as
an
entry
point
for
a
bigger
diagster
project.
So
if
you
have
assets
coming
from
a
bunch
of
different
places,
this
is
where
they'll
kind
of
all
come
together.
A
So
the
first
thing
we'll
do
is
just
get
our
existing
assets
loaded
as
part
of
the
repository
here.
So
what
we'll
want
to
do
is
create
some
population
assets
and
we'll
use
this
load
assets
from
package
module
utility
function
to
import
our
assets
from
the
population,
python
module,
we'll
assign
them
a
group
name
just
for
organization
purposes
and
we'll
add
this
key
prefix,
which
I'll
talk
about
a
bit
later.
But
this
just
prefixes
kind
of
the
the
name
of
the
assets.
A
I'll
explain
why
this
has
been
in
just
a
just
a
few
minutes.
Then
all
we
need
to
do
is
return.
Our
population
assets
here
is
part
of
our
repository,
and
then
we
can
open
update
again
and
we
should
see
the
same
set
of
assets.
A
So,
let's
just
make
sure
that
we
have
this
looking
properly
great,
so
one
thing,
I
sort
of
glossed
over
earlier
is
where
are
assets
are
actually
being
saved
to
and
loaded
from
a
kind
of
insinuated
that
the
I
o
is
happening
automatically,
and
this
is
being
done
using
dagster's.
I
o
managers,
so
the
audio
managers
are
kind
of
a
built-in
abstraction
that
lets
you
decide
where
and
how
your
inputs
and
outputs
for
your
assets
are
stored
running
locally
by
default.
That's
just
on
the
file
system,
but
it's
really
easy.
A
A
A
So
now
that
we
have
our,
I
o
manager
defined
here,
we'll
want
to
bind
it
to
our
assets
to
let
the
extra
know
that
these
assets
should
use
this
particular
I
o
manager
you're
able
to
have
a
bunch
of
different.
I
o
managers,
if
you
wanted
to
sort
of,
save
and
load
different
assets
from
different
places,
you
know
some
might
want
to
live
in
S3
or
snowflake
or
maybe
locally
is
fine.
A
Our
snowflake
I
o
manager
and
we'll
actually
provide
some
configuration
here
to
point
towards
our
snowflake
instance.
So
here
I'm
just
pulling,
you
know
all
the
credentials
from
the
environment.
A
So
if
we
go
back
to
diagster
here
and
reload,
we're
not
going
to
see
any
media
change
here
because
we're
loading
the
same
set
of
assets.
But
if
we
go
ahead
and
re-materialize
our
assets,
it
will
actually
output
them
in
additional
flake
and
we
should
be
able
to
see
that
I
think
in
the
locks
here
great.
So
we're
yielding
our
data
frame.
Outputs-
and
this
is
when
it's
actually
going
to
be
written
to
snowflake.
A
And
so
here,
if
we
go
back
to
the
asset
graph-
and
you
know,
take
a
look
at
our
country
population
asset-
we
can
see-
we
actually
get
some
additional
metadata
that
the
snowflake
I
o
manager
is
attaching
to
our
asset.
So
we
can
see
we
actually
get
the
The
Columns
that
are
being
output
into
our
snowflake
table
and
the
data
types
of
those
columns.
A
We
get
the
row
count
and
we
actually
get
a
query
here
that
we
can
run
in
a
snowflake
against
our
snowflake
instance
to
see
to
see
the
asset
that
we
just
materialized.
So
here
we're
getting
all
of
that
country
data
that
was
just
written
into
snowflake
by
the
I
o
manager.
So
now,
let's
see
what
this
looks
like
adding
in
some
additional
assets,
Dexter
also
has
the
ability
to
integrate
with
other
kind
of
tools
in
your
data
stack.
A
So
we
actually
have
a
DBT
project
already
set
up
here,
defining
a
couple
Transformations
from
our
country
and
continent
data
and
if
you've
used
DBT
before
they
actually
have
their
own
kind
of
graph
here.
That
represents
the
dependencies
between
their
SQL
transformation.
So
here
we
have,
you
know
the
country,
population
and
the
continent
population
being
transformed
into
some
ranking
information
and
some
cleaned
information
and
then
a
summary
and
some
roll-up
tables.
A
So
we
can
actually
import
this
entire
DBT
ref
and
run
it
from
within
Dexter.
So,
let's
see
what
that
looks
like.
A
So,
let's
first
just
specify
where
our
DBT
project
is
located
relative
to
our
repository
file
here
and
then
we
can
actually
load
our
DDT
assets.
All
this
transformation
assets
using
this
load
assets
from
DBT
project
utility,
so
we'll
specify
our
DBT
project
directory.
A
Our
profiles
directory
is
going
to
be
the
same
as
the
project
directory
and
then
we'll
also
add
the
key
prefix
here
just
for
context.
The
the
key
prefix
here
determines
which
snowflake
schema
we're
writing
to
when
we're
using
the
snowflake.
I
o
manager,
so
I'm
just
writing
to
our
Sandy
sandbox
database,
and
so
you
know,
Ben
is
just
the
the
schema
that
I
have
right
access
to.
So
that's
why
we're
using
it
as
the
keyboard
effects.
A
So
this
is
all
we
need
to
do
to
kind
of
load,
our
DVT
assets
into
dagster,
textural,
parse,
the
project
files
and
automatically
build
software-defined
assets
associated
with
each
of
the
tables
that
DBT
is
going
to
produce
here.
A
So
let's
go
ahead
and
add
our
transformation
assets
here
and
we'll
also
need
to
bind
a
DBT
resource,
so
we'll
set
up
a
DBT
CLI
resource
which
will
tell
the
extra
that
we
want
to
execute
DBT
locally
using
the
CLI
rather
than
using
DVT
cloud,
and
you
know
we'll
just
need
to
point
it
at
the
same
project
and
profile
directories
here.
A
So
if
we
go
back
to
to
dagster
to
the
asset
graph,
let's
go
ahead
and
reload
our
definitions
again
and
provided
I
did
everything
right?
We
should
see
our
DVT
assets
appear
great.
So
you
know
here
we
have
our
duty
assets,
Downstream
of
the
Python
assets
that
we
wrote
earlier,
and
you
know
you
can
even
see
we
have
some
metadata
attached,
for
example,
showing
what
that
DBT
transformation
looks
like
for
each
of
our
individual
assets,
and
you
know
if
we
wanted
to.
A
We
could
re-materialize
everything
which
is
going
to
first
run
our
python
assets
as
we'll
see
our
country
population
and
then
our
continent
population,
and
then
it's
going
to
invoke
DBT
to
generate
all
of
those
Downstream
assets.
A
A
This
is
just
going
to
be
pulled
from
another
python
module
and
it's
going
to
depend
on
those
population
assets
that
we
defined
earlier
so
we'll
grab
those
and
add
them
to
our
repository
definition
here
and
while
we
wait
for
this
run
to
complete,
let's
go
back
to
the
asset
graph,
then
reload
and
just
see
what
those
forecasting
assets
look
like.
A
Great,
so
here
we
can
see
we
have
some
our
forecasting
assets,
so
it
sets
up
some
features
based
on
the
the
country
and
continent
population
data,
creates
an
ml
model
and
then
creates
a
forecast.
A
lot
of
this
is,
you
know,
kind
of
mocked
out,
but
for
for
sake
of
the
demo,
let's
say
that
you
know
we
wanted
to
regenerate
this
population
forecast.
How
would
we
go
about
doing
that?
A
We
could
choose
to
rematerialize
all
of
our
assets,
but
this
is
kind
of
an
expensive
operation.
So
one
thing
we
can
do
is
you
know,
go
to
this
forecasted
population
asset.
We
could
view
it
in
the
asset
catalog.
This
is
kind
of
built
into
dagster
and
shows
us
every
time
this
asset
has
been
materialized
and
some
of
the
metadata
associated
with
it.
We
haven't
actually
materialized
this
asset.
A
Yet
so
we
don't
have
any
data
there,
but
we
could
go
to
the
lineage
Tab
and
view
just
its
Upstream
dependencies
and
from
here
you
know
we
could
just
re-materialize
that
forecast
and
everything
that
it
depends
on
rather
than
our
entire
asset
graph.
A
If
we
wanted
to
avoid
going
to
this,
you
know
lineage
page
every
time
we
wanted
to
do
that,
we
could
actually
Define
a
Dexter
job
that
just
re-materializes
our
forecast
and
everything
that
it
depends
on.
So,
let's
see
what
that
would
look
like.
A
A
We
can
specify
the
asset
keys,
that
we'd
like
to
read
material
rematerialize,
the
population.
So
here
we're
just
specifying
that
forecasted
population
asset
and
then
we
can
actually
tell
Dexter
to
grab
everything
Upstream
of
that.
So
if
we
Define
this
job
and
reload
our
set
of
definitions
again.
A
We
should
see
in
the
sidebar
we'll
now
have
our
job,
which
will
just
re-materialize
our
forecast
related
assets.
So
you
can
see
the
other
assets
that
we
don't
depend
on.
These
are
just
kind
of
linked
out,
so
we
can
view
them
externally,
but
they're
not
going
to
be
recreated
by
this
job.
A
Now.
This
is
all
well
and
good.
You
know
we
can
on
a
whim,
update
our
forecast,
but
if
we
wanted
to
do
this
on
a
regular
Cadence,
we
might
want
to
attach
a
schedule
to
this
job
to
have
it
update
automatically.
So
that's
pretty
easy
as
well.
We
can
wrap
our
job
here
with
a
schedule.
Definition
pointed
at
this
job
and
then
give
it
a
bronze
schedule.
A
To
run,
let's
say
once
every
hour:
if
we
go
ahead
and
reload
our
project
again,
we'll
see
that
this
job
will
also
have
a
schedule
attached
to
it.
So
this
way
it'll
run
every
hour
automatically
just
rematerializing
our
forecast
and
all
of
the
Upstream
assets
that
it
depends
on.
So
here
we
have
our
hourly
schedule
that
we
can
toggle
on
and
now
we're
kind
of
Off
to
the
Races.
We
can
have
our
forecast
regenerating
automatically
every
hour.
A
So
hopefully,
this
gives
you
kind
of
a
brief
overview.
A
brief
idea
of
dagster's
programming
model
and
what
using
dagster,
Booth
kind
of
from
the
python
end
and
also
from
the
UI
end,
looks
like.