►
From YouTube: Orchestrating dbt with Dagster
Description
dbt defined an entire new subspecialty of software engineering: Analytics Engineering. But it is one discipline among many: analytics engineers must collaborate with data scientists, data engineers, and data platform engineers to deliver a cohesive data platform. In this video, Nick Schrock of Elementl talks about how orchestrating dbt with Dagster allows you to place dbt in context, de-silo your operational systems, improve monitoring, and enable self-service operations.
A
A
So
here's
our
agenda
well,
there's
this
intro
and
our
opportunity
to
introduce
our
emojis
the
speaker
from
drizzly
yesterday
really
led
by
example.
Here
we
decided
to
copy
her
presentation
where
she
used
our
emojis.
So
so
we're
going
to
talk
about
this,
we're
going
to
talk
about
this
intro
and
then
we're
going
to
talk
about
what
is
dagster,
the
I
guess:
elusive
dagster
and
how
it
relates
to
dbt,
because
we
often
get
questions
of
should
we
use
dagster
or
dbt
and
it's
understandable
because
they
both
have
dependency
graphs,
for
example.
A
But
we
want
to
make
it
clear
that
we
view
them
as
complementary
tools
with
very
aligned
values.
Actually,
then,
as
mentioned,
there's
going
to
be
a
live
demo
also
some
sample
code
and
then
we'll
have
a
discussion
about
the
evolution
of
roles
within
the
modern
data
platform
and
then
a
brief
conclusion.
A
So
why
would
you
use
this
thing?
You
use
dexter
when
you
need
to
operate
and
deploy
dbt
alongside
other
tools
within
the
context.
It's
a
project
or
a
large
data
platform.
So
who
would
be
the
type
of
personas?
Who
would
do
this
so
you're
at
dbt
conference?
You're?
A
dbt,
coalesce
and
very
likely
or
probably
you're
an
analytics
engineer,
but
often
you
might
be
left
without
support,
and
you
also
have
to
self-serve
your
own
infrastructure
because
you
have
to
orchestrate
your
computations
help
out
a
machine
learning
person
or
a
data
scientist.
A
I
guess
this
morning
drew
one
of
the
founders
of
fishtown
talked
about
a
postmodern
data
platform.
I
am
looking
forward
to
seeing
what
that
is.
I
was
unable
to
catch
the
presentation,
and
so
when
you
spin
up
a
new
analytics
team
a
lot
of
times,
it
looks
like
this
or
you
have
five
tran
or
stitch
or
a
similar
tool
doing
ingest,
which
is
effectively
replication
into
a
cloud
data
warehouse
like
snowflake,
redshift
or
bigquery.
A
You
have
analytics
engineers
who
are
writing.
You
know
templated
sql,
within
the
context
of
dbt,
to
produce
the
consumable
assets
that
are
downstream,
and
then
you
have
that
is
then
consumed
downstream
by
tools
like
mode
or
other
bi
tools,
which
interface
directly
with
the
cloud
data
warehouse
using
sql.
A
Now
an
interesting
thing
comes
up
here.
You
know
we
actually
view
this
and
kind
of
deem
it
internally.
The
modern
analytics
stack-
and
you
know
it
begs
the
question
actually,
which
you
know
we
ask,
which
is
is
orchestration
necessary,
because
one
of
the
appealing
things
about
this
stack
as
presented
is
it's
kind
of
like
a
very
lightweight
operationally,
because
you
can
delegate
a
lot
of
the
ops
to
it
out
of
the
box
tool
like
like
five
tran
or
to
the
cloud
data
warehouse.
A
So
you
know,
but
that's
not
the
world.
We
see
what
we
see
in
reality
is
a
data
platform
at
every
company
customized
for
its
own
domain,
and
you
know
just
every
company
and
organization
we
see
it.
Has
these
customized
needs
that
go
beyond
just
a
cloud
data
warehouse.
So
let's
just
take
an
example
here:
let's
imagine
you
had
a
legacy
data
lake,
where
you
had
data
engineers,
writing
spark
producing
data
assets
on
top
of
s3,
and
then
you
want
to
introduce
this
modern,
analytic
stack.
A
A
Now
you
set
up
your
dbt
installation
installation,
you
start
using
dbt
within
the
data
warehouse
and
you
start
using
mode
or
a
similar
tool
to
consume
it
for
downstream
analytics,
but
there's
a
whole
set
of
other
use
cases
here,
like
data
scientists
right
who
want
to
use
python
to
build
machine
learning
pipelines
on
top
of
the
data
in
the
data
warehouse
and
so
and
this
architecture
that
I'm
laying
out
this
instance
of
the
architecture
is
a
simple
one.
We
see
architectures
like
that
with
engineering
teams,
as
small
as
like
one
two.
A
Three
people,
maybe
serving
a
couple
analysts
and
you
know,
talk
about
going
to
a
larger
organization.
You
just
have
an
explosion
of
complexity,
and
so
you
know,
we
think
that
orchestration
is
the
beating
heart
of
a
data
platform
like
this.
If
you
visualize
it
almost
at
every
edge,
you
have
the
orchestrator
instigating
and
managing
the
computations
that
puts
the
orchestrator
at
a
unique
place
where
it
is
interacting
with
every
computational,
runtime
and
tool.
A
Every
single
persona,
either
indirectly
or
directly,
has
to
deal
with
it
and
interface
with
it
and
then,
by
extension,
it
actually
is
interacting
or
instigating
computation
which
store
data
in
every
single
data
store
in
the
system.
It
is
the
common
denominator
throughout
the
data
platform,
and
so
naturally
we
think
that
naturally,
we
think
that
dagster's
a
good
fit
for
this
okay.
A
So
what
is
daxter
the
elusive
daxter,
so
our
shorthand,
for
it
is
the
data
orchestrator,
but
a
slightly
longer
description
that
we
have
for
it
is
dagster,
is
an
orchestration
platform
for
producing
trusted
data
assets,
and
I
think
the
keyword
here
is
trusted
like
doing.
That
is
a
really
challenging
problem
and
in
order
to
do
that,
we
really
think
about
and
model
and
manage
the
full
application
life
cycle.
A
So
how
does
these
you
know?
What
are
we
talking
about?
What
outcome
are
we
trying
to
achieve?
So
during
the
development
and
test
phase,
we
want
to
be
able
to
efficiently
build,
well-structured,
testable
computations,
and
I
really
want
to
emphasize
the
word
testable,
because
in
order
for
these
systems
to
be
testable,
it's
extremely
challenging.
You
need
to
design
it
to
be
so
from
first
principles
from
day
one
and
it's
always
been
a
goal
of
the
project
since
day
one
next,
you
want
to
reliably
execute
debug
and
operate
those
computations
right.
A
And
lastly,
we
want
to
the
end
goal
of
these
systems
is
to
produce
data
assets
that
are
that
are
consumed
by
your
downstream
stakeholders.
That
is
why
these
systems
exist,
so
we
think
it's
natural
for
a
data
orchestrator
to
have
out
of
the
box
data
observability,
and
I
want
to
note
that
all
of
these
all
of
these
outcomes
mutually
enforce
each
reinforce
each
other
by
building
well-structured,
testable
computations.
You
make
it
more
likely
that
you're
going
to
reliably
execute
debug
and
operate
those
compute
by
having
a
programming
model
and
an
api.
A
A
I
think
a
couple
things
again
that
reinforce
each
other,
because
we
designed
for
testability
our
actual
infrastruct
infrastructure
is
very
pluggable,
meaning
you
can
execute
it
in
a
wide
variety
of
contexts,
whether
that's
ci
cd
systems
or
you
know
some
people
want
out
of
the
box
kubernetes
support
which
we
provide.
But
you
don't
need
to
use
kubernetes
in
order
to
use
this
system
right.
It's
designed
to
be
cloud
native,
but
it
doesn't
prescribe
any
specific
vertically
integrated
infrastructure.
A
A
So
again,
let's
put
up
this
life
cycle
and,
let's
start
out
with
this
efficiently
build
well-structured,
testable
computations.
Well,
if
I'm
an
analytics
engineer,
I'm
thinking
I
already
have
a
tool
that
can
do
that.
I
can
I
love
dbt.
I
can
execute
locally.
I
can
build
these
wall
structure
that,
like
compute
dags
of
ginger
templated
sql,
I
can
inject
data
quality
tests,
it's
great.
Why
would
I
need
dexter
to
do
that
and
we
agree.
A
You
know
if
you're,
an
analytics
engineer
and
you're
embedded
within
a
dagster
enabled
data
platform.
You
are
using
the
tools
you
know
and
love
to
develop,
dbt
driven
computations
within
the
data
warehouse.
The
difference
between
these
things
is
that
dbt
is
for
sql
only
transformations
within
the
data
warehouse,
whereas
dagster
is
a
generalized
compute
framework
right,
the
only
time
you
would
be
interacting
with
the
dax
or
local
development
experience.
A
Well,
you
need
to
do
it
when
you're
dealing
with
heterogeneous
data
tools
right,
you
might
be
dependent
on
some
upstream
tool
or
someone
might
be
complaining
about
your
data
assets
out
of
date
and
really
a
focus
of
daxter
is
enabling
folks,
like
analytics
engineers
to
self-serve
ops
in
these
cases
where
something
has
gone
wrong,
because
you
often
need
to
interact
with
your
production
systems
in
order
to
unblock
people
and
get
your
job
done
all
right.
We're
going
to
go
on
to
the
survey
denim
now.
A
My
colleague
max
is
in
the
coalesce
dagster
channel
and
he,
I
believe,
is
about
to
post
a
link
to
google
sheets.
So
what
we're
going
to
do
here
is,
let's
see
max,
have
you
posted
this
anyway?
I
will
continue
on.
I
do
not
see
the
google
sheet
there.
We
are
okay
great,
so
I
want
people
to
have
the
opportunity
to
fill
this
out.
A
As
I
talk
through
the
structure
of
this
pipeline,
so
you're
going
to
fill
out,
some
information
in
google
form
it's
going
to
be
ingested
into
google
sheets
right,
we're
going
to
consume
that
actually
with
pandas,
and
we
do
that
because
we
want
to
use
some
time
stamp
manipulation
stuff
in
pandas.
It's
a
pain
in
the
butt
to
do
with
sql
we're
going
to
load
that
into
snowflake
and
then
do
some
aggregations
using
dbt.
A
A
A
There
we
go
so
here
you
can
just
see
this.
You
know
we're
doing
this.
Google
sheet
ingest
process,
we're
spitting
out
data
frame,
we're
doing
some
post
processing
and
look
at
all
this
rich
metadata
right.
There's
user
defined
descriptions.
There's
types,
you'll
notice
that
every
node
in
this
graph
we
call
each
node
a
solid,
has
inputs
and
outputs.
This
helps
accessibility.
A
The
dag
is
typed,
which
is
both
makes
itself
descriptive
as
well
as
makes
it
more
reliable
right.
We
ingest
that
panda's
data
frame
into
the
warehouse.
This
is
the
node
that
actually
runs
the
dbt
model
right.
Here's,
the
node
that
runs
the
jupiter
notebook
and
we
can
actually
just
render
this
in
line,
which
is
cool,
here's
a
preview
of
what
you're
going
to
see
and
then
we're
actually
going
to
post
this
to
slack
and
just
so.
Why
don't?
We
just
go
for
it
all
right,
so
one
you'll
notice.
A
Here
we
have
these
presets,
so
I
just
run
this
into
test
mode,
which
pushed
it
to
a
secret
channel.
We
have
I'm
going
to
switch
this
and
you'll
notice
that
these
pipelines
are
configurable
right,
so
you
can
parameterize
them
and
run
them
in
different
contexts
and
this
config
system
I'm
not
doing
a
deep
demo
of
it,
but
it's
actually
fully
typed
and
self-describing,
which
is
just
super
super
powerful.
A
When
we
launch
here,
we
go
so
we're
using
this
multi
process.
Well
here
this
is
the
gantt
viewer.
So
what
we're
doing
is
this
is
the
dag
we're
going
to
execute
and
you'll
see?
This
is
a
live,
updating,
beautiful
ui
that
gives
you
live
execution
of.
What's
going
on,
this
works
both
locally
and
in
prod.
A
Now
each
one
of
these
is
blue
stuff.
Is
you
see
this
thing
it's
preparing?
This
is
because
we're
using
the
multi-process
executor,
which
spins
up
a
unique
process
for
every
single
step
in
the
system
right
and
that
gives
it
process,
isolation,
etc.
It
incurs
a
little
spin
up
time.
So
here's
this
actually
I'll
just
show
you
the
watch
this.
This
is
nice.
A
I'm
going
to
show
this
and
see
this
structured
event
log.
So
this
is
much
more
than
just
unstructured
logging,
which
I
call
developers
thinking
aloud.
This
is
actually
structured
event
logs
and
we
can
do
stuff
like
here.
We
can
view
a
preview
of
the
data
that
was
flowing
through
the
system.
Note
this
time
stamp
over
here
and
then
we
can
say.
A
Oh,
the
output-
oh
great,
it's
telling
me
how
to
spell
things
and
then
you
can
see
the
munge
data
over
here
right,
so
you
can
just
get
a
sense
and
observe
what
type
of
you
know
is
your
stuff
actually
working
and
often
the
sample
data
is
sufficient
for
that.
Okay,
well,
the
dbt
run
is
logged.
I
went
a
little
slow,
so
it
didn't
show
the
live
updating
log,
but
it
actually
worked,
and
now
you
notice
that
our
integration
emits
different,
structured
events
right.
So
what
this
does
is.
A
This
is
emitting
an
event
for
every
single
model
in
your
dbt
graph
and
it's
it's
emitting
all
this
interesting,
structured
information
about
it,
and
now
we
can
go
over
here
right
and
now
we
have
all
this
interesting
information
you
can
see.
This
is
the
run
that
last
touched
it.
It's
still
running.
You
can
see.
What's
going
on
as
you
go
down
here,
you
know
you
can
see
familiar
concepts
right
like
it's
materializing
this
as
a
table:
here's
the
database,
here's
the
schema,
etc,
etc.
A
Now
you
can
go
down
here
and
there's
all
this
information,
so,
for
example,
we
can
kind
of
give
it
away
here.
You
can
see
that
around
6
30
this
morning
I
woke
up
and
ran
this
a
couple
times.
Then
I
had
a
meeting
with
a
colleague
and
ran
it
a
few
times,
etc,
etc,
and
then
look
here,
it
just
shows
you.
You
know
all
these
nice
mouse
overs
anyway.
This
is
a
really
nice
observability
tool
and
you
can
do
things
like.
A
Let's
say
we
wanted
to
look
this
up
right,
you
can
just
hop
directly
to
it.
You
can
see
what
information
you
can
see,
information
about
your
assets,
all
right.
Let's
go
back
to
this
and
we've
executed
now
and
now
I'm
going
to
go
to
another
step
where
we're
posting
the
slack
and
I'm
actually
going
to
filter
this
down
to
materializations
and
we've
actually
executed
the
notebook,
and
now
we've
pushed
this
to
our
slack
channel.
A
Open
this
up,
no,
I
need
to
open
up
the
link,
and
here
we
go
here's
the
information
about
where
folks
are
coming
from.
It
looks
very
us
centric,
but
I
think
there's
a
little
action
here.
I
believe
that's
is
that
angola,
I'm
gonna
betray
myself
there,
but
anyway,
so
we
have
a
live
demo
and,
let's
see
if
it
posts
to
slack
there,
we
go
okay.
A
Well,
thank
heavens,
that
worked,
that's
always
terrifying.
So
let
me
just
quick
go
through
what
code
looks
like
that,
enables
that
so,
let's
look
at
this
dag
and
I'm
just
going
to
quick,
show
some
python
code
examples
what
this
looks
like
just
so
you
get
a
sense.
This
is
no
means
a
tutorial,
so
this
is
our
dag
structure.
In
order
to
define
this
dag,
we
call
a
pipeline.
A
You
know
we
actually
all
of
them,
take
inputs
and
outputs,
so
we
actually
just
use
python
syntax
to
to
flow
the
data
through
this
pipeline
right,
it's
very
straightforward
and
intuitive,
so
this
is
actually
constructing
the
dag.
Next
we
have
these
solids
right.
This
is
our
node
in
our
graph
and
solid
is
kind
of
the
leaf
node
which
performs
actual
computation
you'll,
see
here.
It's
a
function
that
takes
a
data
frame,
produces
a
data
frame,
and
then
it's
just
performing
some
compute.
So
you
can
see
this
infer
date.
A
Time
format
is
a
capability
in
pandas
that
we
wanted
to
use
and
then
to
integrate
with
dbt.
We
have
a
library
that
was
community
driven
and
then
donated
to
our
monorepo.
Thank
you
very
much
david
wallace,
and
here
you
can
just
you
know,
take
the
cli
run
solid
and
point
it
to
project
and
profiles,
and
you
are
off
to
the
races
so
just
wanted
to
quick
go
through.
I
made
this
claim
earlier
that
dbt
and
dags
are
highly
defined
values
and
just
want
to
go
through
this
quick.
A
So
again,
the
difference
between
them
is
that
dbt's
domain
is
exclusively
sql,
ginger,
templated,
sql
and
dagsters
generalized
compute
in
python.
So
if
you
have
these
different
concepts,
we
have
functional
compute
dependencies
in
dbt.
It's
the
models
which
are
defined
by
select
statements.
You
can
just
think
of
them
as
functions
and
they're
dependent
on
other
models
via
the
ref
syntax,
and
that
forms
your
dependencies
right.
Dexter
is
very
similar.
A
You
have
solids
which
are
meant
to
be
effectively
just
you
know
pure
business
logic,
and
then
they
have
logical
inputs
and
outputs
right,
which
would
be
similar
to
models
next,
really
focus
on
fast
dev
workflow
right
in
dbt.
You
can
view
docs
locally.
You
can
look
at
the
dbt
cloud
ide
and
then
you
can
and
then
you
can
use,
and
then
I
think,
I'm
running
out
of
time.
So
I'm
going
to
skip
this
but
effectively,
there's
very
analogous
concepts.
A
So
I
want
to
talk
about
one
more
subject
quickly.
While
I
have
some
time-
and
that
is-
and
that
is
you
know
how
the
general
ecosystem
is
developing
and
there's
this
line
that
marshall
mcluhan
says,
which
is
we
shape
our
tools
and
thereafter
our
tools
shape
us
and
what
that
means
is
that
as
we
build
or
shape
tools,
it
affects
the
way
people
work.
It
affects
their
career
path
and
therefore
it
affects
the
organizational
structure.
A
So
before
dbt
there
was
data
engineers
and
analysts
and
in
order
to
deliver
indent
capability,
the
analyst
actually
had
to
talk
to
a
data
engineer
in
order
to
get
a
production
asset
produced,
and
this
formed
what
some
people
have
deemed
data
breadlines.
So
here's
a
breadline,
it's
a
big
line
of
people
waiting
for
bread
and
like
before
data
engineers
are
somewhere
off
screen
to
the
right
and
then
analysts
and
business
users
were
somewhere
in
this
parameter.
I
don't
know
the
guy.
A
Engineers
are
responsible
for
the
core
assets
and
infrastructure
and
analytics
engineers
are
responsible
for
all
the
consumable
assets
in
the
data
warehouse
right,
but
this
is
not
a
complete
picture
of.
What's
going
on
what
we're
seeing
just
how
similar
that
you
know
that
most
data
platforms
consist
of
heterogeneous
tools,
there's
also
many
many
different
roles
or
jobs
happening
within
a
data
platform
by
job
I
mean
like
it's
your
job
to
produce
an
asset.
It's
your
job
to
do
an
ml
model.
It's
your
job
to
produce
the
platform.
A
So
if
you
think
about
this,
you
have
data
engineers
responsible
for
maybe
some
core
assets.
You
have
analytics
engineers,
dbt
data
scientists,
other
subject
matter,
subject
matter
experts,
they're
responsible
for
delivering
assets,
and
then
you
have
this
emerging
category
of
people
who
work
exclusively
on
the
platform
right.
They
are
setting
up
everyone
else
to
be
successful
within
the
context
of
their
tool,
but
this
current
status
of
things
is.
This
is
usually
aspirational.
A
There's
a
lot
of
problems
here
right.
Let's
go
back
to
our
data
breadline
right.
This
original
breadline
was
about
the
data
assets
themselves,
tables
columns,
etc,
etc.
But
there's
this
whole
other
domain
right
right
now,
often
there's
an
op
spread
line
or
the
moment
that
something
goes
wrong.
Analysts
and
business
users
kind
of
fall
off
a
cliff
here
right
and
they
need
to
go
to
the
data
and
data
engineers
in
order
to
like
deal
with
their
ops
right.
We
view
dax
there's
an
empowerment
tool
on
the
ops
dimension.
A
A
You
know
I
was
kind
of
inspired
by
the
apple
m1
launch,
I'm
very
excited
for
having
a
laptop
that
I
don't
have
to
put
in
my
refrigerator.
That's
literally
true.
I've
had
to
do
that
in
order
to
get
run,
but
you
know
I
thought
I
had
a
great
analogy
where,
like
there's
this
overall
machine
that
orchestrates
computations
and
there's
lots
of
specialized
co-processors-
and
I
think
that's
what's
happening
in
the
data
landscape
today.
So
if
the
data
platform
is
the
machine,
dbt
is
like
the
analytics
gpu.
A
It
performs
a
really
important,
huge
subset
of
the
computations
that
are
critical
to
the
functioning
of
this
thing,
but
its
domain
specific
and
very
powerful,
and
this
all
has
to
be
governed
by
an
orchestrator
right
and
dbt
and
snowflakes
show
up
in
the
gpu.
But
then
there's
this
whole
other
universe
of
other
coprocessors
right.
There's
all
these
other
tools
that
exist
either
for
fundamental
reasons
or
for
legacy
reasons
and
like
this
is
kind
of
the
analogy
that
I
feel
is
making
sense
to
me.
A
So
to
summarize,
here,
dbt
and
dax
are
a
complementary
systems,
and
you
know
I
kind
of
view
dbt
as
the
analytics
gpu
in
the
machine.
This
allows
one
of
the
core
values
to
analysts
and
analytics
engineers.
It
allows
them
to
self-serve
ops
right,
so
you
might
be
asking
listen.
I
don't
think
I'm
ever
going
to
write
any
python.
Why
is
this
thing
good?
Why
is
this
thing
important,
and
the
answer
is
that
we
think
it
will
enable
you
to
self-serve
your
office.
A
They
kind
of
designed
to
be
testable,
and
you
know
it's
really
fun
stuff
to
build.
So
thank
you
so
much
for
having
me
here
we're
an
open
source
project.
So
we
have
a
github.
We
have
a
growing
slack
community
feel
free
to
join
us
and
without
further
ado,
I
will
pass
it
back
and
take
questions
in
the
slack
you.