►
Description
Nick will cover the principles and origin of Dagster. Dagster is a new type of workflow engine: a data orchestrator. Moving beyond just managing the ordering and physical execution of data computations, Dagster considers the entire data application lifecycle. Practitioners in Dagster build data-aware dependency graphs designed for local development and testing; deploy those graphs to a multi-tenant, cloud-native orchestration engine; and then monitor and observe the data assets produced by those computations.
This talk will also cover our new major release, which makes significant changes to our core API that dramatically improve usability and ergonomics.
A
We
have
our
sponsor
aya
kemp
to
provide
the
zoom
for
us,
so
we
have
a
great
talk
today
and
before
I
introduce
the
speaker,
so
I
just
gave
you
like,
like
a
10
seconds,
introduction
to
our
upcoming
events.
Besides
today's
talk-
and
we
also
have
another
talk
in
october
by
ryan
blue
who's-
going
to
talk
about
the
apache
iceberg
so
and
after
that
we
have
some.
You
know
kind
of
machine
learning,
machine
learning
related
to
talk
on
the
semantic
search
and
your
information
retrieval
so
which
is
in
november.
A
So
for
today
you
know.
As
I
said,
it's
like.
We
have
a
special
talk
from
I'm
just
gonna
click
on
this.
A
Well,
the
meetup
is
a
little
slow,
so
so
we
have
a
special
talk:
the
by
nick
sharok
who's,
the
ceo
founder
and
ceo
of
the
elemental.
This
is
the
company
who
behind
the
dexter,
so
he
previously
he
worked
at
facebook
and
also
you
know
famously
co-created,
the
graph
ql.
A
So
as
we
know
the
you
know
working
on
the
data
platform
and
working
in
orchestrating
the
different
pipelines
that
we
need
a
kind
of
a
workflow
in
gyms,
so
and
so
dexter
is
a
one
of
the
alternatives
that
you
know.
People
probably
know
about
airflow,
but
there's
one
of
the
other
engines
currently
pretty
popular.
So
with
that
I'll
turn
to
nick
and
I'll,
tell
you
all
the
story
about
it.
I'm
gonna
stop
sharing
here.
Next
are
yours.
B
Great
thanks
thanks
for
that
intro.
Let
me
share
my
screen
here
so
yeah.
Thanks
for
having
me
really
excited
to
speak
here
today,
I'm
gonna
trust
that
everyone
can
hear
me
and
see
my
slide,
so
I'm
just
gonna
go
for
it.
B
Let
me
know
in
the
chat
if
there's
any
problems,
so
yeah
as
mentioned,
my
name
is
nick
schrock,
I'm
the
founder
of
elemental,
the
company
behind
dexter-
and
you
know,
as
mentioned
also
I
you
know,
my
kind
of
bulk
of
my
engineering
career
was
spent
on
facebook,
where
I
founded
this
team
called
product
infrastructure
and
that
ended
up
the
team
ended
up,
creating
react,
react
native
and
the
thing
that
I
was
personally
involved
with,
which
is
graphql.
B
You
know
I
like
to
joke
that.
I
was
present
at
the
creation
of
the
full
hipster
stack,
but
you
know
I
really
saw
the
impact
of
open
source
projects
and
just
you
know
how
broadly
they
can
be
adopted,
and
it
was
really
exhilarating
to
be
part
of
that.
B
Data
must
come
from
somewhere
and
it
must
go
somewhere,
and
that
means
dependencies
and
it's
the
workflow
manager
orchestrator
that
models
these
dependencies
and
ensures
that
the
computations
that
that
produce
data
are
scheduled
and
ordered
correctly.
And
you
know
even
in
this
example
that
is
on
the
screen
here.
B
This
is
not
a
particularly
complicated
data
platform,
but,
as
you
can
see,
it
has
multiple
different
technologies
at
play
here:
ingest
tools
like
five
tran
coming
from
sas,
you
have
python,
scraping
the
web
and
storing
those
results
in
s3
you
have
the
data
warehouse
dbt
over
that
data
warehouse.
You
have
census
at
the
bottom,
funneling
that
data
back
into
the
sas
products,
a
bi
tool
ml
after
the
data
warehouse.
There's
just
a
ton
going
on
here
and
it's
typical
at
companies
that
data
platforms
are
more
complicated
than
this.
And
it's
it's
the
orchestration
layer.
B
That's
really
the
beating
heart
of
it.
It's
a
centralized
natural
center
of
gravity
for
the
data
platform
in
some
ways:
the
orchestrator,
because
it
encodes
dependencies
between
all
the
tools,
it
encodes
the
very
structure
of
the
data
platform
in
new
york
itself,
and
it
also
has
a
sort
of
beast,
nearly
an
octopus.
You
would
say
the
logo
I'm
just
joking
around
there,
but
the
it
grows
across
your
entire
organization.
B
It
serves
a
ton
of
different
constituencies
and
it's
this
natural
place
where
all
operational
stuff
happens
as
well
and
really
you
know
today's
workflow
managers
really
struggle
under
the
weight
of
this
task.
They
can't
handle
the
complexity.
Dev
life
cycles
are
slow.
It's
really
difficult
to
understand
what
dags
are
doing,
and
you
know
here
today,
I'm
here
to
talk
about
dagster
and
we
like
to
call
it
a
data
orchestration
platform
that
is
really
built
for
productivity,
because
we
think
that
productivity
is
a
key
underlying
problem
in
data
platforms.
Everything
is
just
too
slow.
B
It's
too
hard
to
make
change.
So
I
want
to
dig
into
this
notion
of
the
orchestrator
being
the
central
point
of
gravity
and
really
we
did.
We
divide
kind
of
the
interactions
with
the
orchestrator
in
terms
of
three
different
roles:
role:
one
is
the
data
practitioner
and
the
data
practitioner
is
the
person
that
is
responsible
for
producing
data
assets
for
downstream
consumers
and
stakeholders.
So
a
data
practitioner
could
be
a
data
engineer,
producing
parquet
files,
an
analytics
engineer,
producing
tables
in
a
data
warehouse,
a
data
scientist
producing
ml
models.
B
Then
you
have
the
infrastructure
engineers
who
support
all
this
and
they're
responsible
for
reliable
data
infrastructure
and
they
naturally
interface
with
the
with
the
orchestrator
as
well.
And
lastly,
you
have
the
asset
stakeholders,
who
are
the
consumers
of
data
assets.
This
could
be
a
business
user
who
is
interested
in
the
state
of
a
critical
data
asset
from
one
of
these
business
proxies
or
processes,
or
it
could
be
a
peer
practitioner
team,
and
you
know
it's.
We
think
it's
really
critical
to
serve
all
of
these
different
stakeholders.
B
So,
as
mentioned
in
the
intro,
you
know,
airflow
is
kind
of
the
dominant
existing
incumbent
in
the
workflow
management
space,
and
you
know
what
we
hear
from
users
about
that
system,
as
well
as
other
peer
systems
in
the
space.
Is
the
following?
You
know
we
we
hear
that
you
can't
develop
your
dags
locally.
B
It's
just
a
very
slow
developer
life
cycle
kind
of
the
moment
that
you
hit
the
orchestrator
you
run
into
this
productivity
wall.
Related
to
that
is
that
you
can't
test
your
dags
and
testing
is
kind
of
the
bedrock
of
productivity.
In
my
opinion,
because
without
tests,
you
can't
have
that
fast
feedback
loop.
B
Next,
there's
all
these
kind
of
infrastructure
problems
with
the
existing
orchestrators,
a
huge
one
is
dealing
with
dependency
management,
meaning
that
team
a
wants
type
python
packages
x,
y
and
z.
B
B
It's
also
kind
of
difficulty
difficult
to
reliably
independently
deploy
code
and
last
there's
this
modern
observed
phase
it's
difficult
to
debug,
dags,
quickly,
kind
of
there's,
a
lack
of
observability
and
visibility
into
the
computations
inside
of
them,
and
you
can't
keep
track
of
the
data
assets
within
the
orchestration
context
and
we
think
that's
critically
important,
because
the
whole
point
of
these
systems
is
typically
to
produce
and
monitor
and
observe
data
assets.
And
so,
as
you'll
see,
we
we
in
a
deep
fundamental
way,
think
that
the
orchestration
layer
should
be
data
asset
aware.
B
So
the
way
we
approach
this
and
as
you
can,
I
didn't
mention
it,
but
the
previous
slide
chopped
up
that
feedback
into
three
parts
of
the
life
cycle.
You
know
we
really
try
to
think
about
the
full
life
cycle
in
a
thoughtful
way.
So
the
first
phase
is
development,
devon
test
and
with
dagster
we
want
to
be
able
to
build
efficiently
and
productively
build,
well-structured
testicle
computations.
B
You
know
data
pipeline,
you
call
a
data
platform
and
the
application
here
is
that
we're
building
a
recommendation
engine
for
hacker
news,
so
we're
going
to
download
the
data
via
web
api,
we're
going
to
use,
spark
to
compact
it
into
park,
a
load
it
into
bigquery
and
then
we're
actually
going
to
and
then
we're
actually
going
to
use,
split
that
and
have
a
ml
team,
take
it
out
of
bigquery
and
build
a
recommendation
model
using
pandas
and
now
an
analytics
engineering
team
use
dbt
to
build
analytics
dashboards.
B
B
Right.
If
the
data
error
gets
to
your
ceo's
dashboard,
you
might
get
fired
or
it
might
take
a
really
long
time
to
fix
it,
whereas
if
you
catch
it
in
local
development,
that's
no
big
deal
right,
orders
of
magnitude,
more
expensive
in
later
stages
on
a
few
different
potential
dimensions,
and
so
what's
the
goal
here,
we
think
that
a
properly
engineered
orchestrator
can
bend
this
curve
to
make
it
so
that
more
errors
are
caught
earlier
in
the
developer
life
cycle.
B
Now
we're
not
going
to
claim
that
all
errors
can,
because
just
the
nature
of
this
domain,
that
you
know,
data
quality
tests
a
lot
of
times
they
don't
they
don't
data
quality
tests,
don't
pass
or
fail
until
you
catch
them
in
production,
for
example,
but
because
of
the
dynamic
that
errors
are
so
much
more
expensive
later.
This
bending
of
the
curve
here
represents
a
huge
massive
increase
in
overall
productivity
of
a
data
organization.
B
Okay,
let's
actually
dig
into
some
code
here.
So
this
is
hello
world
in
daxter,
okay.
This
is
how
we
accomplish
this.
So
we
have
the
notion
of
a
job
which
is
especially
a
graph
of
computations
that
is
bound
to
a
particular
environment,
and
then
we
have
an
op
which
is
our
unit
of
computation
and
then
you'll
see
here
at
the
bottom.
You
can
simply
take
that
job
and
execute
it
in
process
very
straightforward,
python
api.
B
You
know
we
think
inherently,
that
all
data
pipelines
effectively
are
graphs
of
functions
and
systems
that
don't
capture
parameterization
like
that
are
effectively
not
capturing
some
of
the
real
underlying
complexity
and
the
true
nature
of
what
these
things
are
doing.
So
every
node
in
our
graph,
every
op
is
a
function.
The
body
is
completely
arbitrary
python.
You
can
do
whatever
you
want.
You
can
do
data
processing
in
python
directly
or
you
can
call
out
to
external
systems.
B
And,
lastly,
this
critical
notion
of
separation
of
I
o
and
compute.
This
is
the
bedrock
of
our
testability
and
development
loop.
So
you
notice
here
that
we
output
a
data
frame
which
is
a
logical
in
memory
construct.
We
don't
output
a
4k
file,
we
don't
output
a
csv
file,
we
capture
how
to
persist
the
data
frame
in
a
different
dimension.
We
call
I
o
managers.
This
allows
for
a
lot
of
power.
B
So
the
moment
that
you
structure
your
code
like
this
within
the
dagstar
python
framework,
you
immediately
get
tools
the
first
one
that
you'll
use
is
dagit,
which
is
our
ui
in
the
browser.
So
all
you
do,
is
you
hop
into
your
terminal,
you
type
daggett
and
boom.
You
load
your
pipeline
and
you
can
actually
use
this
tool
as
almost
like
a
local
ide,
which
I'm
going
to
show
you
right
now.
B
So
we
have
some
code
here
which
we've
loaded
up,
oh
and
by
the
way,
this
is
a
completely
new
reskinned.
You
know,
if
you
go
to
our
website
right
now,
everything
looks
completely
different
on
thursday,
we're
doing
a
major
new
release
with
completely
revamped
core
apis
and
a
completely
new
good
feel.
So
this
is
a
little
preview.
Chester
has
promised
not
to
upload
this
youtube
video
until
after
that-
and
I
I
will
hold
you
to
that
chester,
but
back
to
the
topic
at
hand.
B
You
know
the
moment
that
you
put
one
of
these
pipelines
in
this
format.
You
get
all
this
tooling,
so,
for
example,
this
is
the
code
that
was
showing
you
before,
and
you
can
see
that
in
the
web
ui
you
can
get
all
this.
These
descriptions,
which
are
in
code,
exposed
right
here.
You
can
see
what
the
inputs
and
outputs
are.
You
can
see
where
the
outputs
go,
etc,
etc.
It's
this
very
rich,
ui
for
figuring
out,
what's
going
on
and
what's
unique
to
dagster,
is
that
we
render
this
prior
to
computation
right.
B
B
So
here
we
have
this
ui,
which
allows
you
to
kind
of
launch
ad
hoc
computations,
which
ends
up
being
super
useful,
both
in
development
and
operational
context.
Now
I'm
going
to
simulate
an
error
here,
so
I'm
going
to
go
in
here
and
put
in
an
error,
programming.
B
Okay-
and
this
is
going
to
take
a
couple
seconds
but
you'll
notice-
how
this
is
a
live,
updating,
reactive
ui
with
a
you
know,
live
updating
gantt
chart,
you
can
get
a
sense
of
the
performance
and
whatnot
down.
Here
is
a
structured
event,
log
where
you
can
tell
what's
going
on
the
system,
and
this
makes
it
very
searchable
and
whatnot.
Lo
and
behold,
there's
an
error
here.
So
I'm
going
to
click
on
that
and
here's
this
error.
B
B
Go
back
here
and
then
I
can
even
just
you
know,
just
re-execute
the
single
step
right
here
or
I
could
launch
all
the
computations
after
that
error
and
you'll
see
here
this
boots
up-
and
this
should
take
this
couple
seconds
here-.
B
And
there
we
go,
it
now
completes.
So
you
know
if
you're,
a
user
of
say
airflow.
For
example,
this
type
of
local,
rich
development
loop
is,
is
just
not
in
the
realm
of
possibility
in
that
system,
and
it's
not
because
of
incidental
reasons
it's
because
of
deep
philosophical
regions
and
then
how
the
system
kind
of
manifests
that
philosophy
daxter
was
built
from
the
ground
up
to
enable
just
this
use
case,
as
well
as
other
local
testability,
and
we
just
got
through
that
demo.
B
So
working
with
an
ide
is
nice,
but
you
know
you
might
be
an
engineer
and
say
to
me:
that's
all
cute,
I'm
never
going
to
use
that
because
I
work
in
unit
testing
code.
Well,
that's
actually
how
I
feel
too.
I
spend
a
lot
of
time
in
unit
testing
code,
so
I
want
to
be
able
to
run
this
in
a
ci
cd
pipeline.
I
want
to
be
able
to
have
do
tdd,
if
that's
my
thing,
and
so
it's
actually
super
straightforward
to
it's
really
straightforward,
to
call
individual
ops
in
your
jobs.
B
B
So
this
this
chunk
of
code
actually
is
from
a
presentation
from
last
year's
airflow
summit
and
it
showed
kind
of
the
code
you
need
to
write
in
order
to
detect
cycles
in
all
your
airflow
decks,
because
airflow
wasn't
designed
for
local
development
and
without
flexible
python
apis
you
effectively
have
to
re-implement
the
logic
of
the
scheduler.
In
order
to
do
that,
you
have
to
load
every
dag
you
have
to
assert.
You
know
that
it's
valid
you
have
to
call
this
undocument
api,
etc,
etc.
B
By
contrast
with
daxter,
you
simply
assert
that
the
job
exists,
because
if
it
has
a
cycle,
it
won't
get
constructed
very
simple,
very
straightforward
and
we
have
tons
of
other
examples
of
places
where
you
catch
errors
earlier
in
the
life
cycle.
But
I
also
want
to
talk
about
deeper
testing,
meaning
that
testing
the
actual
business
logic
of
these
dags,
not
just
the
structure
of
them,
and
this
is
super
challenging.
B
Actually,
because
what
you
really
want
is
to
be
able
to.
You
know,
take
your
business
logic
and
hold
it
constant.
While
you
change
something
else
about
the
computation
and
we
model
this
desire
directly
in
our
abstraction.
So
we
have
ops,
which
are
our
environment,
neutral,
business
logic,
that
input
and
output
logical
constructs
like
ins
and
data
frames,
and
then
we
have
resources
which
are
responsible
for
binding
that
computation
to
a
specific
environment.
B
But
it's
also
a
production
orchestrator
that
orders
computations
in
prod,
and
you
know
we
really
design
this
ground
up
for
multi-tenancy
and
for
the
cloud
era.
B
Data
platforms
are
naturally
multi-tenant
systems,
even
just
internally
at
a
company
meaning
like
both
the
data
science
team
and
the
data
engineering
team,
that's
going
to
use,
it
usually
is
way
more
than
that,
and
you
know
the
world
is
cloud
native
right.
That's
is
that's
the
default
way
that
things
are
com
computed
and
that's
how
we
designed
our
system.
B
So
we
have
these
core
components
that
are
deployed.
One
is
the
daemon
that
we
call
it
effectively.
The
thing
that
is
responsible
for
scheduling
runs.
Other
thing
is
the
web
server
daggett,
which
is
responsible
for
observation,
monitoring,
etc,
and
a
critical
architecture
point
about
this
system
is
that
all
interaction
with
user-defined
code
happens
over
a
structured
api,
and
this
provides
for
process
isolation
which
is
critical
for
reliability.
B
So,
for
example,
if
one
of
your
teams
somehow
pushed
up
a
python
syntax
error,
the
system
just
notes
that
it
wasn't
able
to
load
the
process
and
everything
continues
on
normally
right,
rather
than
having
a
scheduler
load.
Those
dags
directly
into
process
of
bringing
down
the
entire
system,
which
would
happen
in
airflow.
B
So
this
lends
itself
to
dramatically
more
horizontal
scalability
and
the
ability
to
do
things
like
run
them
on
spot
instances
for
costs.
So
this
is
very
much
designed
for
the
cloud
and
then
the
all.
This
is
customizable
and
plugable
so
that
you
can
run
it
on
your
infrastructure
and
it
has
a
lot
of
flexibility,
but
we
also
come
out.
We
also
come
with
an
out
of
the
production
grade,
kubernetes
deployment,
so
we
have
a
very
flexible
nice
helm
chart.
B
So
what
does
this
look
like?
Well,
I'm
going
to
hop
to
a
different
web
ui
you'll
see
this
is
a
demo.elemental.show,
and
this
looks
very
very
similar
to
what
I
just
showed
you.
The
difference
here
is
that
this
these
this
is
the
same
kind
of
set
of
ops,
bound
to
a
different
set
of
resources,
and
that's
just
a
code
question,
and
now
I
can
launch
this
and
instead
of
just
launching
a
local
little
process
on
my
computer,
as
you
can
see
down
here,
this
is
actually
spinning
up
a
run.
Worker
kubernetes
run
workers.
B
B
Every
single
step
here
is
booted
in
its
own
kubernetes
plot
for
process
isolation.
So
all
this
infrastructure
stuff
has
changed,
but
the
business
logic
has
been
held
constant
and
that
is
really
the
critical
component
to
testability
and
I'll
be
spending
a
little
more
time
in
this
ui
as
time
goes
on.
Let
me
go
back
to
here.
B
So
we've
deployed
the
computations
and
we've
kicked
off
a
computation
improv
while
holding
the
business
logic
constant
and
now,
let's
talk
about
the
monitor
and
observe
part
of
the
life
cycle
and
what
we're
going
to
do,
how
we're
going
to
do
this
is
we're
going
to
add
data
science.
So
the
computation
ice
kicked
off
ad
hoc.
B
What
it
did
is
it
kicked
off
a
spark
job,
effective,
downloaded
stuff.
It
picked,
kicked
off
the
spark
job
to
compact,
it
into
part
k
and
then
load
it
into
bigquery.
Now
we
want
a
different
team
to
fetch
that
data
out
of
bigquery
and
use
it
to
build
a
recommendation
model
before
we
get
into
the
details
of
that.
B
I
want
to
step
back
and
think
about
how
a
data
platform
wants
to
think
about
the
different
roles
that
it
serves
and
the
way
that
we
think
that
a
platform
should
think
about
it
is
that
there's
this
full
life
cycle
and
every
single
stakeholder
has
this
full
life
cycle,
meaning
every
practitioner
stakeholder.
They
develop
an
asset.
They
need
to
monitor
it.
You
want
to
enable
end
to
end
ownership,
and
really
the
job
of
a
data
platform
engineer
is
to
fill
in
all
the
boxes
here,
so
that
each
role
has
its
own
end-to-end
life
cycle.
B
So
you
know
you
want
a
uniform
surface
for
both
deployment
execution
and
monitor
observation
right.
You
should
be
able
to
use
this
unified
substrate
there,
and
then
the
different
practitioners
can
focus
on
their
business
logic
using
the
tool
of
their
choice.
So
a
data
scientist
just
wants
to
write
python
right.
Maybe
they
just
want
to
use
scikit-learn,
pandas,
etc.
B
They
want
to
be
able
to
leverage
infrastructure
built
by
the
data
engineers
and
that's
great,
but
they
still
want
to
just,
but
in
the
end
the
goal
here
is
to
be
able
to
focus
on
business
logic
and
a
tool
of
your
choice,
so
we're
thinking
a
little
bit.
Let's,
you
know
we're
building
a
recommendation
engine
that
typically
takes
a
dag
of
computation,
and
this
is
how
you
do
that
in
dagster.
B
So
each
one
of
these
nodes
in
this
graph
is
an
op,
and
you
can
see
here
that
you
construct
the
graph
the
the
job
contains
by
calling
functions
right,
and
this
doesn't
actually
instate
the
computation.
This
actually
just
builds
that
dependency
tree
and
the
the
actual
bodies
of
those
functions
are
invoked
later.
B
So
this
is
what
the
world
of
the
world
of
daxter
looks
like
to
a
data
science
user,
and
it's
just
plain
old
python
right
they.
You
know,
you
might
not
know
what
truncated
svd.fit
does,
but
data
scientists
do
so
they're,
just
using
the
plain
old
python.
They
know
love
or
tolerate.
B
But
you
know
you
can
add
some
sugar
with
with
dagster
and
allows
you
to
attach
metadata
to
all
these
different
events,
both
the
kind
of
software
artifacts,
as
well
as
the
persistent
events,
and
this
allows
a
lot
of
power
a
lot
of
power
in
this.
So
this
every
single.
The
way
this
is
configured
is
that
every
single
step
in
this
pipeline,
every
single
op,
produces
what
we
call
an
asset.
B
B
It
could
be
a
pickled
model
and
it's
been
really
powerful,
both
for
us
as
dog,
fooders,
meaning
using
our
own
technology,
as
well
as
our
user
base,
to
really
embrace
this
concept
because
operationally
we
have
this.
What
we
think
is
like
a
critical
insight
which
is
obvious
in
retrospect,
but
people
care
about
assets,
they
don't
care
about
pipelines.
I
hate
to
break
it
to
you
the
data
engineer,
but
if
you
go
talk
to
a
business
stakeholder,
they
don't
care
about
your
pipeline.
No
one
cares.
B
They
only
care
about
the
assets
that
you
produce,
and
you
know
this
way
of
thinking.
If
you
fully
embrace
it
is
really
really
empowering.
B
You
know
one
one
great
quote
from
a
favorite
user
of
ours,
david
wallace
who's,
a
staff
engineer
now
at
dutch
you
previously
drizzly,
is
that
daxter
empowers
my
stakeholder
teams
to
own
their
data
assets
and
and
like
no
other
orchestrator,
can
right,
and
that's
specifically
because
of
this
philosophical
view
that
we
take
that
action
should
be
part
of
the
game.
B
So
you
know,
dagster
computations
emit
the
stream
of
structured
events
which
tell
the
system
what
is
going
on
and
one
of
those
events
that
the
user
emits
is
what
we
call
an
asset
materialization
so
see
how
nice
and
fast
that
ui
is.
What
not-
and
you
know
this-
these
are
actually
events
which
say:
hey
I
produced
a
specific
materialization,
it's
going
to
outlive
the
compute
and
I'm
going
to
attach
metadata
to
it.
B
Well,
the
interesting
thing
about
that
is
that
we
keep
track
of
that
in
what
we
call
the
asset
catalog.
So
if
you
go
and
view
this
asset
right,
here's
this
items
asset
right,
and
this
is
what
you
know,
an
s3
file
that
contains
the
items
that
were
produced
by
something,
and
you
can
see
here.
We
have
all
this
interesting
metadata
about.
What's
going
on
here
we
have
the
row
count
which
goes
up
and
down
over
time.
B
We
have
the
step
execution
time
which
has
gone
up
recently.
That's
interesting,
and
so
that's
useful
too,
but
the
you
know
the
kind
of
the
the
where
this
gets
really
interesting
in
my
view
is
that
this
allows
a
completely
new
way
to
index
into
the
information
encoded
by
the
orchestrator.
B
So
let's
say:
let's
just
make:
let's,
let's,
let's
pretend
like
we're
some
stakeholder
and
we
know
there's
a
comments
table
in
the
data
warehouse
now
I
can
just
go
to
dagster
right
and
search
for
that
thing.
Let's
see
comments
and
actually
oh
we're
in
snowflake,
now
not
big
quarry.
We
actually
ported
it.
I
need
to
move
the
slides,
but
let's
go
to
here.
Look,
we
can
find
the
hacker
news
comments,
table
right
and
search
for
it,
see
its
properties,
and
then
we
can
see.
Oh,
this
was
produced
by
this
hacker
news.
B
Api
download
pipeline.
The
last
time
it
was
touched
was
at
131
six
minutes
ago.
It
was
touched
by
this
run,
so
you
can
navigate
back
and
forth
from
the
computations
to
the
assets
back
to
the
computations,
and
this
is
a
super
powerful
way
to
to
navigate
and
operationalize
your
data
systems
go
back
here.
B
B
You
know
you
might
be
familiar
with
the
concept
of
the
data
mesh,
which
is
kind
of
all
the
rage,
but
I
think
the
most
powerful
and
interesting
idea
out
of
the
data
mesh
is
that
assets
should
be
the
interface
between
teams
and
our
system
encodes
that
directly.
So
in
this
case
we
have
one
data
science
job,
it's
kind
of
in
the
abstract,
downstream
from
the
data
engineering
job
and
the
way
we
hook.
B
You'll
notice,
here's
this
live
updating,
page
and
well.
These
dots
represent
the
last
time
that
I
checked
to
see
if
the
asset
key
has
been
updated.
My
demo
took
a
little
longer,
so
I
need
to
go
back
to
this
one
day
view
and
you'll
see
that
okay,
a
job
was
kicked
off.
A
run
was
kicked
off
at
1
32,
which
is,
you
know,
happened
right
after
that.
B
Ad
hoc
job
had
kicked
off
had
just
completed
and
we
can
go
here
and
click
on
this,
and
then
you
get
all
the
exact
tooling
that
you're
used
to
right.
So
I
can
go
here
check
out
what
the
acid
materializations
are,
and
I
can
see
what's
been
going
on
here,
and
you
know
this
ends
up
being
a
super
powerful
operating
modality
for
data
platforms.
B
You
know,
even
as
we
were
developing
this
internally,
the
person
I
think
was
a
couple
people
actually
who
built
the
data
science
job
told
me
hey.
I
didn't
need
to
know
anything
about
their
pipeline.
All
we
did
is
we
agreed
upon
a
mutually
agreed
asset
key
at
the
very
beginning
of
development.
It's
like
you
know,
you're
going
to
produce
this
table,
I'm
going
to
consume
that
table
after
that.
We
live
in
different
worlds
and
that's
a
super
powerful
way
of
operating.
B
So
that's
kind
of
the
life
of
a
data
scientist
in
the
system.
I'm
just
going
to
quickly
go
over
what
the
the
workflow
analytics
engineer
looks
like
and
if
you're
not
familiar
with
the
term
analytics
engineers
effectively
a
term
invented
because
a
technology
called
dbt
exists
and
you
can
think
of
dbt
as
a
way
that
you
make
an
analyst
into
an
engineer,
an
analytics
engineer,
it's
a
super
powerful
tool.
B
So,
if
you're
a
if
you're
an
analytics,
engineer,
you're
listening
to
this
presentation
up
until
now,
you
might
hear
develop
and
test
and
be
like.
Oh,
the
orchestra
is
going
to
take
over
my
develop
and
test
part
of
my
life
cycle.
Well,
I
don't
want
that
because
I
use
dbt.
B
I
like
dbt
a
lot
I
like
to
talk
about
how
much
I
like
dbt
a
lot,
and
I
have
no
interest
in
abandoning
that
tool
and
we
totally
agree
with
you.
We
fundamentally
believe
that
people
should
use
the
tools
they
want
to
do
data
processing
and
that
the
orchestrator
should
facilitate
that,
and
so
in
this
case
we
have
a
fully
functioning
dbt
integration
that
integrates
very
nicely
with
the
system.
B
B
Dbt
metrics,
you
know
this
is
a
very
straightforward
dag
all
it
does
is
invoke,
dbt,
run
and
dbt
test.
But
if
you
go
here,
you
know
this
one
is
kicked
off
at.
Oh,
the
sensor
is
off
how
sad
I
messed
that
up,
but
if
the
sensor
had
been
on,
you
would
have
seen
one
running
about
10
minutes
ago,
but
you
know
just
like
our
other
tools.
We
can
go
to
the
run
that
I
did
last
night
and
just
like
the
other,
you
know.
B
One
of
the
built-in
capabilities
we
have
is
the
ability
to
produce
asset
materializations
as
a
result
of
a
dbt
run.
We
ingest
the
metadata
that
comes
out
of
dbt
and
persistent
in
our
system,
so
this
allows.
You
know
this
allows
dagster
to
be
the
single
operational
plane
of
pane
of
glass,
where
you
can
manage
all
your
assets
and
computations,
no
matter
what
tools
they
end
up,
being
computed
in
or
stored
in,
and
that's
extremely
extremely
powerful.
B
So,
to
sum
up
here,
we
think
that
orchestration
is
a
point
of
leverage
both
for
you
as
a
practitioner
or
a
platform
engineer,
but
also
for
broad
ecosystem
progress,
meaning
that
you
know
the
ecosystem
is
incredibly
fragmented
and
difficult
to
navigate
and
the
orchestration
layer
is
really
where
all
this
stuff
has
to
come
together,
all
the
tools,
all
the
practitioners,
all
the
storage
systems.
So
we
think
it's
a
massive
point
of
leverage
for
improvement.
B
We
fully
consider
developing
and
testing
in
this
system
deploying
and
executing
the
computations
and
moderately
observing
both
the
computations
in
the
produced
assets,
and
this
this
enables
a
really
fast
developer,
workflow
and
end-end
ownership,
which
means
dramatic
increases
both
in
individual
productivity
and
organizational
productivity,
and
it's
an
open
source
python
framework.
You
can
end
this
presentation
sign
off
and
go
and
install
it
for
free.
B
As
per
an
open
source
project,
you
know
you
can
check
us
out
on
github,
we
have
our
docs
and
then
we
have
our
slack,
which
is
very
active
and
the
best
place
to
go
to
interact
with
our
community
and,
like
I
mentioned
you
know,
this
is
actually
the
first
presentation
we've
given
with
our
updated,
look
and
feel
our
updated
core
apis
and
our
release
of
this
is
on
thursday
and
it's
probably
the
biggest
release
since
the
initial
release
of
the
project.
So
we're
super
excited
about
that
and
without
further
ado,
I
will
take
questions.
A
Great
talk,
nick
so
I'll
wait
for
your
instructions
of
when
I
can
help
make
the.
A
B
Well,
you'll
send
me
the
video
and
then
I'll,
see
if
it's
good
and
then
I'll
decide
when
we'll
push
it
out.
No,
I'm
just
kidding.
A
Let
me
ask
you
that's
a
very
interesting
topic
because
that
you
know
echo
some
of
the
stuff
we
do
on
our
on
my
team.
Frankly,
we
did
build
our
own
dev.
You
know
separation
of
dev
staging
in
productions
and
we
have
our
own.
You
know
testing
environments
for
our
workflow
engines,
and
you
know
try
to
you
know
without
changing
the
code,
but
but
it
can
run
in
different
environments.
So
so
it's
very
interesting
to
see
dexter
how
they,
you
know
how
it's
done.
A
I
do
have
a
question
on
the
one
of
the
one
of
the
pieces
that
you
mentioned,
that
and
also
interesting,
so
the
dexter
actually
interconnected
with
us,
the
asset
categories
catalogs,
which
literally
almost
like
a
like
another
system,
similar
to
the
data
cataloging,
which
you
can
discover
what
table
you
have
or
what
metadata
you
have
and
what's
the
the
min
max.
You
know
that
sort
of
thing,
so
it
seems
like
a
dexter,
actually
bundle
that
together.
In
other
words,
this
probably
is
something
like
amazon.
B
B
B
So
the
most
obvious
one
is
like
a
direct
linkage
to
the
run
that
produced
the
asset
and
be
able
to
click
that
really
easily
being
able
to
set
that
up
without
integrating
yet
another
tool
right.
Our
goal
is
not
to
replace
all
the
asset
catalogs
out
there,
there's
like
tons
of
companies
who
are
devoted
to
that.
B
They
often
have
asset
catalogs
that
are
also
scraping
for
information
that
was
like
manually
created.
They
have
manual
annotation
systems,
they're
very
and
have
more
complicated
ontologies.
You
know
what
we
wanted
to
do
is
one
provide
a
built-in
data
asset
catalog
for
simple
use
cases
that
people
get
out
of
the
box
and
then
two
be
able
to
leverage
it
for
operational
use
cases
where
it
really
makes
sense
to
integrate
it
with
the
orchestrator
and
you'll
continue
to
see
us
double
down
in
that
direction.
B
You
know
fully
plan
on
having
you
know
our
metadata
database.
You
can
query
it
yourself
and
ingest
it
and
we
produce
a
structured
event
stream,
and
so
we
fully
expect
that
to
produce
exhaust
that
would
be
consumed
by
the
likes
of
data
hub
or
stem
emunsion,
et
cetera,
et
cetera,.
C
A
Yeah
yeah,
I
can
see
that,
but
it
seemed
like
the
majority
of
the
companies
data
data
results
is
actually
produced
through
the
orchestration
workflow
engines,
so
so
there'll
be
a
big
chunk
coming
from
here.
I
guess.
B
Correct
yeah,
but
you
know
the
the
lots
of
companies
will
have
many
different
orchestrators.
You
know
teams
making
their
own
decisions
about
stuff,
so
we
don't.
We
don't
have
any
illusions
that
we're
going
to
be
the
one
asset
catalog
to
rule
them
all
anytime
soon.
Nor
do
we
want
to
cover
like
a
lot
of
the
use
cases
that
I
covered.
A
B
I
mean
I,
I
think
the
real
answer
is
nearly
every
company,
the
you
know
we
like
to
say
that
every
company
has
a
data
platform,
it's
a
question
of
whether
you
acknowledge
it
or
not,
because
if
you
don't
acknowledge
it
and
staff
it
it
exists,
but
it
becomes
a
complete
unorganized
mess,
that's
not
operationalized
or
well
engineered,
and
what
a
data
platform
is
is
where
you
manage
and
curate
all
of
the
data
assets
of
your
org,
meaning.
B
The
data
has
been
ripped
out
of
its
original
context,
whether
it's
a
sas
app
one
of
your
operational
databases,
slash
system
of
record
or
like
scrape
from
the
web
or
something
and
so
any
time
that
you
are
doing
that
and
you're
stitching
together,
more
than
one
say,
computational
runtime
and
you
need
operational
robustness
around
that.
You
need
something
like
dagster
and
at
that
point
it's
kind
of
like
it's,
the
type
of
thing
where,
like,
if
you're
going
to
build
something
you
want
to
do
it
right.
B
So
it's
you
know
to
me
if
you
don't
use
something
like
dexter,
it's
kind
of
like
saying
when
you
start
writing
a
computer
program.
It's
like!
Oh
I'm
just
going
to
write
some
code,
I'm
going
to
refactor
into
functions
later.
It's
like
no,
you
just
want
to
start
like
engineering
it
properly
from
day
zero
in
order
to
build
a
well-structured
system
where
the
entropy
isn't
going
to
get
out
of
control.
B
So
you
know
most
companies,
I
know
of
require
a
system
like
this
because
they
need
to
ingest
data
from
sas
apps.
They
need
to
integrate
those,
they
need
to
compute
data
that
resides
in
their
data
warehouse
and
then
they
do
something
with
that,
and
even
in
the
simplest
cases
of
that,
you
need
an
orchestrator
to
make
that
all
work
well
and
reliably.
A
So
next
question
is
this:
one
is
a
great
demo
and
he
want
to
know.
Can
you
share
integration
to
databricks.
B
C
B
And
so
there's
an
expectation
of
you
know
a
sustainable
business
at
some
point
and.
B
B
So
this
is
kind
of
the
you
know
now
more
broadly
used
so-called
hybrid
sas
model
by
the
likes
of
build
kite
and
data
bricks,
and
we
handle
upgrades
for
you
maintain
the
performance
of
the
metadata
database,
the
web
server
we
handle
all
upgrades
and
so
on
and
so
forth,
and
then
we
also
add
enterprise
features
like
authentication
are
back
auditing
and
a
bunch
of
ci
cd
help.
A
A
Yeah,
I
think
that
clear,
a
lot
of
the
people
has
the
security
concerns
and
all
the
other
you
know,
concerns
about
the
company
yep
right.
So
I
think
that's,
let
me
see
what's
the
last
one
so
so
need
to
agree
with
your
ideas,
so
the
interface
should
be
assets.
So
that's
just
a
comment
so
somebody
I.
A
B
Yeah
well,
dagstr
is
actively
used
on
all
the
major
cloud
providers
and
daxter
cloud
is
itself
hosted
on
aws.
If
you're
asking
about
that.
But
you
know
the
it.
That's
actually
very
opaque
to
you
so,
but
dexter
is
flexible
enough
to
use
in
any
cloud
provider
as
well
as
even
on-prem.
A
Cool,
so
how
big
is
your
team,
if
I
recall
there's
one
of
my
all
the
acquaintances
from
clutter?
I
I
you
know,
I
remember
somebody
from
canada
or
joining
your
team.
If
I
don't
remember.
B
You're,
probably
referring
to
sandy
yes,
one
man,
yes
very
talented,
he's
he
leads
our
practitioner
team,
which
is
responsible
for
you,
know
most
of
the
open
source,
practitioner-facing,
apis
and
so
yeah.
I
know
we
have
a
team
of
18
people,
including
me,
and
we
are
actively
hiring
had
a
product,
devrel,
a
tech,
writer
and
also
engineers.
A
Okay,
we
got
a
kind
of
one
more
questions,
actually
two
more
questions
and
so
kind
of
related.
So
do
you
have
do
we
have
an
sdk
to
get
it
job
dependency
lineage,
to
integrate
with
amazon
our
data
hub.
B
That
does
not
exist,
we're
not
opposed
to
building
it
or
having
that
community
contribution,
but
it
just
hasn't
come
up
actually,
so
we
build
and
support
integrations
based
on
demand,
but
we're
philosophically
very
open
to
it.
As
I
mentioned.
A
B
So
yes,
there,
if
you
go
in
our
open
source
repo,
you
can
see
all
the
different
libraries
that
have
been
built
and
then
there's
also
libraries
out
there
in
the
wild
too.
In
terms
of
you
know,
something
we
need
to
do
is
build
a
catalog
of
those
but
yeah.
No
one
of
the
advantages
of
an
open
source
system
is
that
one
you
can
see
all
the
integrations
and
then
often
those
are
community
contributed,
which
is
great.
A
Great,
I
think,
that's
all
the
questions
from
the
audience
and
I'd
like
to
thank
nick
again
for
great
talks
and
introduce
the
dexter
to
the
you
know
to
our
communities
yeah.
You
know
we'll
hope
to
welcome
you
coming
back
next
time
to
give
more
insights
on
the
dexter.