►
From YouTube: Prezi: Migrating Data Pipelines to Dagster
Description
Tamas Nemeth presents why and how Prezi migrated their production data pipelines into Dagster from a homegrown orchestration solution.
🎞 Slides 🎞
Prezi & Dagster (Tamas Nemeth)➡️ :
https://prezi.com/view/kveaLi8KasReSs4pyP5l/
🌟 Socials 🌟
Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/
A
Thanks
for
the
nice
words,
I'm
super
happy
to
be
here
and
showing
you
what
we
did
in
this
dexter
world
and
what's
our
journey
to
getting
to
dexter
and
migrating
to
dexter
at
prezi.
So,
let's
start
how
we
started.
A
I
think
the
like
the
data
data
engineering
and
the
data
team
started
around
like
eight
years
ago
at
prezi
we
started
like
I
think,
like
most
of
the
companies,
where
we
have
a
bunch
of
shell
script
and
scheduled
with
chrome,
of
course,
at
some
point,
as
the
utili
job
started
to
grow,
we
basically
we
figured
out
that
it
won't
scale,
so
we
had
to
come
up
with
some
kind
of
solution
on
that
it
was
like.
Six
years
ago
we
were
looking
around
on
the
open
source
bird.
A
Orchestra
orchestration,
which
would
work
for
us,
so
that's
why
we
decided.
Okay,
let's
build
our
own
one.
We
call
this
flow
keeper,
that's
what
you
can
see
on
on
the
screen
this.
This
is
our
homegrown
orchestration
and
one
of
the
main
design
decision
why
we
decided
to
create
a
new
one
and
not
going
some
existing
one.
There
is
simplicity,
so
one
of
the
requirements
from
our
users
was
basically
to
not
they.
A
Basically,
this
is
the
this
is
a
pretty
simple,
json
descriptor.
What
you
can
see
here,
where
you
can
basically
define
the
scheduling
type
you
have
to
type
is
daily
and
hourly
schedule.
You
can
define
the
inputs
what
your
job
is
using
and
also
you
can,
as
you
can
see
there,
you
can
give
some
kind
of
friendly
name,
and
there
is
a
path
what
you
can
define
in
this
case.
This
is
an
s3
pass.
A
What
we
pass
and
and
also
you
could
define
what
kind
of
data
sets
your
job
will
generate.
So
in
this
case,
this
job
as
input
is
some
kind
of
s3
location,
and
then
it
will
produce
some
kind
of
redshift
table
and
when-
and
you
should
know
that,
so
these
input
and
outputs
is
really
you.
We
use
this
to
build
up
the
whole
dependency
graph
in
our
orchestration
and
we
did
we.
A
We
did
not
go
with
that
concept,
but
you
can
see
somewhere
else
or
like
a
like
other
orchestrator,
where
you
can
just
basically
define
your
job
names
and
that's
the
way
how
you
define
dependencies
between
two
jobs.
Here
we
basically
went
in
the
past
that
that
you
only
have
to
know
what
kind
of
data
sets
you
want
to
work
with
and
based
on
that,
we
will
figure
out
the
dependencies
and
which
job
it
needs
to
be
connected.
A
So,
basically,
if
you
define,
if
you
said
that
your
input
is
this
s3
location,
what
you
can
see
here-
and
we
saw
that
the
other
jobs
generated
the
same
s3
location-
we
connected
the
two
jobs.
Basically,
this
this
was
or
how
we
set
up
the
dependencies
between
these
jobs.
I
think
it's
pretty
simple
and
it's
it's.
It's
worked
for
us
because
usually
the
user
knows
what
they
are
working
on
a
bit.
A
What
kind
of
data
sets,
but
they
are
not
really
aware
of
which
job
jar
is
that
and
we
defined
a
couple
of
predefined
job
types.
These
are
what
you
could
use,
and
we
here
in
this
example
is
a
restrict
load.
A
Basically,
what
it
does
you
specify
input
and
we
load
the
input
data
into
redshift
with
the
property,
with
the
parameters,
what
you
can
see
down
there
and
we
have
a
few
job
types
like
retrieve
lord
redshift
transform,
which
was
basically
running
a
sql
script,
and
we
have
spark
jobs
and
and
like
python
jobs
and
a
few
others,
and
we
also
defined
the
the
tiers.
So
every
data
set
put
in
some
kind
of
tier,
which
basically
is
the
priority.
What
does
it
mean?
A
You
can
imagine
that
you
have
a
bunch
of
data
set
as
especially,
if
you
have
like
hundreds
of
data
sets,
then
it
can
happen
then,
and
hundreds
of
epa
jobs.
Then
it
can
happen
that
that
you
have
two
jobs
and
and
two
jobs
can
run
at
the
same
time
but
like
on
the
resource.
What
you
want
to
run
there.
You
can
load
like
running
two
in
parallel.
In
that
case,
you
have
to
make
sure
the
more
important
data
set
will
be
ready
earlier,
and
this
is
what
tiers
means
here.
A
The
lower
the
tier,
the
jobs
get
will
be
scheduled
earlier,
if
possible,
and
another
thing
what
I
failed
to
mention:
it's
basically,
the
job
type
and
in
the
job
type.
This
is
a
redshift
mode
and
jobtite
also
define
the
resource
what
we
are
going
to
use.
So
in
this
case
redshift
and
even
in
our
homegrown
scheduler,
we
had
these
q
resource
cues,
where
basically,
we
made
sure
that
you
can't
overload
the
resources
what
the
job
is
using.
A
You
can
imagine,
I
guess
if
you
have
like
hundreds
of
jobs
which
can
run
in
parallel,
but
if
you
would
run
these
hundreds
of
jobs,
heavy
jobs
in
on
redshift
that
you
most
probably
would
cure
that.
So
this
was
the
state
this.
This
was
our
own
schedule.
What
we
built-
and
we
built
some
nice
user
friendly
ui,
which
is
a
pretty
simple
grid,
where
you
can
see
the
jobs
which
was
finished
and
what
the
status
and,
if
something
fails,
you
can
see
there
as
well.
A
So
think
things
are
looking
good
and
it
seems
like
a
user
really
liked
it,
and
we
ended
up
with
a
dependency
graph
like
this.
So
we
had
around
900
jobs
and
if
you
have
900
jobs,
then
you
will
face
with
a
few
issues
and
that's
why
we
were
really
thinking
if
we
want
to
fix
those
in
our
current
homegrown
orchestrator
or
we
are
looking
for
some
open
source
alternatives
and
why
we
decided
to
not
improving
our
homegrown
orchestrator.
A
Basically,
one
thing
is
the
maintenance
overhead.
So
the
data
engineering
team
is
a
handful
people
at
prezi,
so
we
did
not
really
have
the
capacity
to
fully
focus
on
working
on
the
orchestrator
another.
Another
thing
is
what
you
saw
before
this
grid,
so
you
can
only
see
the
actual
job
which
fails,
but
you
can't
really
see
the
dependencies
between
the
jobs.
So
if
you
can't
really
see
from
that,
if
a
job
fairs,
then
what
other
jobs
were
affected
as
well
due
to
that
that
failure.
A
Another
thing
is:
these
forecasts
are
currently
running
on
one
ec2
machine
and
which,
if
dies,
then
we
are
in
the
trouble.
Then
we
had
to
start
a
new
machine
and
setting
up
everything
there
and
also
there
are
problems
that,
because
we
are
running
all
of
our
jobs
in
one
machine,
it
can
happen
that
two
jobs
interfere
with
each
other.
A
You
can
imagine
if
one
job
basically
generates
too
high
cpu
load
or
just
eat
up
the
disk
space,
or
even
first,
when
when,
basically,
you
have
some
with
users
and
they
just
start
expecting
that
they
can
write
to
a
temporary
folder
and
one
job
without
defining
the
as
a
dependency
between
each
other.
One
just
put
down
some
file
there
and
the
other
one
expects
to
pick
it
up
and,
of
course,
the
infrastructure
that
prezi
is
moving
to
kubernetes
so
or
data.
You
should
structure
needed
to
move
as
well
to
kubernetes
now.
A
Not
much
time
to
work
on
that
fully,
it
was
really
written
in
a
you
know,
not
very
extendable
way,
so
it
was
hard
to
add
new
job
types
and
etc,
and
the
last
one
is
with
the
lack
of
testing
limb
instead.
So
it's
one
so
before
users
wanted
to
test
their
jobs,
mostly,
they
had
to
log
into
one
machine,
copying
their
file
and
trying
it
out
from
that
specific
machine,
and
we
want
to.
A
We
wanted
to
provide
a
way
better
user
experience
to
them,
and
basically
that
was
the
time
when,
when
we
talked
with
dexter
team
and
they
convinced
us
that
let's
try
out
their
tool
and
try
to
and
let's
see
if
it
how
it
works
for
us
and
that's
when
we
decided
okay,
let's
try
to
migrate
to
this
new
system.
But
of
course,
if
you
want
to
migrate
to
a
new
system,
you
don't
want
to
write
all
of
your
epi
jobs
from
scratch.
A
So
what
what
was
our
first
requirement
when
we
tried
to
move
to
dexter
is
basically
to
being
able
to
keep
or
descriptors
and
migrating
and
using
it
for
generating
solids
in
dexter.
So,
basically,
what
we
wanted,
we
had
a
car
and
we
wanted
to
replace
the
engine
a
way
better
engine
and
a
very
more
reliable
engine,
and
this
is
what
we
did
so
keeping
our
job
descriptors.
A
First
of
all,
we
used
the
job
descriptor
and
started
to
generate
solids
from
it.
How
this
looks
like
first
of
all,
we
generated
a
solid
config,
which
I
now
saw
that
it
should
be
config
schema
which
basically,
if
you
treat
solid
as
a
function
which
has
parameters
then
config
config
are
the
parameters
and
its
types
and,
as
you
can
see
here,
we
had
the
original
json
descriptor,
but
you
can
see
down
there.
It's
a
redshift
transform
and
we
generated
a
nice
schema.
A
config
schema
for
that.
A
What
you
can
see
on
the
right
side,
this
screenshot
from
dexter.
So
as
you
can
see,
the
type
can
be
there
or
and
for
every
descriptor
we
generate
one
specific
solid
for
it.
So
that's
why
it's
so
legit.
So
here
you
can
change.
Redshift
transform
any
other
types,
because
their
inputs
and
even
the
processing,
wouldn't
make
sense.
So
there,
as
you
can
see,
you
can
only
specify
directly
transform
and
you
can-
and
there
are
all
the
parameters
which
can
be
used
in
the
relative
transform.
A
So
in
this
case
the
like
the
sql
file,
which
basically
says
resecul
file,
needs
to
be
run
on
redshift
when
you're
running
this
job.
A
Of
course
now
so
now
you
have
a
function
and
you
have
all
the
parameters,
so
you
have
a
solid
and
and
the
config
schemas,
but
you
need
all
the
parameters
or
the
values
that
you
want
to
pass.
That,
and
this
is
this,
these
are
the
presets
we
also
generating
the
preset
yaml
and
from
or
json
descriptor
like.
If
you
check
here
on
the
right
side,
this
is
this
one
is
generated,
one.
A
The
left
side
is
basically
one
which
is
in
or
json
and,
as
you
can
see
there,
we
generated
a
nice
preset
where
we
say
that,
where
we
pre-fill
all
the
values,
what's
what
are
what
are
in
the
json
descriptor
and
later
on?
If
you
want,
of
course,
on
the
playground,
you
can,
you
can
change
it
if
you
want
to
run
some
test
run,
but
but
basically
you
don't
have
to
do
anything,
we
do
it.
We
pre-fill
it
for
you
like.
A
Basically,
the
solid
body
is
predefined
by
us
and
you
are
and
it's
when
it's
get
all
the
properties
from
or
from
the
solid
pre
presets
and
then
based
on
that,
we
decide
okay,
what
kind
of
job
times
we
need
to
run.
So
if
it's
a
registry
transform,
then
we
will
run
relative
transform
and
we
do
some
other
steps
as
well.
So
you
know
solid
body,
basically
what
we
do.
It's
checking
the
inputs
doing
the
actual
job
execution.
A
So
basically
in
the
solid
inputs
you
can
define
in
your
like
in
your
sequel.
You
can
say
the
friendly
name
of
the
input
and
then
we
will
replace
the
friendly
in
your
sql
with
actual
table
names.
A
And
now
you
have
a
nice
solid
for
the
the
config,
the
presets
and
the
body.
But
you
have
to
define
dependency
between
the
solids
and
what
kind
of
dependence
in
the
input
and
outputs
there
are,
and
here
we
as
well
are
using
the
json
descriptor
and,
as
you
can
see
there,
we
are
generating
a
typed
input.
So
in
this
case,
because
it's
a
our
redshift
table,
that's
why
we
generate
a
redshift
flowkey
purpose.
A
Basically,
what
we
do
we
do,
the
same
depends
and
dependency
set
up.
What
we
what
I
mentioned
earlier,
basically
based
on
the
inputs
and
outputs
output
paths
and
and
table
names,
we
look
up
which
job
generate
that
and
we
do
the
connection
between
the
solids.
Based
on
that-
and
here
you
go,
there
is
a
nice
small
pipeline
defined.
A
And
last
but
not
least,
we
also
add
some
solid
metadata
which
not
needed
for
the
solid
itself,
but
it's
more
like
like
dexter
as
the
orchestrator,
and
also
because
we
want
to
add
some
nice
tagging
onto
these
solids.
So
just
a
few
examples
here
when
we
set
the
max
retries.
Basically
this.
This
is
what
what
which
says
that,
how
many
times
we
want
to
retry
a
failing
job
before
failing,
actually
and
and
stopping
retrying,
and
also
we
set
the
tier
here
and
based
on
the
this
tier.
A
A
Or
for
the
resource
cues,
so
now
we
have
a
nice
solid.
We
could
do
this
for
one
specific
job,
but
we
wanted
to
make
this
migration,
the
less
painful
so
basically.
A
Basically
transparent
to
our
users,
so
let's
give
you
an
exam
and
we
wanted
to
see
our
daily
pipeline
as
one
huge
pipeline.
That
was
one
one
of
the
requirements
from
the
beginning.
What
we
want
to
achieve,
because
what
we
saw
like
other
orchestrators,
that
they
they
start
to
have
problems
if
you
have
like
hundreds
of
of
jobs
or
solids
in
in
one
pipeline,
and
that's
why
you
have
to
basically
strip
your
pipeline
into
multiple
pipeline
and
doing
the
connection
between
those
pipelines.
A
But
the
problem
is
that,
with
usually
most
of
these
tools,
that
you
can't
really
see
the
connection
between
the
pipelines
and
that
that
was
one
of
the
reason
why
we
really
wanted
to
keep
everything
in
in
at
one
place
and
not
basically
being
taking
apart
and
another
thing,
of
course,
that
in
the
current
state,
we
are
not
really
able
to
do
this
because
we
have
now
900
jobs,
and
it
would
take
a
significant
amount
of
time
to
do
this.
A
A
There
is
this
nice
selector,
where
you
can
just
select
sub
select
of
the
pipeline,
which
can
be
super
useful,
especially
if
you
try
to
understand
your
pipeline
or
if
you
want
to
change
some
job
and
you
are
interested,
what
other
jobs
will
be
can
be
affected
with
that
change
or
or
even
if
you
are
doing
some
kind
of
debugging
where
you
are
interested
in
if
this
job
failed,
what
others
can
be
affected.
A
So
I
think
it's
a
pretty
cool
thing,
so
now
we
have
all
of
the
jobs
and
and
we
we
can
load
into
dexter
all
of
these
and
we
can
generate
from
our
jobs,
solids
and
all
of
the
solvents
can
be
loaded
into
dexter.
A
Here
as
well-
and
here
is
the
workflow,
what
we
come
up,
how
you,
how
you
develop
your
eti,
a
new
video
job,
so
basically
the
workflow
is
the
following:
you
as
a
user.
You
start
working
on
your
new
shiny
ether,
job.
You
start
local
development,
environment,
local
development
environment
is
basically
doc
dexter
running
in
docker
and
locally
so
and
there
you
can
start
working
on
your
job
testing
and
even
you
can
go
to
access
services
with
your
own
credential.
A
When
you
are
happy
with
your
jobs
with
your
job,
you
have
to
create
a
pull
request
in
github,
and
then
somebody
reviews
that
and
in
the
meantime,
as
well
jenkins,
runs
a
check
on
this
job.
What
we
do,
what
we
are
actually
checking
it's
another.
I
think
pretty
nice
feature
indexer,
that
you
can
introduce
modes
as
well,
you
you
can
create
multiple
modes
and
we
introduce
this
test
mode
where
which
actually
not
touching
any
of
the
resources,
but
what
it
does.
It
just
runs
the
whole
pipeline
and
basically
checks.
A
If
there
are
circular
dependencies,
if
there
is
any
config
issues
and
and
if
we
are
able
to
run
the
whole
pipeline
without
running
on
actual
resources,
which
is
cool,
if
that's
passed,
then
you
can
deploy,
we
are
using
the
kubernetes
executor
with
kubernetes
salary
executor.
So
what's
happened
in
this
case,
so
you
have
the
your
job.
Basically,
this
in
the
end,
what
you
do
is
basically
just
committing
an
adjacent
file
into
a
repo
based
on
that
we
run.
A
And
then,
when
you
start
or
or
for
a
new
pipeline
run
schedules,
then
these
jobs
goes
into
salary
and
and
basically
in
in
salary,
in
the
salary
queue
in
various
resource
resource
queues,
like
you
defined
a
separate
queue
for
redshift
for
presto
and
hadoop
and
python
in
hadoop
that
one
where,
like
spark
and
jobs,
are
running
and
basically
in
this
way,
we
can
make
sure
that
these
queues,
when
when
a
job
is
executing
from
the
red
shift
keys,
then
we
can
make
sure
that
only
like
five
parallel
jobs
is
running
and
it
can
happen
can't
happen
that
we
were
overrunning
a
ratchet
cluster
and
no
one
else
can
be
basically
querying
it,
which
is
not
a
good
thing.
A
And
another
benefit
running
on
kubernetes.
All
of
these
jobs
are
running
in
a
separate
pod,
which
is
nice,
because
jobs
can
interfere
with
each
other
if
it's
using
to
manage
memory
cpu,
whatever
the
pod
got
killed,
but
the
other
jobs
can
run,
which
is
cool
and
also
another
cool
thing
in
the
salary
executor
that
all
this
prioritization
is
there.
So
it's
even
treated
or
priority
settings
incident
treat
or
priority
settings,
and
it's
super
nice,
but-
and
we
also
got
a
few
nice
additional
values
using
dexter.
A
One
is
this
nice
data,
lineage
visualization,
but
I
showed
you
before
the
other
one.
Is
this
pipeline
performance
monitoring,
which
is
pretty
nice
because,
most
of
the
time,
if
it
turns
out
you
want
to
figure
out
if
your
pipeline
is
running
slower
than
expected
and
also
then
you
want.
You
are
interested
in
why
and
which
solid
your
job
runs
longer
than
before,
because
maybe
there
was
some
kind
of
change.
Somebody
committed
a
change
there
which
caused
this
or
or
maybe
there
is
some
issue
with
your
hadoop
cluster
or
whatever.
A
Another
thing
is
like
a
easier
pipeline
debugging.
I
think
it's
a
pretty
nice
ui,
where
you
can
basically
see
the
logs
immediately
and
you
have
this
nice
filter
as
well
twittering
down
what
you
are,
and
here
the
solid
selectors,
where
you
can
only
see
that
portion
of
the
pipeline.
What
you
are
really
interested
in.
A
A
A
And
the
testing
capability,
it's
super
nice.
Actually,
that's
what
I
showed
you
in
in
the
in
the
github
example
or
the
jenkins
example.
So
it's
super
cool
and
and
we
we
can
make
sure
that
we
are
letting
way
less
garbage
in
with
this
running
the
whole
pipeline
in
a
test
run
and,
of
course,
this
nice
type
and
config
checking
which
comes
automatically.
A
A
A
We
need
to
do
more
extensive
user
testing
and
also
onboarding
all
of
our
analysts,
and
and
in
this
way
we
can
basically
speed
up
the
migration
because
we
can
create
and
actually
that's
what
what
we
are
currently
working
on
some
kind
of
migration
guide,
which
we
can,
what
we
can
hand
over
to
them
and
they
can
do
on
their
own.
A
This
type
of
to
migrate
their
own
jobs,
improve
backfield
capabilities.
I
was
super
happy
seeing
that
there
will
be
a
bunch
of
improvements
around
that.
We
really
would
like
to
see
that
and
and
yeah.
This
is
something
what
we
as
we
are
working
on,
to
improve
that
and
introduce
better
quality
checks.
So
currently,
as
I
told
you
about,
quality
checks
is
basically
if
there
is
a
file
or
not
or
that
if
there
is
a
table
or
not
or
if
there
is
at
least
one
row
in
the
table
or
not,
but
we
can.
A
We
would
like
to
introduce
more
sophisticated
quality
checks
as
well
later
on
and
last
but
not
least,
thank
you
dexter
team.
I
think
it's
super
nice
and
I
I'll
be
really
happy
with
the
cooperation
and
all
of
these
things
what
you
implemented.
I
think
it's
super
nice
and
I
we
started
to
work
on
this
almost
a
year
ago
and
when
it
was
the
extra
year
again,
but
now
it's
incredible
where
you
get
there
and
I
think
you
are
like
really
in
a
ludicrous
mode,
so
releasing
new
features.