►
From YouTube: Data Engineering Podcast: Running Dagster in Production
Description
Tobias Macey from Data Engineering Podcast discusses migrating from Cron job to Dagster including the resulting tech stack (Pulumi, Packer, SaltStack, Hashicorp Vault, Consul, Vdist/FPM) and the Dagster features he used (scheduler, resources, hooks, assets, etc).
🎞 Slides 🎞
MIT Open Learning & Dagster (Tobias Macey)➡️ :
https://docs.google.com/presentation/d/1TKL9kem6SDyPr0MADOQIRqwvFgHOF7_gJ9Hqdubiuhs/edit
🌟 Socials 🌟
Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/
A
All
right
yeah,
so
I've
been
using
dagster,
got
into
production,
probably
on
the
order
of
about
a
month
ago
now.
So
this
is
just
kind
of
recapping.
My
overall
experience
of
from
the
problem
that
I
was
dealing
with,
of
having
a
homegrown
chrome
job
running
with
poor
visibility
and
error,
reporting
to
using
daxter
to
replace
that
and
serve
as
the
foundation
for
bootstrapping
other
pipelines
in
my
organization.
A
So
a
bit
of
backstory
first
is
actually
met.
Nick
a
while
before
he
actually
went
public
with
daxter
and
told
he
told
me
that
he
had
actually
been
using
the
data
engineering
podcast
as
one
of
the
critical
sources
of
research.
He
was
doing
that
ultimately
led
to
his
decision
to
build
daxter
and
identify
that
as
a
problem
space
that
was
worth
pursuing.
A
So
we
had
a
script
that
would
take
sql
dumps
of
some
select
tables
from
the
platform
from
a
mysql
database.
A
It
would
run
exports
of
the
actual
courseware,
which
is
it's
a
shell
script
that
generates
a
blob
of
xml
files
that
we
then
tar
up
and
send
off
to
s3
and
then
also
dumping
at
the
contents
of
a
database
for
forum
responses
to
analyze
some
of
the
communications
between
students
among
themselves
and
also
with
the
instructors
and
all
that
would
be
bundled
up
and
uploaded
to
s3
using
a
date
stamped
path.
So
we
would
have
one
extract
per
day.
A
So
you
know,
as
I
said,
we
had
a
custom
python
script,
it
ran
on
a
chrome
job
and
it
would
post
some
very
poorly
formatted
output
to
slack
each
time.
It
ran
just
kind
of
dumping,
the
contents
of
standard
out,
and
it
was
just
very
noisy
and
was
quickly
ignored.
We
didn't
really
pay
a
lot
of
attention
to
that
slack
channel
because
it
was
really
hard
to
get
any
useful
information
out
of
it.
A
So,
in
order
to
break
down
the
problem,
I
took
each
of
the
different
sources
that
we
were
pulling
from,
where
each
sql
dump
would
be
a
separate
source
and
then
the
mongodump
and
broke
that
down
into
its
own
solid
definition.
So
there
would
be
one
function,
definition
for
each
database
table.
We
would
need
to
export
or
the
mongod
table,
and
it
all
starts
with
that.
A
I
also
wrote
a
resource
definition
for
the
daily
extract
folder,
so
that
that
would
be
consistent
in
terms
of
creating
the
folder
and
then
removing
the
folder
at
the
end
of
the
pipeline.
So
just
being
able
to
take
advantage
of
those
life
cycle
hooks
that
the
resource
capabilities
in
daxter
bring
in
daxter
also
has
much
better
logging
and
error
reporting
out
of
the
box
and
the
visibility
with
daggett
being
able
to
go
in
and
see.
This
is
the
stage
that
failed.
A
A
So
I
wanted
to
be
able
to
alert
on
cases
where
the
pipeline
failed.
So
for
that
I'm
using
a
service
called
health
checks
and,
as
I
mentioned,
that's
just
an
end
point
where
we
send
a
post
request
when
the
task
finishes,
and
it
also
gives
you
the
possibility
of
sending
negative
acknowledgements.
So
if
the
task
fails,
you
can
signal
that
as
well
and
then
that
has
integrations
into
the
default.
One
we
use
is
email
b
can
also
have
it
post
to
slack
or
various
paging
services,
and
so
that
way
we
know
on
a
daily
basis.
A
If
I
don't
hear
anything,
then
that's
a
good
sign.
It
means
that
everything's
up
and
running,
but
if
it
does
fail
I'll,
get
a
notification,
so
I
know
to
go
in
and
use
the
tools
available
in
daxter
and
daggett
to
be
able
to
debug
the
pipeline.
So
for
integrating
with
that,
I
actually
wrote
a
hook
definition.
A
First,
I
wrote
a
resource
definition
for
being
able
to
communicate
with
health
checks
and
then
wrote
a
hook
that
wraps
the
pipeline
definition
with
a
success
and
failure
event,
so
that
if
the
pipeline
fails
it'll
tell
health
checks
that
there
was
a
failure
event
I'll
get
notified
of
that,
and
then
I
can
go
in
and
debug.
A
The
other
element
that
I
realized
once
I
first
started
working
on,
getting
it
into
production
in
our
qa
environment
is
I
went
to
the
dag
at
ui
and
realized
that
there
was
no
authentication
or
authorization
restrictions.
I
could
just
go
directly
to
the
website
and
if
I
clicked
on
the
link
for
the
dexter
instance
definition,
there
would
be
secret
values
there
or
in
the
pipeline.
A
I
think
it
was
in
the
0.8
release
added
the
capability
of
having
pipelines
deployed
independently
of
each
other
and
having
daggett
be
able
to
communicate
between
them.
That
I
wanted
to
be
able
to
take
advantage
of
that
and
have
each
of
the
pipeline
definitions
siloed
in
terms
of
the
deployment
and
development
cycle.
A
So
I
went
through
the
process
of
writing
some
tooling
to
package
up
the
pipeline
definition
and
all
of
its
dependencies
and
the
version
of
python
into
a
debian
package
so
that
it's
a
single
object
that
I
download
onto
the
host
machine
and
use
the
standard
apps
tools
to
just
install
it
and
then
using
so
that
when
there
are
new
versions,
I
just
repackage
the
repackage
the
project
add
a
different
version,
number
install
it
using
apt
and
it
just
installs
cleanly
over
the
previous
version.
A
So
it
gives
me
a
very
clean
way
to
upgrade
in
place
without
having
to
rebuild
everything
from
scratch
all
over
again.
So
there's
a
tool.
I've
got
links
to
all
the
tools
I
used
at
the
end,
but
it's
called
vdist
that
just
gives
you
the
option
of
being
able
to
package
all
that
information
up
into
a
debian
package
or
an
rpm.
If
that's
your
flavor
of
choice,
so
yeah
the
tools
used,
I'm
using
a
tool
called
pollumi
for
provisioning.
A
All
the
infrastructure
on
aws,
it's
similar
to
terraform,
but
it's
just
infrastructure
as
code,
so
to
be
able
to
version
that
run
it
all
cleanly
packer
for
building
the
ec2
machine
image
for
the
daxter
application,
so
packages
up
all
the
pipeline
code
and
dependencies.
So
I
have
a
basis
to
work
from
as
well
as
bringing
in
all
the
caddy
definitions.
A
The
music
console
for
service
discovery,
so
that
I
don't
have
to
worry
about
in
the
process
of
writing
my
daxter
code.
How
do
I
determine
what
the
source
destination
is
for
being
able
to
communicate
with
it
in
terms
of
the
dexter
features
that
were
very
useful
in
getting
all
of
this
running
and
into
production,
using
the
scheduler
for
being
able
to
set
up
the
timelines
of
making
sure
that
the
job
runs
every
day?
A
The
resource
definitions
for
being
able
to
extract
out
the
communication
with
my
sql
database
and
the
daily
the
daily
folders
for
uploads,
taking
advantage
of
some
of
the
resource
definitions
available
in
the
dagstr
ecosystem
for
the
dextre
aws
package,
for
communicating
with
s3
and
then
also
being
able
to
use
pipeline
presets
to
separate
out
the
configuration
loading
for
different
deployments
of
edx,
because
I
have
one
environment.
That's
for
students
at
mit
and
then
another
another
deployment.
A
I
think
in
maybe
0.9
for
being
able
to
have
that
on
success
and
on
failure,
logic
attached
to
the
health
check
system
and
then
for
being
able
to
plan
for
future
uses
of
daxter
using
the
workspace
definitions,
so
that
I
can
have
dagget
running
as
a
service
pointed
at
a
single
eml
file
that
tells
it
where
to
communicate
with
all
the
different
pipelines
and
have
a
stable
and
flexible
way
of
bringing
on
new
use
cases
and
then,
within
the
pipeline
definition
itself,
having
the
use
case
for
being
able
to
write
expectations
so
that
I
can
see
if
certain
quality
checks
are
passing
or
failing
asset
materializations
to
understand
what
are
the
files
being
produced
by
these
pipelines
and
then
having
flexible
metadata
to
be
able
to
generate
another
source
of
information
that
can
be
used
for
future
analysis
of
how
or
how
is
the
use
of
this
pipeline
trending?
A
How
are
the
volumes
of
data
that
it's
processing
changing
over
time?
So
just
a
lot
of
really
useful
capabilities
there.
Here's
a
quick
diagram
of
the
overall
data
flow
so
on
the
left,
you
see
daxter
with
its
postgres
database
and
its
own
vpc.
A
So
that's
where
we
are
now
in
terms
of
next
steps.
Looking
at
using
the
pants
builds
tool
for
being
able
to
use
a
monorepo
structure
for
having
all
my
pipeline
definitions
in
one
source
control
repository
but
being
able
to
build
and
package
all
of
the
pipelines
separately
and
deploy
them
independently
from
each
other,
while
still
being
able
to
have
shared
resource
definitions
and
common
libraries
that
can
be
used
across
those
pipelines.
A
I'm
going
to
be
putting
up
a
package
archive
to
make
it
easier
to
upload
and
deploy
new
versions
of
the
pipeline
codes
as
the
debian
packages,
so
using
something
like
artifactory
or
the
pulp
project,
as
the
use
case
grows
I'll,
probably
end
up
bringing
in
das
for
being
able
to
distribute
the
overall
workload
and
then
to
simplify
onboarding
new
use
cases.
A
Creating
some
pipeline
templates
using
tool
called
copier
similar
to
cookie
cutter,
to
be
able
to
provide
some
scaffolding
to
users
who
want
to
be
able
to
write
their
own
pipeline
so
that
I
don't
have
to
be
hands
on
with
all
of
those
new
use
cases
and
then
probably
new
resource
definitions,
as
time
goes
by
for
things
like
vault
for
being
able
to
pull
in
secret
values
in
memory
at
runtime.
So
they
don't
have
to
sit
on
disk
as
a
potential
source
of
compromise.
A
So
the
things
that
went
really
well
are
the
flexibility
and
granularity
of
the
daxter
framework.
As
far
as
being
able
to
define
resources
having
them
be
reusable
having
the
solid
definitions,
easy
to
define
and
hook
together
in
various
pipeline
formations,
using
the
hooks
for
success
and
failure,
events,
asset
tracking
and
materializations
for
rich
metadata.
A
The
fact
that
it
doesn't
have
a
prescribed
deployment
methodology
is
useful
because
it
means
that
I
can
use
my
existing
tooling
for
being
able
to
deploy
and
manage
it
without
having
to
buy
into
another
ecosystem
such
as
kubernetes.
If
I
don't
want
to,
but
having
that
as
an
option.
A
So
the
crop
getting
crown
scheduling
set
up
is
still
a
little
bit
opaque
and
requires
a
bit
of
manual
intervention,
but
all
in
all,
it
worked
well
but
being
able
to
add
in
new
ways
to
define
scheduling.
Maybe
do
it
dynamically
or
have
better
granularity.
Or
you
know,
trigger-based
scheduling
will
be
useful,
probably
going
to
be
hooking
into
the
graphql
api.
A
So
yeah,
if
you
want
to
find
me
online
or
follow
up
with
the
work
I'm
doing
here,
are
the
places
you
can
find
me.
I
run
the
platform
and
data
engineering
team
at
mit,
open
learning.
I
host
the
data
engineering,
podcast
and
podcast.edit,
which
some
of
you
folks
might
be
familiar
with,
and
you
can
find
me
on
linkedin
or
twitter,
I'm
not
very
active
there,
but
I
do
exist
and
then
all
the
code
that
I'm
using
for
building
the
pipeline
and
managing
deployment
is
actually
open
sourced.