►
Description
Carlson Cheng—Machine Learning Engineer at Thinking Machines—demos running Dagster for pipelines in production
See full July 13, 2021 Community Meeting here: https://www.youtube.com/watch?v=tjDnyE7Xcvo&t=1148s
A
Okay,
so
hello,
everyone,
I'm
carlson
from
thinking
machines,
I'm
the
head
of
the
ml
engineering
team
there.
So
thanks
for
inviting
us
over,
so
that
we
can
talk
about
how
we
use
dagster
for
running
ml
pipelines
in
production.
A
So
just
a
brief
intro
with
thinking
machines.
We
are.
We
are
a
global
technology,
consultancy,
building,
ai
and
ml
solutions
and
data
warehousing
platforms
to
solve,
like
high
impact
problems
for
our
clients.
So
our
clients
range
from
large
corporations
inside
in
southeast
asia
and
like
global
nonprofit
organizations,
and
our
main
goal
is
to
empower
these
business
users
with
valuable
data
and
insights
just
so
that
they
can
make
better
decision
making.
A
We
are.
We
are
internationally
recognized
in
the
field
of
data
science
we
presented
in
top
machine
learning
conferences
most
recently
icml
and
europe's
2020,
wherein
we've
been
awarded
the
best
paper
for
europe's
for
one
of
the
mla
workshops
for
development,
for
our
research
in
gsp
geospatial
for
poverty
estimation
using
satellite
imagery.
A
And
we've
used
laxar
for
a
number
of
projects,
primarily
for
big
data
warehousing,
wherein
we
use
dagser
for
more
traditional
etl
and
elt
use
cases,
unifying
data
for
multiple
sources
and
stakeholders,
building
up
data
warehouse
and
dashboards
for
analytics.
A
So
we'll
be
focusing
more
on
the
third
use
case,
which
is
for
mlaps,
so
mentioning
one
of
our
projects
where
we've
used
dagster
so
for
building
a
smart,
unified
search
app.
So
this
app
in
particular
consolidates
a
number
of
data
sources
and
allows
users
to
basically
search
through
these
sources
and
get
relevant
information.
A
So
one
example
of
this
is
a
user.
Would
then
ask
our
application
like
how
do
I
apply
for
vacation
leave
and
then
our
application
using
ml
algorithms
would
then
give
the
most
relevant
section
in
the
employee
handbook,
highlighting
the
steps
that
you
would
need
to
actually
file
for
a
leave
and
our
use
cases
extend
from
that
also
allowing
our
users
to
query
relevant
entities
like
people
and
companies.
A
A
Second,
is
for
q,
a
search
which
is
our
initial
example,
wherein
you
will
get
the
most
relevant
section
of
the
employee
handbook
for
your
question
and
third
is
entity
search,
giving
you
the
most
relevant
person
or
company
information,
and
all
of
this,
like
search
results,
are
piped
into
a
search
ranker.
So
the
search
ranker
is
like
an
additional
ml
model
that
prioritizes
from
the
your
search
results
which
one
to
actually
list
down
as
the
top
priority
and
most
relevant
search
result.
A
So
just
a
simple
up
architecture
for
our
project.
You
can
see
here.
Our
training
and
test
sets
are
piped
into
the
automated
training
and
evaluation
pipelines
that
we've
built
using
daxter
and
based
on
this.
We
can
do
continuous
training
for
newer
data
and
then
build
like
new
models
that
we
then
stage
inside
the
s3
bucket
and
from
there
we
can
redeploy
our
web
application
servers,
storing
these
new
models
and,
as
our
users
are
using
our
application.
A
Their
interactions
will
then
give
us
more
relevant
information
and
give
us
user
feedback
so
that
we
can
build
our
models
and
create
more
fine-tuned
models
for
their
application
needs,
and
we
have
like
an
additional
set
here
where
and
we
created
a
few
dax
or
pipeline
for
meta
pipeline
monitoring.
A
So
this
is
our
ml
automation
workflow
in
general,
it
goes
from
poc
to
dev
and
to
prod
and
focusing
on
the
first
stage.
First,
the
proof
on
proof
of
concept
phase
is
primarily
done
inside
jupyter
notebooks,
where
in
our
data,
scientists
can
fully
test
out
their
different
ml
approaches
and
run
experiments
wherein
then,
after
that,
they
can
finally
finalize
their
ml
methodology
and
then
that's
when
we
start
migrating
over
their
jupyter
notebooks
inside
digester
pipelines,
where
we
can
do
further
fine
tuning
and
polishing.
A
So
how
does
this
work
in
practice?
In
the
left
side,
you
can
see
here
the
jupyter
notebook
that
our
typical
data
scientists
would
create.
So
in
this
case,
after
they've
done
like
some
initial
data
prep,
they
would
then
start
doing
hyper
parameter.
Optimizations
they're,
using
a
module
here
called
optuna
which
is
used
for
hyperparameter
search
and
given
like
a
number
set,
a
set
number
of
trials
like
100
trials.
A
They
would
then
get
like
the
best
scoring
model
with
its
hyper
parameters
and
accuracy
score
and
that's
what
we
can
then
save
over
and
export
as
our
best
model
from
that
number
of
trials.
So
this
is
like
a
typical
thing
that
you
would
get
from
a
data,
scientist's
notebook
and
then
once
we
like
finalize
this
and
polish
it,
we
can
actually
start
moving
it
over
to
our
dagser
pipelines
and
the
convenient
thing
here
is:
you
can
pretty
much
just
copy
paste
over
your
notebook
code
inside
dax
or
solid
since
daxter
is
very
pythonic.
A
It
doesn't
really
require
you
to
like
write
anything
extra
since
most
of
dax
or
solids
are
just
python
functions.
A
You
can
pretty
much
just
port
it
over
to
your
dax
or
solid,
and
the
additional
steps
here
that
you
just
do
in
coordination
with
your
data
scientists
is
to
add
the
data
descriptions
for
your
solid
definition
and
the
input
and
output
definitions,
and
this
is
important
because
we'll
need
these
definitions
later
on
when
we're
validating
and
debugging
our
pipeline
inside
the
ui,
and
some
additional
steps
is
just
adding
logs
and
dagger
assets
for
ml
metadata
tracking.
So
more
on
this
later.
A
Just
some
learnings
that
we've
had
when
running
ml
pipelines
alongside
existing
dags
or
pipelines.
So
usually
when
we're
adding
and
porting
over
our
jupyter
notebooks
into
like
more
standard
ml
pipelines,
we
would
already
have
like
an
existing
dagster
infrastructure
set
in
place
for
more
traditional
etl
and
elt
pipelines.
So
we
don't
really
need
to
create
like
a
new,
a
new
set
of
platform
for
our
ml
workflows.
We
can
just
make
use
of
our
existing
taxer
infrastructure
and
then
add
in
our
ml
pipelines
there.
A
So
one
thing
to
take
note
is
we
should
just
organize
our
pipelines
into
logical
groups.
So,
for
example,
you
would
have
a
group
for
your
different
etl
pipelines
for
source
a
and
source
b,
and
then
you
would
have
other
groups
for
your
ml
pipelines
for
let's
say
a
specific
model
x
and
then
another
model
y.
So
we
make
use
of
like
the
dax
feature
for
repositories
which
helps
us
like
isolate
the
individual
groups,
and
this
also
further
helps
us
isolate
the
dependencies
for
each
of
these
pipelines.
So
you
can
avoid
conflict.
A
A
And
just
some
additional
learnings
that
we've
had
like
daxter
makes
it
very
easy
to
move
over
to
prod
since
pipeline
implementation
is
pretty
much
the
same
whenever
you're
running
inside.
Let's
say
your
local
machine
or
your
kubernetes
production
environment,
so
there's
very
minimal
changes
when
moving
over
your
pipelines.
So
most
of
the
changes
is
done
inside
the
high
level
dagger
configurations,
but
then
on
the
pipeline
level,
you
don't
really
need
to
change
much
to
port
it
over.
A
So
speaking
of
production,
moving
on
to
that,
this
is
where
we
can
fully
make
use
of
our
automated
training
and
evaluation
model
pipelines,
producing
new
models
and
deploying
them
into
our
servers
where
we
can
do
further
monitoring
and
get
new
data
so
for
pipeline
monitoring
for
ml
ops,
we
make
use
of
daxer's
asset
materialization
feature,
so
we
can
keep
track
of
ml
metadata.
So,
coming
back
to
our
initial
example,
we
have
a
code
snippet
here
where
we
create,
like
an
asset
called
generatedmodel.
A
So
this
is
important
because
later
on,
when
you're
actually
checking
your
dagster
ui,
you
can
view
the
metadata
for
each
of
your
training
runs.
So
here
your
latest
run
can
pretty
much
show
you
your
file
name,
the
hyper
parameters
that
was
the
best
scoring
model
and
your
accuracy
score
there.
So
your
data
scientists
can
pretty
much
keep
track
of
like
the
scores
of
each
of
your
training
runs
and
we
get
to
easily
view
if,
like
a
certain
run,
is
performing
well.
A
A
certain
model
performed
really
well
and
then
notice
that
the
score
went
down.
So
we
can
easily
like
debug
issues
within
the
ml
workflow.
A
And
based
on
the
ui,
we
also
like
appreciate
like
a
lot
of
like
the
pipeline
definitions.
Since
dagster
pipelines
are
very
data
aware
we
can
see
like
the
input
and
output
coming
through
each
of
the
solids,
so
we
can
keep
track
of
how
our
data
is
processed
and
changes
throughout
the
pipeline,
as
opposed
to
airflow
ui,
wherein
the
data
is
abstracted
away
from
you.
We
don't
really
get
to
see
how
the
data
is
processed
in
our
pipelines
and
overall,
it
makes
it
a
more
intimidating
ui
to
work
with.
A
They
monitor
the
pipeline
outputs
in
the
daxer
assets,
page
validating
these
models
and
ensuring
that
the
models
meet
a
certain
threshold
before
they
deploy
they
trigger
pipelines
with
different
configurations
running
the
training
pipeline,
either
in
a
dev
mode,
wherein
they
just
run
on
a
subset
of
the
data,
or
they
can
run
the
training
pipeline
with
a
production
mode
wherein
it
runs
on
full
set
of
data.
A
For
a
pipeline
for
additional
pipeline
monitoring,
we
also
make
use
of
the
slap
notifications
in
general,
like
our
company,
makes
use
of
slap
for
our
day-to-day
like
communication,
so
it
allows
this
allows
us
to
spend
less
time
manually,
checking
the
ui
for
pipeline
passing
or
failure
messages,
and
we
get
to
find
out
when
something
happens
as
soon
as
it
happens.
A
So
just
an
example
here
is
we
get
to
see
that
a
daxer
hourly
ingestion
pipeline
is
actually
performing
and
succeeded,
although
it
didn't
really
get
anything
new
data,
so
this
still
counts
as
a
success
for
us
and
it
we
can
see
here
that
the
elapsed
time
takes
this
much
and
we
can
see
if
there's
anything
any
issues
with
the
cpu
resource.
A
If
ever
this
amount
of
time
passes
is
incredibly
high
and
we
can
see
the
s3
or
sg
link
to
the
ml
model
path
that
we
create
and
used
for
the
pipeline,
and
similarly,
we
can
check
like
pipeline
errors.
Whenever
it
happens,
we
can
see
the
specific
pipeline
that
failed
and
which
solid
actually
failed
there,
and
even
the
error
message
that
shows
up
in
the
solid
and
further
on
we
created
like
a
handy
link
here
that
allows
us
to
just
enter
and
then
go
and
which
sends
us
over
to
the
daxter
run
itself.
A
As
you
can
see
here,
we
have
like
a
summary
that
we
get
on
a
day-to-day
basis,
giving
us
all
of
the
different
production
pipelines
that
we
have
and
it's
success
scores
in
the
past.
A
few
runs
that
it
was
executed
on.
So
we
can
easily
see
here
that
some
of
the
pipelines
are
working
as
expected
and
then
some
are
not
doing
as
well,
and
some
things
might
need
to
be
flagged
for
further
debugging
and
we
even
have
like
the
last
success
date
which
helps
us
like
further
check.
A
If
there's
any
issues
based
on
that
date
and
how
we
do
this,
is
we
have
like
a
pipeline
that
accesses
the
dax
database,
primarily
the
runs
table.
So
we
produce
like
a
simple
sql
query
that
just
checks
the
number
of
pipeline
runs
for
each
pipeline.
So
based
on,
like
let's
say
the
past
10
runs
of
a
pipeline.
A
So
yeah
in
conclusion,
why
we
think
daxo
works
for
ammo
pipelines
in
production
data.
Scientists
overall
gets
a
very
user-friendly
ui
which
enables
them
to
run
data
pipelines
without
fear.
They
can
easily
monitor
their
pipelines
and
debug
their
pipelines
from
the
ui
and,
secondly,
dagster
is
very
versatile,
wherein
we
would
have.
We
would
usually
have
a
dagster
infrastructure
that
already
supports
etl
and
elt
pipelines,
and
then
we
could
easily
just
extend
that
to
support
ammo
pipelines.
So
this
removes
the
overhead
of
setting
up
something
completely
new
for
our
ml
workflows.
A
And
thirdly,
daxter
is
uniquely
works
for
ml
ops,
because,
unlike
other
orchestrators
daxxer,
has
features
that
supports
emel
ops
on
top
of
like
automating
our
training
and
evaluation.
A
So,
just
some
extra
things
that
we
we're
planning
on
working
on
next
in
the
future
we
plan
on
migrating
over
to
using
grpc
servers
inside
kubernetes,
so
that
we
could
separate
out
the
pipeline
code
from
core
daxer
infrastructure.
So
this
helps
us
update
our
pipeline
separately
from
the
dax
or
daemons
like
the
scheduler
and
the
sensors,
so
that
we
can
avoid
redeploying
them
all
together.
So
this
is
just
a
step
into
more
process
isolation,
but
inside
kubernetes.
A
A
We
want
to
try
out
dynamic
orchestration
for
etl,
allowing
us
to
generate
solids
more
dynamically
during
runtime
instead
of
having
to
manually,
define
it,
and
this
has
a
bonus
of
like
making
it
very
easy
to
check
inside
the
ui,
since
it
allows
you
to
view
those
dynamic
solids,
a
lot
easier
than
manually
defined
solids.
A
So
yeah,
that's
pretty
much
it.
Thank
you
for
listening
and
I'm
free
to
for
any
questions
later
on.
Thanks.
B
That
was
fantastic
carlson.
Anyone
in
the
audience
feel
free
to.
We
can
take
brief
questions
now,
if
you
want
to
put
anything
in
the
chat
or
you
can
just
if
you're
feeling
brave
you
can
just
unmute
and
pop
in
this
is
a
pretty
unregulated
zoom
call.
C
I
actually
have
a
quick
question.
This
is
rebecca
here.
Thanks
for
the
presentation,
it's
really
cool,
to
see
how
you
guys
use
it.
I
just
had
a
question
about
your
slack
pipelines.
The
I
mean
the
notifications
about
the
pipelines
that
run
the
summary
one
looks
really
great
the
the
ones
that
report
on
individual
pipelines.
I
may.
C
That,
but
if
that's
what
it's
doing,
does
it
does
it
get
spammy?
Is
it
something
that's
helpful
for
you
to
monitor
systems
and
that
kind
of
stuff,
or
how
do
you
use
that.
A
Yeah
for
the
staff
notifications
in
general,
like
overall,
like
the
most
important
ones,
are
the
ones
that
actually
fail.
So
in
terms
of
like
success
notifications
that
might
not
be
as
important
for
us.
So
that's
just
like
an
additional
check
that
we
do,
but
then
overall
the
ones
that
actually
do
fail.
Those
are
the
ones
that
we
actually
tag
users
on,
so
we
automate
like
also
the
tagging
functionality
we're
in
whenever
we
have
a
pipeline
error,
if
we
would
tag
the
relevant
person
for
that
pipeline.