►
Description
Alessandro Marrella—Staff Software Engineer at Earnest Research—discusses using Dagster to power their machine learning pipelines
See the full September 14, 2021 Community Meeting here: https://www.youtube.com/watch?v=oCakb_tB_dU&t=1643s
A
Okay,
hi
everyone,
I'm
alessandro,
and
today
I
would
like
to
present
you
the
work
that
ernest
has
been
doing
and
how
the
company
leverages
taxer
to
achieve
it.
A
A
So
what
does
ernest
do
in
practice?
It
sells
products
that
let
their
customers
understand
how
the
economy
is
moving,
how
where
consumers
are
spending
their
money,
and
so
with
that
information,
how
merchants
are
performing
in
the
market.
A
It
has
a
great
track
record.
In
the
past
five
years,
more
than
500
potential
supply
visa
misses
were
predicted
against
the
consensus.
A
Consensus
is
what
the
markets
think
a
company,
how
the
markets
think
a
company
will
do,
and
so
ernest
was
able
to
predict
better
results
than
the
expectations
that
were
then
validated
at
the
end
of
each
quarter.
A
Having
said
that,
as
you
can
see,
data
and
analytics
are
a
very
important
function
in
earnest
and
for
this
reason,
the
data
science
enablement
team
was
created
with
a
mission
to
grow
earnest
product
with
data
science
and
machine
learning,
capabilities
to
support
initiatives
across
earnest
engineering
and
analyst
teams.
So
the
team
wants
to
enable
both
engineers
and
analysts,
which
are
a
larger
part
of
the
company,
to
run
data
science
and
machine
learning
on
the
data
to
be
able
to
do
that.
A
It's
integrated
with
other
teams
using
dagster,
sensors
and
operations
are
farmed
out
to
managed
services
such
as
ai
platform,
dataflow
and
bigquery.
To
connect
all
this
infrastructure,
the
team
created
a
library
called
the
data
science,
development
kit
or,
in
short,
the
sdk,
which
I
will
talk
about
soon.
A
So
let's
talk
a
little
bit
about
the
gap
between
experiments
and
production
pipelines.
So
we
saw
that
experiments
in
the
company
are
created
in
jupiter
notebooks
and
it's
usually
python
code.
The
company
historically
runs
production
pipelines
in
airflow
using
kubernetes
pod
operator,
but
to
go
from
the
first
step
to
the
second
step.
A
This
there
are
actually
a
lot
of
things
that
go
in
between
which
are,
for
example,
writing.
The
coding
python,
initially
like
in
your
experiment,
run
tests
locally
if
you're
good
and
then
refactor
these
chain
transformations
into
cli
apps,
to
wrap
them
individually
in
kubernetes,
pod
operator
tasks
and
then
either
run
airflow
locally
or
push
it
to
a
remote
distance
to
test.
If
everything
works
great,
but
it
never
works
on
the
first
case.
So
go
back
to
two
and
do
it
again
and
again
and
again,
and
especially
the
wrapping
the
cli
up.
A
I
found
it
very
very
hard
to
iterate
on
so
like
we
call
it
cli
health,
because
you
need
to
keep
passing
arguments
with
cli
and
hope
that
it's
what
it
wants.
A
So
why
did
we
choose
daxter?
We
first
looked
at
airflow
because
it
was
what
was
being
used
in
the
company
and
we
saw
that
well.
First
of
all,
there
are
some
missing
features
that
are
just
not
there
so
directly
executing
bugs
or
input
and
output
type
checking,
which
would
have
been
nice
configuration.
Yes,
there
is,
but
it's
a
json
blob,
that's
not
really
validated,
but
the
most
important
thing
is
that
the
programming
model
is
very
task.
Centric,
you
do
a
then
you
do
b.
A
If
you
use
kubernetes
spot
operator,
as
I
mentioned,
there
is
cli
health
so
again
for
especially
for
data
scientists.
Iterating
on
this
was
very
painful,
so
we
looked
around
and
the
established
solutions,
especially
in
google
cloud,
where
cubeflow
and
tfx
both
are
nicer
to
work
on
than
airflow,
but
share
some
of
the
issues
that
airflow
has.
First
of
all
data
input
output
is
still
a
side
effect.
You
don't
pass
data
between
tasks.
A
You
pass
information
about
data
if
you're,
lucky
and
configuration
is
validated,
but
it's
not
as
good,
and
we
there
is
a
tfx,
also
picks
the
boxes
of
directly
executing
and
testing
dax
and
input
and
output
type
checking,
but
is
very,
very
tied
to
tensorflow,
which
is
not
the
only
library
we
want
to
use
in
the
company.
So,
as
you
can
see
from
the
last
column
daxter
instead,
besides
ticking
all
the
boxes
that
we
wanted.
It
also
has
a
great
api
and,
in
my
opinion,
also
a
great
programming
model
by
putting
data
at
the
edges.
A
So,
just
to
reiterate,
why
do
we
like
daxter
it's
easy
to
run
the
same
code
locally,
execute
pipelines
in
notebooks,
as
we
will
see
soon
write
tests
around
kubernetes
type
checking
is
everywhere,
despite
it
being
python,
it
has
a
data,
centric
approach
which
leads
to
well-designed
dags.
I
think
it's
very
important
to
put
data
as
inputs
and
outputs
and
not
start
reading
data
randomly
in
a
dag
and
then
writing
a
solid.
It's
just
a
matter
of
decorating
a
python
function
and
dependencial
is
nicely
avoided.
Thanks
to
the
repositories
and
grpc
servers.
A
So
we
have
this
infrastructure
and
we
wanted
to
build
some
tooling
for
our
data
scientists
to
interface
with
using
infrastructure.
So
we
built
the
data
science
development
kit,
the
sdk.
A
A
We
needed
to
integrate
with
different
data
sources
and
things,
especially
in
google
cloud.
Actually,
when
we
started
this
exercise,
we
were
in
aws,
so
we
moved
to
google
cloud
with
the
entire
company
and
ducks
are
helped
with
that
too.
The
execution
happens
in
different
layers
and
the
sdk
provides
integration
with
all
of
them
and
it
supports
different
file
formats
and
utilities
that
you
can
use
day
to
day
in
the
notebooks.
A
Each
of
these
components
also
translates
to
a
duxter
solid
via
a
two-solid
method.
So
and
daxter
is
really
at
the
core
of
everything
that
happens
in
the
sdk.
A
A
We
also
have
a
custom,
io
manager
and
some
custom
types.
So
first
talking
about
the
types
we
have
a
location
type,
which
is
essentially
a
pointer
to
data
that
may
be
computed
elsewhere,
so
this
could
be
bigquery
or
gcs
or
local
file
system
and
data
frame,
which
instead
is
data.
That's
actually
computed
in
the
python
code
that
the
solid
is
running
or
is
needed
in
the
python
code
that
the
solid
is
running.
A
You
can
see
here
a
weird
thing,
which
is
that
the
types
mismatch
or
seem
to
mismatch
actually
io
manager.
The
I
o
manager
that
we
built
takes
care
of
converting
between
the
two,
so
we
can
change
solids
without
having
to
compromise
on
performance,
for
example,
a
component
that
runs
a
sql
query.
It
would
be
weird
to
load
the
data
frame
just
to
yield
it
to
the
next
component.
A
So
you
can
see
here
that
I
put
some
icons
about
the
types
that
you
of
location
and
data
frame,
that
you
can
import
and
export.
So
this
is
how
it
looks
like
in
target.
You
can
just
swap
bigquery
with
gcs,
with
google
sheet
with
local
file
system,
without
changing
anything
in
the
code.
So
just
by
changing
there
it
just
you
can
go
from
a
unit
test
to
what
you
do
in
production.
Essentially
so
enough
talk
and
let's
show
you
some
code,
and
so
now
I'm
switching
to
a
notebook
interface
here.
A
So
here,
while
we
are
suppressing
warnings,
because
it's
a
demo,
I'm
importing
the
sdk,
which
is
the
library
I
just
talked
about
importing
some
libraries
that
I
want
to
use
in
the
demo-
and
this
is
the
workflow
that
data
scientists
alternate
usually
does
so
they
start
experimenting
with
some
data,
so
they
load
the
data.
For
example,
here
we
are
using
the
rsd
data
dataset,
obviously
like
in
real
world
data
science.
A
If
data
scientists
are
able
to
code
such
that
everything
fits
in
these
classes
or
further
classes
that
we're
building,
then
they
get
a
lot
of
stuff
for
free
and
it's
very
easy
actually
to
implement
these
these
classes
because
they
use
standard
python
types,
and
you
just
need
to
forget
about
where
you
get
the
data
from
and
where
the
data
goes.
It's
just
python
functions
really
so
by
it
just
being
python
functions,
you
can
run
them
and
experiment
with
them
locally.
A
A
A
We
are
creating
here,
a
pipeline
with
the
usual
nice
texture.
Syntax,
where,
like
the
results,
are
just
results
and
it
looks
like
python
and
then
the
pipeline
can
be
executed
in
the
notebook
thanks
to
the
daxter
execute
pipeline
function,
so
we
pass
in
the
pipeline
and
then
we
pass
some
configuration
here,
some
standard,
bio
manager,
config
and
then
some
solid,
specific
configuration.
Obviously
this
pipeline
runs.
A
You
can
see
now
all
the
output
in
the
in
the
jupyter
notebook
look
at
all
the
logs
and
then
you
can
also
inspect
the
output
and
look
if
the
results
are
what
you
want.
Maybe
the
results
are
not
what
you
want,
because
here
there
is
too
much
mismatch
with
these
classes.
So
you
want
to
experiment
again,
since
here
we
are
using
sqlearn
and
the
interface
for
inference
stays
the
same,
we're
just
swapping
training
in
this
case.
A
So
the
data
scientist
implements
a
new
training
class
with
the
same
interface
and
it
can
test
again
locally,
just
to
sense,
check
everything,
check
the
confusion
matrix
so
to
make
it
a
pipeline.
We
just
need
to
transform
the
new
class
into
solid
and
define
the
pipeline
again,
and
we
can
run
it
again
and
again
we
can
get
the
confusion
matrix,
you
get
it
and
okay,
so
here
we
go
data
locally.
So,
as
you
can
see
here,
the
configuration
location,
local,
I
put
the
url
and
the
data
format.
A
A
And
yeah
this
was
a
wrong
output.
We
can
also
swap
executors.
So,
for
example,
we
if
we
want
to
run
inference
on
apache
beam
because
we
want
to
highly
parallelize
the
inference
we
can
first
of
all
run
inference
in
in
the
notebook.
The
class
has
also
a
two-beam
function,
so
you
don't
need
to
implement
anything.
A
The
the
key
here
is
that
the
the
transformation
doesn't
read
or
write
any
data,
but
we
append
and
attach
at
the
beginning
the
reading,
input
and
output
and
the
I
o
manager
and
the
inference
component
in
this
case
will
take
care
of
transforming
the
data.
You
can
again
define
a
pipeline
then
and
execute
it.
A
The
beam
runner
is
a
daxter
resource,
so
here
you
can
configure
it
and
again.
Here
we
are
specifying.
We
want
to
use
a
local
runner,
so
it's
it's
running
beam
locally,
but
running
it
on
data
flow
would
just
be
a
matter
of
using
the
data
floor
and
earlier
so,
as
you
can
see,
the
dynastar
resource
model
also
helps
by
making
execution
swappable,
which
makes
me
think
that
even
executor
indexer
could
be
a
resource,
but
that
this
is
a
topic
for
another
meeting.
A
Maybe
so,
anyway,
as
you
can
see,
you
can
run
your
pipelines
in
the
notebook.
It
almost
looks
like
production.
So
what
do
you
need
to
do
to
deploy
to
production?
Well,
we
usually
have
a
docker
file
which
contains
the
sdk
a
workspace
file
which
gives
a
dagstar
hint
on
what
to
load
and
then
a
python
file
where
we
literally
copy
paste
the
classes
and
the
imports
from
the
notebook
we
copy
paste
the
two
solid
function.
A
So
this
was
the
notebook
and
the
translation
to
production.
I
just
want
to
reiterate
that
dax
really
helps
us
going
from
experiments
to
production.
A
It
significantly
reduced
the
friction,
especially
compared
to
what
we
had
with
airflow
before
the
time
system
lets
us
separate
the
business
logic
from
computation
and
data,
serialization
deceleration
and
the
resources,
and
I
o
manager,
lets
us
integrate
very
easily
with
our
cloud
in
the
future.
We
want
to
make
more
components
in
the
sdk.
We
want
to
start
using
baxtermine
to
send
notebooks
as
artifacts,
and
we
want
to
start
using
the
asset
api
because
we
are
not
using
it
yet
and
we
want
to
migrate
to
the
new
syntax.
Obviously,
that's
it
thanks.