►
Description
Daniel Blinick—Software Engineer at Immunai—presented on Immunai’s use of Dagster to tackle bio-tech data engineering challenges.
🌟 Socials 🌟
Checkout (and star!) our Github ➡️ https://github.com/dagster-io/dagster
Check out our Documentation ➡️ https://docs.dagster.io/
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Follow us on Twitter ➡️ https://twitter.com/dagsterio
A
A
So
just
a
little
bit
about
me,
so
I
actually
started
in
web
development.
I
joined
immuni
about
three
years
ago
when
it
was
six
months
old
and
I
was
doing
web
there
for
most
of
my
time
actually
and
then
only
about
nine
months
ago.
I
I
really
got
into
data
engineering
and
pipeline,
so
you
know
already
from
from
my
very
beginning
in
this
field
I've
been
with
daxter
and
I
think
it's
made
the
transition
a
really
good
one.
A
So
a
little
bit
about
immunize,
we
are
a
a
therapeutic
company
that
is
trying
to
develop
drugs
that
help
the
immune
system
fight
infection,
and
in
order
to
do
that,
we
need
a
very
comprehensive
understanding
of
how
the
immune
system
works
and
what's
really
going
on
there,
and
there
are
many
approaches
that
take
in
order
to
achieve
that
and
the
one
that
we
we've
chosen
is
to
do
genomic,
sequencing
and
essentially
what
that
does.
A
A
You
know
you,
you
take
a
saliva
sample,
you
send
it
to
a
lab,
and
you
you
understand
more
about
your
dna,
but
in
in
every
cell,
in
your
body,
the
dna
is
essentially
the
same
and
what
differentiates
cells
from
each
other
is
which
part
of
that
blueprint
is
used
in
the
different
cells,
and
so
when
we
can
delve
into
that,
we
get
a
picture
of
what's
going
on
in
your
immune
system
and
and
that's
really
what
we're
trying
to
do
so.
A
The
challenge
for
me
as
an
engineer
and
our
engineering
team
is
essentially
converting
that
biological
data
that
we
get
from
our
lab
into
digital
data
and
then
transforming
it,
enriching
it
in
order
for
our
computational
biologists
and
our
our
data
analysts
downstream
of
the
pipeline
to
do
analysis,
ai
machine
learning
and
and
bring
insight
into
how
we
can
better
develop
drugs.
A
So,
just
to
again
make
this
a
little
bit
more
concrete.
A
I
I
I
wanted
to
present
one
of
the
main
data
structures
we
work
with,
which
is
the
c
the
cell
gene
matrix
and
and
as
I
was
trying
to
explain
before
this,
this
matrix
is
essentially
made
up
of
columns
which
are
our
cells,
and
then
the
the
rows
are
genes
and
the
values
are
are
how
how
prevalent
that
gene
is
expressed
within
the
cell
and
so
different
types
of
cells
will
have
different
gene
signatures,
and
by
doing
that,
we
can
understand.
What's
going
on
in
the
immune
system.
A
For
example,
one
patient
might
have
a
higher
prevalence
of
t
cells
as
opposed
to
b
cells,
and
that
tells
us
something
that
tells
us
when
we
know
that
that
patient
responded
to
a
treatment.
We
know
that
that
you
know
that
that
alignment
of
signatures
might
be
playing
a
role,
and
these
matrices
get
really
really.
You
know
big
in
size.
A
A
And
so
just
to
give
an
overview
of
what
our
pipeline
looks
like.
Essentially,
the
steps
themselves,
the
business
logic
are,
is
written
by
computational
biologists
and
us
on
the
the
data
engineering
team
were
really
responsible
just
for
the
orchestration
logic
of
what
gets
run
when
and
we
try
not
to
concern
ourselves
with
the
actual
business
logic
itself.
A
So
initially,
when
the
company
was
about
a
year
old,
we
developed
a
homegrown
solution
and
what
that
gave
us
was
a
lot
of
flexibility.
It
was
written
in
python
and
the
people
who
wrote
it
did
a
great
job.
Again,
we
weren't
the
company
was
so
new
that
we
just
didn't
really
know
exactly
where
we
wanted
the
flexibility,
and
so
the
solution
did
give
us
a
lot
of
that,
and,
and
it
worked
it
was
it
was.
It
was
written.
A
It
worked
and
it
served
us
well
for
for
a
while,
but
as
time
went
on,
we
we
started
to
have
issues
with
it
and
and
just
natural
as
a
company
grows,
and
so
the
three
main
things
that
that
really
kept
coming
up
were
one
is
that
there
was
no.
We
didn't
have
a
dev
environment.
A
You
know
creating
that
kind
of
environment
obviously
takes
a
lot
of
resources,
and
we
just
didn't
have
that
and
because
the
computational
biologists
were
the
ones
changing
the
business
logic,
they
had
no
different
environment,
they
couldn't
test
things
really
until
it
hit
production
and-
and
so
the
development
cycle
was
very
brittle.
A
We
also
didn't
have
a
ui.
We
had
something
very
hacked
together,
but
it
wasn't
very
usable
and
the
flexibility
that
served
us
well
in
the
beginning,
as
the
orchestration
logic
got
more
and
more
complex,
it
just
got
very
unwieldy
and
we
ended
up
in
a
situation
where
there
was
a
there
wasn't
a
lot
of
transparency.
One
person
really
only
one
person,
really
knew
what
was
going
on
within
the
code
and
it
just
became
very
unworkable,
as
the
company
grew,
so
we.
A
So
this
is
essentially
what
it
looks
like
now
and
when
we
started
exploring
you
know
I
was
tasked
with
doing
a
little
bit
of
research
into
the
different
frameworks
and
we
kind
of
boiled
it
down
to
you
know
the
main
player
airflow,
which
is
you
know,
that's
what
everyone's
heard
of,
even
if
you're,
not
a
data
engineer,
and
I
actually
didn't
find
daxter
one
of
my
one
of
my
colleagues
did,
and
so
we
really
boiled
down
to
these
two.
A
Are
we
gonna
go
with
like
the
name
that
everyone
knows
or
are
we
gonna
take
a
risk
on
a
smaller
project,
which
you
know
definitely
felt
like
it
suited
us
better
and
we
ended
up
obviously
going
with
daxter
for
the
for,
for
these
five
reasons
which
I'll
delve
a
little
bit
into
so.
Firstly,
the
abstractions
in
daxter
just
seem
to
be
very
well
thought
through.
A
We,
when
we
were
doing
the
analysis,
it
was
just
around
the
time
when
dexter
switched
from
pipelines
and
solids
and
composite
solids
to
graphs
and
ops,
and
just
even
just
that,
you
know
switching
to
the
graphs
which
are
kind
of
you
can
have
as
many
graphs
within
graphs
and
there's
no
difference
between
a
sub
graph
and
a
graph,
just
the
the
thought
behind
that
really
appealed
to
us.
A
The
other
thing
we
really
liked
was
the
data
centric
pipeline
model,
seeing
the
pipeline
as
a
flow
of
data
really
just
made
a
lot
of
sense
to
us
and
it's
much
more
intuitive
than
the
air
flow
model
and
what
I've
taken
the
screenshot
here
up
here
is
the
the
being
able
to
run
from
a
point
in
the
graph
which
really
revolves
around
the
fact
that
it's
that
the
pipeline
is
centered
around
around
data
and
we've
used
that
countless
times.
So
it's
really
served
us
well.
A
This
feature
also,
this
is
something
that's
our
pipeline
makes
use
of
this
dynamic
mapping
feature
which
not
all
pipelines
have
which
orchestration
platforms
have,
which
is
the
ability
to
kind
of
define
the
graph
at
runtime.
Based
on
you
know
certain
metadata
so
yeah,
this
may
simplify
how
many
different
jobs
we
needed
to
to
use
and
again
like
we
really
really
like
this
feature.
A
The
asset
materializations
truthfully,
we
haven't
actually
utilized
this
as
much
as
we
should
and
we're
really
excited
about
the
new
developments
in
the
asset
space,
but
but
really
even
just
the
idea
of
kind
of
like
declaring
the
assets
that
have
been
created
and
being
able
to
track
it
and
link
it
back
to
the
run
very.
Very
easily
makes
debugging
debugging
things
easier
and
and
again
just
like
that
thought
behind.
It
was
really
attracted
us
as
well.
A
I
also
I
don't-
I
didn't
even
add
this
here,
but
I
guess
this
is
kind
of
a
meta
point
for
all
the
slides
that
came
before
it.
But
dagit
was
just
in
our
mind
so
much
better
than
than
the
airflow
ui
and
a
lot
of
the
other
uis.
We
saw
it's
just
very
simple:
there's
really
there
there
aren't
that
many
bells
and
whistles,
which
in
my
mind,
is
a
feature.
It's
very
like
straightforward.
A
You
don't
really
have
to
guess
about
what
you're
doing
so,
that's
kind
of
like
what
we
the
reason
we
chose
it
and
but
obviously
we
ended
up
finding
so
many
more
things
that
really
made
us
happy
and
continue
to
make
us
happy.
So
I
just
want
to
talk
a
little
bit
about
those
and
how
we
use
them
so
the
first
one
which
should
have
been
obvious
to
me,
but
not
being
a
data
engineer.
A
I
guess
I
didn't
realize
how
painful
this
would
have
been
without
resources,
but
just
the
ability
to
use
resources
to
set
up
different
environments
have
a
test
environment
very
easily,
a
dev
environment
very
easily
and
not
change.
The
business
logic
has
been
amazing
and
it
really
just
takes
away
so
much
of
the
magic
in
the
sense
that
you,
you
know
exactly
what
you're
doing
in
a
test
environment,
you
don't
have
to
rely
on
frameworks
that
are
filling
in
all
these
gaps,
for
you,
you're,
you're,
feeding
it
the
resources
yourself
and
yeah.
A
It
just
made
things
so
much
easier
to
work
with
custom.
Aisle
managers
we've
also
made
use
of
this
is
one
of
our
our
audio
managers
that
basically
just
extends
the
in-memory
I
o
manager,
but
in
addition,
it
just
dumps
the
content
into
a
bucket,
and
we
have
a
few
others
like
this
and
again
just
added
flexibility
is
great.
A
The
different
sensors
we've
made
use
of
we
make
use
of
the
failure.
Success,
sensors,
obviously
the
standard
sensors
to
kick
off
jobs,
and
we
at
one
point
were
making
use
of
an
asset
materialization
sensor
and
yeah.
It's
just
a
wealth
of
things
to
choose
from
to
do
what
you
want
just
makes
everything
the
code
a
lot
simpler
and
and
more
obvious
in
terms
of
what
you're
trying
to
do.
A
The
graphql
api
is
something
new
that
we've
just
been
exploring
we're
using
it
to.
We
call
like
nuking
our
jobs
that
went
wrong.
Sometimes
we
have
jobs
that
run
with
with
the
wrong
metadata,
and
so
we
use
it
to
kind
of
query
the
runs.
This
is
still
in
development
and
then
just
get
rid
of
them,
and
then
the
sensor
kicks
off,
kicks
them
off
again
with
the
updated
metadata
and
actually
as
an
added
bonus.
A
So
I
just
want
to
share
one
debate
that
we
had
on
our
team.
This
was
earlier
on
just
to
share
it
with
you
guys
and
maybe
it'll
spark
some
interesting
conversation.
A
So
we
we
rely
the
pipe
our
pipeline
relies
on
on
a
bunch
of
metadata
and
we
really
weren't
sure
what
to
do
with
that
metadata,
whether
to
hold
it
hold
it
in
an
external
database
and
query
it
in
during
the
during
the
pipeline
run
or
to
embed
it
within
the
run,
configuration
and
so
the
pro
of
putting
in
the
run
configuration
is
that
it's
explicit.
A
We
also
take
the
run
configuration
and
when
we
create
assets,
we
we
store
the
run
configuration
as
as
as
provenance
for
that
data
set,
and
so
anything
we
put
in
the
run,
configuration
is,
is
implicitly
stored
or
explicitly
stored
as
provenance,
which
so
storing
the
metadata
and
the
run
config
gave
us
that
ability.
Also,
more
importantly,
the
metadata
can't
change
mid-run
and
that's
what
we
were
saying
we
were
nervous
about.
A
The
cons
are
obviously
that
it
makes
running
from
the
launch
pad
almost
impossible,
because
you
need
to
then
like
copy
and
paste
this
massive
file.
It
just
wouldn't
really
have
been
workable
and
the
provenance
that's
created
on
the
data
set
is
is
much
harder
to
read
because
it
just
gets
really
big,
and
so
the
compromise
we
came
to
is
we
basically
created.
A
We,
we
took
the
files
out
of
the
running
config,
we
store
them
as
files
in
google
cloud
storage
that
are
versioned,
and
so
we
just
put
the
version
number
into
the
run
and
what
that
guarantees
is
that
it
doesn't
change
mid-run.
If
it
does,
then
it's
just
a
different
version
of
the
file
and
again
we
can.
We
can
run
it
from
the
launch
pad,
it
doesn't
bloat
the
the
run,
configuration
and
that's
it.