►
From YouTube: OpenShift Commons ML Briefing: ML/AI Data Pipelines on Kubernetes Daniel Whitenack (Pachyderm)
Description
Daniel Whitenack (Pachyderm) discusses how to enable Machine Learning and AI Data Pipelines on Kubernetes and OpenShift with the Machine Learning on OpenShift SIG of OpenShift Commons.
Learn more at http://docs.pachyderm.io/en/latest/getting_started/getting_started.html
Join OpenShift Commons https://commons.openshift.org#join and join the conversation
A
Alright,
it
perfect
so
so
I'm
excited
because
you
know
Michael
and
and
Carol
and
others
have
already
kind
of
given
given
part
of
the
motivation
for
what
I'm
going
to
talk
about.
So
I
can
kind
of
talk
talk
a
little
bit
quicker.
So
thank
you
for
that,
and
thanks
for
the
great
presentation,
I'm
gonna
describe
a
little
bit
about
pachyderm
today
and
how
we're
enabling
machine
learning
and
AI
and
and
other
pipelines
on
top
of
kubernetes
and
on
top
of
OpenShift.
A
You
know
production
pipelines
that
we
that
we
have
going
with
people
in
in
a
lot
of
different
different
spaces,
so
I'll
describe
that
a
little
bit
more.
What
I'm,
gonna
do
is
I'm
gonna
kind
of
describe
as
motivation.
You
know
what
a
typical
machine
learning
a
pipeline
that
we
see
with
our
with
our
users
looks
like
some
of
this
motivation
again
has
already
been
given,
so
I
can
jump
over
it
pretty
quickly.
A
Okay
great.
So,
let's
start
talking
about
machine
learning
pipelines,
so
I
shamelessly
stole
kind
of
a
David's
format
from
from
last
meeting
to
illustrate
some
of
these
things,
just
as
a
reiteration
of
some
of
the
things
he
said
and
some
of
the
things
that
were
actually
already
said
by
Michael.
You
know
a
lot
of
emphasis
is
put
on
training
and
inference
when
people
think
about
machine
learning
and
an
AI
and
a
lot
of
people
are
kind
of.
A
You
know,
see
the
value
of
AI
in
their
in
their
business,
but
really
when
it
comes
down
to
actually
integrating
machine
learning
and
AI
in
their
in
their
infrastructure
and
building
out.
You
know
pipelines
that
can
be
managed
over
time
and
scale,
there's
a
whole
lot
more.
That
needs
to
be
needs
to
be
thought
through,
and
this
is
really
where,
where
we
see
the
challenge
being
so
kind
of,
like
David
said
you
know
in
the
last
meeting,
there's
a
whole
lot
more
than
training
an
inference.
There's
a
whole.
A
You
know
whole
host
of
things
related
to
pre-processing,
feature
engineering,
there's
there's
you
know,
model
export
and
an
optimization.
There's
data
transforms
when
we
do
when
we
do
inference
possibly
post-processing
and
visualization,
and
you
know
we
might
not
even
be
using
the
same
frameworks
and
tools
and
languages
for
all
of
these
steps.
A
Would
like
to
see
you
know,
people
trying
to
use
file
names
and
their
own
sort
of
like,
like
tooling,
to
figure
this
out
and
in
a
not
very
successful
way.
You
know
saying
you
know
this
is
my
feature
set
for
training
from
config
to
with
this
timestamp
dot
CSV.
That
goes
into
my
model,
training
right
and
it's
just
not
not
sustainable
or
scalable
over
time
and
in
addition
to
just
having
those
pieces
of
data,
there's
an
element
of
actually
the
the
sequence
that
these
things
are
happening
in
right,
I
actually
need
to
move
data.
A
We
need
to
manage
all
of
this
data
and
we
need
to
get
all
of
it
in
the
right
place
at
the
right
time.
So
this
is.
This
is
really
what
what
pachyderm
is
seeking
to
do
on
top
of
kubernetes.
So
in
terms
of
like
now
transitioning
to
think
about
what
do
we
actually
need
to
do
on
top
of
kubernetes
to
enable
this
and
sort
some
sort
of
sane
way?
A
A
Fan
of
you
know,
we
know
that
one
way
to
get
these
pieces
of
processing
off
of
our
laptop
and
running
in
a
dependable
reproducible
way
somewhere
else
is
by
erisa
these
right.
But
we
don't
want
to
have
you
know
people
kind
of
spreading
these
docker
images
out
and
data
scientists,
logging
into
a
bunch
of
machines
and
and
running
docker
run,
and
this
is
really
where
kubernetes
has
come
up
right
like
we're.
Gonna
have
a
bunch
of
docker
images
that
we
need
to
deploy
on
a
diverse
set
of
resources.
A
We
need
to
deploy
those
in
a
portable
way
in
a
reproducible
way
and
kubernetes
allows
us
to
get
those
things
running
on
a
set
of
nodes
in
it
in
a
very
nice
way,
but
actually,
if
we,
if
we
think
about
this
now
like
in
terms
of
the
pipeline
that
we
just
talked
about
these
individual
stages
of
processing,
aren't
isolated
right
and
they're.
Actually
not
the
only
thing
that
we're
managing
we're,
also
managing
data
all
right.
So
let's
say
that
we
have
all
our
data
in
an
object
store.
So
now
we
have
all
these
pieces.
A
We
have
maybe
our
our
stages
that
are
that
are
running
as
containers
as
pods
and
kubernetes.
We
have
data
somehow
we
need
to
solve
the
problem
of
getting
data
to
and
the
right
data
to
the
right
ones
of
those
pods
to
run
and
then
collect
the
corresponding
output
from
those
from
those
stages,
and
we
need
to
do
that
in
a
sustainable
way.
Like
Michel
is
saying
we
we
need
to
do
this
in
some
sort
of
version
controlled
way,
so
that
we
can.
A
Actually,
you
know,
remember
what
we've
done
debug,
what
we've
done
maintain
what
we've
done
have
audit
trails
for
compliance
and
somehow
we
need
to
string
these
things
together
in
a
series
of
events.
So
I
want
my
pre-processing
to
run
on
certain
data.
I
want
that
to
output,
other
data,
which
is
used
in
training
and
then
maybe
that
outputs,
a
model
which
is
used
in
inference.
A
So
all
of
these
problems
are
kind
of
additional
things.
On
top
of
kubernetes
that
we
somehow
need
to
enable
similar
to
like
was
already
mentioned,
solving
certain
things
on
top
of
kubernetes,
like
service
mesh
or
like
secret
management
with
with
vault,
or
something
like
that.
So
the
things
we
need
to
do
are
get
the
right
data
to
the
right
code.
A
A
All
of
this
is
kind
of
kind
of
the
extra
stuff
that
we're
really
concerned
with,
and
so
how
do
we
do
this?
Well,
like
our
solution
is,
is
pachyderm,
so,
if
you're
not
familiar
with
pachyderm
pachyderm,
is
this
open
source
data
pipelining
a
data
management
layer
for
kubernetes?
So
you
know
thinking
about
like
other
layers
like
where
it
was
already
mentioned,
to
whether
that
service
mesh
with
with
SEO
or
or
secret
management
with
Nomad
bolte
pachyderm,
is
providing
this
layer
on
top
of
kubernetes.
That
does
specific
tasks
to
accomplish
these.
A
These
sorts
of
pipelines
and
those
things
are,
are
both
the
pipe
lining
piece,
but
also
this
data
management
piece
and
those
things
are
done
together
in
a
unified
way.
So
the
different
components
of
pachyderm
but
are
kind
of
the
core
features
that
enable
these
sorts
of
workflows
is
first
data.
Versioning
like
like
I've
already
mentioned,
we
need
a
way
to
be
able
to
version
our
training
data
sets.
Our
parameters
are
our
visualizations
such
that
we
can.
Both.
A
You
know
go
back
in
time
and
run
specific
processing
on
specific
data
for
reproducibility,
but
also
we
need
that
versioning,
especially
as
we
work
on
larger
teams
and
we're
working
with
data
scientists
and
data
engineers.
We
of
course
utilize
containers
for
analysis,
so
this'll
gives
us
the
flexibility
to
run
any
languages
or
frameworks,
whether
that's
tensorflow
or
Python,
or
Julia
or
OpenCV,
or
whatever
it
is,
or
just
a
bash
command.
A
Then
we
want
data
scientists
to
be
able
to
develop
these
stages
of
processing
and
whatever
the
languages
and
frameworks
they
want,
but
also
to
scale
those.
So
we
have
this
concept
of
distributed
pipelines
where
each
stage
of
our
data
pipelines
are
individually
scalable,
so
you
can
automatically
paralyze
each
state
of
stage
of
a
pachyderm
pipeline
and
pachyderm
will
take
care
of
the
data
sharding
and
and
getting
the
right
data
to
those
workers
and
then
gathering
all
the
results
on
the
other
end
as
well.
A
A
A
Okay,
so
because
you
know
this
is
a
technical
audience,
I
definitely
wanted
to
give
you
a
sense
of
you
know
how
we
actually
enable
this
and
I'm
happy
at
the
end.
You
know
when
we're
going
through
Question
and
Answer
to
give
more
details
on
any
of
these
things.
But
basically
what
happens
is
you
know
again?
We
have
kubernetes,
as
as
the
foundation
for
all
of
this
and
I'm
backing
objects.
A
Great,
so
let
me
give
you
a
little
bit
of
a
sense
of
what
this
actually
looks
like
in
in
the
real
world.
So
I
can
go
over
here.
So
this
is
the
pachyderm
dashboard.
This
is
one
way
to
interact
with
the
pachyderm
cluster,
so
under
the
hood
here,
I
again,
I
have
a
kubernetes
cluster
I,
have
an
object,
store,
I
deployed
pachyderm
to
that
kubernetes
cluster
and
now
I'm
interacting
it
with
it
via
our
our
dashboard,
those
other
ways
to
interact.
A
You
know
via
COI
and
and
language
client,
but
this
is
one
way,
and
you
can
see
here
that
I've
deployed
a
couple
of
data
pipelines
to
the
cluster
I'm
gonna
talk
about
start
on
the
right
here
with
the
machine
learning
one
and
then
I'll
emphasize
the
other
one
at
the
end,
just
to
kind
of
illustrate
some
flexibility.
So
the
first
thing
that
you
can
see
here
is
that
I
have
these
blue
icons.
A
So
each
of
these
blue
icons
represents
a
version
collection
of
data,
so
here
I
have
a
training
data
set
and
in
this
machine
learning
pipeline
I'm
just
kind
of
doing
the
the
hello
world
of
machine
learning.
Both
the
ires
ires
demo,
so
I
have
this
set
the
CSB
training
data
and,
if
I
look
at
this,
this
repo,
or
actually
this
one's,
probably
a
better
example.
These
are
attributes
that
I'm
putting
in
that
I'm
wanting
to
do
inferencing
on
I
can
actually
see
up
here
at
the
top
some
information
about
this
repo.
A
That
tells
me
that
this
is
a
version.
Collection
of
data
I
can
actually
have
branches
of
this
data.
I
see
multiple
commits
into
this
repo,
and
so
I
didn't
see
that
you
know
in
a
previous
state.
I
actually
just
had
one
file
here
right
and
then
in
the
most
recent
state.
I
have
two
files,
so
I
put
an
additional
file
in
here
and
for
any
data
for
any
number
of
ways
that
you
change
the
data.
All
of
this
is
automatically
version,
so
that's
this
first
core
principle
of
data
versioning.
A
The
second
piece
is
around
data
pipelining
and
those
independently
scalable
pipeline
stages.
So
that's
what
these
other
icons
represent
here.
These
are
processing
stages
in
my
pipeline
here
this
first
one.
This
model
stay
is
doing
model,
training
on
that
training,
data
set
and
and
and
then
outputting
a
persisted
version
of
the
model,
and
you
can
see
here
that
the
way
that
this
pipeline
stage
is
defined,
it's
just
be
a
docker
image
that
that
my
code
is
running
in
and
then
I'm
just
running
Python
code.
A
To
do
this,
training
and
I
can
show
you
kind
of
at
the
end.
I
don't
want
to
take
time
now,
but
this
Python
code
it
isn't
like
pulling
in
any
sort
of
like
special
pachyderm
libraries
or
anything.
This
is
the
same
type
of
python
code
that
you
would
run
locally
to
do
your
training
or
in
a
jupiter
notebook
or
whatever.
It
is
and
then
that
that
processing
stage
produces
output,
that
output
is
version
in
an
output
repository
on
the
other
side
of
this
pipeline.
A
So
you
can
see
here,
I'm
I've
output,
a
pickle
file
that
serialize
as
my
model,
and
then
I've
changed
that
to
the
next
pipeline,
which
is
inference
and
that
inference
pipeline
is
running
another
python
script
that
pulls
in
that
model
and
pulls
in
those
attributes
and
does
inference
with
the
model
to
produce
my
my
eventual
results,
which
are
the
species
of
those
of
those
iris
flowers.
So
one
thing
to
note
here
is
that,
like
I
have
all
this
data
version,
I
have
these
processing
stages
created.
A
You
know,
pachyderm
is
aware
of
what
data
is
changing,
because
it's
versioning
it
right.
So
actually,
if
I
go
over
here
and
I
put
new
data
in
so
now,
I'm
going
to
do
this
via
a
second
way
on
the
command
line.
I'm
going
to
put
a
file
into
my
attributes,
repo
on
the
master
branch
I'm
going
to
put
this
third
file,
would
I'll
actually
see
if
I
go
back
here.
Okay,
this
is
automatically
updated
right.
So
my
new
file
is
in
that
repo,
and
you
can
see
that
I've
made
three
commits
now.
A
All
of
that's
automatically
version
not
only
that
what
pachyderm
was
aware
that
I
added
some
data
in
that
hadn't
been
processed
yet,
and
so
it
knew
that
my
results
weren't
up
to
date
with
the
current
state
of
my
input
data,
so
it
actually
went
ahead
and
ran
this
inference
pipeline
again
and
it
updated
my
results
such
that
now
I
have
my
third
result
so
yeah.
This
is
kind
of
how
we
think
about
our
pipelines.
A
We
think
about
them
as
kind
of
dax
of
data
where
you
put
data
in
the
top
and
pachyderm
triggers
all
the
stages
downstream.
That
need
to
run
to
update
your
results.
The
final
core
principle
of
pachyderm
that
I
wanted
to
emphasize
here
was
that
data
provenance
element.
So
if
you
remember
data
provenance
was
the
idea
that
we
could
tie
any
specific
result
or
actually
any
specific
version
of
data
do
all
the
pieces
of
data
and
processing
that
led
to
that
result.
A
So
if
I
look
at
this
particular
inference,
repo,
which
is
my
results,
I've
I
have
four
commits.
Okay
and
if
I
look
at
one
of
those
I
can
see
that
all
the
data
associated
with
that,
but
I
can
see.
Also
all
of
the
upstream
commits
that
actually
contributed
to
that.
To
that
result,
so
there
was
training
data.
A
There
was
a
there
were
attributes,
there
was
a
master
model
version
that
was
used
in
that
inference
and
there
were
a
couple
of
specs,
and
these
are
the
way
that
your
pipelines
are
defined,
so
actually
these
would
represent
the
processing
associated
with
that
particular
result.
So
again,
this
is
definitely
something
that
that
we
view
is
very
important.
The
final
thing
I'd
like
to
emphasize
is
just
that
this
is
so
this
is
we
try
to
keep
this
flexible
in
the
same
vision
as
as
kubernetes.
A
We
want
you
to
be
able
to
to
deploy
this
anywhere
and
scale
it
on
any
infrastructure.
We
also
want
you
to
be
able
to
use
any
types
of
data
and
any
types
of
framework,
so
you
can
see
here
in
this
pipeline
just
to
give
you
a
sense,
I'm,
actually,
processing
image
data
right,
I'm,
using
open
CV
to
do
edge,
detection
on
that
image
data,
and
so
this
is
just
to
illustrate
that
you
know
you're
not
constrained
by
the
type
of
data
or
the
type
of
processing
the
cost.
A
The
processing
you
can
use
is
anything
you
can
run
in
a
container
and
the
data
is
anything
that
can
be
stored
in
an
object
store,
which
I
would
say,
is
pretty
pretty
flexible.
We
also
suggest
to
kind
of
circle
things
back
here.
Before,
we
jump
into
questions
I
just
wanted
to
give
kind
of
a
couple
of
future
directions
that
we're
going
and
also
give
you
some
some
resources
where
you
can
find
out
more.
So
let
me
go
ahead
and
present
here
again,
so
future
directions.
A
We
already
have
deployments
on
top
of
open
ships,
we're
working
on
another
one,
actually,
a
production
deploy
right
now.
That
includes
some
pretty
interesting
components,
so
I'm
very
happy
to
be
part
of
the
sig
and
happy
to
you
know,
see
things
moving
forward
on
that
front.
We're
also
really
excited
to
further
cooperate
with
with
openshift
and
others
in
the
future,
and
actually
one
thing
I
wanted
to
draw
your
attention
to
is
the
actually.
Just
yesterday
we
submitted
a
proposal
for
a
little
bit
more
seamless
integration
between
cube
flow
and
pachyderm
and
I
actually
have
so.
A
We
have
kind
of
an
example
here
of
running
distributed,
tensor
flow
via
TF
job
as
a
stage
of
a
pachyderm
pipeline
and
we're
working
to
kind
of
improve
that
functionality
as
we
go
as
well
as
working
with
people
like
Nvidia
to
better
support
things
like
the
DG
X
and
other
boxes
like
that.
Also
draw
your
attention,
so
I'm
gonna
send
out
these
link
the
link
to
the
slides,
of
course,
I
would
recommend
you
know,
maybe
watching
our
cube
con
talk.
A
I
did
a
little
bit
more
of
an
advanced
workflow
there
that
included
GPUs
and
tensor
flow.
Of
course,
you
can
run
all
of
these
machine
learning
examples
locally
and
mini
cube
and
and
try
it
out.
We
have
public
slot
channel
and
docks
where
you
can
get
get
help
and,
of
course,
anytime.
You
know
feel
free
to
to
reach
out
to
me
so
I'll
kind
of
quit
flapping
at
this
point
and
and
see
if
there's
there's
any
questions.
C
Yeah,
but
thanks
to
a
demo,
that's
really
cool
I
guess
so
you
talked
a
lot
about
kind
of
like
we
know
when
you
rub
your
data
reruns,
the
dag
and
I.
Just
wonder
you
feel
like
have
the
sort
of
corresponding
case
where
let's
say
I
Rev
some
code
and
my
feature
extraction.
That
also
kind
of
like
implicitly
implies
a
rerun
right.
C
A
So
so,
actually
I
can
show
you
that
here.
So
if
we
go
back
if
I
go
back
here,
this
is
the
the
job
specification
that
I
used
for
my
training
and
actually
here
I
have
another
version
of
this
which
doesn't
use
an
SVM
model
but
uses
an
Lda
model.
So
when
you
were
to
update
your
code
like
let's
say,
I
ran
something
I
updated
my
code
I
committed
that
to
get
all
you
would
have
to
do
is
just
update
your
pipeline
and
that
will,
like
you,
said,
trigger
the
exact
same
thing.
A
So,
if
I
look
at
what
jobs
are
running,
oops
sorry
for
the
wrapping
here
go
up
that
actually
automatically
started
a
new
job,
that
retrained
my
model
and
as
rerunning
inference
with
that.
Now,
of
course,
that's
configurable.
Some
people
don't
want
to
don't
want
to
rerun
like
if
they
update
their
model
or
something-
and
that's
that's
something.
That's
totally
configurable.
C
C
A
Each
yeah
you're
exactly
right.
So
each
of
these,
if
I
look
at
the
commit
structure,
each
of
these
commits
in
a
repo
has
has
an
ID
associated
with
it,
and
so
you
can
either
reference.
If
you
want
the
latest
from
a
certain
branch,
you
can
just
reference
that
branch
name,
but
you
can
also
reference
any
commit
ID
from
history
to
get
that
particular
version
of
the
data,
and
you
can
do
that
either.
You
know
via
the
the
dashboard
or
CLI,
but
you
could
also
do
it
like.
B
Carol
had
said
in
in
that's
in
the
sidebar
here:
the
visuals
in
this
are
just
wonderful
and
I'm,
quite
interested
in
the
topic
of
data
provenance
as
well.
I
think
that's
something
that
we
don't
talk
about
a
lot,
because
it's
a
really
hard
thing
to
do
so
kudos
for
getting
that
into
the
into
your
story
and
into
your
workflow
I.
Think.
That's
that's
going
to
be
very
important.
A
lot
of
the
folks
that
are
trying
to
utilize
this
does
anyone
else,
have
any
questions
or
is
anything
further?
You
want
to
add
I
just.
D
A
A
Oftentimes
what
we
see
in
in
this
in
this
sort
of
scenario-
and
this
actually
goes
the
same
for
like
if
you're
ingressing
data
from
a
database
or
something
like
that,
what
we
with
we
actually
have
like
pipelines
that
pull
in
rather
than
you
know,
necessarily
or
driven
by
pushes.
So
basically,
what
you
could
do
is
trigger
that
pipeline.
That
would
pull
in
the
data
from
that
external
source.
A
You
know
and
maybe
save
a
timestamp
associated
with
that,
but
then
on
the
output
like
it
would
it
would
version
your
transformed
version
of
that
data
set
as
the
output
of
that
pipeline
in
a
similar
way.
If
you
were
to
pull
in
from
a
database
each
time
you
make
that
query
to
pull
the
data
in
the
results
of
that
query
could
be
version
and
pachyderm.
A
A
E
E
Do
you
have
any
plans,
or
do
you
have
anything
the
pipeline
around
server
list?
So
my
thinking
was
that,
in
the
same
way
that
data
scientists,
you
know,
would
use
the
notebook
to
essentially
work
that
for
data
engineers
and
or
developers
it
would
be
really
cool
to
have
a
service
integration
there
and
anything
from
a
service
doesn't
really
matter
what
probably
of
risk,
but
anything
around
that
yeah.
A
That's
a
great
question,
so
I'll
kind
of
answer
in
a
couple
ways:
I
guess
so
there
we've
added
a
lot
of
flexibility
in
our
pipeline
spec
recently.
So
that's
pretty
small.
So
let
me
let
me
pull
this
in
here,
but
one
of
those
things
that's
relevant
to
this
is
the
scale
down
threshold.
That's
mentioned
here,
so
you
can
actually
what
like,
let's
say
that
you're
running
batch
jobs
every
once
in
a
while,
or
you
kind
of
only
want
to
spin
up
these
these
pods
when
they
need
to
do
the
work
and
then
scale
them
down.
A
So
that's
really
what's
meant
meant
to
or
that's
what
this
field
is
meant
to
control
in
the
sense
that
like
if
you
want
to
spin
up
a
thousand
workers
to
do
some
batch
processing
in
parallel,
but
you
only
want
to
do
that.
Every
Friday
right!
You
don't
want
to
keep
up
all
of
those
workers
all
the
time,
so
you
can
spend
those
up
immediately
and
then
scale
them
right
back
down
afterwards.
The
other
thing
in
terms
of
services
that
we
have
is
actually
so.
This
is
a
fairly
recent.
We
just
actually
oh
yeah.
A
If
you
wanted
to
like
serve
version
data
or
or
handle
version
data
in
some
way
as
a
service,
that's
what
this
is
meant
to
include
so
like
I
say
this
is
I
would
say
a
experimental
feature
at
this
at
this
point,
but
it's
it's
meant
to
kind
of
kind
of
handle.
Some
of
these
use
cases
like
you,
like
you,
mentioned.