►
From YouTube: Orchestrating Cloud Native ML workflows in Kubernetes
Description
A Machine Learning model is only a tiny piece in a series of multiple processing steps executed as part of an ML workflow. A pipeline is a description of an ML workflow, including all the components in the workflow and how they combine in the form of a graph. This talk will provide an overview of ML pipelines, its common components and how to orchestrate the execution of Cloud Native ML pipelines in Kubernetes, using open-source project Kubeflow.
A
A
A
A
I
am
also
the
maintainer
of
an
open
source
project
called
as
cubefletched
q.
Fledge
is
an
operator
which
helps
you
to
cache
container
images
directly
on
the
worker
nodes,
so
this
will
help
you
in
use
cases
where
you
need
to
have
quick
startup
of
applications
or
rapid
scaling
of
applications,
and
I
am
a
speaker
I
speak
not
very
regularly,
but
occasionally
on
kubernetes
cloud
native
and
very
recently,
even
on
mlops,
I
am
a
tech
blogger.
A
A
So
the
agenda
for
the
day
is
actually
very
simple.
So
what
I'm
going
to
talk
about?
Is
I'm
going
to
talk
about
basically
about
ml
workloads,
how
ml
workloads
are
unique
in
nature
when
compared
to
typical
software
workloads
and
the
importance
of
workflows
in
an
ml
system?
A
And
when
we
talk
about
workflows,
there
is
a
need
of
pipelining
tools,
for
instance,
if
you
are
familiar
with
devops
and
cicd
right,
so
you
need
a
pipelining
tool
for
running
your
ci
cd
processors,
for
instance,
tools
like
jenkins.
So
similarly
in
the
ml
world,
you
will
need
a
pipelining
tool
and
I
am
going
to
introduce
you
to
a
open
source
cube
native
pipelining
tool,
which
is
heavily
used
in
the
ml
space
called
as
cube
flow.
So
I'll
introduce
you
to
cube
flow.
A
So
what
are
the
unique
characteristics
of
machine
learning?
So
this
is
a
picture,
a
very
famous
picture,
which
was
originally
published
in
2015
back
in
2015
in
one
of
google's
paper.
A
A
You
need
to
cleanse
the
data,
you
need
to
normalize
the
data
and
you
need
to
label
the
data.
There
is
a
feature,
extraction,
phase,
feature
engineering
happens
and
there
is
and
machine
learning
training
is
a
resource
intensive
activity.
So
you
need
to
take
care
of
infrastructure
requirements
for
this
highly
resource,
intensive
stage
of
machine
learning,
training
and
serving
infrastructure.
Whenever
you
talk
about
a
highly
scalable
application
that
uses
ml,
so
your
serving
infrastructure
has
to
cater
to
this
high
scalability
and
there
is
reliability.
A
So
there
are
so
uni
you
need,
and
and
and
for
instance,
there
is
also
a
very
paramount
importance
placed
on
monitoring
because
traditional
software
monitoring
you
monitor
for
certain
metrics
and
once
it
goes
beyond
the
threshold.
A
A
You
can
actually
rely
on
the
prediction
of
the
machine
learning
model,
so
you
need
to
be
constantly
monitoring
the
performance
of
the
model.
There
are
various
ways
in
which
you
can
monitor
so
so
there's
a
whole
lot
of
other
components
that
need
to
work
together
coherently
for
you
to
be
successful
in
building
a
machine
learning
system
and
successfully
deploying
it
and
running
it
on
a
machine
on
a
production
grade
system.
A
Okay,
so
that's
where
you
actually
tend
to
become
more
focused
on
how
you
can
solve
all
these
problems
in
a
more
holistic
fashion
and
by
the
way.
How
does
a
machine
learning
model
development
life
cycle?
Look
like
right,
so
it
typically
starts
with
you.
Defining
the
metrics
and
acceptance
criteria.
So
what
will
be
the
success?
A
Success
criteria
for
your
machine
learning,
project
itself,
and
it's
all
about
data
data
and
data
gathering
data
cleansing,
the
data,
transforming
the
data
analyzing
the
data
and
sometimes
visualizing
the
data,
and
you
need
a
whole
bunch
of
data
store
technologies,
be
it
sequel,
storing
nosql
data.
It
all
depends
on
what
type
of
use
case
that
you
are
developing
using
your
machine
learning
and,
of
course
you
need
to
spend
time
with
developing
the
machine.
A
Learning
model
train
the
machine
learning
model,
with
whatever
training
data
that
you
have
and
once
you
are
satisfied
with
the
accuracy
of
the
machine
learning
model.
That's
when
you
decide
to
deploy
the
model
into
production
and
then
you
need
to
constantly
monitor
the
performance
of
the
model,
so
you
need
to
define
metrics,
and
these
metrics
have
to
be
emitted
by
the
model
itself.
A
A
You
don't
have
to
actually
develop
a
new
kind
of
infrastructure
just
because
you
need
to
run
machine
learning
models
or
just
because
you
need
to
process
huge
amount
of
data,
so
kubernetes
is
well
versed
to
handle
all
of
these
requirements,
and
kubernetes
is
fast
becoming
the
substrate
for
any
machine
learning
model.
So
there
is
no
denying
that
kubernetes
is
a
fundamental
piece
of
infrastructure
for
any
machine
learning
model
development,
whether
you
develop
the
model
or
whether
you
deploy
the
model
or
monitor
the
model
or
scale
the
model.
A
Now
let
me
introduce
you
to
an
example:
ml
workflow
right.
So,
first
of
all,
there
is
a
model
building
phase.
So
in
this
model
building
phase
you
heavily
rely
on
training
data,
okay
and
out
of
the
model
building
phase,
you
come
up
with
some
candidate
models,
so
these
are
models
which
have
to
be
evaluated
okay,
so
that
is
where
you
use
again
test
data.
Okay,
so
using
test
data,
you
evaluate
the
model
and
then
you
get
into
an
iterative
loop
of
experimentation
so
that
you
come
up
with
a
chosen
model.
Okay.
A
So
it
is
a
highly
iterative
process
in
nature.
So
you
come
up
with
the
chosen
model
which
fulfills
your
acceptance,
criteria,
which
fulfills
your
matrix
criteria
right
and
then
once
the
model
is
chosen
and
once
you
are
satisfied
with
accuracy
of
the
model,
once
you
are
satisfied
with
the
prediction
of
the
model,
that
is
when
you
productionize
your
model,
so
productionizing
a
model
is
basically
how
you
take
your
model
from
your
jupiter
notebooks
into
a
production
ready
artifact,
which
can
be
deployed
into
a
production
environment.
That
is
what
we
essentially
mean
by
productionizing
the
model.
A
A
Okay
and
you
have
the
application
code,
you
deploy
the
model
into
the
production
infrastructure,
so
the
application
code
will
invoke
the
model,
use
the
predictions
and
then
it
will
perform
its
business
logic
and,
more
importantly,
there
is
a
huge
amount
of
emphasis
that
needs
to
be
placed
on
monitoring
the
model
monitoring
the
performance
of
the
model.
For
instance,
once
you
deploy
the
model,
what
happens?
Is
the
model
can
experience
drift,
so
drift
can
be
of
various
natures.
A
For
instance,
the
model
could
have
been
trained
in
a
certain
train
training
data,
but
whereas
in
production
the
statistical
distribution
of
the
data
itself
could
be
different.
So
this
this
could
cause
the
model
to
underperform,
and
this
is
very
normal
right.
So
you
need
to
constantly
monitor
the
performance
of
the
model.
A
Actually,
what
I
am
trying
to
explain
is
forget
about
all
the
faces
that
I
have
spoken
about.
Basically,
what
I
am
trying
to
bring
out
is
there
is
a.
There
is
a
workflow
that
happens,
so
there
is
a
distinct
chunk
of
work
that
happens
in
a
stage
with
defined
input
and
defined
output,
and
the
next
phase
actually
takes
the
output
of
the
previous
phase
and
then
performs
some
actions
and
then
delivers
an
output.
So
what
happens?
Is
work
gets
done
in
each
and
every
stage
and
it's
a
complete
workflow
right.
A
So
that
is
what
is
essentially
happening
in
any
typical
ml
project.
So
there
is
a
workflow
and
the
workflow
has
to
be
executed
in
a
suitable
infrastructure
and
speaking
about
workflow.
You
need
to
have
pipelining
tools,
so
there
are
plenty
of
pipelining
tools
available,
both
open
source,
as
well
as
tools
that
are
provided
by
the
public
cloud
vendors.
But
today
I
am
going
to
talk
about
an
open
source
tool
called
as
cube
flow
and
by
the
way
queue
flow
has
seen
wide
adoption.
A
In
the
recent
times.
It
is
actually
treated
as
the
machine
learning
toolkit
for
kubernetes
and
it
actually
started
as
an
open
sourcing
of
the
way
google
used
to
run
their
tensorflow
models
by
the
way,
tensorflow
is
on
machine
learning
for
framework
that
was
originally
developed
in
google.
It
is
again
an
open
source
machine
learning
framework.
A
It
all
began
just
as
a
simpler
way
to
run
your
tensorflow
jobs
on
kubernetes,
but
then,
after
a
period
of
time,
it
it
had
its
own
roadmap
and
it
finally
ended
up
in
developing
into
an
end-to-end
machine
learning.
Workflow
system,
okay,
and
what
I
mean
by
an
end-to-end
machine
learning,
workflow
system
is
whatever
you
need
for
your
ml
life
cycle
is
provided
by
cube
flow,
so
whether
you
need
to
do
model
exploration
or
whether
you
need
to
do
model
training
or
whether
you
need
to
deploy
the
model
into
production
and
monitor
it.
Okay.
A
So
all
this
is
offered
by
kubeflow
itself
and
that's
why
it
is
called
as
the
machine
learning
toolkit
for
kubernetes
and
it
runs
entirely
on
kubernetes,
okay,.
A
And
specifically,
in
cube
flow,
you
have
a
cube
flow
pipeline
platform
component.
Okay,
so
q
flow
has
many
different
features.
Okay
and
one
of
the
prominent
features
that
is
widely
used
in
kubeflow
is
the
kubeflow
pipeline
and
it's
a
purpose-built
pipeline
for
running
machine
learning
models.
So
it
has
these
four
components.
So,
first
of
all,
q
flow
pipeline
provides
you
with
an
user
interface
with
which
you
can
submit
your
workflows.
You
can
see
how
your
workflows
are
running.
You
can
perform
experiments
and
things
like
that.
A
A
It
provides
you
with
an
python
sdk,
so
you
can
actually
use
the
sdk
to
codify
your
pipelines
and
you
can
create
reusable
components
and
use
these
components
in
your
queue
flow
pipelines
and
of
course,
it
also
provides
you
with
jupiter
notebook
so
that
you
can
use
the
queue
flow
pipeline.
Sdk
from
your
jupyter
notebook
to
create
your
pipelines
as
well.
A
And
this
is
how
the
architecture
of
kubeflow
pipelines
look
like
looks
like
so
by
the
way,
don't
get
overwhelmed
by
the
complexity
of
this.
Actually,
what
is
what
is
the
notable
thing
here
is
that
you
have
a
pipeline
service.
Okay,
so
the
pipeline
service
is
the
core
service
which
actually
accepts
the
input,
so
whenever
you
want
to
submit
a
pipeline
right,
so
the
pipeline
is
actually
getting
submitted
to
the
pipeline
service
that
you
see
over
here
and
once
the
pipeline
is
submitted
to
the
pipeline
service.
A
A
There
are
multiple
workflow
orchestrators
that
can
be
used
in
q
flow
pipeline.
For
instance,
in
this
picture
you
will
see
that
argo
workflow
is
being
used
by
the
way
argo
workflow
is
widely
used
in
kubernetes
and
cloud
native
systems
for
executing
workflows,
so
kubeflow
pipeline
also
uses
argo
workflows
and
queue
flow
pipeline
can
also
be
plugged
into
other
workflow
systems.
Okay.
But
for
this
talk
I
limit
limit
myself
to
the
argo
workflow
controller.
A
So
basically
what
happens
is
once
you
submit
a
pipeline.
The
pipeline
service
creates
an
argo
workflow
and
from
there
on
argo,
controller,
watches
the
workflow,
and
then
it
starts
executing
the
workflow
okay.
So
this
is
how
essentially
the
workflow
gets
executed
and
then
the
artifacts
that
come
out
of
the
workflow
are
stored
in
an
artifact
storage,
which
is
nothing
but
an
mini
object.
Storage,
okay.
A
So
this
is
how
this
is
actually
a
very
simple
architecture,
but
there
are
other
nuances
and
nitty
gritty
certain
other
components
for
caching
and
things
like
that
which
we
will
not
touch
upon,
and
that
is
pretty
much
about
q
flow.
Now.
Next,
we
are
going
to
look
at
a
very
simple
demo,
so
in
this
simple
demo,
what
I
will
be
showing
is
in
step
number
one:
there
will
be
a
model
that
will
be
trained
initially
using
some
trading
data
and
in
this
in
step
number
two.
A
Okay,
so
that's
what
we
do
in
evaluate
model
accuracy
and
if
it
is
not,
as
per
our
expectation,
we
once
again
retrain
the
model.
Okay,
so
it
goes
into
a
loop.
So
so
this
is
a
very
typical
machine.
Learning
workflow
that
you
will
come
across
in
production
systems,
and
that
is
what
I
am
going
to
show
you
now.
A
So
this
is
the
queue
flow
pipelines
ui.
I
have
already
opened
it
and,
by
the
way,
cube
flow.
I
have
already
installed
q
flow.
Sorry,
it's
not
this
window,
perhaps
this
window
yeah,
so
I
have
already
in
installed
kubeflow
pipelines
in
one
of
my
name
spaces.
So
here
you
will
see
all
the
components
running
just
ignore
the
component
that
is
in
crash
loop,
backup
because
we
don't
actually
mandate
really
need
that
component.
So
you
here,
you
will
see
all
the
components
up
and
running.
A
And
as
soon
as
I
start
the
ui
you
I
here,
I
am
able
to
see
the
list
of
pipelines
that
I
have
right
so
for
this
demo
I
am
going
to
use
this
pipeline,
so
this
is
how
the
pipeline
will
look
like.
So
basically
what
is
happening
is,
as
I
said
earlier,
so
there
is
an
initial
model
that
will
get
trained
and
then
based
on
the
metrics
that
is,
output
from
the
model
prediction.
The
model
will
be
retrained
continuously,
so
I'm
going
to
actually
run
this
pipeline.
A
A
A
Okay,
argo
workflow
is
again
an
custom
resource
that
is
watched
by
the
argo
workflow
controller
and
argo
workflow
will
have
all
these
steps
that
needs
to
be
executed
as
part
of
the
workflow
and
each
and
every
step
is
executed
inside
a
port.
That
is
how
argo
workflow
works.
So
if
you
see
the
terminal
now,
you
will
see
a
list
of
parts
getting
created.
A
Okay,
so
these
are
the
parts
that
are
actually
created
by
argo
workflow.
In
order
to
execute
this
workflow,
okay
and
as
and
when
each
and
every
pod
completes
their
activities,
it
will,
it
will
get
updated
here
and
you
can
actually
click
on
one
of
these
items
and
yes
and
then
you
can
see
what
what
what?
What
is
the
input
artifact
for
this
particular
step?
What
what
are
the
output
artifacts
out
of
the
steps?
A
So
in
this
case,
what
has
happened
is
this
step
has
taken
an
input
from
minio?
This
is
a
training
data
set
and
it
has
actually
transformed.
So
there
is
a
transformation
step
that
has
happened
and
it
has
transformed
the
data
okay,
so
likewise
it
gets
executed.
So
what
is
happening
here
is
the
initial
model
has
been
trained
and
the
in
the
model
has
been
trained
again
because
we
have
a
training
loop.
A
So
this
is
the
initial
model
training,
and
this
is
the
training
that
has
happened
again
with
another
set
of
data,
and
there
is
a
prediction
happening:
okay,
so
there
is
a
prediction
happening,
so
the
output
of
the
prediction
will
again
be
a
data
set
and
the
workflow
is
calculating
the
metrics
out
of
this
data
set
and
it
is
deciding
whether
the
model
is
performing
as
per
expectations
or
not.
So
that
is
the
calculation
that
was
performed
and
then
it
has
decided
that
it
has
to
retrain
okay,
so
it
has
again
retrained.
A
So
that
is
what
you
will
see
here
and
then
again
it
is
predicting
okay.
So
this
step
is
running.
So
again
a
prediction
is
being
run
and,
let's
see,
what's
happened,
what
happens
now?
Let's
see
if
the
loop
gets
terminated,
let's
see
if
the
workflow
has
determined
whether
it
is
satisfied
with
the
output
of
the
model,
so
that
is
the
prediction
that
is
happening?
A
Okay,
so
let's
close
this-
and
here
you
go
so
here
you
now-
you
see
that
the
workflow
has
completed.
So
what
has
happened
is
once
the
model
has
been
retrained.
A
It
is
now
it
has
now
reached
a
stage
where
it
is
where
the
workflow
is
satisfied
with
the
accuracy
of
the
model
and
then
finally,
the
model
has
the.
Finally,
the
workflow
has
also
completed
okay,
and
that
is
that
this
is
actually
pretty
much.
What
I
wanted
to
show
you
so
basically
to
sum
it
up,
kubernetes
is
the
de
facto
infrastructure
that
is
also
used
for
running
ml,
workflows
and
ml.