►
From YouTube: Leveraging Kubernetes for Machine Learning
Description
This talk will demonstrate how a developer can build machine learning pipelines on top of Kubernetes. This presentation will include a deep dive on how Kubernetes is being enhanced to add GPUs as a resource. The presentation will also demonstration how an operator can spin up a Kubernetes cluster on Openstack with terraform.
A
Hello,
thank
you
very
much
for
coming.
In
my
talk,
this
is
machine
learning
with
kubernetes
I'm
very
excited
to
be
the
first
producer
here
at
kubernetes
day.
It's
great
to
be
here
in
Boston,
so
Who
am
I.
My
name
is
Christopher
I'm
Luciano
I'm
part
of
the
I'm,
a
part
of
the
open
source
technology
team
at
IBM,
digital
business
group,
I'm,
blessed
with
being
able
to
work
on
kubernetes
full
time,
I
mostly
concentrate
on
sig,
node
and
sig
network.
A
My
github
handle
is
on
there
in
my
twitter
ID
feel
free
to
tweet
during
the
keynote,
but
please
make
sure
to
add
the
underscore
cm
Luciano
on
Twitter
founded
some
junket
service
he's
a
lot
more
successful
than
I
am
and
he
doesn't
need
the
PR,
but
I
do
so
make
sure
to
keep
the
underscore
in
there.
I
have
a
very
unsuccessful
blog
for
some
reason,
I'm
hoping
to
fix
that
pretty
soon
so
check
back
later
on,
I'll
be
posting
some
more
stuff
on
there.
A
So
a
lot
of
talks
talk
about
some
of
the
more
technical
things:
how
to
set
up
things:
how
to
twist
some
of
the
knobs
how
to
tune
things,
but
not
a
lot
of
talks.
Talk
about
the
WHO
and
the.
Why
so
I've
coined
a
series
of
talks
by
myself
was
from
calling
the
what
where
and
why
series
so
why
do
I
want
to
you?
Is
these
types
of
technologies
before
actually
using
them?
A
When
we
talk
about
machine
learning,
it's
very
important
to
note
what
we're
trying
to
accomplish
here
with
machine
learning,
it's
very
important
to
get
the
most
accurate
results
possible.
So
if
we
have
a
traditional
bell
curve,
that's
garbage!
We
don't
want
anything
like
that,
so
we're
going
to
continually
train
our
system
to
get
to
eke
out
the
best
possible
accuracy
that
we
can
so
the
goal
we're
starting
with
some
sort
of
a
base
knowledge.
We
have
points
of
analysis,
a
corpus
of
unstructured
data.
Then
we
feed
that
into
our
system.
We
notice
the
airs.
A
So
let's
take
a
very
simple
example.
This
is
my
cat
sprinkles.
She
has
very
distinct
features.
If
you
know
her
ear
is
very
pointy.
If
you
notice
her
feet,
you
know
they
come
down,
they
connect.
The
shape
of
them
is
very
similar
to
that
of
her
paws.
You
could
kind
of
see
her
tail
not
really
make
it
out,
but
we're
highlighting
here
are
some
of
the
features
that
say
that
this
is
a
cat.
You
know,
we've
figured
a
few
different
data
points
and
this
is
some
of
our
base.
Knowledge
we're
going
to
be
feeding
in.
A
So
we
move
on
again
here's
another
picture
of
sprinkles
little
darker,
but
we
could
still
make
out
pointy
ears.
We
have
a
circular
face,
we
can
see
her
feet.
Maybe
we
start
to
notice
some
more
patterns
about
the
system
we
get
to
here.
We
notice
myself
my
fiancee
and
a
penguin.
We
ask
the
system.
Is
this
a
cat?
Well,
I,
don't
see
any
of
the
features
that
I
noticed
before
I,
don't
see
the
circular
face,
I,
don't
see
the
tail
I,
don't
see
the
fur
I,
don't
see
the
small
nose.
A
A
A
What
is
this
so
here
we
have
a
sloth.
Does
their
system
know
what
that
is?
Maybe
not
what
it
does
know
is.
It
has
a
smaller
face.
It's
circular
in
nature.
It
has
a
smaller
nose,
a
distinct
mouth.
There's
fur
the
feet
seem
to
come
together,
everything's
kind
of
coming
together.
It
kind
of
looks
like
a
cat.
So
in
this
instance
our
our
system
might
become
confused,
and
this
is
where
we
would
see
on
our
rate
that
maybe
we'll
have
a
jump
and
we'll
have
some
like
false
or
also
negative
or
false,
positive
sorry.
A
A
So
I'm
going
to
here,
oh,
we
found
another
cat
potentially,
so
we
noticed
the
pointy
ears.
You
notice
a
circular
face
it's
a
little
harder
to
see,
but
this
is
sprinkles
again.
We
notice
the
long
tail.
Is
this
a
cat?
If
we
have
many
cats,
but
did
we
tell
our
system
that
we
might
be
expecting
many
cats
I,
don't
know
once
again
another
opportunity
to
come
back
around
and
train
the
system,
so
you
might
be
thinking
like
what
what
why
would
I
care
about
this?
This
is
a
very
useful
example.
A
I
already
have
a
ton
of
cats.
I
don't
need
anymore
I
want
you
give
me
something.
I
can
work
with
here.
So
IBM
first
started
it's
kind
of
for
a
and
see
you
some
of
the
more
public
AI
knowledge
with
Watson
prior
to
IBM
digital
business
group
I
worked
on
Watson
for
two
years.
This
is
why
I'm
here
today,
so
what
to
start
it
out
on
our
ever
famous
power
machines
here,
if
you
watched
it,
I
encourage
you
to
do
so.
A
If
you
have
not,
just
edit
beat
Ken
Jennings
at
his
own
game
essentially,
but
this
also
isn't
very
useful
for
the
common
person.
Unless
you're
really
trying
to
impress
your
friends
every
day,
invite
them
over
jeopardy
only
to
dominate
them
with
Watson.
This
isn't
a
very
effective
method
for
you
to
to
utilize,
so
we
can
see
that
Watson
started
off
as
that
research
project.
There's
a
demonstration
at
jeopardy,
then
some
of
the
more
advanced
features
of
Watson
start
to
be
separated
out
into
smaller
services.
A
So,
first
off
we
started
in
healthcare
mood
on
financial
services
and
then
it
kind
of
ballooned
to
such
that
we
had
a
ton
of
services.
So
the
lots
of
developer
cloud
is
something
goes
through
the
IBM
cloud
today
these
are
services
you
can
hook
into
you
and
are
using
some
of
these
key
points
of
the
older
Watson
application
that
are
actually
useful
to
you.
What
are
some
uses
that
you
can
think
of
for
these
smaller
examples?
A
A
Another
interesting
example
that
I
found
it
being
used
in
I've
been
today
is
for
security.
You
know,
intrusion
detection
systems
also
have
a
corpus
of
knowledge.
They
in
doats
cases
that
they
know
about.
As
far
as
does
this
look
like
a
security
breach
or
not
they'll
classify
it
and
then
maybe,
if
you
have
a
choosing
prevention
system,
they'll
try
to
actively
block
that,
but
attacks
are
getting
so
advanced
today.
That
is
necessary
to
potentially
incorporate
artificial
intelligence
in
order
to
detect
and
note
to
newer
types
of
attacks.
A
A
So
now
we'll
get
into
like
how
can
I
do
this
myself?
These
links
I'll
provide
the
slides.
Obviously
you
can't
click
on
these
now,
but
we're
going
to
start
with
GPUs,
so
you're
going
to
need
a
machine
that
exposes
GPUs
GPUs
are
being
leveraged
because
of
the
amount
of
cores
these
training
jobs
take
a
long
amount
of
time.
So
you
can
have
the
case
where
you
think
you're
going
to
have
a
short
iterative
solution.
Spin
up
some
GPU
virtual
machines
or
bare
metals,
do
your
training
and
then
tear
them
down,
but
often
case.
A
If
you,
you
know,
note
a
lot
of
the
examples
we
had
with
cats
with
dogs
takes
a
lot
of
time
to
train
your
system
and
to
note
these
things
and
to
air
correct.
So
it's
not
uncommon
for
these
jobs
to
take
weeks
months,
even
in
kubernetes
is
going
to
help
you
to
cut
out
some
of
the
corner
cases
you
have
to
deal
with
if
you're
deploying
this
on
the
bare
metal
directly.
Tensorflow
is
also
an
interesting
project
that
has
come
out
of
Google.
That
allows
you
to
leverage
some
of
these
more
advanced
api.
A
You
will
need
to
build
it
yourself.
There
is
also
examples
of
a
deploying
tensorflow
atop
kubernetes
now
I
know
you're
thinking,
there's
lot
going
on
here.
I've
got
stacks
on
stacks
on
stacks.
If
I'm
going
all
out,
I
have
to
start
off
with
the
bare
metal,
then
I
put
some
OpenStack
on
it.
Now
a
virtual
machine
I
deploy
a
containerized,
runtime
docker
rocket,
then
I
put
kubernetes
on
it.
A
Then
I
put
tensorflow
it's
a
lot
going
on
here
and
I
can
understand
cutting
a
lot
of
these
out
myself,
but
you
want
to
cut
out
the
OpenStack
cut
out
the
OpenStack.
You
want
to
cut
out
the
tensorflow.
You
want
to
do
this
yourself,
go
for
it.
What
I
want
you
to
think
of
an
Irish
Breakfast?
You
know
we're
not
beating
my
Irish
Breakfast
I'm
not
going
to
eat
the
blood
sausage
without
the
pork
sausage.
The
toast
perfectly
complements
the
eggs.
I
wouldn't
want
it
any
other
way.
A
So
when
you're
thinking
about
these
systems,
think
about
how
kubernetes
can
help
you
to
better
deliver
some
of
these
machine
learning
training
systems,
so
this
visa,
the
information
that
I
present
the
next
following
slides,
hot
off
the
press.
Some
of
these
proposals
were
just
discussed
last
week
and
sig
node,
but
some
of
the
characteristics
of
GPUs.
It
is
important
to
note
that
distinguish
them
from
other
types
of
resources.
We'll
go
to
each
of
these
right.
Now,
multiple
video
cards,
so
one
one
node
one
blade-
could
have
a
ton
of
different
video
cards.
A
You
could
even
potentially
have
different
models
of
video
cards
in
there.
They
have
some
faster
one.
You
have
some
slower
ones.
You
have
some
purpose-built
ones
for
our
certain
topology,
so
Cooper
nutty
is
going
to
help
you
out
with
this.
By
allowing
you
to
use.
Look
that's
in
the
next
slide.
A
node,
selector
and
I'll
discuss
that
in
a
couple
more
slides.
What
allow
you
to
specifically
target
the
exact
GPU
that
you
want?
A
There's
a
lot
of
discussion
going
on
in
the
community
about
exposing
topology
up
through
kubernetes,
so
that
you
can
target
the
exact
topologies.
You
want,
but
then
the
next
few
slides
I'll
show
you
ways
that
you
can
get
around
that
to
do
the
right
thing.
The
first
time
driver
installations.
If
you're
using
a
video
card,
you
need
to
install
the
drivers,
these
are
proprietary
drivers
and
you're
going
to
get
these
from
Nvidia's
website.
A
An
important
thing
to
note,
though,
is
the
driver
version
on
the
host
most
often
needs
to
match,
whatever
you're
deploying
in
your
container,
whatever
workload,
if
you
don't,
they
clash,
you
get
a
weird
error
message:
nothing
works!
So
when
you're
matching
driver
versions,
you
want
to
be
sure
that
version
2
of
the
driver
matches
up
with
version.
2
of
your
container.
This
match
is
bad.
You
could
deploy
these
things
for
the
Cooper
Nettie's
daemon
set,
and
the
Cooper
teddies
daemon
set
will
essentially
deploy
a
target
across
all
of
your
nodes.
Do
whatever
work
necessary.
A
So
when
you
spit
up
new
nodes,
it's
automatically
going
to
stall
these
things
for
you.
Now
there
is
some
confusion.
Arounds
do
I
need
to
reboot
the
machine.
The
official
Docs
say
that
in
the
video
that
you
should
reboot
to
grab
all
of
the
latest
kernel
modules
prior
to
utilize,
the
machine
I've
seen
mix
and
match.
Sometimes
you
have
to
reboot.
Sometimes
you
don't.
Certainly
in
my
laptop
it
says,
I
had
to
reboot
when
I
install
the
Nvidia
drivers,
so
you
might
find
discrepancies.
A
So
this
is
before
you
have
multiple
video
cards.
Kubernetes
allows
you
to
have
a
node
selector
in
order
to
target
the
specific
video
cards
you
want.
You
have
some
Tesla
100
P
107.
There
go
ahead
and
target
that
right
here
and
Liat
h,
pointer
down
here,
you're
saying
how
many
DP
use
that
you
want-
and
this
is
a
normal
pot
spec-
that
you're
going
to
be
using
to
deploy
out
whatever
a
system
like
tensorflow
or
your
application
specific
solution.
A
The
link
down
here
will
show
you
this
example
and
show
you
a
little
bit
of
base
knowledge
about
that
resource.
Fragmentation
is
also
big
thing.
So,
as
I
said,
these
jobs
take
a
long
time
to
run.
So
you
want
these
always
to
run
on
time
as
fast
as
possible,
with
the
most
correct
results.
If
you
are
consistently
working
around
slower
systems
in
your
data
center,
maybe
you
have
some
older
nodes
mixed
with
the
newer
nodes,
you're
always
going
to
want
to
target
the
newer
nodes
that
this
is
an
important
job,
because
these
take
weeks
months.
A
Kubernetes
is
trying
to
figure
out
how
to
use
topology
in
a
way
that
you
don't
have
to
necessarily
expose
that
to
the
end
user,
because
it
increases
complexity
and
people
that
just
want
to
get
a
job
done
will
have
to
figure
out
how
to
use
that
it
could
get
a
little
messy
GPUs
fail
in
different
ways,
and
it's
not
always
a
hard
failure.
Sometimes
it
just
gets
too
hot
in
there.
If
you're
running
a
training,
job
you're
just
grinding
away
at
it,
the
fan
may
start
to
overheat.
A
You're
stuck
in
you're,
going
to
start
to
get
inconsistent
builds
insufficient
power
problems.
What
have
you
I
have
an
example
of
an
error
messages
you
might
see
when
it
happens
here
in
a
normal,
node
setup.
If
you
have
one
card
in
your
blade
that
fails,
what
are
you
going
to
do
we're
going
to
go
in
there
and
hot
swap
it
out
or
you're
just
going
to
see
if
it's
just
fine
in
the
end,
what
kubernetes
is
going
to
do
is
get
the
proactively
mark
this
as
unavailable,
and
it's
going
to
only
target
the
working
GPU.
A
So
at
your
leisure,
you
could
come
back
fix
it
again.
This
type
of
performance
and
problem
analysis
is
active
in
the
node
problem
detector
and
we're
starting
to
add
some
for
requests
to
Cooper
Nettie's
to
actively
do
that
blocking.
However,
if
a
node
is
just
completely
dead,
your
job
will
get
migrated
over
somewhere
else,
so
in
kubernetes
1.6
just
released
a
few
weeks
ago.
A
A
lot
of
the
work
went
into
trying
to
make
the
GPU
experience
a
little
better,
so
officially
reach
the
alpha
stage,
so
you
can
have
multiple
pods
on
your
nodes,
a
pod
and
kubernetes.
Just
your
unit
of
work
really
video
card
discovery.
It's
know
a
little
better
and
now
it's
using
some
fancy
regex
to
figure
out
any
active
video
card
so
that
it
can
expose
the
basic
failure.
A
Recovery,
as
I
mentioned,
is
in
there
the
only
problem
at
the
moment
it
only
works
with
docker,
and
it
only
does
this
because
of
some
very
interesting
handoff
between
the
Kubla
trying
to
figure
out
what
containers
are
actively
using
things.
This
is
something
that's
going
to
come
and
feature
person.
So
where
are
we
going
with
GPU
in
kubernetes
we're
going
to
start
with
a
device
recovery?
It
is
very
important
to
be
able
to
segment
off
one
card
and
allow
that
job
to
continue
somewhere
else.
A
That's
where
the
health
checking
features
come
in
topology,
as
I
mentioned
before
you
can
just
full
stop
allow
the
user
to
configure
these
things
themselves.
However,
if
you
schedule
it
right,
the
first
time
a
kubernetes
has
a
degree
of
quality
of
service
features,
so
a
best
effort
is
I.
Just
want
this
to
run.
If
it
fails
a
few
times,
if
I
can't
look
at
my
resources
right
now,
it's
fine.
First
of
all,
this
is
starting
out
with
these
resources.
This
amount
of
CPU
this
amount
of
RAM.
A
However,
it
might
creep
up
to
this
limit,
in
which
case
I,
want
you
to
cut
it
off
or
guaranteed.
That's
where
you
know
how
this
application
works
and
kubernetes
is
going
to
do
its
best,
almost
always
guarantee
those
resources
for
you.
Now,
if
you
bubble
up
topology
to
the
guaranteed
level-
and
you
say
I
want
this
thing
to
be
guaranteed
to
have
these
things.
Building
a
topology
to
that
is
a
lot
more
consumable
for
for
a
user.
Then
they
don't
even
need
to
know
about
what
the
best
apology
for
GPUs
are
they're.
A
Just
going
to
be
assured
that,
because
of
the
base
colleges,
you
know
about
GPUs
that
they're
going
to
get
that
every
time.
That's
another
thing
we're
going
to
work
on
metrics.
As
always
a
good
thing.
Kubernetes
utilizes
see
advisor
for
every
container
that
that's
spinning
up.
So
it's
going
to
give
you
metrics
per
container
that
you
launch
in
there.
It's
also
going
to
give
you
metrics
node
by
node,
some
clean
of
features,
and
we
want
to
do.
We
want
to
make
it
work
with
things
that
are
not
docker.
A
There
was
a
significant
effort
in
one
kubernetes,
one,
two,
five
to
attempt
to
abstract
out
a
lot
of
these
things,
and
so
they
contain
a
resource
interface
so
that
you
can
pass
the
same
amount
of
base
information
to
kubernetes
and
deploy
a
VM
to
play
rock
the
container
to
play
docker.
Whatever
you
want.
A
Last
week,
we
met
with
a
lot
of
people
from
Nvidia
and
videos
having
their
conference
this
week
on
the
other
side
of
the
country,
but
they're
going
to
help
us
out
with
some
of
the
libraries
where
they
try
to
constrain
some
of
their
best
practices
by
env
ml.
In
their
newest
addition,
Lib
NVIDIA
container,
so
this
will
be
something
that
the
couplet,
which
is
the
worker
node
of
kubernetes,
will
call
out
to
to
gain
this
functionality.
A
So
that's
basically
end
of
my
talk.
I
apologize
I
did
put
at
the
end
that
I
was
going
to
have
a
demo
for
deploying
kubernetes
on
top
of
OpenStack
I
notice.
There's
like
five
to
ten
talks
that
are
only
targeting
that
I
didn't
want
to
steal
the
material,
so
I
will
deploy
well
I'll,
send
up
a
blog
post
on
the
IBM
code,
page
with
an
example.
If
you
really
want
to
see
my
specific
example
later
on,
but
come
talk
to
me
about
these
things,
I'm
very
interested.
A
B
Good
yeah,
thanks
for
the
presentation,
I
have
a
question
in
my
area.
Gpu
is
less
interesting,
but
the
networking
part
is
way
more
interesting
because
I'm,
basically
working
with
the
service
providers
and
they
are
looking
into
the
networking
performance-
is
there
any
plan
to
do
something
similar
for
in
the
kubernetes
environment,
to
support
something
similar
to
GPU,
but
for
the
networking
cards?
Yes,.
A
A
This
relates
to
some
of
the
topology
work,
we're
trying
to
get
it
right
at
the
node
level.
First,
before
we
move
on
to
inter
node
communication
scheduling
on
the
network
takes
place
in
kind
of
an
upstream,
potentially
soon-to-be
CN
CF
project.
A
lot
of
those
networking
things
happen
in
the
container
network
interface,
which
is
a
separate
project
where
a
lot
of
these
plugins
come
together
in
order
to
achieve
some
of
those
advanced
things.
A
So
a
lot
of
the
hope
is
to
contain
place
a
lot
of
that
knowledge
in
those
plugins
and
allow
you
to
chain
those
together
to
get
what
you
are
trying
to
achieve.
This
both
supports
a
more
clean
code
base
for
kubernetes,
but
still
will
potentially
allow
you
to
achieve
those
same
results
all
right.
Thank
you
for
other
questions,
nope.
Well,
thank
you
very
much.