►
Description
Deep Learning Workflows on NVIDIA GPUs on Open Shift with Mehnaz Mahbub of Super Micro and Mayur Shetty of Red Hat.
Filmed October, 28th 2019 in San Francisco.
A
B
B
So
the
agenda
for
today
is
gonna,
be
it
could
be
quick.
What
we
find
the
talk
is
I'm
going
to
cover
white
containers
and
kubernetes
for
ml
workloads
and
in
particular,
why
open
shift
and
I
also
going
to
talk
about
how
we
are
prepared.
The
system
to
run
their
mailbox
roads
and
Manaus
is
going
to
talk
about
the
super
micro
hardware
and
also
the
results
that
we
collected
during
our
exercising
before
I
proceed.
I
would
like
to
just
walk
through
the
pipeline
and
the
various
personas
involved
in
ml
workload.
Deployment.
B
The
first
and
foremost,
is
you
collect
all
the
data.
We
collect
data
from
various
sources,
and
this
is
the
raw
data
that's
coming
in
and
the
persona
here
is
the
data
engineer.
The
next
phase
we
have
the
the
data
is
stored
in
data
lakes
and
we
have
the
the
models
which
are
being
tuned.
It's
it's
tested,
it's
trained.
All
that
is
happening
in
the
second
phase
and
the
persona
here
is
the
data
scientist.
B
In
the
first
phase,
you
get
to
see
some
trends,
but
this
is
where
all
the
training
happens
in
the
second
phase,
then
it
takes
certain
models
and
then
those
are
deployed
using
the
update
process
and
the
application
developers
are
involved
in
this
face.
The
data
scientist
are
also
involved
because
they
want
to
make
sure
that
the
the
right
models
are
deployed.
B
Also,
they
want
to
make
sure
if
there's
any
drift
in
data,
because
if
there's
new
data
coming
in
this
needs
to
be
some
tuning
done
and
some
retraining,
so
the
data
scientists
are
also
involved
along
with
the
app
developers
and
at
the
end,
what
you
see
is
an
intelligent
application
which
meets
some
Business.
Objects
object
is
all
across
across
all
these
phases.
One
rule
which
is
common
is
the
IT
operations
folks,
so
they're
common
across
all
this
they're
also
responsible
for
yesterday,
two
operations
involved
in
the
pipeline
so.
B
We're
going
to
talk
about
white
containers
and
why
kubernetes
in
particular,
for
a
hybrid
cloud
in
a
IML
workload,
first
and
foremost,
the
agility
that
it
provides
by
this
I
mean
the
automation
for
the
the
platform
and
also
the
the
model
framework
that
the
data
scientists
are
using.
All
that
can
be
automated.
Also
the
auto
scaling
feature
so
with
this
now
without
a
scaling
feature,
the
data
scientist
does
not
have
to
rely
on
the
IT
folks
to
provide
them
the
infrastructure.
B
They
could
basically
just
use
their
own
tools
to
auto
scale
and
and
get
the
infrastructure
that
they
need
to
do
their
work.
So
what
we've
seen
is
the
training,
the
testing.
All
this
is
very
compute
intensive,
so
any
hardware
acceleration
provides
a
key
benefit
to
the
data
scientist.
So
what
we've
noticed
is
GPU
acceleration
integration
with
any
security
features
uptime
all
this
key
value
add
for
the
ml
workloads.
B
B
Also,
now
you
can
offer
ml
as
a
service,
so
we
could
do
this
with
containers,
but
now
you
could
do
is
is
can
service
so
that
way,
the
data
scientist
does
not
have
to
focus
on
writing
all
the
services
into
their
application
code,
but
rely
on
existing
services.
They
could
just
go
to
a
registry
and
download
these
services
and
integrate
it
with
the
applications
and
benefit
from
that.
B
There
are
products
and
services
which
help
a
lot,
mainly
by
talking
about
the
automation
and
the
CI
CD
pipeline
that
basically
rod
brings
in.
So
all
this
boost
productivity
and
last
but
not
the
least,
is
the
lifecycle,
management
and
operations
that
will
help
the
deployment
of
AI
ml
workloads.
Okay-
if
some
of
you
have
already
seen
this
light
from
earlier
ml
workloads
are
highly
data
and
compute
intensive.
At
the
same
time,
OpenShift
is
a
distributed
platform.
B
B
Also
like
Sharad
mentioned
earlier.
These
services
can
now
be
behind
load
balances
and
disk
ailable,
and
what
we
mean
by
this
is,
you
can
add
more
or
resources
as
and
when
you
need
them,
or
you
even
shrink
resources
when
you
don't
need
them.
So
this
is
a
huge
value.
Add
also
now,
with
open
shift.
Dml
work
loss
can
be
to
truly
portable
meaning
they
could
be
running
on
your
private
cloud
or
on
your
public
cloud.
B
Just
like
the
containers
are
benefiting
lifecycle
management.
This
is
something
that
is
new
to
the
data
scientist
world.
So
now,
data
scientists
can
just
focus
on
writing
their
code,
putting
it
into
a
git
repository
and
a
source
to
image
kind
of
feature
and
the
CIC
the
pipeline
that
is
integrated
into
openshift,
takes
that
source,
creates
an
image
and
then
deploys
it
and
all
the
testing
that
hadn't
happened
or
the
testing
hooks
that
are
available
in
openshift
can
be
leveraged
by
the
data
scientist.
B
B
Let
me
move
on
to
the
actual
project
that
we
had
done
together.
So
I
want
to
talk
about
GPU
as
a
service
on
OpenShift.
So
before
we
before
we
even
started
run
our
benchmarks,
there's
some
critics
that
we
had
to
run
so.
First
and
foremost,
you
may
have
to
make
sure
that
you
have
the
GPU
drivers
running
on
the
servers
which
are
having
the
GPUs
so
to
make
sure
that
things
are
fine.
B
What
we
need
to
do
is
what
we
did
is
actually
got
the
GPU
names
and
they're
going
to
use
the
GPUs
the
GPU
names
later
on
in
labeling
the
machines
the
docker,
which
comes
with
rail,
already,
has
the
OCI
runtime
hooks,
so
we
didn't
have
to
do
anything
out
there.
We
focused
on
the
Nvidia
container
on
time.
Hooks
got
that
configured
once
that
was
done.
We
had
our
system
ready
to
deploy,
containers,
docker
containers,
and
we
used
OC
commands
document
commands
at
this
point
to
deploy
this
containers
on
the
GPU
machines.
B
B
The
device
plug-in
API
is
already
enabled,
so
we
didn't
have
to
do
anything
out
there,
so
we
are
to
just
focus
on
the
device
plugins
so
got
to
make
sure
that
the
Nvidia
device
plug-in
was
running
on
the
host
which
had
the
GPUs.
So
once
we
had
that
configured,
we
again
tested
it
using
the
CUDA
containers
and
I'll
show
you
the
next
slide
out
here.
What
you
see
is
the
last
line.
This
is
the
Yama
file
which
tells
that
a
GPU
is
required
for
this
particular
container.
B
C
You
mayor
so
before
I
I
mean
asked
by
the
way
I'm
from
Super
Micro
so
before
I
dive
into
the
benchmark
numbers
here.
I
just
want
to
tell
you
a
little
bit
about
Super
Micro,
so
Super
Micro
is
one
of
the
leading
providers
of
super
servers
throughout
the
industry
in
current
days
and
our
headquarters
in
San
Jose.
We
also
have
branches
in
the
Netherlands
and
in
Taiwan
as
well.
So
super
Micra
is
one
of
the
leading
manufacturers
of
a
huge
array
of
hardware's
actually
including
servers.
C
Networking
devices,
server
management,
software's,
HP,
CAI
I
mean
you
name
it
and
we
provide
the
whole
hardware
stack
for
you
and
we
also
do
exciting
solution
buildings
like
this,
and
we
are
super
mikro
straight
to
partner
with
Red
Hat
here,
where
we
have
ran
real-life
AI
workloads
for
the
first
time
on
top
of
OpenShift.
So
let
me
start
with
the
solution
reference
architecture
here,
so
we
built
a
10
node
cluster,
which
I
will
show
you
in
detail
in
the
next
slide.
C
So
for
the
actual
OpenShift
building
block,
we
have
used
our
super
micros,
a
famous
big
twin
super
server,
which
is
famous
for
its
very
dense,
parallel
compute
power.
As
long
as
it's
large
memory
footprint
and
for
the
for
running
the
actual
AI
workloads,
we
have
used
our
super
micros
GPU,
Server,
TV
RT,
which
again
I,
will
provide
you
the
actual
spec
details
of
the
servers
in
later
slides
and
for
networking.
We
have
used
our
own
super
micro
switches.
C
We
have
employed
both
10g
and
hundred
G
switches
for
this
project,
and
this
is
kind
of
a
summary
of
the
software
stack.
For
example,
if
you
want
to
know
what
role
we
use,
we
used
rel
7.6
and,
as
my
you
mentioned,
we,
this
project
was
done
and
open
ship,
3.11
and
CUDA
versions
and
summaries
like
that.
C
So
coming
back
to
the
solution,
building
blocks
on
your
left,
you
have
the
actual
openshift
cluster
building
block,
which
is
the
Super
Micro
big
twin
again
again:
I'm
not
gonna,
go
into
the
whole
details
of
the
CPUs
and
memories.
But
if
you
guys
have
any
questions
regarding
any
of
the
servers,
please
let
me
know:
I'll
get
back
to
you
on
that
and
on
your
right.
You
have
our
super
micros
GPU
server,
which
can
operate
up
to
eight
Tests
leve
100
sx
m
two
GPUs,
which
are
the
actual
GPUs.
C
We
used
for
this
benchmarking
as
well
again,
if
you
have
any
detailed
questions
about
the
specs
I'll
be
happy
to
answer
them.
So
moving
on
to
the
actual
hardware
setup,
we
in
our
super
micro
lab,
we
created
a
10,
node,
OpenShift
cluster
standard
tree
master
nodes,
3,
infra
and
3
application
nodes
along
with
1
load,
balancer,
node
and
one
of
those
application
nodes,
as
you
can
guess,
as
the
GPU
server,
the
actual,
where
we
actually
ran
the
AI
workload
in
there
and
the
network
topology
pretty
straightforward.
C
We
implemented
two
different
layers
of
network
10g
and
25
G
10
G's
basically
was
for
the
management
purposes
and
at
25
G
was
implemented
because,
as
you
know,
the
wider
bandwidth
and
lower
latency
we
can
provide
to
machine
learning
workloads.
The
faster
results,
we're
gonna
get
so
and
again.
This
whole
solution,
architecture,
reference
architecture
and
network
topology.
We
built
it
in
such
a
way,
so
it
would
that
it
would
be
scalable
to
me
any
you
know
whatever
scale
your
deep
learning
project
might
be.
C
So
for
running
the
actual
benchmark
suite
we,
you
know,
chose
the
if
you're
familiar
with
the
world
of
machine
learning,
you're
very
familiar
with
mo
perf,
so
mo
/
is
the
wide
range
of
benchmarks,
whis
that
that
covers
a
lot
of
the
main
applications
of
machine
learning.
So
ml
/
basically
gives
you
a
set
of
rules
and
some
specific
data
sets
and
a
bunch
of
specific
models
so
that
the
results
that
you
produce
are
comparable
across
any
hardware
platform
or
across
any
framework.
So
and
from
the
mo
perf
suite.
C
We
basically
chose
two
categories
of
benchmarking.
The
first
one
is
object,
detection
and
the
other
one
is
machine,
translation
and
for
I
want
to
talk
a
little
bit
just
a
little
bit
about
the
datasets
we
used
so
the
first
one
for
object.
Detection.
We
used
a
data
cell
called
the
cocoa
dataset,
it's
from
Microsoft,
which
contains
around
328
K
images,
along
with
more
than
2.5
million
labeled
instances
and
in
the
images
and
for
machine
translation.
C
C
So
moving
on
to
the
actual
benchmarking
here,
as
I
mentioned,
the
first
one
is
object.
Detection.
So
before
I
talk
about
the
numbers,
very
basic,
the
matric
that
we're
comparing
here
is
the
training
time
and
on
the
very
right
side,
you
see
that
if
you
go
to
a
mappers
website,
you
will
see
that
the
only
numbers
they
have
published
are
mainly
run
on
Nvidia's
dgx,
one
platform,
and
so
what
we
have
done
in
our
lab
is
the
software
stack
that
we
have.
C
The
hardware
stack
that
we
have
created
is
very
much
comparable
to
Nvidia's
DG
x1,
whether
its
CPU
cores
a
number
of
GPUs
GPU
memories.
Things
like
that,
so
we
have
tried
to
create
a
very
comparables
hardware
stack
to
dgx
one
so
that
we
can
compare
our
results
with
the
ML
published
results.
So,
moving
on
to
the
first
number
here,
the
first
object
detection,
which
is
the
hip
heavyweight
object,
detection,
and
that
was
actually
the
longest
training
that
we
ran
and
incredibly,
we
got
even
better
training
time
than
in
videos.
Tgx
one.
C
As
you
can
see,
ours
was
a
little
around
205
minutes
where
Nvidia's
was
around
two
hundred
and
seven
minutes
and,
as
you
know,
this
much
of
a
difference
in
timing
makes
a
huge
impact
in
the
real
life
AI
trainings
and
for
the
next
one,
which
is
the
lightweight
object.
Detection.
We
have
also
got
like
very
close
results,
and
I
will
also
explain
why
all
these
numbers
are
very
important,
even
if
they're
not
better
than
nvidias
Dziedzic.
C
So
the
next
one
movie
line
is
the
machine
translation,
as
I
mentioned
English
to
German,
and
we
have
ran
two
different
two
different
sets
of
algorithms
here,
but
again,
both
on
PI
torch.
So
the
first
one
is
the
recurrent
translation
which,
as
you
can
see,
the
training
time
is
very
close
to
dgx
wine
and
the
next
one
again
is
the
non
recurrent
translation
which,
for
which
we
got
the
exact
same
training
time
as
nvidia
DJ,
x1,
and
one
more
thing
I
do
want
to
mention.
C
C
So
these
are
again
as
Gerard
mentioned.
He
also
showed
the
cool
demos
here.
So
this
is
one
of
the
examples
of
the
of
what
the
cool
GUI
of
OpenShift
that
you
can
play
around
with.
So
both
these
dashboards
were
created
using
Prometheus
and
Prometheus
and
Griffin
ax,
and,
as
you
can
see
on
your
left
chair,
you
can
see
the
actual
GPU
usage,
how
much
of
each
GP
is
being
used
along
with
the
GPU
memory
usage
and
on
the
other
one.
C
You
can
see
the
actual
GPU
temperatures
which,
when
you're
learning
training
workloads
it's
very
important,
to
monitor
the
overall
health
of
your
GPUs
as
long
as
power
usage
as
well.
So
the
open
shaped
GUI.
It
gives
you
a
really
lot
of
really
cool
to
monitor,
in
fact,
every
aspect
of
your
project.
However,
you
want
to
monitor
later
control
it.
So
this
is
one
of
the
like
really
cool
examples
of
openshift
features.
I
think
that
we
have.
C
We
were
able
to
implement
for
our
project
as
well,
so
on
this
last
slide,
I
want
to
talk
a
little
bit
about
why
these
numbers
or
what
are
the
impact
of
the
numbers
that
I
just
showed
you
well.
First
of
all
to
our
knowledge,
this
is
the
first
real
life
AI
workload
that
was
run
on
openshift
and
another
very
important
main
point
is
that
the
numbers
that
we're
comparing
it
with
Nvidia
DG
x1.
They
were
all
running
bare
metals.
C
So
the
fact
that
we
can
match
bare
metal
numbers
running
on
top
with
workload
running
on
top
of
OpenShift,
because
there
is
supposed
to
be
a
little
bit
of
lag
because
of
bare
metal
and
open
ship
comparison.
So
the
fact
that
we
can
match
those
bare
metal
numbers,
if
not
better
in
one
case,
like
I,
showed
you
is
a
huge
statement
by
itself
so
again,
just
matching
those
numbers
being
able
to
match
those
numbers
or
getting
close
to
those
numbers.
Showcases
not
only
open
shifts
performance.
C
It
also
shows
the
overall
hardware
performance
and
how
well
we
integrate
with
OpenShift
and
again.
The
last
advantage
is
the
heart
from
the
hardware
point
of
view
and
videos.
Dg
x1
is
a
very
expensive
piece
of
hardware,
as
you
might
be
aware
of
compared
to
that.
The
super
micro
hardware
stack
wood
that
we
have
developed,
which
is
very
comparable
to
digits.
1
is
much
more
cost
efficient.
C
So
the
fact
that
the
customers
are
getting
getting
the
same
training
performance,
if
not
better
in
one
case,
for
a
much
in
a
much
more
cost-efficient
way,
is
another
huge
statement
on
its
own.
So
before
I
finish,
I
want
to
let
you
know
I
want
to
share
some
links
with
you
in
this
slide.
The
first
one
is
the
white
paper
that
we
have
jointly
published
with
the
Red
Hat
Red
Hat
on
Supermicro.
C
You
can
that
white
paper
has
all
the
details
of
this
project
and
all
the
numbers
and
hard
words
and
everything,
and
also
I've,
also
provided
the
get
account
information
here.
If
you
want
to
go
there,
you
can
download
all
the
data
sets
that
we
have
used
all
the
yellow
files
and
everything's
in
the
get
account
and
also
I
have
linked
the
super
micros
openshift
solution
page
here.
If
you
want
to
take
a
look
at
the
hardware,
stack
details,
so
thank
you.