►
From YouTube: Deep Learning on OpenShift with GPUs Tripti Singhal, Nvidia OpenShift Commons Gathering Seattle 2018
Description
Deep Learning on OpenShift with GPUs | Tripti Singhal (Nvidia) | Tushar Katarki (Red Hat)
at OpenShift Commons Gathering Seattle 2018
https://commons.openshift.org/gatherings/Seattle_2018.html
A
A
B
All
right
thanks
so
much
ok.
So
this
is
just
a
quick
agenda.
I'll
start
off
with
a
brief
overview
of
what
deep
learning
is
and
focus
a
little
bit
more
on
the
inference
side
and
then
I'll
jump
right
into
the
NVIDIA
tensor
RT
inference
server,
which
was
announced
in
September,
so
fairly
new
I'll
go
into
the
features,
the
internal
architecture
where
it
fits
into
the
larger
inference.
Ecosystem
I
have
one
quick
performance
slide
and
then
I'll
jump
into
a
demo
and
then
I'll
pass
it
back
to
Tushar
to
talk
about
open
shift
and
kubernetes.
B
So
deep
learning
at
a
high
level
is
the
idea
of
using
large
amounts
of
data
to
train
neural
networks.
Teach
these
neural
networks
how
to
make
human
life
decisions,
and
so
it's
typically
broken
down
into
training
and
inference
and
inference,
is
what
I'll
be
focusing
on.
Mostly
today,
training
is
using
large
amounts
of
data
teaching
these
neural
networks,
how
to
you,
how
to
make
these
human-like
decisions
and
then
inference
is
taking
that
train
trained
model.
B
That's
been
iterated
over
with
that
data
several
times
and
then
deploying
it
into
the
real
world
and
giving
it
new
data
to
make
new
decisions
and
new
predictions.
So
that's
deep
learning
at
a
high
level,
like
I,
said,
I'll,
be
focusing
more
on
the
inference
side
and
the
ten-thirty
inference
server,
so
focusing
more
on
inference
and
why
GPUs
are
necessary.
There's
this
idea
of
plaster
which
stands
for
program
ability,
low
latency
accuracy,
size
of
the
network,
throughput
efficiency
and
the
rate
of
learning.
B
B
Also,
solutions
today
typically
only
offer
support
for
one
framework
and
that
really
restricts
your
team's
internal
teams
working
on
developing
these
AI
models.
It
restricts
them
to
that
one
framework
and
it
made
yeah,
and
so
they,
if
they,
if
they
feel
like
they,
some
teams
work
in
pie,
charts
some
teams
work
in
and
tensorflow
there's,
not
one
that
uses
it
all.
B
B
Excuse
me-
and
this
is
just
a
high
level
overview
of
where
it
fits
into
the
larger
ecosystem
on
the
Left
you'll,
see
the
clients
sending
in
requests
and
to
some
sort
of
cloud
application
running
on
the
in
the
data
center
and
from
there.
Those
requests
are
sent
to
a
load
balancer
which
directs
traffic
to
to
the
appropriate
instance
of
the
tensor
RT
inference
server.
B
So
here
are
some
of
the
current
features
that
the
tensor
RT
inference.
Server
has
to
offer
and
just
to
single
out
a
few,
we
can
kind
of
separate
them
between
two
performance
features
and
usability
features
so
for
performance.
There's
a
feature
called
concurrent
exit
model
execution,
which
is
what
allows
you
to
run
multiple
models
or
multiple
instances
of
the
same
model
on
one
GPU
at
one
time.
B
So
this
is
this
is
how
you're
really
gonna
maximize
the
utilization
of
your
GPU
and
get
the
most
capacity
there
and
dynamic
batching
you're
able
to
match
up
your
inference.
Requests
inside
the
inference
server
based
on
a
user-defined
SLA
latency
SLA,
rather
than
having
to
build
that
logic
outside
of
the
inference
server
and
so
now
for
more
of
user
usability
features.
The
10:13
print
server
exposes
metrics
such
for
utilization,
count
and
latency
to
enable
auto
scaling,
and
we
support
multiple
model
frameworks
such
as
Facebook's,
tensorflow
and.
B
B
So
in
this
case
it's
an
image
classification
model
so
based
on
that,
in
the
in
the
request
and
the
request
itself,
it
will
go
to
the
scheduler
queue
for
that
particular
model.
That's
needed
to
execute
that
in
format
inference
and
from
there
it's
sent
to
the
framework
backends
that
actually
does
the
infants
compute,
and
if
that
image
classification
model
was
written
in
tensorflow,
it
will
go
to
the
tensorflow
back-end
and
from
there
the
result
is
sent
back
up
through
the
response
handling
and
back
up
to
the
client.
B
B
Okay,
so
that
was
that
was
kind
of
a
zoom.
What
we
just
saw
was
kind
of
a
zoomed
in
view
of
what
is
inside
the
inference
server
and
now
we're
taking
a
step
back,
zooming
out
and
looking
at
where
it
fits
into
the
larger
inference
ecosystem.
So
you'll
notice
that
at
the
far
right
is
the
inference
server.
I
guess
I
can
move
my
mouse
here
so
that
that's
what
I
just
showed
you,
the
zoomed
in
portion
and
now
starting
on
the
left
hand,
side
the
user.
In
this.
B
When
those
kind
of
go
up.
It's
a
good
indicator
that
it's
time
to
spin
up
a
new
instance
and
then
you'll
also
notice
a
dotted
line
around
the
around
that
portion,
and
that
shows
our
collaboration
with
cube
flow
to
support
the
tensor
RT
inference
server,
and
so
there's
a
detailed
blog
describing
that
collaboration
and
all
the
code
is
available
on
github.
B
Okay,
so
this
the
chart
on
the
right
shows
the
performance
gains
you
get
when
using
the
tensor
t
inference
server
across
three
separate
deployments
of
ResNet
50,
which
is
an
image
classification
model.
It's
the
typical
one
used
for
performance
benchmarking
and
so
tensorflow
F
P
32
on
CPU
and
GPU,
and
then
you'll
obviously
see
the
most
performance
gain
when
using
the
NVIDIA
tensor
RT
version
of
the
ResNet
50
and
the
main
takeaways.
B
So
this
is
actually
a
video
of
the
of
our
tensor
RT
inference,
server,
flowers,
demo
and
there's
a
live
version
of
this
that
we'll
be
running
at
our
booth,
so
feel
free
to
check
it
out.
So
let
me
just
describe
what
you're
seeing
here
at
the
top
is
our
flowers,
client,
and
so
all
those
little
images
are
images
of
flowers
being
classifieds,
whether
it's
a
daisy
or
a
rose
or
so
on,
and
this
flashing
bar
going
down
indicates
that
classification.
B
B
So
what
just
happened
was
the
demand
increased.
We
increase
the
demand
to
five
thousand
images
per
second
and
keep
in
mind
that
we're
still
directing
traffic
to
the
the
manual
no
tensor
RT,
inference
server
cluster
and
so
you'll
notice
that
the
two
GPUs
that
are
running
this
flowers
model
are
completely
maxed
out,
while
the
other
models,
whether
it's
a
deep
recommender
or
anything
like
that,
those
those
GPUs
are
under
remain
underutilized.
B
You
see
the
spike
on
the
chart
happened
there
and
the
average
the
average
GP
utilization
is
around
38,
and
so
a
typical
solution
here
would
be
to
increase
the
hardware
and
just
add
more
GPUs
to
support
this.
This
flowers
model,
but
you're
still
left
with
underutilized
hardware
in
your
data
center,
which
is
really
inefficient
and
you'll
also
notice
in
the
images
per
second
on
the
bottom
left
corner
that
we're
not
meeting
the
5,000
images
per
second
demand
we're
only
getting
around
4,800,
so
that's
not
ideal
in
a
production
workflow.
B
So
soon
what
will
happen
is
will
drop,
will
stop
the
traffic
going
to
the
the
non
tensor
RT
inference
server
cluster
and
will
move
all
that
traffic
to
the
enabled
one
with
tensor
artainment
server
and
you'll
see
that
drop
as
soon
enough
you'll
see
it
drop
in
on
the
left
and
peak
on
the
right
there.
It
goes
and.
B
At
the
bottom,
all
eight
GPUs
have
all
four
models
loaded
on
to
it.
So
when
this
peak
happens,
you'll
notice
that
beforehand
the
GPU
utilization
was
around,
maybe
17%,
I.
Think
and
now
it's
back
it's
up
to
39
or
40
percent
similar
to
the
manual
to
the
manual
deployment.
But
this
one
you
get
the
same
average
GPU
utilization.
B
But
in
this
case
all
your
hardware
is
being
utilized
and
you
can
also
see
that
we're
easily
meeting
the
demand
of
5,000
images
per
second,
with
plenty
of
capacity
to
even
spin
up
a
new
workload
or
and
in
or
increase
the
demand.
And
so
another
thing
to
make
it
more
realistic
is
that
we
show
that
it
can
also,
with
the
10:30
inference
server,
enabled
and
having
your
models
evenly
distributed
across
your
hardware.
B
So
you
can
see
that
happen,
and
none
of
them
are
really
being
maxed
out.
Yet
the
fact
that
they're,
the
fact
that
they're
distributed
across
all
eight
GPUs
allows
for
more
capacity
to
be
able
to
handle
that
spiky
workload,
and
so
in
a
little
bit,
you'll
see
that,
with
this
configuration,
we're
actually
able
to
get
to
18,000
images
per
second,
if
you
notice
the
grey
box
there
18,000
images
per
second
and
where
we've
completely
maxed
out
our
GPU
utilization-
and
this
is
using
the
tensor
RT
inference
server.
B
A
They
go
all
right,
so,
as
I
said
so,
so
what
can
we
do
with
this
beautiful
demo
and
the
ten
sorority
infants
and
worse,
they
wrote
ahead
for
this
on
open
chef.
A
deep
learning
on
open
shift
is
what
I'm
going
to
describe
next,
we'll
start
with
that
basic
ten
celerity
influence
server,
that
we
saw
the
demo
for
that
trip.
They
described
a
little
earlier
and
then
we'll
say:
oh,
that's,
actually
a
bunch
of
containers,
and
so
the
containers
need
a
container
platform,
and
so
we
talked
all
day
about
it
today.
A
So
we
got
open
shoe
container
platform,
which
is
basically
Cuban
Eddie's.
It
can
run
across
data
center
and
cloud
and
by
the
way
it
supports
GPUs
and
media
GPUs
and
therefore
you
can
run
it
either
on
the
data
center
or
on
the
cloud
or
a
combination
of
those
two
okay.
But
now
you
have.
The
tensor
are
the
infant
server.
But
now
you
need
a
you,
you
actually
need
to
use.
It
may
be
asked
to
describe.
A
Maybe
it's
part
of
a
you
know,
recommendation
system
or
you
have
a
you-
have
a
chat
bot
which
is
using
that
a
natural
language
processing.
You
know,
so
you
have
an
app
that
you
want
to
create
and
you
want
to
deploy
it
in
production.
So
you
have
your
cloud
native,
intelligent,
app
and
that's
running
on
tensor
RT,
inference,
server
and
and
and
by
the
way
it
needs
a
things
such
as
a
load
balancing
it
needs
things
such
as
routing
it
needs
encryption.
A
Then
we
have
the
openshift
service
mesh
that
was
talked
about
earlier,
so
this
is
kind
of
your
production
setup.
Now
you
have
a
cloud
native,
intelligent
app,
which
is
actually
using
the
tensor
RT
inference
server
and
the
underlying
infrastructure
that
is
provided
by
openshift
to
do
for
this
to
be
deployed
in
a
shown
in
production
right.
So
now,
okay,
but
where
did
these
models
come
from
right?
Somebody
has
to
actually
create
these
models,
so
you
can
use
openshift
as
a
platform
to
train
your
models.
A
Your
data
scientists
can
do
that
using
bring
your
own
or
actually
and
media
has
a
bunch
of
pre
frameworks
such
as
for
a
tensor
flow,
etc
that
they
call
a
nvda
ng
c.
Then
we,
the
GPU
cloud
container,
so
you
can
run
them
on
top
of
OpenShift,
and
so
there
you
got
your
data.
Scientists
have
access
to
the
best
of
the
best
frameworks
to
create
the
models
for
to
feed
into
this.
Tensor
are
the
inference
server?
A
Okay,
so
what's
next
so,
but
as
you
saw
earlier,
you
probably
need
to
do
some
pre-processing
of
the
data,
some
post
processing
of
the
data
you
need
to
set
up
the
data
pipelines.
Your
data
scientists
might
want
to
you
know,
visualize
the
data
that
is
coming
in
using
Jupiter,
something
that
they
might
be
used
to.
They
want
to
use
Kafka
for
as
a
message,
bus
and
maybe
use
you
know
a
spark
for
real-time
processing.
So
all
those
patterns
are
available
on
OpenShift
and
you'll.
A
I
won't
go
into
the
details
of
that,
but
you
know
I'll
describe
some
references
in
the
end
and
all
these
different
quote-unquote
framers
can
run
on
top
of
openshift.
So
now
what
do
you
have?
You
have
your
inference
server.
You
have
your
models.
You
have
set
up
your
data
pipelines
on
OpenShift.
Now.
What
do
you
want
to
do?
Is
you
also
want
to?
Actually
you
know
right
here,
cloud
native
application?
You
want
your
developers
to
do
that.
A
So
that's
that
gives
you
an
idea
of
how
open
ship
can
be
used
in
this
entire
ecosystem
from
end
to
end
and
some
of
the
things
that
have
happened
so
far
really
are
things
such
as
device
manager,
support
which
enable
the
support
of
GPUs.
They
have
been
available.
I'd
say
for
about
a
couple
of
releases
already:
we
have
introduced
other
features
with
the
community,
such
as
priority
and
preemption,
which
can
be
used,
which
will
become
as
essential
if
you
are
trying
to
do
things
such
as
model
training
and
and
using
jobs,
for
example.
A
So,
recently
back
in
October,
we
announced
jointly
with
an
media
support
for
our
Enterprise
Linux,
which
is
certified
now
on
DGS
DJ,
x1
and
Tesla
GPUs,
which
is
basically
the
x1,
is
the
GPU
appliance
from
Nvidia,
and
that
support
is
available
now
on
OpenShift
and
rel,
and
then
here
we're
showing
you
a
preview
of
tensor
RT
inference
server
on
OpenShift.
We
are
planning
to
write
a
reference
architecture
and
obviously
the
road
ahead
is
really
to
make
this
a
much
easier
deployment
deploy.
A
Experience,
install
experience
with
operators,
we
have
other
exciting
stuff
that
we're
going
to
talk
at
cube,
corn
or
the
next
couple
of
days
with
the
community
and
doing
things
such
as
GPU
sharing
heterogeneous
clusters,
for
example.
How
do
we
have
I
mean
this
picture
on
the
left?
Here
it
looks
nice,
but
some
of
these
things-
they're
true
genius
cluster,
still
is
work
that
needs
to
be
done
up
in
the
upstream
community.
So
there
are
some
things
GPU.
Our
topology
is
another
thing
that
we
are
working
on.
A
So
so,
if
you
step
back
now,
you
know
what?
How
does
that
hat
CAI?
You
know
I
think
one
of
the
the
four
pillars
you
know
here,
as
you
can
see,
running
AI
as
a
workload
on
top
of
openshift,
and
you
know
our
ecosystem
is
obviously
the
first
pillar-
and
you
saw
a
little
bit
of
that
today
on
the
rightmost.
Is
how
do
you
build
intelligent
applications?
I
thought
I
touched
a
little
bit
about
that.
Also.
A
So,
if
you're
going
to
build
an
intelligent
application
which
uses
AI,
how
do
you
build
that
and
then
they're
the
two
in
the
middle,
which
is
basically
the
first
one
really
is?
How
do
we
and
continue
to
enhance
our
code
business
using
open
source
tools
and
AI
and
the
second
one
really
is
you
know?
How
do
we
enhance
our
products
itself
using
AI?
This
is
where
Chris
Wright
was
pointing
out
earlier
about
automation
and
self
driving.
You
know
products
and
self-driving
components.
A
Oh
so
that's
kind
of
how
reddit
sees
with
oh
I
missed
the
most
important
piece
with
data
as
a
foundation,
and
so
you'll
you'll
hear
some
of
this.
You
know
throughout
the
discussion
over
the
next
few
days.
So
here
are
some
references
and
cube
con
highlights
to
round
this
up.
You
know
and
media
is
going
to
present
basically
scaling
GI
entrants
workloads.
What
you
saw
today
at
a
much
larger
breakout
session.
First,
that's
tomorrow,
Tuesday
December,
11
I
believe
it's
at
1:45
p.m.
A
Talks
about
how
to
use
the
tensor
RT
the
influence
server
at
the
second
one
is
using
cue
flow
through
dimension.
How
to
use
cue
flow
to
deploy
the
tensor
RT
influence
server
on
Cuban
Aires
and
then
the
last
one
is
actually
on
how
to
basically
enable
GPUs
on
open
shift.
So
so
these
are
some
of
the
resources
that
you
have
there's
a
webinar
which
is
coming
in
Joey's
it'll.
A
Introduce
me
in
just
a
moment
about
which
talks
about
how
to
maximize
GP
utilization
that
we
have
booth
setups
both
at
the
ingredient,
Red
Hat,
both
so
come
and
check
it
out
and
as
we
mentioned,
there's
a
live
demo
going
on.
There,
then
I
want
to
also
mention
that
the
tensor
RT
influence
server
is
open
source.
It's
available
on
github
you'll,
find
a
link
there
and
then
the
open
ship,
Commons
machine,
learning
cig,
that's
something
where
we
as
a
community,
from
open
shift
and
in
media
and
others.
A
We
get
together
and
talk
about
machine
learning,
and
then
we
have
open
data
hub,
which
is
some
of
the
four
pillars
that
I
talked
about.
This
is
a
Red
Hat
CTO
office
initiative
and
they
are
trying
to
build
a
data
hub
that
I
described
earlier
and
the
four
pillars
in
a
open
way.
So
there
are
plenty
of
resources,
lots
of
excitement-
you
know
so
hope
you
guys
can
participate
in
that
and
with
that
I
think.
Thank
you.
Thank
you.
Thank
you.
Nvidia
and
I
ran
back
to
dad
one
of
the
one.