►
From YouTube: OpenShift Commons Gathering Santa Clara 2019 Deep Learning Inference with Nvidia GPUs on OpenShift
Description
OpenShift Commons Gathering Santa Clara 2019
Deep Learning Inference with Nvidia GPUs on OpenShift
Production Deep Learning Inference on Nvidia GPUs
Tripti Singhal (Nvidia)
Peter MacKinnon (RedHat)
A
Here's
a
quick
agenda
of
what
I'll
be
talking
about
today,
so
first
I'll
do
give
like
a
high-level
overview
of
deep
learning
and
focusing
more
on
the
inference
side
and
then
I'll
dive
a
little
deeper
into
the
Nvidia
1030
inference
server,
which
is
the
product
that
I
focus
on
and
the
features
that
go
into
it.
The
overall
architecture,
the
ecosystem,
I'll,
do
a
quick
performance
slide
and
then
the
demo
and
then
I'll
hand
it
over
to
Pete
to
talk
about
the
inference
server
on
openshift,
but
before
I
get
started.
A
A
So,
if
you're
not
aware
deep
learning
is
the
technique
of
using
massive
amounts
of
data
to
train
neural
networks
to
be
able
to
make
human-like
decisions.
So
you
start
off
with
an
untrained
model,
and
you
till
you
do
several
iterations
of
a
large
data
set
on
this
neural
networks.
When
an
outcome
is
a
trained
neural
network
that
can
now
make
human-like
decisions.
A
Most
of
this
talk
is
the
is
the
idea
of
taking
that
trained,
neural
network
and
then
putting
that
and
deploying
that
into
the
real
world
to
make
decisions
on
data
that
it
hasn't
seen
before
so
training
its
uses,
the
same
kind
of
set
to
train
on
and
then
an
inference,
it's
new
data
and
then
there's
this
idea
of
plaster
which
stands
for
program,
ability,
low
latency
accuracy,
size
of
network,
throughput
efficiency
and
rate
of
learning.
So
this
kind
of
goes
into
wide.
A
Gpus
are
necessary
for
this
inference
process
of
deep
learning
and,
as
you
can
see,
latency
is
not
the
only
factor
that
goes
into
a
successful
inference
deployment.
There's
things
like
accuracy
as
well.
So,
if
you
think
about
use
cases
like
it's
vehicles,
you
not
only
need
your
decisions
to
be
made
fast,
but
also
highly
accurate
and
using
GPUs
for
inference
delivers
on
all
these
factors.
So
here
are
some
of
the
pain
points
that
we
typically
see
with
end
users
of
these
deep
learning
inference
deployments.
A
So
the
main,
the
main
problem
here
is
the
problem
of
inefficiency.
So
the
first
one
here
is
being
only
being
able
to
run
a
single
model
on
a
single
GPU
at
a
time
and
so
being
able
to.
So,
if
you
have
a
use
case
like
here
on
the
far
left,
if
you
have
an
ASR
model,
natural
language
processing
model
and
a
recommendation
model
all
running
in
your
data
center
and
they
each
have
a
dedicated
GPU
and
one,
let's
say
the
ASR
model
traffic
spikes,
the
rest
of
the
GPUs
are
left
underutilized,
which
is
highly
inefficient.
A
Another
another
pain
point
is
having
only
a
single
framework
support
and
this
restricts
teams
to
only
use
one
framework.
So
if
there
are
several
teams
and
they
want
to
develop
and
PI
torch
print,
tensorflow
and
Caffe,
they
would
all
have
to
build
out
their
own
custom
pipeline,
which
is
not
efficient
as
well
and
then,
similarly
with
custom
development,
if
you
have
several
teams
each
doing
their
own
pipeline,
one
for
each
use
case
that
they're
deploying
if
every
single
team
builds
out
their
own
their
own
custom
pipeline.
A
That's
not
efficient,
because
they're
really
they're
all
really
doing
the
same,
underlying
task,
which
is
inference
and
so
having
one
cusp.
One
solution
that
manages
all
these
pipelines
makes
that
makes
that
person
easier.
So
this
is
just
a
quick
high-level
overview
of
the
ten-thirty
inference
server
and
where
it
sits
in
the
ecosystem.
A
So
here
on
the
left,
you
see
the
clients
sending
their
requests
into
some
cloud
application
or
several
applications,
and
it
might
be,
and
then
those
requests
are
sent
to
a
load
balancer
where
those
requests
are
then
tracked,
the
traffic
is
sent
to
the
appropriate
instance
of
the
inference
servers.
So
you
may
have
several
instances
of
this
inference.
Server
and
all
underlying
hardware
is
visible,
including
heterogeneous
GPUs,
which
is
why
we
have
Tesla
t4v,
100
and
P
4
listed
here,
and
this
is
helpful.
A
So
the
inference
server
has
several
features
and
just
to
point
out
a
few
here
in
green
I
mentioned,
have
being
able
to
run
multiple
models.
Concurrently
on
a
single
GPU,
you
can
run
multiple
models
and
multiple
versions
of
the
same
model
at
the
same
time
on
a
single
GPU,
and
this
is
what
really
lets
you
utilize
your
GPU
to
its
maximum
capacity
and
then
another
feature.
Useful
feature
is
dynamic,
batching,
and
so,
when
client
requests
come
in,
they
come
in
without
any
logic
they
come
in
at
batch
size,
one
and
being
able
to
the
inference.
A
The
format's
supported
our
tensor
flow
graph,
def
and
save
save
model.
We
also
have
this
tensor
flow
tensor,
RT
integration,
that's
also
supported,
of
course,
temps
or
RT
plans
and
then
cafe
to
net
dev
through
the
onyx
path,
and
then
the
newest
feature
is
that
we'll
be
announcing
soon
is
the
streaming
API,
and
this
allows
for
support
for
sequence,
models
that
have
input
that
have
state
associated
with
it.
So
use
cases
like
speech,
recognition
and
translation
are
also
supported
now,
so
this
dives
a
little
bit
deeper
into
the
inference
server
architecture,
the
internal
architecture.
A
So
this
green
box
here
represents
the
inference
server
and
at
the
top
you
see
the
client
request
come
through
HTTP
or
G
RPC,
and
so
there's
also
this
Python
and
C++
client
library
that
that
helps
with
this
interaction
between
client
and
server,
and
so
once
client
requests
come
in.
They
go
through
request
and
response
handling,
and
here
you
can
see
it's
a
simple
image.
A
Classification
use
case,
similar
to
the
example
I
showed
before
for
classifying
these
images,
and
so
then
what
happens
is
that
request
goes
through
the
per
model,
scheduling
cues,
so,
for
example,
if
this
image
needed,
if
this
request
needed
a
resident
50
model,
it
would
go
to
the
ResNet
50
queue.
If
it
needed
an
inception
model,
it
would
go
to
that
one
and
so
forth
and
then
through
there
it
goes
to
the
framework
back-end.
So
if
that
model
was
in
tensorflow,
it
would
go
to
the
temps
or
flow
backend.
If
it
was
in.
A
If
it
was
a
tensor,
our
team
model,
it
would
go
to
that
back-end
and
so
forth,
and
then
from
there
the
request
gets
sent
back
through
handling
and
back
to
the
client.
A
few
things
you'll
notice
is
the
the
metrics
are
exposed
through
a
separate
HTTP
endpoint
and
then
at
the
top
you'll
see
the
model
repository,
and
this
is
where
models
are
placed
after
training
is
done
and
they're
already
in
the
format.
That
is,
for
the
inference
servers
so
like
the
ones
that
I
mentioned
graph,
def
save
model
and
net
def
and
plans.
A
A
So
it's
a
bit
complicated
there,
but
what's
important
here
is
that
packets
of
the
same
sequence
need
to
go
to
the
same
batch
slot
here
to
for
execution,
and
so
this
sequence,
batcher,
is
what
has
been
built
out
recently
in
the
inference
server.
So
up
until
now,
we've
been
taking
kind
of
a
zoomed
in
view
of
what
the
inference
server
has
internally
and
now
we're
taking
a
step
back
and
seeing
how
it
fits
into
the
larger
ecosystem
here
and
so
here
on.
The
right,
you'll
see
the
green
boxes.
A
Those
are
that's,
the
temp
start
a
inference
server
and
you
can
have
several
instances
of
it
running
in
your
data
center
and
then
on.
The
Left
you'll
see
this
user
client
request
coming
in
and
then
this
box,
with
these
three
components-
the
client,
API,
pre
and
post-processing.
So
this
is
like
this
is
what
we
like
to
call
the
app
whatever
it
may
be.
So
for,
like
I've,
been
using
this
image
classification
use
case.
A
This
would
be
where
an
image
comes
in
and
maybe
it
needs
to
be
resized
or
cropped
to
be
able
to
be
fed
into
the
neural
network.
That's
needed
for
the
result,
so
that's
where
that
would
happen,
and
then
the
request
would
then
go
to
a
load
balancer
that
directs
traffic
and
then
at
the
top.
Also
I
mentioned
that,
if
you
have,
you
may
do
all
your
training
of
all
your
models
and
you
put
them
in
some
network
file.
A
Storage
that
would
a
subset
of
those
models
that
want
to
be
deployed
would
have
to
be
separated
out
into
this
persistent
volume,
which
is
called
the
model
repository
and
that's
mounted
into
the
tensor
RT
inference
server
from
there.
The
tensor
RT
is
taking
these
requests
and
then
performing
the
inference
and
sending
the
result
back,
and
while
it's
doing
all
that,
it's
also
exposing
these
Prometheus
metrics
through
HTTP
and
those
can
also
be
connected
to
an
auto
scaler
for
further
scaling
up
and
end
down.
A
You
also
notice
this
just
added
line
box
and
that's
where
we've
had
our
collaboration
with
cube
flow.
It's
ten
thirty
sentiment,
servers
now
supported
in
queue
flow,
and
the
goal
is
to
kind
of
extend
this
box
and
include
the
load
balancer
as
well.
So
that's
also
supportive
and
so
here's
a
quick
performance
lie.
Key
takeaways
here
are
that
this
is
a
resume
at
fifty
several
deployments
of
resume
at
50
on
tensorflow
FB
32
on
CPU,
which
is
this
blue.
A
This
blue
line
and
then
temps
are
flow,
the
same
model
on
GPU
and
then
the
tensor
RT
plan,
which
is
which
is
a
highly
optimized
version
of
the
same
model
through
the
tensor
RT
framework
and
you'll,
see
the
old
notice
that
the
there's
a
significant
increase
in
the
throughput
here-
and
this
is
all
under
50
millisecond
SLA.
But
the
key-
the
more
important
thing
to
take
away
from
here
is
that
the
inference
server
supports
CPU,
GPU
and
multiple
model
frameworks
like
tensor
flow
and
hence
RT.
A
This
is
our
flowers
demo
to
show
off
the
inference
server
and
at
the
top
you'll
see,
is
the
flowers
client
which
is
doing
several
classifications
of
these
flower
images,
and
that
flashing
light
is
indicates
how
many
images
are
being
performing
inference
on
right
now,
so
you'll
notice
at
the
bottom
left
you'll
see
demand
and
delivered
so
there's
800
images
per
second
of
demand
and
it's
being
met,
and
so
on.
The
bottom
left.
A
There's
there
you'll
see
that
this
is
8
GPUs
at
the
bottom,
so
to
blue,
and
the
different
colors
at
the
bottom
indicate
different
models.
So
the
blue
indicates
this
flowers
model
and
right
now,
you'll
notice
that
the
utilization
across
all
the
dials
that
shows
per
GPU
utilization
and
then
the
bigger
dial
is
the
overall
GP
utilization
so
fairly
low
as
of
right
now
and
then
so,
that's
kind
of
what
you're
looking
at
and
I'll
skip
forward
a
little
bit
just
to
show
some
changes.
So
what
you'll
see
soon
is
that
will
increase?
A
The
demand
will
increase
the
demand
to
the
same
cluster
that
so
we've
increased
the
demand
to
thousand
images
per
second
now
and
all
the
traffic
is
being
sent
to
this.
This
cluster
that
has
one
GP
one
GPU
supporting
each
model.
So
you'll
notice
the
spike
over
here
on
the
chart
where
the
mod
of
the
flowers
model
has
spiked
in
a
request
now
to
5,000,
and
these
two
GPUs
supporting
that
model
have
maxed
out
in
GP
utilization
and
while
the
other
and
you'll
notice
that
the
other
GPUs
supporting
other
models
are
not
being
utilized
at
all.
A
So
obviously,
this
is
not
efficient
and
we're
not
even
meeting
the
demand
of
5,000
images
per
second
here,
and
so,
if
I
were
to
skip
forward
a
bit.
A
typical
solution
here
would
be
to
increase
more
hardware
to
support
that
model,
but
you
still
are
left
with
GPUs
underutilized.
So
now,
what
of
what
has
been
done
is
the
traffic
is
now
sent
to
a
different
cluster
at
the
bottom
right
of
the
screen.
A
You'll
you'll
see
that
this
is
a
different
user,
all
kubernetes
clusters
by
the
way
and
each
GPU
has
every
single
model
loaded
onto
it
with.
Hence
the
multicolored
colors
inside
the
boxes
of
each
GPU,
and
so
now
you'll
see
that
it's
easily
meeting
the
5,000
images
per
second
demand
on
this
cluster
that
has
the
tensor
RT
inference
server
running
on
it.
So
all
the
GPUs
utilization
is
about
30-40
percent
and
the
overall
GP
utilization
is
around
the
same
and
we're
easily
meeting
that
demand,
and
so
with
this.
A
So
that's
pretty
much
it
for
the
demo
and
so
I've
have
some
resources
here,
including
our
data
center,
inference
page
where
you
can
learn
more
about
our
deep
learning
inference.
Products
such
as
10:30
and
the
inference
server
as
well,
and
the
temps
RT
infant
server
is
available
on
our
NGC
and
video
GPU
cloud
registry
as
a
container,
and
it's
also
open
source
like
I
mentioned,
then
the
github
repos
right
there
as
well
and
for
getting
started.
A
B
Obviously,
you
know
the
start
of
the
show
is
tensor
RT
and
server.
It's
a
nice
sort
of
quick
little
demo
that
hopefully
illustrates
how
this
can
be
deployed
on
OpenShift.
So
the
structure
of
the
demo
is
we
have
two
pods
there's
one
for
the
inference
server,
another
for
a
inference,
client
and
then
a
open
chef
service
to
find
for
the
server.
B
That's
the
communication
path
for
the
client
to
the
server,
and
then
there
will
be
a
PVC
and
PV
making
use
of
local
storage
where
we
store
some
pre-built
models
for
ResNet
and
Inception,
so
sort
of
architectural
e.
This.
What
the
demo
looks
like
again
the
communication
path
there
is
in
what
we
call
sort
of
the
pod
layers
from
the
client
to
the
service
endpoint
for
the
inference
server
and
the
inference
server.
Pod
is
basically
attached
to
this
storage
layer
where
it
has
the
models.
Sorry,
I,
don't
know.
If
that's
use
that
NVIDIA
inference
server.
B
Sorry
tensor
RT
inference
server.
It's
sort
of
a
shorthand
for
this
this
demo,
that's
my
bad
I
think
so
yeah,
that's
the
idea
there.
So
the
server
image
in
terms
of
image
we're
talking
about
the
image
that
was
created
by
Nvidia
the
client
image.
That's
used
in
this
demo
is
built
from
the
publicly
available
docker
file,
that's
on
the
Nvidia
github
repo,
and
then
they
have
a
set
of
models
that
we
can
load
into
the
inference
server.
B
Just
a
simple
one,
resident
50
and
inception
v3,
and
we
run
the
demo
on
an
Nvidia,
be
116
gigabyte
card
there.
So
this
is
all
using
open
shift,
3.11,
alright.
So
let's
look
at
our
project
again
this
shorthand
this
an
official
nomenclature.
Trt
is
so
we
have
basically
our
application
here
and
para
pods.
So
let's
have
a
look
at
the
inference:
server,
pod
and
the
logs
for
that
indicate
a
successful
start
of
the
infrared
server
in
that
pod.
It's
actually
exposing
a
couple:
endpoints
there's
one
for
HTTP
rest
and
the
other
one
for
G
RPC.
B
As
Tripti
mentioned,
those
two
capabilities
are
in
place
and
we
expose
them
through
OpenShift
and
then
from
the
logging.
We
can
tell
basically
it'll
sit
there,
I
believe
in
pull
for
models
in
the
storage
location,
so
we've
dropped
some
models
in
there
and
it
picked
them
up
and
it's
basically
gone
through
its
setup
for
setting
an
inference
serving
of
those
models.
Okay,
so
that's
the
server
itself
well
and
that
services
let's
go
back
to
pod.
So
let's
go
to
the
client
and
let's
do
a
terminal
here
and
let's
see
if
it
remembers
any
other
commands.
B
So
let's
run
so.
The
image
client
here
is
a
pre-built
C++
client
within
this,
this
client
pod.
So
we're
going
to
work
against
the
the
rest
service
to
in
with
and
we're
going
to
run
inference
on
a
couple
of
images.
In
this
case
it's
a
mug
and
let's
see
how
it
does
coffee,
mug,
Oh
Diane.
Of
course
there
is
it's:
it's
not.
B
B
B
And
it's
quite
a
bit
faster
I'm
gonna
have
to
scale
that
back
sorry
and
then
finally,
there's
a
benchmark
test.
So
there's
various
different
modes
with
the
clients
you
can
do
batch
inference
and-
and
things
like
that,
we'll
finish
off
here
in
the
demo,
with
a
performance
benchmark
again,
that's
provided
by
Nvidia
for
us.
B
Yeah,
so
we
ran
the
benchmark
there,
that's
pretty
much
it
so
you
can
also
you
can
see
it
through
the
console,
basically,
there's
interaction
between
the
inference
server,
the
models
that
have
been
stored
for
it.
That
storage
could,
of
course,
be
CF
for
the
purpose
of
the
demo
is
just
put
in
the
local
storage
and
then
the
interaction
between
components
in
OpenShift
and
kubernetes
we've
put
together
a
client
for
that.
So
that's
pretty
much
it
for
the
demo.
Thanks.