►
Description
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes
https://github.com/NVIDIA/gpu-operator/blob/master/README.md
Kevin Jones (NVIDIA)
OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html
Find out more about OpenShift Commons, please visit: https://commons.openshift.org
A
Hello,
everyone,
my
name
is
kevin
jones.
I
am
an
nvidia
product
manager.
Today,
I
want
to
walk
through
with
you,
the
gpu
and
network
operator,
just
a
quick
news,
flash
15
minutes
on
the
current
state
of
affairs,
where
they
are
today
how
they're
constructed
and
how
you
can
utilize
those
on
top
of
your
kubernetes
platforms,
to
take
advantage
of
the
hardware
acceleration
of
gpus
and
smartnics.
A
Today.
You
have
the
cuda
libraries,
drivers
and
sdks
that
are
available,
and
then
you
also
see
that
nvidia
is
working
on
different
frameworks
to
make
smart
systems
more
available
and
more
easy
to
develop
applications
for
so
we
have
things
like
metropolis
for
smart
cities.
In
retail,
we
have
clara
for
healthcare
isaac
for
robotics
and
5g
aerial
for
the
telco
work
that
is
being
done
on
5g
today.
A
You
may
also
have
noticed
that
there
are
different
lines
at
nvidia
with
the
gx
moniker
at
the
end
and
I'll
explain
those
really
quickly.
The
agx
line
is
really
about
small
scale
system
on
chip
embedded
manufacturing,
robotics
type
systems.
The
dgx
line
is
about
scale.
Up
so
remember,
I
said:
egx
is
about
scale
out.
Dgx
is
about
scale
up
system.
So
a
lot
of
our
work
in
this.
The
super
computers
lately
with
the
dgx
super
pods,
that's
what
those
systems
are
being
built
for.
A
The
hgx
line
is
about
cloud
service
providers,
so
allowing
cloud
service
providers
to
provide
instances
to
their
end
users
with
nvidia
gpus
in
them,
and
then
egx
is
our
scale
out
platform
that
I've
been
discussing
here
today.
Now,
I'd
like
to
switch
gears
and
start
talking
about
the
operators,
but
before
we
dig
into
each
of
our
nvidia
operators,
I
really
want
to
touch
on
the
operator
framework.
So
a
while
back
red
hat
acquired
a
company
named
core
os
and
coreos
had
many
contributions
to
the
kubernetes
ecosystem.
A
One
of
the
major
comp
contributions
that
they
gave
was
the
operator
framework
and
red
hat
has
continued
the
work
on
this
really
beautiful
and
capable
framework
for
deploying
kubernetes
native
applications.
It's
a
pattern
for
how
you
accomplish
this,
and
you
can
use
different
ways
to
write
your
operators
as
well.
A
They
can
be
helm
based
operators,
they
can
be
ansible
waste,
they
can
even
be
go
based
if
you
get
really
complex
and
each
of
those
different
layers
have
either
simplicity
in
the
approach
of
how
you
build
them
all
the
way
up
to
very
complex
application
capabilities
and
the
operator
framework
allows
you
to
do
deployment.
It
allows
you
to
do.
A
A
The
next
is
the
drivers
themselves.
This
is
the
component.
Most
administrators
are
aware
of
whether
they
have
done
containerization
with
nvidia
gpu
drivers
or
on
standard
host,
and
the
driver
is
really
the
most
critical
component
to
exposing
all
the
capabilities
of
your
underlying
gpu
to
your
application
layer,
and
the
goal
here
is
to
simplify
provisioning
of
the
nvidia
driver
with
our
operator.
A
A
Application
is
making
a
request
for
one
nvidia
gpu
and
that's
what
kubernetes
will
expose
to
the
application,
and
I
mentioned
our
data
center
gpu
manager,
dji
dcgm
exporter,
which
enables
us
to
take
the
telemetry,
that's
being
fed
back
from
the
gpus
and
expose
that
into
prometheus,
which
is
gives
the
administrators
of
the
cluster
themselves
visibility
into
their
gpus
telemetry,
much
the
same
way,
they're
getting
visibility
into
their
cpus
that
are
running
in
the
cluster.
So
this
is
a
really
great
way
to
take
advantage
of
the
native
tooling.
A
That's
been
been
selected
in
kubernetes
and
and
feed
that
with
our
gpu
telemetry.
So,
with
all
of
these
core
components
put
together
and
simplified
in
the
operator,
we
have
a
great
way
to
expose
and
get
the
best
out
of
our
gpus
and
our
kubernetes
clusters.
So
next,
let's
talk
about
the
roadmap
for
the
gpu
operator,
we're
working
on
a
number
of
things.
New
features
that
I've
listed
on
the
slide
here.
So
things
like
upgrade
management
where
we
handle
driver
and
kernel
updates,
handle
node
reboots.
A
If
we
need
to
those
things,
are
improving
with
each
release
of
the
gpu
operator,
we're
also
working
on
disconnected
and
air-gapped
installations,
where
whether
you're
restricted
via
a
proxy
to
get
out
to
certain
resources
or
you
have
no
internet
connection
at
all.
And
you
have
to
pull
things
from
custom
registries.
A
Those
type
of
environments
are
very
difficult
to
work
in
and
we're
trying
to
make
it
easier
to
utilize.
The
gpu
operator
in
those
environments
we're
also
working
on
security
capabilities
like
more
granular,
our
back
controls
via
roles
and
bindings
for
the
gpu
operator.
Nvidia
is
very
conscious
of
security
capabilities
within
our
hardware
and
our
software
stack,
and
we
want
to
make
sure
that
the
customers
are
getting
the
best
and
most
secure
capabilities
they
can.
A
A
A
There
is
a
getting
started
document
that
I've
linked
to
here,
and
you
can
also
reach
out
to
us
with
any
of
the
questions
you
have
now.
I
want
to
switch
gears
and
start
talking
about
our
network
operator.
If
you
were
not
aware,
nvidia
acquired
a
company
named
melanox
at
super
computing.
Melanox
has
obviously
made
a
very
good
name
for
themselves
and
we
want
to
continue
that
trend
with
melanox
as
nvidia's
networking
business
unit,
and
so
with
the
network
operator,
we've
taken
a
very
similar
approach
that
we
did
with
the
gpu
operator.
A
We
wanted
to
simplify
the
deployment
experience
so
that
complex
network
deployment
tasks
are
taken
care
of
for
you,
with
by
the
network
operator,
they're
portable
across
different
kubernetes
platforms
and
it's
a
consistent
deployment
across
those
platforms.
We
also
wanted
to
give
you
some
operational
efficiency
gains,
because
we
are
now
managing
the
network
at
a
cluster
level
rather
than
individual
system
levels.
So
the
operator
itself
starts
to
look
at
this
as
a
cluster
capability,
rather
than
individual
system
units
that
have
to
be
configured,
and
we
want
to
put
the
network
automation
administration
on
autopilot
last.
A
So
we
did
this
when
we
started
you
had
this
legacy,
where
the
melanoxa
of
driver
and
the
nvidia
peer
memory
driver
were
configured
on
a
linux
system
by
by
hand
or
by
automation,
scripting,
and
then
we
containerized
in
the
same
way
that
we
did
the
device
plug-in
for
gpus.
We
have
this
kubernetes
rdma
shared
device,
plug-in
that's
containerized,
maltis
is
what
really
gives
us?
A
So
let's
talk
about
the
individual
pieces
themselves.
The
offed
driver
container
is
what
loads
the
melanox
outfit
driver
into
the
kernel,
so
you
have
it
pre-built
for
the
distribution
and
the
kernel
lift
running
on
the
hose,
deploy
it
onto
the
nodes
based
on
node
labels.
So
we
in
both
the
gpu
operator
and
the
network
operator
there
are
these
node
feature
discovery
capabilities
that
go
out
and
label
each
of
the
hosts
with
the
hardware
that
they
have.
A
We
expose
the
container
root
fs
to
the
host
to
allow
kernel,
module,
compilation
against
updated
headers,
and
then
we
load
the
kernel.
Rdma
stack
in
the
melanox
driver,
sac
on
container
start
and
we
unload
it
on
container.
Stop
the
rdma
shared
device.
Plugin
is
how
you
can
run
rdma
workloads
in
kubernetes
and
the
shared
device.
A
Plugin
is
the
way
that
we
let
pods
perform
rdma,
exposing
rdma
device
files
to
the
container
in
a
shared
manner,
and
you
can
see
in
our
example
here
that
it
is
requesting
a
single
rdma
device
and
it's
limiting
itself
to
that
one
rdma
device
through
its
pod
spec.
The
next
is
the
nvidia
pure
memory
driver
container,
which
compiles
and
loads
our
pure
memory
client
driver
into
the
kernel
itself,
so
it
loads
in
mem
module
into
the
kernel
and
unloads
it
when
the
container
exits.
A
This
is
a
really
great
capability
for
us
to
be
able
to
do
this,
all
for
the
administrators
and
all
automated
via
the
operator
lifecycle
manager.
So
where
are
we
at
today?
So
in
december,
we're
looking
at
helm
deployment
we're
using
node
feature
discovery.
That's
what
you
see:
nfd
there
to
label
rdma
nodes
and
we're
also
working
on
secondary
network
deployment,
so
you
actually
have
a
secondary
network
that
is
rdma
configured
that
we
can
take
advantage
of
in
the
cluster
itself.
A
So
that's
our
15
minute,
quick
news,
flash
on
both
the
gpu
and
network
operator
coming
from
nvidia.
I
really
appreciate
your
time
today.
I
hope
this
this
information.
It
finds
usefulness
for
you
that
you
explore
the
gpu
operator
and
the
network
operator
feel
free
to.
Let
us
know
if
you
have
any
issues
or
you
have
any
feature,
requests
that
you
want
to
add
to
this.
You
can.
You
can
find
us
in
the
upstream
community
working
on
these
operator
code
bases
we
really
look
forward
to
hearing
from.