►
From YouTube: OCB: GPUs on OpenShift with Zvonko Kaiser (Red Hat)
Description
The PSAP (Performance Sensitive Application Platform) team has developed the special resource operator which is a template to enable hardware accelerators on OpenShift. Besides NVIDIA we have and are still enabling other vendors as well. For this installment, we are going to talk about SRO and its inner workings. We will conclude the talk with a demo and how SRO relates to the official NVIDIA GPU operator.
A
A
A
So
without
further
ado,
I
let
funko
introduce
himself
talk
about
the
topic,
maybe
a
demo,
and
if
you
have
any
questions,
we'll
have
a
live
q
a
session
at
the
end
so
feel
free
to
add
your
questions
in
the
chat
wherever
you're
watching
this
live
stream
from
and
we'll
relay
them
back
here.
So
with
that
won't
go.
Take
it
away.
B
B
B
This
session
will
cover
some
of
historical
bits
and
pieces
we
encountered
during
the
way
on
enabling
gpus
first
on
bare
metal,
then
on
op
shift,
and
we
will
explain
what
we've
done
and
what
s
0
is
and
how
we
used
sro
to
enable
not
only
gpus
but
also
other
accelerator
cards
in
hope
shift.
So
with
all
the
further
odo
we
are
going
to
start
what
we'll
be
discussing
today.
We
will
cover
a
lot
of
topics.
B
B
B
A
hook
is
nothing
else
in
a
callback
during
the
container
lifetime
we
have
a
preset
hook,
poststar
token
poster
post
up
hook
and
nvidia
is
using
mainly
the
prestathook
to
enable
the
gpu
hooks
can
be
used
to
enhance
the
functionality
of
a
container
runtime
with
the
functionality
that
the
container
runtime
does
not
have,
and
in
this
case
it's
to
enable
a
gpu
in
a
container
and
what
the
nvidia
preset
hook
is
doing
is
mostly,
it
bind
mounts
devices,
binaries
and
libraries
into
the
container
from
my
host
this
way,
it
makes
sure
that
the
host
and
the
container
are
in
sync
regarding
the
versions
running
on
the
host
and
on
in
the
container,
so
the
pre-stato
configures
the
container
to
use
gpus.
B
B
Everybody
who
mounted
any
volumes
or
mounted
any
files
from
a
host
into
the
container
knows
that
something
is
happening
with
s
linux.
In
regards
so
I
don't
want
to
illustrate
what
really
happens
when
you're
doing
mounts
from
the
host
into
a
container
and
that's
a
crucial
important
point
for
the
nvidia
priesthood,
because
it's
getting
the
files
from
the
host
into
the
container
sllinx
is
nothing
else
than
a
labeling
system.
So
a
host
we
have
here
all
the
blue
labels
is
the
host
context.
B
The
blue
labels
or
the
blue
colors
know
how
to
talk
to
other
blue
colors
and
how
to
talk
to
processes
and
how
to
work
with
with
files.
On
the
other
side,
all
the
ready
stuff
is
the
container
is
linux
contacts
all
the
red
stuff
knows
how
to
read
red
labels,
how
to
read
red
files
and
how
to
work
with
red
processes.
B
So
what
you're,
essentially
doing,
if
you're
mounting
files
from
the
host
into
the
container,
is
that
you're
introducing
blue
labels
into
a
red
domain?
The
red
labels
do
not
know
how
to
handle
those
blue
labels
and
that's
why
you're
getting
always
a
permission
denied
if
you're
not
dealing
with
s
linux
on
on
a
rail
system.
B
So
and
one
thing
what
people
are
doing
is
running
the
container
privileged,
but
they
don't
know
really
know
what
hap
what's
happening
if
they're
running
the
container
privileged
the
crooks
here
is
a
privileged
container
has
now
that
you're,
seeing
all
the
container
and
hosts
are
blue,
that
the
container
now
can
do
everything
with
the
files
and
processes
as
well
on
the
host
containers
running
privileged
are
having
a
special
s.
Linux
type,
which
is
the
spc
underscore
t
and
sp
ct,
can
interact
with
any
labels
from
the
container
and
the
host.
B
What
then
people
are
doing
is
relabeling
host
files,
but
this
means
you
are
now
introducing
red
labels
into
the
blue
domain,
which
can
break
the
host
content,
and
this
is
one
worry
of
nvidia
if
we
are
changing
the
devices
on
the
host,
that
it
breaks
things
on
the
host
and
we
have
enabled
the
container.
But
now
the
host
is
broken.
B
So
the
solution
for
this
is
write
a
linux
policy
to
introduce
new
types.
What
nvidia
has
done
for
the
devices,
libraries
and
and
other
files?
That's
mounted
in
the
container
and
write
a
selling
policy
to
allow
the
container
to
read
these
special
files
or
special
entities
from
the
host.
This
is
illustrated
by
the
purple
labels.
It
can
be
read
by
a
host
and
it
can
be
read
by
the
container
and
the
container
has
to
not
run
any
more
privileged.
B
I've
posted
here
the
s
linux
policy,
we've
written
for
nvidia,
to
enable
to
enable
that
gpu
in
the
container,
an
updated
blog
post
about
how
to
enable
nvidia
gpus
in
containers
of
bimethyl
on
on
rail8,
with
all
the
updated
sl
linux
policy,
for
route
7
and
regulate,
and
for
people
who
are
curious.
How
a
simple
pre-start
hook
works.
I've
also
included
here
directory
for
a
simple.
B
I
call
it
oci
decorator,
it's
a
preset
hook,
which
is
backed
by
a
config
file
to
introduce
which
can
be
used
to
mount
devices,
files
and
other
libraries
from
the
host
into
a
container.
B
So
if
we
are
taking
a
default
install
with
free
master
nodes
and
free
cpu
nodes,
we
can
use
a
machine
set
to
scale
a
cluster
with
your
gpu
node.
So
just
you
can
just
use
a
cpu
machine
set
that
you
have
already
in
your
cluster
installed
change
the
instance
type
scale
it
up
and
you
have
a
gpu
node.
B
The
problem.
We
have
here
is
we
have
now
a
heterogeneous
cluster.
From
the
openshift
point
of
view.
We
have
three
workers.
Openshift
does
not
know
that
we
have
a
gpu
node.
It
knows
that
we
have
some
cpu,
it
has
only
the
notion
of
a
worker.
So
what
we've
done
in
this
case
is
yeah,
let's
first
zoom
into
the
gpu
node,
to
show
some
features
that
you
might
have
on
a
node.
B
B
We
can
use
those
features,
for
example,
for
optimized
workloads.
If
you
are
optimizing,
your
workload
for
avix
512,
you
want
to
have
your
workload
run
on
an
abx,
5112
node.
If
you
have
gpu
workloads,
which
is
special
in
this
case,
you
want
to
run
your
gpu
stack
only
on
gpu
nodes,
you
don't
want
to
create
a
theme
set
and
the
dim
set
is
occupying
the
cpu
nodes
and
the
gpu
nodes.
B
So
what
we've
done
is
we
introduced
a
a
software
which
is
called
node
feature
discovery
we've
written
an
operator
for
it,
so
it's
available
in
openshift
and
what
does
node
feature
discovery
does
is
exposes
node
features
as
labels.
So
we
have
labels
for
cpu
flags
for
the
kernel
for
operating
system
and,
as
you
see
on
the
right
side,
pci
10de
10d
is
the
pci
vendor
id
for
nvidia.
So
we
have
now
a
label
where
we
can
steer
either
with
a
node,
selector
or
port
affinities
and
anti-affinities
or
to
the
right
node.
B
Optimized
workloads
can
now
be
placed
on
the
right
node.
As
I
said,
gpu
stack
on
the
gpu
node
avx
512
optimized
workloads
on
the
avx
500
round
nodes.
Why
is
this
important
workloads
that
are
optimized
for
abx
512
will
not
run
on
any
other.
You
will
just
get
a
legal
instruction,
so
you
want
to
make
sure
that
your
port
lands
on
the
correct
node.
B
I've
added
here
some
links
for
node
feature
discovery
and
how
we
are
using
also
nfd
for
multi-architecture
image,
builds
optimized
low-level
libraries
for
openshift
and
how
to
steer
those
libraries
and
workloads
to
nfd.
B
So
now
that
we
have
bootstrapped
our
cluster,
we
know
where
our
gpu
noticed.
We
need
to
think
about
how
to
enable
this.
B
So
we
started
thinking
about
how
we
can
support
immutable
hosts
since
openshift
3.8
and
we've
done
some
experiments
on
atomic
and
what
we've
come
up
is
a
thing
called.
A
driver
container.
A
driver
container
is
not
only
a
delivery
system
for
kernel
modules,
but
it's
also
a
a
image
that
handles
the
drivers
and
starts
demons
that
are
needed,
create
some
sys
controls
and
all
the
things
a
privileged
container
would
need
to
do
to
enable
the
hardware.
B
B
We
are
running
a
small
validation
step,
the
driver
container
validation,
it's
a
small
workload
that
uses
the
accelerator
just
to
verify
that
the
drivers
are
loaded
and
we
can
access
the
hardware.
In
this
case
we
are
running
cuda
vector
at
the
nice
thing
about
cuda
vector
it
allocates
memory.
So
we
know
that
memory
is
working,
does
some
computation,
so
the
compute
costs
are
working
and
we
can
be
sure
after
we
are
running
cuda
vector
ad,
that
the
drivers
are
installed
correctly.
B
Next
step
is
device
plugin
the
device.
Plugin
is
the
piece
that
exposes
the
accelerator
to
the
cluster
as
a
extended
resource,
so
pods
can
allocate
an
extended
resource
and
can
be
scheduled,
but
that's
what
we
are
doing
as
a
next
step.
The
device,
plugin
validation,
is
simply
the
same
workload
before
by
just
allocating
a
extended
resource.
So
we
want
to
make
sure
that
the
extended
resource
allocation
works
and
again
that
we
can
run
a
gpu
workload
on
the
cluster.
B
We
have
enabled
the
hardware
we
exposed
it
to
the
cluster.
The
next
step
is
monitoring,
so
we
are
setting
up
prometheus
and
grafana,
adding
metrics
a
custom,
note
exporter
that
exposes
gpu
metrics
and
we
are
also
adding
alerts,
so
it
is
visible
in
the
cluster.
If
something
happens,
if
it's
overheating
malfunctioning
and
stuff
like
that,
when
monitoring
is
done
as
an
optional
step,
we
can
deploy
a
siteguard
container
for
nfd.
B
Nfd
has
a
hook
mechanism
where
you
can
exploit
nfd
to
to
extend
the
labeling
system
with
your
custom
sources,
and
in
this
way
we
are
using
the
sidecar
container
for
nfd
to
expose
more
sophisticated
features
of
a
gpu
like
the
gpu
type,
the
gpu
name,
the
memory,
the
cuda
cores,
firmware
version,
driver
version
and
stuff
like
that.
So
in
the
case
of
aiml,
you
want
to
use,
for
example,
a
v100
for
training
and
want
to
use
a
t4
for
inferencing.
B
So
with
the
feature
discovery
and
those
those
exposed
new
labels
with
a
prefix
like
nvidia.com,
you
have
more
fine-grained
scheduling
to
see
your
workloads
to
a
specific
gpu
nodes.
B
B
Some
need
more
features,
some
need
less
features,
but
as
sro
is
so
extensible
that
you
can
add
as
much
states
as
you
want
or
just
leave
it
just
by
the
driver
container
asteroids.
Also
sorry,
it's
also
capable
on
handling
several
driver
containers
for
several
vendors,
and
it
really
depends
on
on
the
accelerator
stack,
what
it's
really
needed
and
what
needs
to
be
deployed.
We
will
now
see
also
how
sro
can
handle
updates
in
a
openshift
cluster.
B
And
other
stuff
directly
into
the
manifest
you
are
exposing
and
all
of
this
I
see
it
sro
as
a
small
state
machine,
because
if
you
are
deploying
the
driver
container,
it
makes
no
sense
to
deploy
the
device
plug-in
in
parallel,
because
it
will
just
not
work.
We
have
some
parallelism.
We
are
steering
everything
with
labels,
so
in
sro
we
have
a
a
sequential
approach
and
then,
where
it's
capable,
we
are
also
running
stuff
in
parallel.
B
So,
for
example,
when
we
are
deploying
the
device
plug-in,
we
are
running
the
monitoring
feature
discovering
and
other
parts
in
parallel
to
enable
it
in
sro
and
sro
has
been
used
with
several
vendors.
Now
prime
example
is
now
nvidia,
which
we
have
a
solar
flare,
also
used
sro
to
enable
their
hardware,
and
I
will
get
back
to
another
use
case
where
we
use
sro
as
a
visibility
study
or
as
a
poc
vehicle
to
enable
hardware
on
on
openshift.
B
Besides
that,
sro
also
supports
some
more
configuration.
We
can
hit
hard
or
soft
petitioning.
I
will
come
back
to
that
later.
We
can
choose
the
driver
versions,
we
can
use,
and
some
other
configurations
for
driver
container
building
and
all
of
those
states
are
driven
by
custom,
manifests
that
are
in
a
conflict
map.
So
during
the
run
time
of
sro,
you
can
live
edit.
B
Let
me
get
into
detail
how
sero
is
dealing
with
updates.
So,
let's
start
with
the
normal
use
case,
we
have
deployed
nfd
and
sro,
so
the
first
step
is
nft
detects
the
kernel
version
and
labels,
the
node.
We
have
a
worker
node
kernel
running
for
18080,
the
nft
worker
diem
set
detects.
It
sends
a
message
to
the
nfd
master
and
the
nfd
master
through
the
kubernetes
api
labels,
the
node
and
says:
okay,
I'm
running
with
kernel,
418
080..
B
The
next
step
is
sro
reads:
those
configurations
uses
build
configs
to
build
a
driver
container
for
418
0
8080
and
push
it
to
the
local
image
registry.
B
So
what
happens
if,
if
an
update
comes
in
nfd
will,
of
course
detect
it
see
it's
that
we
have
a
new
kernel
version,
we'll
relabel
the
node
and,
as
a
row
will
detect
this
mismatch
and
we'll
use
this
new
information.
It
gets
this
front
information
injected
into
the
build
config
and
say:
okay,
build
me
now.
The
driver
container
for
this
specific
kernel
version
push
it
again
to
the
image
registry.
B
S0
will
update
the
gpu
driver,
diem
set,
add
the
new
version
and
this
new
version
will
be
pulled
by
the
driver.
Beam
set
restarted
and
the
complete
stack
will
be
also
restarted
so
that
all
those
new
changes
that
are
coming
with
the
driver
are
applied
to
all
the
to
the
complete
stack
that
is
already
deployed.
B
Special
resource
operator
is
it's
a
way
how
to
enable
special
resources,
the
openshift
way,
I've
included
several
links
here.
The
first
link
is
the
github
repository.
B
The
next
link
is
how
to
use
entitled
image
builds
to
build
driver
containers
with
uvi
on
openshift,
which
is
the
which
we
are
using
to
build.
The
driver
containers
to
always
stay
in
sync,
with
the
kernel
versions
and
the
user,
libraries
and
all
the
tested
and
bug
fixed
and
cve
tested,
ubi
images,
I've
written
two
blog
posts,
part
one
and
part
two
for
those
curious
on
the
inner
workings
of
s0
part.
B
2
is
really
how
api
is
working,
how
open
shift
kubernetes
works
internally
and
what
sro
building
blocks
we
are
providing
to
enable
hardware.
B
The
last
link
is
how
to
deploy
the
nvidia
gpu
operator
on
openshift
and
using
the
cluster
autoscaler
to
scale
the
rapids
image,
but
also
curious.
What
the
relation
between
the
nvgp
operator
and
sro
is.
The
nvidia
gpu
operator
is
a
a
fork.
A
one-to-one
copy
of
of
sro
after
and
nvidia
has
picked
up
s0
they
of
course,
relabeled.
It
done
some
extension
to
it
to
their
needs
and,
but
mostly
it
works
like
I
just
described.
B
So
what
else
can
we
do
with
gpus?
We
have
the
initial
gpu
cluster
configuration
with
one
gpu
node.
What
works
today
is
we
can
use
the
cluster
autoscaler
for
demand
gpu
nodes.
So
if
you
have
a
for
example,
if
you
have
a
port
pending
with
the
extended
resource
nvidia.com,
you
can
create
a
cluster
auto
scalable
for
specific
machine
sets
to
auto
scale
gpu
nodes.
We
are
currently
working
also
on
enabling
hpa
to
work
with
with
gpus.
B
B
What
we
want
to
prevent
pertains
is
that
we
only
schedule
gpu
only
parts
on
the
gpu
nodes
without
having
any
cpu
nodes,
interacting
with
our
gpu
nodes.
The
other
way
is
to
do
soft
partitioning
so
to
allow
cpu
and
gpu
nodes
on
the
gpu
nodes,
but
have
priorities.
So
any
gpu
workload
or
part
that
comes
in
with
a
high
priority
or
low
priority
will
have
is
first
to
run
on
the
gpu
rather
than
cpu
node.
B
We
can
also,
of
course,
combine
those
patterns
of
partitioning
combining
both
that
we
are
only
running
gpu
nodes,
but
now
we
have
also
high
priority
and
low
priority
running
on
those
gpu
nodes.
B
Another
thing
that's
working
with
gpus
is
coders,
so
you
can
have
coders
per
namespace,
multiple
namespaces
for
cpu,
mem
and,
of
course,
for
extended
resources
like
the
gpu.
With
all
these
features,
you
can
go
on
and
maybe
create
even
some
more
roles.
B
You
can
have
clustering
nodes
with
specific
roles
for
gpu
training
or
for
gpu
inferencing,
depending
on
your
gpu
type
or
what
or
what
you're
exactly
want
to
do
in
your
gpu
cluster
by
default,
the
infra-only
parts
and
master
nodes
are
already
tainted,
so
you
could
also
have
things
on
your
cpu
nodes
and
priorities.
B
The
other
thing
I
want
to
mention,
if
you
have
several
gpu
nodes,
you
want
to
have
them
interconnected
with
a
high
high
speed
interconnect.
So
this
is
where
maltese
comes
into
place
and
for
those
people
who
don't
know
what
maltese
is
maltese
provides
secondary
interfaces
to
the
parts,
so
you
can
have
a
separate
data
plane
for
your
gpu
node.
So
that
is
important,
worry
that
you
are
interacting
with
the
api.
B
So
with
the
gpu
nodes
and
nv
stack,
you
already
have
something
like
a
separate,
compute,
plane,
independent
and
out
of
reach
of
the
api
you're,
just
running
your
workload
on
the
gpus
and
now
not
to
use
the
internal
network,
but
the
dedicated
network.
You
can
use
multis
and
what
we
are
currently
working
on.
Is
we've
used
sro
to
show
and
make
a
poc
how
to
enable
rdma,
completely
containerized
in
in
openshift
with
running
either
over
infiniband
or
ethernet.
B
So
all
the
pieces,
like
the
moffett,
stack,
the
gpu
direct
driver
and
all
other
pieces
that
we
need
for
gpu
interconnects
are
containerized
and
running
on
reddit
cores,
the
company
or
the
vendor.
We
are
currently
working
on
used
again
sro
as
the
driving
vehicle
to
test
this
poc,
and
now
they
are
implementing
on
top
of
sro,
and
with
this
with
this
template,
their
own
operator
to
enable
rdma
or
infinite
or
ethernet
or
or
gpus.
B
B
I've
also
included
a
link
to
a
google
doc
how
to
enable
gpus
with
okd
4.5,
and
the
future
work
we
are
doing
is
one
is
firstly,
gpu
direct
to
enable
fast
interconnects
and
the
other
future
work
is
mix.
Support
mix
support
is
a
new
feature
of
the
new
gpus
to
slice
up
the
gpu
in
several
distinct
pieces,
so
a
big
gpu
can
be
sliced
up
to
seven
pieces
and
those
seven
seven
slices
are
recognized
by
openshift
as
seven
dedicated
gpus.
A
Indeed,
we
do
and
uncle
that
was
a
really
great
presentation,
so
someone
else
is
asking
is
available
on
okd.
Is
it
also
available
on
top
of
f
cost,
I'm
assuming
that's
like
control
operating
system,
but
I'm
I
could
be
wrong.
B
That's
fedora
chorus.
I
just
I
just
switched
to
my
terminal
oc
get
cluster
version.
B
We
have
here
a
okd
4.5
running.
We
don't
use
s
0
here
we
can
use
the
nvidia
gpu
operator
here.
I've
included
the
instruction
how
to
use
it
oc
get
part.
We
have
all
the
usual
suspects
on
okd
running
and
if
I
do
aoc
get
notes,
this
is
my
last
oc
describe
note.
B
Rep
10
de.
I
just
want
to
show
that
nfd
works
out
of
out
of
the
box.
We
are
currently
working
on
edit
to
the
community
operators
on
operator
hub
so
that
it
shows
up
in
operator
for
opd
as
well.
So
we
have
a
gpu
in
here
and
if
we
grab
for
the
extended
resource,
we
can
see
that
we
have
one
gpu
in
this
cluster
zeros
are
allocated,
but
we
have
one
available
and
we
are
running.
B
Okd,
let's
see
debug
node.
B
B
To
release,
we
are
running
on
a
with
our
core
os
32
working
with
nvidia
to
add
fedora
support
into
their
driver.
Currently
we
have
to
build
our
own
driver
container
and
push
it
to
a
registry,
but
working
with
nvidia
to
add
this
in
the
official
repositories.
A
Great,
so
we
also
have
a
question
from
james
he's
asking
is
that
amd
mx
gpu.
B
A
I
guess
we
could
answer
both.
So
are
we
engaging
with
amg
right
now
and
if
no
or
is
there,
is
there
a
plan
to.
B
I'm
not
using
any
amd
gpu
but
amd
has
all
the
pieces
I
described
here
already
posted
they
have
a
pre-start
hook.
They
have
the
device
plug-in,
they
do
not
have
any
metrics,
they
don't
have
any
monitoring.
So
technically
it
should
work.
There
were
some
discussions
on
enabling
amd
here.
B
A
B
A
Okay,
thank
you.
So
we
have
a
question
here
someone's
asking.
So
if
I
understand
correctly,
we
do
not.
We
do
not
change
the
red
hat
chorus
or
fedora
kois.
The
operator
will
install
a
driver
container
and
push
it
to
open
shift,
stroke,
okd
default
registry,
and
that
enables
the
gpus
on
okd.
Is
that
correct
and.
B
I
just
have
that
question
yeah.
I
understand
the
question
so,
okay,
if
what
we
wanted
from
the
beginning,
even
when
started
with
atomic
on
openshift
3.8,
we
didn't
want
to
touch
the
host
at
all,
so
we
are
not
doing
anything
on
the
host
operating
system.
Why
we
want
to
make
updates
as
much
if
possible.
So
we
don't
want
to
write
somewhere
to
some
suspicious
directories
like
slash,
opt
and
then
clean
up
afterwards.
B
If
we
reboot
the
node,
we
don't
want
to
have
any
traces
of
anything
on
the
node,
so
everything
is
in
the
container.
S0
pushes
us
to
the
internal
registry.
Nvidia
has
used
a
different
path.
They
are
building
their
drivers
on
demand
in
the
driver
container.
So
if
you
are
deploying
the
nvidia
gpu
operator,
it
will
build
the
driver
on
demand
right
away,
as
I
pull
it
on
the
cluster.
B
B
A
All
right,
thanks
for
that
funko,
we
also
have
another
question
from
youtube:
daniel's
asking
are
canal
headers
included
in
redhead
core
s.
B
I've
posted
one
of
the
links
on
nfd
cloud
on
the
nfd
no
teacher
discovery
page
is
the
link
how
to
entitle
your
cluster,
how
to
build
red
hat
software,
so
you
need
a
subscription
to
get
kernel,
devil,
kernel,
headers
and
kernel
core,
that's
needed
for
a
kernel
driver
in
the
red
hat
chorus
case.
You
need
an
entitled
cluster
to
get
access
to
this
to
the
software
on
federal
core
s.
We
don't
need
it,
there's
no
entitlement.
We
can
use
it
right
away.
A
B
One
aspect
is
the
granularity,
fine
grain
scheduling
on
the
type
of
gpu
and
the
features
of
gpu.
This
would
be
like
I'm
targeting
the
workload
for
v100s
or
d4s,
and
the
other
aspect
is
the
priorities.
So
every
part
you
can
provide
a
priority
that
you
custom,
define
and
if
you
are
using
taints,
only
gpu
workloads
would
run
on
a
specific
gpu
node
with
the
stains.
So
we
are
excluding
like
the
cpu
workloads
if
you
are
using
taints.
B
This
would
be
like
the
hard
partitioning
part
where
we
are
repelling
all
cpu
workloads
and
then
inside
a
gpu,
node
or
namespace,
or
where
your.
How
do
you
cluster
your
gp
nodes?
You
would
have
priorities
and
you
can
add
as
much
as
as
you
want
to
have
priorities.
I
just
included
like
high
and
low,
but
you
can
have
a
more
fine
granular
priority
classes
from
one
to
ten.
One
to
five
depends
on
really
what
you
want
to
achieve,
and
higher
priority
parts
have
a
higher
priority,
of
course,
for
running
and
will
evade
low
priority
parts.
B
You
can
be
sure
that
your
high
priorities
are
running
with
a
high
at
first
on
a
ngpu
node.
A
Thank
you,
so
here's
another
one
from
sandrino,
so
he
says
a
month
ago
he
tried
the
nvidia
operator
on
a
test
cluster
and
then
video
creator
was
trying
to
fetch
kernel
headers
in
the
extended
update
support
and
our
entitlement
does
not
or
the
entitlement
does
not
cover
esu.
So
there
is
a
rule
that
you
showcase
today
solve
the
esu
issue.
B
Another
another
enhancement
we
are
working
on
is
to
provide
a
simpler
way
how
to
build
driver
containers.
So
this
is
currently
in
the
working
to
hide
all
those
nasty
details
from
customers
like
where
to
fetch
the
kernel
headers
to
enable
as
extended
update
support
repositories,
fetch
it
from
somewhere
else.
So
we
are
building
something
like
a
base
thing
for
customers
so
that
they
can
easily
build
driver
containers
without
thinking
of
which
kernel
version
I'm
running.
Where
do
I
get
my
dependencies?
A
Thank
you,
funko,
I'm
just
checking
if
we
have
any
other
questions,
but
w
says
thank
you
for
clarifying
a
question,
and
oh,
we
have
another
question
here.
So
the
feature
discovery
is
sideka
will
enable
vgpu,
that
is
scheduling,
cause
not
a
whole
gpu.
Is
it.
B
A
B
B
For
example,
just
annotation,
if
you're
running,
for
example,
openstack
and
install
the
vgp
or
compute
server
from
nvidia,
it
will
work
to
pass
through
with
gpus
in
vms
and
then,
if
you're,
installing
openshift
on
openstack,
you
just
need
to
install
the
nvidia
gpu
driver,
the
nvidia
gpu
operator,
and
it
will
it
will
work.
The
vgp
is
on
a
different
level.
It's
not
on
the
open
shift
level.
It's
on
the
infrastructure
level.
A
Okay,
all
right,
thank
you.
So
we
also
have
no
question
here.
So
is
the
scheduler
aware
of
the
gpu
load
panel.
B
B
B
B
You
could
do
that,
for
example,
sro
has
sample
implementations
of
the
metrics
and
alerts.
We
are,
for
example,
alerting
the
customer
if
the
gpu
is
idle
and
he's
wasting
money
or
if
the
temperature
is
too
high.
You
have
a
problem
with
cooling.
That's
the
same
thing
you
could
do
for
for
pressure.
You
could
take
out
a
specific
metric,
which
that
tells
you
the
how
much
memory
is
used
or
how
many
cores
are
used
and
expose
this
as
a
alert.
B
If
a
given
threshold
is
overwritten,
you
can
alert
the
cluster
or
the
administrator
or
the
user,
who
is
sitting
in
front
of
the
console
that,
like
99
percent
of
the
gpu,
is,
is
used.
This
will
otherwise
not
be
fed
back
into
scheduling
decisions.
B
A
Okay,
so
is
it
possible?
Is
it
possible
for
the
ports
to
share.
A
Okay:
okay,
yes,
we'll
jump
back
to
the
gpus
again
so
vsphere.
If
one
enable
vgpu
licensing,
then
open
shift
via
the
nvidia
gpu
operator.
A
I'm
sorry!
Let
me
I'm
trying
to
understand
this
question
as
well.
So
if
one
enables
vgpu
licensing,
then
openshift
by
the
nvidia
gpu
operator
will
be
able
to
see
this.
B
Yeah
yeah
everything
that
is
visible
by
the
operating
system
and
exposed
as
a
pci
device,
which
is
done
in
vsphere
or
in
openstack,
which
is
essentially
like
something
like
a
pass-through,
pci
pass-through
and
the
operating
system.
That
is
deployed
on
top
of
this
infrastructure.
B
And
if
the
gpu
is
detected
like
a
normal
device,
which,
I
suppose
it
is,
the
nvidia
gpu
operator
will
work.
A
Okay,
okay,
that's
clear!
Second,
let
me
see
if
we
have
any
more
questions
coming
in.
A
Okay,
I
think
we're
clear,
thank
you
so
much
for
taking
the
questions
my
pleasure
yeah,
that
was,
that
was
a
very
useful
session
and
especially
right
now,
with
gpus
being
such
a
hot
topic,
I'm
just
gonna
give
it
a
few
more
seconds
to
see
if
we're
gonna
get
any
more
questions.
A
Okay,
when
they're
clear,
so
thank
you
so
much-
and
this
is
a
very
great
and
timely
session-
always
insightful-
always
educational,
huge,
shout
outs
to
chris
shot
as
well
for
promoting
this
session
and
making
the
live
stream
very
seamless,
and
I
guess
with
that
we'll
let
you
all
get
back
to
your
days.
Thank
you.
So
much
for
joining.