►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Minute
just
rolled
over
so
I'll
go
ahead
and
introduce
and
I'm
I
know
I
didn't
get
a
chance
to
practice
your
name
before
we
started,
but
yeah.
B
Okay,
yeah
I'll
get
started
so
thanks
everyone
for
joining
good
morning
good
afternoon
good
evening.
So
after
we
have
heard
many
hardcore
topics,
let's
discuss
something
more
relaxing.
B
So
today
I'm
gonna
discuss
about
the
kubernetes
performance
visualizations
at
nvidia,
so
first
some
background
informations,
so
convert
is
the
core
component
in
nvidia's
cloud
platform
stack.
Basically,
we
use
cooper
to
provide
a
vm
resource
to
cloud
gaming
service.
So
when
we
use
kubert,
we
found
an
issue
that
is
a
lack
of
visualizations.
B
This
makes
it
very
hard
to
debug
and
triage
issues.
So
usually,
when
we
have
a
problem
we
can
use,
can
you
only
use
command
line
tools
like
coop
ctrls,
to
get
the
info?
This
is
usually
inefficient
and
not
intuitive,
and
also
so.
Our
cloud
team
is
very
big,
so
there
are
multiple
teams.
There
are
a
cloud
service
team
and
there
are
validation
team
and
there
are
sru
teams.
So
for
the
external
customers
they
are
usually
very
find
it
very
difficult
to
understand
how
the
kubernetes
works
and
how
our
platform
side
works.
B
B
B
B
Some
important
components
like
a
work
queue
and
things
like
that
and
they
need
to
be
visualized.
B
So
in
nvidia
we
have
a
different
type
of
way
to
to
visualize
our
platform,
so
we
have
a
matrix
based
visualized
we
have
logs
and
we
have
traces
and
specifically
for
cooper.
We
chose
promisius
and
graffana,
based
monitoring
stack
to
visualize
covert
matrix,
so
last
year,
upstream
has
introduced
some
very
interesting
metrics
for
us
to
visualize.
B
Also,
we
have
a
vm
phase
count
phases.
So
in
this
case
we
can
know
how
the
vm
this
is
distributed
inside
our
zones,
like
how
many
of
them
are
in
running
phase
and
how
many
of
them
are
impending
and
scheduling,
scheduled
phase
and
also
there's
a
kubernetes
workq
matrix.
B
B
So
the
overall
layout
is
like
initially
we
so
the
lower
the
the
panel,
the
more
detail
it
contains.
So
like
initially
on
the
top,
we
have
like
a
user
manual
panel.
This
is
usually
for
the
external
customers,
because
they
don't
have
a
very
good
understanding
of
the
codewords
and
then
we
have
a
performance
indicator
dashboard.
B
B
B
So
one
of
the
most
important
indicator
is
the
vm
creation
time.
So
this
is
basically
the
very
the
most
important
indicators
for
cooperate
because
on
the
cloud
service
side,
what
they
care
about
most
is
how
fast
this
vm
can
be
created.
B
So
we
put
this
indicator
on
top,
so
it
will
track
the
end
to
end
time
from
vm
creation
to
to
running,
and
also
it
contains
some
stats
of
each
phase
like
the
average
and
75
quantize
95
contact
a
99
quantile
of
time
to
reach
a
specific
phase
like
the
graph
here
is
track.
The
time
it
takes
from
previous
phase
to
the
running
phase.
B
B
So
basically,
this
panel
will
visualize
the
kubernetes
vmi
phase
transition
time
from
creation
seconds
matrix,
so
it
will
display
a
time
it
takes
to
reach
a
specific
phase
and
in
general
it
is
like
a
stacked
bar
histogram.
So
we
stack
a
different
phase
at
the
same
time
together.
So
it
is,
it
is
compact
and
intuitive
to
to
to
check
how
long
does
it
take
for
each
phase,
so
the
most
important
usage
is
to
check
how
how
which
face
spends
most
of
time.
B
Then
we
know
that
phase
has
some
issues
and
may
need
to
triage
and
deep
dive
like
in
this
graph.
You
can
see
that
so
different
colors
represent
different
faces,
so
the
the
yellow
colors
represents
the
time
it
takes
to
reach
pending
phase.
So
the
light
blue
represents
time.
It
takes
to
reach
scheduling,
phase
the
dark
blue
means
time
it
takes
to
reach
scheduled
phase
and
the
the
green
represents
time
it
takes
to
reach
running
phase.
B
So
in
this
graph
we
find
that
the
dark
blue
bar
is
always
the
longest,
so
it
means
it
takes
most
time
for
the
vms
to
go
from
the
scheduling
phase
to
schedule
scheduled
phase.
So
it
means
the
vms
takes
a
long
time
in
scheduling.
B
So
basically,
this
graph
will
provide
very
useful
information
on
the
health
of
each
faces.
B
And
actually,
there's
a
shortcoming
of
breakdown
graph,
that
is
the
data
in
the
transition
time.
Matrix
is
aggregated,
so
we
through
95
quantile
calculations.
In
this
case,
we
lack
of
we
will
lose
some
of
the
information
from
this
matrix.
So
we
we
cannot
where
we
were.
Some
information
of
individuals
vms
is
lost
during
its
95
quantile
calculations
like
if
we
see
this,
we
can
see
that
so
each
bar
represents
the
the
95
quantile
of
all
of
the
vm
equation.
It
doesn't
contains
the
information
of
individual
vms.
B
B
So
so
this
example
of
this
graph
is
so
it
contains
many
cells
actually
so
this
this
heat
map
panel
contains
many
cells,
and
each
cell
represents
a
bucket
in
this
histogram
matrix
and
the
brighter
the
cell
is.
The
bigger
number
is
in
this
bucket
like
in
this
example.
If
we
hover,
if
we
power
over
this
cell,
we
will
see,
we
will
check
that
the
count
of
this
bucket
is
one
point,
one
eight
k,
so
it
means
there
are
many
vms
that
is
inside
this
bucket.
B
Let's
see
that
is
like
10
seconds,
so
it
means
this
heat
map
tells
us
that
there
are
many
vms
that
has
a
phase
transition
time
of
10
seconds,
and
also
we
found
that
the
this
this
row
is
also
very
bright,
so
it
means
there
are
also
many
vms
that
has
a
phase
transition
time
like
if
we
check
the
y,
it's
five
seconds
yeah,
so
you
can,
you
can
see
that
previously
many
vms
are
most
vms
are
in
the
transition
time
of
five
seconds,
but
now
most
vms
have
a
transition
time
of
10
seconds,
so
it
means.
B
B
So
so
this
is
basically
very
helpful
and
another
usage
is
it
can
help
us
to
check
some
some
outliers.
So,
if,
like
many
vms,
are
actually
like
in
a
five
minutes
cell
or
even
10
minutes
cell,
it
means
that
we
have
many
outliers
vms.
That
has
a
very
unhealthy
transition
time.
Transition
time
is
too
long
and
it
needs
some
deep
dive
and
triage.
B
And
also
we
have
a
vm
distribution
panels,
so
usually
one
of
the
one
of
the
very
important
indicator
is
that
how
the
vms
goes
for
each
phase
like
how
many
vms
are
in
running
phase
and
how
many
vms
I
scheduled
to
face
and
how
many
vms
are
in
scheduling,
phase
and
pending
phase.
So
if
we
check
this
graph
and
found
that
there
are
many
vms
at
like
the
vm
account
of
of
running
of
running
vms
is
very
large.
B
We
know
this
zone
is
in
general
is
healthy
because
most
of
the
vms
are
in
running
phase.
B
So
if,
in
some
conditions
that
we
found
that
the
the
running
vms
drops
a
lot
and
the
failed
vms
or
the
pending
vms,
the
count
of
them
increases
a
lot.
We
know
the
zone
may
be
in
an
unhealthy
state,
so
the
sru
team.
We
know
this
zone
is
not
healthy,
so
they
may
take
the
zone
offline
for
quick
maintenance
or
of
detox
deep
dive
and
triage.
B
Also,
we
have
a
total
succeed
and
failed
vms,
so
this
can
actually
help
us
to
see
how
the
the
how
the
house
of
loan
goes
through
the
times.
So
if
we
found
that
the
field
vms
keeps
increasing
very
fast,
then
it
means
the
vms
may
be
not
healthy
and
may
need
a
maintenance
or
something
like
that.
B
In
general,
there
are
many
interesting
use
cases
for
this
covert
performance
dashboard.
So
the
first
typical
use
case
is
the
heavy
cloud
gaming
workload
use
case.
This
is
very
typical
and
the
most
important
use
case
so
basically
immediate
cloud
gaming
service
service
often
run
into
some
high
load
situation
like
like
in
the
night
or
or
in
the
some
holidays.
B
Many
users
will
come
to
start
play
games
at
the
same
time.
In
this
case,
many
vms
will
be
created
in
very
short
time.
This
will
have
very
large
pressure
on
the
cloud
platform
side,
the
cloud
service.
Basically,
they
will
use
kibana
to
trick
track
vm
resource
informations
like
in
this
graph.
They
will
track,
they
have
a
vm
pool,
so
they
track
the.
How
many
vms
are
there
in
the
vm
post?
So
in
this
graph
we
see
that
under
some
very
high
load
situations,
the
vm
pool
will
drop
a
lot.
B
This
may
be
possibly
due
to
some
load
test
off
or
some
high
load
use
cases.
In
this
case,
the
vm
will
will
drop
a
lot
so
service
side
cloud
service.
I
use
kibana
to
track,
and
in
this
case,
on
the
platform
side,
we
need
bet
better
visualizations
as
well
to
to
triage
the
issue
on
our
side
to
make
sure
that
our
cloud
platform
side
is
not
the
why
the
problems
occurs.
B
So,
basically,
the
phase
transition
time
is
very
suitable
to
do
some
and
analyze
on
platform
side
like
so
this.
This
graph
is
captured
at
the
same
time
as
this
graph,
so
so,
on
the
server
side,
we
see
that
there's
a
vm
pool
drop
at
around
4
00
a.m.
B
So,
in
this
graph
the
platform
part
first
transition
time
breakdown
graph.
We
we
found
the
same
problems
happen
at
the
like
at
around
the
same
time.
It
also
happens
around
the
4
a.m.
We
found
that
the
the
light
blue
bar
increases
very
fast,
and
also
we
see
that
the
count
of
running
vms
decrease
a
lot.
So
this
is
inconsistent
with
the
service
side
graph,
so
how
what
this
graph
tells
us.
B
So,
since
the
the
scheduling
bar
is
very
very
long,
so
it
means
it
takes
most
time
spent
most
time
to
reach
a
scheduling
phase.
So
in
kubernetes,
if,
if
a
vmi
spent
a
long
time,
you
reach
scheduling
phase,
it
means
usually
it
is
keep
in
the
pending
phase.
So
usually
it
is
probably
due
to
the
lack
of
system
resources
like
lack
of
gpus
or
laptop
memories
that
cause
this.
B
So,
with
this
graph
we'll
quickly
know
what
may
be
probably
the
issues
and
they
can.
We
can
resolve
these
issues
like
by
adding
more
system
resources
or
cleaning
up
some
orphan
node
that
take
up
most
resources,
so
in
general
this
helped
us
to
triage
the
issues
in
the
cloud
gaming
workload
and
platform
side.
B
B
So
so
we
found
that
when,
whenever
there's
a
large
vm
division
operations,
it
may
cause
a
vert
controller
to
panic.
B
So
whenever
the
vert
controller
to
is
panic,
it
cannot
expose
any
matrix.
So
on
the
graph,
we
will
see
that
the
the
graph
is
interrupted,
because
the
word
controller
is
panic
and
it
cannot
expose
any
matrix.
B
So
with
this
graph,
it
is
easy,
very
easy
for
us
to
detect
some
situations
like
the
vertical
controller
panic,
and
we
also
know
why
it
is
panic
and
we,
and
if
we
check
the
log,
we
can
find
why
it
is
panic.
B
So
the
panic
for
this
is
due
to
the
deleted
final
state
unknown
object.
It's
not
properly
handled
that
causes
an
a
panic.
So
usually
we
need
to
add
some
error
handling
logic
to
make
sure
that
it
not
cause
very
severe
issues.
B
So,
in
summary,
the
newly
introduced
phase
transition
matrix
provides
a
very
good
approach
to
visualize
the
covert
performance.
It
can
be
used
by
external
customers
to
quickly
overview
the
behavior
of
kubernetes.
So
they
don't
need
to
deep
dive
into
the
details
or
keep
asking
us
about
how
whether
the
the
platform
is
healthy.
You
can
use
just
use
the
indicator
panel
to
check
whether
the
zone
is
healthy.
B
It
can
also
be
used
by
developers
to
detect
a
hidden
bugs
because
the
the
kubert,
the
this
the
grafana
dashboard,
is
in
real
time
it's
very
suitable
to
to
check
when
the
problem
happens,
when
and
and
where
the
problem
happens
and
then
to
further
deep
dive,
they
can
check
the
logs
on
elasticsearch
to
deep
triage
the
problem
and
the
issues.
B
So
currently,
the
dashboard
is
staged
in
immediate
cloud
platform,
production,
environment
and
it
helps
the
visualizations
in
the
heavy
cloud
gaming
workload
scenario.
B
A
A
Your
presentation,
I
really
like
the
case
studies
you
went
through
there.
We
do
have
some
questions
in
the
chat.
First
off
andre
was
asking
if
nvidia
is
working
on
live
migration
of
vms,
with
gpus.
B
A
B
Yeah,
I
think
there
may
be
some
some
design
and
discussion
ongoing,
but
I,
as
far
as
I
know,
we
we
currently
don't
have
it
yet
so
usually
we
don't.
We
need
some
maintenance
window
to
to
achieve
this.
Actually.
A
Okay-
and
he
andre
also
asks
whether
you
can
monitor
the
gpu
temperature
with
prometheus
and
and
your
plug-ins.
B
Yes,
I
think
there
are
actually
several
open
source
exporters
like
mvsm
exporter
and
snmp
exporter.
I
think
they
can
be
used
to
monitor
gpu
temperature.
I
think
it
is
currently
already
implemented
in
our
platform
yeah.
So
basically,
it's
like
so
immediate
smi
has
exposed
some
interface
and
the
golan
sdks
for
us
to
to
write
a
quick
exporter
to
export,
to
to
retrieve
the
temperature
information
from
the
framework
and
then
expose
it
as
promises
matrix
yeah.
I
think
it
is
doable
yeah.
A
B
A
So
we
have
a
number
of
questions
that
are
interested
in
seeing
how
your
user
load
works
out.
So
one
of
them
from
daniel
was
what
is
the
average
time
that
users
will
tolerate
until
there's
some
sort
of
churn.
B
A
B
If
we
cannot
provision
on
vm,
then
we
cannot
start
streaming
and
then
we
cannot
play
games.
So
it
is
very
obvious
and
you,
every
user
can
can
observe
it
if
they
think
the
game
cannot
start.
The
can
leads
to
a
very
long
loading
dialogue.
B
A
B
So
the
maximum
number
of
concrete
creation
of
vms,
I
think,
is
like
hundreds
of
yeah
and
sometimes
it
may
reach
reach
thousands
yeah
yeah.
So
we're.
B
Colleague,
ryan
and
fam
may
have
better
some
some
data
as
well
yeah.
A
B
A
And
do
you
plan
to
make
the
dashboards
publicly
available
sj
asks.
B
I
think
currently,
maybe
it
is
mainly
for
internal
use,
so
we
may
not
make
it
public
available
yet
because
it
is
deeply
coupled
with
our
our
zone
configurations.
So
maybe
it's
hard
for
external
customers
to
use.
A
Okay,
chris
has
a
final
question
or
a
question
I
think
in
terms
of
measuring
the
the
average
and
maximum
number
of
concurrent
requests
like
we
were
talking
about
earlier.
You
kind
of
group
those
by
minute
by
second
like,
what's
your
time
bucket.
B
A
A
Somebody
else
sharing
a
grafana
dashboard
marcelo
was
mentioning
one
that's
useful,
but.