►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
welcome
everybody
to
the
last
session
of
the
first
day
of
kubrick
summit.
By
the
way,
all
these
sessions
are
being
recorded
and
as
soon
as
we
can
edit
the
video
we
will
get
it
up
on
the
kubert
youtube
screen
in
case
you
missed
anything,
but
for
our
last
session
of
the
day
we
have
fan
zhang
who's,
going
to
be
talking
about.
B
Thank
you,
yeah
hi,
everyone
yeah.
Thank
you
for
joining
my
talk.
My
name
is
benjam.
I'm
a
software
engineer
at
nvidia,
I'm
working
on
delivering
a
global
deployed,
massive
skilled,
scaled
gpu
cloud
services
as
a
foundation
for
some
challenging
workloads
like
cloud
gaming,
ai
and
machine
learning
and
gpu
accelerated
workloads,
something
like
that
at
a
large
scale.
So
today
I'm
going
to
talk
about
a
few
bugs
and
the
findings
from
the
bmi
turn
in
our
practice.
B
Okay,
so
yeah
I
I
would
like
to
start
with
talking
about
how
convert
is
used
in
reading
nvidia
project
and
something
about
our
use
case.
That
upstream
doesn't
doesn't
cover.
So
nvidia
is
a
leveraging
convert
as
the
core
part
to
build
cloud
native
infrastructure
and
services
on
the
on-premises
data
center
to
support
our
globally
multi-tenancy
workloads,
like
stream
gaming.
B
We
hope
convert
to
in
our
stack
to
be
resilient
and
resilient
and
reliable
and
there's
a
workload
on
it
to
be
managed
in
a
pure
cloud
native
approach.
So
now
take
the
stream
gamer
use
case
as
an
example.
So
stream
games
must
be
running
on
a
window
machine
window
window,
russia,
a
windows
machine.
The
backend
services
must
be
running
on
linux,
virtual
machine,
our
infrastructure
services
and
are
built
on
top
of
kubernetes.
So
basically
all
of
them
must
be
running
isolated.
B
B
B
On
the
virtual
machine,
instant
object
directly,
the
workload
runs
in
a
high,
intensive,
dynamic
manner,
vms
burst
with
creation
and
the
deletion
every
minute
somewhat.
A
lifetime
knowledge
would
not
exist
more
than
two
hours
or
even
less.
However,
some
critical
services
would
expect
it
to
be
running
for
a
long
time.
B
So
some
big
qbs
in
the
skill
wait
is
bad
burst,
creation
rate
at
least
600
vms
per
minute,
and
normally
we
are
running
a
big
kubernetes
cluster
with
over
600
bare
metal
nodes
and
over
1
000
of
mis
are
running
every
minute.
B
So
that's
our
use
case.
So,
let's
move
to
the
box.
B
Okay,
yeah
so
to
build
a
status.
The
first
one
is
a
vmware's
data
stock
issue.
In
our
practice,
we
noticed
a
couple
of
times
that
the
larger
part
was
moved
from
kubernetes
cluster,
but
via
my
object,
thereby
incidents
are
still
in
running
status.
B
So
looked
into
the
logs
of
the
communities
the
logs
of
the
kubernetes
are
covered
there.
We
found
that
there
are
two
issues
habit:
the
first
one,
a
word
handler,
failed
to
think
where
the
handle
failed
to
sink
domain
cash
during
a
crash
or
termination.
B
A
reboot
rebooted
vert
handler
has
to
re-sync
with
the
api
server
and
the
rebuilt
of
the
local
informal
cache,
unlike
vmi,
which
is
a
persistent
fcd.
The
domain
informer
cache
is
completely
lost
during
the
word
handler
crash
or
termination
word
handler,
handles
resync
by
listing
launcher
circuit
files
in
the
host
path
and
adding
back
to
the
domain
format.
B
Then,
in
the
default
working
process
it
calculates
the
accurate
to
be
my
status
based
on
the
domain
and
be
my
existence.
B
If
the
party
is
gone
and
the
domains
are
responsive,
the
bmi
should
be,
firstly,
updated
the
field
status
by
the
world
handler
and,
and
then
the
word
controller
can
take
over
this
vmi
and
for
the
afterward
finalizing
and
deleting
kubernetes
is
a
distributed
system.
So
there
might
be
multiple
controllers
running
and
trying
to
write
to
the
same
object
simultaneously
or
object
is
manipulated
concurrently,
so
kubernetes
api
implemented
implements
multiple
version,
concurrency
control
and.
A
B
To
orchestrate
the
currency
right
operations
so
to
update
or
re-sync
the
object,
the
controller
must
use
the
latest
or
resource
version
and,
however,
in
we
notice
that
the
resource
version
of
the
vmi
is
easily
get
lost,
so
in
the
current
code
base,
the
word
handler
does
not
cover
the
situation
when
the
resource
version
is
empty.
So
so,
in
this
issue,
I'm
I'm
going
to
add
a
fix
to
get
the
latest
resource
version
to
update
the
object
in
this
case
and
the
editor
and
adding
the
get
the
permission
of
report
handler
are
back.
B
A
B
Is
evicted
on
the
report
of
the
node?
All
the
node
not
ready
status,
lasted
to
last.
More
than
five
minutes,
the
partner
will
be
evicted,
the
periodical
equal,
routine
change.
The
word
header
has
a
periodically
routine
checks.
The
state
scale
circuit
and
marks
unresponsive,
one
by
creating
a
large
launcher
unresponsive
file
at
the
same
directory
of
the
launcher
socket.
However,
if
the
port
volume
is
completely
removed
on
the
host
there,
isn't
a
directory
can
be
right
too.
So
we
always
so.
The
still
error
messages
always
happens
repeatedly,
yeah.
B
So
the
reason
for
this,
I
would
suspect
that
it
might
be
relating
to
the
ghost
record.
The
caching,
the
launcher
socket
pass
was
not
cleaned
up
properly.
B
That's
why
we
go
to
the
next
issue,
so
this
is
a
very
interesting
bug.
We
captured
this
when,
when
vmi
stuck
stuck
in
scheduled
status,
but
the
launcher
part
failed
with
an
error
showing
the
computer
come,
the
compute
container
was
terminated
whatever
this
vmware
is
recreated
or
we
tried.
The
result
was
the
same.
B
B
Yeah,
I'm
talking
a
little
bit
about
the
ghost
recorder
as
a
background,
so
each
new
creative
word
launcher
part
needs
to
provide
a
launcher
socket
and
register
it
into
the
var
wrong:
convert
private
coaster,
records,
vm,
uuid
path
for
caching
and
and
every
every
time
the
verb
launcher
is
started.
It
will
read
all
the
files
for
handling
the
rule
we
started.
It
will
read
all
the
files
in
the
past
into
its
cache.
The
ghost
record
is
used
for
guaranteeing
that
advanced
local
data
is
cleaned
up.
B
But
when,
while
word
handler,
we
initialized,
even
if
the
vmware
is
deleted
from
the
ipcd,
how
we
debug
this
issue
and
the
founder
the
root
cause.
There
are
some
clues.
First,
the
issue
happened
after
node
reported
from
an
already
situation.
B
The
word
handler
was
terminated
by
the
equivalent
during
the
same
period
of
the
status.
So
that's
the
first
circle.
Second,
one
from
the
termini
terminated
the
computer
log.
We
we
see
it
was
timeout
waiting
for
domain
to
be
defined
and
we
find
from
the
we
found
from
the
red
handler
logs.
We
saw
that
the
error
message
was
something
like
unable
to
create
where
the
launch
of
client
records,
when
we
try
already
exist
with
different
uid.
B
B
This
vmi
is
for
critical
services
and
is
asked
to
be
deployed
on
one
specific
node,
also
the
vm
name
and
the
namespace
are
specified.
So
it
means
the
key
of
this
coastal
record
is
always
the
same,
and
severe
va
will
always
be
scheduled.
On
the
same
note,
okay,
so
checking
back
to
see
the
time
steps.
The
word
handler
was
terminated
on
the
node
and
the
node
was
suffering
from
node.
B
For
more
than
five
minutes,
so
the
previous,
the
launcher
pawn
was
evicted,
and
but
the
local
data
was
not
a
clean
algorithm.
The
word
handler
did
not
have
to
clean
up,
did
not
have
the
chance
to
go
through
a
successful
cleanup
process.
The
ghost
record
was
a
sticker
one
after
the
word
handler
rebooted
every
time,
a
new
bmi
of
the
same
key,
the
name
space
slash
should
be.
My
name
was
a
sponge.
The
word
handler
using
the
key
using
this,
the
same
key
have
to
pick
up.
B
The
the
scale
goes
to
record
in
the
past,
so
the
vmi
will
never
be
processed
and
the
convention
cannot
be
built.
That's
why
we
saw
the
container
failed
with
the
timeout
waiting
for
the
connection.
B
So
looking
into
the
code
base,
I
think
the
fix
will
be
adding
a
cleanup
logic,
but
it
goes
to
record
this
could
be
done
by
extend
extending
the
logic
of
the
cleanup
when
deleting
the
old
domain
systems.
So
this.
A
B
I
think
this
one
should
be.
This
is
a
good
example.
We
should
be
thinking
about
some
corner
cases,
especially
when
the
failure
happens
on
the
cryptovert
components,
how
we
handling
the
still
ghost
records
on
all
circuit
files.
B
Okay,
yeah,
okay,
so
the
editing
video
our
workload
could
be
very
intensive
and
at
a
large
scale.
So
we
are
experience
experiencing
something
that
hasn't
been
covered
upstream.
So
today,
I'm
going
to
talk
about
one
thing
is
at
the
largest
scale:
for
example,
one
thousand
vmis
deleting
a
lot
of
bmis,
can
cause
world
controller
to
panic
before
we
expand
the
root
cause.
Let
me
step
back
a
little
bit
and
look
more
abstract
on
how
kubernetes
controller
works
and
why
they
choose
the
event
event.
B
Logic
here
are
the
two
ways
to
detect
this
state
changes
for
an
event
in
current
in
in
the
real
world,
so
one
is
even
edge
trigger
and
all
edge
driven
trigger,
so
which
means
at
the
point
in
time
the
state
changes
occurs.
A
handler
is
triggered.
For
example,
the
paw
was
in.
There
was
another
important
and
suddenly
the
pod
is
running
yeah,
so
so
this
is
edge
trigger
it's
not
like
a
pulse.
B
The
second
one
is
the
level
trigger
level
triggers
means.
The
state
is
when
the
state
is
checked
that
the
regular
the
state
is
checked
at
the
regular
intervals
and
if
something
or
certain
conditions
happens
or
met,
then
the
handle
of
the
controller
is
a
trigger,
so
level
trigger
is
a
form
of
reporting.
B
B
Noticing
changes
depends
on
the
interval
of
the
pulling
and
how
fast
the
vmi
can
the
how
fast
the
api
server
can
answer.
So,
if
many
async
controllers
run
simultaneously,
the
system
will
take
longer
time
to
meet
the
desired
state
status
so,
on
the
contrary,
edge
triggers
is
much
more
efficient
with
many
objects.
A
B
Workers,
threats
in
the
controller's
processing
events,
so
kubernetes
using
the
kubernetes
controller,
is
designed
based
on
the
edge
trigger
so
also
we
called
event
processing
yeah,
let's,
let's
refresh
how
kubernetes
controller
works.
So
kubernetes
controller
has
two
main
components:
informal
and
work.
Eq
informers
have
mercados
on
the
hood
to
watch
for
changes
on
the
camera
status
of
kubernetes
objects
and
send
events
to
the
worker
pew.
Then
the
event
in
this
work
queue
will
be
popped
up
by
the
workers
to
process
inside
the
cache.
B
For
example,
delete
function
is
called
when
an
existing
resource
is
deleted.
It
gets
the
final
state
of
the
resource
if
it
is
known.
Otherwise
it
will
get
an
object
of
the
object,
type
that
delete
final
state
unknown.
B
B
So
actually
we
also,
we
observed
that,
on
the
larger
scale,
edge
trigger
events
like
a
delete
have
a
higher
chance
to
get
a
missed
when
the
watcher
means
that
the
data,
when
the
controller
preferred
means
the
delete
event
happens,
the
delete
final
state
unknown
object
is
added
to
the
data
fifa
peel
of
the
vmi
informer
so
but
the
converter
picks
up
the
object
and
the
custom
attempt
attempt
to
assert
the
key
to
the
vmi,
which
is
causing
our
runtime
panic.
So
that
is
the
root
cause.
The
the
fix
is
easy.
B
We
added
this
search
before
in
red
controller.
Every
time
word:
controller
trying
to
assert
the
object's
type.
B
B
A
sort
of
religion
has
a
sort
of
relationship
to
the
pod
crashes,
no,
not
the
available
network,
and
I
o
problems
so
we're
on
the
way
to
expose
some
faulty
injection
solutions
to
convert
to
do
this.
We
use
running
some
space
to
randomly
inject
some
faults,
but
that's
not
enough.
We
have
experiment.
We
have
investigating
some
calcium
engineering
tools
like
health
mesh
to
do
it.
Hopefully
we
will
have
another
talk
on
this.
B
B
B
The
last
one
is
debugging,
the
debugging
is
is
hard
and
painful,
so
there
are
many
challenges.
As
I
said,
I
can
see
the
first
one.
Some
tricky
problems
are
not
easy
to
reproduce.
This
is
a
major
blocker
for
debugging.
A
second
bug
could
only
be
fired
in
some
particular
criteria.
Criterias,
but
we
don't
know
the
root
cause
most
of
the
time,
so
it
cannot
reproduce
debugging.
B
It
is
also.
It
is
very
hard
to
capture
all
the
information
needed
to
debug,
for
example,
components,
log
and
not
sufficiently
debug.
Why
the
qmu
process
terminations
expectedly,
we
needed
to
search
for
every
pieces
of
clue
to
find
out
to
the
root
cause
so
yeah,
so,
okay,
okay,
so
that's
all
for
my
talk
today.
This
is
my
contact
information,
so
feel
free
to
show
me
any
message
and
talk
about
anything.
A
Yeah
cool,
okay!
Awesome!
If
you
want
to
stop
sharing
your
screen,
we'll
see
if
we
have
any
questions.
A
So,
thank
you
very
much
for
that.
The
so
folks
have
any
questions
comments
about
fawn's
experience.
B
So,
as
far
as,
if
I
remember
correctly
in
video
gpu,
because
we
are
using
the
bare
metal
virtualization
platform
that
doesn't
support
the
migration
of
the
virtual
gpu,
so
that's
not
the
case
in
our
that's,
not
the
something
we
take
care
of
in
our
platform.
Also,
in
our
use
case,
we,
when
we
support
the
vmi,
the
vmi
running
spin
out
very
quickly
and
the
lifetime
is
very
short,
so
there
isn't
a
need
to
do
the
do
the
migration.
A
Okay,
cool
actually
wants
to
know
what
kind
of
monitoring
and
alerting
you
used.
B
We
use
the
promises
and
a
lot
of
exporters
to
to
grab
the
information
and
logs
from
the
cluster
and
pointing
to
the
dashboard.
So
that's
that's
a
major
tools.
We
are
using.
B
The
gpu-
that's
that
that's
nvidia,
has
a
lot
of
tools.
Nvidia
has
a
tool
to
monitor
the
gpus.
I
think
it's
africa's
them.
You
can
get
checked
on
the
nvidia
gpu
online.
A
B
Well,
what
the
mysql
numbers
could
scale
to
after
fixing
the
panic
issues
yeah,
so
we
we
have,
we
are
gonna,
be
fixing
this
problem.
We
can
support
over
one
thousand
vmis
currently
running
in
the
cluster,
so
every
time
they
are
every
minute
there
are
hundreds
of
vms
created
and
deleted.
So
this
is
there.
A
Yeah,
so
chris
is
commenting
that
nasa
is
actually
using
water.
Cooling
for
their
nvidia.
Gpus
really
need
to
answer
that,
and
I
guess
I
guess
do
you
want
to
share
your
contact
information
slide
again
sure,
just
because
andre
wanted
that.