►
From YouTube: SIG - Performance and scale 2021-09-16
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.qs7aweajr18k
A
A
Okay,
let's
start
with
the
first
item,
so
there
are
a
few
there's
some
new
bugs
that
that
actually
reported
that
I'd
reported
pretty
recently.
These
are
kind
of
interesting.
So
we're
going
to
talk
about
two
of
them.
I
think
there
are
they're
actually
some
other
items.
A
We
can
even
talk
about
related
to
this
that
I
saw,
but
I
I
kind
of
investigate
a
little
bit
more
before
I
post
these
issues,
but
we'll
just
talk
about
the
two
that
I
that
I
think
I
understand
the
most
so
just
to
show
some
context
on
this.
A
So
we're
doing
some
testing
internally
at
pretty
large
scale
and
we're
seeing
the
vert
controller
panic
when
we
delete
a
bunch
of
vms,
not
not
a
ton
like
I
mean,
maybe
like
a
few
hundred
or
something
and
at
this
scale
there's
a
lot
of
events
that
are
occurring
and-
and
sometimes
you
know
with
this-
a
lot
of
events
occurring.
Some
of
the
edge
triggered
events
like
deleted
can
be
missed
by
fire
watch
and
the
way
that
the
controller
handles
this.
A
It
puts
this
specific
key
on
the
queue,
and
this
key
is
a
different
type
than
than
what
the
vert
controller
expects.
It
expects
a
a
vmi
type
and
it
tries
to
do
type
assertion
on
it
and
it
actually
causes
a
runtime
panic
and
so
there's
this.
A
Actually,
what
ended
up
happening
is
that
we
there's
two
of
our
controllers
and
they
basically
would
alternate
between
panics
and
I,
and
it
was
hard
to
really
tell
if
it
was
kind
of
like
kind
of
the
it
sort
of
eventually
healed,
but
it's
hard
to
tell
exactly
how
it
healed
than
that.
Perhaps
there
was
enough
events
that
the
the
sort
of
deleted
key
the
this
deleted-
final
state
unknown
key,
was
eventually
flushed
and
then
tanks
were
away
it's
hard
to
really
say,
but
eventually
it
does
heal.
A
So
there's
a
pr
open
to
to
fix
this
one.
That's
that's
here
and
and
roman
has
already
reviewed
it.
So
that's,
that's
one.
Are
there
any
questions
on
this
one
I'll
go
to
the
second
one.
If
there
aren't,
this
was
kind
of
neat.
A
Okay,
there
are
even
so
this
this
sort
of
like
key
there's.
There
are
multiple
objects:
we're
actually
have
cues
for
the
other
ones,
the
other
objects
we
have
like.
Pods
and
data
volumes
have
catches
for
this.
I
think
it
was
just
virtual
machine
instances.
I
was
missing
it
and
it
occurs
in
rare
cases.
I
like
I've,
never
seen
this
before
it
just
occurred
at
this
very
at
this
kind
of
at
the
scale,
and
it
doesn't
occur
all
the
time.
A
So
I
don't
know
what
it
is,
but
it
just.
I
think
we
just
hit
a
point
where
we
we
just
started
missing
some
deleted
events,
and
then
it
started
to
pop
up.
A
Okay,
let's
go
to
the
second
one,
so
this
is
actually
from
the
same
same
incident,
same
scenario
deleting
bmis
at
large
scale
in
the
in
the
vert
controller
logs.
There
are
tons
of
these
status
reasoning,
valid
errors,
which
is
a
a
422
error
and
basically
like
the
server
or
the
request
that
you're
making
is
valid,
but
the
server
is
not
processing
it,
and
I
was
looking
around
a
little
bit
on
this
and
we
we
do
so.
A
The
vert
controller
is
at
this
point
for
for
some
of
the
objects,
it
doesn't
control
them.
The
vert
handler
has
them
so
it's
trying
to
do
patch.
I
try
and
do
a
patch
on
the
status
field.
It
does
it
for
conditions
and
and
for
this
active
pods
field
under
status.
A
My
thought
is
that
it's
it's,
this
active
pods
field,
that
something
is
failing
with
this
like
we're
trying
to
patch
it.
It's
not
working.
So
we
get
these
422's.
This
isn't
harmful
like
it's.
It's
like
it's
just,
there's,
there's
tons
of
them
in
the
log
I
mean
because
we're
cleaning
up
the
object.
It
would
be
nice
to
get
rid
of
these
and
here's
an
example
of
the
error.
A
You
can
see
right
here
just
from
a
few
different
objects,
but
it
completely
populates
the
entire
log.
A
A
Yeah-
and
this
is
what
I
think
so
this
is
the
code
from
the
air,
so
you
can
see,
I
mean
the
matching
error
here,
and
so
we
try
and
do
this
patch
and
patch
bytes,
like
where
the
two
things
that
I
was
mentioning
is
that
the
status
field,
that's
we're
patching
some
conditions
and
we're
passing
patching
this
active
pods
field.
A
I
haven't
fully
tied
it
together,
but
it's
it
only
really
showed
up
when
we
were
deleting
now.
I
didn't
notice
that
in
the
case
but-
and
it
seemed
really-
I
hadn't
noticed
this
prior
and
it
seemed
like
this.
This
only
occurred
at
large
scale.
Something
was
something
just
seemed
to
be
off.
This
is
the
because
I
mean
we've.
We've
done
like
some
other
error
code
analysis,
and
we
haven't
seen
this
many
422's.
It's
just
sort
of
there's
now,
there's
sort
of
an
explosion
of
them.
C
A
No,
I
don't
know,
I
don't
know
who's
rejecting
it,
I'm
that
it
could
be
the
validating
web
hook.
I
don't
know
I
I
didn't
like.
I
couldn't.
D
A
But
maybe
it's
in
the
api
server
reads
on
the
cubic
fan
server,
so
I
I
don't
know,
but
the
reason
I
posted
this
one,
those
I
think
I
have
a
little
bit
of
better
understanding
than
this
as
for
this
one,
but
I
think
there's
still
some
more
investigation
here
as
to
like
to
pinpoint
exactly
because
I
think
like
there
are
two
like.
I
know
that
we're
like
I
can
see
what
we're
we
can
see
the
air.
A
I
can
see
what
we're
trying
to
do,
but
don't
know
like
you,
said
yeah
it
was
gavin,
I
don't
know
who's
rejecting
it
and
I
don't
know
how
how
often
we're
doing
this.
I
also
don't
know
why,
like
that,
wasn't
quite
clear.
B
Here's
my
theory,
so
in
this
specific
instance,
where
is
a
patch
correct
that
we're
yeah.
B
Yes,
so
in
our
patch
we're
doing
a
json
patch
and
we
have
a
test
condition
followed
by
the
actual,
replace
or
add
or
remove
whatever
we're
wanting
to
do
and
the
test
conditions
saying
here
is
the
way
we
think
this
struct
should
look
based
on.
What's
in
our
informer
and
what
the
information
we
have
and
here's
what
we're
wanting
to
change
if
that
test
condition
fails,
the
whole
thing
will
fail.
So
that'll
say
that
the
reality
is
that
the
thing
you're
trying
to
patch
doesn't
look
any
more
like
what
you're
thinking
it
looks
like.
B
What
could
be
happening
is
that
our
formers
are
behind
and
we're
constantly
trying
to
make
a
change
on
the
vmi
and
the
information
we're
using
to
make
that
change
is
inaccurate,
because
the
informers
haven't
caught
up
to
reality
yet,
and
I
made
an
optimization
for
the
update
path
for
the
the
vmi
controller
to
say:
don't
try
to
update
this
vmi
again
until
we've
seen
the
previous
update
arrive.
I
didn't
do
that
for
the
patch,
so
it
could
be
the
same
problem
where
so
was.
A
I'm
sorry!
What
was
that
the
was
that
the
oh?
I
don't
remember
the
error
code
now
that
was
the
I
remember
you.
B
But
it
was
specific
to
the
update
api
call,
not
the
patch
one,
and
we
we
do
both,
and
I
only
addressed
the
instance
where
we're
doing
an
update.
I
did
not
address
this
since
we're
doing
a
patch,
so
it
could
be
the
same
problem.
B
It's
just
manifesting
itself
differently
because
in
the
case
of
an
update,
for
example,
we're
going
to
fail
in
the
back
end,
because
the
updates
revision
or
whatever
is
not
accurate,
but
in
the
case
of
the
patch
we're
rejecting
the
update
or
the
patch,
because
the
patch
condition
is
failing
that
we've
supplied
so
we're
going
to
get
a
different
error.
I
think,
but
the
same
cause
could
be
invoked
for
both
like
it
could
be
the
same
underlying
cause
that
we're
trying
to
modify
an
object
and
the
object
that
we're
trying
to
modify.
B
A
Yeah,
I'm
trying
to
look
for
your
patch.
If
I
have
it
here
somewhere,
oh,
reduce
feed.
My
collisions.
This
is
this
one
yeah.
A
Yeah
that
sounds
sounds
plausible
to
me:
okay
409,
I
said
429,
okay,
yeah,
okay,
so
yeah,
it
sounds
possible.
I
mean
so
even
like
he
said
for
additional
context,
and
this
is
sort
of
the
like.
Like
I
said
this
is
happening
at
the
same
time.
As
you
know,
the
panic
is
going
on.
So
there's
we're
restarting
for
controllers
a
lot.
Reformers
are
catching
up
a
lot
and
this
kind
of
was
kind
of
led
to
another
thing
that
that
I
was
kind
of
wanted
to
explore.
A
A
little
more
is
like
the
time
it
took
for
the
vert
controller
to
catch
up.
It
took
a
little
bit
because
there's
a
lot
of
requests
where
it's
like
hey.
We
want
to
update
or
want
to
do
something
that's
objective
like,
but
we're
not
we're
behind.
There's
a
ton
of
those.
A
So
I
mean
it
sounds
like
it
sounds
yeah
that
sounds.
It's
a
good
point
I'll
tag
this
in
here,
just
as
something
as
a
reference,
and
we
can
just
so
we
mark
it.
Okay,.
B
And
the
fix
is
relatively
simple:
we
just
have
to
follow
the
same
expectation
logic.
It's
almost
it's
only
like
maybe
20
lines
down
that
we're
doing
it
for
the
update.
We
just
need
to
follow
something
similar
for
the
patch,
but
we
need
to.
I
need
to
look
at
that
a
little
bit
just
to
make
sure
that
we're
only
issuing
the
expectation
when
the
patch
is
actually
going
to
change
something
I
I
think
that's
the
way
it
would
work,
but.
A
Anyway,
cool
okay,
yeah.
Well,
so
those
are
two
two
new
bugs
and,
like
I
said
there,
there's
some
other
ones
that
I
was
thinking
of.
Like
I
briefly
mentioned
like
the
time
it
takes
for
her
controller
to
catch
up
could
be
another
one,
but
I
want
to
get
a
little
bit
more
measurement
on
that
one
to
see
like
get
like
a
rough
amount
of
time
and
some
more
logs
to
see
just
to
get
a
better
picture
of
it
before
for
a
file
vision.
But
it's
something
that
I've
noticed.
A
Okay,
all
right:
let's
go
to
the
next
bullet
points,
so
we
use
memory
overhead
for
the
launcher.
So
this
this
is
a
discussion
that
you
want
to
have
here.
We
have.
We
have
a
few
issues
open
waiting
for
all
of
them.
A
And
I
want
to
kind
of
consolidate
here
on
a
on
a
mission:
let's
see
where's
the
first
one,
okay,
this
one
so
reduce
memory.
Overhead
of
launcher
is
the
first
one
and
the
ones
that
that
I
thought
we
had
some
overlap
on
was
the
was
this
one,
so
removing
the
monitoring
process
and
the
launcher
to
reduce
its
memory
footprint
and
then
also
the
the
profiling,
the
control
plane
high
load?
So
I
guess
so.
A
My
point
is
this
is
like
I
kind
of
wanted
to
talk
about
like
the
goal
here
and
see
if
we
can
outline
some
of
the
tasks-
and
I
think,
like
you
know,
this
is
sounds
like
to
me
like
one
of
the
tasks,
I
could
actually
go
into
here
as
a
possible
optimization
for
launcher,
but
I
kind
of
want
to
talk
about
you
know.
Some
of
the
others
so
is.
Is
it
daniel?
Is
that
who
this
user
is.
A
Hey,
do
you
want
to
so
do
you
do
you
want
to
talk
a
little
bit
like
elaborate
on
some
of
your
goals
with
this,
and
you
know,
maybe
we
can
kind
of
enumerate
some
of
them
and
see
where
we
can
kind
of
yeah
go
ahead.
F
So
the
context
is
that
there
is
aldergy
internships
and
it
is
paid,
opens
open,
source,
community
internships
for
for
diversity
in
attack,
and
we
just
thought
that
I
mean
we
need
some
projects,
so
the
inter
internships
can
work
on
them
and
this
topic
comes
about,
or
was
there
for
a
long
time.
So
we
thought
why
not
do
it
and
we
don't
really
have
a
password.
F
We
just
know
that
we
can
probably
reduce
the
size
with
the
with
the
separate
binary
for
this
or
if
we
rewrite
it
to
different
language
right
and
the
other
option
can
be
to
remove
the
forking
all
together.
But
I
guess
that
needs
to
be
discussed.
If
we
can
guarantee
the
guarantees
that
the
forking
process
is
doing.
A
C
G
Yeah
exactly
look
so
I
I've
tried
to
estimate
on
this
this
removing
this,
this
beard
launcher
for
king
and
I
think,
when
I
go
gone
back
to
to
the
when
what
is
when
it
was
introduced.
There
was
there
was
some
special
case
and
that's
that
I
think
it
covers.
I
think
it's
for
container
disks.
I
haven't
really
used
this
feature
and
I'm
not
sure
if
we
ever
use
it
on
our
side.
G
So
we
thought
that
if
it's
not
critical
for
us,
we
could
just
just
remove
it,
but
I
think
if
it's
in,
if
it's
yeah
the
the
problem
with
with
forking,
is
that
I
I
don't
see
that
that
it
really
does
what
what
what
it's
supposed
to
do
because
like
there
is
always
a
chance
that
that
this
and
this
monitoring
process
can
go
away
like
I
don't
know
it
could
be,
killed
or
or
whatever,
and
then
there
there
is
no
monetary
process
and
there
is
no
one
to
to
clean
up
or
or
wait
for
for
those.
G
So
I
don't
see
really.
I
don't
really
see
a
reason
why
we
can
just
do
it
in
in
like
special,
go
routine
or
or
something
like
that,
that
that
would
do
this
like
watching
and
yeah
like
as
daniel
mentioned
it
this
this.
This
monitoring
just
just
adds
a
lot
of
memory
and
and
to
overall
so
yeah
that
should
be
discussed.
A
Okay,
so
would
you
say
silly,
maybe
we
can
define
some
of
this
like
well
to
what
was
how
we
talk
about
like
the
reason
for
for
like
forking.
Does
it
like?
Does
anyone
want
to
talk
to
that
like
in
terms
of
you
know
what
was
the
benefit
of
it
and
then
okay
can.
D
Okay,
great,
so
I
want
to
adjust
that
the
main
purpose
of
it
is
that
if
we
have
it
as
a
precaution,
it
it
does
very
little
by
intention
with
because
that
means
that
that
it's
very
unlikely
that
this
process
crashes,
while
our
work
launcher
process
that
that
one
which
is
actually
talking
delivered
and
so
on,
and
to
do
him,
does
a
lot
a
lot
more
things.
And
if
it
crashes
our
main
process
and
the
container
is
down
and
then
it
then
all
other
processes
would
be
stopped
immediately
and
yeah.
It's
a
precaution.
A
Right,
so
we
don't
want
it
as
pit
one
right.
We
wanted
something
else
so
that
we
just
don't
immediately
once
it
fails.
We
just
go
away
like
so.
In
other
words,
we
it
makes
like,
because
I
I
understand
like
like
we
want
some
sort
of
something
else
to
to
to
be
there.
So
I
guess
the
question
is
like,
so
you
knew
you
said
created
another
go
routine
like
if
we
have.
A
B
Excuse
me,
sorry
if
people
crashes
today
or
I'm
sorry,
burt
launcher
pat
crashes
today,
the
point
of
the
pid
one
and
the
forker
is
that
we
do
some
graceful
cleanup
of
the
queueing
process
if
it's
still
around
so
we'll
attempt
to
shut
it
down
in
a
way.
That's
not
gonna
cause
this
corruption,
and
things
like
that.
So
I
think
maybe
that's
the
underlying
reason
why
we
even
have
a
catch-all
kind
of
thing
like
that.
Originally
it
was
a
bash
script.
B
What
you
said,
yes,
that's
different,
so
originally
we
had
a
bash
script.
That
would
do
this
kind
of
clean
up.
It
was
really
unwieldy
and
we
just
made
a
function
and
vert
launcher
that
would
do
it
and
then
we
just
decided
to
fork
launcher
from
vert
launcher.
I
think
we
can
achieve
the
same
thing
in
a
go
routine
and
have
it
catch
panics.
So
very
first
thing
we
do
when
burnt
launcher
starts,
is
we'll
create
this
go
routine,
we'll
have
it
catch
panics.
B
A
B
We
don't
have
to
fork.
I
think
that
we
can
do
it
in
a
go
routine.
When
a
panic
occurs,
we
have
the
opportunity
to
still
execute
something.
I
think
that
will
work.
Maybe
maybe
somebody
else
has
more
thoughts
on
that.
If
we
don't
do
that,
the
other
alternative
is
to
create
a
really
stripped
down
forker.
B
That
only
does
exactly
what
we
need
that
fork
logic
to
do,
rather
than
loading
everything
that
vert
launcher
needs
because
there's
just
a
lot
of
dependencies
and
everything
to
get
started
with
vert
launcher
and
if
we
just
have
a
really
really
small,
thin
binary,
that's
just
in
charge
of
launching
burnt
launcher
and
then
sharing
a
vert
launcher
exits
that
the
commuting
process
is
torn
down
as
gracefully
as
possible.
Then
that's
great
and
the
result.
B
B
It's
the
same
thing
we
do
in
all
of
our
controllers.
Isn't
it.
A
Okay,
so
you
knew
sir,
do
you
think
you
have
a
path
forward
like
I
said
you
will
be
brought
up
coverage
and
you
think
you
think,
like
you,
think
that
makes
sense.
Do
you
want
to
explore
that.
B
Look
at
the
recover
command
and
go
laying.
I
think
that
we
set
a
recover
command
at
the
very
top
level
of
like
the
hierarchy.
That
will
basically
be
a
catch
all
for
everything.
That's
the
very
first
line
we
have
in
the
code.
I
would
think
that
anything
that
causes
the
second
fault
from
there
would
I
could
be
wrong,
but
that's
my
expectation
we'll
get
caught.
It's
not
mine,
but
no,
it's
not
yours.
Okay!
Maybe
I'm
totally
wrong
there
investigate
that.
I'm,
like
50
sure
now.
A
D
F
A
Yeah
they
don't
mind
well,
so
so
what
are
the
so
we
have
so
we
could
do
panic.
We
could.
We
could
have
something
to
cover,
I
guess
the
panics,
but
that
so
like
that.
We
said
that
doesn't
catch
everything
right
like
if
it's
like
what,
if
like,
what
are
the
other
cases
that
we
want
to
like
what,
if
qmu,
just
kind
of
like
crashes,.
D
B
It's
best
effort,
it's
all
best,
ever
it's
just
to
maintain
data
consistency.
I
think.
A
Okay,
does
that
make
sense
to
you?
You
knew
some
that
kind
of
satisfy.
That's
this
issue
to
you,
like
you,
think
you
haven't
yeah.
A
Okay,
all
right,
let's
kind
of
bubble
up
a
little
bit
higher.
So
we
have
this
like
this
issue.
We
have,
we
have
to
produce
memory
already.
I
think
this
is
probably
one
of
the
issues
that
we
can
produce
the
memory
overhead
in
there.
So
I
mean
I
guess
what
we
could
do
is
we
can
kind
of
tag
cross-tie
the
issues
and
this
could
be
one
of
them,
but
there's
there's
probably
more.
Maybe
we
can
like.
We
have
profiling
that
we
want
to
do.
A
A
Okay,
all
right,
let's
go
to
marcelo
with
another
evaluation
report.
E
Yeah,
so
I
run
the
updates
one
with
the
well
the
main
repository
of
this
week.
We
it's
running
like
from
100
to
800
vmis
in
their
three
clusters
and
something
that
it's,
I
would
say,
maybe
not
go
too
much
details
here
right
now,
but
something
that
it's
interesting
here
is
from
800
vmis.
E
E
Yeah
yeah
the
first
image
now
yeah,
okay,
the
vm
creation
time.
Yes,
so
it
has
like
600
and
800
the
two
last
ones
and
the
first
one
took
like
five
minutes:
the
95
percentile.
You
know
the
worst
case
to
create
the
vm,
the
vmi
and
then
with
800.
It's
you
know
it's
jumped
to
then
to
10
minutes,
so
double
the
time
to
create
the
vmi
so
and
it
was
not.
The
double
of
number
of
yams
was
600
to
800.
So
I
would
say
is
not
scaling
very
well.
You
know
the
creation
here.
A
A
Oh,
I
see
it:
okay,
yeah,
six
hundred
four
hundred
two
okay,
okay,
so
you're
saying
we
double
the
time
from
six
to
eight.
The
slowest
bmi's
take
five
or
take
ten
minutes,
five
minutes
slower
the
last
two
hundred
roughly
or
the
the
slower
ones
in
the
last
one.
Okay,.
E
A
E
Yeah,
we
will
still
have
like
this
stuck
thread
in
the
work
queue
that
will.
We
will
need
some
investigation
later
that
we
can.
We
can
check
in
the
in
the.
A
E
Yeah,
however,
you
see
that
it
doesn't
gross
too
much
when
we
have
like
more
vms
being
created.
You
know
from
600
to
800
it's
it's
small,
the
difference,
so
something
yeah.
It's
not
it's
not
because
of
the
scale.
It's
all
that
it
looks
like
it's
not
because
of
the
scale
of
the
vm
creation,
but
something
is
stuck
in
the
code.
So
that's
the
definition
of
unfinished
work,
but
from
kubernetes
this
metric
that
it's
when
it
grows
the
time
here.
It
means
that
some
threads
are
stuck
so
not
necessary.
E
Is
that
what's
happening
the
code,
but
might
be
that
okay,
something's
true
is
running
through
slow
and
something
like
that.
A
Okay,
so
we
we
might
be
able
to
catch
these
with
profiling
right,
maybe
something
we
can
look
at.
Okay,
all
right.
I
think
this
is.
This
is
a
good
one
for
to
create
an
issue
about
that.
We
can
see
if
we
can
locate
these,
they
stuck
threads
and
we
do
some
profiling.
A
Okay,
let's
see
what
else
yeah.
E
Also
just
work,
you,
the
work,
you
add
rate
down
yeah
this.
I
think
we
already
have
an
issue
for
that,
but
it's
still
like
the
vm
controller
disruption
budget.
It's
the
most
intensive
one,
and
we
have
like
some
discussion
about
that.
You
know
previously
here
and
I
don't
know
who
who
mentioned
that
that
this
controller
is
actually
the
description
budget
shouldn't
be
that
intensive.
D
E
B
E
B
It
shouldn't
be,
you
shouldn't,
be
creating
that
much
work.
Yeah
it'd
be
interesting
to
profile
that
it's
creating
pod
destruction,
budgets
for
virtual
machines
that
have
eviction
strategy
equals
live
migrate
to
ensure
that
the
the
vmi
can't
be
torn
down.
What's
in
my
trusty
ray
node,
so
we're
forcing
it
to
the
eviction
to
fail
and
we're
creating
a
live
migrate
as
a
result.
G
B
A
Okay,
all
right,
I
can
add
this
as
a
another
picture
to
that
that
issue
then,
and
then
this
one
kind
of
interesting.
So
we
have,
we
have
some
vert
handlers
that
are
a
little
bit
higher.
A
You
have
a
vert
controller
and
up
there
just
what's
vert
controller
vmi
versus
vert
controller
node
is
this:
is
this
just
the
vert
controller
just
different?
I
guess
different
transformers
or
different
control
loops
or
something
inside
of
our
controller.
A
C
E
A
From
the
smallest
largest
and
they're
all
10
seconds,.
A
Let's
see
so
vert
controller
node
gets
pretty
large
and
then
second
place
is
the
vert
handler
vm,
which
looks
like
it's
down
here,
pretty
tiny,
so
we're
controlling.
I
know
it's
got
a
big
retar
rate.
A
Yeah
this
one,
this
might
be.
This
is
a
good
picture
to
go
into
the
vert
into
the
the
the
efficiency
issue
is
another
data
point.
C
I
think
I've
seen
that
retry
rate
before
it's
it's
when
I
think
both
both
bird
controller
and
vert
handler
both
want
to
update
the
node
structure
with
labels,
and
so
that
you
know
whoever
wins
the
other
has
to
back
off
and
retry
or
something
along
those
lines.
I
I
I
saw
lots
of
retrials
anyway
as
well.
A
A
A
I
had
to
look
at
it
like
our
diagram
again
and
we
talked
about
there,
but
that
is
kind
of
interesting
I
mean
you
have
this
is
this
is
three
nodes
yeah.
E
A
Yeah
with
100
nodes
like
if
we're
seeing
this,
if
this
jumps
a
lot
bit.
A
Okay,
I'm
going
to
add
I'll.
Add
these
to
the
to
that,
like
catch-all
card
that
has
like
the
efficiency
of
the
control
loop
efficiency,
whatever
it
is,
that
I
remember
know
what
it's
called
that
shoes
I'll
find
it
on
that
we
have
it
in
six
scale
document
but
I'll
add
those
to
it,
but
it
has
additional
data
points.
Okay,
unfinished
work.
A
We
kind
of
saw
this
earlier.
This
was,
I
know
we
already
talked
about
this.
I
mean
yeah,
so
we
have
a
bunch
of
unfinished
work.
Yeah,
eight,
the
node
and
bmi
controller-
do
not
have
much,
but
I
mean
still
it
needs
sec
37
seconds,
eight
minutes.
I
guess
this
is
the
total
count
right
like
it's,
the
total
amount
of
time
spent,
waiting
or
or
stuck
like
it
could
be
like
it's
it's
an
accumulation.
It's
not
just
like
one
thing
right,
that's
I
guess
I
I
just
depends
how
you
put
the
metric
together.
A
A
Okay
memory,
we
jump
up.
A
I
wonder,
marcelo
like
after
you
delete
right
here
how
long
this
takes
to
go
down
kind
of
what
it
looks
like
over
time.
I
mean
we
can
see
kind
of.
We
see
a
slight
dip
here,
but
it's
still
kind
of
high
and
we've
deleted
the
vms.
At
this
point,
where's
the
count
where's
our
account
to
compare
right
here.
A
A
E
What
seems
to
be
increasing
is
the
purple.
Actually,
we
can.
A
Okay,
I
was
looking
at
them
as
if
they
were
they
were
combined,
but
I
think
the
purple
one
is
the
one
like.
So
it
goes
at
a
max
of
484,
but
it's
actually
like.
We
see
good
peaks
and
it
looks
like
it
comes
down
to
pretty
close
to
what
baseline
was
so
that
actually
looks
fine,
because
these
are
like
again
we're
saying
these
are
stacked.
Then
this
is
then
they're
all
doing
that
some
of
these
aren't
as
well,
but
it
looks
like
they
eventually
get
there.
B
Memory
being
what's
that
metric
of
the
memory
of
the
process
or
like,
is
that
what
goeling
reports
or
memory's
weird
it's.
A
A
A
E
B
So
threads
get
spun
off
by
the
go
run
time,
depending
on
how
many
go
routines
is
that
is
that
accurate?
I
think
that
makes
sense
to
me
right.
D
B
E
B
B
E
And,
as
I
mentioned
before,
with
ryan
the
tcg
latency,
you
see
it's
fine,
it's
everything
it's
under
10
milliseconds.
I
think
you
were
doing
some
evaluation
with
a
huge
spike.
Isn't
it
official
dtc
latency
before.
F
E
So
if,
but
it's
related
to
the
the
previous
one,
also
for
the
storage
operation,
okay,
so
yeah
we
have
the
another.
I
have
another
file
showing
things
when
I
increase
the
you
know
the
rate
level
actually
the
the
burst
and
pairs
per
second
okay.
So
we
can
see
here
that
I
was
mentioned
to
delete
the
vm,
so
I
have
like,
in
this
experiment
here
very
low
performance.
When
I
delete
the
vms,
it
sometimes
takes
hours
to
delete
vms
or
doesn't
delete.
So
I
need
to
force
the
deletion
of
the
vms
later.
E
You
know
by
hand-
and
this
is
directly
related
to
many
storage
operation
neighbors.
You
can
see
here
well
in
the
figure
in
the
right
part.
The
storage
operation
rehearse
it's
okay,
so
the
vmware
that
we're
creating
it's
using
ephemeral
volumes
with
empty
direct
plates,
the
empty
gear.
Maybe
different
volumes,
has
different
performance,
but
that's
what
we
have
here
and
it
has
a
lot
of
empty
gear
er
to
unmount
this
and
get
stuck
to
amount
that
and
it
doesn't
delete
the
pod
and
it
remains
forever.
E
E
I
don't
know
why
what's
related
to
that,
but
it's
it's
actually.
I
don't
see
any
vm
deletion
problem
anymore
after
increase
this
rate
limit,
so
maybe
you
can
go
to
to
the
next
link.
A
F
E
Yeah,
I
I'm
going
to
show
so
okay,
if
you
see
here
no,
this
is
not
just
one
also.
Sorry,
it's
not
very
well.
A
E
Just
a
little
bit
more:
oh
yes,
okay,
so
just
just
this
is
the
configuration
that
I
change.
So
it's
I
increase
the
burst
to
100
for
this
components
here,
the
carrots
per
second
to
50.,
I
think
by
default,
was
five
and
ten.
Something
like
that
before
I
also
it's
maybe
a
question
to
raman.
This
is:
is
there
other
components
here
that
we
also
tune
or
desire?
Are
all
the
components?
And
the
other
question
is
when
I
change
that
only
the
vert
controller
reboot
so
well,
we're
started.
E
You
know
the
pod
were
started,
the
root
handler
didn't
restart,
so
I
don't
know
if
it's
actually
got
the
configuration
there
or
if
it
doesn't
need
to
restart
to
use
this
configuration
anyway.
So
just
this
was
my
doubt
here
when
I
applied
that.
E
Yes,
if
and
without
I
didn't
so-
I
didn't
have
time
to
put
all
the
figures
here.
Also,
but
the
point
is
it
didn't
change
the
vm
creation
time?
Also
it
didn't
fix
all
the
the
rate
limit.
With
this
configuration,
it's
improved,
it's
not
five
500
milliseconds
anymore.
It's
actually
250.
If
you
go
to
the
other
figure
and
you
we
don't,
I
don't
yeah,
you
see
it's
still
like.
Oh.
E
Exactly
so
surprisingly,
I
would
say
I
don't
know,
what's
the
relationship
of
that,
but
it's
interesting
I
would
say.
A
A
I
see
and
yeah
we
got
none
and
then,
while
it's
over
here,
we
get
quite
a
few
okay
and
also
by
comparison
yeah.
Okay,
I
see
interesting,
okay,
so
yeah,
it's
a
thinner,
thinner,
graph,
interesting
and
none
for
their
rate.
Okay!
A
A
A
I
wonder
if
it's
one
of
the.
I
wonder
if
it's
one
of
these,
like
I
wonder
if
it's
if
it
was
literally
one
of
these
that
did
this
I'm
guessing
it's
like.
Maybe
the
handler.
A
A
Cool,
okay
and
then
did
you
want
to
talk
at
all
about
this
one?
This
was
the
other
one
I
saw
this.
Was
this
one?
I
think.
E
Yeah
this
is
this
one
yeah,
let
it
go
very
quick
from
that.
So
this
was
the
experience
to
try
to
run
500
gm.
You
know
you
know
and
actually
was
hard
to
do-
that
I
increased
this
timeouts
for
the
chemo
that
we
had
before
and
and
then
many
other.
You
know
many
other
change
numbers
of
device
and
things,
but
in
the
end
I
integrate
500
vm
once
and
actually
the
vms.
E
It's
when
it's
get
like
close
to
the
cpu,
very,
very
close
to
80
or
90
percent
of
the
cpu
utilization
in
the
old,
and
also,
I
would
say,
not
also
eight
or
nine
percent
of
memorization
and
all
things
get
very
nasty.
So
the
operations
start
to
kill
you,
even
though
it
has
one
giga
of
memory.
You
know
you
know:
version
systems
start
to
kill
some,
some.
No
some
containers
and
the
via.
So
the
the
the
operation
system
also
is
very
slow.
E
It's
it's.
I
saw
I
log
into
the
node
and
even
though
it
drops
the
cpu
utilization,
for
example,
250
something,
but
it's
everything
is
very
slow.
I
see
a
lot
of
I
animation
interrupts.
You
know
calls
in
the
kernel
and
and
some
containers
being
killed,
but
I'm
not
I'm
not
sure
exactly
what's
happening,
but
when
it's
the
it's
saturated,
the
node,
it's
get
like
unstable.
That's
the
what
I'm
saying
I
also
test
with
three
different
runtimes:
the
docker
container
g
and
cryo
and
container
d
had
better
performance.
E
It's
the
cryo
and
docker
were
tying
me
out
to
create
the
containers.
With
far
you
know,
much
less
could
create
less
vms,
less
pods,
I
would
say,
and
and
then
I
could
create
safety
550.
E
A
So
this
what's.
A
This
this
creation
time
marcelo
so
we've
seen
here
so
looks
like
fifty
hundred
two
hundred
fifty
hundred
two
hundred
and
then
two
hundred
all
the
way
to
four
hundred.
It's
almost
the
same.
It's
like
like
we
hit
a
threshold
here
and
then
we'd
kind
of
we
leveled
off,
which
is
that
which
is
kind
of
interesting.
E
B
B
Me
double
check
this
yeah,
so
if
that's.
A
B
We
can
add
more
buckets
there
or
or
revise
the
buckets.
I
I
based
it
on
what
I
thought
would
be
realistic.
That's
well
that's
an
indication.
B
I
mean,
what's
more
data
going
to
give
us
here,
that's
pretty
terrible
if
it
takes
10
minutes
to
start
to
get
to
a
vmi
it's
running,
and
is
that
the
p99
or
something
or
is
that
the
average?
What.
B
Okay,
okay!
Well,
that's
that's
terrible!
So
I
don't
know
if
it
matters,
we
add
more
buckets.
We
need
to
figure
out
why
it's
taking
so
long.
A
A
We
can
whatever
that's.
We
can
call
that
so
like.
I
think
this
would
be
another
one.
I
I
kind
of
I'll
add
this.
I
think
if
I
don't
have
it,
I'm
going
to
add
it
to
our
profile
list
marcelo
as
a
thing
I
mean
since
you're
kind
of
doing
this
all
right.
This
would
be
cool
when
you
hit
like
right
here
I
mean
this
seems
like
it's
going
to
be
exponential.
I'm
guessing
like
we're
going
to
be
up
here.
A
A
A
Yeah
I
mean
you
actually
see
like
this.
This
peak
is
not
much
of
a
difference
which
is
interesting
and
then
we
kind
of
basically
double
yeah.
I
mean
I
I'm
just
this
unfinished.
Work
is
really
just
an
interesting
one.
Maybe
I'll
attach
these
unfinished
work
ones
to
the
to
that
profiling.
I
think
getting
this
information,
that's
gonna
be
really
helpful.
It
might
give
us
some
good
stuff.
A
Okay,
pretty
cool,
okay,
there's
a
lot
of
good
charts
in
here,
thanks
marcelo,
okay,
all
right
any
last
finishing
words
here
we're
at
time
anything
if
anyone
wants
to
bring
out
it's.
E
Yeah,
just
every
quick
update
for
the
continuous
performance
jobs.
There-
evaluation
jobs
there.
It
was
not
so
the
job
there
was
not
being
collected,
the
metrics
I'm
working
on
that
with
frederico
we
are,
we
tried
like
we
are
debugging
that
to
see
many
configuration
what's
happening,
that
the
metrics
were
not
being
collected
so
because
what
we
have
is
it's
like
a
global
prometheus,
that's
running
the
cluster,
and
then
there
is
another
cluster
that
is
created
for
the
for
the
the
job
to
run.
E
You
know
the
tasks
and
that
also
creates
a
promises,
and
this
global
parameters
needs
to
collect
the
metrics
of
this
local
prometeus
and
it
was
not
being
collected
so
so
so
we
don't
have
results
for
that,
and
the
new
graffana
dashboard
is
there.
I
think
maybe
we
will
have
it
married
soon.
So
let's
hope
for
that
yeah.
A
Awesome
yeah,
I
was
gonna
say
I
mean
I,
I
wanna.
I
have
some
ideas
of
what
we
could
add
to.
So
that's
awesome.
Okay,
all
right!
Everybody
extra
time
have
a
good
day
talk
to
y'all
later
bye.