►
From YouTube: SIG - Performance and scale 2021-06-10
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit
A
A
One
of
the
topics
that
I
didn't
want
to
talk
about,
and
particularly
it
was
interesting
because
something
that
we
were
looking
at
actually
in
a
video
the
last
few
days,
it's
something
we
were
dealing
with,
and
so
we
were
just
to
kind
of
give
some
background.
So
we
were
doing
some
load
testing
recently
around
the
700
vm
mark.
We
were
doing
it
actually
between
releases
because
we
were
looking
to
upgrade.
A
We
were
running
zero,
two
four
and
we
were
moving
to
zero
through
five
and
we
were
noticing
some
latency
with
api
calls.
This
is
actually
a
partial
snapshot
of
the
the
kubernetes
api
call
latency,
and
you
can
see
that
there's
a
huge
difference
between
zero,
three
five
and
zero,
two
four
six,
four
being
there's,
there's
no
activity
and
then
zero
through
five
there's.
A
Actually,
quite
a
bit
of
activity
and
to
the
point
where,
when
we
hit
700
vms,
I
think
it's
roughly
about
200
nodes,
the
list
api
call
latency
explodes
up
to
a
minute.
I
think
it
even
goes
higher
as
soon
as
you
go
higher.
I
don't
know
this
in
this
graphic.
It
just
caps
in
a
minute,
but
everything
balloons
you
can
see.
Even
let
me
even
zoom
in
more,
you
can
see
like
how
update
balloons
to
four
seconds
normally
the
baseline
is.
A
You
can
see
it
down
here
and
some
of
the
other
ones.
It's
it's
it's
in
the
milliseconds
and
it
explodes
quite
a
bit
and
yeah,
as
opposed
to
two
four-
and
I
thought
this
was
interesting.
I
thought
I'd
mention
it
because
it
was
we
don't
I
don't.
We
don't
know
the
cause,
what
it
was
with
zero
through
five
that
that
caused
this,
but
it
was.
A
It
was
interesting
to
see
to
actually
measure
this
and
then
saw
a
major
impact
because
I
think
you
know
one
of
the
goals
that
we
we
set,
at
least
with
for
this
sig,
was
to
try
and
stay
at
like
less
than
one
second
latency
for
the
apicals.
That's
generally
what
the
the
goal
that
kubernetes
has
set.
So
it
was
interesting
to
see
this
and
and
from
our
I've,
even
provided
more
information.
There's
like
we,
we
see
from
events
the
virtual
machine
instance
events
is
exploding,
I
mean
it's
like.
A
The
number
of
events
is
astronomical
compared
to
other
things,
there's
very
little
very
few
pods,
very
few
other
things
and
there's
just
tons
and
tons
of
list
events
coming
from
vert
handlers
that
causes
this.
So
I
figured
I'd
mention
it
in
case
I
don't
know.
Maybe
people
had
some
ideas
or
we
could
talk
about
some
of
the
details
and
I'd
love
to.
B
B
A
C
B
B
Watcher
yeah,
I
think
even
you
did
the
fix
or
was
it
marcus
could
do
two.
E
A
C
So
you
said
you
pulled
it
in,
were
you
able
to
rebuild
and
then
deploy,
or
how
do
we
know
that
this
bug
fix
made
it
into
your
test
environment.
A
Yeah,
so
we
we,
we
did,
we
pulled
it
in.
We
didn't
build
with
it
and
and
deployed
with
it.
C
And
are
we
certain
that
the
the
build
manifests
are
referencing
the
newly
built
containers
that
you
had,
or
I
guess
I'm
trying
to
just
make
sure
that
there
wasn't
a
scenario
where
we
rebuilt
and
everything,
but
the
manifests
still
reference
like
an
old
container
version
or
something
like
that.
Instead
of
the
one
that
you
all
built.
A
Yeah
we
went
through
the
normal,
build
process
on
this
and
and
yeah
I
mean
like
it's,
I'm
pretty
sure
it's
in
there
yeah
I
mean
I'm
pretty
confident
in
there
sure.
So
let
me
see
where's
the
here.
It
is
this
one.
Is
this
the
one
you're
talking
about?
This
was
the
one
we
kind
of
backported.
C
A
A
It
okay.
A
Kevin
are
you,
are
you
asking
me.
F
A
A
Yeah,
you
wanna
sorry,
your
question
was
you
wanna
see
like
what
other
what
what
else
slows
down.
F
Yeah,
what
process
you've
I
like
about
gold
processors
have
like
a
huge
load
or
garbage
collection,
explodes
or
a
lot
of
cool
machines
like
where,
where
the
slowdown
might
come
from.
A
Where
were
they
slowed
down
too
and
like
this
is
so
the
this
is
basically
I
all
I
have
here
is
just
the
the
api
call
legacy,
so
everything
was
affected
in
the
cluster,
because
this
is
this
is
the
cube
api.
Everything
was
affected.
I
mean
I'm
not
sure,
I'm
answering
a
question
but
like
what
like
you're
looking
for,
like
I
don't
have
the
metrics
for
every
other
load
like
this
was
the
only
thing
that
changed
in
this
cluster
was
launching
a
lot
of
vms
really
quickly.
F
C
Do
we
know
what
list
calls
are
coming
from
so
in
that
metric
that
you're
looking
at
there
should
be
a
way
to
say
what
pod
it's
coming
from
the
requests
they're
coming
from,
so
maybe
we
could
do
a
filter
over.
A
E
Yeah
so
we
have
the
metric,
so
maybe
we
can,
we
can
just
see
which
metric,
actually
you
are,
you
know,
checking
and
also
which
labels
you
know.
Then
we
can
check
label
the
metrics
has
so
maybe
we
can
figure
out
some
more
information.
A
Yeah
hold
on
it's
going
to
take
me
a
little
bit
to
get
to
some
of
this,
so
the
the
so
let
me
go
to
the
first
one.
We
you
wanted
to
see
like
some
of
the
events,
so
we
can
see
like
like
I
can
get
the
from
the
api
server
like
I
can
look
at
some
of
the
list
calls
is
that
what
you're
looking
for.
A
Yeah,
let
me
there
you
go.
A
A
A
B
A
C
We
can
maybe
follow
up
on
this.
A
C
Yeah
we'll
find
out
yeah,
that's
certainly
a
good
finding.
I
I
think
that
would
be
an
interesting
thing
to
present
as
well.
Sometime
in
the
future.
F
A
We,
oh
you
think
it
doesn't
have.
This-
is
that
with
that
list
call
you're
saying.
F
A
B
C
Well,
we're
going
to
get
a
list
of
virtual
machine
instances
due
to
the
informer
on
all
the
doors,
so
you'll
still
see.
B
Yeah
yeah,
that's
true:
yeah
you're,
right
yeah.
There
is.
C
For
sure
right,
there's
we're
seeing
multiple
lists
like
this
is
all
within
a
second.
A
A
A
I
think
from
our
estimates,
we'd
see,
like
we'd,
see
a
decent
amount
from
handler,
just
because
there's
so
many
nodes
that
and
so
many
vms,
it's
like
all
together.
It
just
kind
of
combines
into
this
like
symphony
of
just
like
a
delusion
of
requests.
A
B
But
as
soon
as
it
would
then
retry
to
try
until
it
finally
reaches
the
node,
the
api
server,
then
it
says:
okay,
I
have
to
do
a
list
again
because
I
don't
know
the
actual
state,
but
that
should
be
it
yeah.
A
Yep
yeah,
I
agree.
Okay,
I
guess
like
so
we
don't.
We
don't
need
to
like
spend
on
time.
Debugging
like
if
you
think,
if
we
have
like
some
issue
about
it,
we
can.
We
can
talk
about
it
on
offline
and
slack
and
see
if
we
can
find
it
but
yeah.
I
agree
like
with
that
conclusion.
Like
it,
we
seeing
the
number
of
lists
is
sort
of
surprising.
A
We
shouldn't
there
shouldn't
be
that
many
anyway,
okay,
but
I'll
leave
that
in
here,
because
we
can
sort
of
follow
up
on
the
circle
back
and
see
what
what
exactly
we're
missing
with
this
or
what's
going
on
okay.
So,
let's
move
on
to
the
to
some
of
the
other
work.
That's
going
on.
We
have
a
few
things,
so
we
have
the
when
we
start
with
this
one
done
so
measuring
performance,
this
one's-
oh,
actually,
this
one
so
do
the
work
that
you're
doing
david
in
this
pr.
A
So
dave
is
working
on
dmi
phase
transition
times.
Do
you
want
to
talk
a
little
bit
about
this
or
I
don't
know
like?
I
don't
know
how
much
you
want
to
highlight
from
here.
If
you
want
to
have
some
discussion.
C
I'm
sure
I'll
just
I'll
highlight
my
goal
with
this
real
quick.
The
goal
is,
I
wanted
a
way
to
when
we're
doing
these
stress
tests
to
be
able
to
monitor
sorry,
I'm
a
little
bit
distracted.
My
daughter
wants
my
attention.
I
think
one
second
give
me
like
five
seconds.
C
I
wanted
a
way
to
monitor
the
total
time
until
running
when
we're
doing
these
stress
tests.
So
the
way
I
was
looking
at
it
was,
I
just
wanted
to
track
the
time
between
creation
and
running
and
then
be
able
to
look
at
the
outliers.
So
I
wanted
to
see
the
p95
for
when
we're
p95
outliers
for
the
time
between
creation
and
running,
and
when
I
do
a
stress
test,
I
see
that
p95
go
up
similar
to
if
you're
doing
a
stress
test
on
http
server.
C
You
would
see
the
request
latency
go
up,
so
we
can
see
that
it's
taking
longer
and
longer
for
vm's
identical
vms
to
go
from
the
creation
state
to
the
running
state.
So
I
can
do
that
with
the
gauge,
and
that
was
the
the
original
idea
I
had
through
the
discussion.
We
began
talking
about
some
more
advanced
ways
of
getting
something
similar
that
might
give
us
a
finer
granularity
of
detail
of
the
interphase
transitions
so
between,
like,
for
example,
scheduling
to
scheduled.
That's
not
reflected
in
my
creation
time
to
running,
but
with
a
histogram.
C
We
can
track
the
transition
time
between
every
single
phase
as
well,
which
gives
us
a
finer,
more
fine
granularity
view
of
exactly
at
least
between
what
phases
we're
spending
the
most
time.
C
With
that
histogram
I
found
that
it
was
really
difficult
to
get
to
my
ultimate
goal
of
just
the
p95
of
creation
to
running
you
can
get
something
pretty
similar
kind
of
it's
just
a
lot
of
calculations
and
it's
presented
a
little
bit
differently
and
I
was
never
really
quite
satisfied
with
the
histogram
for
my
for
my
goal,
but
I
see
the
the
usefulness
of
the
histogram,
so
I
think
what
makes
sense
and
what
I've
landed
on
is
the
histogram
and
the
gauge
would
make
sense.
So
we
would
get
both
metrics.
C
What
I'm
curious
and
what
I
think
the
discussion
that
will
help
me
right
now
is
understanding
how
people
would
use
the
histogram,
because
I'm
still
not
seeing
completely
the
value.
I
think
I
might
be
seeing
the
value
but
like
what.
How
would
people
use
this
histogram
and
what
would
it
give
them
this
interesting,
because
it's
not
obvious
to
me.
E
Right
so
one
of
my
comments
here
about
instagram
is
you
know
the
gouge
you
you
take
like
the
only
the
least
metric
that
it's
reported
so,
for
example,
consider
that
you
have
like
a
30
seconds.
Is
a
scrap
in
a
scrap
interval
or
even
longer
depends
on
the
cluster?
Maybe
you
don't
want
to
have
permitted
scrapping
all
the
time.
E
You
know
everything
one
minute,
for
example,
and
then
you
have
one
minute
and
when
you
scrape
you
just
get
the
the
the
you
know
the
list,
the
list
metric,
you
don't
get
all
the
vm
transitions
of
any
creation
that
happened
in
between
so
histograms
actually
shows
everything.
You
know
you
get
like
all
the
vm
transitions
phase
in
the
histogram
and
not
just
the
least
metric.
You
know
value
that
it's
there
so
yeah.
That's
why
I
I
mentioned
that.
C
C
Always
hit
running
because
running
assuming
a
vm
stays
up
longer
than
the
scrape
interval
will
will
eventually
hit
it.
So
for
the
running
condition.
It
probably
gives
me
what
I'm
looking
for
the
individual
phases
time
less
accurate
because
we'll
miss
it.
That
makes
sense
to
me.
Okay,
so
really,
my
metric
is
only
valuable,
I
would
say
for
running.
A
What
about
like
histogram
is
like
one
way
to
represent
the
data
right
like
in
the
gauge
sort
of.
It
is
another
way
right
like
can't.
We
can't
you
still
like
so
this
kind
of
what
this
is
representing
a
representation
of
a
gauge
right.
Can
you
have
multiple
lines
on
this
sure
so
representing
each
phase.
C
I
mean
the
problem:
is
that
we'd
miss
it?
That's
what
he
was
just
pulling
out
because
of
the
scrape
interval
like
we'll
get
running,
because
running
is
going
to
last
longer
than
our
scrape
interval.
D
A
A
The
only
thing
you
just
apply
is
like
the
phase
and
the
name,
and
we
just
we
just
we
just
get.
We
use
these
to
differentiate.
So
we
have
a
line
for
like
a
phase
or
something.
Maybe
we
don't
need
name.
We
just
have
for
everything,
and
then
we
have
a
phase
to
differentiate
and
we
just
record
and
then
eventually
we
record
in
the
same
gauge,
but
then
prometheus
just
scrapes.
It
all
and
gets
a
few
different
metrics
from
from
a
single
gauge.
C
A
Yeah
like
because
then
because
then
isn't
like,
because
then
histogram
is
just
one
data
format,
and
this
is
like
another
data
format,
then
we
can
choose,
we
could
have
both
like
we
could.
We
could
technically
report
both
if
we
want
to,
I
guess,
depending
on
how
we
want
to
look
at
it,
but
I
think
we
can
get
the
data
either
way
I
mean
like,
like
I
said
here
like
you,
can
use
a
histogram
back,
because
what
it's
called
and
then
I'd
gauge
back
either
one.
C
The
histogram
works
with
buckets.
That's
the
thing
that
kind
of
threw
me
a
little
bit
so
we're
only
getting
the
level
of
detail
down
to
what
bucket
the
metric
falls
within
so
you're,
not
getting
exactly
how
many
seconds
you're
getting
how
many
seconds
it
is.
What
bucket
that
landed
in
you're
getting
a
count.
So,
let's
say
let's
say
it
took
30
seconds
to
transition
between
scheduled
to
running
you're,
not
necessarily
getting
the
value
30
seconds
recorded
you're
getting
what
bucket
that
fell
into
in
the
histogram,
which
maybe
let's
say
that
was.
C
E
C
B
E
Yeah
but
the
quantile,
which
goes
it's
this
for
the
sample
that
you
haven't
promised
and
the
control
in
the
instagram
it
will
be
like
you
know,
with
all
we
don't
miss
any.
You
know.
E
Yeah
so-
and
I
don't
I'm
not
sure
if
I
understand
what's
the
problem
with
the
histogram
for
the
data
representation
that
you
guys
mentioned
before.
C
E
F
The
problem
I
have
with
sorry.
E
C
I'm
getting
the
create
I'm
not
getting
from
creation
to
running
I'm
getting
a
between
different
phases
like
it's
a
different
metric,
though.
B
So
I
mean
you
could
create
a
histogram
also
from
creation
to
running
but
yeah.
If
we
talk
about
having
just
the
face
itself,
then
you
would
get
the
ground
tile
for
each
phase,
which
I
personally
think
is
very
appropriate
for
scaling
for
scale
testing.
B
So
that's
always
the
point
I
don't
get.
I
mean
with
the
gorge
from
the
histogram,
if
it's
really
just
for
the
scheduled
and
scheduling
phase,
it's
really
just
for
the
face,
but
we
just
have
a
few
and
the
data
which
you
get
is
exactly
what
I
would
consider
to
be.
The
thing
we
want
to
see,
but
obviously
not
for
the
for.
C
You
well
I'm
just
thinking
about
like
what's
the
thing
that
we're
tracking
in
real
life
like
if
somebody
posts
their
vm,
what
they're
impacted
by
is
the
time
of
when
they
post
it
to
the
time
that
it
becomes
available.
So
I
just
want
to
track
exactly
that.
I
don't
want
to
interpolate
or
have
some
sort
of
interpretation
of
what
that
might
be.
I
just
want
to
actually
track
that.
E
Yeah
yeah,
no,
not
yet.
I
would
say
that
it's
true,
so
you
should
have
maybe
a
histogram
for
just
the
whole
thing
you
know
from
create,
you
know,
submission
to
run
and-
and
we
don't
have
the
the
face
between
yeah,
because
that.
C
B
F
B
And
so,
for
instance,
you
wouldn't
care
if,
if
the
starting
on
1000
vms
there
is,
there
are
some
games
where
it
then
takes
for
some
200
seconds
to
start
it
doesn't
really
matter
if
it's
200
or
100,
when
you
say
it
should
be
below
50..
So.
A
A
We
can
see
that
there
was
an
increase
in
at
in
time
it
took
to
go
to
running
and
then
with
like
the
histogram
here,
we
can
see
that
over
time
we
have
a
lot
of
really
slow
ones
that
are
taking
this.
I
think
they're
both
valuable
they're,
both
valuable.
B
Right,
all
I
want
to
repeat
again
is
if,
if,
if
you
create
a
histogram
too,
in
addition
to
the
histograms
per
phase,
where
you
go
from
creation
to
running
you,
the
histogram
is
also
collecting
the
sum
and
the
count.
So
you
get
exactly
the
same
thing
too,
with
it.
C
I
don't
know
yet
I
need
to
play
around
with
it.
It's
kind
of
frustrating
to
me
just
because
I
like,
I
know
exactly
what
I
want.
C
As
far
as
what
I
want
to
see
the
performance
increase
up
and
I
know
how
to
represent
that
and
I'm
I
feel
like
I'm
going
through
a
lot
of
hoops
to
get
what
I
want
just
in
a
different
way,
and
I
don't
understand
entirely
why
it's
useful,
like
I
understand
why
it
presents
the
information
in
a
different
way
with
a
histogram
and
how
you
can
get
some
some
interesting,
more
detailed
results,
especially
with
like
the
verb
phase,
transitions
and
things
like
that.
C
D
C
Yeah
I
see
the
histogram
is
definitely
being
useful.
I
I
want
to
make
sure
I'm
not
like
shooting
it
down.
I
think
that
it's
like
the
next
step
is,
or
maybe
it's
the
first
step
for
some
people,
but
when
I'm
running
this
I
want
to
see
okay
latency
increased
and
the
histogram
is
going
to
directly
correlate
to
what
phase
we
spent
the
most
time
in
and
that's
all
great.
But
I
like
both
views,
yeah.
C
Helps
for
the
the
view
that
I
currently
have?
Is
it
performance,
wise
or
like
from
a
metric,
collecting
perspective
or
or.
B
E
C
B
It
would
reset
and
prometheus
would
realize
that
it
did
a
reset.
C
Okay,
I'll
mess
around
with
a
little
bit
more
and
see
if
I
can
get
similar
results
that
I'm
looking
for.
A
Okay,
cool
thanks;
okay,
let's
go
to
another,
so
this
is
on
the
mailing
list.
Fan
proposed
this:
how
to
prove
improve
performance
in
the
vert
controller,
there's
a
link
to
the
thread
and
he
put
together
a
pull
request.
A
It's
a
little
bit
of
an
example
of
what
he's
looking
to
do.
I
I
don't
know,
fan
if
you're
here,
but
I
think
it
will
be
good
kind
of
talk
about
some
of
the
ideas.
Maybe
we
can
find
kind
of
learn
some
things
from
this
ways
like
we
can
find
this,
because
fans
saw
some
improved
performance
from
this
little
fan.
Are
you
here?
Yeah?
Oh,
hey,
okay,
yeah.
C
A
D
Yeah
yeah,
thank
you
yeah,
so
the
basket.
It
has
three
topics.
So
one
one
thing
is
we
I
want
you
to
reduce.
The
income
happens
in
the
world
controllers.
The
third.
A
One
well
one,
second,
one,
second,
so
to
preface
this
a
little
bit
more
that
so
this
isn't
so
the
vert
controller
we
have
our!
Oh,
I
don't
have
the
I
wish.
I
had
the
picture
that,
but
okay,
so
in
the
vert
controller
we
have
a.
We
have
a
reconcile
loop,
we're
going
through
we're
doing
an
update
and
that's
and
that's
where
this
update
status
and
there's
this
informer
and
then
and
that's
kind
of
the
background
in
this
I'll,
find
the
I'll
find
the
picture
and
I'll
put
in
the
background.
D
Yeah
so
yeah,
so
basically
this
what
we
observed
in
the
in
practice,
so
the
latency
between
the
power
creation
and
the
vmi
creation.
So
when
I
look
into
the
into
the
print
out
the
logs
for
the
parts
and
they're
my
status
updates
in
the
real
controller,
so
you
can
see
a
lot
of
inques
happens
in
the
word
controller,
even
though
these
enqueues
are
not
relating
to
the
status
updates.
So
when
I
reduce
the
inquiry
times
by
incur
only
the
status
includes
the
event.
D
Events
relating
to
the
status
changes,
so
the
the
qns
reduced
here
he'll
reduce
the
very
much.
So
that's
the
the
first
part
of
this
proposal.
So
we
I
want
you
to
reduce
the
enqueue
happens,
so
only
reduce
something
like
a
crit
or
create
a
part
or
delete
or
update
the
vmi
status
happens
or
something
like
that
and
the
the
second
thing
is
I
so
right
now
yeah.
D
I
agree
that
the
using
the
q
worker
q
model
is
good
for
the
concurrency
process,
but
I
think
we
should
have
some
supplement
for
pre-processing
like
to
rolling
the
ball
fast
before
income
incurring
to
the
worker
queue
so
that
that's
the
purpose,
that's
the
third
part
of
this
proposal
or
the
second
part
of
the
proposal.
So
we,
when
we
create
a
pod,
the
credit
pod
can
can
trigger
trigger
the
creation
of
yeah.
When
we,
when
we
have
a
svm
event,
happens
this
will.
D
This
will
trigger
a
create
a
pod
event
so
before,
including
that,
so
we
can
use
this
when
the
when
something
failure
happens,
we
also
we
can
still
encue
this
event
to
the
worker
queue
anyway.
So
this
is
the
basic
idea
of
the
rolling
ball.
We
still
keep
the
a
major
logic
of
the
worker
queue,
but
we
just
a
faster
speed,
the
processing
and
yeah.
So
that's
basically
logic.
C
If
I
can
ask
a
question,
what
I
don't
understand
here
is
the
work
is
still
being
done,
so
we
we
have.
The
logic
is
still
being
processed
in
the
exact
same
process,
we're
just
moving
it
around.
C
C
D
Yeah,
the
the
the
best
good
on
the
issues
is
the
working
queue
has
very
long
q
length.
So,
as
I
look
into
the
q
lens,
you
can
see
hundreds
of
events.
So
if
we
don't
change
the,
if
we
don't
change
the
thread
using
the
default,
tens
rest,
for
example,
we
in
the
practice
we
always
have
hundreds
of
vmis
hundreds
of
keys
in
the
q
lens,
that's
keeping
hundreds
of
keys
in
the
q
less
or
for
creating
of
500
vms.
D
So
these
kill
these
keys
in
the
q
lens
come
come
from
kids
from
the
past
events
and
the
vmi
events,
so
they
mixed.
D
So
if
I
as
I
illustrated
in
the
issue
so
when
we,
the
time
is
when
we
created
a
vmi
and
associated
the
events
like
updated,
the
paths
updated
the
vmis
queued
up
in
the
distribute.
They
are
not
sequential
as
a
distributing
the
keys
so
next
round
waiting
for
available
worker
to
pick
up
the
this
key
would
would
be,
would
wait
for
a
while.
So
that's
the
latency
happens.
C
I
see
that
the
cue
backs
up
so
and
yeah
you're
right
that
that's
caused
by
we're
in
queueing
from
lots
of
different
places,
so
we're
looking
at
the
pods
and
the
the
virtual
machine
systems
and
stuff
and
lots
of
places,
can
pop
or
play
something
on
the
queue
I'm
more
interested
in
why
the
queue
worker
queue
isn't
able
to
keep
up
with
that.
So
the
fact
that
we're
seeing
it
backed
up
I'd
like
to
understand
why
our
worker
queue
is
inefficient.
C
Even
when
we
expand
it
sound
like
we
gave
it
lots
more
threads,
what's
going
on
in
our
worker
queue.
That
means
it's
backed
up,
because
I
would
think,
for
example,
if
we
have
let's
say
100
vmis
cued
keys,
cued,
that
we
should
be
able
to
chew
through
that
in
like
milliseconds.
It
should
be
nothing
especially
when
no
api
calls
are
being.
B
Invoked
yeah,
and
especially
considering
that,
for
instance,
when
we
look
at
the
enqueues
coming
from
parts,
we
have
to
think
that,
at
the
same
time
where
we
seem
to
collect
all
the
backlog
in
the
queue
the
kubernetes
was
able
to
process
the
vm
update
it
and
even
send
the
watch
update
to
us
and
enqueue
it
and
we
are
behind.
So
this
is
really
weird.
C
We
could
be
doing
something
silly
like
making
an
api
call.
Every
time
a
reconcile
is
popped
and
if
we.
C
That
means
that
the
work
queue
probably
is
doing
something
inefficient.
Like
I
mean,
I
think,
that's
really
valuable.
I
I
question
more
so,
where
we're
targeting
our
efforts
on
and
improving
the
performance,
I
would
want
to
understand
more
details
about
the
actual
work,
cue
execution
and
where
it's
slow
and
perhaps
even
get
some
pprof
on
on
that
during
a
stress
test,
instead
of
moving
logic
out
of
it,
because
I
understand
that
improves
things,
but
I
don't
know
why.
E
Right,
I
I
see
like
hiding
the
problem,
you
know
like
you
are
bypassing
something
and
then
you
just
you
know,
don't
see
the
problem
anymore.
I
I
and
also
I
think,
like
bypassing
you
know
the
work
kill.
It's
like
you
know,
removing
the
whole.
You
know
you
know
fundamental.
You
know
idea
of
kubernetes
that
it
you
must
have
things
asynchronous
and
send
it
to
the
queue
and
process
everything
asynchronous.
E
I
think
it's
going
to
a
different
direction.
What
kubernetes
suggests
to
do.
B
B
A
Would
you
say
this
so
we'll
kind
of
break
down
some
of
these?
So,
like
number
three,
if
we
go
from
say
like
pending,
like
we
say,
we
shouldn't
skip
phases,
is
that
is
that
what
we're
saying
or.
B
A
D
Oh
okay,
yeah,
sorry,
so
in
the
after
distribution
number,
three,
I'm
talking
about
anniversary
so
in
the
updated
status.
So
currently,
for
example,
if
the
bmi
is
when
it
is
created,
so
it's
automatically
going
to
onset.
So
it's
waiting
for
the
next
available
worker
to
pick
up
said
one
and
check
the
vmi
status
and
moving
forward
to
the
next
stages.
So
considering
in
the
case,
the
the
policy
has
been
running
very
quickly
and
and
by
now
there
isn't
a
worker.
D
If
there
isn't
the
available
worker,
the
peacock
was
updated
vmi,
so
it's
still
in
the
onset
and
for
the
next
available
worker
pickup.
So
in
a
current
logic,
it's
a
it
only
checks
if
the
part
exists
not
ready,
so
it
will
go
into
the
scheduling,
not
the
schedule
b
so,
and
it
also
needs
to
wait
for
the
third
third
well
available
worker
to
pick
up
to
process
it
to
schedule
b.
So
the
latency,
having
is
the
ty
and
the
t2,
so
the
the
this
is
the
latency.
In
the
word
controller.
A
B
See
I
mean
we
have
a
limited
amount
of
workers
and
it
has
to
go
some
places,
so
we
of
course
increase
the
api
load
when
we
have
more
faces
and
everything.
But,
as
you
see,
the
api
server
as
such
seems
to
be
perfectly
happy
it.
It
can
start
the
pods
fast
and
the
api
server
reports,
the
updates
to
watches
fast.
B
So
we
should.
You
should
see
a
delay
when
you
start
more
vms
and
if
you
start
a
lot
of
them
at
the
same
time,
you
should
see
also,
maybe
quite
some
delay
for
a
very
short
amount
of
time,
but
not
over
such
a
long
period.
This
is
really.
It
really
looks
very
weird
from
a
logical
point
of
view.
In
our
controller.
A
A
What
is
it
with
the
work
you
that
could
possibly
not
be
causing
what's
causing
it
to
be
slow,
like,
I
think
maybe
one
thing
would
be
useful
if
we
could
see
a
diagram
of
like
the
keys
and
how
it
balloons
and
see
like
okay,
like
over
time,
you
know
how
quickly
they're
they're
being
processed.
You
know
how
many
they're
getting
to
relative
to
that
of
vmis.
A
I
think
that
helped,
maybe
a
picture
of
that
and
then
and
then
maybe
we
can
start
looking
at
some
some
of
the
code
snippets
like
how
does
that
sound
for
next
step.
B
Yeah,
I'm
not
sure
there
should
be
some
metrics
for
controllers
already,
I'm
not
sure
if
we
expose
all
the
controllers
metrics,
which
are
built
in
by
default.
You
should
see
some,
but
at
least
you
would
need
some
measure
points
regarding
to
the
sync
logic,
the
update
status
logic
and
you
would
probably
want
to
watch
the
rest
api
calls
with
puts
if
they
take
long
or
not
stuff
like
this
and
often.
A
Yeah,
I
think
this
was
a
the
one
of
the
one
of
our
goals
we
wanted
to.
D
A
D
D
B
Collects
how
long
the
the
key
itself
takes
or
the
the
the
processing
of
the
queue
entry
takes
in
our
logic?
If,
because
we
have
this
controller
framework,
if
not,
it
should
be
pretty
trivial
to
add
one
or
two
measurement
points
in
the
controller
itself,
and
then
you
can
probably
see
at
least
immediately
if
in
which
parts
of
the
controller,
roughly
the
the
time
is
spent
yeah,
because
that.
A
A
We
can
look
at
doing
something
like
that
kind
of
looking
at
implementing
some
of
these
metrics
and
trying
to
get
some
of
this
information
or
we
can,
if
you
want
to
look
at,
do
the
faster
way
and
kind
of
just
generate
it
again
and
look
and
generate
a
graph
from
whatever
data
you
can
gather
whatever
you
think,
but
we
do
want
to
definitely
get
to
these
eventually
get
some
of
these
working.
D
D
Yeah
I
have
some.
I
have
some
tests
on
my
local
development
machine.
I
just
print
out
some
laws
for
each
in
just
directly
print
out
the
log
in
the
code.
So
I
can.
I
can
see
the
log
file
of
the
controller
to
see
how
the
working,
how
the
queue
works,
how
the
event
color
works.
For
example,
in
my
I
can
see
that
there
are
for
each
vmi
updates.
The
vertical
controller
call
is
30.
D
There
are
30
events
or
30
actions
on
the
bmi
updates
and
70
percent
of
them
about
the
inquiry
incu
activities.
So
when.
B
D
Oh
yeah
so
means
so
every
time.
Every
time
the
event
handler
is
caught,
they
were
in
queue.
It
will
include
the
worker
queue
right.
That's
what
we
call
it's.
It's
our
activity
of
the
inquiry,
so
the
eq
happens
30
times.
B
Yeah
so,
but
that
should
not
necessarily
be
a
problem.
The
the
the
the
the
the
part
which
we
have
to
do
right
is
that
we
don't
do
any
rest
calls
when
not
necessary
so
and
to
not
hot
loop
on
status
updates
like
changing
timestamps
or
whatever
there
did.
You
see
something
there
like
that,
we
are
really
doing
status
updates
of
the
vmi,
which
should
not
happen.
D
Oh,
I
I
think
yeah,
that's
my
point.
So
if
for
currently
as
we,
we
do,
we
do
nothing
except
the
interior
in
the
worker
queue
and
waiting
for
the
main
logic
to
process.
So
if
I
reduce
the
inquiry
times
so
just
to
include
something
relating
to
the
status
happens,
so
we
we
can
still
update
the
vmi
status,
but
the
q
lens
reduced
very
much
so
we
do
need
to.
D
We
don't
need
to
incur
so
many
keys
in
the
work
queue
first
of
the
same
vmi
so
and
also
I
I
see
that
in
the
current
stages,
when
we
increase
many
when
when
we
started
creating
the
vmi
for
500
bmis,
the
q
lens
grows
very
fast
and
we
have
to
wait
for
a
while,
like
three
minutes
for
the
for
the
vmi
to
be
picked
up
and
updated,
starting.
The
update.
A
Yeah,
so
that's
that's!
So
that's
one
of
the
problems
like
with
the
q
q
length
how
slow
they
are
and
then
and
then
the
other
thing
you
mentioned
like,
and
maybe
this
is
like,
so
do
we
need
to
re-cue
you
every
time
a
pod
is
updated
like,
for
example,
what
you're
doing
here
like
like
what
like
d
is
this?
Would
this?
Is
this
something
that
we
want
to
run
through?
All
the
yes?
Yes,
we
do
every
single
time,
even
like
on
a
scheduling
for
a
pod.
B
Yeah
I
mean
you
would
have
to
embed
here
than
the
logic
when
something
is
interesting
or
not,
and
which
means
that
you
have
to
keep
it
in
sync.
Here,
like
within
the
main
processing
logic.
Are
they
looking
for
scheduling
conditions
and
want
to
update
then
on
the
vm
status?
You
would
have
to
keep
it
here
in
certain
to
to
really
enqueue
it,
because
it's
interesting
and
you
actually
do
not,
and
you
would
also
have
to
look
on
the
vmi
if
the
vmi
has
the
data
already
or
not.
B
So
it's
really
hard
to
tell
here-
and
this
should
really
not
be
the
thing
right
now,
which
we
should
look
at.
There
are
other
things
with
the
callbacks
which
I'm
happy
to
explain
in
detail
of
interest
regarding
to
go
routine
scheduling
where
we
have
issues
if
we
do
too
much
in
the
callbacks
due
to
the
lock
mechanism,
but
the
the
key
thing
for
me
is:
I
do
not
on
it.
B
I
can
understand
if
things
become
slightly
slower
if
we
really
start
a
lot
of
vms,
but
what
we-
but
I
did
not
see
so
far,
is
an
indicator
on
how
long
we
spent
on
processing
the
bmi,
for
instance,
even
if
we
don't
have
to
do
anything
because
it
should
just
not
matter
too
much
if
we
enqueue
it
10
times
or
20
times,
when
it
doesn't
have
to
do
anything
on
the
objects.
That's
most,
that's
that
are
very
much
all
in
memory
actions.
B
We
do
a
few
memory
lookups,
which
we
drew
about
a
few
comparisons
of
objects
and
decide.
Okay,
we
don't
have
to
do
anything
and
that
should
be
really
fast,
just
again
think
about
the
fact
that
the
cubelet
and
api
server
are
doing.
In
the
same
time,
all
the
work
with
the
same
model
and
giving
us
us
all
the
updates
on
those
objects,
and
they
really
did
something
there,
including
rest
api
updates,
and
we
are
supposed
to
do
pretty
much
nothing
on
many
of
them,
but
they
have
the
same
model.
B
E
You
check
the
cpu
usualization
of
the
nose
that
it's
around
the
controller,
maybe
you're
getting
saturated.
You
know
I
don't
know
just
guessing.
You
know
100
of
the
cpu
and
then
because,
if
you
are
processing
many
things,
for
example,
when
you
bypassing
you
decrease
the
cpu
utilization,
and
then
you
see
things
going
fast.
E
B
This
is
actually
a
very
good
hint
because
I
think
we
are
not
setting
cpu
limits.
Cpu
requests
in
this
old
release
by
default,
and
you
may
end
up
in
the
in
the
best
effort
quality
of
service
class
and,
if
which
and
which
this
means
that
you
may
share
the
cpu
with
a
lot
of
other
processes
and
what
kubernetes
does
here
is.
It
gives
the
process
just
a
very
small
amount
of
cpu
time
very
often,
if
nothing
else
is
there,
so
you
get
minimum
amount
of
time.
B
E
A
Yeah,
okay,
I'm
writing
a
few
things
down.
So
like
let's
diagram
the
q
length,
let's
measure
the
cost
of
each
and
q,
let's
do
pre-prof
of
the
for
controller
and
then
let's
look
at
what
was
it?
You
wanted
me
like
the
cpu.
B
Cpu
request:
okay,
cpu
request:
you
should
set
one
you
can
do
that
pretty
easily
by
specifying
on
the
keyboard.
Namespace
some
defaults.
You
know
some
regress
defaults.
Okay,
then
you
get
them
immediately.
There
just
set
them
to
something
reasonably
high
to
just
rule
it
out.
Let's
see.
A
Okay,
all
right,
so
we
have
some
options.
Okay,
I
think
that'll
be
like
the
next
steps
for
for
this
one,
okay,
and
we
can
kind
of
circle
back
on
that.
So
there's
a
mailing
list
thread.
We
can
maybe
we
during
the
week
we
can
use
the
mailing
list
thread
to
see
what
we
find
and
then
we
can
in
two
weeks
time.
We
can
talk
about
like
some
of
the
findings
for
this
one:
okay,
cool
all
right,
we're
at
time.
A
So
if
the
last
few
seconds
here
is
there
any
other
open
items,
anything
else,
anyone
else
wants
to
bring
up
and
see
listed
here.