►
From YouTube: SIG - Performance and scale 2021-07-29
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.twy9rph886f0
A
Okay,
let's
get
started
so
first
up
on
the
agenda.
We
have
this
perfscale
audit
tool.
I
think
yeah.
This
is
mine.
What
do
we
have
left
here?
Is
anyone?
Does
everyone
feel
pretty
decent
about
the?
A
If
we
look
at
the
description,
I
updated
the
description
with
a
workflow
that
kind
of
shows
how
we
can
start
give
it
to
start
time,
give
it
an
end
time
and
then
capture
the
results
during
that
time
and
then
also
compare
those
results
to
thresholds.
So
we
give
an
input
config
with
our
desired
thresholds.
A
So
we
want
to
know
that
we
get
to
running
vmi
to
running
state
within
x
amount
of
time
and
then,
when
we
get
the
output,
we'll
both
get
what
the
p95
p99
p50,
whatever
results
were
and
whether
that
met
our
threshold.
So
we
could
use
that
for
our
periodic
testing
and
things
like
that.
Does
that
satisfy
everyone,
how
that's
kind
of
all
laid
out
there
any
comments
or
concerns
there.
C
B
Yeah,
it
looks
good
to
me
also
my
only
concern
now,
but
it's
not
to
block
the
pr,
so
I
think
it
should
be
merged
is
with
the
new
tool.
I
don't
know
if
how
new
is
that,
but
the
one
that
you
just
marked?
That's
the
second
one,
the
cluster
provider.
B
I
think
they
share.
You
know
similarities
for
the
work.
You
know
the
performance
framework
and
I
I
just
made
some
comments
and
actually
in
the
comfort
cluster
profiler
and
I've,
maybe
it
will
be
better
to
have
everything
in
only
one
place.
You
know,
instead
of
have
like
a
lot
of
different
tools
spread
around.
B
B
A
Yeah,
I'm
fine
with
that.
So
maybe
let's
jump
to
that
that
kuver
cluster
profiler
pr.
So
that's
the
second
one
in
our
agenda
and
it's
kind
of
related
to
the
audit
tool.
It's
something
so
this
this
keeper
cluster
profiler.
It
does
something
a
little
bit
different
than
the
audit
tool
right
now
and
maybe
there's
a
way
to
converge
this
behavior.
Certainly
we
could
make
this
perf
audit
tool.
Do
both
types
of
behavior
but
I'll
give
a
brief
rundown
of
the
differences,
real
quick.
A
So
the
audit
tool
is
taking
a
time
range
going
to
prometheus,
pulling
data
around
or
through
that
time
range
and
then
compiling
some
results
and
comparing
that
to
a
threshold.
So
it's
something
that
can
be
done
retroactively.
A
The
profiler
is
something
that
we
run
and
it
actually
triggers
things
like
tracing
or
other
like
things
in
our
actual
components
to
begin
capturing
data,
and
then
it
has
to
be
started
at
the
very
beginning
like
before
the
stress
test
starts
and
it
has
to
be
stopped
and
dumped
at
the
end
of
the
stress
test-
and
this
gives
us
things
like
prof
dumps
for
cpu
and
memory.
A
So
we
can
figure
out
where
we
spent
the
most
time
and
execution,
and
I
added
something
that
I
think
is
a
little
bit
controversial,
where
I'm
figuring
out
what
I'm
counting,
what
api
resources
we're
calling
and
the
actions.
So,
for
example,
I'm
counting
how
many
times
every
component
calls
a
list
or
a
get
on
a
specific
resource
like
a
pod
and
I'm
aggregating
all
those
into
a
report
and
that's
something
that
we
can
get
and
the
the
dump
as
well.
So
that
part,
possibly,
we
could
put
that
prometheus.
A
C
So
the
I
I
really
like
the
proof
part
I've
said
it
before
I
I've
been
looking
at
distributed
tracing
the
last
two
days
and
I'm
still,
I
I,
the
people
of
part,
might
actually
be
more
useful
to
some
extent
like
than
that
for
the
request
metrics,
I
think
I
or
I
thought
we
already
had
those
metrics
somehow
or
we
we
collect
them
and
they
I
ask
them
in
the
comments.
C
I
think
if
we
could
use
the
metrics
where
we
collect
the
the
difference
might
be,
they
are
not
as
granular
or
not
as
up
to
date.
I
wouldn't
pull
them
from
pure
primordius.
I
don't
think
I
would
mix
the
two
tools
one
pulling
from
prometheus.
The
only
thing
I
would
change
is
we
have
the
counters
in
our
code
and
we
might
be
able
to
read
the
count
as
we
already
have,
instead
of
building
new
counters,
but
this
tool
seemed
like
that.
C
B
Yeah,
so
what
I,
if
it's,
the
only
doing
the
p
prof
it
makes
sense.
So
it's
only
doing
this
profiling,
however,
as
as
I
saw
the
atp,
you
know
metrics
being
collected,
and
I
think
those
metrics
at
least
some
of
these
metrics-
that
david
is
trying
to
you
know
accounts
there
is
information.
B
Is
I
listed
there,
two
metrics,
but
also
in
the
cooperate.
You
know
code
there
is
this
reflectors
metrics
that
it's
also
account
the
list
and
watch,
although
I
cannot
see
the
name.
Parameters
for
some
reason
need
to
double
check
why
it's
not
being
exposed,
but
we
have
some
metrics
there.
We
should
have
some
metrics
there
that
it's
showing
you
know
this
api
request
and-
and
I
I
like
the
idea
to
have
like
you-
know
this
report
showing
many
things.
B
B
So
the
just.
I
also
comment
that
in
the
in
the
cluster
profiler
pr
p
prof
has
two
different
apis,
so
we
can
use
one
it's
what
david
you
use
it
and
another
one
is
expose
it
as
an
api.
So
so,
in
this
case
you
can
also
determine,
for
example,
the
amount
of
time
that
you
want
to
profile
like
you
are
doing,
with
the
audio
tools,
like
you
know,
10
minutes,
whatever
you're
doing
and
and
also
do
some
live.
B
You
know
analysis
during
the
execution
of
the
code,
so
it
can
it's,
it
has
a
user
interface.
You
know
you
so
you
can.
You
can
check
that
while
it's
running
and
might
be
interesting,
also.
A
I'll
take
a
look
at
the
net
profiling
that
you've
pointed
out.
I
haven't
yeah.
I
haven't
looked
at
that
in
detail.
I
think
that
my
my
initial
reaction
to
that
is
how
to
access
these
endpoints
so
right
now
I
can
get
ingress
through
the
cube,
burnetti's
api
server
and
through
our
sub
resource
endpoint,
and
actually
retrieve
the
runtime
profiling
data
the
real-time
data
we'd
have
to.
I
don't
know
how
to
aggregate
that
as
easily,
but
that's
something
I'll
look
into.
I
think.
A
For
the
sake
of
this
conversation,
it
sounds
like
we're
in
agreement
that
pprof
is
useful
somehow
making
that
simpler
for.
A
And
all
right,
so
we
can
probably
table
that
and
say
we're
going
to
do
something
there.
Somehow,
when
we
look
at
this
profiler
tool
as
it
is
now
we
have
pprof,
and
then
we
also
have
this
http
stuff
that
I'm
doing
and
the
http
http
stuff
is
a
little
bit
controversial
because
it
appears
like
it
could
be
a
prometheus
metric
and
I've
been
thinking
about
that
as
well.
A
I
don't
know
how
to
get
the
data
from
an
existing
metric,
so
what
I
want
is
to
know
exactly
what
api
calls
occurred
on
our
specific
resources
from
our
components
and
I've
been
unable
to
get
that
from.
A
C
A
All
I
want
is
to
know
this,
isn't
I
want
to
know
exactly
what
api
calls
come
from
every
component
within
our
control
plane
over
a
specific
period
of
time?
So
if
I.
A
Look
at
every
single
api
call.
Individual
api
call
the
counts
that
have
occurred
for
that,
and
we
see
that
like,
for
example,
we're
calling
10
000
lists
on
pvcs
and
that's
not
expected.
I
want
to
go
and
investigate
that
yeah
or
if
we
see
that
there's
new
lists
or
new
get
calls
that
are
occurring.
I
want
to
be
able
to
go
and
figure
out
where
that's
occurring,
and
why
so
I
want
to
know
exactly
what
component
it
came
from
and
I
want
to
be
identified.
That's
unexpected.
B
Right
so
yeah
we
can
discuss
him
and
it
will
be
actually
useful
for
me
also.
So,
for
example,
the
api
server
request
total
you
have
the
verbs,
so
you
can
see
at
least
watch
and
and
then
you
have
the
you
know
the
groups
and
that
shows
actually
what's
being
called
and
regard.
So,
for
example,
if
it's
been
called
pods
virtual
machines,
you
know
it
will
be
like
the
modules,
the
components
and
you
can
you
it
has
it's
a
histogram,
so
it
has
counts.
So
you
can
count
that.
C
The
metric
you're
talking
about
is
the
api
server.
One.
B
The
api
server,
and
also
we
have
this-
the
rest,
yeah
yeah,
that
should.
C
Be
the
api
server
is
not
the
one
he
was
looking
for,
because
that's
any
request
made
by
anybody
and
I
think
what
they
meant
was
the
requests
we
make
as
the
control
plane,
and
that
would
be
the
rest
client,
but
I'm
not
sure
if
we
consistently
used
it
everywhere
like,
but
we
should
have
like.
I
saw
something
like
that.
We
have
the
rest,
client
metrics,
the
calls
we
make,
but
I'm
not
sure
if
they
are
really
in
every
of
our
clients.
A
Let's
say
they
are:
how
does
that
give
me
the
per
verb
per
resource
information?
How
would
I,
how
would
I
yeah
rather
that.
C
I
don't
have
a
cluster
right
now.
It's
booting!
I
can't
try,
wait
yeah
I'll,
send
you
close
to.
Let
me
check.
C
Yeah
the
rest
climb,
metrics
are,
I
I
had
al.
I
also
had
a
problem
with
the
structure
of
them
that
I
couldn't
get
exactly
what
I
wanted
like
some
resources
were,
were
not
a
part
of
it
or
a
part.
I
I
I
had
some
confusion
with
him
as
well.
Right,
where
are
the
metrics?
I
can't
find
the
list
anymore.
A
C
I
think
it's
slash
api,
v1,
etc.
If
I
remember
correctly,.
A
A
Yeah
yeah,
so
there's
no
url
there.
So
we
don't
know
what
we
know
that
lists
have
become
increasingly
latent,
but
we
have
no
idea
which
ones
or
we
can't
individually
count.
What
urls
are
there.
That's
actually
tricky
because
it's
difficult
to
figure
out
exactly
what
the
resource
was
and
whether
it
was
a.
B
Talk
over
yet
yeah,
we
didn't
hear
you
so
I
I
just
put
here
an
example
in
the
chat,
so
actually
the
url
has
the
resource,
for
example.
Here
I
have
an
example
that
it
has
the
pods,
so
it's
listing
the
pod
end
points
and
other
whatever
resource
actually
has
been
called,
and
it's
counting
per
you
know,
per
component
and.
B
C
A
D
A
A
lot
more
like
more
histograms
right,
so
it.
C
C
D
A
Yeah,
it's
tough.
I
had
to
really
expand
it
yeah,
and
this
is
also
a.
A
A
There's
no
we're
these
are
their
callbacks,
that
we
are.
B
D
C
D
B
B
A
I
can
create
a
new
metric
that
gives
me
exactly
what
I
want.
It
would
be
a
histogram.
B
C
B
A
One
thing
that's
also
difficult
here
for
me
to
understand
is
how
would
we
so
I
want
to
use
this
data.
However,
we
get
it
to
determine
when
new
api,
when
unexpected
api
calls
are
occurring,
whether
that's
the
frequency
of
the
calls
or
if
new
ones
have
been
introduced,
that
we
don't
expect
to
happen
during
a
test.
A
How
would
I
get
that
the
things
that
I
don't
know
about,
I
guess
this
one
trying
to
ask
when
I
request
the
data,
it
would
be
a
histogram
with
a
label
that
I'm
not
expecting.
So
how
would
how
would
I
get
information
about
the
thing
that
I
don't
expect.
C
You
probably
would
pull
a
metric
for
the
entire
like
for
all
components
and
then
group
by
component
and
resource
or
component
resource
and
and
and
verb,
and
then
you
will
see
the
numbers
that
you
can
compare
to
the
previous
run.
You
can
see
that
and
then
you,
you
actually
see
that
gets
before,
where
you
can
graph
that
you
see
gets
before
in
the
last
call
where
four
parts
were
much
lower
than
gets
are
now
by
weird
handler.
C
A
Okay,
so
here's
what
I
think
I'll
do
for
now,
this
profiler
tool-
let's
forget
the
http
stuff
and
I'll
go
with
the
prof.
I
might
clean
it
up
and
if
we
think
this
is
a
good
idea,
then
I'll
probably
introduce
the
tool
just
with
pprof
for
now
aggregating
that,
for
the
http
part,
I'm
going
to
experiment
with
a
new
prometheus
metric.
That
gives
me
exactly
what
I
want
and
do
whatever
I
need
to
do
to
get
what
I
want
and
then
we
can
look
at
what
I've
done
figure
out.
A
Yeah,
I
think
I
think
it
would
be
good.
Do
you
think?
So
we
talked
about
like
tool,
convergence
and
things
like
that.
I
don't
know
if
ctl
is
the
right
place
for
this
even
should
I
just
create.
A
I
don't
know
if
merging
it
into
perf
audit
is
accurate
either.
Since
that's
a
tool
that
retroactively
looks
at
at
this
information,
I
can
create
a
new
tool.
That's
just
cluster
profiler,
that's
similar
to
perf
audit
and
then
in
the
future.
If
we
ever
decide
to
converge
those
to
like
the
audit
tool
and
the
profile
tool
together,
we
could.
I
could
even
just
pronounce
it's
just
p
prop.
I
could
just
call
it
p,
prof,
aggregator
or
something
I
don't
know
something.
That's
just
specific
to
that.
C
I
wouldn't
have
minded
inverted
ctl,
but
I
I
I
think
a
small
separate
tool
for
now
would
also
help
also
I
because
in
my
in
back
of
my
head,
I
still
play
with
this
idea
of
this
being
useful
for
other
projects
that
if
it's
a
small
tool,
it's
easier
to
show
off,
you
don't
need
very
ctl,
and
if
we
want
to
put
it
into
word
ctrl,
we
can
just
include
it
there.
Then
it's
just
a
command.
If
you
use
cobra
as
well,
it's
just
you
just
register
and
you
have
it
as
well.
C
B
C
Profile,
the
profile
is
a
dev
tool
you
use
while
working
on
the
code.
Mostly
it's
like
the
data
you
get
from
the
profiler,
it's
hard
to
compare,
run
by
run.
You
you
it.
There
is
little
valley
in
in
collecting
it
per
test,
run
for
example,
and
and
padding
them
against
each
other,
because
profiling
data
is
hard
to
compare,
but
yeah.
A
C
A
Okay,
all
right
so
for
the
just
to
summarize
per
scale
audit
tool
that
I
have
now
we
feel
comfortable
with
that.
It
sounds
like
and
the
the
workflow
that
I
have
there.
So
can
we.
D
D
C
Yeah
yeah,
I
I
spent
friday
last
friday
and
then
I
think
yesterday
because
I
was
sick
looking
at
the
go
routines.
First,
that's
a
bit
below
and
during
that
I
remember
that
we
talked
about
tracing
in
the
past
and
I
was
kind
of
feeling
excited
about
it,
and
I
want
to
give
it
a
try
how
complicated
it
is
to
add
jaeger,
tracing
and
open
tracing
to
vert
handler,
and
it
was
fairly
easy
it.
C
I
sadly,
don't
have
it
running,
because
my
azure
cluster
got
killed
it's
rebooting,
but
yeah.
What
I,
after
adding
that
and
installing
jaeger
in
in
my
open
shifts
every
run
of
dirt
endless
execute
method,
got
a
trace.
I
could
look
at.
I
saw
the
I
saw
nice
charts
of
how
long
each
run
takes
and
where
it
spends
time.
C
C
So
it
shows
android
handler
is
looking
at
the
virtual
machine
test
and
that
had
spent
so
much
time
reading
from
cache
and
updating
it
and
doing
this,
and
that
which
is
compared
to
the
profiling
more
work
to
add,
because
you
actually
need
to
annotate
each
function
to
do
that
and
some
of
them
and
for
some
like
you,
you
also
need
to
kubernetes
like
to
go
context,
so
I
had
to
change
some
signatures
to
have
the
possibility
to
create
spans.
C
So
it's
a
bit
more
work
than
the
gopro
filing,
but
it
allows
us
to
do
it
on
the
fly
and
add
metadata
so
compared
to
pprof,
where
you
see
how
much
time
is
spent
in
a
single
function
in
general,
it
shows
you
how
much
time
is
spent
in
a
single
function.
For
this
run,
which
can
be
useful
to
debug
those
scaling
issues
we
saw
or
any
errors
that
occur,
I'm
not
sure.
C
Yet
how
much
more
it
will
bring
us
if
we
have
to
pee
proof
stuff,
so
I'll
keep
looking
at
it
verge
handler
was
fairly
boring.
I
only
edited
it
for
the
protein
stuff
and
it
was
hard
to
get
in
there
because
you
always
need
to
put
context
in
and
we
almost
know
in
our
code
we
probably
propagate
context
so
far,
so
it's
quite
some
work
sometimes
but
yeah.
A
A
C
Yes,
so
the
context
you
can
you
start
with
the
your
root
context,
and
you
can
you
pass
it
through
your
entire
call
stack
basically
wherever
you
need
it,
however,
far
deep,
you
need
it.
If
you
change
it,
you
get
a
new
context
based
on
that,
so
it's
kind
of
mutable
and
what
tracing
does
it
uses
that
context,
like
context,
is
very
useful
to
pass
along
stuff
down
the
tree
so
trade?
What
tracing
does
it?
It
uses
context
to
ride
on
it
and,
like
another
way,
would
be?
C
If
you
don't
want
to
pass
context,
you
could
pass
the
spans
directly
and
build
the
tree
this
way,
but
with
context
you
already
have
that,
and
so,
whenever
you
call
a
function,
you
pass
it
a
new
context
based
on
your
previous
context,
with
the
new
parent
span
parent
trace.
You
have-
and
this
way
you
get
this
this
tree
of
of
calls
that
you
can
then
annotate
so.
B
C
Both
a
cancellation
method
in
in
concurrent
go,
but
it's
also
a
method
of
passing
along
stuff
based
on
each
other.
I've
had
a
logging
library
that
also
used
that
so,
instead
of
calling
lock.log
like
we
do,
which
is
kind
of
like
some
people,
consider
it
a
bad
practice
because
a
global
variable
of
sorts,
our
logger,
it
passes
log
to
context,
and
you
can
call
say
like
log
from
context
and
then
add
like
vmi
name
and
if
you
pass
the
same
context
to
the
next
function.
It
also
has
this
vmi
name
in
there.
A
C
A
Yeah,
so
that's
that's
pretty
interesting.
I
think
it's
gonna
be
really
invasive
yeah.
I'm
not
opposed.
B
A
A
More
useful
for,
like
an
engineer,
who's
trying
to
understand
a
live
operation.
So
let's
say
I
work
at
uber
and
I
need
to
understand
why
this
specific
customer's
request
keeps
getting
some
error
or
something
like
that
and
we're
spending
the
time
where
I
can
look
at
the
tracing
and
look
at
the
span
of
that
request,
see
it
trace
through
our
entire,
like
actual
operation,
and
they
gain
an
understanding,
live
how
that
occurred.
A
Where,
with
prof,
we're
just
running
this
on
our
dev
clusters
and
running,
like
stress
tests
that
aren't
like
live,
so
we
have
a
little
bit
more
flexibility.
I
guess
is
what
I'm
getting
at
to
not
have
to
need
something.
D
C
It
seems
more,
it
is
more
a
an
operational
observability
to
unless
say
debugging
or
it
can
be
debunking
tool,
but
yeah,
it's
less.
It's
more
useful,
like
if
you
experience
problem
in
the
cluster-
and
you
can
look
at
that
and
you
see
like
or
you
see
why
vm
is
not
spinning
up
and
you
can
see
it's
waiting.
The
record
the
operator
is
waiting
or
stuck
or
taking
forever
in
the
reconcile
loop,
and
you
can
maybe
see
why.
B
B
But
it's
not
what's
happening
so
we
want
to
actually
trace
the
word
handler,
for
example,
internally
and
see
what's
happening
inside
in
it
and
p
prof
controlled
that
maybe
so
yeah
yeah.
C
The
difference
is
granularity,
p
prof
shows
you
how
much
time
ghost
spends
in
a
function
and
open
tracing
shows
you
how
much
time
it
spends
in
every
call
of
this
function.
C
And
it
can
give
you
the
metadata
metadata,
like
it
spends
more
time
in
calls
in
reconciliation,
reconciling
this
single,
while
the
other
one
is
just
generally
takes
a
lot
of
time
in
there.
So
it's
more
operational
more
if
you,
if
you
have
a
running
cluster
and
you
want
to
see
what's
going
on
while
prof
is
more
again
for
us,
while
developing
so
yeah,
that's
I'm
also
still
not
sure
how
useful
it
will
be.
C
And
regarding
the
overheads.
C
B
B
C
Right,
the
the
the
calls
themselves
and
the
functions
are
not
the
problem.
It's
the
recording
that
you
enable
or
disable
like
there's
a
collector
in
the
background
running
that
collects
some
spans
and
that's
the
part
that
can
slow
down,
and
you
can
tell
it
to
only
sample
like
every
second
time
or
on
probability
and
stuff
like
that,
and
then
it
gets
faster
again.
A
All
right
any
other
thoughts
on
this
tracing
thing.
C
B
B
Yeah
some,
maybe
some
pr
was
merged.
I
I
the
point
is
since
we
are
not
tracking
prs.
Unfortunately,
I
don't
know
what
fixed
that
so,
but
it's
I
don't
I
don't
know,
and
but
it's
kind
of
good
and
bad
night
bad,
because
we
don't
know
but
good,
because
we
don't
have
this
problem
anymore.
So
yeah.
A
B
Okay
yeah:
this
was
the
read
called
request
which
the
api
you
know,
server
requests,
but
only
for
the
convert
calls
is
the
duration.
B
Isn't
it
because
I
thought
it
was
very,
you
know
super
high
one
minute
here.
However,
I
see
the
same
thing
in
the
kubernetes
calls.
I
don't
know.
I
was
expecting
to
see
this
very
just
too
high.
This
is
only
for
a
read
watch
and
lease
operation.
B
B
C
A
A
B
Yeah,
so
we
was
just
you
know,
we
still
see
some.
You
know
40409
requests
here,
but
they're
very
low,
so
it
should
be
fine
and
I
got
this
new
metric
which
the
okay.
So
if
you
guys
want
to
discuss
something
here,.
C
B
Yeah,
I
need
you
to
update
the
you
know
this
new
graphene
dashboard.
Okay,
I'm
keeping
an
update
in
that
to
improve.
So
probably
don't
you
don't
see
this
current
there
yeah.
This
is
the
number
of
vmis.
Now
it's
also,
although
it's
only
showing
the
running
here,
it's
also
show
when
it's
something
fails
now
here
the
number
of
vmis
it's
running
and
fail.
I
omit
the
ones
that
are
zero.
That's
why
you
don't
see.
C
B
The
legend
sure
anyway
yeah
so,
regarding
their
the
rest
rate,
limited
duration.
This
is
a
metric
that
roman
enabled,
I
think
it's.
This
is
interesting,
and
this
metric
shows
how
long
the
height
limiter
wait
until
it's
you
know,
and
how
what's
the
best
world
say
that
until
it's
enabled
the
request
goes
through.
So
it's
the.
B
Yeah,
I
think
it's
hold
on
and
then
okay
and
then
you
know
permitted
to
execute.
So
it
means
that
if
you
with
the
pr
that
roman
creates
now
we
can,
you
know,
increase
the
rate
limit,
and
these
numbers
here
should
be
smaller
in
the
cold.
I
see
two
thresholds
there
one
day
they
call
long
running
thresholds.
What
was
50
milliseconds
and
another
one
that
it's
long
running
threshold,
it's
one
second,
so
if
so,
you
know
just
analyzing
this.
B
It
means
for
me
at
least
that
all
the
requests
should
be
under
50
milliseconds,
isn't
it
and
what
we
are
seeing
here
is
way
higher
and
definitely
we
need
to
increase
this.
You
know
character
seconds
things.
That's
a
problem
enable
that,
especially
for
which
controller
and
vert
handler
here
and
so
on.
So
this
is
the
new
metric
and
it's
it's
showing
that
we
have
some
problem
here
and
with
the
aromas.
Pr
probably
will
fix
that
or
something
that
we
need
to
fine
tune
here.
Yeah.
C
B
That's
true,
and
maybe
the
instance
you
know
yeah,
I
think
maybe
it
already
has
yes,
yes,
but
yeah
yeah
by
the
way.
So
that's
why
I
got
confused
before,
because
these
metrics
here
actually
show
the
components,
isn't
it
just
rest
rate
limiter,
so
roman,
it's
enabling
this
and
this
metric
is
actually
the
same.
You
know
the
same
file
that
we
enable
rest,
client,
rest,
latency,
request,
latency
seconds
so
and
roman
it
did
a
similar
metric,
but
in
enabling
the
name
of
the
the
component.
C
B
Yeah,
so
the
other
thing
that
I
want
to
discuss
here-
it's
actually
I
don't
know
if
david.
If
you
saw
that
I
mar
I
market
you
in
this
document
so
and
yeah,
I
don't
know
if
you're
receiving
some
invitation
for
that
anyways,
just
just
to
be
sure
if
you
guys
get
notified.
If
I
do
that,
something
like
that.
A
I
did
yeah,
let
me,
can
you
hit
that
arrow
on
the
or
just
do
that.
B
Basically,
what
I
see
here
so,
if
you
see
maybe
I
can
just
comment
here-
there
is
some
mismatching
here,
so
vm
creation,
time
and
vm
phase
transition
latency
see
it's
way,
lower
the
transition
latency.
I
understand
that
we
will
have
some
mismatching
here,
because
this
is
some
general
aggregation
in
the
transition
latency,
but
not
that
high
isn't
it.
This
is
because
it's
missing
a
scheduling
phase,
so
it
has
the
running
the
schedule,
but
not
the
scheduling
phase.
So
I
don't
know
what
it's
gone
or
it's
not
collecting
that
so.
A
A
To
see
I
guess
I
I
know
you
would
expect
to
see
the
scheduling,
but
is
this?
Is
this
phase
transitions
from
one
another,
or
is
it
phase
transitions
from
creation?
This
specific
graph
that
we're.
B
B
What's
the
baseline
from
the
scheduling
time
phase
because
you
have
like
you
know,
I
was
expecting
creation
to
scheduling,
then
scheduling
to
schedule
and
then
schedule
to
run
you
yeah.
D
C
B
A
I
don't
know
this
is
possible,
we
might
be
missing
if
it
occurs
really
quickly.
We
might
be
missing
it.
So
it's
scheduling
the
scheduled
seems
like
that
should
be
a
while,
though,
and
that
should
be
the
one
that
takes
a
while.
C
C
C
B
So,
first
of
all,
I
I
checked
it
all
the
the
phase
that
it's
in
this
metric
and
it's
only
three
schedule
running
and
succeed,
because
after
I
delete
that
it
gets
succeed
and
actually
the
deleting
I'm
not
showing
here,
but
it
suits.
It
takes
a
lot
of
time.
Also.
A
Okay,
yeah.
That
takes
a
long
time
because
we're
actually
waiting
for
the
pod
to
terminate
which
could,
depending
on
how
the
virtual
machine
was
created,
there's
a
transition
or
I'm
sorry,
a
graceful
period
where
we
wait.
A
B
A
B
B
C
B
A
But
it's
pending
we're
going
to
create
the
pod.
Then
it's
going
to
go
scheduling
and
then
once
the
pod
is
running
is
going
to
go
scheduled.
C
A
Yeah,
I
would
expect
to
see
pending
yeah.
I
will
investigate
this.
I
felt
like
I
saw
them
all
when
I
tested
it,
but
I
only
tested
the
vm
creation
time
one.
So
I
I
did
running
scheduling
and
I
layered
them
all
together.
It
actually
gives
a
really
similar
result
to
the
vm
phase
transition
latency.
You
could
just
use
the
to
creation
time
as
well.
A
B
A
B
C
A
Else
quickly
that
anyone
needed
to
bring
up
that
we
can't
carry
over
until
next
week
blocked
or
anything
like
that.
C
C
Yeah,
it's
the
last
item
verge
handler
resource
leak,
investigation,
okay,.
B
Just
regarding
this,
you
know
leaking.
You
know
this.
D
B
Routine,
I
was
thinking
that
you
mentioned
that
it's
hard
to
close
some
some
of
the
routines
because
we
lose
the
you
know
the
track
of
that.
Maybe
I
was
thinking
if
we
have,
by
default
all
the
go
routines
with
timeout.
We
could
solve
these
problems.
C
C
Hard
to
close
them,
it's
that
the
part
was
that
was
hard
for
some
of
them
is
to
find
out
if
we
close
them
like
the
code
is
a
bit
mixed
up
where
they
get
created
and
where
they
get
close.
There
is
closing
for
a
lot
of
them,
but
it's
hard
to
see
if
it,
if
it's
hard
to
see,
if
there's
a
co-path
that
doesn't
and
I'm
trying
to
investigate
those
to.
C
And
what
I
shared
before,
do
you
miss
a
lot
if
we
do
tests
like
that?
Also,
if
we
add
performance
tests
to
our
code,
it
would
be
great
if
we
can
like
put
the
exact
specs,
we
need
to
run
those
tests
somewhere,
because
I
was
trying
to
run
the
density
test
and
I
didn't
know
what
I
need
like
how
my
cluster
has
to
look.
C
I
think
it's
in
here
somewhere
and
now
I
see
it,
but
if
we
add
birth
tests
in
in
the
test
suite,
it
should
tell
us
what
you
need
to
do
to
run
it
somehow.
B
B
B
You
know
the
no,
the
cpu
that
is
requested
and
the
cpu
that
is
used
in
the
in
the
components.
It's
the
secure
request,
it's
way
low
than
the
cpu
that
it's
been
used
for
vert
api
view
controller,
and
we
have
an
alert
of
that
when
it's
happening
in
our
cooperate.
You
know
you
know
cold
and
probably
this
already
it's
been
raised
a
lot
of
time.
No
one
has
seen
that,
but
is
something
that
maybe
we
need
to
improve
that
you
know
fix
the
cpu
request.
C
We
sh
the
the
request
should
be
the
smallest
amount.
The
controller
needs
to
run.
What
we
would
set
otherwise
is
either
increase
them
at
runtime
with
an
autoscaler
or
set
limits
with
an
autoscaler,
and
we
have
the
story
of
investigating
that,
because
if
we
increase
them
now,
people
could
could
not
run
it
anymore.
A
We
don't
want
to
set
a
limit
because
we
might
get
yeah
it's
more
important
for
memory
that
we
don't
sell
limit,
but
we
don't
want
any
limits.
Requests
that
means
discussion
if
we
go
over.
The
problem
is
that
we
can
be
evicted
in
certain
scenarios
and
rescheduled
other
places,
and
we
don't
necessarily
want
that
to
happen
either.
So
we
want
our
request
to
be
accurate,
yeah
and
if
we
are
consistently
over
it,
that's
not
a
great
thing
yeah
we
should.
We
should
understand
that
a
little
bit
better.
A
C
I
mean
we
shouldn't
set
it
higher
than
the
minimum,
because
if
you
run
it
on
a
minimal
cluster,
you
shouldn't
you
couldn't
schedule
it.
If
you
didn't
have
the
resources,
that's
why
we
should
look-
or
we
discussed
before
that
we
should
look
at
the
kubernetes,
horizontal
and
vertical
pod,
auto
scalers,
to
change
that
when
the
demand
increases,
so
the
auto
scaler
would
see.
Okay,
this
operator
is
is
using
so
much
resources,
let's
give
it
more
requests.
So
it's
reflected.
B
Yes,
I
have
some
comment
about
that,
so
it's
but
okay.
We
don't
have
too
much
time
to
discuss
that
guys,
but
there
is
another.
I
I
put
like
a
link
in
the
to
discuss.
That
actually
is
the
last
link
here
that
I
call
slos
slis,
so
in
just
doc,
man
so
cooperver
actually
kubernetes
has
some
analysis
is
the
maximum
number
of
vms
that
they
should
that
they
support
per
node
and
that's
they.
They
they
don't
overload
the
kubelet.
B
For
example,
we
don't
have
this
kind
of
analysis,
I'm
planning
maybe
to
to
to
arrive
with
jonathas
in
our
experiments,
and
this
should
show
us
the
these
limits.
Isn't
it
so
that
we
want
to
have,
for
example,
let's
assume
that
maybe
the
best
limit
that
we
have
it's
160
vms
per
node,
and
then
we
can
check
here.
B
You
know
supporting
106
vms
per
node.
What's
the
limit
that
deferred
handler
the
cpu,
the
minimum
cpu
request
that
the
version
handler
should
have
for
for
this?
You
know
things
that
we
support
and
then
we
keep
with
this
sure
vendors.
A
For
different,
like
cluster
sizes,
on
the
keyboard
cr,
if.