►
From YouTube: SIG - Performance and scale 2021-10-07
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.hmx9wqksaqdy
A
Okay,
it's
it's
october,
7th!
This
is
sixth
scale.
Everyone
please
add
yourself
as
an
attendee
and
add
agenda
items
feel
free
to
add
in
items
as
we
go
through.
Okay,
I,
the
first
thing
I
have
on
this
list
is
bugs,
but
I
we
don't
need
to
we're.
Gonna
start
with
that.
I
was
kind
of
hoping
to
start
with
something
else
did
like
david.
Do
you
want
to
talk
about
like
the
your
proposal
at
all
like
do
we
have
any
things
that
you
want
to
bring
up
with
that.
B
First,
so
for
the
virtual
machine
pools,
I
think
we're
just
wrapping
it
up,
so
I
would
encourage
anyone
who
is
interested
in
this
to
definitely
go
and
look
at
the
proposal.
I
guess
maybe
we
can
post
that
in
the
notes
again,
maybe
as
well
or
I
can
do
that.
My
goal
is
to
have
this
merged
as
soon
as
possible,
and
the
next
step
will
be
I'll,
try
to
go
ahead
and
create
a
pr
that
lays
the
foundation
of
all
this.
A
Okay
yeah,
we
talked
about
this
one
last
time.
This
was
the
removal
of
the
vm
config
api.
Let's
see
so
there
are
a
few
comments.
Here,
looks
like
some
text.
A
B
Yeah
so
roman
and
I
discussed
what
you're
looking
at
briefly
today
and
we'll
probably
sort
this
out
today
or
tomorrow.
It's
it's
pretty
minor.
His
comment
here
is
we
look
at
the
the
policy
selection
of
virtual
machines
either
for
scale
in
or
update.
How
do
we
have
some
kind
of
basic
optimizations?
B
So
if
we
want
to
select
like
vms
that
are
shut
down
first
or
paused
before
we
actually
touch
active
virtual
machines,
how
would
we
do
that
and
I
think
that
would
fall
either
under
random,
the
random-based
policy
or
perhaps
just
a
something
called
default
policy
which
would
be
we
we
take
at
random
virtual
machines
that
fall
and
kind
of
a
tiered
ordering.
So
do
we
have
any
vms
or
shutdown
all
right?
Take
a
random
selection
of
one
of
those
for
a
scale
end
all
right,
so
we've
exhausted
that
see.
B
A
Okay,
so
this
is
like
so
the
I
guess.
The
assumption
is
that
you
know
that
we'll
do
we'll
have
some
sort
of
optimization
that
we're
going
to
take
we're
going
to
use
the
the
running
status
of
the
vmi
to
make
a
selection,
maybe
before
before
any
of
these.
Or
would
it
be.
A
B
Opinion
because
I
think
that's
the
expectation-
and
I
think
this
would
only
apply
to
base
policy
when
you
are
selecting
based
policies-
that's
kind
of
at
random.
So
if
you
select
a
base
policy
that
isn't
random,
so
you
say
I
want
oldest
first
or
newest
first,
then
we
don't
really
have
any
optimization.
B
We
can
do
there
because
you've
told
us
exactly
what
you
want,
but
if
we
have
a
policy
like
random
or
we
just
call
it
default
or
something
like
that,
then
we
have
some
leeway
into
doing
some
more,
maybe
user-friendly,
optimizations
that
people
would
actually
like
and
it
wouldn't
hinder
people
who
didn't
expect
it
either,
because
if
they're
choosing
random,
for
example,
they're
going
to
get
vms
that
are
potentially
active
shut
down.
So
if
we
do
a
little
bit
of
help
there,
maybe
that's
not
terrible
okay.
B
C
A
Be
good
to
have,
I
think,
and
if
there's
a
description
in
face
policy,
if
you
don't
have
it
already,
I
think
that
would
make
a
lot
of
sense,
okay,
sure
and
then
the
only
other
one
I
scrolled
by
so
these
two
looks
like
you,
I
don't
know.
I
think
I
didn't.
I
haven't
gone
through
the
dock,
but
I
think
you've
made
some
changes
to
it.
You
need
to
reflect
these
are,
but
I'll
have
to
go
through
and
check
this,
and
then
there
was
this
one
for
some
metrics
yeah.
A
Do
you
want
to
talk
about
this
one
like
some
ideas
that
we
have
for
around
metrics.
B
Yeah,
I
I
actually
meant
to
respond
to
this.
I
think
I
even
typed
it
all
up,
and
I
decided
that
we
should
talk
about
in
the
meeting
and
I
totally
forgot
yeah.
So
thanks
for
bringing
that
up,
I
don't
have
a
great
sense
of
exactly
what
will
be
needed.
Quite
yet,
I
thought
of
a
few.
Let
me
see
you
had
well,
oh
yeah
go
back
and
do
this
a
few.
C
B
A
Was
I
was
trying
to
consider
a
little
bit
of
both
like
if
I
was
running,
I
was
trying
to
consider
the
perspective
of
you
know
if
I
was
running
this
in
production,
you
know
what
I'd
like
to
see
and
then
a
little
bit
of
how
we
can
how
we
can
integrate
with
some
of
the
work
we've
already
done
with
the
performance
testing.
So
like
the
first
one
is
like
so
we've.
You
know
the
number
of
vms
were
started
this.
Could
this
could
like
so
we're?
A
We
have
like
this
expectation
that
the
pool
is
going
to
be
some
size
like
we
want
to
know.
You
know,
what's
like
the
churn,
like
you
know
how
many
like?
What's
you
know
how
vms
are
being
killed
and
and
recreated
like?
How
many
times
is
that
you
know?
How
often
does
that
happen,
because
maybe
you
can
put
a
rate
to
it
so
like
how
often
like
things
are
being
because
I
could
kill
red,
I
could
kill
it
anytime.
I
want
just
delete
it.
How
often
is
that
happening?
A
B
I
think
it
makes
sense
yeah.
I
do
think
that
makes
sense.
I'm
curious
if
it's
restarted.
What
we're
really
looking
at
here
is,
I
don't
know
if
restarts
right
word.
A
B
Yeah
yeah,
I
see
what
you're
getting
at
I'm
just
trying
to
think
of
how
to
accurately
represent
that.
So
it's
number
it's
like
you're,
almost
wanting
to
track
the
number
of
shutdowns
and
starts
possibly
even
independently
of
each
other,
because
that,
maybe
I
don't
know,
maybe
it
is
truly
restarts,
so
a
restart
would
be.
B
Virtual
machine
shuts
down
and
gets
started
again.
A
yeah
turn
would
be.
The
virtual
machine
got
completely
deleted
and
replaced
eventually
yeah.
A
B
A
I
think
the
reason
I
used
restart
was
because
the
assumption
is
that
we're
I'm
taking
deleting
a
vm
and
then
we're
replacing
it
with
like
the
same
one.
So
it's
like
I
it's
replaced
or
whatever
I'm
like,
I'm
almost
restarting
I'm
just
killing
it
and
you
know
letting
the
thing
start
a
new
one.
For
me
the
same
one
like
it's
been
the
same
name.
It's
gonna
have
to
be
kind
of
really
a
vm
restart
from
like
a
power
standpoint,
but
more
like
you,
know,
I've
I've
brought
in
a
new
one.
A
B
Yeah
so
maybe
starts
is
accurate
enough.
The
number
of
boots
that
have
occurred
because
that's
what
we
really
care
about,
we
don't
really
care
about
the
shutdowns
unless
they
result.
In
another
start.
A
Yeah
the
numbers
started,
so
they
had
the
rate,
but
because,
like
does
it
get
a
bunch
of
things
from
the
number
the
rate
like
this
could
be
just
you
know
our
the
same
metric
as
we
did
with
like,
like
the
we
have
a
little
bit
of
a
bucket
of
them.
I
don't
know
like
whatever
like
a
num,
we
can
get
a
we
get.
A
rate
is
really
what
I'm
getting
at
it.
Just
to
tell
us
like
how
how
often
it's
happening
and.
B
Stuff
like
that
and
then
okay,
so
let's
say
we
yeah,
we
get
a
rate
for
starts
that
occur
and
we
could
also
do
shut
down.
So
that
makes
sense
in
the
future
and
maybe
speak
to
the
next
one
detached
pool.
A
Okay,
I'd
say
kind
of
similar:
it
measures
how
to
measure
how
often
we're
doing
detachments
it
could.
It
could
be
like
there's
sort
of
an
assumption
that
we're
doing
something
with
that
detached
bmi
like
it
could
be
forensics
it
could
be.
We
want
to
take
it
under
some
sort
of
management,
I'd
kind
of
want,
I
kind
of
want
to
get
an
idea
of
how
often
those
events
are
occurring.
A
You
know
how,
when
we
need
sort
of
special
attention
on
these
vmis
is
this
occurring
at
once
once
a
day
once
a
week,
so
another
kind
of
another
rate
that
we
can.
We
could
learn
a
bunch
of
information.
A
A
Yeah,
so
we
have
the
where
we
did
all
the
phase
transition
times.
What
I
was
thinking
is
that
maybe
we
can
do
a
little
bit
with
the
labeling
here
we
can
attach.
We
can
do
like
a.
We
can
see
how
a
pool
is
performing
as
a
unit.
We
can
have
a
label
for
the
pool
or
something
so
that
we
can
know
that
these
view
mines
are
the
ones
for
this
pool.
A
This
is
how
they're
performing
as
opposed
to
just
generally
created
or
cool
or
view
marks
for
other
pools.
B
Yeah,
I
think
that
makes
a
lot
of
sense.
So
if
you
had
lots
and
lots
of
pools
in
your
cluster
you'd
be
able
to
isolate
these
phase
transition
times
by
pool,
and
that
would
be
yeah,
I
think,
really
useful,
because
it
would
tell
you
exactly
which
which
one's
causing
the
trouble
remember
how
hard
that
would
be.
What
would
you
think
that
would
just
be
a
label
that
we
put
on,
or
would
we
actually
have
a
new,
so
it
would
be
a
new
label
on
the
metric.
A
Yeah,
I
think
I
think
so
because
I
think
the
expectation
or
most
my
expectation
is
that
we
wouldn't-
I
wouldn't
expect
someone
to
to
like,
have
thousands
of
pools
they're
like
like
one
vm
in
them,
which
would
be
like
the
equivalent
of
the
you
know
like
the
thing
that
we
didn't
really
want
like
to
have
where
we
just
label.
We
have
like
a
per
vm
labeling,
so
I'd
exp,
you
know
so
that
so
that's
sort
of
like
our
the
bad
case,
but
I
mean
it
could
be
possible
here,
but
I
would
I
would.
A
I
wouldn't
really
expect
it,
but
it
gives
us
that
opportunity
to
like
you
know
like
to
to
know
how
you
know
the
performance
before
so
I
think,
like
I
think,
just
in
general,
is
it
as
a
way
to
to
sort?
I
think
it
makes
makes
sense,
because
that
that
case,
where
you're
just
doing
one
vm
pulls
I
don't
this,
wouldn't
make
any
sense.
I
don't
know
why
you're
doing
that,
yeah
you'd
want
to
do
that.
B
Yeah
that
makes
sense.
Yeah
I'd
be
fine
with
that.
So
I
think
the
two
that
of
what
you've
listed
that
makes
sense
to
me,
are
adding
the
label
to
the
face
transition
times
and
then
the
we
should
investigate.
B
Perhaps
after
we've
created
the
feature,
how
we'd
want
to
track
starts
and
stops
and
kind
of
understand,
turn
I
don't
think
we
have
a
great
understanding
quite
yet
I'd
like
to
maybe
not
completely
flesh
that
one
out
yeah,
unless
you
feel
pretty
confident,
I
don't
feel
confident
about
it
yet,
but
that
I
do
agree
that
that
would
be
useful.
I'm
just
not
sure
how
to
represent
it.
A
Okay,
yeah,
so
my
the
reason
so
the
my
expectation
sort
of
like
what
I'm
thinking
about
this
being
used
yeah
I
create,
I
create
a
pool
object
and
my
expectation
is
that
I'd
have
some
sort
of
controller.
That's
going
to
say
you
know
either
time's
up
like
this.
This
this
vmi
is,
is
finished
with
its
work
or
you
know.
A
Maybe
the
vmi
itself
terminates
after
it's
done
with
this
work,
and
so
I
expect
I'm
expecting
that
there's
going
to
be
a
lot
of
vmis
that
will
be
that
will
be
cleaned
up
and
it'll
happen.
It'll
happen.
A
Often
I
expect
that
a
lot
and
so
having
the
rate
of
vmis
that
are
clean,
that
are
removed
per
pool,
gives
me
a
concept
of
like
say
like
because
I
haven't
considered
that
each
pool
is
an
identity
kind
of
when
I
was
going
off
with
the
perf,
like
you
know,
there's
an
image
associated
with
the
pool
and
so
on
and
so
forth,
and
so
I
can
tell
like
okay,
how
many
people
are
finishing
their
work
or
doing
whatever
or
whatever
how
much
work
is
finishing
for
these
types
of
bmis
and
how
often
it
happens
it
sort
of
it
gives
me
like
it's
kind
of
like
it
gives
me
a
sense
of
like
you
know
what
what
is
happening
so
like.
A
A
You
know
more
warming
or
warm
vms
available
to
be
used
because
there's
just
so
much
turn
right
now,
there's
just
so
many
being
you
know
so
many
people
requesting
it
and
deleting
it.
So
that
means
you
know,
there's
things
like
that,
like
I
can
sort
of
get
like
a
bunch
of
data
from
it
so
that
we
can
make
informed
decisions
about
you
know
what's
to
come
next.
Oh
that's.
B
Interesting
so
in
your
use
case,
I'm
kind
of
familiar
with
it,
somebody
logs
into
one
of
these
vms
and
then
when
they're
they
are
done.
Does
that
virtual
machine
cycle
completely.
B
We
yeah,
okay,
yeah
it'd,
be
removed,
and
then
then
it
would
be
fresh.
So
you
have
basically
the
pool
is
like
a
bunch
of
open
slots
and
you
you're
it's
either
a
one
or
a
zero,
whether
it's
being
used
or
not,
and
there's
no
other
okay,
and
I'm
not
sure
if
this
metric
would
help
you
there
as
much
as
you
would
need
some
sort
of
custom
metric
specific
to
your
use
case,
to
understand,
essentially
how
many
of
these
slots
are
full
per
pool.
So
you
have
to
associate
the
users
connecting
well.
A
It's
it's
not
so
much
about
like
knowing,
then
how
how
much
is
full
I'd
say
it's
more
like
about
the
rate
like
I
want
to
know
like
here's.
How,
in
order
to
make
some
sort
of
an
informed
decision
like
it'd,
say
like
okay,
there's
a
lot
of
people
that
are
starting
there's
a
lot
of
starting
vms,
a
lot
of
deleting
vms
in
the
school
there's
a
lot
of
activity
of
people
that
are
using
this
kind
of
pool.
Perhaps
I
should
perhaps
I
should
increase
it.
You
know,
maybe
I
should
decrease
it.
A
B
I
guess
I
I
get
what
you're
saying,
but
wouldn't
it
be
more
accurate
to
say
I've
got
this
vm
pool
and
I've
got
this
many
idle
like
open
slots
and
then
say:
if
the
percentage
of
open
slots
to
use
slots
falls
within
a
certain
threshold,
I
need
to
scale
up
or
down
like
that
seems
like
it
would
be
more
close
to
what
you
want,
because
you're
wanting
not
to
run
out
of
free
slots,
but
you
want
to
keep
the
margin,
probably
narrow,
because
you
don't
want
to
use
on
like
unnecessarily
consumed
resources
that
aren't
necessary.
A
Yeah,
I
I
maybe
I'm
completely
complaining
two
things
because
like
so,
if
by
informed
decision
like
I
I'm,
I
guess
what
I'm
saying
is
that
as
if
a
computer
would
do
it,
but
I
mean
these
metrics
are
for
humans,
so
it's
not
like.
I
wouldn't
use
this
metrics
to
make
to
have
like
the
controller
scale
up,
but
I
guess
the
the
idea
is
that
I,
as
a
so
to
go
to
the
human
side
of
this,
which
is,
I
think,
to
isolate.
A
That
part
is
that
I
would
be
able
to
know
that
that,
as
a
user
or
as
a
as
the
administrator
that
this
that
there's
a
lot
of
activity
here,
I
think
this
is
really
all
it
is.
So
forget
the
informed
decision
part
it's
more.
It's
really
that
I
I've
noticed
that
there's
a
lot
of
activity
with
this
kind
of
pool.
A
I
think
that's
really
the
only
conclusion
that
you
that
you
draw
from
this,
which
I
I
don't
know.
I
find
that
to
be
useful,
but
I
don't
know
I
I
maybe
like,
because
it's
you
I
understand,
you're
saying
that
it
could
be
just
for
this
use
case.
Maybe
people
don't
do
this
like:
maybe
they
don't
complete
vmis
after
they're
done
and
the
pool
just
stays
full,
and
so
that's
that's
not
really
relevant
this.
That
conclusion
wouldn't
be
helpful
for
them,
so
yeah
I
mean
I
could
see
that
that
argument
too.
A
B
Okay,
so
if
you're
using
it
just
informationally
like
you,
just
want
to
look
at
a
dashboard
and
you're
not
using
it
to
make
decisions,
I
think
what
we're
looking
at
is
we're
trying
to
measure
activity
on
the
vmware.
So
I
think
measuring
the
rate
of
shutdowns
and
stops
or
starts
and
stops
is
probably
the
most
accurate
way
to
to
represent
activity,
or
maybe
it's
the
number
of
scale
and
scale.
A
Yeah,
I
mean
I'm
not
sure
yeah
yeah
there's
an
assumption.
I
think
here
yeah
that
that
I
measure
activity.
I
would
measure
activity
based
on
my
use
case
by
the
number
of
starts
and
stops,
but
then
maybe
not
some
someone
else
might
not.
Okay,
so
yeah.
There
is,
I
think,
there's
an
assumption,
but
I
I
don't
know
I
mean
in
other
cases
like
in
the
general
case.
Like
would
you
care,
if
you
know
you
had
a
bunch
of
starts
and
stops
like
I
mean
it
could
be.
A
You
know
useful
if
it's
something
that
you
something
was
happening
to
your
pool,
something
that
you
didn't
expect.
Maybe
a
lot
of
vms
were
just
being
deleted
by
for
some
reason,
or
maybe
they
were
failing
or
something
it
might
be,
something
that
you
could
alert
on,
because
it's
not
something
you
expect
so
I
it
so
there
could
be
other
use
cases
there.
If
you,
if
you
know
if
it's
something
that
you
you
know,
if
you
don't
expect
a
lot
of
restarts
and
you
do
see
them.
B
Yeah,
I
can
get
that.
I
think
it
makes
sense
so
the
rate
of
start
and
the
rate
of
stops.
I
can
get
behind
that.
I
think
so
here.
Here's
what
I
would
write
in
the
document.
B
C
B
It
will
look
a
little
strange
because
when
we
first
create
vms,
it's
going
to
take
a
long
time
in
the
scheduling
or
the
no,
because
that
okay
forget
it
all
right
so
yeah
it
makes
sense
to
have
a
base
transition.
I
think
that
start
and
stops
makes
sense
as
well,
and
I
can
document
both
of
those
as
being
desired.
Metrics.
B
I
don't
know
that
that's
gonna
probably
be
one
of
those
follow-up
things
so
we'll
we'll
document
in
the
design
and
then
once
we
kind
of
get
the
base
features
and
everything
we'll
look
at
adding
those
and
it
might
get
changed
a
little
bit
at
actual
implementation
time.
But
I
think
we
know
enough
to
be
able
to
document
those
three.
A
Okay,
I
don't
know,
I
don't
see
any
other
items
on
this,
I
mean
for
me
overall,
it
looks
good
to
me
I'll,
probably
just
read
one
more
time
like
my
review.
Okay,
all
right,
I.
B
Will
let
me
clean
up
I'm
going
to
focus
on
cleaning
up
this
document,
because
I
notice
there's
some
some
comments
that
are
just
kind
of
typos
sort
of
things
and
I'll
document,
roman's
thing
and
I'll
document.
B
A
A
This
was
weird,
I
don't
know
if
you've
got
any
idea
about
this
david.
This
was
something
we
had
noticed,
so
we
brought
in
the
some
of
the
phase
transition,
metrics
and
there's.
Actually,
this
occurred
during
the
outage
that
one
of
the
attitudes
was
more
like.
It
was
more
like
that.
The
situation
where
we
found
the
different
controller
panic
issue
and
one
of
the
things
that
came
out
of
it
was
this.
A
The
the
label
on
the
the
my
face
account
metric
changed
and
then
so
you
can
see
like
that.
Yeah,
the
value
value
gets
repo
everything's
running
it's
just
now.
It's
value.
B
Yeah,
so
here's
what
happens
when
a
virtual
machine
instance
is
first
created
that
phase
label
is
empty
and
until
it
gets
reconciled
the
first
time
it's
not
gonna,
be
set.
So
it's
possible
that
what's
happening
is
lots
of
vmis
are
created
but
they're
not
being
reconciled
yet,
and
then
they
all
get
values
afterwards.
So
what
happened
here
with
your
this
was.
A
The
vert
controller
restarted
and
the
I
think
it
was
just
the
birth
control.
I
don't
think
it
was
just
for
controller
and
it
was
the
panic
issue,
and
then
this
appeared.
The
the
labels.
B
A
A
A
B
Okay,
it's
just
that
and
we
look
like
this.
I
wonder
if
this
is
some
sort
of
prometheus
interpretation
of.
B
You
said
vert
handler
restarted,
but
this
would
only
make
sense
in
the
case
of
controller
restarting
if
your
controller
restarts
there's
going
to
be
a
period
of
time
where
the
leader
election
lease
isn't
given
up.
So
we're
going
to
have
no
vert
handler
for
like
30
seconds,
I'm
sorry
for
controller
okay,
we're
talking
about
the
cluster
scope
again
for
control,
we're
not
going
to
have
it
for
controller
for
about
30
seconds
or
so
waiting
for
that
lease
to
expire
and
the
new
leader
to
come
online.
B
So
if
prometheus
tries
to
query
for
information
related
to
virtual
machines
during
this
time,
it's
going
to
get
nothing
so
there's
going
to
be
a
period
of
time
where
it
gets
it's
just
going
to
look
like
there
was
a
gap.
I
guess,
and
once
the
new
vert
controller
comes
online
depending
on
when
it
gets
queried.
B
It
may
or
may
not
reflect
virtual
machines
existing
yet
depending
on
if
it's
caught
up
to
its
informers
or
not,
I'm
not
sure
exactly
what
it's
going
to
get
reported
during
that
time,
but
eventually
it
will
all
get
caught
up
and
the
correct
values
will
get
reported
by
the
new
controller.
So
during
this
time,
between
ever
controller
crashing
or
being
restarted
and
the
new
one
taking
control
and
thinking
correctly,
things
could
get
kind
of
weird,
I'm
not
sure
exactly
what
would
be
reported
and
exactly
how
prometheus
or
even
graffana
would
interpret
these
results.
A
Yeah,
I
think
I
think
when
I'm
like,
so
I
think
like
the
thing
we
need
to
figure
out
here
is
that
this
during
this
period,
so
I
think
the
test
is
like,
like
I
think
so,
like
restart,
we
start
a
vert
controller
and
then
we
need
to
get.
We
need
to
look
at
what's
in
prometheus
at
this
point
like
we
need
to
follow
up.
We
first
need
to
verify.
We
see
this
and
then
we
need
to
like
get
the
data
from
prometheus,
because
that
would
be
that
could
at
least
rule
this
out.
A
If
it's
bringing
a
circle
or
whatever.
That's
that's
the
metric
services
reporting
this
or
if
it's
at
least
something
else,
but
let's
see
so
that's
that's
kind
of
what
I
would
like
to
see,
because
what
you're
saying
is
curious
to
me,
because
I'm
wondering
if
like
if
we
cracked,
open,
grafana
or
cracked
open
prometheus,
we
saw
there
that
the
values
are
just
gone
or
something
because
it's
synced
and
then
it's
like
no
there's
no
values
for
this,
but
we
see
them
they're
there.
We
have
a
number.
B
A
Yeah,
let
me
put
it
here
so
we'll
do
let's
try
this
so
like
a
test
with
we
like
restart
the
controller.
A
A
B
So
this
is
a
pr
I
just
put
in
the
chat.
I
would
like
to
see
us
begin
tracking.
Some
of
these
perk
scale
results
so
until.
B
B
I
forgot
about
it
too.
I
was
just
looking
at
my
what
I
had
open
once
this
could
merge
we'll
start
to
get
an
artifact
that
shows
us
the
expected
like
transition
times
and
everything
and
api
calls
and
other
stuff
that
occurs
during
our
density
tests.
Once
we
feel
comfortable
with
what
we
see,
we
should
start
setting
some
thresholds
and
then
start
you
know
ensuring
that
we
meet
those
thresholds.
C
A
A
Yeah
like
so
I
this
one,
I'm
not
sure
how
we
can
at
least
make
progress.
I
think
we
need
to.
I
think
this
is
one
of
those
areas
where
we
do
the
we
do
the
the
profiling
and
maybe
see
if
we
can
notice
anything
yeah.
I
mean
because
I
like
we
see
this
as
well
and
like
on
our
hardware
and
yeah.
I
think
we
just
need
to
commit
to
doing
a
profiling.
I
think,
let's
see
we
can.
I
think
that's
really
just
this.
Let
me
write
it
in
here.
A
What
is
it
the
the
word
controller
disruption,
budget.
B
So
I
guess
what
this
came
out
of.
So
what
is
the
density
test,
so
I
could
investigate
this
by
running
the
density
test
myself
and
like
just
lower
the
number
of
virtual
machines
for
my
environment
that
can
actually
run,
and
theoretically,
I
should
be
able
to
to
recreate
this
yeah.
B
A
All
right,
that's
one.
Okay,
let's
see
disruption
budget.
We
have
this.
One
that's
been
around
for
a
little.
While
this
is
the
key
performance.
This
was
kind
of
a
general
one.
We
have
a
bunch
of
these
like
metrics,
like
here's
work,
you
wait
and
see
which
everything
you
see
everything
seems
like
the
same
there
and
then
I
could
use
the
description
budget
and
the
unfinished
work
yeah
I
mean
I
this
one
also
see
a
lot
of
on
rn2.
Oh
yeah,
I
think,
did
I
post
this
yeah?
A
I
did
or
went
to
20
years
ago,
but
I
thought
I
posted
new
ones
unfinished
work.
I
did
not
post
new
ones.
I
did
see
so.
Oh
I
remember
this.
This
was
okay,
so
this
was
marcel.
Did
this
experiment
again
and
we
did
that's
what
we
wanted.
We
wanted
to
check
pps
and
see
if
it
made
a
difference
and
we
didn't
see
much
of
a
change,
but
we
did
see
this.
A
Unfinished
work
still
high
from
the
handler
I
wish,
like.
Unfortunately,
one
of
the
hard
things
to
quantify
like
versus
sort
of
imagine
is
exactly
what
this
is
like.
I
don't,
I
understand
like
unfinished
work.
It's
like
it's
a
thread,
that's
running
long
in
the
controller,
but
I
don't
understand
this
measurement
like
we're
saying
this
is
telling
me
that
we
have
a
12
minute
running
task
in
a
controller
which
is
which
just
sounds
crazy
to
me.
It's
possible
yeah,
but
it's
it's
interesting.
I
think.
Maybe
this
is
another
thing.
A
Maybe
we
just
need
to
do
some
profiling
on
to
learn
some
more
info.
B
Okay,
we've
had
errors
in
the
past,
for
example,
where
we'd
make
an
api
call
somewhere,
it
didn't
set
an
appropriate
deadline
on
it
and
it
just
hangs,
and
so
it
essentially
consumes
a
thread
for
a
long
time.
I
possibly
definitely
I
think,
we've
resolved
those
I'm
aware
of,
but
that
kind
of
thing
is
possible.
A
Yeah
we
see
tomas
is
looking
at
this.
A
Yeah,
I
think
that's
just
going
to
tie
into
this
thing,
so
he
tomasu
has
already
looked,
is
already
looking
at
making
a
change.
I
think,
did
he
post
a
my
pr
for
this?
Oh
well,
you
can
see,
I
think
he
did.
A
B
A
Very
nice,
okay,
this
is
this
is
tommaso's
change
david.
I
don't
know
if
you're
aware
of
it,
but
the
this
was
the
how
he
wants
to
reduce
the
memory
usage.
B
Oh
okay,
interesting
yeah,
so
I
just
got
back
from
pto
today.
Actually
I
was
gone
thursday
right
after
our
call.
I
left
I've
literally
been
gone
for
a
week
until
this
call
practically.
So
I
will
look
at
that.
That's
interesting
to
me.
A
A
I
don't
know
how
to
sign
myself
all
right
I'll,
just
keep
it
to
that
part.
Okay,
yeah
that
so
this
is,
I
think,
what
he's
doing
next
for
the
profiling.
A
Okay,
we're
p
work
performance
and
then
we've
got
modules
using
more
cpus
and
requested.
A
I
think
do
we
just
set
limits
on
this.
Like
is
this
like
we.
A
All
right,
we
have
marcelo.
A
All
right,
we
need
yeah,
we
need
to
scoop
that
one
okay
and
then
profiling
prior
to
fairness.
I
actually
saw
this
so
I
re
we
actually
ran
into
this.
This
is
interesting.
Retrograde
says
the
other
day,
an
issue
with
this
with
the
current
one,
two
one
defaults
settings
which
groups
all
of
the
the
the
I
guess,
the
priority
of
of
all
everything.
That's
a
service
account
it.
A
It
runs
as
a
service
account
into
one
queue,
and
so
we
actually
ran
into
an
issue
because
cuber
was
creating
or
doing
a
lot
of
work
and
it
conflicted
with
obs
and
they
were
filling
up
the
queue
and
they
were
actually
getting
requests
rejected.
A
So
I
had
to
give
kubert
its
own
priority
queue
so
that
didn't
conflict
with
anything,
and
actually
there
were
the
the
rejections.
The
422s
go
away,
the
429's
go
away,
but
this
was
at
large
scale
under
a
lot
of
stress,
so
it
was
something
that
I
don't
think
many
people
will
really
see
at
the
moment,
but
it's
something
to
that.
We
can
consider.
I
don't
know
I
I
I
know
this
is
something
that
I
need
to
do,
or
at
least
to.
A
A
A
Yeah
all
right
got
it
exactly
yeah
that's
so.
This
is
essentially
like
this.
This
issue.
That
kind
of
you
know
when
I
talked
about
this
yep.
That's
what
I
that's
one
of
the
things
I
wanted
to
get
at,
but
the
is
that
the
minimum
but
the
the
larger
tab.
You
start
defining
the
q
length
and
the
things
which
I
don't
have
enough
information,
but
at
the
very
minimum,
having
its
own
cue,
I
think,
makes
a
lot
of
sense.
A
Okay,
yeah
all
right,
something
we'll
keep
it
open
for
now.
I
think
something
I
I'll
get
back
to
at
some
point.
I
haven't
just
had
a
chance
yet
and
then
oh,
we
did
this
one
right
and
then
there's
profile
under
high
load,
which
is
okay,
did
tomas?
Oh
hey,
thomas
you're,
here,
yeah,
hey
did
you
want
to
talk
about
your
change
at
all.
C
Yeah,
so
I
can
talk
briefly
about
it,
so
basically,
I
added
pagination
to
cluster
profiler,
which
means
that
I'm
basically
making
list
pods
requests
in
pages
right
and
I
just
returned
to
a
user
like
continuation
token,
together
with
profiler
results
of
that
pods
from
from
one
page
and
so
yeah
it
so
that
helps
us
to
control
how
many,
how
much
memory
viewed
api
uses
to
actually
store
in
memory,
the
cluster
profiler
results
and
additionally,
I've
added
label
selector.
C
So
basically,
you
can
select
the
ports
with
as
usual,
with
label
selector,
which
we
all
know
from
cube,
ctl
command.
It's
it's
basically
the
same
syntax.
It's
it's
parsed
the
same
way.
So
we
have
like
these
two
mechanisms:
pagination
and
filtering.
C
Filtering
with
labor
selector
to
to
reduce
the
memory
usage
and
it
both
works,
so
I've
checked
it
on
on
a
large
cluster.
Yes,
so
if,
if
you
can
david
have
a
look
at
it,
I
would
appreciate
it.
B
Sure
yeah
I'm
looking
at
it
right
now
for
your
pagination
stuff.
When
does
that
token
expire?
I
guess
if
we
tried
to
do
another
token
or
what.
C
C
B
List
as
the
oh
interesting,
okay
and
the
other
option
you
had
was
the
label
selector,
and
that
would
let
us
choose
just
just
like
all
the
vert
controllers
or
all
the
word
apis
yeah.
B
In
practice,
what
what
have
you
used?
Have
you
used
both
of
these
or,
if
you
forget,.
C
So
so
I've
checked
both
of
these,
like,
for
instance,
for
weird
weird
api.
I
just
there's
a
cube
view:
dot
io
label
right,
so
I
use
that
it
works.
So
interesting
is,
interestingly,
the
page
size.
So
I
thought
that
like
20
is
like
we
should
handle
it
just
fine,
but
it
turns
out
that,
at
least
for
me
there's
some
problem
with
fetching
large
files
through
the
weird
client.
C
Maybe
that's
because
of
of
the
fact
that
I'm,
like
you,
know,
far
away
from
the
data
center
actually
which
I'm
in
europe
this
data
center
is
in
in
us,
and
I've
noticed
at
some
point
that
cube
ctl
copy
it
phase
as
well
for
me
with
larger
files.
So
maybe
that's
that's
just
because
the
cli
grants
client
is
not
reliable
in
copying
large
files,
I'm
not
sure
yeah,
but
smaller
page
sizes
they
work.
C
B
I'm
curious,
if,
for
your
use
case,
the
label
is
enough,
because
the
label
would
you
select
controllers
and
you
could
probably
get
just
like
one
or
two
vert
handlers.
If
you
needed
it,
you'd
have
to
target
the
exact
handler
yeah.
C
Yeah,
so
for
me
it
works.
It's
okay,
it's
enough,
but
you
know
I
was
thinking
about
adding
field
selector
as
well,
which
I
guess
would
cover
like
every
use
case,
probably
right
that
you
can
select
by
the
I
know,
by
the
name
or
whatever,
whatever
you
want
right,
because
that's,
I
guess,
that's
two
filtering
methods
we,
which
cubectl
has
label
selector
and
and
field
selector.
So
I'm
okay
with
adding
that
as
well.
It's
like
should
be.
You
know
it's
not
it's
not
a
big
deal
to
add
this.
C
I
I
don't
need
it
right
now,
but
maybe
to
have
like
you
know,
full
coverage
of
of
use
cases.
Why
not.
B
Yeah
and
are
you
do
you
need
the
pagination
right
now
or
I
guess
I'm
restructuring
my
question
make
sure
I'm
asking
it
accurately
or
will
the
label
and
field
selector
be
enough
for
you.
C
Yeah
yeah,
I
think
I
could
do
with
label
selector,
but
but
I'm
thinking
like
why
you
know,
I
think
we
can
support
like
just
fetching
all
of
the
results
from
all
of
the
components,
including
build
launchers,
which
we
don't
right
now
right.
So
I'm
wondering
why?
Wouldn't
we
like
to
do
it
or
provide
a
user
way
to
do
it?
B
My
concern
with
the
pagination
is
that
these
are
ephemeral
states
so
like
what
you
query
at
one
point
may
not
even
exist
anymore
minutes
later
so
like
pods
might
change,
and
things
like
that,
so
the
result
like
you
just
wouldn't
be
able
to
get
up,
it
would
fail.
I
guess
that
request
and
definitely
so
once.
A
So
would
that
mean
because
of
like,
like
going
back
to
the
original
problem,
which
was
the
memory
usage,
but
that
mean
that
if,
if
say
label
selector
was
the
only
option,
it
would
mean
that
it's
basically
a
requirement
to
make
it
usable
use
it
use
to
make
it
usable
at
all
in
a
large
cluster
situation,
is
that
we
would
have
to
have
a
bunch
of
labels
on
there
right.
A
C
Yeah
yeah,
because,
on
the
other
hand,
when
we
don't
have
pagination
like
then
we
would
release
like
a
feature
like
this
cluster
profiler,
with
okay
ability
to
to
select
subset
of
quotes
with
labor
selector,
but
then
we
would
ship
a
feature
with
which
would
be
like
internally
flowed
for
the
user.
You
know
not
choosing
not
to
use
labor
selector
then,
like
you
know
it
crashes,
the
whole
the
whole
build
api
pod.
So
it's
like
for
me.
C
It's
it's
really
strange
to
to
have
a
feature
which
basically
doesn't
work
when
we,
when
we
ship
it
with
with
default
arguments
right
because
default
for
labor
selector.
For
me,
it's
empty.
Why
would
someone
you
know
tell
you
the
default
is
something
different
and
with
default
default
arguments,
sometimes
you
would
just
for
some
cluster.
It
would
just
fail
with
you
know,
out
of
memory
without
of
memory
error.
So
that's
just
that
architecture
for
me
right.
We
should
not
like
have
it
like
this.
That's
that's
just
you
know.
One
point
of
view.
A
C
C
B
If
it
helps
you
all,
I
think
I'm
fine
with
it,
because
this
again
is
not
a,
I
don't
even
know.
Maybe
you
all
would
use
this
in
production.
I
wouldn't
recommend
it
definitely
would
recommend
that
actually,
but
if
it's
useful
enough
for
you
all,
we
should
enable
it.
C
Yeah
I
mean
we
could
always
do
like
default.
Page
size
is,
is
zero,
which
would
mean
like
everything
at
once.
C
Let's
say
right
so
then,
then
we
wouldn't
have
this
behavior,
this
female
behavior
by
default,
but
then
user
can
select
a
page
size
which
is
you
know
something
smaller
and
then,
when
user
explicitly
selects
the
patched
page
size,
then
he
basically
agrees
that
okay,
this
is
you
know
if
ephemeral-
and
you
know
it
can
break
for
some
in
some
circumstances,
break
meaning
the
request
will
not
succeed.
B
Okay,
I'll
look
at.
I
think
I
think
this
makes
sense
what
you
have
the
more
I
think
about
it.
I
just
want
to
make
sure
for
the
default
developer
use
case
where
somebody's
running,
just
in
their
laptop
with
a
couple
nodes
and
lots
of
and
non-live
virtual
machines
and
everything.
Well,
I
guess
we
don't
collect
the
invert
launcher
stuff
yet
as
well,
make
sure
that
we
get
all
the
information
at
once
for
the
default
case.
C
If
you
have
anything
in
mind
it's
just
like
you
know,
I
just
you
know
just
thought
that
this
might
be
good
numbers
assuming
the
size
of
the
the
profilers
right
because
of
the
profiles.
If
we
have
like
you
know,
let's
say
one:
one
part
is
nine
megabytes
of
of
memory,
so
10
is
like
100
megabytes
of
additional
memory,
so
this
seems
to
be
fine.
If
you
have
anything
in
mind
just
you
know
just
comments
there
and
yeah.
B
So
the
thing
that
I
have
in
mind
is,
I
just
want
the
default
to
be
able
to
handle
a
two-note
cluster
with
all
of
the
components
involved,
and
I
think
we,
I
think
10
might
be
exactly
how
many,
because
we
have
two
instances
of
operator
api
controller
and
handler.
So
that's
eight,
maybe
only
being
eight,
but
that's
all
it
does
so
10
would
definitely
cover
it.
B
Okay,
I
will
review
this
and
I
think
I'm
on
board.
C
Thanks
thanks
yeah
I've
got
one
more
question
like
regarding
this
change
about
the
viewed
launchers.
Was
there
any
reason
why
you
decided
not
to
include
them
in
your
change
or
because
I
I
would
like
to
profile
with
launchers
and
see
how
how
it
you
know,
behaves
and
what's
there
so
I'm
just
I'm
just
curious.
No.
B
That's
fine.
I
was
less
interested
in
bert
launchers
at
the
time,
simply
because
there's
not
a
lot
going
on
in
there
really.
So
I
mean
it's
just
times
one
workload,
but
I
could
see
I
was
thinking
about
this
primarily
from
a
cpu
usage
standpoint,
but
I
could
see
for
like
memory
and
other
things
yeah.
It
makes
a
lot
of
sense,
so
that
was
probably
a
short
side
on
my
part.
The
difficulty
is
going
to
be
how
to
get
the
information
out
of
the
vert
launcher.
B
I
don't
really
know
exactly
how
that
will
be
done,
because
it's
not
it's
e.
We
don't
control
the
network
on
the
vert
launcher.
The
virtual
machine
guest
controls,
the
network,
so
somehow
vert
handler
is
going
to
have
to
get
that
information
to
return
it
that
all
the
dumped
results
and
everything
and
also
handle
starting
and
stopping
and
dumping
the
launcher
results.
It'll
be
a
little
tricky.
B
From
vert
handler
now
that
would
be
fine,
that's
what
I
would
expect
so
vert
handler
would
have
yeah.
It
would
well,
no
you,
wouldn't
you
have
to
do
that.
So
bert
handler
exposes
a
an
http
endpoint
when
you
have
this
debug
feature
flag,
enabled
and
everything.
This
is
totally
not
safe
for
production,
so
I'll,
throw
that
out
there
again,
because
you
could
anyone
can
hit
this
endpoint
and
get
information
they
would
hit
this.
B
You
would
hit
this
dump
endpoint
and
the
invert
handler
behind
the
scenes
would
just
go
in
and
grant
all
the
vmis
that
exist
on
that
node
and
retrieve
from
their
file
systems
and
return
them.
Now
you
could
do
that,
but
maybe
that
would
be
too
much
information.
Then
we
get
back
into
the
pagination
issue,
because
if
you
have
like
100
virtual
machines
onto
a
node,
then
that's
a
ton
of
data
being
returned.
I
don't
know
it
gets
a
little
tricky.
You
could
pod
exec
and
use
a
copy
sure.
Why
not?
B
B
That's
the
best
I
can
say,
and
it's
possible
that
if
you
really
want
to
profile
a
vert
launcher
that
the
cluster
aggregation
of
all
that
might
be
too
tedious
and
that
we
could
add
some
hooks
where
you
can
profile
it
internally
by
doing
your
own
pod
exactly
in
there
and
getting
data
so
just
bake
in
the
ability
to
to
turn
on
off
these
profiling
stuff
into
vert,
launcher
and
then
manually.
Do
it?
B
A
Okay,
all
right
well
we're
at
time
any
last
thoughts.