►
From YouTube: SIG - Performance and scale 2023-09-21
Description
Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh
A
So
one
of
the
first
items
was
it's
actually
so
this
is
this.
Is
these
items
actually
been
carried
out
for
a
little
while,
since
this
is
our
first
meeting
of
I,
think
three
weeks
I
had
this
a
little
while
ago
this
I
came
across
this
PR
and
essentially
what
it
is.
It
has
to
do
with
adding
a
status
on
the
VMI
spec
that
will
assist
with
live
migration,
and
so,
unless
I
comment
on
here
that
there's
a
I
can't
see
it
here.
A
Basically,
we
it's
a
field
that
we'd
upgrade
we'd
update,
pretty
frequently
on
the
few
miles
back,
and
so
there's
been
some
discussion
about
this
and
I
I
love
to
comment
at
this
and
talked
athletic
about
this
and
and
they're
aware,
like
there's
some
challenges
here,
like
we're
gonna
We,
would
we
would
really?
If
we
had
this
PR,
it
would
really
increase
the
number
of
update
calls
or
the
number
of
patch
calls
to
the
VMI.
A
So
there's
a
there
is
an
alternate
approach
that
Roman
proposed.
Instead
of
going
this
route,
where
what
we
can
do
is
use
so
there's
so
the
waves
ascribe
to
me
is
like
we
can.
The
way
the
metrics
are
reported
right
now
in
kubernetes
is
pretty
much.
Everything
goes
through
Prometheus,
but
there's
I
guess
some
proposal
out
there
that
will
allow
the
user
to
change
how
metrics
are
reported
so
that
it
doesn't
necessarily
go
through
Prometheus
that
it
would
go
that
we
can
send
it
through
some
other
different
places.
A
In
other
words
like
we
can
have
some
sort
of
single
signal
handling.
That's
that's
kind
of
like
in
the
form
of
metrics,
which
is
essentially
what
this
is
and
have
something
listening
off
of
the
end
of
it.
So
now,
instead
of
just
Prometheus,
so
in
other
words
like,
we
could
send
this
to
Prometheus
and
read
Prometheus,
but
that's
not
what
we
want.
We
instead
would
rather
just
have
this
messages,
be
these
metrics
be
sent
out
we'd
like
something
else
that
we
write
to
be
listening
for
the
whatever
these
formatted
metrics
and
then
so.
A
We
can
have
some
tool
that
that
does
something
with
it.
So
that
was
basically
the
suggestion
that
Roman
had
in
here.
So
that
would
be
instead
of
having
the
a
bunch
of
objects
or
a
bunch
of
fields
that
we
update
on
the
spec,
so
I
think
we're
this
isn't
going
to
be
implemented
this
way,
but
we'll
we'll
have
to
keep
an
eye
on
the
pr
I.
Don't
think
that's
going
to
be
the
case.
B
A
Of
this
information,
so
it
should
just
be
it's
like
when
you
do
live
migration,
it's
like
when
or
one
of
the
challenges.
So
what
the
I
guess.
The
use
case
here
is
live
migration
Works
in
a
lot
of
cases,
but
there
are
some
cases
it
doesn't
work
and
the
rate
the
reason
it
doesn't
work
in
some
cases
or
it
doesn't
make
sense,
even
is
because
there
is
a
lot
of
activity
on
on
the
VM.
A
It's
writing
a
lot
to
disk
or
to
memory
or
something
to
a
point
that
that
doing
the
life
migration
would
be
extremely
challenging
because
we
right,
we
think
about
it.
We
have
to
we
have
to
take
the
data
it
has
to
be
has
to
have
the
same.
The
the
new
VM
and
the
lvm
have
to
have,
like
the
you
know,
we're
gonna.
That's
gonna,
pick
up
all
the
the
rights
to
disk
and
things
like
that
and
start
executing
on
the
new
on
the
new
VMI.
A
But
the
challenge
is,
if
we're
doing
so,
much
work
on
the
old
VMI
on
like
it's.
It's
writing
a
lot
to
disk
or
something.
Then
it's
really
difficult
to
then.
Do
the
live
migration,
so
the
idea
is
like
this
is
a
metric
to
say
like
okay,
this
VMI
is
going
to
have
a
really
hard
time
doing
live
migration
right
because
we've
got
a
high
dirty
rate
or
whatever
this
one
will
have
an
easy
time,
because
we're
not
actually
doing
a
whole
lot
of
things.
We're
not
writing
a
lot
to
disk.
A
It
won't
have
a
whole
hard
time.
Moving
to
the
live,
writing
live
migrating
and
moving
to
the
new
PMI.
So
that's
the
that's
the
background
for
this,
and
so
the
user
would
be
if
I
wanted
to
migrate.
All
of
my
VMS
in
my
zone,
I
think
mortality
said
it
was
for
upgrade
for
upgrading
things.
If
I
wanted
to
migrate
them
all,
we
want
to
increase
the
chance
of
success
and
so
for
the
ones
like
we'd
have
this
metric.
A
We
can
figure
out
all
the
ones
that
we
can,
that
we'll
be
successful
with
and
then
I'm
not
sure
how
we'd
handle
this
like
I,
think.
Maybe
we
need
user
intervention
or
something
when
we
have
VMS
that
are
really
hot
and
we
don't
want
to
migrate
them
right
now.
I'm
not
sure
what
we
do,
but
the
idea
is
that
we'd
identify
them.
B
A
B
So
I
I
think
the
problem
I
see
with
this
is
it
can
cause
like
cause
like
a
watch
strum
so
because
this
is
an
update
in
the
status
field.
That
means
any
controller
that
is
watching.
The
VMI
will
always
get
an
update
event
and
trigger
a
reconcile
for
it.
So
yeah
I
understand
this
is
going
to
be
very
hard
to
scale.
A
Yep
so
hopefully
the
solution
that
I
mentioned
is
going
to
be.
What
is
is
going
to
be
the
path
forward
and
not
the
something
on
the
status
field.
Unfortunately,
I
don't
fully
understand
it
because
I
don't
at
least
I
haven't,
read
anything
about
it
like
I,
don't
but
Roman's
comments
somewhere
in
here,
I
think
it
was
I,
don't
know
which
one
but
somewhere
here
he
made
that
recommendation.
A
I
don't
see
it,
but
that
was,
but
that
was
that
I
think
is
the
path
that
this
is
gonna
go.
Is
that
they're
gonna
try
and
find
some
sort
of
other
signaling
outside
of
this?
So
I
guess
you
could
say
some
sort
of
API
outside
of
this,
but
not
like
some
sort
of
crd,
but
we
used
since
this
is
essentially
a
metric.
A
You
use
the
metric
system,
not
Prometheus,
but
like
some
other
way
of
signaling
like
a
metric
and
then
have
something
scrape
it
some
sort
of
agent
or
something
so
I,
don't
know
I,
but
as
far
as
I'm
concerned,
like
the
as
long
as
we
don't
grow,
the
Sprout,
then
I,
don't
think
it'll
be
any
concerns
for
us
on
the
scale
side.
A
Okay,
cool
all
right,
so
nothing
with
this
I
think
we'll
skip
this.
We'll
do
this
later,
let's
go
to
Jed's
question
I'm
currently
trying
to
get
better
understanding
of
our
rate
limiting
metric
exposed
cues
in
the
broken
controller
controllers,
like
watch
migration,
previous
crops
for
covert
working
depth,
expecting
values
to
go
up
and
create
times
like
the
current
VMS,
but
they
rarely
do
anyways.
Someone
knows
okay,
I!
Think
so.
A
Ignace
of
this
is
the
the
work
you
depth
is
going
to
only
go
up
when
I
reconcile
takes
too
much
time
so
I
mean
in
theory
the
the
number
of
cues
can
go
up
if
we
run
into
a
lot
of
doing
a
lot
of
work,
but
it's
not
only
dependent
on
the
amount
of
work.
We
do
it's
much
larger
than
that.
It's
like
our
work
because
we're
calling
kubernetes
we're
calling
lots
of
other
things
in
Cuba.
A
It
would
take
something
in
that
in
that
chain
of
events
to
be
slow
and
then
some
retries
to
happen
and
then
for
us
to
then
build
up
a
work
queue.
So
it
would
I
think
it
would
be
a
bunch
of
things
that
would
need
to
happen
not
just
tons
of
concurrent
VMS
and
migrations.
A
B
I
think
there
is
one
more
thing
around
this,
so
the
vacuum
Matrix
are
exposed
by
underlying
client
go
as
well,
and
they
are
not
prefixed
with
cubot
underscore
so
might
be
worth
checking
those
as
well.
A
A
Testing
Ram
migration
books
schedule,
hundreds
of
VM
migrations,
wait
for
completion
and
then
schedule
another
100
migration.
Please
was
slowly
degrading
with
every
book,
starting
with
20
seconds
and
reaching
1570
seconds
in
the
last
bulks
in
order
to
debug
this
I
scheduled
800,
VM
migrations,
and
so
these
are
announced.
The
root
cause.
A
A
A
A
B
B
A
A
If
this
so
seconds
per
pm
and
was
both,
it
would
be
easier
to
you
know,
is
there
any
sort
of
I
don't
know
if
there's
any
sort
of
like
we
would
see
this?
Wouldn't
we
like
I,
think
so
he's
saying
so,
whereas
Chad
was
saying
that
he
didn't
see
it
in
the
work.
Hubert
work
Hue
depth,
but
maybe
it's
not
so
this
is
what
you
were
saying.
It
might
not
be,
might
just
be
in
a
work
through
depth
from
a
client
go
that
has
this.
A
Okay,
so
I
guess
there's
an
open
question.
I,
don't
also
understand
this
Behavior
like
five,
so
it
also
I
think
that's
another
thing.
It's
like
the
behavior
doesn't
seem
right
and
that,
like
we
should
be,
we
should
be
taking
things
whenever
there's
one
of
these
one
spot
free
in
the
parallel
migrations
per
cluster,
it
should
be
freed
up.
A
We
should
take
another
one,
I
guess.
Maybe
that's
not
the
that's
not!
The
point
is
that
maybe
what's
happening
is
so
parallel,
migrations
per
cluster.
So
it's
doing
five
migrations
and
at
the
same
time
it's
got
a.
A
B
Yeah
I
think
what
I'm
understanding
is
that
so
I
don't
know
if
this
is
possible,
but
that
there
will
be
95
of
them
in
the
queue
right
and
for
some
reason
they
are
creating
like
they
are
exhausting
the
queue
from
constant
updates,
so
like
none
of
them
will
will
get
processed
I,
don't
know
if
that's
what
is
happening
so
while
the
five
are
being
migrated,
the
95
could,
you
know,
continue
to
re-queue
and
cause
watch,
Q,
exhaustion,
I,
don't
know
if
that's
happening
here.
B
A
A
A
So
I
guess
another
thing
we're
missing
is
like
what
is
like.
Why
is
it
that
when
I
guess
the
thing
I
wonder
is
like
if
you
increase
the
state
under,
you
know
like
what
would
happen
or
a
thousand
decreases
to
a
thousand
I,
don't
know
what
happened.
A
B
I'll
try
to
take
a
look
at
the
work
you
and
see
if
what
we
are
doing
is
similar
to
client
go
or
not,.
A
A
I
guess
that
would
be
the
question:
okay,
I'll
just
message
Jen
on
slack,
and
so
we
can.
We
can
correspond
with
them
that
way,
and
maybe
he's
funny
thing
on
the
birth
control
accumulator.
We
can
also
don't
know,
I'll
tag
you
on
the
thread
away
and
we
can
we'll
just
go
that
way.
Go
about
it
that
way.
A
Okay,
all
right!
All
right
is
there
anything
else.
Did
we?
Let
me
check
our
post
B1
tracking.
Did
we
have
is
everything
merge
here
like
is
everything
all
done
or
did
we
okay,
sell
slope
items.
A
Oh,
this
is
a
peer
review.
Wait
where's
the
this
one.
A
Okay,
these
are
all
still
open.
Do
we
need
to
have
like
Daniel
take
a
look
at
some
of
these
are
Google.
B
Was
helping
out
with
one
with
the
thing
that
we
need
next,
the
updating
of
graph,
but
it
has
fallen
off
the
burner.
So,
okay.
A
B
A
B
You
look
at
the
work
in
progress.
One.
A
B
Okay,
here,
two
nine
three
one.
B
B
Yeah
we
so
that's
the
pr
it
has
been
a
while
and
it's
a
small
one,
so
I'll
try
to
Ping
lubo
again
if
he
gets
time
for
this.
If
not,
maybe
I
can
help
out
I'll
push
changes
to
that
PR.
A
Ok
Okay
yeah.
All
right
are
you
thinking
about
that
and
let's
see
if
we
can
see
if
we
can
get
this
further
yeah
I
thought
we
were
close
because
I
thought
like
we
had.
B
Yeah
so
I
think
Luber
ran
into
one
issue,
which
was
so
the
pr
here
creates
a
post
submit
job
right.
So
our
current
automation,
scrapes,
the
metrics,
puts
it
in
CI
benchmarks
repository.
Then
this
job
gets
triggered
because
it's
a
post
submit
job,
so
it
will
generate
the
HTML
and
again
post
it
to
GitHub
right.
So
the
issue
was
that,
because
this
was
a
post
submit
job,
it
would
always
recursively
invoke
itself
creating
a
problem.
B
It's
okay
to
do
a
pre-submit
right
just
the
day
after.
A
Yeah,
that's
fine!
We
don't
need
to
be
like
these
things.
Are
we
don't
need
to
be
the
day
yeah
like
it's?
Okay,
yeah
that
doesn't
work,
fine,
okay,
cool
all
right!
Well,
you
can
start
there
with
him.
Let's
see
if
we
can
get
this
going
again,
it
sounds
like
we've
already
sounds
like
it's
pretty
like
we've
thought
a
lot
about
this
and
we're
pretty
close.
We
just
need
a
little
bit
more
to
get
it
on
the
finish
line:
yeah,
cool,
okay,
all
right
thanks,
so
late,
I
think
I.
Think!
That's
all!
A
Okay,
okay,
all
right
anything
else
move
this
we'll
just
stack
for
whenever
you
have
time
we'll
talk
about
this
one.