►
From YouTube: SIG - Performance and scale 2022-02-03
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.yg3v8z8nkdcg
A
All
right
welcome
to
sixth
scale
everybody,
it's
february
3rd,
I'll
link.
The
document
notes
in
the
chat.
Please
add
yourself
an
attendee,
please
all
right.
A
Let's
get
started
with
the
agenda,
so
a
few
items
for
today
want
to
review
the
performance
periodic
job
results
because
we're
starting
to
see
a
lot
more
information.
Now
that.
A
A
All
right,
let's
start
with
the
first
one
here
so
okay,
so
this
is,
after
the
changes
with
the
primer
added
the
primer
vm.
So
this
looks
really
good
we're
so
we're
starting
to
get
the
and
much
more
accurate,
create
pods
count
that
looks
really
good.
A
102
also
really
good
108
yeah.
Those
all
look
really
good.
So
this
is
awesome
and
then
like
this
too,
like
we're
we're
so
we're
getting
this
info
now
all
right.
So
that's
that's
really
good.
So
another
a
lot
of
really
important
things
about
this
is
that
I
think
this
gives
me
a
lot
more
confidence
in
these
verbs.
A
A
What's
that
the
delete?
No
so
because
marcelo,
we
run
because
now
we're
running
the
job
in
the
we're
doing
it
before
this
week,
we're
running
the
job
and
then
we're
waiting,
and
then
the
delete
gets
run
after
the
test
completes.
So
it's
way
after
the
metrics
have
been
scraped.
C
A
C
A
A
A
A
D
A
D
Controller
handler
and
operator
for
for
handlers
it
can
make
sense.
Vendors
do
heartbeating
on
the
node,
so
they
get.
A
C
D
A
Yeah,
I
know
that's,
I
I
don't
think
I
think
it's
okay,
like
I,
I
I'm
just
trying
to
pick
out
some
high
numbers
in
here
like
update
endpoints,
I
don't
what
about
this
one?
What
up?
What
endpoint
would
we
be
updating
here.
A
C
C
D
C
A
Yeah,
so
that
would
be
like
yeah,
so
like
every
time
we
changed
its
phase.
Right
would
be
so
we
should
expect
at
least
what
it
would
be
like.
Four,
four
five.
So
four
with
like
pending
scheduling.
D
C
C
E
It's
tough
to
correlate
it
directly
to
all
the
different
fields,
because
a
lot
of
fields
are
updated
at
the
same
time,
so
you
might
hit
running
and
have
some
conditions
at
the
same
time,
for
example,
and
then
you
might
have
some
conditions
to
get
set
independently
of
any
of
the
phases
and
it's
really
tough.
Okay,.
A
D
A
D
A
A
Okay
and
then
just
to
get
end
points
about
roughly
yeah,
this
one's
not.
D
D
E
Yeah,
how
about
thresholds
has.
A
That
been
discussed
yet
all
right,
I
haven't
oh,
so
that's
where
I
was
going,
so
I
was
gonna.
So
this
part.
So
we
can
look
at
a
few
of
these.
Let's
see
here
so
I
just
pulled
up
three
randoms,
so
we've
got
37
on
the
50
35
on
a
50
30
out
of
50.,
seven,
thirty,
five,
thirty,
okay,
that's
roughly
like
a
ten
percent
range
or
so
variation.
E
P95
is
something
we
could
probably
use.
I
would
not
use
p99,
that's
like
one
percent
worst
case,
but
p95.
It
seems
like
if
we
take
so
the
highest
one.
We
got
was
58
if
we
added
like
50
to
that.
So
that's
that's
60
seconds,
maybe
maybe
90
seconds
as
a
threshold.
If
we
ever
give
about
90
seconds,
we
did
something
terribly
wrong.
B
A
E
One
of
the
things
about
ephemeral,
virtual
machines
here
is
that
we
should
have
already
synced
the
image
right,
so
the
I
guess
it
depends
on
what
image
we're
using
I'm
trying
to
think
of.
If
there's
any
case
where
we
are
pulling
the
image
to
nodes.
First,
the
actual
boot
image,
because
we're
doing
that
then.
A
B
A
A
No,
no,
this
is
from
the
this
is
from
the
periodic
tab.
E
C
Yeah,
I
I
think
it's
using
the
image
that
it's
upload
when
you
do
cluster
up.
A
How
many,
how
many
nodes
is
this
just
one
node
two
nodes,
the.
B
C
Yeah,
so
I
have
the
other
job
running
the
performance
cluster,
but
it's
actually
not
running
right.
Now
I
actually
create
a
pr
because
the
image
was
using.
You
know
what
was
missing
goal.
Just
something
very
simple,
you
know
was
just
missing
go
I
have
a
pr
fixing
that,
but
it
wasn't
merged
yet.
So,
when
we
have
this
merge,
we
can
see
the
the
the
job
right
in
the
performance
closer.
A
C
D
A
A
It
is,
I
agree
with
you
like,
for
I
completely
agree
because
this
is
like.
I
think
this
is
so
I
would
say
this
is
going
to
help
us
a
lot
with
performance,
and
then
this
is
going
to
impact
I'd,
say
scale,
because
when
we,
when
the
number
of
verbs
that
we
do
right,
is
going
to
affect
kubernetes,
so
we
should
be
conscious
of
this.
I
mean
I
think
I
mean
I
don't
know.
What
should
we
pick
out
of
here
I
mean
I
would
expect
like
this
one
looks
pretty
good
like
patching
virtual
machines.
E
Pretty
close
because
it's
patching
the
what's
it
patching
the
ready
status,
dirty
conditioner
thing
right
from
the
from
the
pod
right
is
that
what
it
was
it's
sinking
something
from
not
guaranteed.
E
E
A
So
maybe
not
like,
so
maybe
what
should
we?
So?
What
would
be
like
a
case
where
it's
not
one
to
one
I
mean
that's
just
like
well
so,
if
like,
for
instance
like
this,
is
an
estimate,
if
it's
not
one
to
one
like,
let's
say
it
was
200.
That
means
that
we
had
to.
That
means
they
all
failed.
Whatever
the
condition
was.
E
Yeah
we're
specifically
looking
for
loops
here,
so
something
where,
if
a,
if
two
controllers
collide,
then
they're
going
to
get
in
this
competition,
both
trying
to
update
an
object
at
the
same
time,
and
maybe
it
eventually
resolves
itself.
But
that
would
result
in
several
like
like
double
or
maybe
even
triple
the
number
of
patches
and
updates.
D
Okay,
I
think
we
can
even
be
pretty
close
here
and
say
for
the
patch
it's
two
to
one
and
for
the
update.
We
could
even
say
it
should
it
must
be
below
nine
to
one
or
ten
to
one,
because
for
as
long
as
we
can
just
change
this
number,
with
the
pull
request,
when
we
see
that
it's
legitimate
that
it
goes
up
right.
C
I
think
maybe
also
the
number
of
get
and
least
you
know-
maybe
not
threshold,
but
something
to
analyze
here,
because
it
will
change.
You
know.
Number
of
getting
lists
will
change
if
the
vm
is
running
with
pvcs
or
multiple
needs
things
like
that,
but
you
know
we
don't
want
to
make.
For
example,
get
we
have
getting
point
get
nodes
only
that.
E
That's
for
our
heartbeat,
so
the
get
nodes
is
for
the
bert
handler
heartbeat
wow,
600
eddie.
A
This
is
and
interpolated
over
five
minutes.
That's
insane!
So
a
hundred
five
hundred
five
minutes.
Yes,
really
weird,
the
last
one.
C
Can
you
see
the
other
runs.
D
A
D
E
Accurate
because
everything
else
is
accurate,.
C
A
C
So
I
think
again
things
that
I
I
think
I
mentioned
before.
We
need
to
give
here
also
the
code
you
know
to
understand.
You
know
which,
which
requests
were
200.
You
know
http
200
code
for
response
and
which
ones
has
the
400
and
500
ones,
because
we
could
see
here
in
the
get
notes
if
those
ones
are
400
and
500
answers
called.
A
C
A
C
A
Yeah,
like
I
see
here
so
we
we
patched
the
nodes,
roughly
kind
of
at
the
rate
that
you
just
said,
roman,
I
mean
it's
fairly
close,
so
it's
12.
D
E
Would
expect
yeah,
I
think
it's
our
node
labeler.
We
have
a
controller
that.
B
A
A
A
I
don't
know
what
this
would
be
list.
Maybe
it's
the
only
one
I
could
think
of
here.
It's
like
we
ran
the
job
once
like
what
we
created
a
role
or
something
or
like
we
checked
something
on
a
service
account
or
what
what
would
these
be?
C
C
You
know
things
that
when
we
have
like
many
questions
like
that,
we
don't
know
also,
who
is
actually
you
know
actually
requesting
that
you
know.
Is
it
very.
E
C
C
A
Okay,
I
don't
even
see-
I
don't
even
see
these
in
the
other
ones.
I
only
see
that
one
only
the
list
service
monitors,
which
I
don't
know
what
I
don't
know
what
that
is,
and
then
there's
no
lists
here,
I'm
I'm
guessing
it's
there.
It's
just
I'm
missing.
I
mean,
like
that's,
probably
a
small
window
to
like
to
capture
this,
because
I
bet
it's
literally
happening
once
probably
right
at
the
start
of
the
test,
and
we
have
to
get
the
the
data
at
the
right
point
in
time
to
see
it
with
the
interpolation.
A
It's
possibly
just
missed
it,
but
it
I
think
at
least
it's
it's
pretty
insignificant,
though
you
would
see.
I
guess
maybe
what
this
isn't,
and
maybe
it's
not
something
we
care
about.
If,
if
we
see
like
more
of
this-
and
we
see
like
more
of
these
show
up,
then
I
may
be
a
little
more
concerned
but
yeah.
I
don't
think
we
need
to
worry
about
that.
C
Because
I
was
expecting
you
know
some
least,
you
know,
for
example,
somewhere
at
least
in
vmi,
and
then
this
wouldn't
mean
not
good.
You
know
you
know
to
find
those
kind
of
things.
You
know
well.
A
C
A
A
A
This
looks
like
from
the
word
operator
something
yeah.
Okay,
I
think
that's
pretty
good
and
then
so
I
think
these
seem
pretty
good.
So
we
have.
We
can
do
thresholds
around
these
these
right
here
and
I'll.
Get
you
all
these
and
then
investigate
this
one,
which
is
a
little
a
little
unclear.
Okay,
cool
all
right!
That's
good
good!
We're
getting
some
data
from
that!
Okay!
Let's
go
to
the
second
bullet
point
from
and
I'm
glad
you're
here
today.
A
I
heard
you
mention
this
in
the
community
call
because
I
think
it's
gonna
actually
be
important
for
our
jobs
and
using
the
serial
tag,
because
we
don't
want
any
of
them
running
in
parallel
or
we're
going
to
get
money.
D
C
D
Think
you're,
not
are
you
executing
the
tests
through
heck
funk
test
we're?
No,
it's
it's
hack,
perf,
test,
okay,
yeah,
then
it
doesn't
then
heck
funk
test
is
only
real.
Only
attack,
funk
tests
make
sense
out
of
the
serial
tag.
So
if
you're
just
running
it
out
of
that
you're
not
affected,
but
it
may
still
be
good
just
to
indicate
that
they're
supposed
to
be
run
in
serial
efficiency,
but
you
should
not
be
affected.
D
A
All
right:
cool,
okay!
Well,
let's
go
to
the
third
bullet
point
we'll
go
to
pr,
so
this
is
one.
This
is
a
follow
up
to
the
the
pr
from
last
time
where
I
hardcoded
the
five
minute
range
vector.
So
this
is
what
I
wanted
to
do
here
is.
Let
me
go
to
the
pr
from
last
time,
because
I
have
a
this:
is
the
one.
A
Here
we
go,
I
think
I
had
at
the
bottom,
so
the
follow-up
for
this
I
wanted
to
do
is
that,
like
I
wanted
to
establish
a
relationship
between
the
range
vector
and
the
previous
grape
interval,
because
they're
they're
absolutely
related,
like
I
did
a
bunch
of
testing
on
this
and
it's
this
is
how
the
interpolation
is,
is
determined
and
how
the
values
get
set.
A
The
other
one
is
that,
like
we
need
to
monitor
the
range
vector
and
make
sure
that
it's
that
it's
the
right
length
based
on
the
test
duration,
you
know
if
it's
too
short,
we
should
extend
it.
We're
gonna
miss
data.
If
it's
too
long,
we
need
to
be
cautious
of
that
as
well.
A
So
there's
there's
a
few
things
that
that
I
wanted
to
capture
and
that's
what
I
went
through
with
this,
and
so
the
relationship
that
I
established
with
the
range
vector
is
a
10x
relationship
between
the
range
vector
and
the
prometheus
scrape
interval.
I
set
premium
scraper
integral
to
be
30
seconds
as
a
as
a
global
variable
there.
I
couldn't,
I
thought
about
trying
to
look
up
the
prometheus
scrape
interval,
but
it's
actually.
A
I
didn't
think
that
was
a
good
idea
like,
but
I'll
give
the
user
the
ability
to
set
this
if
they
want
to,
but
it's
hardcoded
to
30
seconds.
That's
what
our
tests
are
so
10x
to
that
comes
out
to
five
minutes,
which
is
what
we're
using
taxing
reasonable
because
of
you
know
what
we
were
seeing
with
the
results
from
the
five
minute
range
vector
time
gave
us
reasonable,
interp
interpolation
metrics
for
increase.
A
So
that's
why
I
went
with
that
and
then,
like
I
said,
the
other
one
was
that
the
setting
the
range
vector
to
be
close
to
the
duration,
and
so
basically
what
I
want.
I
wish
I
had
my
some
actual
data
here,
but
basically
what
I
want
to
do
there's
a
graph
here.
A
What
I
want
to
do
is
like
I
want
to
get.
Let's
see
if
I
find
a
good
one
here,
I
want
to
get
the
that.
I
want
us
to
run
the
audit
kind
of
like
right
near
the
end
of
this
test.
Actually
you
can
see
right
here
so
that
this
value
right
here
right,
it's
22
and
you
can
actually
see
it
right
on
the
end.
Here.
A
It's
it's
actually
coming
down
kind
of
the
end
of
this
test,
because
the
the
primer
tests,
whatever
is,
is
running
in
front.
So
what
I
did
was
is
I
actually
increased
this
buffer,
so
I
added
a
buffer
between
tests
now
of
two
prometheus
grape
intervals.
So
let's
come
down
to
a
minute
by
default,
to
give
us
some
time
between
tests,
and
I
want
us
to
like
scrape
right-
one
prometheus
interval
back
where
it's
on
the
offset
so
kind
of
somewhere
in
this
area.
A
A
We,
they
kind
of
move
a
little
bit,
so
I
kind
of
bring
it
in
to
where,
like
the
results
are
fairly
stable
kind
of
like
you
can
see
it
here
on
this,
like
when
they're
fairly
flat
at
these
points,
where,
wherever
one
to
where
I
want
to
grab
them.
So
kind
of
right,
near
probably
like
a
five
minute
interpolation,
it's
like
be
like
four
and
a
half
minutes
is
where
I
want
to
grab
it
for
the
most
accurate
results.
A
A
A
E
E
It
would
essentially
just
after
an
updated
key
vert
caused
all
this
to
be
wrong.
So
I
changed
it
to
use
this
controller
revision
and
I
can
compare
the
controller
revisions
against
each
other.
I'm
using
the
exact
same
api
version
every
time
and
it
works.
It's
just
really
tedious.
So
that's
all
I
changed
cool.
A
Okay,
all
right
and
the
third
one.
This
is
the
the
slos
document
that
I've
talked
about.
Previously,
I
don't
I
don't
know
what
I
haven't
seen
many
comments
on
this,
except
from
marcelo
I
mean
do.
People
like
are
people
okay,
like
mainly
what
I
wanted
to
do
with
this,
like
I
said,
is
that
to
describe
what
we
want
to
do
with
our
testing
and
kind
of
get
to
where
we
want
to
go
with
like
when
kubernetes
has
this
slos
document
or
they
they
try
to.
A
You
know
they
have
tested
or
confirmed
things
like
that
they
slows
they
have
for
the
platform.
A
We
could
go
that
way
as
well,
and
I
think
kind
of
advertising
through
our
testing
is
the
way
we
could
do
it.
So
I
mean
that's
kind
of
what
I'm
doing
here
with
this
just
kind
of
laying
the
groundwork
for
that,
and
then
we
need
to
implement
the
testing
for
it.
D
C
Yeah
it's
merged,
so
uber
now
has
the
excitation
extension
to
create
via
convert
object,
pm,
vmis
and
replica
set
and
wait
for
the
you
know
the
ready
state
and
and
collect
some
detail:
metrics
latency,
the
metrics
so
from
the
vmis
it's
in
the
end.
It's
just
create
a
map
and
get
the
timestamps.
For
you
know
all
the
states
from
the
when
the
vm
and
vmi
is
changing
the
states
and
also
the
pod.
That
is
changing
so
it
it
gets
like.
You
know
the
time
that
the
vm,
for
example,
vm,
creates.
C
Then
the
vmi
is
created
because
I
want
if
the
vm
is
creating
the
vmi
okay
and
then,
when
the
pods
created
the
pods
in
each
eye.
The
container
is
in
each
eye,
the
pods
are
running
and
the
vm
vmware
is
running.
The
vm
is
ready.
So
it's
it
gets
all
of
this.
So
yeah.
D
C
Yes,
I
have
actually,
I
run
several
tests
doing
that
in
the,
but
it
was
in
other
nodes.
I
I
can
send
a
picture
about
that
later,
because
also
it
it's
some
test
that
I
did
okay,
I
did
some
tests
that
in
the
in
the
ci
also
so
I
will.
C
A
A
All
right,
okay
I'll,
have
to
try
it
at
some
point
on
our
internal
questions.
A
Okay,
cool,
okay,
all
right,
I
think
that's
all
we
have
so
for
did
we
did
anyone
get
the
answer
to
this?
Are
we
using
precinct
images
for
this
for
this
job?
I
think
someone
said
they.
A
D
From
there
from
there
where
we
saw
their
results
right
now,
yes,
but
I'm
not
sure
if
it
gets
presents,
could
but
could
be.
A
Okay,
so
for
follow
up
on.
C
No,
it's
it's
not
I'm
not
using
yeah,
I'm
in
the
in
the
performance
cluster,
I'm
not
using
this
image
because
it's
you
know
it's
a
kubernetes
that
it's
already
running
there
and
I'm
not
pushing
the
image
so
the
first,
the
first
image
comes
from
the
quay
and
yeah.
You
see
like
some
downloading
time
so
in
the
first
vm
that
has
been
created.
A
A
All
right
all
right
so
for
follow
up
on
these,
so
I
can.
I
can
take
the
action
here
to
put
together
thresholds,
I'm
going
to
do
it
for
all
of
these.
Basically,
what
we
have
here,
oh
this
one's
duplicate,
so
I'll
do
based
on
these
ratios,
we
can
start
I'll
start,
adding
like
a
a
fail
pass
fail
based
on
what
we
see
on
these
thresholds.
A
Okay
and
then
I'll
create
an
issue
for
this
one
I'll,
just
open
it
and
attach
it
here
and
then
we
can
we'll
see.
We
can
do
follow-ups
on
this
one
and
see
what
people
find
about
about
this.