►
From YouTube: SIG - Performance and scale 2021-10-14
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.phpt2kytr3mt
A
Okay,
welcome
to
sixth
scale,
it's
10
10
14
october
14th.
The
link
to
the
document
is
is
in
the
chat
at
yourself
as
an
attendee
and
you're
creating
okay.
So,
let's
get
started,
it
looks
like
so.
The
first
thing
is
the
thing
that
I
added
all
right.
Let
me
it's
pretty
large,
there's
some
snippets
in
here.
So
if
you
want
to
add
agenda
items,
just
add
it
on
top
here.
Just
so
I
don't
miss
it,
but
we
can
start
with
this.
A
So
I
was
looking
at
a
little
bit
of
tracing
and
they
are
doing
it
some
trade,
twos
and
tracing
in
the
code.
The
problem
I
was
actually
looking
looking
at
was
trying
to
figure
out.
You
know,
what's
going
on
between
the
time,
the
transition
time
between
scheduling
and
scheduled.
I
see
a
lot
of
time
gets
taken
up
in
this
this
area
and
I
wanted
to
do
a
little
bit
of
tracing.
So
I
found
this
library
the
kubernetes
api
server
actually
uses
it.
A
A
So
basically,
the
way
it
works
is
like,
if
you,
if
you
add
a
you,
get
some
some
traces
that
if
they
go
over
a
certain
amount
of
time,
it'll
just
print
to
the
to
the
log,
the
amount
of
time
from
when
you
started
the
trace
to
when
you
when
it
stopped-
and
you
actually
have
some
cool
things
like
you-
can
add
steps
in
there
and
then
it
takes
the
difference
between
the
steps
and
it
ends
up
printing
them
out,
and
so
I,
what
I
did
was
I
actually
added.
A
I
looked
in
the
controller
since
that's
where
I'd
expect
to
see
the
most
of
the
time
between
scheduled
scheduling
and
schedule
to
take-
and
I
found
something-
that's
kind
of
weird
to
me
so
to
kind
of
explain,
so
I
don't
have
the
full
picture
yet,
but
to
kind
of
explain
what
I'm
doing
is
that
I
have
this
thing
called
queue
and
I
have
a
count.
A
Basically,
I
can
I'll
show
you
the
code,
maybe
even
easier
to
do.
Is
I
I
added
this
right
in
the
execute.
Can
you
see
my
terminal
by
the
way,
making
sure
I'm
showing
everything
yep?
Okay?
So
I
do
is,
I
add,
this
I'd
start
a
trace
with
a
key
in
this
execute
function,
and
I
step
I'll
step
if
I
re-cue
I'll
just
record
a
step
and
then
I
stop
when
we
do
after
we
do
the
forget
and
the
when
I
do
the
recording
I
actually
take.
A
Every
time
that
the
the
key
gets
seen
on
the
cue
again,
I
just
increment
it
over
and
over
and
over
again
and
record
the
time
in
between
each
of
the
the
last
step
and
this
step
right
here,
which
is
the
the
key
with
the
count.
So
that's
how
this
comes
out.
You
can
see
like
q1,
two
there's
there's
a
three
in
there.
It's
just
too
fast.
A
I
actually
went
on
the
object
and
looked
at
it
and
it's
it's
pretty
accurate
in
terms
of
the
way
it
in
terms
of
the
total
time,
but
the
thing
that
was
weird-
and
I
saw
this
and
pretty
much
every
vmi-
that
I
looked
right
around
the
ninth
time,
the
eighth
time
somewhere
around
there-
that
this
object
goes
through
the
queue
you
can
see
this.
A
A
Oh
sorry
is
that
is
that
any
question,
so
this
is
in
the
yeah.
This
is
in
the
vert
controller.
It's
in
the
it's
in
the
watch.
The
vmi
execute
loop.
B
Yeah,
okay,
so
you
have
a
you
start
the
trace
the
first
time
you
see
the
key
and
you
end
the
trace
once
it's
been
forgotten
and
if
there's
an
error,
I
guess
that's
what
I'm
interested
in
the
most
there's
an
error.
Then
we
just
do
a
step
and
do
you
know
if,
like
the
43
second,
whatever
gap
trace
if
there
were
errors
that
occurred
during
that.
A
I
don't
so
so
do
we
call
execute
here.
I
should
see
them
here
in
this.
I
should
catch
them
here
and
I
don't
see
any.
I
don't
see
anything
occur,
the
so
43.
So
the
only
thing
I
get
that
I
guess
you
could
see
like
it's
so
between
q8
and
q9
is
update.
Status
is
has
a
step
in
there,
but
it's
very
quick.
It
ends
almost
instantly
and
we
go
right
to
q9.
A
A
Yeah,
it's
it's!
It's
the
so
this
is
well
okay.
The
the
this
this
total
time
would
be
the
time
it
takes
to
to
move
from
when
we
first
saw
the
key
to
when
we
finished
processing
it.
The
time
between
the
cues
would
be
when,
like,
yes,
like
you
said,
executing
we're
doing
work
on
a
key,
it's
popped
off
the
cue
and
we're
we're
in
execute.
B
46
seconds
to
execute
a
key
so
do
the
prometheus
metrics
show
like.
I
would
expect
the
we
had
some
sort
of
metric
around
work,
queue,
duration
or
something
like
that.
Maybe
we
don't,
but
I
would
expect
that
to
show
spikes
like
this
as
well.
If
this
is
occurring,
this
is
a
really
unexpected
46
seconds,
invert
controller.
A
If
I
go
back-
and
this
is
just
all
I
did
is
I
did
this
with
just
a
big
cluster
up,
so
you
can
try
this.
I
can
give
you
the
patch.
If
you
want.
B
A
B
That,
if
you
could
put
that
patch
or
your
branch
in
the
notes
or
something
that'd,
be
helpful
sure,
like
my
first
instincts
when
I'm
hearing
this,
is
that
there's
something
unexpected
happening
with
your
patch
and
less
likely
with
vert
controllers.
This
sounds
crazy.
A
A
Yeah,
I
don't
know
I
it's
it's
weird,
I'm
still
trying
to
figure
out
because
there's
something
yeah,
there's
just
something
bizarre
with
what
I
expected,
but
the
beside,
even
even
though
like
while
I
still
invest.
I'm
gonna
still
investigate
this,
but
I
don't
know
I
want
to
see
if
there's
any
thoughts
around
this,
but
I
think
that
what's
cool,
though,
is
even
actually
this.
This
library
is
really
easy
to
integrate.
It
might
be
something
I
don't
know
if
it
makes
sense
like
if
you
want
to
do
it
in.
A
If
you
could
do
this
in
blogging
or
something,
but
I
don't
know,
I
find
this
to
be
valuable
like
if
we
I
could
set
the
threshold
to
anything
like
one
second,
which
would
probably
be
more
reasonable
and
we
can
actually
see
all
the
steps
it
takes
for
that
or
over
one.
Second,
it
might
be
something
easy
we
can
do
to
improve
tracing.
I
don't
know
how
this
would
integrate
with
other
tools,
but
like
jager
and
stuff,
but
I
don't
know
it
seemed
like
a
pretty
serviceable,
easy
on-ramp
to
to
get
some
information.
A
A
Yeah
and,
like
you
said,
dave
like
there,
we
do
see
some
of
this
like
with
the
longest
running.
Was
it
like
remaining
work
or
something
like
the
the
metrics
for
this?
Like
we
see
that
the
I
think
they're,
I
don't
know
if
I
have
them
somewhere
here,
marcelo's
pictures,
but
we
do
see
some
that
that
you
see
pretty
often
that
there
are
ones
that
are
fairly
long.
I
think
where
we
see
him
in,
like
we've
seen
10
minutes.
A
I
think,
if
I
remember
correctly,
we
see
some
very
long
ones
if
I
can
find
his
previous
documented
somewhere
in
here,
but.
B
Yeah,
that's
so
the
thing
that's
surprising
to
me
about
this
is
that
vert
controller
isn't
doing
anything,
that's
blocking.
I
mean
it
makes
some
api
calls
and
things
like
that,
but
I
think
they
have
deadlines
and
under
normal
operation
like
we're,
talking
milliseconds
for
those,
so
for
this
to
be
causing
45
second
or
43
whatever
it
is
delays
during
x.
I
can't
think
of
anything
that
we
do.
That
would
come.
That's
very
strange.
B
That
sounds
like
the
amount
of
time
for
scheduling
or
something
like
that
to
be
reflected,
and
even
that's
high
for
for
some
clusters.
So
time.
B
A
This
is
about
a
minute
right,
yeah,
but
this
so
the
scheduling
step
is
pretty
fast
scheduled
right.
Okay,
so
it's
all
the
way
to
running,
but
here
is
a
second
yeah
yeah.
C
B
C
So
where
did
was
this?
The
no
yeah
yeah
so
so
pending
to
scheduling
is
one
second
right.
C
C
C
If,
from
the
tracy's
perspective,
it
would
look
like
it
is
really
in
the
controller
loop
this
time,
but
here
it
looks
like
it's
more
the
stuff
where
they
were
waiting
to
get
scheduled
or
something
could
you
check
the
the
pod
itself
yeah
there
you
can
see
when
it
got
ready
and
when
it
got
created.
That's
also
interesting
for
us.
C
C
C
Okay,
I
agree,
but
this
explains
the
long
time
from
the
timestamps
and
the
vmi
was
captured
right.
Yes,
so
so
what
I
no
wonder,
I
mean
there
is
a
huge
number
in
the
trace.
The
trace
is
even
bigger
than
or
as
big
as
we
would
we
see
here,
but
this
as
we
can
see,
it's
not
reflected
in
the
timestamps
here
that
there
would
be
an
additional
delay
where
nothing
happens
in
between
that's
what
I.
A
Wanted
to
see
so
this
is
a
this
is
so
this
is
a
minute
yeah,
okay
and
then
that's
still
within
that
40-second
window.
So
what
what
would
we
say
we're
doing
we're
waiting
we're
just
waiting
for
this
right
here?
This
has
to
finish.
I.
C
C
A
A
C
So
I
mean,
if
it's
already
in
the
node
and
it's
tagged,
so
it
should.
If
not,
you
should
use,
if
not
present,
with
the
pull.
C
A
C
Then
it's
pretty
simple:
normally
it
should,
but
I
mean
oh
no,
no
just
this,
not
your
oh
yeah
you're,
just
pulling
it
normally
so
the
first
time
you
schedule
the
vm
it
gets
pulled,
but
only
the
first
time.
C
And
if
you
go
down,
it
already
says
container
image
or
res
present
on
the
machine
below,
so
it
took
more
than
one
minute
to
find
out
that
everything
is
already
here.
I
know
sorry,
the
first
pull
is
from
the
init
container.
The
second
message
is
for
the
for,
for
the
container
later
on,
which
uses
it
too.
A
A
Yeah
I
mean
I
see
it
on
all
of
them.
There's
five
55.
A
I
might
try
this
on
a
different.
I
might
try
this
on
a
different
environment,
because
I
I
don't-
I
don't
understand
that
either
okay
I'll
play
around
with
this,
it's
kind
of
interesting,
but
I
anyway
that
I
just
kind
of
found
that
pattern.
I
thought
that
was
that
was
weird
I
can
share
with
you
guys.
Also
after
I'll
show
you
the
patch
and
if
you
guys
want
to
play
around
with
it
or
we
can
share
whatever
okay
all
right.
We
have
no
other
topics.
A
A
What
if
people
do
we
have
any
other
things
before
we
go
down
that
that
red
hole.
A
Okay
hold
on,
can
we
do
we
can
review
some
of
these
or
something
first,
so
we
have.
Let's
see
we
talked
a
little
bit
about
this
last
time.
There's
some
updates
from
last
time.
A
That
so,
if
you
do
some,
if
you
restart
the
the
vert
controller,
you
can
actually
you
can
see
this.
Here's
a
here's,
a
picture
of
the
the
metrics
in
prometheus,
so
you
can
see
the
the
label
gets
dropped
off.
There's
no.
A
A
It
gets
to
reattach.
The
label
gets
repicked
up.
A
So
I'm
not
sure
what
watch
is
it
or
what
would
maybe
like
the
metric?
That
is
that
does
the
labeling
is.
A
Is
we
lose
it
somewhere
when
our
controller
restarts
and
then
when
we,
when
another
event
occurs,
we
we
find
it
again.
We
find
the
vmi
again,
we
label
it
based
on
its
current
phase.
A
E
A
So
it
will,
what
I've
seen
is
that
it
will
actually
like
it
won't
ever
reattach,
if
you
say
we're
just
to
have
a
bunch
of
running
the
eyes
in
the
zone.
But
if
you
just
let
it
let
this
the
vmix
sit
there,
it
doesn't,
it
doesn't
ever
reattach,
but
for
some
reason
when
I
was
did
a
delete,
I
noticed
that
a
few
of
them
did
start
to
reattach.
So
my
thought
was
that
maybe
an
event
causes
it
to
relocate
the
object
and
reattach
the
label.
A
Okay,
let's
see
profiling
under
high
load.
There's
two
moss
here:
just
wonder
how
this
is
going.
D
A
D
Yes,
so
there's
open
pr.
I've
addressed
comments
today
from
janusz
and
david
and
I'm
just
waiting
for
response
and
hope
to
have
it
merged
soon.
Okay,
yeah,
I
haven't
done
the
profiling
on
the
live
screen.
I
mean
I
did
a
bit,
but
I
don't
have
the
precise
results.
I
hope
to
have
it
ready
in
the
next
couple
days.
So
maybe
I
can
share
in
the
next
couple
of
weeks
for
at
the
first
glance,
it
looks
like
we
spent
a
lot
of
time
marshalling
and
martialing
data.
D
D
Additionally,
we're
using
standard
encodings,
slash,
json
library
and
they're
like
more
efficient
libraries
to
do
martial,
link
and
marshalling
which
both
use
less
cpu.
They
just
do
less
operations
and
they
use
less
memory,
so
yeah.
So
I'll.
Try
to
see
how
much
of
an
improvement
we
can
get
by
simply
replacing
the
dependency
okay.
A
I
don't
know
so
we
don't.
I
don't
think
we
have
marcelo,
but
we
still
have
some
open
items
here
on
the
the
previous
experience
that
experiments
that
marcel
did
okay,
so
we
said
we're
gonna
profile,
that'll
be
the
next
step
for
this
one
and
then.
A
Okay,
all
right,
those
were
the
bugs.
Are
there
any
other
features
or
pr's
out
there
that
we
want
to
look
at?
I
saw
david's
request
merged,
and
then
we
also
have
david.
You
have
the
you,
have
the
change
that
for
ci
that
went
in.
Do
we
have
like
any
data
at
this
point
from
the
from
the
from
ci
gathering
towards
any
of
the
thresholds
or.
B
We
should
yeah,
I
haven't
looked
at
those
periodics,
but
it's
a
file
that
is
stored
as
an
artifact
or
it
should
be
so
we
can
begin.
I
mean
we
could
use
that
today
to
come
up
with
some
thresholds.
B
So
we
would
want
to
look
at
the
performance
or
the
density
test.
Prow
periodic.
B
To
look
at
the
artifacts
there
to
find
the
the
perf
audit
results.
B
C
C
C
C
Is
there
yeah
the
the
config
is
also
shared
stored
as
an
artifact?
If
you
look
here
in
the
file.
B
C
C
C
A
There,
okay,
all
right,
maybe
next
time
we'll
have-
we
can
do
a
report
for
that
yeah
all
right.
I
pushed
my
changes.
I
I
mean
well
actually
before
we
do
that,
do
we
do
we
have
any
other
topics.
Do
you
want
to
discuss.
A
Okay,
all
right
also
could
people
add
themselves
as
attendees.
I've
heard
a
bunch
of
people
talk,
and
I
only
see
two
people
yeah.
I
just
wasn't
attending,
because
we
do
just
just
to
show
that
people
are
here
all
right.
I
did
push
these
changes
here.
I
want
to
look
at
them.
Let
me
see
I'll
link
them.
A
All
right,
well,
there's
a
link
to
the
branch,
try
it
so
I
I
guess
that
for
this,
let
me
I'll
play
around
some
more
and
see
if
I
can
figure
out
what
this
is
new.
There's
a
mistake
in
here
or
something
but
like
I
want
to
see
what's
going
on
and
then
and
then
either
way,
depending
on
what
what
I
find
in
this,
I
kind
of
want
to
see
I'll,
I'm
going
to
maybe
come
with
proposal
or
something
and
how
we
could
add
this
or
what's
a
reasonable
way.
A
We
could
add
this.
I
think
this
is
pretty
convenient
and
maybe
just
a
very
simple
tracing
and
maybe
start
with
the
vert
controller
or
something
maybe
it
doesn't
even
have
to
be
configurable
at
first,
but
just
something
simple:
we
can
post
in
the
logs.
I
don't
think
it'd
be
too
verbose
if
we
make
that
we
pick
some
sort
of
reasonable
time
for,
like
the
logging,
maybe
something
like
I
don't
know.
We
could
do
like
a
few
seconds
like
five
seconds
or
something.
A
A
Okay,
okay,
all
right!
Well,
I
don't
have
any
other
issues.
I
think
we
can
close
early.
Then
if
we
don't,
we
have
no
more
discussions
or
anything
and
then
yeah
check
this
out.
Well
I'll
message
you
guys
and
slack
david
afterwards.