►
From YouTube: SIG - Performance and scale 2021-10-28
Description
Meeting Notes: https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.jol87qyjgei
A
All
right
welcome
to
sixth
scale,
it's
october
28th
the
link
to
the
documents
in
the
chat.
If
you
have
agenda
items
and
yourself
as
an
attendee,
please,
okay,
we'll
start
with
the
first
item,
so
the
periodic
job
threshold.
So
this
I
I
think
so
we
do
actually
before
before.
We
start
in
the
first
item.
So
last
week
we
didn't,
I
canceled
we
didn't.
A
We
didn't
have
too
many
items
to
discuss,
and
so
I
figured
and
plus
we
didn't
have
a
lot
of
attendance,
so
it
very
well
pushed
to
this
week.
So
the
this
item,
the
first
item
here,
the
periodic
job
threshold.
So
I
don't
know
if
there's
been
an
update
on
this,
since
I
think
we
originally
talked
about
two
weeks
ago,
but
I
wanted
to
if,
if
has
there
been
any
change
with
this?
A
If
not,
I
just
want
to
get
some
work
items
and
kind
of
what
can
be
done
and
to
get
us
to
work,
and
I
can
take
it
if
not.
B
I
don't
think
there
has
been
okay
yeah.
I'm
sorry,
I
feel
like
that's
partly
my
responsibility.
I
have
not
gotten
to
it
so
the
the
path
forward.
If
you
want
to
write
down
action
items,
are
we
have
to
build
that
perf
audit
tool
and
we
probably
want
to
gather
results
with
it,
which
should
be
once
it's
actually
built.
We
should
start
getting
results
posted
to
an
artifact
that
could
already
exist
today.
B
So
we'll
see
the
periodic
run,
hopefully
successfully
we'll
see,
thresholds.
Sorry,
we
won't
see
thresholds,
we'll
see,
profile,
results
and
based
on
a
collection
of
those
like.
If
we
see
it
run
for
a
few
days
or
whatever
we
could
probably
start
saying
thresholds,
which
would
then
alert
us
when
things
regress.
A
Okay
and
then
this
so
talking
about
this
business
building
the
perf
audit
tool,
like
does
this-
just
need
to
be
like,
like
this
just
needs
to
be
added
to
like
the
the
path
in
the
make
file
for
build
or
something
or
is
this
like
a
specific
path
that
needs
to
go
into.
B
I
I
think
it
just
needs
to
be
built,
like
I
think
that
it's
going
to
end
up
in
the
expected
path
that
the
automation's
looking
for,
if
it
just
gets
built.
A
B
I
can
show
you
here
I'll
grab
that
pr
for
you
real,
quick,
because
it's
already
been
merged
and
that'll,
give
you
an
idea
of
exactly
what
code
exists
to
to
execute
perf
audit
and
why
it
doesn't
work.
B
A
I
thought
the
oh
okay,
I
thought
it
would
show
all
the
history.
Okay,
there
we
go
all
right.
So
if
you,
if
I
was
saying
I
was
talking
like
that,
everyone
could
see
the
document
chat.
I
I
thought
no
one
could
see
this.
Okay,
sorry
about
that.
Everybody.
B
A
All
right,
I'm
looking
fine
too
to
like
what
to
do
with
this
period
of
job
and
then
test
it.
Okay.
That
gives
me
enough
all
right,
perfect,
all
right
so
next
week,
I'm
hoping
that
you
know
so,
if
I
can
solve
this
in
time,
I
hope
next
week
we
can
get
some
start
gang
on
these
getting
these
thresholds
and
we
can
start
making
some
decisions
as
to
where
we
stand
and
perhaps
start
looking
at
different
ways.
We
can
gate
around
those
thresholds.
B
A
Definitely
yeah
and
then
once
we
get
the
yeah
and
then
I'm
hoping
that
as
part
of
this,
I'm
going
to
learn
a
lot
about
this
periodic
job
and
like
when
it
brought
like.
I
don't
know
anything
about
it
like
where.
A
B
B
B
Okay,
here's
the
chat,
and
I
mean
I
would
look
through
what
marcelo
has
committed
as
far
as
pull
requests
go:
okay,.
A
Directories,
I'll
I'll
look
through
it
I'll
see
what
I
can
figure
out
with
that'll.
Give
me
enough
to
go
on.
Okay,
all
right.
Thanks
david
all,
right
lum.
I
think
we're
good
on
this
topic.
Let's
go
to
the
second
bullet
point,
so
I
this
actually
the
segways
off
what
you're
saying
so
additional
audit
tool
measurements.
So
I
was
some
background
on
this.
I
was
looking
at.
I
was
looking
around.
A
It's
actually
doing
some
some
tracing
work
and
looking
at
an
issue-
and
I
found
a
bunch
of
interesting
things,
different
ways
that
we
can
actually
measure
some
of
the
the
times,
and
these
are
all
things
that
actually,
I
think,
would
fit
just
fine
in
the
auto
tool.
So
this
is
what
I
came
up
with
right
now
they
we
so
we
can.
We
can
see
the
scheduling
to
scheduled
transition
latency.
A
We
can
measure
that
and
we've
got
in
our
metrics,
but
there's
also
some
things
that
we
can
actually
get
off
the
objects
to
that
tesla.
Some
other
things
like
latency
between
when
the
pods
are
ready
and
the
vmi
object
transitions
to
scheduled.
A
We
can
actually
see
on
the
pod
when
the
containers
go
to
ready
it's
in
it's
actually
in
the
conditions
there
it's
in
the
status
and
then
we
can
also
see
when
the
vmi
object
transition
is
scheduled,
so
we
can
actually
start
putting
some
more
some
more
data
points
down,
there's
also
like
latency
between
the
vert
launcher,
pods
being
assigned
to
a
node
to
the
creation
timestamp.
A
This
is
on
the
pod
we
can
see
like
when
the
network's
assigned,
like
the
node
name,
is
actually
filled
in
there's
a
time
stamp.
That's
that's
put
there
and
we
obviously
have
a
trade,
the
creation
times,
amp,
there's
the
launcher,
pods
being
assigned
a
network
if
you
have
so
never
plug
in
when
those
get
laid
down.
That's
also
there
we
could.
If
we,
if
there
are
padded
pvcs,
we
could
actually
look
and
see
the
pvc
that
is
being
allocated.
A
A
There's
all
this
stuff
that
that
has
time
timestamps
around
who
is
updating
what
fields
and
when
and
we
can
actually
examine
it
to
provide
some
more
information
here.
A
I'm
thinking
it
goes
in
the
audit
tool,
because
what
I'm
thinking
is
that
we
can
actually
find
we
could
take
the
break
down
even
further
to
these
things,
which
I
would
be
really
interested
in
seeing
because
when
I
look
at
the
right
now,
when
I
look
at
scheduling
to
schedule,
I
can
see
the
time
and
and
it'd
actually
be
nice
to
see
like
even
more
like
what
went
into
you
know
the
scheduling
schedule,
because
it's
actually
not
hubert.
A
A
B
Yeah
everything
you're
saying,
makes
all
sense
to
me:
how
are
you
this
latency?
Is
it
exposed
today
in
metrics.
A
This
latency,
the
the
four
I
don't
think
so.
No,
like
you
mean
like
in
like
in
either
in
cuba
or
in
humanities
in
some
way,
is
that.
B
Right
right
do
we
have,
I
guess,
a
way
of
detecting
this
today,
even
if
it's
complex
do
we
have
a
way
of
determining
this
has
occurred,
retroactively.
A
We're
retroactively
like
yeah,
so
it's
on
the
actual
objects
it's
on
the
animals
we
could
so
so
I'm
way
I'm
introducing
this
here
is
if
the
odd
tool
can
go
through
and
look
at
the
ammos,
the
vmi
animals
and
just
kind
of
look
through
a
bunch
of
them
crawl
through
a
bunch
of
them
and
and
dig
up
this
information
after
you
know
it's
after
the
vmi
is
running,
for
example,
but
we
could
do
metrics
on
this
too.
That's
that
that
might
be
possible,
because
these
are
also
events.
B
B
A
Necessarily
watch
like
I
wasn't
sorry,
I
wasn't
thinking
that
necessarily
it
would
watch
like
this
can
be
all
done
retroactively
just
to
clarify
like
this
can
be,
but,
like
you're
saying
it's,
they
aren't
metrics
today.
So
wouldn't
it
would
be
a
little
bit
different
than
what
two
is
currently
doing,
but
mainly.
A
B
B
How
do
you
stop
the
problem
like
if
we're
measuring
time
for
shutting
down
a
bmi
or
something
like
that?
I
don't
know
anytime,
we
delete
the
object,
then
we
lose
all
that
data.
A
Right
but
the
so
the
metrics,
let's
say
we
delete
the
vmis
and
then
we
parse
the
metrics
again.
Aren't
they
gone.
B
Metrics
stick
around
for
the
okay,
so
we
have
they
stick
around
forever.
I
mean
there's
a
there's
a
peer.
I
mean
that's
just
the
database
for
all.
B
Some
days
or
something
they
start
getting
purged
but
they're
going
to
be
around
certainly
longer
than
our
load
test.
A
Yeah
you're
right,
okay,
these
could
be
metrics,
I
mean
dude.
What
let's
I
mean,
I
kind
of.
May
we
go
down
that
path
like
does
do
we
s?
Would
we
ever
like
see
when
I
see
value,
but
we
have
do?
Would
it
make
sense
that
we
measure
this
kind
of
thing
inside
of
hubert,
that
we
I
mean,
I
think
we
can.
The
events
are
there
and
the
objects?
Are
there
they're
being
updated?
We
could
see
them
we're
already
watching
them.
B
This
is
giving
us
a
more
fine,
granular
understanding
of
what's
happening
between
vmi
transitions.
So
right
now
we
have
scheduling
our
schedule,
scheduling
to
schedule,
and
then
these
latencies
will
give
us
like
more
fidelity
between
those
and
same
thing
with
scheduling
to
running,
and
things
like
that.
I
think
the
question
to
me
is:
do
we
need
that
fidelity
yet
like?
B
If
we
we
find
that
we
need
a
very
specific
understanding
of
certain
latencies,
that's
more
than
the
like
the
fine
buck
or
the
the
kind
of
large
buckets
that
we
have
today,
then
that's
when
I
would
start
looking
at
at
these.
A
That's
kind
of
the
background
here
and
that
and
that's
kind
of
why
I
got
into
tracing,
to
figure
it
out
and
it
and
through
tracing
I
completely
eliminated
hubert's
work
queue.
It
was
not.
It
was
nothing
to
do
with
the
vertica
controller.
It
was
everything
to
do
with
was
actually
specifically
to
do
with
with
pvcs
and,
and
that
was
interesting
is
that
I
found
what
I
actually
found
is
that
keywords
work
you
is
is
executing
pretty
fast.
A
I
mean
it's
it's
almost
instantaneously,
and
but
I
wouldn't
when
I,
when
I
actually
look
at
this,
this
transition,
the
kubert's
work
queue
or
the
work
that
kubert's
doing
is
tiny
in
this
transition,
and
it's
getting
it's
kind
of
given
the.
Maybe
it's
given
the
wrong
impression.
If
you
don't
know
that
it's
giving
the
wrong
impression.
B
I
think
I
would
expect
that
so
scheduling
means
that
we've
posted
the
pod
and
scheduled
means
that
everything
between
scheduling
and
schedule.
My
expectation
is
that's
all
kubernetes,
because
that's
all
just
making
the
pod
run
somewhere
and
as
soon
as
as
we
see
it,
running
we're
just
sitting
at
the
schedule,
but
we
aren't
doing
anything
between
that
time
span.
A
Yeah
I
mean
it,
it
makes
it
makes
sense
to
me
like
it
like
right.
We
have
pods
going
and
pods
are
impending.
They
go
to
like
this.
The
pods
are
coming
up
during
that
time.
That's
the
majority
of
the
work,
but
it
was
helpful
because
I
actually
discovered
an
issue
in
the
process
that
I
found
that,
specifically
with
pvcs,
ended
up
being
the
case
and
there's
also
some
other
things.
I
found
like
the
the
amount
of
time,
even
network
assignment
node
assignments.
A
These
things
seem
to
be.
These
are
helpful
to
know
and
this,
and
that
was
the
other
thing
it's
like
there.
So
I
looked
at
other
metrics.
For
example,
scheduling
the
scheduler
has
a
an
end-to-end
pod
time,
which
was
helpful
and
that
it
could
like
it.
It
gave
roughly
a
gauge
of
of
what
to
expect
in
in
the
times,
but
it
also
didn't
talk
about
like
what
it
was
like.
What
what
is
going
into
this,
like,
specifically
to
me
like
I,
was
interested
in
the
pvc
time,
which
ended
up
being
really
slow.
A
What
ended
up
happening
so
it
basically
what
I
did
is.
I
went
through
the
scheduler
at
the
schedule
of
logs
and
noticed
that
the
the
time
it
took
for
pvc
to
be
to
be
allocated
and
a
node
assigned
was
was
very
long
and
then
found
out.
There
were
a
lot
of
pvcs
a
lot
more
than
expected
that
were
just
sitting
around
and
that
they
weren't
doing
anything
and
the
scheduler
kind
of
looking
in
the
code.
B
So
the
way
I
would
approach
this
is
we
want
what
we're
looking
for,
specifically
with
scheduling
the
scheduled
is
to
understand
how
long
the
kubernetes
part
is
taking
like
what
what's
happening
at
the
kubernetes
scheduling
layer.
So
I
would
investigate
if
there
are
any
metrics
related
to
the
kubernetes
scheduler,
that
we
can
start
introspecting
and
add
that
to
the
perfona
tool
to
give
us
more
understanding.
But
here's
also
the
thing
if
it's
outside
of
cube
vert.
B
A
Yeah-
and
I
think
the
reason
like
so
my
reasoning
is
that,
like
you
know,
we
we've
talked
about
some
of
the
more
thresholds
here
like
it
would
be.
These
are
the
kinds
of
things
like
when
here
we're
expecting
how
whatever
kubernetes
performance
to
be
and
like
there's
that
pod
scheduling
time
that
metric
that
I
mentioned
from
the
scheduler,
I
think
that's
a
that's
a
big
factor
that
was
that
was
pretty
well
correlated
to
this
time.
Yeah
I
mean,
and
then
you
know
any
of
these
other
ones
would
could
just
be.
A
I
don't
know
things
that
are
that,
could
we
can
additionally
see
value
in?
I
think
like
some
of
the
others,
like
maybe
you
know
if
pvcs
aren't
one,
I
think
some
of
the
ones
that
at
least
would
be
interesting,
are
the
what's
the
this
one
here
like
the
when
pods
go
ready
to
when
the
vmi
objects
is
switches
to
scheduled.
A
Since
that's
the
last
step
that
we
do,
we
want,
we
expect
that's
that's
in
our
control
and
that's
one
that
we
expect
to
be
to
be
really
fast,
but
there
could
be,
I
mean.
Sometimes
it
takes
a
second
I've.
Seen
in
some
cases
it
takes
many
seconds.
A
This
would
also
be
useful
to
know,
and
this
is
actually,
I
think,
one
of
the
original
reasons
behind
qps
was
the
qps
change.
Was
this
one?
This
gap
was
a
little
bit
larger,
but
this
is
this.
Was
I
know
this?
One
could
be
another
one
that
we
could
use,
but
anyway,
I
I
think,
like
what
you
said,
maybe
start
with
the
cube
scheduler
metrics
as
a
as
a
first
step
on
this
might
be
an
easy
on-ramp.
A
Okay,
next
topic
so
tracing,
so
I
had
mentioned
previously
doing
some
tracing,
I'm
kind
of
looking
for
some
ideas
from
folks
and
opinions.
The
there's
some
work
in
the
community
around
this,
and
I
don't
know
how
far
we
want
to
go
with
this,
but
there's
also
a
fairly
simple
library
that
is
really
handy
to
make
tracing
work.
A
A
Then
we
we
spit
it
out
into
the
log
as
an
easy
first
step,
and
then
maybe
this
could
be
something
that
we
can
look
at
later,
but
I
I
don't
know,
I
think
what
if
people
think
I
mean
this,
I
don't
know
this
seemed
seems
the
easiest
way
to
start,
I'm
not
that
familiar
with
tracing,
though
or
what
kubernetes
is
doing.
I
don't
know
if
anyone
else
is.
B
A
Yeah,
I
think
we
already
include
details
and
yeah.
I
think
it's
all
right.
It's
yeah,
it
is
it's
a
vendor.
I
didn't
have
to
add
it.
It
was
already
there,
it's
just
wow
yeah
it
just
so.
This
is
wait.
A
Yeah
the
functions
and
everything
and
the
distractions
are
there,
we
basically
just
we
basically
just
need
to
put
them
in
the
right
places,
which
is
also
another
question.
That's
that
I
want
to
get
some
opinions
on,
because
so
let's
say
that,
let's
say
that
this
makes
sense.
Where
would
where
would
we
want
to
add
tracing
because
I've
got?
A
A
B
A
B
A
Yeah,
it's
already,
it
was
already
there.
So
it's
just
look
for
trace
or
start
trace
or
step
trace,
trace.
B
Sir,
what's
a
tracer,
let's
say
here,
I
was
looking
for
trace
yeah.
A
It's
in
utils
cameras.
He
tells
trace.
B
A
So
what
do
you
alright?
So
what
do
people
think
about
like
some?
Where
do
people
want
traces
because,
like
I
can
do
I'll
do
I
could
do
a
few
of
these?
I
mean
the
work
you
seem
to
make
sense
to
me
in
in
a
different
way
than
I
was
doing
it
before,
because
I
think
the
mistake
I
was
making
before,
as
I
was
measuring
time
between
keys,
which
is
actually
was
measuring
the
time
it
took
for
kubernetes
to
to
do
it
was
measuring
kubernetes
work.
A
B
What
I'm
most
interested
in
is
the
vmi
and
the
vm
work
queues
and
vert
controller
invert
handler.
So,
for
example,
if
I
look
at
vert
handler
that
one's
really
interesting
to
me
to
understand
that
the
vm
has
been
queued
and
understand
where
we're
spending
the
most
time
or
performing
work
on
that
to
the
point
where
we
return.
B
So
that
would
tell
me
that,
for
example,
let's
say
we
are
performing
vert
handler
work,
q
and
a
vm,
and
we
get
to
the
point
where
we're
syncing
the
vm
with
vert
launcher,
and
I
can
tell
that
that
function
is
taking
like
almost
a
second
or
something
like
that.
Then
that
would
tell
me
that
something's
happening
and
vert
launcher
is
causing
this.
This
grpc
call
to
block
for
longer
than
I
expected
things
like.
That
would
be
like.
I
have
no
visibility
into
that
today,
but
tracing
would
allow
us
to
to
do
that.
B
A
Okay,
yeah
well,
so
what
I'll
do
is
I'll
do
yeah
so
that
right
when
it's
popped
off
the
cue
I'll
do
basically
what
I
did
here
is
like
start
the
the
trace
and
add
a
bunch
of
steps
in
there,
and
then
we
can
have
a
threshold,
for
I
don't
know
what
I'll
just
play
around
with
it
and
see
what
the
which
I
mean,
maybe
a
second
or
something
I
don't
know
what
we
expect,
but
some
amount
I'll
throw
in.
A
B
Sir
yeah,
did
you
post
your
pr,
I
mean
I
saw
your
code
for
some
of
this.
Did
you
post
it
as
a
pr
yeah
or
is
it
no.
A
I'm
gonna
redo
it
because,
like
I
was
saying
before
the
the
way
I
have,
it
now
isn't
correct,
because
it's
actually
measuring
the
time
that
we're
waiting
for
q
for
events
for
kubernetes
and
that's
that
was
the
that
was
the
mistake
that
I
was
making
is
that
it
would.
I
thought
we
were
doing
work
or
something
during
that
time.
It
was
not
the
case.
We're
actually
just
waiting
for
our
informers
to
get
work.
B
Great
yeah
and
just
a
really
simple
pr,
maybe
just
do
one
trace
with
some
steps
in
it
like
pick.
One
like
vert
handler
is
great.
Birth
controller
is
fine
and
the
vmi
or
vm
controller,
and
then
document
like
how
to
use
it
and
everything,
and
that
gives
you
like
a
precedent
that
we
can
add
more
afterwards.
A
Okay,
that's
all
I
had
for
topics.
Do
people
have
any
other
items
we
want
to
talk
about.
B
Give
us
developer
insight
into
how
to
the
tracing
and
the
performance
profiling
like
that
getting
p
profiler
back
those
are
are
going
to
be
our
two
primary
tools,
I
think,
and
and
improving
the
the
results.
A
Yeah,
okay,
so
I
I'm
gonna
do
I'll
take
this
one.
I've
got
the
I'll
do
the
tracing
as
well.
This
one
in
the
middle,
I'm
gonna,
create
an
issue
for
actually
I
I
think
I
haven't
added
to
this
issue
in
the
test
framework.
Has
some
ideas
that
we
can
look
into
I'll
mention
the
scheduler
metric
in
there,
and
this
is
something
that
we
can
consider
areas.
We
can
expand.
A
Okay,
cool
all
right.
If
there's
no
other
topics,
I
think
wonderly
all
right.
Thank
you.
Everybody
all
right,
bye!
Thank
you.
Bye.