►
From YouTube: SIG - Performance and scale 2023-03-23
Description
Meeting Notes:
https://docs.google.com/document/d/1d_b2o05FfBG37VwlC2Z1ZArnT9-_AEJoQTe7iKaQZ6I/edit#heading=h.tybh
A
Okay,
all
right
welcome
to
six
Gallery
buddy,
it's
March
23rd,
the
Anderson's
attendee,
please
and
please
enter
topics.
Okay,
we're
gonna
start
off
with
analyzing
the
performance,
job
and
I,
specifically,
I
think
what
we
should
do
is
like.
Let's
lay
why
don't
you
start
with?
You
know
talking
about
some
of
the
work
we've
been
doing,
some
of
the
graphs
and
kind
of
like
the
new
way.
We
eventually
want
to
analyze
these
things.
B
Sure
do
you
mind
if
I
share
my
screen
yeah.
B
Can
you
see
my
IDE
yep
okay,
so
today,
I
would
like
to
talk
a
little
bit
about
some
of
the
new
tools
that
we
are
working
on
to
analyze
the
scale
Reserve
like
analyze,
the
results
that
are
run
by
Sig
performance
jobs.
B
So
there
are
two
major
buckets
of
jobs
that
we
run.
One
is
the
pre-submits
which
are
run
on,
which
can
be
optionally
done
on
every
PR
and
the
other
is
periodics
every
day.
B
B
So
instead
of
going
and
manually
reviewing,
we've
been
trying
to
work
on
a
tool
which
can
give
us
a
nice
graph.
So
I
have
opened
the
tool
here
and
I
just
wanted
to
go
through
some
of
the
phases
of
this
tool
and
and
the
results
from
this.
So
the
way
I'm
thinking
about
this
tool
is
in
three
phases:
phase
one
will
go
look
at
each
job
and
collect
its
results
and
put
it
in
this
directory,
so
the
output
format
will
be
the
directory
name.
B
Slash
result
slash
job
name,
so
this
is
here:
output,
history,
periodic
keyword,
end-to-end,
Sig
performance
and
then
under
each.
This
is
the
job
name,
and
this
is
the
job
ID,
which
was
run
and
for
each
job
ID.
We
will
get
the
VMI
results
and
the
VM
results,
so
this
is
directly
scraped
from
from
the
build
log.txt
in
the
future.
This
phase
might
go
away
because
we
might
have
the
ability
to
dump
this
values
that
are
observed
in
an
artifact
directory.
B
This
is
this
was
suggested
in
in
one
of
the
threads,
so
this
phase
might
be
little
bit
easier.
We
don't
have
to
scrap,
scrap
the
blog
dot
text
for
results
so
moving
on
from
there.
We
can
do
lots
of
things
with
this
data.
One
of
the
initial
steps
that
I
took
a
short
at
is
to
aggregate
this
data
into
weekly
averages
and
then
plotted
on
a
graph.
B
So
the
phase
two
of
this
tool
is
another
sub
command
that
will
aggregate
the
results
of
this
per
per
resource
and
it
will
give
a
summary
of
of
the
average.
So
then,
this
will
be
output,
history,
weekly
sub
directory,
and
in
that
let's
check
the
VMI.
B
The
one
of
the
interesting
metric
here
is
creation
to
running
P95.
So
if
you
look
at
this
directory,
the
sub
command
creates
multiple
directories,
which
is
the
starting
Monday
of
each
week
and
within
that
we
have
results.
So
the
start
date
is
here.
The
average
is
here
and
within
within
this
data
structure.
There
are
data
points
for
each
date
right,
so
you
can
go
through
this
and
figure
out.
B
What
are
the
the
data
points
and
what
its
averages,
but
really
this
is
a
pre-processing
for
the
next
phase,
which
is
plotting
this
thing
in
a
chart.
B
So
the
way
I
have
CI
Health
already
had
a
good
tool
that
would
put
this
data
into
two
in
in
a
single
plot.
It
would
draw
a
scatter
points
for
these
values
and
then
align
line
chart
for
the
averages
across
the
weeks.
So
now
I
would
like
to
show
phase
three,
which
is
the
result
of
this
aggregation.
B
You
can
see
that
right
around
this
time,
the
performance
of
the
weekly
creation
to
running
degraded
a
little
bit,
because
these
there
are
some
observations
where
it
went
to
around
39
40
seconds
and
then
right
around
this
oops
right
around.
This
is
when
we
got
back
so
there
is
a.
There
is
actually
two
ways
to
plot
this
in
in
this
up,
command,
one
is
to
get
static.
Graph
other
is
to
get
a
plotly
graph.
So
this
is
a
much
more
Dynamic.
B
It
creates
an
HTML
output,
and
here
you
can
figure
out
where
exactly
things
started
to
get
bad
and
where
it
it
got
better.
So
one
thing
we
have
been
trying
to
do
over
the
past
week
is:
we
know
when
things
got
bad,
which
is
December
22
to
23
2022,
and
we
know
when
things
got
good,
December,
sorry,
January,
22
to
January,
23
2023,
so
we're
trying
to
vet
each
PR
in
there
and
find
out
the
culprit.
We
do
have
like
couple
of
PR
pairs
that.
B
That
we
have
a
suspicion
on,
but
we
were
not
able
to
confirm
from
one
or
the
other,
so
yeah,
that's
that's.
The
VMI
creation
to
running
I
do
have
couple
of
other
interesting
observations,
so
this
is
the
second
chart.
B
This
chart
is
weekly
batch
pods
count
for
for
the
VMI
right
around
this
week,
the
patch
for
VMI
was
doubled
so
initially
for
each
VMI.
We
had
one
patch
pods
count.
Now
we
are
having
two.
B
Similarly
sorry
wrong.
Similarly,
if
you
look
for
the
patch
count
on
parts
from
VM,
it
also
follows
the
same
Trend.
So
this
allows
us
to
see
that
okay,
something
in
the
code
base
change
right
around
this
time,
which
increased
one
patch
count
for
dbmi.
We
were
actually
able
to
pinpoint
what
that
change
was
and
we'll
share
more
details
in
in
the
six
scale
talk
next
week.
B
Another
observation
from
this
tool
was
that
patch
counts
for
virtual
machines
virtual
machine
instances,
so
this
is
on
the
left.
The
chart
is
the
patch
count
for
virtual
machine
instances.
If
the
VMI
was
created
via
VM
controller,
you
can
see
that
there
were
increase
in
the
patch
calls
first
week
of
February
and
second
week
of
February,
there
were
two
spikes,
but
if
you
I
do
not
have
the
right
chart
here,
let
me
get
that.
B
So
if
you
look
at
the
same
plot
for
VMI,
so
this
plot
is,
if
you,
if
a
user
creates
VMI
manually,
the
patch
counts
for
VMI
remain
stable
right.
So
there
were
two
changes
that
when,
when
team,
in
the
way,
VM
controller
manages
virtual
machines
and
because
of
to
those
two
changes,
this
patch
count
increased
again.
B
B
Yeah,
so
if
you
guys
have
any
questions
or
feedback
on
on
things
to
change
or
things
to,
we
can
do
better.
Please
share.
C
Hi
I
mean
this
is
great
and
thank
you
for
taking
the
task
on
I.
Just
have
a
question:
do
we
actually
track
or
bake
in
what,
but
PR's
are
for
the
pre-submits.
B
Yes,
so
the
the
results
that
I
have
churned
right
now
these
are
for
periodics.
There
is
some
improvement
to
be
done
for
running
the
same
thing
against
PR,
so.
B
The
pre-submits
are
yet
the
pre-submits
data
is
organized
little
bit
differently
than
the
periodics
in
in
the
GCS
bucket.
So
I
had
not
gotten
a
chance
yet
to
go,
modify
this
tool
to
look
at
the
pre-submits,
but
once
I
do
we
should
be
able
to
get
the
pr
number
in
here
and
I
I
think
the
better
place
to
put
it
is
in
this
interactive
chart.
B
C
Perfect
and
I
think
even
what
Iranian
is
writing
maybe
have
a
PR's
in
the
periodic.
So,
for
example,
what
PRS
went
in
within
the
within
the
beat
we
are
putting
the
we
are
putting
the
graph.
That
will
be
also
interesting
to
see.
B
C
So
if
you
are
going
to
plot
the
graph
once
per
week,
so,
for
example,
let's
say
Friday-
we
can
use
the
GitHub
client
to
actually
query
what
PR's
got
in
in
this
week
and
then
include
it
in
into
the
plot.
A
Yeah
you'll
have
to
yeah
okay,
so
you,
oh,
you
could
do
this
on
the
client
side.
I
see.
Okay,
I
go
because
yeah,
it
was
I
was
gonna,
say
like
we
could
do
this
based
on
time
frame
or
you
can
include
it
in
the
periodic,
but
I
think
what
you
just
said.
A
It
makes
sense
the
time
frame
where,
like
you,
query,
good
and
it's
like
we're
doing
it
weekly,
we
check
what's
been
merged
in
a
week,
and
if
someone
wants
to
do
this,
like
every
three
days,
is
their
time
frame,
which
I
think
is
allowed
in
your
tool.
The
way
you
set
it
up
delay
then
it'd
be
the
same
thing.
We
would
grab
it
by
every
three
day
periods
or
something
what
PR
is
reverse
in
those
chunks
of
time.
Yeah.
B
So
I
I
think
that
might
be
so.
This
see
high
CI
Health
might
already
be
doing
this
right.
So
we
we
might
have
some
code
written
to
to
pull
this
PRS
right
right,
I'm,
not
sure
but
I
I'm
trying
to
think
it
aloud
because
I
know
like
Ci
Health
does
some
processing
on
on
the
PRS
that
went
in
it
might
have
utility
and
we
can
take
it
from
there.
A
B
Okay,
is
there
any
other?
Is
there
any
other
call?
I
mean
metric
that
you
would
like
to
see.
I've
tried
to
look
at
the
major
ones,
and
these
are
the
ones
that
popped
out,
but
I'm,
not
sure
if
I
missed
anything.
There
are
lots
of
these
here.
C
I
actually
wanted
to
ask
if
we
do
want
to
track
all
of
it.
Maybe
do
we
want
to
I,
don't
know,
maybe
use
fuel
which
we
know
that
might
affect
the
performance
a
lot
and
then
others
would
be
not
collected.
Maybe-
and
we
could
just
take
this
on
demand
if
we
see
that
something
spiked
out-
and
it's
not
this
usual
to
expect.
C
We
don't
we
don't
have
a
garbage
collection
on
the
job.
So
technically
we
would
not
lose
it
yet.
B
C
And
did
you
did
you
think
about
what
graphs
do
you
want
to
publish
or
do
we
do?
We
want
to
publish
all
of
them,
I
mean
maybe
most
of
the
interesting
ones
would
be
the
P90
P95
for
creation
to
running
right.
That's,
yes,
that's
most
interesting,
most
interesting
and
then
I
think
operations
as
the
at
least
patch.
B
B
B
I
actually
was
thinking
to
add
some
more
Matrix
to
this,
so,
for
example,
CPU
and
memory
utilization
of
word,
Handler
and
World
launcher.
B
Those
are
the
two
things
that
would
be
nice
to
have
plotted
over
time.
Just
so,
we
can
have
an
idea
of
how
the
memory,
consumption
and
the
CPU
consumption
of
our
components
are
evolving
at
scale.
Those
might
be
some
metrics.
We
would
have
to
figure
out
in
the
audit
tool
and
then
populate
here.
A
A
A
No,
it's
okay,
you
can
move
it
up
yeah
there
we
go
okay,
like
what
else
do
we
think
makes
sense
in
here
like
I
mean
limit
ranges,
I,
don't
think
so.
Oh
Hubert's,
jobs,
no
game
and
sense.
Config
maps
and
most
of
this
stuff
is.
A
B
A
B
B
A
A
We
have
to
do
some
I
forget
what
phase
it
is,
but
whatever
it
is
that
when
we
committed
to
get,
we
I
think
what
we
do
is
we
keep
only
the
ones
that
we
have
here
designated
in
the
your
srl
share,
I'm,
highlighting
something
and
sure
and
I'll
grab
it
from
here.
A
There
I,
like
I,
think
this
is
kind
of
like
our
regex's
and
we
keep
we
keep
this
and
then,
as
we
see
you
know,
if
we
see
any
other
weird
things
by
just
analyzing
the
job
because,
like
we
do
like
lupus
saying
like
we
could,
if
we
wanted
to
go
back
and
look
at
the
the
actual
job
and
look
at
the
plain
text,
we
could
see
it
there
and
we
can
always
add
more
later.
If
we
find
anything
strange.
B
With
those
nice
that's
cool
yeah,
my
only
thought
would
be
to
be
careful
with
the
data
as
to.
If
we
want
to
get
other
calls
in
the
future,
we
should,
even
if
we
don't
publish
it,
we
should
leave
out
doors
for
US
Open
to
process
it
later
and
I
think
it
was
mentioned
that
we
don't
garbage,
collect
it
at
the
bucket
layer.
So
we
should
be
fine.
A
All
right
and
yeah,
like
you,
said,
we'll,
be
talking
some
more
about
this.
Some
Hubert
Summit
we'll
have
some
more
on
this
and
some
crafts
and
stuff,
okay
or
in
some
PRS
that
go
along
with
the
graphs
okay.
What
else
do
we
want
to
have
that
was
the
topic?
I
know
the
top
of
your
house
delighted.
You
want
to
talk
anything
about
some
of
the
things
that
or
anything
with
to
do
with
the
windows.
D
Games,
so
if
we
I
can
discuss
in
general
what
we
do
today
around
the
q
word,
so
we
have
a
the
last.
The
last
workload
we
had
is
called
boss
term.
We
test
the
time
it's
close
to
your
test
with
one
and
100
VM,
and
but
we
test
in
parallel
the
total
memory
allocation
per
VM
and
paranoid
and
the
total
CPU
we
fetch
it.
D
From
Prometheus
for
specific
run-
and
we
run
it
twice
a
week
to
see
that
we
get
a
stable
result
against
latest
cluster
openshift
cluster,
and
so
we
start
to
test
the
DOA
around
a
300
VM
and
our
limitation
in
general
is
the
memory
we
configured
the
request
memory
to
128
Mega
in
order
to
make
to
run
the
maximum
of
Fedora
VM
and.
D
C
D
So
the
minimum
memory
we
in.
D
The
memory
is
to
two
gigabytes
and
yes,
two
gigabytes
per
per
Windows
Server
Windows
server
2019..
D
D
D
So
if
you
re,
if
you
launch
more
than
and
so
you
get
to,
you
will
get
memory
sufficient
insufficient
because
you
reach
to
the
Limit
and.
D
In
general,
so
when
we
run
it,
we
run
it
in
balance
according
to
the
number
of
the
CPU.
D
So
each
time
will
each
time
we
we
launch
20
VMS
and
in
order
to
see
that
it
starts
on
the
same
exact
time
we
We
Run
The
yaml
with
the
running
Force.
So
all
the
VMS
will
start
at
the
same
exact
time
and
we
will
do
it
in
a
ramp
up
what
I
mean
in
ramp
up.
So
each
time
we
add
more
20
VMS
till
we
reached
in
each
node
till
we
reach
to
the
limit
in
each
node,
and
we
continue
with
the
next
node.
D
So
at
the
end
we
have
the
only
in
the
already
running
VM
we,
as
I
said
we
collect
the
Prometheus
for
each
run
and
we
distribute
the
data
into
elasticsearch
and
share
the
result
into
grafana.
So
we
have
a
clear
result
for
each
nightly
run.
D
A
I
didn't
understand
how
many
cores
per
note.
Would
you
say
it
was.
D
Place
to
share
the
result,
but
it's
depend
of
so
we
we
create
the
distribution
of
the
result
and
in
in
general,
the
the
times
look
stable.
Sometimes
we
need
to
investigate
why
there
is
intensive
memory
and
all
this
stuff,
but
in
general
we
install
the
nightly
CNB
of
openshift
okay,
so
we
take
the
latest
version.
I
guess
it
should
be.
The
latest
version
version
of
cubert
right.
D
So
I
don't
know
exactly
what
the
version
of
I
I
actually
tagged.
The
CNB
nightly
version
inside
our
grafana
dashboard.
A
Okay,
so
that's
cool
I,
I
guess
like
so
we're.
D
Getting
away
all
the
logs
by
the
way,
all
the
logs
we
store
in
has
three
buckets.
So
if
we
you
discovered
before
that,
you
don't
have
enough
place
to
keep
the
lock,
so
we
Target
and
upload
to
elasticsearch
and
for
future
analyze.
So
we
have
to
next
to
each
one
a
link
to
S3
bucket.
So
by
one
click
you
go
and
all
the
logs
on
your
locals.
So
you
can
do
local
analyze
in
order
not
to
say
you
know
all
the
logs
and
locally.
A
Cool,
well,
that's
pretty
awesome,
I
mean
I,
think
that's
cool
and
yeah
as
you
get
like
some
results
from
this.
You
know.
A
Let
us
know
and
yeah
like,
like
you
know,
like
I
said
where,
like
we
have
that's
the
similar
Hardware
topology
on
the
dedicated
cluster
and
the
tests,
we're
doing
aren't
quite
the
same
they're,
not
they're,
not
they're,
not
the
same
and
we're
just
you
know
we're
creating
VMS
of
certain
quantities
and
and
analyzing
the
results,
so
I
mean,
as
you
get
like,
I
mean
what
I,
what
I
would
actually
what
I'm
actually
curious
is
like,
as
you
do
some
of
this
stuff,
it
would
be
cool
to
see.
A
We
have
the
audit
tool
that
basically
scrapes
the
metrics
and
we'll
analyze
them
in
such
a
format
like
that
they
can
be
scraped
using
the
this
work,
that
Olay
is
doing
for
graphs,
so
you
should
be
able
to
eventually
when
Illinois
has
this
published,
you
should
be
able
to
use
the
audit
tool
and
then
build
graphs,
and
it
would
be
cool
to
see
you
know
based
on
what
you're
doing
here,
what
else
we
find
you
know
because
you're
doing
nightlys
we
should
see
like
you
know,
we
should
be
able
to
compare
like
what
it
is
you're
seeing
and
maybe
we
can
get
some
additional
data
points
from
it.
D
A
I
mean
so
like
with
so
like
what
what
lay
was
showing
earlier
like
this
stuff,
with
these
metrics,
like
the
HTTP
request,
counts
like
you,
you
mentioned,
like
you're,
looking
at
the
the
amount
of
memory
and
the
CPU
of
the
note
and
how
how
that
gets
affected
by
you
know
the
the
scale
that
you're
going
with
what
I
was
suggesting
is
that
you
could
also
incorporate,
as
part
of
your
analysis,
that
we
have
the
HTTP
requests
and
the
the
VMI
phase
transition
times.
A
Both
those
things
will
give
you
more
insight
into
performance
and,
and
then
one
of
them's
like
this
one's,
particularly
with
scale
on
the
kubernetes
side,
and
so
since,
when
you
use
the
audit
tool
for
this
and
I
can
send
you
a
link
afterwards.
It's
it's
something
you
can
just
run
locally
as
part
of
your
your
job.
It's
pretty
easy
to
add
in
yeah.
D
Do
I
need
to
run
it
against
each
VM
to
get
it
the
memory
no.
A
No
you
just
so
what
you'll
do
is
you
just
when
you
just
run
after
you?
Do
your
the
way
it
works
after
you
do
your
test.
This
isn't
exactly
what
we
do
say
is
we
run
our
tests
and
then
we
run
this
audit
tool.
A
So
you
do
exactly
what
you're
doing
today
after
you're
done,
you
run
this
audit
tool
and
it'll
capture
a
bunch
of
this
this
stuff
from
Prometheus
and
organize
it
in
such
a
way
that
that
that
it
will
be
helpful
for
doing
like
performance
and
scale
analysis
and
that
and
then,
when
Olay
has
this
stuff
at.
A
A
If
you
have
an
old
cluster
one
note,
so
you've
used
in
the
past
doesn't
matter,
but
if
you
create,
which
is
you
create
one
VM
before
you
start
and
then
you
you
do
your
whole
scale
up
here
with
like
300
foot
rvms,
you
run
your
audit
tool
afterward
after
you're
done
over
the
time
period,
like
the
that
that
you
want
to
analyze,
and
it
will
give
you
a
bunch
of
data
about
the
the
client
go
hcp
request
the
VMI
transition
times.
A
You
know
organize
it
in
such
a
way
that
you
can
get
the
P90s,
the
P95
P99,
the
p50
and
a
bunch
of
that
stuff,
and
then,
when
delays
lays
working
on
this
tool
that
he
presented
earlier,
where
you
can
graph
this
stuff,
it's
like
because
it
sounds
like
you're.
Storing
the
data
somewhere,
so
we
should
also
be
able
to
use
this
tool
to
get
the
point
at
your
persistent
storage.
D
It's
gave
us
the
percentile
right
across
all
the
VMS,
that's
running
on
the
specific
cluster
right,
yeah.
A
Yeah,
you
could
get
this
and
right.
You
can
get
it
in
grafana.
If
you
want
to
the
the
point
is
the
reason
I
was
mentioning
is
like
if
you
wanted
to
do
this
programmatically
from
like
as
a
part
of
like
your
like,
if
you
wanted
to
write
a
bunch
of
code
around
the
results
like
getting
you'd,
run
this
tool
and
run
a
bunch
of
code
around
it,
but
you
could
obviously
do
this
in
grafana
like
if
you
want
to
review
each
job
one
by
one.
D
Nice,
so
it's
something
embedded
in
in
the
client.
A
It's
it's
not
it's!
It's
a
it's
a
part
of
Upstream
keyword.
I
can
I
can
point
you
to
it.
You
just
use
you
just
use.
Let's
see
here.
A
The
tools
right
here,
so
you
can,
what
you
do
is
you'll,
compile
it
yourself,
I
think
it's
a
there's,
a
command
for
it
here
and,
and
you
can
run
it.
A
Oh,
no,
okay,
we
compile
it
in
that
repo,
then,
okay,
so
what
you
do
is
you'll
in
in
here
you'll
you
can
compile
it
in
here
and
it'll.
Get
you
a
binary
and-
and
you
can
run
it
like
I
was
saying
you
can
just
run
it.
There's
an
example.
I
can
point
you
to
an
example.
I
can
put
in
the
dock
or
I
can
send
it
to
you
on
slack
afterwards
of
like
how
this
how
this
looks
inside
the
test,
it's
pretty
straightforward.
There's
some
I
mean
it's.
A
Yeah
so
like
it
ends
up
looking
like
this
I'll
show
you
exactly.
D
D
You
know
foreign
loading,
the
time
of
loading
take
30
seconds,
I
want
to
know
the
distribution
of
of
this
30.
Second,
in
the
cube
word
itself,
how
many
spend
each
in
each
block
in
the
code
the
results.
A
So
we
no,
we
don't
have
that
yet
so
this
is.
This
is
actually
where
some
of
the
enhancements
we
want
to
have
so
like
just
to
give
you
an
idea
of
where
we
are
like.
So
here's
like
here's.
What
like
the
output,
would
look
like
so
you'll
get
like
a
you'll,
get
like
a
plain
text:
Json
dump
to
your
to
your
local
terminal.
A
You
can
put
it
to
a
file
or
whatever,
and
so
this
is
based
on
the
time
period
that
we
ran
this
here's
our
p50
from
create
to
running
time
in
seconds
or
P95
and
RPM.
A
A
So,
if
you
look
at
there's
the
little
nuances
to
this
is
like
we
have
the
phase
transition
from
from
schedule
to
running,
like
that's
the
period
of
like
when
we're
booting
the
domain
and
it's
showing
it's
ready.
So
that's
a
whole
time
period
and
you
could
technically
track
that
time
period
right.
It's
maybe
in
this
case
this
is
the
whole
thing,
but
maybe
it
took
like
eight
seconds
or
something
so
during
that
time
period
technically
we
could
write
some
metrics
that
could
look
to
track.
Okay.
A
It
took
eight
seconds
like
what
was
the
breakdown
of
like
what
took
eight
seconds.
Was
it
the
domain
being
created?
Was
it
the
callbacks
between
the
Handler
and
the
launcher?
Was
it
something
you
know
whatever?
It
is
like
that
that
kind
of
stuff
is
what
we
can
create,
but
my
point
is
like
just
show
you
kind
of
what
we
have
today
and
eventually
where
we
want
to
break
this
down
and
where
we
could
break
this
down
more
and
how
we
can
output
it
in
a
useful
way.
A
A
D
C
B
D
B
So
let
me
jump
a
little
bit
in
here.
D
B
You
can
scroll
up
a
little
bit,
so
these
value.
A
B
But
that's
it!
That's
it
okay!
So
the
value
of
that
this
tool,
in
my
opinion,
is
that
you
get
an
aggregation
right,
that
creation
to
running
P95
timestamp
is
45
seconds
or
or
something
right.
You
get
that
you
can
get
that
even
without
this
tool,
but
in
the
keyword
stack,
a
lot
of
time
is
being
spent
in
making
these
API
calls.
B
These
API
calls
are
made
against
these
API
calls
are
made
by
keyword
to
kubernetes
API
server,
and
you
can
correlate
whether
the
creation
to
running
P95
has
gone
up
or
gone
down
or
you
know,
stayed
there
based
on
the
aggregation
of
these
API
calls
right.
So
you
can
say:
Okay
P95
creation
to
running
has
gone
up
because
we
are
making
more
list
calls
which
are
being
expensive
in
our
reconcile
Loop,
and
that's
why
it
has
gone
up.
So
this
is
the
first
level.
B
Well,
nor
so
it's
not
really
a
a
total
breakdown,
but
it's
more
of
a
way
to
understand
the
scaling
behavior
of
cube
vert
itself,
so
you
can
correlate
data.
The
next
thing
is
Ryan
was
mentioning,
was
to
actually
do
a
breakdown
of
that
creation
to
running
right.
So
the.
B
B
We
we
can
get
and
that's
something
in
in
the
roadmap,
but
the
value
of
this
is
that
you
can
understand
the
scaling
behavior
of
cubeboard.
So
we
can
look
at
this
data
and
say:
okay,
if
we
scale
up
to
1000
vmis,
we
are
getting
4000
get
endpoints
count
and
kubernetes
will
not
work
well
in
our
environment,
with
400
4000
get
get
endpoints.
D
You
decide
for
something
all
more
than
than
expected,
so
you
know
that
something
is
going.
The
wrong
in
this
in
this
section
called
The
Rise.
Yes,.
B
Us
understand
these
scaling
behavior
and
predict
whether
kubernetes
will
be
able
to
handle
this,
or
the
stack
will
be
able
to
handle
this.
The
performance
behavior
that
you
are
talking
about
is
to
exact
breakdown
of
each
of
the
phase
into
smaller
components
is
something
that
we
would
have
to
work
on.
So.
D
B
Now
this
is
aggregation
that
is
monitored
in
the
cubeboard
client.
So
what
will
happen
is
any
any
time
you
create
a
bunch
of
VMS
right
keyword.
Client
will
increment
each
of
the
value
when
it
makes
the
call
so
when
it
makes
the
create
pod
call,
it
will
increment
the
Prometheus
counter.
At
the
end,
the
audit
tool,
like
Ranch,
said
here.
The
audit
tool
will
look
at
all
the
increment
value
and
Report
the
the
data
back.
D
B
D
A
The
total
call
right
the
way,
I
wear
a
characterized
device
like
our
thesis,
is
that
we're
looking
at
performance
and
scale
sort
of
related
things
and
and
that,
if
you
were
in
the
traditional
virtualization
world
right
when
you're
launching
your
virtualizing
you're,
very
focused
on
the
performance
of
like
how
long
it
takes
from
like
when
I
created
the
domain.
To
the
point
that
I'm,
having
available
to
me
right.
That's
that's
critical,
but
as
like
what
I'm
saying
is
part
of
our
thesis?
A
Is
that
there's
so
much
more
happening
here
and
that's
what
this
illustrates
is
like?
We
have
to
deal
with
kubernetes
the
kubernetes
layer,
the
coup,
the
cuber
control
plane,
and
then
we
have
to
deal
with
the
the
actual
Parts
machine.
And
so
all
those
things
together
is
what
ultimately
affects
our
quote-unquote
performance,
because
that's
what
we're
dealing
with
with
kubernetes.
And
so
what
we're
doing
is
here
is
we're
we're
taking
the
kubernetes
and
the
kuber
control,
plane,
portions
and
we're
sort
of
factoring
that
into
our
analysis.
A
And
then
we're
also
doing
the
part
which
you're
emphasizing,
which
is
we're
trying
to
get
to
the
point
where
you're
emphasizing
where,
like
we
want
to
get
to
the
breakdown
of
like
just
like
the
domain
and
the
virtual
machine,
how
quickly
that
the
guest
is
ready
when
we?
Actually,
you
know
when
going
to
find
the
domain
to
when
it's
actually
difficult
to
use.
It.
D
So
I
think
that
there
I
I
think
that
this
this
is
not
come
embedded
inside
the
keyword,
so
I
thought.
Maybe
it
should
be
like
configuration.
B
D
A
No,
it's
it's!
This
is
by
this
is
on
by
default.
These
are
metrics
already
in
provided
by
default
when
you
install
kubert
all
we're
doing
like
so
you
can
get
this
you
can
in
your
cluster.
I
know
you
can
get
this
in
your
Prometheus.
You
can
look
at
these
values.
We're
just
saying
what
we're
doing
here
is
we're
we're
analyzing
them
useful
way,
and
this
by
the
way
like
getting
it
they're
getting
this
stuff
is,
is
not
easy.
A
D
A
Well,
I
anyway,
I
just
wanted
to
mention
this
as
something
you
can
add
as
an
addendum,
because
there's
a
bunch
of
data
that
we
can
that
we
can
capture
and
I
just
wanted
to
mention
it,
because
I
think
it'll
be
useful
to
see
like
when
you
want
to
bring
your
your
data,
like
you
did
here
to
this
meeting.
It
would
be
cool
to
also
see
this
as
a
comparison
to
what
we're
seeing
and
some
of
the
tests.
We.
D
Have
for
sure,
because
we
can
have
more
details
around
across
all
the
blocks
code
blocks
that
calling.
A
D
When
we
use
them,
maybe
it
can.
Yes
when
we
run
like
the
100
VMS,
so
all
the
depending
on
on
how
much
calls
we
have
so
I
think
then
the
numbers
will
be
the
same,
but
at
the
end,
when
will,
if
we
want
to
add
it
to
our
nightly,
so
we
can
see
the
the
behavior
across
the
run.
So
if
there
is
the
in
there
is
a
one-run
peak,
so
we
can
find
that
there
is
issue,
maybe
this
method
or
something
like
that.
Okay.
So
if.
B
A
D
A
We
have
I
mean
that's
what
we
have
today.
We
just
used
the
transition
times
to
get
the
P90
RSP
15.99s,
but
I
mean
the
like
I
was
saying:
there's
there's
further
breakdowns.
We
can
do
here
in
a
lot
of
different
directions.
From
just
the
domain,
the
launched
the
networking
there's
lots
of
stuff.
We
can
do
but
I
mean
that's.
We
started
with
this
because
this
is
something
that
gives
us
an
easy
on-ramp,
but
yeah
I
mean
what
you're
talking
about
is
things
we
want
to
add.
B
A
D
A
D
This
week
there
is
also
the
performance
and
also
the
the
cubert
summit.
So
there's
a
lot
of
things
so.
D
C
D
I
think
that
next
month,
beginning
of
April
more
time
to
do
the
investigation
around
it
and
to
you
know
to
First
this
tool
can
help
us
to
know
the
distribution
and
for
sure
I
will
try
it
on
my
side
and
also
to
what
we
discussed
last
meeting
and
to
enhance
the
Upstream
try
to
enhance
and
to
understand
more
about
performance
test
in
the
Upstream
side.
Sure.
A
B
B
Yes,
so
this
was
one
of
the
questions
raised
in
the
last
discussion
and
I
I
took
the
time
to
figure
out
what
what's
happening
here
so
really
quickly.
Summarizing
this,
the
cubelet
controls
a
set
of
conditions
on
the
Node
status.
B
The
node
controller
only
looks
at
the
heartbeat
time
and
if
cubelet
has
not
posted
a
heartbeat
for
some
timeout
period,
Then,
the
node
controller
will
add
a
attained
to
the
pods
that
are
running
on
on
that
and
it
will
basically
terminate
the
the
pods
with
with
with
retained
value,
no
execute.
So,
basically,
what
I'm
trying
to
say
is
the
cubelet
owns
those
status
conditions.
The
node
controller
reads
from
those
status
conditions
and
takes
a
bunch
of
actions.
B
The
reason
why
it
does
not
interfere
with
the
Kwok
controller
is
that
the
Kwa
controller
becomes
the
fake
cubelet
and
it
owns
those
status
conditions
instead
and
because
it
runs
a
regular
heartbeat
on
on
those
status
conditions.
The
node
controller
things
that
the
node
is
healthy
and
it
does
not
do
anything.
A
B
Yeah
I
actually
did
not
get
time
to
prepare
a
demo
for
this.
Maybe
we
I
can
get
it
going.
The
next
next
time.