►
From YouTube: 2021-02-04 Kubernetes SIG Scalability Meeting
Description
Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?ts=5d1e2a5b
A
B
A
A
If
it's
not
done,
then
we
will
just
not
have
recording
for
this
meeting,
but
welcome
everyone
to
this
another
sixth
calvity
meeting
and
we
have
fourth
of
february
and
we
already
started
discussing
abu's
pr
about
like
cleaning
up
the
the
the
how
timeout
is
handled
in
in
the
api
server
like
filter
chain
and
so
the
pr
has
merged.
We
already
checked
that
we
don't
see
any
regression
or
like
any
issues,
any
any
flakings
in
the
5k
node
test
results
and
here's
the
link.
If
anyone
is
interested.
B
Oh
nice,
so
so
this
should
fix
that
issue
with
the
hcd
connections
in
the
air,
even
after
the
call
times
out.
C
It
so
I
have
to
like
double
check
all
the
lcd
storage
layer
to
make
sure,
but
this
pr
basically
makes
sure
that,
as
soon
as
the
request
is
received,
we
have
a
deadline,
bound
context
and
that
context
is
attached
to
the
request.
So
the
hope
is
that
the
storage
layer
so
far,
I've
seen
it
that
uses
the
same
context.
C
But
a
follow-up
task
is
what
I
have
is:
go
through:
the
hcd
storage
layer,
the
admission
layer
and
the
the
aggregation
layer
and
make
sure
that
all
the
layers
they're
using
the
context
or,
if
they're,
creating
new
context,
they're
creating
new
context
based
on
the
parent
context.
So
that
that's
that's
a
follow-up
task
I
have
on
my
plate.
Okay,
I
did
a
preliminary
check
on
that
layer.
It
looks
like
the
context
is
already
wired,
so
it
should
use
the
the
deadline
context.
B
C
For
the
filter,
I
have
unit
tests
that,
basically
I
simulate
timeout
conditions
and
make
sure
that
we
get
the
expected
result.
I'm
also
planning
to
add
similar
tests
in
the
fcd
storage
layer.
These
are
hard
to
simulate
in
an
interview
test,
but
yeah,
I'm
I'm
planning
to
add
more
unit
tests
with
these
edge
conditions.
C
Yeah
with
regards
to
that,
there's
there's
also
a
pr
that,
basically,
it
has
some
unit
tests.
For
example,
what
happens
like
they
are
very
hard
to
simulate
in
in
the
real
production
cluster
or
in
into
a
test.
For
example,
a
request
times
out
and
the
the
inner
handler
still
tries
to
write
to
the
responsibility.
What
happens
then
right?
C
A
request
is
waiting
in
the
pnf
queue
and
then
new
request
arrives.
What
do
you
expect
for
that?
You
for
the
request
to
happen.
I
actually
wrote
a
few
a
couple
like
a
number
of
unit
tests
that
simulate
these
extreme
edge
conditions
and
base
basically
like
check
the
expected
behavior.
I
can
this
pr
is
marched,
but
I
can
like
share
you.
A
A
It
will
be
hard
like
to
test.
It
would
be
hard
to
test
like
in
in
some
end-to-end
manner,
whether
the
timeout,
for
example,
is
propagated,
but
I
think
it
shouldn't
be
that
difficult
right
like
if
we
can
so
the
question
is:
do
you
have
access
to
some
like
larger
cluster,
for
example,
with
large
num
large
number
of
some
objects,
for
example
bots
and
then,
for
example,
listing
all
the
plots
from
from
lcd?
It
will
take
like
order
of
hundreds
of
milliseconds
right
or
like
seconds.
B
A
A
That's
actually
just
the
migration,
but
we
introduced
some
modules
in
cluster
loader.
This
is
a
response
to
feedback
from
many
places
that
our
our
load
test
config
is
actually
really
hard
to
parse
like
by
human.
I
know
streamers,
okay,
it's
not
an
example
here,
and
but
that
I
like.
The
thing
is
that.
A
Has
been
loaded
because
it's
like
over
800
lines,
so
it's
super
hard
to
to
understand
what
is
going
on
stairs
test
is
super
hard
actually
to
maintain
it,
for
example,
to
implement
something
new
or
to
sometimes
we
even
find
bugs
there
so
to
fix
something
etc.
So
the
idea
is
that
we
don't
have
to
have
like
a
test
inside
a
single
huge
file.
We
should
like
go
and
have
some
modules.
This
is
what
basically,
this
pr
does
and
some
examples.
A
A
Yeah
so,
basically
like
with
modules
we
can
like
replace
some
of
the
parts
with
like
much
shorter
definition,
and
also
the
good
thing
is
that
we
can
create
a
single
module
and
use
it
to
like
do
usually
like
you
have
in
clustered
order
some
part
for
creating
something
and
then
for
deleting
or
even
like,
creating
and
then
modifying,
and
then
deleting
so.
You
can
put
everything
into
into
a
single
module
and
then
just
call
it
from
the
from
the
test
config.
So
my
estimation
was
is
once
we
migrated
everything
to
to
these
modules.
A
A
And
we
already
have
external
contributor
peter
who,
who
is
helping
us
with
that
so
hopefully,
in
the
next
few
weeks,
we'll
have
like
all
the
all
the
load
tests
migrated
and
all
right.
What
else
do
we
have
here.
A
So
it's
like
ap
is
very
slow,
meaning
requests
can
take
like
hundreds
of
milliseconds.
Then
this
post
startup,
who
can
literally
take
like
minutes
because,
like
one
thing,
is
that
we
are
doing
everything
serially
in
serial
and
the
other
is,
if
anything
fails,
then
we
start
from
the
beginning.
So
we
open
this
issue
and
we
also
have
like
contributor
who
is
working
on
that,
like
first
of
all,
we've
created
them.
B
Hey
so
matt
quick
question
about
this,
so
you
said
it's
one.
Get
plus
update,
call
for
a
lot
of
hundreds
of
roles
in
binding
right
yeah
so
but
like
when
the
api
server
starts.
A
C
B
A
Think
the
role
bindings
yeah-
that's
fair
points
yeah,
so
I
know
that
we
started
with
adding
a
benchmark
so
like
the
second
pr
is
still
like.
Oh
it's
still
open,
I
believe
so.
Actually,
that's
a
good
point.
We
should
double
check
whether
we
can,
instead
of
like
having
get
get
get
for
each
of
these
rolls
or
all
bindings.
We
can
do
like
one
like
four
lists
and
yeah.
That's
that's,
probably
a
good
idea.
A
On
the
other
hand,
it
might
be
tricky
if
there
is
a
like
another
another
actor
modifying
this,
these
roles
or
all
bindings
right,
because
then
it
might
be
actually
hard
to
resolve
the
conflict,
but
over
for
this,
but
yeah
anyway,.
A
Ap
machinery
too,
to
figure
out
whether
that's
that's
a
good
like
that's
something
we
could
do
or
not,
but
yeah
interesting
idea,
yeah.
So
that's
one
thing:
I
wanted
to
discuss
more.
A
What's
this
oh
yeah,
so
we
also
have.
We
also
have
shem,
who
is
like,
like
few
weeks
ago,
started
helping
us
in
scalability.
He
is
doing
an
amazing
job.
He
already
like
has
has
a
lot
of
meaningful
contribution
contributions.
One
of
them
is
he
fixed
the
image
preload
feature
which,
like
stopped
working
once
we
migrated
to
container
d
from
docker
image.
A
Problem
is
like
a
feature
of
of
of
cluster
loader
that,
basically,
you
can
pre-load
some
images
on
nodes
before
the
test
starts
running,
and
this
is
to
like
better
simulate
production
environments
because
usually
on
our
like,
we
run
our
scale
test
on
on
fresh
clusters
and
with
like
no
notes
like
no
images
on
nodes
and
and
if
you
compare
like
empty
cluster,
with
almost
no
images
on
nodes
with
like
production
cluster,
where
already
a
lot
of
workloads
have
been
run,
you
will
you
will
see,
there
is
a
huge
difference
in
no
size
object
right,
because
every
time
an
image
is
downloaded
on
on
on
an
aux
and
then
the
hash
of
the
image
is
stored
in
the
in
the
node
object
right.
A
So
this
is
actually
causing
discrepancy
in
like
results
of
scale
tests.
So
yeah
like
like
to
give
you
some
numbers
in
like
empty
cluster.
The
node
object
has
like
one
or
two
kilobytes,
maybe
and
like
in
production
clusters
like
easily
it's
like
in
20
30
kilobytes
right.
So
these
are
like
huge
objects
and
they're
like
really
important
to
performance
of
like
really
relevant
to
performance
of
fcd,
especially
like
operations
like
compacting,
no
objects,
or
things
like
that.
So
to
make
the
scale
test
more
realistic.
A
We
have
this
feature,
so
we
can
before
we
start
the
test.
We
actually
print
out
some
images
to
to
to
make
this
node
object
more
realistic
and
it
stopped
working
one
once
we
migrated
to
container
d
but
fixed
that,
and
he
also
has
like
other
contributions.
So
big
big.
Thank
you
chamfer
for
like
doing
that.
Some
some
examples
of
the
things
he's
currently
working
on
is
he's.
A
He
has
like,
like
almost
he's,
almost
ready
with
fixing
the
load
test
to
support
clusters
smaller
than
100
nodes,
and
this
is
like
very
useful
thing
for
like
new
contributors
who
want
to
start
the
journey
with
like
performance
testing,
but
don't
necessarily
have
access
to
large
clusters
so
with
shems
pr,
which
I
hope
will
merge
tomorrow.
You
should
be
able
to
run
our
log
test
even
with
like
one
node
cluster.
A
So
that's
also
like
good
for
anyone
who
wants
to
help
with
clustered
other
features
or
like
with
adding
something
clustered
or
or
or
doing
some
good
first
issue,
because
you
will
have
them
in
to
test
that.
He
also
like
works
on
making
caster
roller
to
work
with
kind.
A
So
the
same
story
with
like
with
kind
like
anyone
can
set
up
a
kind
cluster.
So
that
should
like
help
a
lot,
especially
new
contributors,
he's
also
like
working
on
adding
some
better
validation
to
cluster
other
conflicts.
So
before
it
actually
does.
It
executed
we'll
have
like
a
validate
step,
and
this
is
really
great
because
it
hasn't
been
like
hitting
us
a
few
times
that
cluster
had
like
no
real
validation
of
the
test.
A
So
imagine
a
like
large
scale
tests
that
can
take
12
hours
and
very
often
we're
wasting
runs
because
in
the
middle
of
the
test,
after
like
a
few
hours,
something
was
wrong
with
the
config
and
the
the
test
failed.
So
so
yeah
also
like
kind
of
anticipated
feature.
So
yeah
again
like
I
wanted
to
thank
stream
because
he's
doing
amazing
job
and
yeah
one
and
the
last.
A
A
Yeah,
I'm
actually
open
an
issue
actually
appear
is
already
fixed.
Then
we
noticed
some
issues
with
how
lightness
probe
of
lcd
is
configured.
That
is
actually
the
way
we
we
set
the
timeout
second
and
and
like
the
period
of
when
the
the
lightness
probe
is,
is
checked
or
like
yeah.
We
checked
the
healthy
of
health
endpoint
of
the
hcd
and
it
didn't
really
make
sense,
like
especially
larger
clusters.
A
So
oh
yeah,
this
should
be
appear
somewhere
here,
because
that's
the
issue
like
especially,
we
had
like
a
lot
of
false
positives,
so,
like
lcd
becoming
unhealthy
for
a
really
short
time
and
what
was
happening
like
cubelet
was
killing
it
and
that
was
actually
doing
more
harm
than
good,
because
if
the
hd
was
left,
it
would
like
completely
recover,
and
but
if
cubelet
killed
it,
then
it
caused
a
lot
of
issues
because
it
took
like
more
time
for
a
cd2
to
get
up
again
and
then
like
like.
A
We
have
this
like
thundering
card
of
issues
later,
because
api
server
become
unready
for
some
moments.
Some,
like
client,
disconnected
and
connected
again.
So
the
change
is
actually
pretty
simple,
but
it
had
like
a
nice
results
in
our
scale
test,
so
yeah.
A
So
basically
it
was
like
tweaking
of
some
arguments
and
also
changing
the
way
we
we
checked
the
lcd
healthiness
so.
A
Now
we
are
still
checking
actually
we
changed
something
because
we
are
not
calling
the
we
are
not
using
the
hdp
get
to
to
check
like
the
health
endpoint.
We
use
the
sd
castle,
but
I
believe
more
or
less.
Let
me
check
the
your
description
yeah.
I
think
magic,
like
summarize,
like
what
is
the
difference
in
like
calling
the
health
endpoint
versus
like
using
the
sd
card,
so
like
feel
free
to
to
read
through
that.
A
I
don't
remember
exactly
what
was
I'm
not
sure
if
this
is
the
the
main
reason
we
did
that,
but
I
know
that
the
actually
like
this
part
of
changing
the
parameters
is
crucial
because
it
makes
the
lightness
probe
much
less
project
to
like
single
failures
of
lcd
healthiness
right.
So
cuban
basically
will
give
xd
more
time.
A
A
So,
basically
now
you
need
to
have
like
five
fails
in
a
row
before
it
was
just
three
and
and
also,
I
think,
like
this
lcd
castle
is
much
it's
better
in
a
way
that
if
there
is
like,
for
example,
a
lot
of
going
on
on
the
master
vm
when
it
comes
like
to
network
throughput,
then
I
think
sd
castle
probably
works
better
than
I'm
just
calling
like
this
http
get
handler,
but
yeah,
I'm
not
sure
for
100,
but
yeah.
A
A
B
Hey
man,
I
actually
had
a
quick
question.
Sorry
I
forgot
so
for
this
cluster
loader
right
today.
Are
we
capturing
any
network,
latency
related
metrics
or
are
we
still
not
like?
Yes,.
A
So
let
me
where
is
it
it
should
be
here
so
long
time
ago?
Actually,
that's
it's
a
shame
that
we
we
haven't
done
like
more
in
this
area
since
two
years,
but
basically
we
define
all
this
like
network
latency
and
network
programming,
latency
scalability,
slides
on
the
slots.
A
And
we
started
implementing
them,
so
actually
both
of
them
are
implemented
in
cluster
loader,
but
the
network
programming
latency
requires
changes
because
we
like,
after
the
kubernetes,
has
migrated
to
endpoint
slices
the
cod.
The
code
stops
working
and
we
we
haven't
had
time
to
look
into
this
right
because
previously
there
was
end
points
and
it
worked
for
endpoints,
but
I
believe
for
employee
sizes.
A
C
A
Actually
yeah
yeah
both
so
we
also
have
it
implemented.
It
actually
works,
but
we
just
like
started
measuring
it
in
a
very
simple
way
and
having
rechecked
like
looking
into
results.
Let
me
show
you
where
the
code
is
for
that.
A
All
right
so
so
for
the
network
latency,
we
introduced
like
this
concept
of
probes
and
it
lives
in
the
util
images
probes,
package
ping.
That's
like.
Basically,
the
the
probe
implementing
probe
is
basically
a
container
like
a
pod
running
in
the
cluster
and
cluster
order
creates
like
these
spot
spots,
but
like
probe
pro
bots,
and
so
there
is
like
for
to
measure
network
latency.
A
There
is
a
simple
client
and
simple
server,
basically
client
kinks
the
server
and
records
the
latency
and
export
it
as
a
prometeuse
metric
and
then
cluster
loader
has
this
prometus
stack
so
basically
in
prometheus.
If
the
probes
are
enabled-
and
we
should
have
a
probe
or
maybe
we
do
it
differently
check.
A
I
think
like
that
now,
let's
take
my
you're
right:
let's
take
a
look
at
the
pro
quotes
code,
so
it
should
be
in
the
measurement
and
common.
I
believe
probes.
Oh
here,
are
the
monitors.
Okay,
all
right
yeah,
so
there's
deployment
for
both
server
and
client,
and
there
is
also
service
monitor,
which
is
like
prometheus
operator
api,
like
custom
resource
for,
like
defining
that
the
ping
server
should
be
described
right,
like
this
interval
and
using
the
matrix
port
defined
it
in
the
in
the
deployment.
A
B
So
for
truth,
as
for
so
in
the
test,
when
the
promoters
actually
scrapes
these
metrics
and
stuff,
what
happens
so?
How
is
it
collected
as
a
summary
at
the
end?
Are
you
just
taking?
So
I
believe
this
part.
A
Is
not
implemented,
so
you
can
check
it
in
grafana.
You
can
check
it
in
grafana.
A
You
can
basically
provide
the
disk
snapshot
name
and
like
and
cluster
roller
after
the
test
will
dump
the
prometus
database
into
this
into
this
disk
and
later,
like,
we
have
a
scripts
for
creating
grafana
instances
for
this
data
from
like.
So
so.
Basically,
you
can
check
like
the
prometus
data
for
some
tests
right,
like
all
the
metrics
that
were
collected
during
the
test.
That's
okay,
yeah.
B
So
you're
kind
of
running
your
own
instance
of
grafana
on
your
computer
like
somewhere
else
like
a
long.
A
Running
instance:
yes,
so
yeah,
okay,
anyway,
that
you
can
check
that
in
graph
now,
I
can
like
later
show
you
an
example.
I
believe
we
haven't
implemented
any
like
automatic
checking
or
like
we
haven't
defined
any
threshold,
but
in
general,
that's
something
that
we
do
for
that.
That's
the
way
the
api
call
agency
works.
A
So
if
you
take
a
look
at
the
api,
slos
aka
responsiveness
prompt
use,
that's
we
only
use
committees
right
now,
so
basically
prometheus
scrapes
api
server
every
five
seconds,
and
now
we
have
this
measurement
to
like
it
uses
promotes.
Basically
it
connects
to
promit
use
and
and
executes
query
exactly
the
curry.
That's
how
we
check
whether
slo
satisfies
satisfied
or
no
or
not.
Right
so
like
the
idea
was
to
do
exactly
the
same
for
the
network
latency
and
we
have
like
almost
everything
there.
A
Similar
for
network
programming
latency,
we
also
like
have
it
implemented,
and
it's
similar
like
prometheus
scrapes,
keep
proxies
right
because
the
proxy
has
a
metric
for
the
network
programming
agency
and
then
like.
We
have
this
query
here
to
just
check
it
yeah.
So
take
a
look
at
the
code
and
okay.
A
Yeah
all
right,
so
we
are
over
time.
So
thank
you.
Everyone
for
joining
hope
to
see
you
in
two
weeks
so
bye.
Thank
you.
Bye.