►
From YouTube: 2023-03-02 Kubernetes SIG Scalability Meeting
Description
Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing
A
A
Ty
meeting
second
of
March
2023
and,
as
wetek
said,
we
have
one
topic
to
discuss
regarding
cutting
costs
of
scalability
tests.
C
Sure
yeah,
my
topic
is
quite
fast,
so
I
I
think,
let's
start
with
with
my,
because
I
think
it's
pretty
urgent
also-
and
we
definitely
should
have
topic-
should
have
time
for
your
questions.
C
C
Basically,
as
a
project,
we
are
over
the
budget
as
a
project
I
mean
as
a
whole
kubernetes
we
are
over,
but
over
the
budget
for
our
like
infrastructural
costs.
So
we
are
looking
for
savings.
Everywhere,
We
Care
everywhere
we
can
and
scalability
tests
are
be,
maybe
not
the
most
obvious
but
like
the
second
most
obvious
option
to
look
for
the
first
first,
more
obvious.
C
The
first
most
obvious
is,
and
the
biggest
is,
is
the
are
the
costs
of
the
the
official
release,
artifacts
and
then
related
to
that,
and
that
has
been
taken
care
by
others.
But
scalability
tests
are,
are
the
second
biggest
costs
and
we
are
trying
to.
We
are
looking
for
for
things
that
we
can
do
to
to
cut
the
costs,
and
one
thing
is
is
has
already
happened
here,
which
is
we
already
disabled,
the
100th
note
pre-submits.
They
are
now
only
optional.
C
Their
reasoning
behind
that
was
that
they
were
basically
generating
half
of
roughly
half
of
all
the
costs
from
scalability
related
tests,
and
we
really
had
trouble
for
for
finding
any
case
that
were
they
in
the
last
year
or
so,
where
they
actually
uncovered
and
prevented
from
merging
something
that
wouldn't
would
be
would
be
causing
like
the
failures.
So
given
that
the
ROI
here
is
is
fairly
low
and
we
really
need
to
cut
the
costs,
we
we
decided
and
I
already
did
that
I
already
tagged.
C
This
PR
and
I
think
it
already
merged
like
earlier
today
that
we
made
the
100
100
note:
pre-submits
are
only
optional
now,
so
more
or
less
they
want
to
be
running.
We
are
looking
for
more
I.
Did
some
minor
things,
some
other
minor
things
too,
but
I
think
we
also
need
to
adjust
a
little
and
by
adjusting
it
unfortunately
means
decreasing.
C
C
C
We
are
only
we
are
running
it
once
a
day
and
to
give
you
some
numbers,
I
can't
remember
if
I,
if
I
remember
them
exactly,
but
we
are
roughly
like
all
the
scalability
related
things
are
costing
project
as
a
whole.
Roughly
six
hundred
thousand
dollars
per
year,
ish
pre-submits
are
roughly
half
of
that.
5K
node
tests
are
almost
half
the
like
5K
note:
Fest
are
almost
half
pre
100
nodes,
pre-submits
are
another,
almost
half
and
the
rest
are
peanuts.
C
Basically,
so
one
half
we
pretty
much
cut
today,
the
second
half
we
need
to
reduce
significantly
and
the
rest
doesn't
matter
that
much
really.
C
We
believe
that,
even
if
something
will
merge,
we
we
will
be
able
to
to
soon
to
relatively
quickly
determine
what
that
was,
and
hopefully,
reverted
or
fix
it
or
whatever.
D
Okay,
since
and
what
the
frequency
for
5K
you're,
reducing
it
to
once
per
day
or
once
per
day,
is
what
we
already
had
today
on.
C
Your
channel
one
day
is
already
what
we
have
and
we
need
to
reduce
it.
There
are
some
discussions
happening
there.
We
will
like
I'm
I'm,
trying
to
to
push
for.
D
A
So
maybe,
based
on
that
I'm
wondering
if
we
are
planning
any
kind
of
like
investment
towards
Cube
markers,
as
you
know
like
this,
was
kind
of
like
always.
The
approach
to
to
save
money-
and
maybe
maybe
Cube
mark
would
be.
You
know
something
that
we
could
potentially
invest
in
order
to
have
better
like
more
more
frequent
runs
for
for
5K,
for
example,.
A
C
Definitely
can
be
useful,
but
yeah
I
I,
don't
think
I
will
have
capacity
for
anything
like
that.
C
Some
Cube
marks
are
still
running
I
disabled
5K
note
Cube
marks
because
they
were
kind
of
duplicated,
with
real
5K
no
tests.
We
still
have
like
500
note.
Cube
marks
running
pretty
much
all
the
time,
so
so
something
is
running,
but
it's
definitely
also
not
not.
Uncovering
all
the
issues
that
that
real
clusters
are
uncovering.
A
B
Sorry
to
interrupt
I
am
a
contributor
at
chaops,
and
today
I
was
part
of
one
of
these
discussions,
and
my
understanding
is
that
the
budget
that
is
actually
over
is
the
Google
GCE
part
any
chance.
You
could
use
AWS
for
some
of
these
tests.
I
know
that
we
have
quite
a
lot
of
credits
there
too.
B
D
Yeah
yeah,
I
I
think
it's
definitely
a
valid
question
and
recently
I
guess
there
is
some
budget
I
think
I'm,
not
sure
about
all
the
numbers
and
details,
but
I
guess
for
the
CI
test.
Ci
testing
part
I
think
there
is
is
an
account
we
have
I
guess
we
need
to.
You
need
to
set
up
these.
If
I
can
note
tests
on
co-ops
I
guess
today,
though,
the
way
we
set
up
all
that
scripts
and
that
infrastructure
I
believe
it's
it's
for
specific
to
gcp.
D
They
believe
we
also
tweak
quite
a
few
things
when
creating
these
clusters.
Yeah
I,
guess
I.
Think
let
me
let
me
take
this
as
an
AI
back
to
my
team
here
and
see
if
we
can
begin
by,
for
example,
100
node
or
a
500
node
test,
and
maybe
we
can
do
something
like
reduce
tree
with
a
reduced
frequency.
We
can
do
alternating
runs
on
gcp
and
AWS,
or
something
like
that
to
reduce
some
pressure
off
of
these
5K
DCP
ones.
B
D
Yeah
yeah,
so
let
me
start
the
thread,
maybe
on
the
sixth
scale,
chart
and
try
to
poke
a
few
folks.
A
B
Arnold
was
saying
that
it's
kind
of
getting
to
zero
for
this
period,
at
least
so
they
were
trying
to
figure
out
how
to
reduce
costs
in
Kate's
gcrio
that
people
are
still
using
and
it's
consuming
quite
a
lot
of
credits.
So
it's
really.
C
C
Into
that
I
think
the
num.
The
current
numbers
is
that
we
are.
We
still
have
a
lot
of
budget
for
this
year,
but
with
the
speed
that
we
are
spending
it
now,
we
will
get
over
budget
probably
around
end
of
October,
or
something
like
that.
So
we
will
have
like
zero
money
for
running
anything
in
the
last
two
months.
So.
D
D
Yeah,
so
there
is
Nathan
hers
today
from
the
case
team
and
I
think
he
wants
to
discuss
a
specific
issue.
This
bug
fix
with
watches
I.
Guess
hey,
then
do
you
wanna?
Do
you
wanna?
Let
me
actually
first
link
to
that
issue.
I'm
mentioning.
E
Yeah
yeah,
I
I
think
the
there's
a
fix
that
was
merged
yesterday.
I
just
wanted
to
talk
about
if
we
could
backboard
it
for
for
older
versions,
when
I
was
working
on
reproducing
it,
it
looks
like
it
starts
on
122.
D
E
Yeah
I
I
think
start
I
think
the
first
case
we
saw
was
in
November
and
since
then
I
think
we've
had
close
to
10
escalations
on
it
with
with
most
of
those
being
sub
twos
or
like
getting
paged
just
because
the
cluster
can
get
into
an
unrecoverable
State.
And
even
if
the
customer
deletes
the
the
crd
a
lot
of
the
times,
they
don't
have
control
to
trigger
a
restart
like
I.
Guess
really.
C
Yeah
so
I
think
the
additional
context
about
about
the
user
user
user
escalations
is
useful
here,
so
it
would
help
I
think
currently
it's
causing
some
flakes
of
tests,
so
I
think
we
need
to
address
that
first,
but
assuming
that
will
get
fixed
at
some
point
with
the
additional
context,
it
would
be
good
to
write
it
down
somewhere
in
some
issue
or
or
whatever
I
think
back
parting.
It
might
makes
more
sense
than
iPhone
yeah.
E
Yeah
yeah
I
think
there's
like
a
Harbinger
notice
on
our
side
open.
So
we
can
share
something
I
think
externally,
on
those
yeah
and
at
least
eeks
still
supports
122,
so
we'll
just
carry
the
patch
internally.
For
that.
D
A
Okay,
so
do
we
have
anything
else.
B
Ahead,
okay,
I
have
a
use
case
with
agents
running
on
each
node
demand,
set
collecting
Network
information
and
enriching
it
with
kubernetes
information.
B
At
the
moment,
all
the
tools
that
I
seem
to
get
that
I
could
get
started
on
exploring,
are
doing
kubernetes
enrichment
on
each
agent
by
listing
all
the
pods
and
then
continuing
to
watch
generally
I
think
those
tools
were
tested
at
smaller
scale,
but
in
our
case
I
think
the
scale
would
be
more
towards
thousands
of
nodes.
B
Starting
that
those
agents
like
2000
agents
and
asking
the
API
server
for
the
list
of
pods,
which
is
not
so
small,
either
I,
would
not
really
work
very
well.
Now.
I
saw
that,
let's
say
Calico
went
with
the
taifi
as
a
proxy
psyllium
created
their
own
smaller
objects,
with
only
the
information
that
they
need,
but
I
wanted
to
know
as
a
general
guideline.
What
do
you
think
about
this
kind
of
use
cases?
What's
the
recommended
approach.
C
Yeah,
so
certainly
watching
all
parts
from
every
note
is
not
a
good
idea
like
it's.
It's
not
going
to
scale
independently
from
like,
even
if
we
ignore
the
initial
list
for
those
that's
a
bad
idea
anyway.
C
So
so
that
that's
like
the
the
first
point
regarding
regarding
the
what
we
can
do,
I
think
I
I,
don't
know
the
use
case.
So
it's
hard
to
hard
to
say
but
like
even
what
kalika
is
doing,
is
also
highly
problematic.
So
the
creating
the
object
itself
like
the
psyllium
endpoint,
is.
C
C
Batch
multiple
single,
multiple,
smaller
objects
into
single
change.
Radio
also
include
also
at
the
same
time
reducing
further
they
they
they
their
size
and
I.
Guess
martial.
You
can
probably
talk
more
about
it,
but
and
at
the
same
time
like
introducing
something
so
that,
like
not
every
potentially
not
every
single
change
is
being
sent,
and
even
that
is
causing
problems,
so
that
is
that
helps
a
lot,
but
even
that
is
problematic
in
in
some
scenarios,
especially
in
scenarios
where
we
want
to
have
clusters
with
like
higher
posture.
C
So
there
are
some
more
things
that
still
involves
are
are
thinking
about
which
I'm
not
100
up
to
date
with,
but
I,
don't
think
we
have
like
a
good
solution
at
this
point
like
some
solutions,
obviously
scalier,
horizontal
scalier
controlled
in
horizontally
further
and
instead
of
learning
like
five
or
three
or
five
API
server
and
25
or
100
or
whatever,
and
at
some
point
that
will
work.
But
that's
not
super
satisfying
solution,
so
I
think
I'm.
Sorry,
I,
I
think
that.
C
I
think
we
should
basically
look
into
the
use
case
and
see
what
you
can
do
to
avoid
having
all
the
information
in
every
single
agent,
because
that
in
many
cases,
including
many
sub
cases
of
the
psyllium
that
isn't
strictly
needed.
C
And
it's
just
like
it
often
boils
down
to
to
the
fact
that
it's
just
simpler
to
do.
B
Well,
the
use
case
is
more
or
less
let's
say:
debugging
Tool,
that
is,
is
looking
at
what
net
what
pods
are
doing
connections,
maybe
errors
between
various
pods
in
theory,
one
could
do
this
enrichment
at
a
later
time,
so
separate
it
completely.
Just
gather
the
raw
data
from
the
pods
using
dbpf
or
something
and
then
reaching
it
later.
B
That's
I
would
say
the
simplest
approach
that
I
could
think
of
just
that
generally,
it
was
preferred
to
do
it
in
each
agent
so
that
you
can
send
and
Rich
data
to
the
data
store
completed.
So
you
don't
have
to
join
some
tables
to
get
the
right
info.
B
Would
it
help
if,
let's
say
just
now,
I
don't
know
the
controller,
the
deployments
or
replica
sets
are
listed
and
watched,
or
even
those
at
this
scale
would
mean
too
much.
C
C
Probably
I
think
I
I'm
trying
to
remember
what
we
are
storing
in
deployment
status.
I
think
we
are
also
storing
like
number
of
ready
Bots,
which
is
obviously
visibly
better
than
than
watching
all
the
parts,
but
it's
still
I
can
still
imagine
use
cases
where
it
might
be
problematic,
but
it's
definitely
some
mitigation
for
sure.
C
Yeah
and-
and
it
highly
depends
on
the
pot,
churn
itself.
If
this
cluster
is
fairly
static,
then
it
it
may
work
with
even
with
pods,
but
if
you
want
to
throw
a
lot
of
bots
or
like
create
a
lot
of
parts
at
the
same
time
or
schedule
or
delete
or
whatever
that,
like
the
more
you
do
of
that,
the
more
problematic
it
will
be.
A
I
also
also,
maybe
one
more
comment,
also
like
adding
like
how
you
deploy
this
demon
set
like,
as
you
said
like
there
is
like
the
first
step,
which
is
like
listing,
let's
say,
all
the
deployments
and
then
watching
them,
and
so
you
should
also
be
careful
like
with
the
deployment
of
your
demon
set,
because
if
you,
if
you
create
those
odds
too
fast,
then
they
all
will
start
hitting.
You
know
API
server
at
the
same
time
as
well.
B
A
Well,
I
would
I
I
would
not
fully
agree,
because
what
can
happen
is
then
you
know
you
are
doing
two
calls
per
right
like
list
and
then
watch
and
the
watch
might
never
happen.
If,
if
you
are
overwhelming
the
APF.
B
B
Okay,
I
saw
that
there
are
some
caps
about
streaming
or
like
instead
of
watch
list
to
start
streaming.
Some
sort
of
events
are
this
in
are
planned
for
127
or
the.
C
Day
the
core
server-side
implementation
of
that
has
just
merged
today.
Oh
yes,.
A
C
Is
this
is
definitely
planned
for
127.
we
will
like
we
want
to,
like
the
client-side
part,
is
still
not
merged,
but
I
hope
it
will
be.
It
still
has
like
two
weeks
or
so
to
have
it
merged,
and
it's
not
that
far
so
I
hope
it
will
get
merged.
A
So,
actually,
from
the
client
side
point
of
view,
do
you
have
to
do
something
in
order
to
enable
this
feature?
I'm?
Guessing
yes
right,
because
it's
totally
different
API.
C
So
it's
it's
not
totally
different
API
but,
like
you
need
to
you
need
to
handle
it
slightly
differently
right,
like
the
the
again
yeah.
It's
the
watch.
It's
just
a
watch
with
a
different
parameter,
but
you
need
to
you
need
to
handle
it
slightly
differently.
So
so,
yes,
you
need
to
do
that.
The
client-side
part
isn't
huge
and
isn't
super
complicated,
like
I
can
probably
link
the
the
work
in
progress
PR
here.
C
D
So,
while
the
while,
the
streaming
is
happening
for
the
objects
is
that
occupying
the
APF
seat
or
how.
D
D
D
Am
thinking
I'm
thinking,
because
today
this
happens
with
a
list
followed
by
so
initially
for
most
reflectors?
It
is
list
followed
by
watch
right
so
for
lists.
It
is
if
a
lot
of
clients
are
making.
These
list
calls
we're
able
to
use
APF
to
to
kind
of
throttle
some
of
them,
especially
on
some
clusters,
where
there
is
yeah,
where
there
are
a
lot
of
listers
for
some
customers
where
we
have
to
do
this.
We,
for
example,
introduce
some
EPF
rules
to
throttle
lists.
D
So
if
we
are
going
to
do
this,
do
we
have
to
do
something
equivalent
for
watches,
or
is
that
going
to
have
any
side
effects.
C
So
the
way
API
APF
works
is
that
only
the
the
initial
event
parts
are
actually
handled
by
or
handled
by,
APF,
which
boils
down
exactly
to
what
we
want.
So
it's
it's
this
the
part
of
list.
Actually,
so
it
may
require
a
little
bit
of
tuning,
but
also
on
the
API
PF
side
of
of
APF
site
itself
and
I
guess
it
may
not
happen
for
Alpha
it's
something
that
we
should
probably
play
a
little
but
with
before,
enabling
it
by
default,
but
I
like
conceptually.
C
It
fits
the
model
so
that
that
was
considered
like
a
conceptually.
It
fits
the
model.
We
may
need
to
do
a
little
bit
of
work
to
make
it
work
exactly
as
we
want
and-
and
that
may
require
tuning
some
some
some
ipf
rules
also,
but
it
should
be
doable,
at
least
without
significant
work.
D
B
You
very
much
for
all
your
ideas
and
for
your
help.