►
From YouTube: 2023-05-25 Kubernetes SIG Scalability Meeting
Description
Agenda and meeting notes - https://docs.google.com/document/d/1hEpf25qifVWztaeZPFmjNiJvPo-5JX1z0LSvvVY5G2g/edit?usp=sharing
A
Hey
everyone:
this
is
six
collability
meeting
25th
of
May
2023,
and
today
we
have
LTD
tracing
demo
from
Benjamin
and
I.
Give
you
yeah
it
just.
You
can
just
start.
Okay.
B
B
So
allow
me
to
provide
some
background
information
for
those
unfamiliar
with
the
concept
of
tracing
you
might
have
heard
of
open
tracing
or
even
ebpf.
Essentially,
what
a
tracer
do
is
record
event
that
represents
what
a
process
was
doing
at
a
certain
point
in
time.
B
As
an
example,
here
we
have
a
client
doing
a
request
to
the
back
end
and
the
bank
can
doing
a
few
subtasks,
so
the
events
would
be
the
client
starting
a
request
and
then
ending
their
request,
and
then
the
web
app
starting
the
processing
and
then
ending
the
processing.
And
with
all
of
these
events,
you
can
create
a
sequence
diagram
that
represents
the
critical
path
of
your
request.
B
But
the
issue
is
that
from
one
Trace
to
another.
Maybe
you
can
see
here
that
the
task
one
is
taking
more
time
compared
to
trace
a
and
there's
no
way
to
know
exactly
why
it
was
slower.
B
It's
kind
of
a
black
box
and
so
kernel
tracing
allows
you
to
better
understand
why
the
task
was
plugged.
So
in
the
second
example
you
see
you
can
see
here
that
using
kernel
chasing,
you
could
see
that
Task
1
was
plugged
by
a
task
one
from
trace
a
because
it
was
waiting
for
a
few
texts
and
basically
a
kernel.
Tracer
collects
system
calls.
So
that
way
you
can
know
more
about
your
system
and
using
a
tool
like
Trace
can
pass.
You
can
better
understand
the
critical
path
of
each
processes.
B
You
can
better
understand.
You
can
see
the
CPU
users,
the
memory
usage,
the
disk
usage
and
a
lot
more
things.
So
the
kernel
Chase
server
that
I'm
using
is
lttng.
B
B
As
an
example,
I've
traced,
a
a
simple
deployment
of
a
single
pod
on
kubernetes,
using
my
user
space
instrumentation.
You
can
see
all
the
lives,
the
events
in
the
life
cycle
of
a
dispute
creation.
You
can
see
the
deployment
being
created,
you
can
see
the
replica
set
being
created
and
you
can
see
the
Pod
being
created.
So
you
can
see
all
the
faces
to
create
the
path
you
can
see
it
being
pulled.
You
can
see
it
being
started
and
you
can
see
it
being
killed
in
this
pod.
B
I've
created
on
purpose,
I've
put
a
really
low
quota.
So
when
you
can,
when
you
zoom
in
on
the
pad
being
started,
you
can
see
the
small
Spike
here
that
indicates
that
it
was
being
throttled,
which
is
a
common
case,
and
when
you
look
at
the
control
flow
of
your
system,
you
can
see
that
all
my
processes
were
preempted
which
basically
showed
that
Mike
pudd
was
being
traveled.
Also
I
can
look
at
in
this
case.
B
The
polling
was
really
so.
It
took
13
seconds
and
basically
to
know
why
it
was
slow.
I
can
look
at
the
critical
path
of
container
D
container
D
was
the
process
that
was
pulling
the
image.
B
In
the
more
specific
view.
You
can
see
all
the
live
time.
It
was
printed,
so
it's
a
bit
cluttered
but
there's
an
overview.
B
View
here
of
the
critical
path-
and
you
can
see
the
reason
why
it
was
slow.
So
in
this
case
you
can
see
that
it
was
printed
a
lot
for
like
on
pick
Z
other
container
D
processes.
It
itself
was
printed
a
lot.
It
waited
a
lot
almost
also
on
the
timer
and
a
bit
on
the
network,
so
yeah
so
essentially,
Trace
can
pass
and
kernel
tracing
allows
you
to
get
more
information
on
what
was
actually
happening
on
the
Kernel.
B
You
can
see
also
the
disk
activity
you
can
see
which
process
for
running
on
which
CPUs
at
each
time,
you
can
see
the
complete
control
flow
of
the
processes.
So
that's
essentially
the
information
I.
We
can
call
it
and
process
using
lttng
and
Trace
compass.
A
Yeah
this
this
looks
pretty
cool
I
was
actually
wondering
because
on
the
lowest
one
you
also
have
scaling
deployment
and
replica
set.
How
do
you
get
this
information?
Because
my
understanding
before
was
that
the
ltdng
is
kind
of
like
kernel
thing,
but
here.
B
B
So
basically
lttng
can
collect
a
every
system
called
being
called
and
recorded,
but
it's
also
there's
a
library
called
the
lib
lttng
UST
that
allows
you
to
send
events
to
record
events
from
user
space,
so
what
I've
done
is
added
I've
created
my
basically
I've
added,
like
a
few
call
to
that
library.
Inside
of
kubernetes
I've
compiled
my
own
kubernetes
and
from
kubernetes
I've
called
the
kernel
Tracer
to
add,
like
a
life
cycle
event
like
being
pulled,
started,
killing
Etc.
A
Yeah,
so
this
is
pretty
cool,
because
I
think
like
actually
last
last
time
on
Sig
meeting,
we
were
discussing
kind
of
similar
thing
where
we
were
interested
in
tracing
Cube,
Verde,
latency
of
creating
VMS,
and
something
like
this
would
be
also
really
great
I
think
what
do
you
think
shiam
and
wetek?
D
Yeah
I,
I,
fully
agree
and
I
think
we
I'm
not
sure
I
I
I'm
not
closely
involved
in
like
the
tracing
related
efforts
that
are
driven
from
by
sick
instrumentation,
because
I
think
they
are,
they
are
integrating
with
open
tracing
if
I
remember
correctly,
I
think
it
would
be
good
to
present
that
awesome,
sick
instrumentation
meeting
and
get
their
feedback
and
and
understand
why
they
chose
open
tracing
and
whether
we
May
potentially
want
to
revisit
that
I
I'm
not
super
familiar
with
any
of
those.
D
So
it's
hard
for
me
to
tell
that,
but
they
would
be
good
people
to
to
talk
about
that
too.
A
B
E
So
Benjamin,
you
said
you
said
you.
There
was
a
way
for
you
to
plug
in
like
this
kind
of
custom
events
that,
like
the
long,
the
user
space
events
that
you're
plugging
in
using
this
Library-
and
you
also
said
you
to
be
able
to
do
this-
you
had
to
kind
of
come
by.
E
Let's
make
some
code
changes
to
the
components
to
kind
of
permit
certain
events
and
stuff,
so
I
think
it
will
be
interesting
to
see
if
the
same
kind
of
or
the
similar
method
as
possible
with
the
with
the
open
Tracy
people.
That
goethic
is
talking
about
with
the
statistic:
instrumentation
is
now
the
things
that
have
changed
here:
do
they?
How
would,
if
can
they
also
be
plugged
into
the
other
solution
that
the
community
today
is
thinking
about?
E
It's
not,
then
it's
a
good
place
to
bring
it
up
right,
hey
I'm,
able
to
do
this
with
lpt,
and
it's
the
same
thing
with.
A
I
think
one
more
interesting
thing
from
our
perspective
would
be
to
see
it
for
the
control,
plane
notes
and
not
the
worker
like
from
user
perspective.
Usually
probably
they
care
about
like
why
the
Pod
was
starting
slow
but
from
scalability
perspective,
I
think
it
would
be
also
cool
to
see
it
for
the
control
plane.
A
A
Question
yeah,
so
my
question
would
be
like
actually,
how
much
effort
had
did
you
have
to
put
in
to
to
actually
integrate
it
with
with
with
Note,
basically
or
couplet
and
containerdy.
B
So
the
thing
so
essentially
the
kernel
space
information
instrumentation
you
get
it
like
for
free.
You
just
have
to
install
a
kernel,
module
and
you'll
be
able
to
collect
every
Trace
event
that
you
want
from
the
kernel
for
the
user
instrumentation.
So
in
this
case
this
view
the
issue
is
that
the
library
is
in
C,
so
I
had
to
use
cigo
to
call
it,
but
otherwise
it's
like
pretty
easy.
It's
just
like
a
system
call.
E
I
I
know
lots
of
users
at
least
a
class
of
customers
today
that
do
at
least
the
kernel
inside
of
tracing
using
EVP
of
probes.
I
think
you
also
mentioned
YouTube
here,
so
you
can
install
you,
can
fuses
probes
and
permanent
probes,
for
instance,
right
and
I.
Don't
know
how
much
of
this,
for
example,
celium
celium
I,
believe,
does
a
bunch
of
EPF.
E
Yeah
yeah,
so
I
I
think
I
I,
guess
from
what
I
sense
where
the
community
is
ending
and
I
might
be
totally
wrong.
Is
for
the
for
these
kernel
operations
right
or
even
user
space
operations,
like
let's
say,
changing,
IP
templates
and
stuff
like
that
which
is
or
let's
networking
configuration
and
stuff
using
evpf
to
gather.
E
That
is
something
people
already
do
today,
and
the
open
tracing
proposal
that
is
floating
around
in
sick
instrumentation
I
need
to
go
check
whether
that
is
trending
towards
maybe
voice,
like
you
know,
more
about
boyfriend
but
I.
Suppose
that's
purely
kubernetes
space.
E
E
Okay
and
what
you're
proposing
here
Benjamin
is
one
single
solution
that
can
do
all
of
this.
So
again,
I
think
I.
Think
I'm,
going
back
to
the
same
point
as
my
previously
is
it'll,
be
good
to
check
with
books
the
direction
we've
taken
and
why
we've
taken
that
direction.
B
B
B
So
that's
one
issue
with
like
combining
these
two
traces
and
then
ltdng
is
able
to
combine
both,
but
also
ebpf
adds
more
overhead
than
lttng
does.
B
E
Ty
collecting,
if
it's
is
not
using
evpf,
is
it
a
completely
new
thing
or.
B
No,
so
that's
the
thing
is
like
ebpf
is
like
Computing
things,
but
lttng
only
like
it's
only
like
copies
things
in
memory
like
it
doesn't
touch
at
all.
It
just
records
events.
So
that's
why
it's
really
fast
and
efficient.
E
Okay,
okay,
let
me
see
overall,
I
I
think
it
is
a
super
cool
to
watch
as
well.
The
other
part
I
think
we
haven't
talked
much
about
is
the
how
you
are
visualizing
this
like
this
tool
for
visualizing?
Is
this
also
a
part
of
entity
or
is
this?
Can
this
work
like
in
general,
the
open
tracing
standards
like
it
can
inject
traces.
E
Maybe
we
can
I
guess
it's
a
bit
a
little
bit
down
the
line
lane
but
I
think
it'll
be
cool
for
us
to
add
this
sort
of
a
visualization
for
artists
that
we
run
today.
B
Trace
can
pass
yes,
so
Trace
Compass
is
just
like
the
you:
don't
have
to
use
a
trace
Compass
to
analyze,
lttng
traces,
there's
also
a
library
called
Babel
Trace
that
allows
you
to
essentially
like
read
your
Trace
files
and
create
your
own
custom
analysis
for
it
and,
like
I'm
sure
you
could
use
another
front
end.
B
But
what's
nice
with
Trace
Compass?
Is
that
there's
a
lot
of
like
analysis
that
have
been
built
upon
it's
a
tool
developed
by
a
buyer
or
a
lab
at
Polytechnic
Montreal
with
also
a
Ericsson.
B
E
E
Thank
you,
yeah
I,
guess,
I
think.
Maybe
once
you
check
back
with
sick
instrumentation
feel
free
to
let
us
know
what
what
we
are
thinking
is
about
it.
You
can
use
this
channel,
bring
up
any
follow-ups
because,
for
example,
say
what
sort
of
things
we
usually
look
for.
B
Okay,
but
I've
actually
talked
a
few
months
ago
with
six
scalability.
B
B
Our
use
cases
like
before,
like
Integrity.
B
A
Had
a
meeting
okay.
B
I
was
just
looking
more
for
like
use
cases
inside
of
like
kubernetes,
like
people
that
would
appreciate
want
to
see
certain
things
using
kernel
and
user
space
tracing.
A
Also
but
I
think
I'm
wondering
like
if,
if
the
cube
verdict
like
two
weeks
ago,
we
had
exactly
similar
conversation
with
the
issue
of
debugging
kind
of
like.
Why
does
it
take
so
much
time
for
for
VM
to
spin
up
in
Cube
Verde
and
they
were
actually
like
interested
in
tracing
I'm,
not
sure,
if,
like
they
need
like
kernel
tracing,
it
might
be
helpful.
A
So
that
could
be
maybe
like
one
use
case
but
from
like
our
perspective,
I
think
like
we
saw
before
some
issues
with
pod
startup
latency,
for
example,
that
you
presented
here
right
and
I
think
we
saw
like
on
the
Node
level
as
well.
So
it
might
be
useful
for
us
I.
Think.
A
Yeah,
like
with
Docker
Docker
before
we
had
like
some
contentions
right
on
the
Node
level,
I
think
and.
D
Yeah
I
think
in
general
we
still
don't
have
like
super
good
understanding
where
we
are
spending
most
of
time
or
how
the
time
is
split
in
inside
the
node,
for
starting
or
even
deleting
the
boat.
E
Yeah
yeah,
yeah,
I,
guess
I
think
where
this
collection
is
heading
I
believe
is
maybe
a
okay,
but
if
Benjamin,
this
is
something
you
want
to
take
a
stab
at
you
and
you're
interested
in
actually
kind
of
taking
some
of
these
changes
you
might
want
to
see
if
there's
some
parts
of
it
at
least
that
you
can
try
integrating
with
our
tests
or
testing
I,
guess
the
changes
where
you
have
to
go
make
add
this
response
in
the
qubitis
code.
Those
will
be.
F
E
All
right
so
I
had
another
topic
for
today.
If.
E
So
it's
about
this.
This
change
I
just
made
the
link
here.
E
I
think
the
question
made
this
change
well,
I
think
you're
resuming
it
so
this
it
seems
like
this
one
is
adding
making
missed,
calls
various
multiple
seats
in
apfq
and
there
is.
There
is
I,
think
couple
customer
issues
we
ran
into
recently
and
I
see
pratik
from
the
from
the
AWS
titles
joints
I.
E
Let
him
speak
more
about
it,
but
I
I
think
the
main
I
guess
the
main
thing
which
kind
of
was
coming
out
of
it
for
me
was
that
change
if
we
tested
it
using
our
test
today,
we
the
load
density
test
that
we
run
today,
I
think
it
wouldn't
have
actually
exercised
this.
You
know
because
artists
don't
create
really
any
load
to
put.
E
D
We
were
testing
it
a
little
bit
like
out
of
tree
internally
in
Google,
but
can
you
explain
what
the
actual
issues
were
because
like
yeah?
Certainly
it's?
There
are
things
where
it
can
be
improved,
but
that
I
want
to
understand
where?
Yes,
because
we
also
seen
a
bunch
of
cases
where
it
actually
helps
a
lot
and
like
the
cluster
will
fall
down
without
this
yeah
and
it
survived
with.
C
So
yeah
thanks,
so
the
particular
The
Edge
case
that
we
are
seeing
is
if
a
customer
is
making
or
if
a
user
is
making
large
number
of
list
calls,
and
now
they
are
upgrading
from
a
version
which
didn't
have
this
functionality,
which
is
122,
and
this
was
added
in
123..
So
what
they
are
seeing
is
their
list
calls
are
now
taking
large
number
of
seats.
C
So
some
of
the
other
calls
that
fall
in
the
same
bucket
are
basically
getting
429s,
because
there
are,
there
are
no
more
seats
for
those
calls,
which
was
not
the
case
in
122.
so,
and
this
started
happening
right
after
they
upgraded
from
122
to
123
and
in
The
Matrix.
We
can
see
that
the
429s
and
the
reject
calls
those
also
increase
as
a
result
of
the
upgrade,
but
from
their
side
there
was
no
other
increase
in
the
load
or
the
list
call
pattern
that
we
checked.
D
Yeah
I
think
that
that,
in
general,
like
because
we've
seen,
we've
seen
that
too
and
the
reason
for
that
is
primarily
that
we
are
not
tuned
well
enough,
especially
in
terms
of
like
defining
the
capacity
of
the
API
server.
So
so,
basically,
defining
or
increasing
the
number
of
in-flight
seeds
in
the
API
server
is.
Is
the
or
adjusting
this
this
this?
This
value
is
like
how
how
we
believe
it
should
be
solved.
D
E
Put
more
seats,
I
feel
that
probably
the
Miss
with
that
PR
is
when
doing
this,
because
it
is
also
sharing
the
bucket
with
other
papers,
which
means
like
mutating
requests,
which
will
be
hydrating.
Maybe
we
should
have
done
that
in
along
with
maybe
separating
the
bucket
for
let's
calls
altogether,
because
in
the
end,
we
actually
ended
up
doing
that.
E
For
this
customers
we
created
another
bucket,
which
will
divert
the
list
topic,
and
you
can
separately
throttle
that
without
affecting
the
question,
so
maybe
but
I
see
also
how
it
may
be
hard
to
come
up
with
search
engine
pocket
with
respect
to
how
many
concurrency
shares
and
stuff
to
have.
But
maybe
do
you
think
it
makes
sense
that
this
change
would
have
should
have
gone
along
with
that
sort
of
a
change.
Yeah
I'm.
E
D
Yes,
so
I
guess
I
I
think
it's.
It's
certainly
a
reasonable
mitigation
to
that
problem
too.
I
wouldn't
do
that
by
default
for
everyone,
because
in
a
typical
use
case,
where
customer
or
the
the
user
is
not
actually
overloading
or
overloading
in
some
sense,
the
API
server
I
think
they
they
generally
don't
want,
or
they
generally
want
to
share,
share
the
priority
level
between
like
different
types
of
calls,
I
mean
both
mutating
and
reads
so,
I
think
what
what
we.
E
D
Mitigations
and
so
on,
but
I
I'm,
not
I'm,
not
sure
we
we
actually
should
have
said
by
default,
separate
that.
C
Original
comment
about
around
testing
so-
and
you
also
mentioned
that
the
estimation
is
still
not
perfect,
so
let's
say
we
in
APF:
we
go
and
improve
the
estimation
algorithm
and
tomorrow,
like
from
Maximum
seats.
For
any
reason
it
is
now
we
are
bumping
up
to
like
20
or
50
seats.
Then
again
we
might
run
into
this
issue
writer
like
unless
we
have
these
kind
of
testing,
which
can
include
this
case
from
like
either
cluster
loader
or
some
kind
of
perv
test.
D
Yeah
we
started
I
mean
the
fact
that
we
are
missing.
Tests
is
I,
certainly
agree
so
like
if,
if
anyone
has
capacity
to
like
extend
our
test
or
build
new
ones
to
exercise
these
scenarios
I'm
all
for
it.
E
D
You
can
also
talk
to
Abu
from
Red
Hat
I
will
I
will
paste
you
his
his
Elder
here
on
the
chart
like
because
he
was.
He
was
actually
also
looking
into
that
and
and
performing
a
bunch
of
like
tests
internally
in
Red,
Hat,
I,
I,
think
I've
seen
some
demo
or
presentation
or
summary
from
him
at
some
point
like
those
two
years
ago.
D
But
I
can't
remember
exactly
what
those
were,
but
certainly
he
was
also
doing
those
and
he
may
he
may
either
have
something
still
around
or
maybe
able
to
help
with
with
with
running
some
of
those
tests
or
designing
or
reviewing.
F
A
D
A
I
will
post
his
nickname,
nickname,
but
I
I
will
not
type
it
right
now,
polynomial,
but
with
some
zero
somewhere.
Okay.
C
Or
if
you
could
Point
me
towards
the
test
that
can
that
I
can
reuse
for
enhancing
these
tests.
G
G
We
are
working
on
experiments
where
you
need
to
find
out
at
what
point
API
server
breaks
in
concurrent
risk
also
I
was
just
curious
and
it's
related
to
the
discussion.
If
anyone
has
pointers,
I
I'll
be
curious
to
read.
E
F
E
One
thing,
though,
is
yeah
I.
Guess
we
missed,
we
missed
the
release,
note
for
this
particular
change,
but
I
think
you
mentioned
this
idea.
That
is
also
my
other
call
out.
I
think
what
we
needed
from
the
release
notice
changing
this
way,
and
you
know
we
can
make
a
difference.
D
Yeah
I
guess
it's
too
late
now
to
to
add
it,
because
no
one
will
read
this
release
note
now,
given
that
120
2
is
already
out
of
123,
will
be
soon
out
of
support
window
even
but
I.
Think
updating
the
documentation
in
general
is
something
that
we
we
still
can
do,
and
that
is
something
that
people
are
looking
at.
E
Yeah,
so
okay
sounds
good,
all
right
so
for
the
testing
gaps
and
like
this
idea
about
really
split
list
calls
okay,
now
defaults
itself
to
a
separate
bucket
dial
open
is
British.
You
need
to
discuss
this,
but
in
general,
actually
the
fact
that
this
change
went
out
and
besides
two
customers,
not
many
people
saw
this
actually
is
a
good
sign
that
the
state
is
actually
the
right
change
and
this.
This
is
not
these
particular
cases,
which
is
three
corners.
Foreign.
A
Okay,
then,
thank
you
and
see
you
in
two
weeks.