►
From YouTube: SIG Instrumentation 20221013
Description
SIG Instrumentation Bi-Weekly Meeting Oct 13th 2022
A
B
C
The
recording
welcome
everyone
to
Sig
instrumentation
bi-weekly
today
is
October
13th.
We
have
a
few
items
on
the
agenda
today.
Up
first
is
I,
don't
remember
what
the
name
behind
the
username
is,
but
asking
about
the
release
of
metric
server.
A
One
yeah,
so
he
would
know
we
can
move
this
on
to
the
next
meeting
and
I
will
ping
America
and
ask
him
to
attend
cool.
E
Guess
I'm
up
so
we've
been
working.
My
team's
been
working
on
metrics,
specifically
related
to
usage
for
for
a
bit
now
in
thought
it
might
be
a
good
fit
for
the
community
as
a
sub
project,
especially
Katrina
and
I
have
been
talking
about
metrics,
that's
related
to
you
know
the
utilization
metrics
and
some
of
these
things
to
help
tune
workloads
and
she
was
also
interested
in
it.
E
We
thought
that
the
kubernetes
project
might
be
a
good
place
for
us
to
collaborate
on
these
and
I
have
a
doc
link
that
kind
of
outlines
like
why
not
just
you
know
the
coop
State
metrics
or
why
not?
You
know
any
of
the
various
other
solutions
that
already
exist,
that
I
can
kind
of
walk
through
here,
but
but
the
intent
is,
is
you
know
asking
the
Sig
if
this
is
something
that
there
is
any
interest
in
having
developed
within
the
community.
D
C
E
E
C
E
B
F
C
B
If
there
are
folks
that
are
new
to
the
meeting,
while
we're
waiting
for
Phil
to
come
back,
I
just
wanted
to
let
you
know
that
if
you
feel
comfortable
feel
free
to
turn
on
your
camera,
say
hi,
we
love
having
new
contributors
and
seeing
instrumentation
and
it's
great
to
see
all
your
faces.
Paul,
you
have
a
hand
up.
C
I
would
just
like
to
say
hello,
hi
everybody
bye,
everybody
I
did
also
want
to
say
that
I
have
a
I'd,
have
a
pull
request
to
add
some
basic
reconciliation
metrics
to
KCM.
That
I
simply
need
to
reopen
and
rebase
having
been
starved
for
time
with
numerous
distractions
and
I
am
hoping
to
do
that
soon
anyway,
nice
speaking
to
you
all,
would.
F
C
The
stuff
he's
about
to
show
so
hopefully
I
can
answer
some
questions
and
hey
Katrina
good.
To
see
you
good
to
see
you
too
I
can
introduce
myself
as
well.
My.
C
Work
with
Bill
and
Ava
I
now
work
at
Shopify
and
I've
still
been
talking
to
Phil
about
some
of
the
stuff
he's
about
to
present
shaffa
is
is
pretty
intrigued
by
it
as
well
so
I'm
here
to
support
the
proposal
and
I
mentioned
it
as
well
to
some
of
my
colleagues
who
work
on
platform,
efficiency
and
observability
and
I
know
I
see
a
few
of
them.
I
came
here
today
as
well.
If
they'd
like
to
introduce
themselves.
C
Yeah,
so
my
name
is
Pedro
I'm,
also
working
currently
at
Shopify
on
the
observability
team
and
yeah.
Here,
like
intrigued
by
this
proposal
and
quite
interested
in
this
like
seek
in
general
Mass
Mutual.
C
I
am
Tai
from
Shopify
as
well
on
platform
efficiency,
team,
I,
don't
think
you'll
be
seeing
me
as
a
regular
here,
but
also
interested
in
The
Proposal.
D
C
C
And
I'll
introduce
myself
to
you:
I'm
Danny
I
work
with
Paul
Eva
and
Dell
and
I
work
with
Katrina
for
a
bit
as
well.
Hi
Katrina.
D
B
It's
all
good
Risha
were
you
about
to
introduce
yourself?
Yes,.
B
C
E
Makes
sense
all
right,
I'm,
Phil,
Wittrock
work
here
with
you
know
about
yes,
Danny
and
Paul
and
have
spent
time
in
the
community
on
oftentimes
in
six
CLI,
also
sometime
in
safe
release.
So
it's
great
coming
over
to
this
territory
getting
to
see
the
folks
over
here.
E
Okay,
so
like
the
the
tldr
ride,
is
that
we
don't
have
this
solution
that
we
think
is,
you
know,
solves
a
problem,
certainly
solves
a
problem
for
us
and
we
think
it
would
solve
a
problem
for
others
and
it's
designed
from
the
ground
up
after
our
several
rewrites.
That
is
to
be
like
generic
and
extensible,
and
really
avoid
hard
coding
any
of
our
specific
use
cases
or
metadata,
and
really
have
that
side
loaded
through
configuration.
E
Here's
like
a
TL
VR
example
of
just
kind
of
like
here's,
something
that
is
relatively
difficult
to
do
today
and
that
we're
trying
to
make
easy
right
and
that's
like
for
all
the
containers
in
a
workload
get
the
do
a
one
second
sample
interval
and
then
show
me
the
you
know
per
container,
or
rather
you
know.
If
a
pod
has
multiple
containers
say
you
have
that
you
know
app
container
and
then
your
login
container
right
for
for
each
one
of
those.
E
So
for
the
app
containers
show
me
the
95th
percentile
utilization
for
CPU
over
the
last
five
minutes
over
all
those
one
second
samples
across
all
the
pods
and
then
the
same
thing
for
the
for
the
logging
container
right,
and
so
you
know
the
the
use
case
for
this
right
is
like
I
wanna
understand
my
requests
and
my
limit
set.
You
know
to
good
values.
How
might
I
want
to
change
them,
and
so
getting
this
you
know.
E
One
second
sampling
interval
is
a
much
finer
grain
than
say
like
a
15
or
30.
Second
sampling
interval.
You
might
get
so
so
a
couple
couple
things
here
that
this
provides
that
are
challenging
to
get
today.
One
is
that
that
finer
grain
sampling
interval
another
is
resolving
those
containers
up
into
the
workloads
so
right,
rather
than
having
to
join
it
with.
E
You
know
what
is
the
owner
and
then
what
the
owner
from
replica
Set
deployment
and
these
sorts
of
things
all
through
prom
ql
queries,
which
is
like
actually
really
quite
challenging
to
do,
and
actually
it's
very
expensive,
and
if
you
have
a
sufficiently
large
number
of
containers
right,
you
don't
want
to
be
doing
these
massive
joints
and
the
the
sampling
interval
is
is
not
one
second
right
in
Prometheus
is
not
the
right
storage
mechanism
in,
in
my
opinion,
for
storing
one.
Second,
you
know
time
series
at
a
one.
E
Second,
granularity:
it's
you
know
that
would
definitely
blow
up
Prometheus
so
having
it
at.
You
know
a
one
second
interval,
you
know
scraped
beforehand,
and
then
you
know
export
to
Prometheus.
You
know
a
five
minute
window,
what
the
95th
percentile
is
or
the
max
or
the
mean
or
the
median
or
or
these
sorts
of
things,
and
so
this
is
kind
of
how
we
describe
it
again.
E
This
gets
exported
to
Prometheus,
so
it's
kind
of
a
preprocessor
for
Prometheus
to
improved
performance
and
offer
a
number
of
things
and
says:
okay
grab
me
the
CPU
and
memory
and
I
want
to
get
container
utilization
and
then
here's
kind
of
what
I'm
aggregating
against
right
so
keep
the
container.
E
These
are
the
labels
I
want
to
keep
so
there's
a
bunch
of
labels
that
you
know
attach
this,
like
the
node,
for
instance,
and
priority
class,
and
all
these
sorts
of
things
and
drop
all
those
and
instead
keep
these
labels
and
then
give
me
the
P95
for
everything
that
has
the
same
set
of
labels,
so
those
that's
kind
of
the
quick
TL
IDR
example
of
like.
How
would
you
express
in
this
this
in
our
configuration,
how
to
get
this
metric
and
then
that
metric
gets
exported
to
Prometheus?
E
And
then,
if
you
want
to
do
additional
prompt
ql
on
that,
like
it's
very
possible,
like
average
over
time
over
a
Time
window
of
like
a
day
or
a
week,
or
something
like
that.
E
In
that
is
a
configuration
option,
so
the
the
way
this
is
is
implemented.
Is
that
there's
a
Scraper
on
each
node
that
scrapes
every
second
and
then
stores
it
in
a
ring
buffer
and
then
pushes
it
to
a
central
Service?
You
know
over
an
interval,
and
so
the
scrape
interval
is
a
configuration
option
like
is
it
every
second
is
every
tenth
of
the
second
is
every
two
seconds
the
size
of
the
Ring
buffer
is.
Is
a
configuration
option?
E
Do
you
want
to
store
10
minutes
worth
of
samples,
five
minutes
worth
of
samples,
one
minutes
worth
of
samples
in
that
ring
buffer
before
you
start
expiring,
and
then
how
frequently
you
push
them
to
the
the
main
collector
is,
is
a
third
configuration
thing
right,
so
you
could
store,
for
instance,
five
minutes
worth
of
samples,
but
push
them
every
minute
so
that
the
collector
has,
while
it
has
a
five
minute
window,
it
also
has
the
most
up-to-date
set
of
metrics
when
it
does
get
scraped,
they're,
not
five
minutes
old,
for
instance.
E
So
so
all
of
those
are
done
for
configuration
options.
We
chose
these
particular
numbers
based
off
the
the
one.
Second,
scrape
I
think
is
off
of
the
Google
autopilot
paper.
E
I
believe
that
was
you
know
the
interval
chosen
there,
and
so
we
said
well,
I
don't
have
a
better
interval
that
I
have
a
strong
feeling
for
so
we
chose
that
I
think
the
five
minutes
was
chosen
because
it
matched
kind
of
the
scrape
interval
we're
using
for
Prometheus
at
the
time
so
so
having
the
window
of
five
minutes
matched
made
sure
that
each
scraped
out
of
alcohol,
the
samples
in
it
that's
how
we
chose
those
but
they're
they're,
not
fixed.
E
I
mean
no
I,
don't
see
it
as
being
a
part
of
either
one
of
those
I
mean
it's
possible
I
think
it
has
a
pretty
different
set
of
goals.
It's
much
more
focused
Coupe,
State
metrics
publishes
all
the
like
publishes
a
metric
for
everything,
and
it
allows
you
to
do
a
lot
of
really
cool
stuff.
E
But
to
do
simple
like
to
do
this,
for
instance,
you
would
need
to
do
a
lot
of
joins
and
still
can't
get
at
the
second
interval,
and
so
it'd
be
a
pretty
big
I
think
architecture
shift
for
either
one
of
those
and
introduce
maybe
a
lot
of
unnecessary
complexity
and
those
called
other
problems
this
doesn't
solve.
This
is
really
just
focused
on
capacity
and
usage
metrics,
and
so
introducing
that
complexity
and
Coop
State
metrics
I.
Don't
think
necessarily
provides
a
lot
of
benefit.
E
You're
not
not
in
this
sort
of
not
in
this
target
right,
so
this
this
particular.
B
E
Can
you
can
use
you
can
use
Coop
State
metrics
to
get
not
this,
but
something
else
right,
so
you
can
get
it
over
a
longer
interval
without
the
workload
metadata
attached,
right,
I
think
maybe
that's
where
and
that's
actually
how
we
started
out
and
that
well,
you
know.
Initially,
we
started
out
with
the
coupe
State,
metrics
and
and
spent
some
time
doing.
The
propql
joins
to
try
and
try
and
make
that
work
before,
eventually
setting
only
on
this
solution.
B
Yeah
he
wrote
also
said
metric
server,
which
makes
sense
there's
another
question
in
the
chat.
Where
is
the
data
coming
from
cubelet,
see
advisor
containerdy
or
directly
from
the
process.
E
Fantastic
question
so
today,
so
this
is
I,
don't
think
something
that's
like
fixed
in
time
and
we're
happy
to
love
to
get
feedback
on
better
ways
of
doing
it.
We
explored
a
number
of
different
options,
I
think
you're,
talking
about
the
utilization
data
and
the
one
second
scrapes
specifically,
is
that.
F
E
E
E
We
tried
different
options
and
turns
out
like
walking
the
assist
the
group
file
system
isn't
particularly
complex
and
and
using
some
of
those
other
tools
is
not
particularly
simple,
and
so
because
this
this
worked
out
very
it
was
a
small
amount
of
code
to
do
this,
and
you
know,
worked
pretty
well
and
gives
not
a
lot
of
permissions
like
mounting
having
access
to
the
container
V
socket,
for
instance,
gives
us
stuff
that
we
don't
necessarily
want
to
be
able
to
do,
but
not
being
the
secret
file
system
in
read
only
really
limits
our
capacity
to
do
malicious
things.
E
So
that's
that's
what
we
ended
up
doing
that,
but
we're
not
we're
not
stuck
on
this,
and
it
doesn't
work
well
for
some
things.
Like
micro
VMS,
we,
you
know
you
can't
get
at
the
container
metrics
and
micro,
VM
runtime,
so
I
I
think
we're
going
to
continue
it
to
evolve.
How
do
we
get
these
metrics
and
maybe
have
different
options
available
where
you
select
in
the
config?
Where
do
you
want
to
get
the
metrics
from.
F
Yeah
I'd
be
very
interested
in
it
and
also
like
getting
in
touch
we've
been
kind
of
exploring
in
GK
metrics.
If
we
can,
we
get
these
metrics
in
high
resolution
and
where
we'd
get
them
from,
and
that
was
also
what
we
looked
at
like
skipping
coupon
State
advisor
directly
looking
at
the
file
system
and
watching
that
as
well
to
get
what's
the
most
resolution
possible.
E
One
nice
thing
too,
is
that
this
example
is
just
for
containers,
but
we
also
have
the
C
group.
We
have.
We
scrape
the
c
groups
on
at
a
node
level.
So
there's
two
Dimensions
there's
like
the
workload
Dimension,
which
is
spread
across
a
bunch
of
notes,
right
and
then
there's
another
dimension,
which
is
like
the
the
no
Dimension
and
So
within
the
node
Dimension
you
make
Halo.
How
much
is
the
system
Slice
versus
the
coup,
pods
C
group,
using
right
within
the
system?
Slice
do
I.
E
Have
some
process
like
that
kicks
up,
you
know,
is
it
really
the
kublet?
That's
the
most
or
do
I
have
other
processes
in
there
and
what
do
they
Spike
at
and
we've
seen
some
interest,
so
we
scrape
those.
We
also
export
those
p95s
Maxes,
those
sorts
of
things,
and
we
have
seen
interesting
results
where
you
know.
There's
certain
processes
running.
E
You
know
in
the
system
reserved
that
on
average
don't
take
a
lot.
But
if
you
look
at
a
one
second
granularity,
they
they
Spike,
like
they
actually
take
a
ton
at
that
very
small
interval.
There's
others
that
you
know
just
on
average,
take
a
significant
amount,
but
never
Spike
and
having
an
understanding
of
that
I
think
is
is
definitely
interesting.
F
One
other
thing
you
might
be
that
might
be
interesting
to
look
at
that
we've
been
at
least
discussing
if
it
would
be
possible,
is
looking
at
it's
some
way.
Looking
at
om
killer
events,
because
that's
like
the
final
point
you
get
like,
how
much
did
you
use
because
before
it
got
killed
and
that's
what
we
usually
Miss
100.
E
Umkiller
CPU
throttling
for
C
group
C2
various
like
memory
pressure
events,
all
that
stuff
I
think
we
want
to
look
at
and
some
of
that
stuff
is
like
partially
wired
in,
but
not
just
not
complete,
but
but
we're
certainly
working
towards
getting
all
that
sort
of
stuff.
And
then
you
can
build
like
you're
saying
a
collective
picture
of
like
you
know,
maybe
maybe
my
P95
usage
was
well
below
my
request,
but
I'm
still
getting
throttled
right
and
possible
thing.
E
A
A
We
have,
we
have
five
minutes
left
so
you're,
proposing
basically
to
turn
this
into
a
sick
project.
F
A
Okay,
we
can,
we
can,
does
anyone
have
any
objections
to
this
being
a
part
of
this.
D
I
I
have
an
option
to
be
honest
with
this
one,
because
essentially
I
feel
like
it's
trying
to
replace
the
existing
ways
we
have
to
collect
the
usage
that
is
not
already
well
optimized,
like
we've
seen
that
it
doesn't
scale
away
to
get
the
metrics
from
cubelete
directly
because,
like
C
advisor
is
not
optimized
and
I
feel
like
this
would
be
just
a
shortcut
like
a
shortened
to
not
like
optimize
your
advisor
and.
C
There
is
a
related
signode
enhancement
that
they're
working
on
to
get
all
the
C
advisor
metrics
through
the
CRI.
B
The
CRI
stats
one
I
think
this
is
complementary,
because
this
is
only
going
to
be
I,
think
a
subset
of
those
stats
and
at
a
much
more
scalable
like
level
because
with
the
CRI
stats
well.
First
of
all,
also
the
CRI
stats
isn't
everything
from
C
advisory.
That's
also
a
subset,
but
I
think
that
this
is
looking
at
much
more
granular
data
than
the
CRI
stats
proposal.
Currently
is:
okay.
B
Yeah
I
did
see
that
the
that
the
cap
had
been
updated,
but
it
wasn't
clear
to
me
if
they
were
planning
on
replacing
all
of
them
or
not.
E
I'm
gonna
I
want
to
be
clear,
like
the
value
here
isn't
like
we
have
another
way
of
getting
the
metrics
from
like
the
metrics
on
the
Node.
That
is
not
the
value
in
switching
to
see
all
right.
Like
that's,
that's
like
the.
How
right
the
the
proposal
is.
How
do
we
get
one
second
glandularity
metrics
for
containers
within
a
workload
for
instance
or
C
group
aggregated
I,
want
to
see
a
histogram
of
the
c
groups
within
a
node
across
a
cluster
and
and
I'm
not
like.
E
A
Asm
in
the
near
future,
realistically.
D
And
then
even
like
that,
couldn't
be
fitted
for
what
the
purpose
of
KSM
is
yeah
personally,
is
that,
like
we
already
have
like
this
kind
of
granularity
for
the
resource,
metrics
endpoints
like
we
already
have
a
way
to
scrape
like
the
like
the
prod
usage
container
usage?
More
often,
the
only
limit
today
is
that
cubelet
doesn't
like.
It
was
shown
that
we
are
like
it's
impossible
to
get
down
to
one
second,
because
the
advisor
is
not
optimizing.
D
D
But
from
that
other
side,
I'm
also
thinking
that,
even
with
a
one
second
script
interval
in
my
opinion,
Prometheus
should
be
more
optimized
than
whatever
solution
we
come
up
with,
because
even
if
the
scrapes
is
done
more
often
the
data
will
be
put
onto
the
disk
like
it
will
be
on
run
more
often
like,
because
it's
put
into
the
disk
after
a
certain
amount
of
scrapes
and
in
terms
of
optimization
for
searching
the
data
and
then
applying
the
actual
mathematics.
D
F
F
Summary
API,
like
is
still
not
actively
a
proper
public
API.
There
is
no
clients
for
it.
Like
people
doesn't
have
an
API
when
it
comes
to
that
that
we
actually
tell
people
to
use,
we
just
use
it
to
my
understanding
and
the
resource
metrics.
E
I
have
a
question:
let's
say:
we've
switched
to
whatever
way
of
getting
the
metrics
on
the
node
of
choice
is
like
I
have
no
opinion
about
it.
Right
so,
let's
say
there's
like
a
preferred
way,
which
is
you
know
the
container
runtime
interface
or
it
really
is
the
advisor
we
resurrected
from
the
dead,
whatever
it
is
right
like
if
we
just
switch
to
that.
Does
that
address
the
now?
We
have
two
ways
of
doing
this
thing.
A
I
think
the
question
is
like:
does
it
make
sense
to
roll
some
of
this
into
the
cubelet
directly.
E
It's
going
to
be
hard
to
roll
in
the
like.
The
aggregation
aspect
like
I,
didn't
really
cover
all
the
use
cases
right,
but
the
the
aspect
of
scrape
like
collecting
all
the
samples
from
all
the
notes
and
then
or
sorting
them
right
sort
of
sorting.
So
let's
say
you
have
300
samples
per
container
and
you
have
let's
say:
100
100
replicas
for
pod
replicas
for
a
deployment
spread
across.
Let's
say
you
know
two
replica
sets
because
it's
mid-roll
out
by
collecting
those
300
samples
per
container
times,
100
pods.
You
got
30
000
samples
there.
E
B
A
Oh
by
the
way,
we're
having
a
face-to-face
at
cubecon
on
Monday,
so.
A
I
think
10,
15
or
10
30.
in
the
morning
on
Monday
during
the
week
of
coupon
and
yeah.
We
should
definitely
talk
about
this
I
think
I.
Think.
B
A
I
am
personally
okay
with
rolling
this
into
the
Sig
I
I.
Think.
B
We
clearly
have
a
lot
of
interest
and
I
think
a
lot
of
like
people
who
are
looking
to
contribute
and
as
an
out
of
tree
thing,
I
don't
really
have
any
concerns
of
like
throwing
it
in
the
Sig's
project
and
seeing
what
happens,
we
would
just
need
like
a
list
of
people
for
approvers
reviewers,
repo
admins,
that
kind
of
thing
and
we
can
POC
it
and
if
it
turns
out
that,
like
you
know
this
isn't
the
best
fit,
then
we
can
do
that
once
we
actually,
you
know,
see
and
work
with
the
code
yeah.
E
Yeah,
why
don't
we
do?
My
suggestion
would
be
like
I'd,
be
fine
with
like
a
three
months
evaluation
period,
it's
harder
to
evaluate
when
you
can't
see
the
code
or
you
can't
run
it.
B
D
B
E
Being
here
you
can
you
can
slack
me
on
pubert
Rock
in
the
kubernetes
slack,
if
you're,
you
know
interested
or
add
to
the
agenda
and
then
I'll
do
my
best
to
make
sure
everyone's.
We
reach
out
to
everyone
that
has
interest
and
yeah
I'll
Atlanta
I'll
get
you
list
of
names,
great.
C
I
think
that's
it
for
the
agenda.
So
thanks
everyone
for
joining.
Sorry,
sorry,
we
ran
long
and
see
everyone
at
Cube
or
see
some
people.