►
From YouTube: SIG Instrumentation 20230302
Description
SIG Instrumentation Bi-Weekly Meeting - March 3rd 2023
A
Okay,
welcome
everyone
to
Sig
instrumentation
bi-weekly
today
is
March,
2nd
2023..
We
have
a
couple
things
on
the
agenda
hold
on.
Let
me
share
my
screen,
make
sure
I
share
the
right
thing,
because
otherwise
I
can
get
in
trouble.
A
A
Okay,
so
welcome
everyone.
We
have
two
items
on
the
agenda.
The
first
is
Tim's
on
categorized
request:
metrics,
do
you
wanna?
Do
you
want
to
talk
about
a
little
bit.
B
Yeah
sure
so
this
is,
we
can
maybe
think
of
it
as
like
pre-cap
I'm,
just
trying
to
kind
of
gauge
interest
on
this
idea.
B
If
it's
worth
exploring
this
further
and
maybe
getting
a
cap
out
in
128
or
129,
it's
not
kind
of
an
urgent
issue,
but
just
kind
of
the
the
problem
statement
I
with
some
frequency
find
myself
wishing
that
I
could
break
down
our
request:
metrics
by
namespace
or
user,
requesting
user
or
yeah
I,
guess
those
are
the
the
main
two
ones,
maybe
occasionally
like
resource
name,
but
that's
more
less
common,
but
we
can't
actually
have
metrics
with
labels
for
those
fields
because
of
cardinality
issues.
B
So
I
was
thinking
about
like
what
a
solution
to
this
might
look
like.
I
came
up
with
this
idea
for
having
a
static,
although
we
could
probably
discuss
a
dynamic
version
as
well
definition
of
different
categories
of
requests.
B
So,
for
instance,
if
I
want
to
say
like
I'm,
going
to
classify
these
requests
as
system
requests,
they're
for
things
in
Cube
system
or
they're
made
from
system
components.
Maybe
there's
a
couple
cluster
scoped
resources
that
I
explicitly
want
to
include,
and
so
any
request
that
matches
any
of
these
patterns
in
the
include
in
the
example
below
gets
essentially
labeled
or
categorized
as
system,
and
then
we
can
record
emit
a
metric.
That
is
what
did
I
call
it.
Api
server
categorize
request
total.
B
B
Increment
that
count,
and
so
the
idea
would
be
you
know
everyone
what
what
every
cluster
operator
cloud
provider
considers
to
be
system
requests.
This
is
going
to
be
a
little
different.
It
depends
on
which
plugins
you've
installed,
what
you
consider
to
be
workloads
versus
what
you
consider
to
be
part
of
running
a
cluster,
and
so,
let's
the
cluster
operator
Define.
This
is
this
group
of
Metro
of
requests
that
I
care
about,
measuring,
explicitly
and
sort
of
a
build
your
own
metric
around
it.
So.
A
A
A
B
Yeah,
that's
a
good
argument
against
the
dynamic
version,
I
think
with
the
static
version.
The
assumption
is
that
every
cluster
that
I'm,
that
I
care
about
would
have
the
same
definition
of
metrics
categorization.
A
So
I
I
have
so
I
have
a
like
I
I
thought
about
this
problem.
A
lot
since
you
brought
this
up
like
actually
quite
a
while
ago
in
basic
instrumentation
slack
channel,
so
I've
thought
about
it.
A
lot
like
I
think
I
would
be
okay
with
this
if
it
was
not
dynamic
and
if
this
were
Exposed
on
a
different,
scraping
endpoint,
which
would
be
enable
or
disableable
by
a
command
line
flag
on
the
API
server
so
like
basically
enable
High,
cardinality,
metrics
or
something,
and
then
you
can
enable
them.
A
B
So
it
doesn't
have
many
base,
so
the
only
extra
label-
Dimension
that's
being
added
here
is
category
and
category
is
a
statically
defined
string.
So
in
the
example
here,
there's
a
category
called
system,
and
so
that
includes
Cube
namespace,
but
it
the
namespace
isn't
being
included
there.
That
just
gets
a
request.
That's
in
the
cube
system,
namespace
gets
labeled
system,
and
it's
just
that
system
piece
that.
A
Yeah
yeah
and
I'm,
saying
like
even
even
this,
is
higher
cardinality,
because
if
you
include
these
three
types
of
four
types
or
three
types
of
categories,
that's
three
acting
that's
basically,
three
oxing,
the
the
cardinality
of
API
server,
request
Matrix,
which
is
basically
a
third
of
the
total
of
the
metric
metric
yeah.
It's
basically
like.
C
But
I
I
think
that,
as
long
as
we
have
infinite
amount
of
values
for
this
label,
it
should
be
fine
because,
like
most
of
the
problem,
we've
run
into
so
far,
I've
been
because
of
cardinality
explosion,
like
the
actual
Footprints
hasn't
been
that
important,
like
people
have
been
reporting
that
the
latency
metric
is
like
really
heavy
like
actually
consume
a
lot
of
memory
on
their
back
end
and
but
we're
always
able
to
tell
them
that.
Well,
this.
C
And
yeah,
and
even
like
it's
such
an
important
metric
that
you
need
to
spend
that
much
memory
to
save
it
because,
like
it
will
tell
you
the
health
of
your
cluster,
essentially
so
yeah
I
I,
think
it's
reasonable
and
I
have
like
many
use
cases
for
that
kind
of
usage
like,
for
example.
Here
we
are
creating
groups
with
namespace
and
user
agent
or
like
User
Group.
C
They
have
other
use
cases
where
it
could
be
good
like
to
add
new
labels
for
infinite
amount
of
namespace.
Let's
say
your
group
name
space
together
and
put
a
value
for
them.
A
A
A
A
C
B
A
A
Exactly
that's
exactly
what
I'm
saying
that's
exactly
what
I'm
saying!
So,
if
it's
on
a
separate
endpoint
like
we
have
this
Dynamic
cardinality
limit
limiter
thing.
Basically,
because
people
have
exploded
metrics
so
many
times,
we've
enabled
this
flag,
which
allows
you
to
bound
a
label
to
a
certain
known
set
of
values
and
then
to
Cluster
those
and
put
everything
else
into
another
group.
A
That
way
the
memory
ends
up
getting
constrained
which,
in
conjunction
with
this,
gives
you
something
very
similar
to
this
right
without
actually
having
to
implement
much,
except
maybe
duplicating
this
adding
a
namespace
label
and
putting
it
on
a
separate,
endpoint
and,
and
so
so,
basically
I
I'm,
suggesting
something
simpler
than
this
right,
which
is
you
basically
want
to
know.
The
namespace
and
you're
worried
about
the
namespace,
Cardinale
and
I'm?
Actually,
an
user.
A
I
mean
like
look
even
completely
unbounded
stuff,
I'm,
okay,
with
in
an
external
toggleable
metrics
endpoint,
because,
like
you're,
only
enabling
this
for
short
durations
of
time
right.
So,
like
your
your
concerns
about
like
booming
and
crash
looping,
the
API
server
goes
out
the
window
because
you're
doing
this
on
purpose.
B
Well,
I
mean
let
me
go
in
a
little
more
depth
in
in
the
specific
use
case
that
I'm
looking
for
here,
I'd
like
to
be
able
to
set
and
come
up
with
an
SLO
across
our
Fleet
of
clusters,
for
what
is
the
error
rate
on
system
requests?
B
A
B
I
think
it
would
be
useful
to
have
information
like
like
which
resource
was
failing,
which
verb
but
I
suppose
like.
If
we
just
had
the
bare
minimum
metric
of
category
and
code,
then
we
could
pull
that
information
out
of
logs.
A
And
you
could
also
you
could
also
look
at
the
the
other
API
server
request,
total
metric
to
look
at
like
the
distribution
of
requests
and
then
basically
kind
of
do
an
inference
right
for
your
use
case.
I
would
almost
just
go
with
the
simpler
metric
and
bypass
all
this
stuff,
because
this
is
too
complicated
to
it
sounds
like
you
can
achieve
that
with
the
simpler
metric.
The
thing
that
you
want.
C
We
have
a
rig
X
for
in
on
the
thing
that
we
have
to
bound
the
label
values
do.
Does
it
support
3x
today,
no.
A
A
But
yeah
so
I
I
can
get
you
everything
except
the
user,
and
the
user
would
be
a
problem.
Even
if
we
did
something
more
complicated.
It
would
still
be
a
problem
right
because
you're
going
to
have
card.
No,
you,
like
users
is
unbounded
like
there
can
be
There's
no
practical
cap
on
users
right
like
namespaces.
There
is
because
of
etcd
and
the
size
could
and
whatever
and
like
Discovery
falling
over.
If
you
have
too
many
namespaces,
like
you
know
like
like,
there's
like
namespace,
is
practically
bounded.
A
I
mean
you
can
do
that
with
a
regular
metric,
though
I
mean
like
you,
can
you
can
have
a
metric
called
okay.
What
do
you
call
this
thing
that
you're
trying
to
measure
like
what.
B
A
To
measure
system,
error
rate
system,
error
rate,
okay,
System
error
rate,
you
have
a
metric
and
then
basically
in
that
metric,
you
you
pass
in
the
user.
You
pass
in
the
namespace
and
you
pass
in
the
error
code
right.
If
there
is
an
error
and
then
in
your
function,
you
take
the
user,
and
then
you
put
it
into
a
group
that
requires
none
of
this
right
like
it
requires
request,
metrics
categorization,
manifest.
You
don't
need
that.
You
can
just
write
that
in
code
in
your
metric.
C
A
Pnf
PNF
is
a
valid
thing
right:
PNF
P,
there's
the
default
PNF
setting
Upstream,
so
I
would
imagine
this
would
be
completely
configurable
right,
like
you
could
read
your
PNF
settings
and
then
throw
the
stuff
into
the
you
could
call
your
PNF
settings
and
then
put
stuff
in
the
right
groups.
Right
no
I
mean.
B
Well,
so
funny
you
should
mention
that,
so
that's
what
we're
thinking
about
using
as
a
proxy
for
this
right
now
is
we
don't
I,
don't
think
we
have
a
priority
and
fairness
metric
that
has
error
code
attached
to
it.
So
we
could,
we
might
be
able
to
get
away
with
just
doing
that,
but
what
we're
looking
at
right
now
is
just
aggregating
a
metric
out
of
the
law,
the
HTTP
logs
from
the.
A
B
A
C
B
A
I
mean
there
would
have
to
be
like
a
real
compelling
use
case
for
it,
and
I
would
support
building
it
if
there
was
like
a
real
compelling
use
case
where,
like
we,
could
not
solve
this
problem,
otherwise
right,
but
it
sounds
like
we
can
probably
solve
this
otherwise,
given
given
our
conversation,
yeah
does.
Does
that
sort
of
resolve
the
thing
for
you
or.
A
Yeah
and
if
you
want
me
to
show
you
like
what
I
mean
by
like,
like
so
in
the
body
of
like
where
you're
recording
the
metric
like
record
this
metric,
you
know
you
pass
in
these
parameters
and
then
you
have
like
this
call
to
recording
the
metric.
But
before
that
body
right,
you
can
just
make
a
call
to
PNF
settings,
get
all
of
the
the
user
groups
which
are
in
workload
status
and
then
do
like
I,
said
inclusion
and
then
put
it
into
that
workload
status.
B
A
Yeah
I
mean
I
I,
don't
think
we
can
do
this
with
API
requests
until
we
definitely
can't
add
namespace
to
it,
because
because
of
the
cardinality
is
so
high
and
stable
anyway,
yeah.
C
A
C
A
Scalability
tests
they
keep
creating
or
deleting
namespaces.
So
basically,
what
happens
when
you
include
a
namespace
label
is
that
the
scalability
tests
tend
to
the
API
server
because
of
the
namespace,
so
we
might
not
actually
be
able
to
ever
use
namespace
as
a
label
value.
C
A
Okay,
cool
thanks.
Thank
you!
Tim
yeah
yeah,
thanks
Tim,
okay,
yeah,
the
next
one
is
mentorship
plan;
okay,
yeah
sure,
let's
open
up
the
mentorship
plan,
so
we
have
a
bunch
of
volunteers
for
or
for
mentorship.
A
And
we
have
several
people
interested,
so
how
should
we
do
assignments.
A
A
How
about
this,
since
we
have
six
minutes
left,
if
you
guys
are
on-
and
you
guys
have.
A
If
the
people
who
want
a
mentor
would
like
please
fill
out
your
desired
Mentor,
if
there
is
a
slot,
then
you
can
do
it.
So
it's
I
guess
it'll
be
first,
come
first
serve
and
that
way
we
can.
We
can
distribute
them.
A
David
David
should
be
out
because
David
is
hold
on
to
me.
How
do
I
do.
A
Okay,
there
we
go
and
I
feel
like
Elena
might
only
be
able
to
handle
one
but
I
I.
Don't
know
she
wrote
this
she's
out
today
too,
but
yeah.
So
if
you
guys
are
interested
in
a
mentor,
please
fill
this
out
and
then
over
the
course
of
the
next
week.
A
A
Well,
that's,
that's
basically
it
for
today
and
we
have
about
three
minutes
left.
Is
there
anything
else
anyone
wants
to
talk
about.