►
Description
Andrew shows Bob how we automatically generate recording rules from high cardinality metrics, and how to include a new feature_category label in that.
A
Go
andrew,
so,
okay,
so
just
to
kind
of
explain
the
background
for
the
video
we
are
looking
to
add.
Bob's
new
feature,
category
attribution
onto
our
error,
but
error
budgets,
page
or
which
is
here
and
at
the
moment
it
only
tells
us
the
attribution
for
sidekick
cues
and
what
would
be
really
nice
is
if
we
could
extend
this
so
that
it
was
also
the
web,
and
so
we
were
discussing
ways
to
do
this
and
the
most
obvious
way
is
to
just
kind
of
use
the
the
metrics
that
have
got
these
new.
A
A
That's
interesting,
oh
not
everything.
Has
it
right,
presumably
no.
B
If
I
remember
correctly,
I
haven't
merged
anything
yet,
but
I
saw
sean
added
a
bunch,
but
only
the
merger
quest
controller
has
attribution
right
now.
If
I
remember
correctly.
A
A
So,
okay,
so
this
is
this
is
what
we've
got
at
the
moment
right.
So
we've
got
this
feature
category
over
here:
source
code
management
and
that's
awesome,
but
the
problem
is
if
we
had
to
take
this
metric
and
put
it
into
here,
especially
you
know.
This
has
got
a
seven
day
by
default.
It
it'll
just
time
out,
like
these
metrics,
basically
anything
with
these
metrics
at
the
moment
times
out,
and
so
we
have
to
take
a
recording
rule.
A
Now
there
probably
are
some
recording
rules
that
already
have
these
in
them
kind
of
by
accident,
because
when
you
do
aggregation
in
prometheus
you
can
either
say
some
buy
and
you
give
it
a
bunch
of
labels
or
you
can
say
some
without
and
then
it
drops
some
labels.
A
But
what
I
was
suggesting
to
to
bob
was
that
we
take
the
the
the
significant
labels
and
use
that
as
a
way
of
doing
this,
and
so
just
to
kind
of
explain
what
significant
labels
are
on
all
of
the
service
overview.
Dashboards
like
this
is
the
service
overview
dashboard
for
the
web,
and
you
know
this
is
each
of
the
sli's.
You
know
here.
A
We've
got
the
load
balancer
puma
workhorse,
which
all
kind
of
things
that
we're
monitoring
for
the
web
servers
and
then
we
have
these
collapsed
rows,
one
for
each
components
and
if
we
open
that
up
you'll
see
inside
here,
we
have
kind
of
even
more
detail
than
we
have
outside
and
for
each
row
we
have.
A
Basically,
we
break
the
data
down
by
some
label,
so
in
this
case
this
is
kind
of
a
non-aggregated
version
and
then
we've
got
per
fully
qualified
domain
name
per
method
and
so
we're
breaking
these
metrics
down
by
these
different
labels.
And
if
you
go
into
workhorse
you'll
see,
we've
got
different
labels
or
maybe
they're
the
same
so
yeah
so
for
workhorse.
We've
got
per
fully
qualified
domain
name
and
per
route
and
the
way
that
we
do
that
is
in.
A
A
A
Like
you,
probably
don't
care
about
job,
you
kind
of
know
what
the
job
is,
and
so
we
say
the
metrics
that
we
kind
of
interested
in
the
most
as
operators
or
people
running
the
system
are
fully
qualified
domain
name,
which
will
we'll
slowly
transition
to
away
from
as
we
move
to
kubernetes,
because
we
don't
have
fqdn
anymore,
but
then
the
other
one
which
is
super
useful,
is
root
which
is
like
workhorses
got
about
10
different
ways
that
it
handles
things
and
that's
on
the
on
the
root
label
and
so
that
over
there
is
what
drives
these
breakdowns
here.
A
So
so
you
basically
get
a
row.
You
you
get
the
the
the
metrics
aggregated
by
each
of
those
labels
inside
this
detail
row.
So
by
default.
You
don't
see
that.
But
if
you
in
an
incident
you
might
open
this
up
and
say:
oh
all,
the
errors
are
on
this
node
or
all
the
errors
are
on
this
route
and
that's
kind
of
a
way
to
kind
of
speed
things
up.
So
you're
probably
still
wondering.
A
A
If
you
have
any
questions,
just
shouts
as
well
here,
so
I
don't
know
if
you've
seen
this
okay,
I
had
some
before
okay.
So
basically,
what
we
used
to
do
is
we
had
the
the
metrics
catalog
and
then
for
some
of
the
metrics
it
would
use
if
they
would
rely
on
recording
rules,
but
those
recording
rules
were
manually
maintained.
A
All
that
add
another
label,
and
there
were
all
these
different
things,
and
so
I
was
kind
of
getting
quite
frustrated
and
also
the
other
thing
that
might
happen
is
that
we
would
need
another
label
and
it
wouldn't
be
on
these
metrics.
And
so
we
built
this
small
thing
called
recording
rule
metrics
and
we
we
use
it
for
sidekick,
because
sidekick
also
has
very
high
cardinality.
A
So
what
it
says
is
in
the
metrics
catalog
whenever
you
are
dealing
with
these
metrics
metrics
on
on
on
these
names,
try
and
use
a
recording
rule
instead
and
then,
when
we're
generating
the
the
service
level
metrics.
A
So,
to
kind
of
give
you
an
example
like
if
I
go
to
this
sidekick
jobs
fail
total
over
here
and
then
I
add
for
the
selector
I
add
like
whatever
wombats
yes
right,
kind
of
in
the
old
world.
That
would
break
because
this
recording
rule
well,
the
recording
rule
that
we'll
be
using
would
also
need
to
have
that
on
it
and
and
like
everything
you
know,
you'd
have
to
manually
go
and
update
something
else,
and
that
was
error-prone
and
now
what
the
metrics
catalog
does
is.
A
It
goes
when
it
generates
the
recording
rule
for
sidekick
jobs.
Fail
total
it'll
go
one
of
the
things
that
it
filters
on,
for
that
is
this
wombat
label,
and
it
will
automatically
add
that
to
the
recording
rule,
and
likewise,
if
I
remove
this
from
here,
it
automatically
removes
it
from
the
from
the
recording
rule
that
gets
generated.
So
if
you
go
look
at
the
auto.
A
Generated
so
you
can
see
here
it's
this
is
the
recording
rule
that
gets
generated.
So
you
know
it's
taken.
It's
looked
through
the
entire
metrics
catalog
and
it's
seen
that
the
labels
that
we
use
for
that
metric
environment
feature
category.
That's
useful.
You
know
less
than
or
equal
to
for
the
bucket,
and
that
was
all
this.
This
recording
rule
is
totally
automatically
generated
because
of
that
definition
up
here,
and
we
also
use
these
in
I'm
pretty
sure.
If
you
go
into.
A
A
It's
as
part
of
it
so
basic.
Well,
not
the
json,
but
the
the
recording
rules.
I'll
show
you
in
a
second,
but
here
you
can
see
I've
used
that
recording
rule.
So
when
I,
when
I
build
that
dashboard,
I
don't
use
this
thing.
I
actually
just
said
you
know:
sidekick
jobs,
completion,
total
and
then
there's
kind
of
like
a
like
a
pipeline,
and
in
that
pipeline
it
says.
A
Well,
the
labels
that
were
requested
were
x,
y
and
z
and
aggregations
were
these
and
therefore
it
actually
matches
with
the
with
the
recording
rule
and
so
I'll
use
the
recording
rule,
and
so
it's
kind
of
done.
The
substitution
for
the
record,
oh
and
the
other
thing
that's
really
important-
is
that
this
has
got
a
rate
on
it
right.
A
A
Yeah
to
the
original
query,
and
so
why,
like
I'm,
still
like
adding
to
the
stack
but
it'll
start
unwinding
quite
soon,
so
the
reason
why
this
is
important
is
because,
when
the
resolution
is
when
it's
deciding
what
labels
to
use
for
a
recording
rule,
one
of
the
things
it
looks
at
is
the
significant
labels.
So
any
label
that
you
add
to
where
is
it
yeah
any
label
that
gets
added
into
your
significant
labels
gets
included
in
the
recording
rule
kind
of
automatically.
A
B
This
is
our
role
to
be
added
to
the
details
of
the
passport
on
the
service
dashboard.
A
A
You
open
up
that
detail
page
and
what
you
see
is
that
all
the
errors
are
coming
from
a
certain
feature
category
and
then
you
know
who
to
speak
to
so
and
if
you
take
a
look
going
back
to
the
sidekick
example,
where
we've
already
sort
of
done
this,
let's
just
go
back
here,
you'll
see
that
we
do
do
that
with
these
component
details
yeah
per
feature
category,
but
these
are
hey
here
we
go,
there's,
there's
a
bug
which
causes
these
latency
ones
not
to
work.
A
It's
quite
a
complicated
bug
to
resolve
as
well.
So
it
just
needs
a
lot
of
time,
but
there
you
can
see
you
know
the
spikes
are
on
a
certain
error
category,
a
feature
category
I
don't
know.
What's
happened
to
my
grafana,
looks
like
it's
crashed,
hello.
A
Well,
I
don't
know
what's
going
on
there,
but
hopefully
it'll
come
back.
So
that's
why?
So?
What
I
was
going
to
say
right
is:
if
we
go
back
to
the
web
over
here
and
we
take
a
look
at
puma
so
for
puma
for
the
error
rate
we're
using
that
does
that
have
the
label
on
it?
I
don't
think
so.
I
don't
know
where
okay
well,
then
that's.
A
B
A
A
A
A
Yeah,
so
if
we
go
to
here
and
we
go
to
puma-
and
we
just
add
here
like
what's
what's
the
status
of
the
moving
these.
A
The
status
of
the
off
the
histogram
and
onto
the
not
histogram
onto
the.
B
You,
while
you're
running
that,
also
show
us
where
you're
generating
like
where
you're
getting
where
you're
aggregating.
All
the
information
like
to
generate
a
recording
rule.
A
But
it's
also
one
of
those
things
where
people
say
you
know
the
code
behind
the
metrics
catalog
is
like
really
complicated,
but
it
does
do
stuff
like
this,
which
I'm
really
happy
about
and
like
ultimately
leads
to
less
like
manual
work
and
maintaining
three
different
things
which
I,
which
I
hate
doing
right,
and
so
we
have
this
thing
and
we
got
called
the
recording
rule
registry
and
I've
been
kind
of
moving
away
from,
because
recording
rules
are
super,
overused
term
and
it's
quite
complicated
to
understand
what
we're
talking
about.
A
So
I've
been
sort
of
moving
away
from
calling
it
recording
rules
to
intermediate
recording
rules,
because
what
we
have
is
we
have
a
a
metric
and
then
we
have
an
intermediate
recording
rule
which
we
don't.
Actually
it's
not
the
end
game,
it's
kind
of
like
a
halfway
house
and
then
from
that
we
we
we
use
that
to
generate
the
sli's
and
everything
else.
And
so
I
might
call
this
like
something
and-
and
I
think
I've
got
that
name
in
some
way.
No,
I
don't.
A
There
are
certain
places
where
I
call
it
the
intermediate
recording
rules,
but
that's
kind
of
beside
the
point.
They're
kind
of
temporary
right,
they're
kind
of
like
a
a
pre-processing
step
before
the
next
step,
and
so
what
we
do
here
is
it's
got
on
its
public
methods.
It's
got.
This
thing
called
resolve,
recording
rule
four,
and
so
you
give
it
the
type
of
aggregation,
you're
doing
the
labels
that
you're
aggregating
over
and
the
the
function
that
you're
using
is
so
kind
of
all.
A
The
queries
have
to
be
of
a
similar
form,
which
is
basically
like
function
on
a
vector
and
then
aggregated
on
a
bunch
of
labels
which
is
like
90
of
what
we
do
right.
Obviously,
there's
some
clever
things,
but
a
lot
of
it
is
that,
and
so
this
kind
of
just
it
only
deals
with
those
kind
of
functions,
and
I
think
that's
like
a
reasonable
thing
to
do.
But
basically
it
says
you
know,
what's
the
what's
the
aggregation
function,
what
are
the
labels?
What's
the?
A
What
I
call
a
range
vector
function
and
what's
the
interval
because,
like
I
said
we
have
to
do
each
of
those
and
then
what
are
the
select?
What's
the
query
that
you're
running.
A
I
don't
like
one
minute,
they
they're
just
noisy
and
they
don't
have
enough
data
in
them.
So
I
think
that's
going
to
go.
We
pretty
much.
Don't
have
much
left
on
that,
and
so
anyway,
what
it
does
is
this.
This
will
then
go
and
it'll
look
up
and
it's
got
like
an
internal
registry
which
is
basically
like
a
big
hash.
A
You
know
mostly
driven
by
the
by
the
metric
name,
but
then
also
it
validates
that
the
the
labels
that
you're
using
in
the
selector
and
aggregator
are
a
subset
of
the
labels
that
it's
using
on
the
recording
rule,
or
you
know
that
they're
contained
within
and
that
obviously
the
range
vector
matches
and
the
aggregation
function
matches.
A
A
Service
metrics
rates
right
so
when
we
in
the
metrics
catalog,
we
have
this
thing
called
like
a
rate
metric
and
it's
how
we
model
that
for
every
service
yeah
exactly,
and
so
we
intercept
it
quite
low
in
that
service.
When
it's
when
it's
you
know
for
for
a
definition
of
a
rate
metric
when
you
say
basically
give
me
the
prom
ql
for
this
definition,
what
it
does
is
it
goes
at
this
point.
It
goes
to
the
registry.
A
A
The
thing
in
this
code
that
I
don't
like
that
I
want
to
fix
is
this
is
a
kind
of
a
library
that
is,
I
want
to
extract
from
from
the
run
books
and
just
have
it
as
a
as
a
as
a
library,
but
it
has
a
dependency
on
the
registry,
which
is
part
of
the
run
books
repo.
A
So
it's
kind
of
like
an
inverse
dependency
that
I
need
to
get
rid
of
because
yeah,
it's
the
the
registry
of
where
of
all
those
things
is
part
of
the
of
the
implementation
where
this
and
this
library
depends
on
it,
so
there's
kind
of
a
circular.
Well,
there
is
a
circular
dependency
there,
which
is
horrible,
but
I
can
fix
it,
but
it's
just
more
work
and
then
the
other
part
of
it,
which
is
kind
of
required
reading,
is
back
in
the
metrics
catalog.
We
do
this
thing
where
we
collect
all
the.
A
Sorry
through
all
the
all
the
definitions
and
I
kind
of
yeah,
I
I
kind
of
step
through
everything
as
a
pre-step
before
I
start
generating
anything,
and
I
build
up
those,
so
I
mean
that's
pretty
so
I
start
here
I
say:
collect
metrics
and
labels
and
then
I
go
through
each
service
in
the
service
catalog
in
the
metrics
catalog,
and
then
I
go
through
kind
of
each
function
on
that,
and
I
say
like
hey,
give
me
the
those
you
know:
the
labels,
the
applications,
yeah
yeah
and
then
and
then
obviously
for
some
of
them.
A
B
B
B
A
Yeah,
so
one
thing
to
note
about
this
is
that
we
only
use
selectors
that
are
hash
functions
because
I
didn't
want
to
go
through
the
bother
of
like
writing
a
parser
for
prom
ql
selectors,
and
so
you
know,
if
you
look
at
these,
the
sort
of
way
that
I
do
selectors
now
like
this
is
much
easier
for
me
to
pass
out
the
the
keys.
A
I
mean:
there's
lots
of
reasons
why
I
prefer
it
like.
You
can
union
them
together
and
you
can
manipulate
them
much
easier.
So
if
you
want
to
use
this,
the
components
need
to
use
a
hashtag.
A
From
ql
then-
and
we
could
like,
if
we
didn't
want
that,
we
could
write
a
promptcuell
parser
in
jsonnet,
but
like
just
don't
do
that.
So
that
was
that
oh
yeah,
let's
go
take
a
look
at
what
that
code
looks
like
I'm
kind
of
interested
to
see
if
it
worked,
yeah
see.
A
A
A
B
If
you
add
it
now,
it
will
just
all
be
empty
and
fine,
and
then,
as
soon
as
you
start
emitting
with
those
labels
on
those
metrics.
A
Yeah
it'll
it'll
it'll
it'll,
just
pop
up
the
one
thing:
that's
oh,
you
know
one
reason
why
I
don't
want
to
do
that
and
I'd
rather
wait
until
it's
there
is
that
we
can
actually
kind
of
cause
more
cardinality
explosion.
So
what
we
want
to
do
is
what,
like,
with
these
recording
rules.
You
want
to
take
something:
that's
got
like
really
big
cardinality
and
bring
it
down
to
much
smaller,
where
it's
manageable
and
then
what's
nice
is
all
the
other.
Recording
rules
are
using
that.
A
So
we
we
take
a
lot
of
load
off
the
prometheus
server.
You
know,
instead
of
having
to
go
through
100
250,
000
metrics,
every
15
sec
or
every
minute
it's
going
through.
You
know
this
much
smaller
set,
and
so
what
I
like
to
do,
if
I
ever
change
those
is
I
just
like
to
go
and
check
the
definitions
that
have
been
created.
A
You
know
like
these
ones
and
run
them
and
then
just
see
what
they
you
know,
how
they
look
and
how
how
much
cardinality
they
have,
and
if
we
oh,
what
just
this
wasn't
it.
A
A
Right
now
it
would
be
that,
yes,
it
would
it
would,
but
it
could.
You
know,
that's
probably
not
gonna
work,
then,
because
it's
gonna
add
several
hundred.
How
no,
how
many
feature
categories?
Are
there
yeah
something
like
that?
A
A
A
Itself,
yeah
yeah,
so
that
might
be
the
best
way
for
us
to
do
that
and
then
you
know
all
the
added
benefits.
Are
we
automatically
get
it
in
the
pull-downs?
A
You
know
anything
else,
kind
of
we
also
just
saying
more
generally
like
this
is
a
useful
label.
You
know,
and
maybe
in
future
we
can
order,
generates
queries
for
sres.
You
know
like
when
there's
an
incident
like
hey,
you
know,
or
even
just
order,
generate
documentation.
A
B
Whatever
like
it
would
be
cool
like
now,
we
we
go
through
all
engineers
when
we
need
one
yeah
bring
an
old
call,
we
go
to
all
of
them
and
then
maybe
we
have
somebody
yeah
the
feature
category,
maybe
not.
A
Yeah
and
that's
where
we've
already
you
know,
we've
already
got
some
of
those
alerting
rules
that
will
kind
of
go
to
the
pages
channel
or
the
gitly
channel
and
like
at
the
moment,
it's
only
driven
by
service.
So
you
can
only
do
it
for
services
that
are
very
obviously
correlated
with
a
team,
but,
like
that's
very
much
like
the
there's,
a
there's,
an
issue
about
that,
actually
rooting
the
alerts
on
the
on
the
feature
category,
which
is
which
is
awesome.
B
A
This
yeah,
it's
it's
not
hard,
but
it's.
The
problem
is
it's
kind
of
like
it's
going
to
break
a
bunch
of
stuff
when
we
do
it.
So
that's
it
requires
caution.
B
A
B
Yeah,
I
will
get
through
that
to
that
today
and
then
what
do
we
need
to
do
like
how?
How
do
you
imagine
like
if
we
had
the
feature
category
as
a
significant
label
inside
the
the
puma
service?
Yeah
we'd
have
a
new
road
there
that
shows
the
the
rates
per
feature
category.
B
A
Can
do
I
it's
a
so,
I
think,
as
a
first
step
like
the
first
iteration,
we
don't
aggregate
up
to
like
a
single
like
feature
category.
We
kind
of
have
here's
your
information
for
sci-fi
kick
and
here's
information
for
like
web,
or
you
know
not
necessarily
web,
but
like
http
requesty
stuff
and
then,
like
we
just
kind
of
duplicate.
What
we've
got
on
the
error
budgets.
I
can
share
my
screen
if
it
makes
it
easier,
but,
like
we
just
kind
of
have
the
same,
I
mean
I
that
that
is
not
really
used.
A
I
think
ultimately,
where
we
want
to
go,
is
we
want
to
kind
of
just
have
an
aggregated?
You
know
the
everything
rolled
up
to
the
level
of
a
feature
category
and
it's
kind
of
like
your
overall
score.
B
B
Now
it
should
be,
it
should
be
possible
like
if,
if
I
was
in
the
create
sort
code
group,
I
should
would
go
there
and
see
like
yeah
we're
doing
good
and
if
we're
not
doing
good,
I
want
to
be
able
to
expand
and
see
we're
losing.
B
A
Like
in
the
part,
this
is
now
I'm
just
kind
of
spinning
off
a
little
bit,
but
like
I'll
just
mention
it
in
the
past.
I
also
thought
it
might
be
useful
to
generate
like
a
dashboard
per
feature
category,
that's
kind
of
got
there
like
each
feature
category
or
maybe
each
team's
kind
of
information
on
it.
You
know
which
is
like
almost
like
their
dashboard
in
grafana,
and
we
can
totally
do
that
because
we've
got
all
the
mappings.
We've
got
the
stages.
Yaml
we've
now
got
the
feature
categories
and
we
have
a
map
yeah.
A
So
that's
all
we
need
really
and
then
we
could
start
coming
up
with
a
thing
where
it's
like,
like
I'll
give
you
an
example
why
I
thought
about
this
like
if
you
look
at
the
stuff
dylan
and
his
team
are
doing
with
search
they're,
adding
a
lot
of
stuff
to
the
web,
or
I
think
it's
a
web
dashboard,
but
they're
the
only
and
and
I'm
really
happy
that
they're
doing
that
and
like
that
should
be
encouraged,
but
really
they're,
the
only
ones
that
are
ever
going
to
use
that
stuff
and
so
having
like
a
dashboard
where,
like
there's
a
whole
bunch
of
stuff,
that's
auto
generated
like
here's
your
feature
category
stuff
and
then
he
can
add
they
can
add
their
stuff
into
that
dashboard
like
further
down.
A
B
A
B
A
B
A
I
suspect
that
the
numbers
from
the
request
ones
are
going
to
be
much
much
higher
in
most
cases
right,
and
so,
if
you're
aggregating,
you've
got
to
do
it
in
a
kind
of
careful
way,
because
average
is
probably
not
that
good
but
then
also
like
a
weighted
average
based
on
like
number
of
requests,
you'll
totally
drown
out
everything
from
sidekick
right,
because
maybe
you
got
100
times
more
requests
on
the
web
than
you
do
in
sidekick,
so
your
weighting
will
be
vastly
skewed
in
that
direction.
A
That's
why
I
don't
really
have
a
like
a
story
for
that,
so
I
think
it's
better
just
to
keep
them
separate
until
we
can
kind
of
figure
out
like
how
to
yeah
how
to.
A
Yeah,
because
and
giving
them
50
50
weighting
also
feels
wrong,
but
then
giving
them
like
just
wiping
out
the
the
low
stuff
is
probably
wrong
as
well.
Yeah.