►
From YouTube: Andrew explains general metrics to the Delivery team
Description
Andrew N. talks about general metrics with Alessio C., Mayra C. and Marin J.
A
This
week
has
been
a
total,
it's
recording,
isn't
it
so
we
did
great
this
week
has
been
quite
something
at
the
with
the
incident
that
we've
got
ongoing
and
I'm
tired
and
I
mean
I
haven't
even
looked
at
this
talk,
so
the
last
time
I
looked
at
it
I'm
really.
Sorry.
Last
time
I
talked
looked
at
this
talk
was
about
a
month
ago
and
I
was
hoping
to
get
some
prints
on
today,
but
we've
been
fighting
fires
for
like
a
week,
solid
and
I
haven't
had
a
chance.
A
A
A
Was
there
we
go
hey,
there's
back
cool
so
like
this
actually
started
off
as
the
omelet
man,
I
dropped
it
and
it's
stuck
being
a
talk
from
monitor
Robert
and
it's
kind
of
a
sitting
around
and
I
was
talking
to
her
another
day
and
we
were
discussing
like
how
did
he
metrics
on
get
lab?
Calm
and
I
was
talking
about
the
general
metrics
that
we've
got
and
they've
been
around
for
about
six
months.
I
use
them
in
lots,
but
generally
in
the
company,
they're
not
really
used
very
much
and
I.
A
Think
the
main
problem
is
a
lot
of
people,
don't
really
understand
them,
but
certainly
in
the
last
week,
there'd
been
the
kind
of
primary
signal
that
things
are
really
slow
and
so
a
lot
of
the
other
metrics
that
we
have
you
know
haven't
been
a
nursing.
We
don't
we
don't
see
a
lot
of
alert,
saying
gitlab
is
really
slow
at
the
moments
and
and
and
these
are
nerds
have
been
alerting
and
so
meirin
suggests
that
I
give
a
little.
I
give
you
guys
a
practice
run
of
my
talk
that
I
did.
A
That
was
on
the
original
subject.
So
probably
only
show
you
guys,
but
there's
obviously,
architecture
as
it
was
about
a
year
ago,
and
it's
obviously
changed
a
bit
since
then,
but
you
guys
probably
have
a
better
understanding
of
the
original
target
audience
and
I'm
not
sure
if
you're
aware
of
this
excuse
that
terrible
graphic
like
because
it's
never
got
into
like
being
done.
But
this
is
what
our
monitoring
infrastructure
looks
like,
and
I
just
remember
that
you
can't
really
use
an
Apple
Mouse
with
Google
slides.
It's
like
jumps
around
like
crazy.
C
A
Yeah
me
neither
that's
that
that
was
done,
so
this
is
actually
before
that
and
I'm
just
trying
to
find
some
time
to
look
at
what
they
did
this
yeah.
So
so
what
we
have
is
really
got
the
finer
to
refiners.
We've
got
dashboards,
get
lab,
there's
some
dashboards,
get
lab
chrome
and
obviously
what
we
don't
want
to
do
is
have
the
public
facing
Gravano
kind
of
crash
our
infrastructure.
A
If
there's
like
a
hack,
a
news
article
or
something
like
that,
and
so
we
have
this
little
thing
called
trickster
and
tricksters
quite
a
clever
little
proxy
that
sits
between
the
Prometheus
infrastructure
and
Ravana
and
what
it
does
is.
It
does
a
whole
bunch
of
caching,
but
it
does
something
else.
That's
really
clever.
A
A
So
then
we
all
get
the
same
time
series
and
that's
much
more
cashable
then
like
everyone
having
a
slightly
different,
skew
and
I,
actually
think
we
should
probably
put
this
on
on
on
our
private
instance
as
well,
particularly
because
it's
a
bit
slow
at
the
moment-
and
you
know
I'm-
never
really-
that
worried
about
the
last
15
seconds,
like
it's
really
up
to
the
last
minute
that
that
we
really
care
so
so
we
might
as
well
aligned
on
those
on
those
things
so
anyway
tricksters
there,
and
that
goes
into
a
thing
called
Thanos
and
that
ice
is
kind
of
like
a
view
onto
lots
of
what
Stannis
does
lots
of
different
things.
A
But
then
this
in
this
context,
Thanos
query
the
view
onto
lots
of
under
underlying
religious
service.
So
if
you
go
to
you
know,
we
have
Prometheus
at
Prometheus,
DB,
Prometheus
and
they're
all
collecting
different
metrics,
and
sometimes
it's
much
more
useful
to
just
have
a
single
view
of
those
and
Thanos
will
effectively.
When
you
make
a
crater,
thermos
it'll
fan
those
arts
it'll
query,
all
the
underlying
Prometheus
instances
get
back
the
results
and
then
present
you
with
a
single
view.
So
it's
it's
a
much
better
sort
of
way
of
accessing
Prometheus.
A
It's
especially
useful
for
if
you
want
to
have
the
same
dashboard
and
then
just
be
able
to
pull
down
like
the
environment
is
proud
or
staging.
You
know
you
don't
have
to
change
the
data
source
for
that
and
that's
really
really
useful.
So
we've
got
that
also,
then,
we've
got
lots
of
different
Prometheus
servers
and
and
I've
just
lost
my
headphones
because
they
have
run
out
of
battery.
A
A
We've
got
lots
and
lots
and
lots
and
lots
of
metrics,
but
it's
not
that
on
its
own,
isn't
gonna
solve
anything,
so
you
can
have
lots
of
metrics,
but
if
you're
not
using
them
properly,
then
they're
not
very
helpful
and
kind
of
irony
which
still
exists.
A
Obviously
this
was
kind
of
polished
up
for
monitor
on
it's
always
like
hey,
everything's,
great,
no,
but
I
think
this
is
still
like
a
major
problem
and
the
probe
the
other
problem
is
that
not
only
did
we
not
have
a
lot
of
metrics,
we
also
had
a
lot
of
alerts,
but
the
alerts
that
we've
got,
especially
in
the
production
channel.
If
you
go
look
in
there
and
in
pager
duty,
they're
just
firing
all
the
time
and
nobody
seems
to
be
bothered
about
them.
A
But
you
know,
if
you
look
at
the
the
uptick
scores
over
the
last
week
in
the
performance
slowdowns,
we're
not
getting
alerts
on
that.
We
getting
alerts
on
very
specific,
like
production
issues
which
may
or
may
not
be
critical,
but
we're
not
getting
alerts
on
like
what
the
users
are
being
impacted
by.
So
the
way
that
people
talk
about
this
is
our
learning
is
based
on
like
causal
Allah.
So
click
is
the
cause
of
this
and
here's
an
alert
for
that.
So
the
CPU
on
like
gotelli,
oh
five,
is
at
98%
it's
a
cause,
but.
A
A
Yeah
and
and
and
I
mean
so
so
really
what
we
kind
of
what
this
is
about
is
about
instead
of
using
causal
alerts,
we're
gonna
go
to
like
suits
and
based
alerts
right.
So
it's
like
what
is
the
user
experiencing
and
we're
gonna
we're
gonna
monitor
that
instead
of
saying
you
know
like
hey,
this
is
broken
and
and
fix
it
so
yeah
we,
you
know
we
still
had
like
all
these
alerts
and
every
time
something
goes
wrong.
A
We'll
quickly
go
and
like
add
another
alert
to
do
too
Prometheus
and
we'll
say
like
well
when
this
goes
above
this
threshold,
then
we'll
sit
in
other
loads
and
then
of
course
like.
If
you
look
at
the
way
it
lab
works,
everything's,
growing
and
growing
and
growing,
and
so
you
know
I
think,
there's
a
there's.
One
of
the
alerts
says:
if
we
have
60
errors
in
rails
in
a
in
a
one
minute
period,
then
we
need
to
send
out
an
alert.
Now.
A
If
you
look
at
the
when
that
came
in
the
box,
it
made
sense
because
60
was
like
a
reasonable
number.
But
if
you
look
at
like
the
three
or
four
thousand
requests
per
second
now
like
60,
is
like
nothing
right,
and
so
so
we
just
have
this
like
whack
Amole,
where
we
like
stick
an
alert
on
it
and
that's
gonna
fix
it
for
next
time,
and
so
we
we
have
all
of
these
alerts
that
that
are
firing.
A
That
don't
mean
anything
and
we
kind
of
ignoring
it,
because
they
they
just
a
little
bit
irritating
and
we
playing
whack
Amole
and
that's
why
there's
a
mall,
you're,
just
sort
of
hitting
this
and
then
another
one
pops
up
when
we
hit
that
one
and
and
so
we
you
know,
even
with
all
of
these
alerts,
we
still
have.
You
know
they're,
not
kind
of
full
warning
us
of
problems
like
often
the
first
time
we
really
know
that
things
are
going
is
is
when
someone
goes
into
the
production
channel
and
says:
hey
guys.
A
Have
you
seen
that
this
fight
this
the
the
the
Tanooki
on
on
github.com?
And
actually
I
was
talking
to
someone
this
morning
and
it
was
like
like
the
way
we
seem
to
treat
things
is
like
notes
and
nicky
no
problem.
It's
like
if
someone
comes
in
and
goes
hey,
there's
a
tanooki
on
the
on
the
page
and
it's
like
okay,
everyone
scrambles,
but
you
know
if
you've
got
like
five
seconds
latency
on
every
single
web
request.
This
is
sort
of
attitude
is
like.
A
Well,
you
know,
we
don't
see
it
sanuki,
yet
everything's,
fine
and
and
like
it's
the
wrong
way
of
looking
at
things
so
like
I
was
trying
to
like
model
what
our
matrix
look
like
at
the
moment.
So
we
have
only
these
metrics
coming
in
and
then
we'll
create
a
bunch
of
recording
rules
and
those
recording
rules
will
get
consumed
by
dashboards
and
then
some
of
the
recording
rules
will
get
consumed
by
alerts.
A
But
the
problem
is
like
often
what
happens?
Is
the
dashboards
to
be
stopped
working,
or
maybe
we
change
the
the
configuration
of
the
dashboard
and
so
like
a
lot
of
the
recording
rules?
Don't
seem
to
actually
be
there
for
any
purpose
other
than
they
were
there
as
an
optimization
for
a
dashboard
that
was
that
was
around
like
five
years
ago
that
nobody
actually
uses
anymore
and
same
with
the
list.
We
might
actually
remove
the
alert,
but
we
don't
remove
the
recording
rules.
A
So
we
have
this
whole
sort
of
like
rabbits
warren,
like
I'm
so
rat's
nest
of
like
like
things
that
are
wired
together
and
it's
pretty
chaotic,
like
I
don't
know,
if
you've
ever
tried
to
sort
of
audit
the
the
alerting
rules
that
we
have
in
in
the
rand
books
project,
it's
like
a
little
bit
crazy
and
and
like
what's
actually
even
worse
about
this.
Is
that
if
you
go
look
through
our
dashboards
like
most
of
my
dashboards,
don't
work
like
I
would
say
more
than
50%
of
our
dashboards,
they're
just
empty
and
and
like.
When?
A
Did
they
stop
working?
Nobody
knows
what's
actually
more
scary
than
that
is
the
same
as
true
for
alerts
right,
because
Prometheus
has
got
like
a
very
clever
and
very
sophisticated,
alerting
system
compared
to
many
other
alerting
systems
that
are
there
and
it
gives
you
an
expression
and
if
their
expressions
sort
of
resolves
to
data,
then
you
have
an
alert
and
it
fires
really
well.
A
Are
we
having
a
nerd
for
that
there's
no
way
to
actually
well?
There
is
a
way
you
have
to
manually
go
and
look
the
loot
and
sort
of
decompose
it
and
look
for
the
online
series
and
make
sure
that
they
did.
But
nobody
does
that.
There's
no
CI
for
that,
and
so
we
have
this
like
big
problem,
where
we
have
like
a
lot
of
alerts.
That
don't
mean
anything.
We
have
a
lot
of
dashboards
that
don't
work
and
so
like
what
I
was
trying
to
do
with
the
general.
A
Let
the
general
metrics
was
to
try
to
like
sort
of
reset
the
board
a
little
bit
and
kind
of
like
kind
of
come
up
with
something
a
little
bit
more
saying,
and
so
I
came
up
with
this
idea
of
building
like
a
pipeline
where
we
have
alerts
coming
in
and
then
we
apply
the
same
steps
to
all
of
the
alerts
and
and
then
what
we
can
say
is
like
we
sort
of
normalize
all
the
lids
dance.
So
for
every
service
that
we
have
from
gitlab.
A
You
say
we
want
to
have
an
error
rate,
a
single
error
rate
metric
for
that
service.
We
want
to
have
a
single
apdex
for
that
service.
We
want
to
have
a
single
variability
and
we
want
to
have
a
single
operation
rate
per
second
like
how
many
operations
are
going
in,
and
those
are
the
four
that
we've
kind
of
picked
on
and
I'm
very
close
to
you
adding
a
fifth
one
which
is
saturation,
and
so
what
saturation
is?
Is
it
kind
of
depends
on
the
service?
But
if
you
think
about
it,
like
the
saturation
of.
A
Postgres
one
of
the
saturation
parameters
for
Postgres
would
be
in
Postgres.
We
can
have
a
hundred
concurrent
connections
into
the
Postgres
server
at
any
one
time
we
are
currently
sitting
at
70,
so
so
the
saturation
level
would
be
70,
PG
bounces,
the
same
thing.
You
know
we
have
lots
of
owned
and
for
unicorn
the
saturation
would
be.
We
have
1000
unicorn
workers,
and
so
far
at
the
moment
we
are
we're
using
like
700
of
them,
and
so
we
have.
A
These
different,
like
saturation
is
different
for
each
kind
of
service,
but
effectively
and
and
each
service
would
have
multiple
saturation
parameters
like
another
one
that
would
be
common
across
all
services
is
the
CPU
on
on
a
service
right
and
in
at
the
aggregate
saturation
of
the
service.
Isn't
the
average
of
like
the
CPU
and
the
connection
pool
size.
It's
the
max
or
whatever,
the
whatever
the
most
saturated
and
lying
metric
is
that's
your
saturation.
So
sorry
I,
spun
up
on
on
the
saturation.
A
But
what
we
trying
to
do
is
come
up
with
like
a
simple
set
of
key
metrics,
so
we
we
just
want
like
a
few
ways
to
tell
the
health
of
a
service
instead
of
a
thousand
ways
that
we
can
keep
track
of
and
so
like
with
this.
The
whole
plan
is
like:
we
have
a,
we
have
a
containment
pipeline
and
we
can
tell
when
things
are
broken.
A
So
and
the
results
are
promising
because,
like
just
this
stuff
is
very
working
and
I
feel
very
confident
in
it.
So
obviously
like
what
I
wanted
to
do
with
this
was
was
the
goals
of
the
project
were
to
reduce
the
antigens,
like
the
problems
that
we
that
we
sing
like
as
we
see
at
the
moment
and
then
trying
to
get
it
early
warning
on
two
on
two
issues
and
then
I
really
wanted
the
sticky
like
system-wide.
So
like
this?
Isn't
a
bark
like
individual
teams
like
losing
their
dashboards
like?
A
We
will
still
always
have
like
a
PG
bouncer,
a
red,
ass
dashboard.
That's
got
all
sorts
of
minute.I
about
the
way
Redis
works
or
about
its
particular
metrics.
But
this
is
about
like
having
like
a
like
an
overall
view
of
the
system
and
being
able
to
say,
like
Redis,
is
healthy
right
now
gets
it
is
healthy
right
now
the
web
is
healthy,
but
API
is
not
like.
A
We
need
to
zoom
in
on
the
API
and
then
maybe
at
that
point
you
go
down
to
the
to
the
more
detailed
dashboards,
and
so
it's
like
a
high
level
view.
You
know,
as
the
application
gets
more
complicated
like
you
know,
we
kind
of
like
every
single
ops
person
being
like
an
expert
in
every
single
component,
and
so
this
is
just
like.
A
We
have
the
same
set
of
Health
metrics
for
each
service
and
then-
and
then
you
know
we,
you
just
know
at
a
high
level
whether
or
not
they
wither
or
not
their
work,
and
so
like
this
is
the
the
main
dashboard
for
a
service,
and
you
know
these
are
the
four
things
that
I
was
talking
about.
So
we've
got
like
optics
and
optics
is
a
sometimes
I
think
is
a
little
bit
complicated
for
people
to
understand,
because
it's
not
a
latency.
A
It's
like
it's
a
percentage
of
your
requests
that
come
into
that
service
that
are
satisfied
within
an
acceptable
amount
of
time
right.
So
what's
really
nice
about
that
compared
to
latency
is
that
you
can
give
me
an
apt
X
for
any
service
and
I.
Don't
need
to
know
whether
it
is
good
like
if
you
give
me
if
you
give
me
a
latency
if
you
say
like
this
service
like
if
you
say
Redis
is
responding
to
requests
in
a
hundred
milliseconds.
A
You
need
to
know
that,
whether
that's
good
or
bad,
in
terms
of
redness
and
that's
like
not
everyone
knows
that
I
know
Redis
very
well
and
I
know
like
if
Redis
is
taking
100
milliseconds
to
respond
like
your
son's
gonna
be
20
down
and
flat,
but
not
everyone
knows
that,
and
the
nice
thing
about
optics
is
that
all
you
have
to
say
is
that
the
apdex
score
for
Redis
is
like
99
percent.
That's
fine,
like
maybe
one
set
of
requests,
are,
are
a
little
bit
slow
but
then
get
silly.
A
You
know
it's
like,
oh
well,
a
second
is
probably
okay
for
some
get
early
calls.
So
it'll
have
a
twenty
different
apdex,
but
the
range
are
sorry
totally
different
latency,
but
the
apt
X's
will
be
on
the
same
level,
and
so
because
we
we
converting
it
from
from
the
time
in
milliseconds
to
a
ratio.
We
can
compare
it
all
of
that.
All
of
the
pronounced
X's
or
all
of
our
services
and
come
up
with
a
single
value,
and
we
can
then
start
talking
about
the
SLO
draft
X.
We
can
say
well
for
giddily.
A
We
want
it
to
be
that
99.9%
of
requests
come
within
a
satisfactory
period.
The
other
thing
about
app
dates,
that's
really
really
powerful.
Is
that
if
you
want
to
say,
like
you
know,
some
some
requests
like
a
garbage
collection
or
Kristy,
literally
don't
care
how
long
it
takes
it
could
take
like
some
garbage
collection
or
Chris
and
get
early
take
half
an
hour,
and
we
don't
want
those
requests
to
kind
of
mess
up.
The
whole
thing,
so
the
optics
will
forget
ly
specifically
excludes
garbage
collection
right
because
we
just
ignore
it.
A
We
just
throw
it
out
and
we
don't
count
that
as
part
of
a
big
school,
but
obviously
for
for
other
things.
We
do
another
example
of
that
is
on
the
API.
We
have
long
polling
requests
that
come
in
that
nearly
always
take
50
seconds.
So
if
you
look
at
the
p99
in
Japan
or
the
P
95
value
for
our
API
requests,
workhorse
it's
a
flat
line.
A
It's
50
seconds
and
and
if
you
look
at
you
like
oh
there's,
something
wrong
with
our
metrics,
but
actually
because
we
have
these
long
polling
requests
that
always
take
50
seconds.
It
makes
perfect
sense
and
so
for
the
apdex
school
for
the
API.
We
actually
specifically
exclude
like
long
polling
requests
because
they
kind
of
they
don't
give
you
anything
useful,
and
so
that's
another
reason
why
using
app
takes
rather
than
p95
is
a
much
more
powerful
measure
and
then
error
ratios
is
basically
just
like
the
number
of
requests.
A
The
number
of
errors
that
you
see
in
a
minute
divided
by
the
number
of
requests
in
a
minute
and
the
reason
we
use
a
ratio,
is
exactly
the
same
reason
because
you
know
talking
about
sixty
errors
in
a
minutes
doesn't
make
any
sense.
You
know
if
it's
sixty
areas
IRA's
in
a
minute
and
there
were
a
hundred
requests.
A
Availability
is
the
same
again.
It's
a
ratio,
but
this
time
it's
a
ratio
of
how
many
services
that
are
supposed
to
be
serving
that
process
are
a
running
versus
the
number
that
are
vitamins
so
like
if
you,
if
you
got
lots
of
server
restarts
or
like
hangs
or
something
like
that,
that
will
dip
down
or
like
during
a
deployment,
go
down
and
back
up
and
it's
the
the
way
we
calculate.
A
That
is
super
easy,
because
Prometheus
actually
gives
us
to
you
built
in
Prometheus
as
a
metric
called
up
and
every
time
it
tries
to
spray
per
server
the
Konkan
hold
of
the
server
it
just
it
says
up
to
2-0,
and
so
all
you
have
to
do
is
basically
figure
out
the
ratio
of
the
number
of
servers
serving
say
the
web
or
their
get.
Let
me
say
like
well,
you
know
ninety
percent
of
them
are
now
at
the
moment.
A
A
Like
you
know,
if
a
single
like
at
the
moment,
if
a
single
web
node
hands,
we
have
like
lots
of
alerting
going
on
like
to
me
like
in
an
ideal
world
like
that,
machine
should
just
be
destroyed
and
new
one
created
like
automatically
like
who
cares
like
like
everything's
gonna,
fail
and
having
failing
nodes,
is
not
a
problem.
It's
only
when
you
you
get
to
like
critical
redundancy
that
it's
a
problem
now,
obviously
with
with
single
points
of
failure
like
Italy,
it
would
be
much
higher.
A
I'll
show
that
to
you,
because
I'm
really
really
happy
with
that
and
I
presented
about
it
at
monitor,
Amin,
and
then
we
lots
of
people
that
we
like
that's
a
good
idea.
So
I'll
show
you
that
so
okay,
we
explained
all
these
just
need
one,
and
then
I
really
described
up
next
to
you
on
the
last
slide.
So
it's
not
from
then
as
well.
So
like
the
the
pipeline
that
we
use
on
get
lab
comm
with
with
these
with
these
metrics
is
first
thing
we
do.
Is
we
collect
the
metrics
and
then
for
each
service?
A
We
kind
of
adapt
the
metrics
and
and
and
put
them
into
the
pipeline,
and
you
aggregate
them
into
into
specific
ways
so
like,
for
example,
you
know,
collection
is,
is
pretty
much
just
check
with
the
next
slide
rules:
yeah,
okay,
so
I'll
explain
this
more.
So
there's
basically
four
steps,
and
this
is
kind
of
just
the
overview.
A
The
first
step
is,
we
collect
the
metrics
and
that's
exactly
as
we
do
at
the
moment.
So
there's
no
difference
there.
We
just
collect
them
into
parameters
in
the
second
step
is
where
we
normalize
the
metrics
and
we
turn
them
into
like
aptX
scores
and
error
ratios
and
availability
ratios,
and
we
aggregate
them
and-
and
that's
super
important
so
with
the
metric
comes
into
the
system.
It
is
like
an
arbitrary
number
of
labels
on
it
right,
so
it
might
have
like
the
name
of
the
host.
A
The
name
of
the
G
RPC
method,
like
the
status
of
the
G
RPC,
call
afterwards
all
the
HTTP
codes.
So
if
right,
like
401
for
a
243
for
a
4
and
the
problem
with
that,
is
that,
with
all
of
those
you
sort
of
have
a
like
it,
unbounded
set
of
metrics
on
your
data,
and
so
it's
very
difficult
to
automate
things
unless
you
aggregate
it
down
to
like
a
limited
sense
and
that's
one
of
the
things
we
did
and
that's
really
important
in
this.
A
In
this
whole
thing,
then,
once
we've
got
that
we
can
start
building
up
statistics
and
you
can
do
anomaly
detection,
because
we
have
like
a
finite
set
of
dimensions
and
we
can
sort
of
look
over
those
and
then,
finally,
with
that
we
can
build
alerts
and
dashboards.
That
are
that
that
riu,
that
that
sort
of
reuse,
the
same
metrics
and
I'll,
show
you
that
as
well.
A
So
this
is
sorry,
don't
don't
use
Apple
Mouse
with
Google
Spreadsheets
and
Google
slides
so
like
this
is
just
another
explanation
of
what
I
was
talking
about
on
the
last
page,
so
we
have
all
of
the
normal
exporters
coming
in
and
the
thing
one
of
the
things
I
really
like
about
this
approach
is
that
we
don't
need
to
like
rebuild
our
application.
Metrics.
We
don't
need
to
say,
ok,
we're
going
to
fork
H
a
proxy
exporter
and
make
it
support
at
dick's.
Now,
because,
like
that
ain't,
nobody
got
time
for
that,
you
know.
A
So
what
we
rather
do
is
we
take
the
existing
exporters
and
we
adapt
them
to
our
system,
and
you
have
like
whole
bunch
of
recording
rules
that
we
that
we
use
to
take
the
existing
metrics
and
turn
them
into
what
we
need,
and
so
the
recording
rules
will
do
the
the
apdex
does
the
error
rate.
It
does
the
request
per
second
and
it.
A
This
is
a
left
of
the
availability,
but
that
we
also
do
and
then
using
the
error
rate
CRISPR
rate,
we
figure
out
the
error
ratio,
which
is
obviously
just
one
divided
by
the
other,
and
then
from
that
we
can
start
building
up
alerts.
So
we
build
up
an
alert
for
the
SLO,
the
error
ratio,
and
then
we
also
have
a
warning
for
when
the
the
that
says,
apdex
ratio,
anomaly
warning
so
that
actually
don't
do
that
anymore.
A
A
A
C
A
Cool
so
like
what
we
have
is
we
have
like
you
know.
This
is
what
we
have
coming
in
for
one
of
our
services
right,
and
we
just
have
this
like
raw
metric
and
then
what
we
do
with
the
recording
rules.
Is
we
take
in
the
middle
of
it?
She
could
convert
it
into
this
metric,
so
this
is,
as
you
can
see
here,
this
has
got
like
a
whole
bunch
of
like
arbitrary
dimensions
on
it
and
and
it's
really
hard,
then
what
we
have
here
is
like
a
metric
with
a
fixed
set
of
dimensions.
A
A
Well,
it's
the
chances
are
like.
If
you
look
at
request
for
a
second,
it's
a
it's
a
classic
example
of
a
normally
distributed
metric
and
so
like
you
will
find
that
about.
68%
of
your
values
will
lie.
One
standard
deviation,
either
side
of
your
mean
right
and
then
it's
95
percent
will
lie
within
two
standard
deviations
of
school.
It's
also
called
a
z-score.
A
So
if
you
take
a
value-
and
you
say
how
far
is
this
from
the
from
the
mean
that
if
you
measure
that
in
standard
deviations,
that's
called
a
z-score
and
then
within
three
standard
deviations,
you
get
about
ninety-nine
point
seven
percent.
So
that's
what
we
used
to
do.
Nominee,
detection
and
I've
got
a
whole
talk
on
this.
I
kind
of
I
won't
go
into
too
much,
but
that's
how
we
can
kind
of
spot.
These
outliers
and
I
did
alerting
on.
A
It
turns
out
that
some
of
our
metrics
are
not
actually
normally
distributed,
and
so
that's
why
I've
stopped
using
it
for
the
error
ratio
and
the
apdex
score,
but
it
are
our
requests
per
second
rates
are
normally
distributed,
so
I've
carried
on
using
it
and
then
I
don't
got
time
to
go
into
this.
But
this
is
how
we
we
do
the
anomaly
detection
or
CRISPR
second,
but
basically
we
use
the
seasonality
in
the
data.
A
So
so
what
it
means
is
that
we
have
the
data
that
we
have,
that
we
collect
into
each
service.
We
can
plot
it
on
like
a
single
set
of
dashboards
and
this
triage
people
that
I
use
right
and
so
for
one
of
our
services.
We
plots
our
apt
X's
alongside
one
another
here
and
we
plot
our
era
ratios
requests
per
second
and
our
availability.
A
Is
that
we've
kind
of
brought
like
lots
and
lots
and
lots
of
different
data
and
we've
kind
of
nudged
it
down
so
that
it's
all
the
same,
and
so
we
can
compare
like
with
like
so
when
something
goes
really
wrong.
It's
really
obvious,
because
you
know
all
the
values
are
the
same
as
acidity
the
only
one
we
don't
do.
A
That
with
is
the
request
service
per
second,
because
it's
not
really
possible,
but
what
we
can
do
with
requests
per
second
is
we
can
we
can
basically
plot
how
unusual
that
value
is
for
this
time
of
day
right.
So
we
can
say:
well
it's
a
it's
a
Thursday
afternoon,
and
so
this
request
should
be
very
high,
but
actually
it's
like
twice
as
high
as
it
should
be,
and
so
this
is
called
Zed
school.
A
A
Yeah,
it
kids,
maybe
your
point,
one
minus
point:
one
having
read
us
as
well
so
like
would
like,
if
you
look
here,
we
zoom
in
on
this
you'll,
see
you'll
see
like
this.
So
you
can
see
here.
This
is
the
graph.
If
we
kind
of
comparing
the
requests
per
second
rate
for
all
of
our
services.
You
can't
really
see
like
a
big
sorry
about
the
birds.
You
can't
really
see
like
much
detail
in
here
like
it's
not,
but,
but
when
you,
when
you
compare
like
the
standard
deviations,
you
can
see
hey
wait
a
minute.
A
This
is
a
very
unusual
value
like
Redis
was
spiking
up
there,
but
you
know
when
you're
looking
at
eleven
here,
it's
it's
not
really
clear,
and
so,
when
these
values
go
back
into
the
red,
that's
when
we
start
sending
out
alerts
on
them
and
we've
got
built-in
alerting
for
all
of
this
stuff.
I
feel
like
I've
spoken
a
lot
at
you
guys.
So
if
you've
got
like
similar
knowledge,
questions
nihongi
under
standard
I
I
do
feel
like
this
has
been
super
rushed
so
so,
like
yeah.
D
A
A
A
Service
interests
so
there's
a
whole
bunch,
different
alerts
that
go
in
there,
and
some
of
them
are
like
more
important
than
others,
but
the
loads
that
go
in
there
are
all
like
on
a
per
service
level,
and
so
the
really
important
ones
are
I'm
just
going
to
zoom
in
on
a
little
bit
of
this
graph
over
here.
So
like
the
first
one
that
we
see,
which
is
the
one
that's
been
firing
for
the
last
week
non-stop
is
the
SLR
on
the
apdex
score
is
below
so
for
each
of
our
for
each
of
our
services.
A
We
have
a
service
level
objective,
and
so
what
we
say
is
we
would
like
to
achieve
like
that.
99%
of
our
web
requests
come
in
within
the
allotted
thresholds
for
web
requests
right
and
what
you
can
see
that
that
dotted
line
over
there
is
that,
if
follow,
it's
that
it's,
that
targets
represented
on
a
graph
and
the
the
yellow
line,
is
the
actual
value.
So
when
that
actual
value
is
underneath
the
target,
that's
when
it
for
five
minutes.
That's
when
you
get
an
alert
right.
A
The
same
goes
for
the
error
ratio,
so
these
are
the
SLO
targets,
and
so
here
you
can
see
our
actual
error
rate
is
0.01%,
but
our
targets,
error
rate
is
0.5%,
so
we
haven't
been
getting
alerts
on
that
if
you
zoom,
you
know
last
seven
days
you
can
see
here
like
during
this
period
over
here
now.
This
is
turned
out
to
be
kind
of
a
like.
A
What
happened
was
we
enabled
a
bunch
of
machines
and
then
you
turn
them
off,
but
we
didn't
turn
them
off
fully,
and
so
they
were
health
checking
and
erroring
and
then
we're
like
in
a
bad
state,
and
so
here
we
got
a
very
clear
signal
that
something
was
wrong
and
we
we
could
fix
it
once
quite
nice,
for
the
SLO
is,
is
that
each
service
has
a
different
level.
So
you
get
this
to
the
right.
This
is
0.5
percent,
but
the
Islamic
of
Italy
is
trimmed
much
closer
to
so
forgive
me.
A
The
SLO
is
0.1
percent
right.
So
it's
a
much
lower
SLO
and
what
I'd
like
to
see
over
time
is
that
trimming
these
down.
It's
effectively
a
recording
rule
that
the
SRO
is
a
recording
rule
that
says
SLO
get
to
the
error
rate
and
it's
just
a
fixed
value.
Zero
point:
five,
since
most
recording
rules
are
like
a
Prometheus
expression,
and
these
recording
rules
are
just
a
fixed
value.
A
So,
if
you
want
to,
if
you
want
to
adjust
it
up
or
down
you
just
going
to
make
a
merge
request
and
and
that
value
will
go
up
or
down
and
then
automatically
any
service
that
exceeds
its
SLO
will
automatically
fire
an
alerts.
There's
it's
not
like:
we've
wired
together,
one
alert
forgivingly
and
one
alert
for
the
web
and
one
for
the
API,
just
the
fact
that
the
API
is
generating
a
very
metric
numeric
get
this
other.
A
So
that's
the
first.
Those
are
the
two
SLO
nodes
right,
so
those
are
when
you
see
SLO.
Those
are
the
ones
that
I'm
listening
to
the
most
one
of
the
things
that's
important
to
realize
is
that
those
kissel
are
alerts
because
we
break
everything
down
by
stage
now
and
we
have
the
canary
stage
and
the
main
stage
a
lot
of
the
iserlohn
nodes
that
are
firing
at
the
moment
off
for
the
canary
stage.
And
oh,
we
don't
have
canary
giving
you
obviously.
A
So
if
you
go
canary
weird,
you
know
what
what
we
find
is
that
it's
a
little
bit
more.
It's
you
know
it's
a
little
bit
more
jerky
at
the
moment
and
so
like
I,
don't
think
we'll
read
those
two
pager
duty
yet,
but
when
I
see
in
that
tunnel,
I
see
like
SLO
alerts
like
this
service
main
stage,
then
I'm
like
okay,
like
things,
have
gone
wrong
and
that's
when
I
start
reacting
and
that's
why
I
really
feel
those
actually
need
to
go
to
page
duty.
A
Now
you
had
a
meat
cheese
day
and
we
agreed
that
those
actually
now
like
me
to
fire
like
and
get
to
keep
them
out
a
bit
because
generally,
when
they
go
wrong,
it
means
that
there's
a
real
problem.
The
availability
I
haven't,
got
the
the
representation
on
here,
but
at
the
moment
for
all
services
this
is
fixed
at
75%.
Oh,
these
are
probably
easier
to
understand
if
I
turn
off
the
deploy
annotations.
A
So
these
are
all
fixed.
Basically,
at
this
level
here
and
I
think
what
I
need
to
do
is
the
next
step
is
adjusted
in
the
same
way
that
that
each
service
has
a
haptics
and
an
error
ratio
is
low.
We
have
an
availability
SLO
in
the
last
type
of
alert
that
we
get
in
that
channel,
actually
there's
several
more
times,
but
the
one
of
the
last
is
is
when
this.
A
So
this
is
our
prediction
for
what
this
metric
should
be
like
the
the
requests
per
second
for
this
service
at
this
time
of
day
this
over
the
week
right.
So
what
we
find
with
with
our
data
is
that
it's
very
for
the
request
per
second
data-
it's
very,
very
seasonal.
So
we
see
exactly
the
same.
If
you
look
at
our
data,
we'll
have
a
small
hump
which
is
European
lunchtime.
A
We
will
it
see
a
little
hump
and
then
the
Americans
come
online,
see
that
and
on
Fridays
we
see
the
same
slow
both
week
in
week
out
week
in
which
we
got
and
every
week
we
just
have
the
same.
Gradual
growth
and
this
green
line
actually
takes
the
green
growth,
the
growth
intercoms,
and
then
it
looks
at
like
three
weeks
with
the
data
and
the
whole
anomaly.
Detection
thing
talked
that
I
was
mentioning
is
about
that.
Did
the
problem
at
the
moment
is
because
we've
chopped
our
Prometheus
down
to
one
week's
retention.
A
I've
only
got
one
week's
worth
of
data
to
guess
this,
and
so
it's
not
as
good
as
it
was
like
a
few
like
a
month
ago
and
and
I'm
pushing
to
get
us
the
Prometheus
data
back.
But
that's
a
that's
another
question,
so
the
the
next
time.
Let
me
kick
this
line
jumps
outside
of
the
green
boundary.
Then
we
can
give
me
alerts
on
that,
but
I
don't
think
that
that
will
ever
go
to
pay
to
GE,
because
we
what's
better
is
that
we
build
services
that
that
can
handle
traffic
spikes.
A
A
A
If
we
go
take
a
look
at
the
spike
over
here
in
pages,
you
know
you
can
see.
That
is
what
we
were
expecting
the
range
to
be,
and
then
now
that,
on
its
own
shouldn't,
get
someone
on
a
bed
right,
but
over
here
you
can
see
what
that
did
to
the
they're
so
low
because
they
just
sort
of
faints
at
the
drop
of
the
hats
when
it
gets
in
traffic
that
just
led
to
a
massive
spike,
and
so
because
the
error
rate
wins
over
the
the
pages.
This
alone,
someone
would
have
got
paged
for
that.
A
So
then
they
would
have
woken
up
in
an
ideal
world.
They
would
have
come
to
this
page
and
they
would
see
like
hey,
like
look
at
this,
like
the
the
reason
is
because
of
this
anomaly
here
in
the
in
that,
and
so
we
we
haven't
a
look
that
goes
to
to
slack
but
I,
don't
think
that'll
ever
go
to
page
of
duty.
Another
thing
that's
kind
of
worth
pointing
out
is
that
four
pages
we
don't
have
an
app
that
school
and
that's
that
because
at
the
moment
we
don't
have
the
data
right
and
so
like.
A
What's
quite
nice
about
this,
is
we
can
optionally
put
things
in
one
of
the
ways?
One
of
the
nice
things
about
this
approach
is
that
we,
because
the
recording
rule
can
be
very
thin
for
services.
We
actually
used
like
H
a
proxy
data
to
figure
out
like
for
the
registry,
the
error
rate
that
we
use
actually
comes
from
a
trade
proxy
and
some
kind
of
adapt
the
HF
proxy
data
and
make
it
the
the
error
rate
for
the
for
the
registry
service.
A
And
so
you
know,
if
you
go,
look
at
the
registry
and
I
think
it's
the
same
for
pages,
but
we
still
missing
something,
and
so
this,
like
the
real
news,
feed,
doesn't
emit
an
error
rates
value,
but
we've
kind
of
because
of
the
the
flexibility
of
the
recording
rules.
We've
used
this
instead.
Sorry
I'm,
just
talking
like
again
so
did
that
answer
your
question.
A
A
B
Yeah,
this
makes
a
lot
of
sense.
I
really
like
this
I
think
you
should
be
part
of
the
onboarding
for
engineering,
this
recording,
because
it's
really
helpful
and
the
other
things
it's
that
I
really
love
that
you
say
it
is,
and
you
finally
explain
me
where
why,
when
I
go
through
the
rough
on,
most
of
the
dashboards
are
empty
and
I
always
asking
myself
is
something
really
bad
happening
or
I'm.
Just
looking
at
the
brunt
thing,
yeah.
A
B
Think
this
is
a
very
good
starting
point
for
understanding.
What's
going
on,
because
all
the
other,
you
know
the
other
networks
or
you
know
what
to
expect
the
dashboards
for
day.
You
can't
use
them.
There
are
too
complex
or
to
make
into
the
details.
You
have
to
be
an
expert
not
only
of
the
thing
that
you
are
looking
at,
but
also
at
the
how
the
matrix
was
built
so
yeah.
So
the
one.
A
So
one
of
the
things
that
I
that
I
didn't
touch
on
was
that
I
also
have
alerts
that
tell
me
when
I'm
not
getting
in
load
data
anymore
and
because
we
have
a
fixed
expectation
of
what
the
alert
of
what
all
the
metric
data
is.
We
can
alert
on
it
not
being
there
and
so
in
a
deployment.
Recently,
we
lost.
A
We
lost
on
a
psychic
data
and
here
oh,
it's
back,
okay,
so
that's
kind
of
weird!
So.
C
I
need
to
I
need
to
drop
off.
I
just
wanted
to
give
you
a
like
a
highlights
to
to
Alessio
and
Myra
before
I
leave,
and
you
can
continue
if
you
want
to
the
reason.
I,
don't
know
if
you
realize
that
now,
but
the
reason
why
I
asked
you
to
be
present
and
try
to
understand
this
is
you're
not
as
serious
you're,
not
holding
a
pager.
It's
not
your
job
to
necessarily
like
jump
on
all
of
these
alerts.
C
What
is
your
job,
though,
is
to
help
out
here
series
once
they
finally
get
to
understand
what's
happening
in
where
they
were
looking
at
and
I
want
you
to
be
the
ones
also
to
understand.
You
know
it
in
case.
There
is
something
like
this
happening:
how
to
like,
consume
this
data
and
also,
if
at
any
point
in
time,
you
get
like
a
bit
of
free
time
to
try
and
teach
yourselves
on,
like
hey
I'm,
noticing
that
this
is
a
problem.
C
I'm
gonna
improve
the
product
itself
so
that
it
can
actually
help
out
the
s-series
understand
what's
happening
a
bit
better.
So
that's
that's
the
angle
of
why
the
two
of
you
are
in
this
Andrew.
Thank
you.
So
much
for
doing
this,
I
really
really
appreciate
it.
I
learned
a
lot
still
from
it.
So
I'm
gonna
put
this
on
on
youtube.
If
you
don't
mind,
if
you
mind,
send
me
a
message
and
I
won't
I
guess:
I
need
to
drop
drop
off
so
I'll
catch.
You
all
later
right.
B
You
are
muted
Andrea.
There.
A
We
go
you
just
want
to
quickly
show
if
you
want
to
so.
The
other
thing
I've
been
doing
is
I've
stopped
using
like
refiner
to
edit
dashboards,
because
it's
it's
like
a
horrible
way
and
you
like
each
everyone's
overriding
one
another
and
nobody
knows
like
why
a
change
was
made
or
what
stopped
working
and
so
now,
like
all
the
dashboards
that
I'm
building
I'm,
pretty
Integra
finer
itself
into
the
run
books
repository
and
it
uses.
This
thing
called
JSON
it
and
JSON.
A
It
is
like
adjacent
templating
library
that
Google
builds
and
they
use
it
for
a
lot
of
stuff,
and
so
imagine
like
Jason,
with
functions
right,
and
so
you
could
say
like.
Instead
of
having
like
a
hash,
you
just
say
get
the
the
value
of
this
hash
from
this
function,
and
then
the
the
guys
from
the
graph
on
a
team
have
built
something
called
graph
on
nets,
which
is
basically
Cora,
fana
library
for
JSON,
it's
and
and
what
that
means
is
that
all
my
dashboards
are
in
here
and
it's
not
just
horrible
Griffin
and
Jason.
A
It's
kind
of
like
a
little
bit
like
kind
of
ordered,
and
so
you
know,
I've
got
all
my
notes
in
here
and
I
can
do
all
of
that
and
then
the
thing
that
I
love
about
this
approach
is
like.
Maybe
you
didn't
know
it
just
maybe
did
but
wasn't
the
same
like
colors
and
everything,
and
if
I
want
to
change
the
color
I
go
to
that
function
or
that
thing
and
I
change.
A
The
cut
like
the
color
for
critical
things
is
that
and
I
changed
that
I
commit
my
change
and
automatically
deploys
its
secret
fauna
and
all
the
dashboards
are
got
like.
The
new
color
like
the
color
for
critical,
is
X
and
like
I.
Do
that
with
a
bunch
of
stuff,
and
it's
like
a
much
better
way
of
doing
this.
So
if
you
want
to
contribute
to
that
or
if
you
like,
want
to
add
new
dashboards
like
this,
is
how
I'm
trying
to
do
this
you'll
actually
see
these
dashboards.
A
If
you
try
to
edit
them
they're,
not
editable
I
mean
you
can
you
can
override
that?
But
oh
yeah
cool
I
think
that's
it
for
me.
Unless
you
got
any
more
questions.
A
Yeah
awesome
I'll
show
you
this
there's
there's
one
other
thing
with
this
era:
budgets
that
I'd
like
to
show
you
the
sunset
but
I
can
I
can
I
can
show
to
you.
Another
time
is
basically
this
over
here
is
a
recording
of
how
much
time
we're
spending
underneath
the
SLO
for
that
sub.
For
that
thing,
and
so
we
can
start
sending
attribution
to
teens
and
saying
you
know
the
get
early
service
is,
you
know
not
reaching
its
era
budget?