►
Description
This talk covers GitLab's adoption of SLO monitoring, from our previous causal alerting strategy, which had outgrown its purpose as the complexity and traffic volumes grew, to our early attempts, building and maintaining configuration, and the problems that brought about, to our current, declarative approach. The talk will cover the challenges of getting buy-in from engineering, operations and product stakeholders, the benefits of having a common language of availability across the organisation and our future plans. This is a deep-dive, practical talk; all the code and configuration for GitLab.com's monitoring infrastructure is open-source, and the talk will include links to these resources.
A
My
name
is
andrew
nudigate,
I'm
an
engineer
at
gitlab,
where
I
work
on
the
infrastructure
team
helping
to
build
gitlab.com
in
this
talk,
I'm
going
to
give
you
a
very
brief
walk
through
on
how
we
implemented
slos
for
gitlab.com.
We
don't
have
a
lot
of
time,
so
I'm
going
to
be
moving
pretty
quickly
as
a
starting
point.
It's
worth
describing
where
we
started
this
journey
in
2018.
It
definitely
felt
like
we
had
outgrown
our
monitoring
strategy.
A
Our
monitoring
was
ad
hoc.
Our
alerting
was
piecemeal.
We
had
the
cycle
of
when
we
had
an
incident.
We
would
create
a
bunch
of
new
alerts
in
an
attempt
to
detect
the
cause
of
the
incident,
but
these
alerts
would
very
often
be
flappy
while
simultaneously
not
alerting
us
to
the
next
incident.
So
when
that
happened,
we'd
create
some
new
alerts
for
that
incident,
and
the
cycle
would
continue.
A
Another
problem
was
that
we
maintained
our
dashboards
manually
so
often
during
an
incident.
We
would
discover
that
a
really
important
dashboard
was
broken
and
we
had
to
spend
time
trying
to
fix
it.
Instead
of
focusing
on
the
incidents
itself
around
this
time,
I
read
my
philosophy
and
alerting
by
rob
evershark,
and
so
we
began
our
journey
to
symptom-based,
alerting
and
eventually
to
slo.
A
Monitoring
our
first
step
in
our
journey
was
to
move
from
cause-based
alerts
to
symptom-based
alerts,
but
how
would
we
alert
on
these
signals
for
the
first
iteration?
We
decided
to
use
the
simplest
approach.
We
would
fire
an
alert
if,
over
a
five
minute
period,
we
generated
errors
at
a
greater
rate
than
our
slo
allowed.
So
if
our
slo
was
99
and
we
had
more
than
one
percent
errors
in
a
five
minute
period,
we'd
trigger
an
alert.
A
The
reason
we
selected
this
approach
was
twofold.
The
first
was
the
approach
was
simple
to
implement.
It
uses
a
single
time
window,
so
we
could
continue
to
manually,
define
all
our
recording
rules
and
alerts,
but
what's
more
at
the
time
we
didn't
know
any
better.
It
was
the
only
approach
that
we
had
considered,
but
the
problem
was
that
the
alerts
were
no
less
flappy
than
before
and
we
expec
we
experienced
a
lot
of
false
positives.
A
Luckily,
around
this
time
google
published
the
sre
workbook
and
we
learned
about
multi-window
multi-burn
rates,
alerting,
as
described
in
the
chapter
alerting
on
slos.
We
wanted
to
move
over
to
this
approach,
but
it
was
unfeasible
to
do
so.
While
maintaining
the
recording
rules
manually,
the
solution
we
arrived
at
was
to
develop
a
simple
dsl
for
describing
our
slos
and
our
slo
monitoring.
A
We
were
already
modeling
our
application
as
a
series
of
services.
The
next
step
was
to
break
each
service
down
into
a
series
of
components.
Each
component
has
three
signals:
a
latency
sli
for
which
we
use
an
aptx
score,
an
error
sli
and
a
measurement
of
the
number
of
requests
per
second,
that
the
component
is
processing.
A
We
then
go
on
to
define
a
number
of
components
in
this
example,
I'll
walk
through
the
shared
runners
component,
which
is
our
hosted
runner
fleet
used
by
gitlab.com
customers
for
each
component.
We
include
some
ownership
information
and
a
description.
This
information
will
be
included
in
alerts
and
dashboards.
A
We
then
define
the
aptx
sli.
We
have
several
ways
of
defining
an
aptx,
but
the
most
common
is
based
on
prometheus
histograms,
as
shown
here,
we
use
an
abstract
description
using
the
histogram
metric
name,
the
selectors.
We
will
need
to
filter
the
metric
and
the
latency
threshold
that
the
sli
should
complete
within
for
the
request
rate
we
most
commonly
use
a
rate
over
a
counter
metric,
as
shown
here
again
we
define
the
metric
name
and
the
selectors.
A
From
this
definition,
we
generate
our
prometheus
recording
rule
configuration.
We
generate
one
recording
rule
per
sli
per
time
window.
The
recording
rules
also
include
appropriate
aggregations.
Here
you
can
see
that
our
histogram
map
decks
has
generated,
recording
rules
for
each
time
window,
including
5
minutes
30
minutes
1
hour
and
so
on.
A
We
then
also
generate
our
slo
alerts
from
the
dsl.
Each
sli
has
several
alerts,
including,
of
course,
slo
violations.
The
alerts
consume
the
recording
rules
that
we
generated
in
the
previous
step
and
use
a
multi-window,
multi-burn
rate
alert
expression
described
in
the
sre
workbook
and
using,
of
course,
the
slo
thresholds
that
we
defined
in
the
dsl.
A
A
A
We
use
a
five
minute
time
window
for
the
main
series
and
include
two
percent
and
five
percent
slo
thresholds
on
the
dashboard.
Another
useful
feature
is
the
color-coded
status
panel,
which
we
can,
which
will
quickly
draw
your
eye
to
any
slo
violations
on
a
busy
dashboard.
These
panels
are
designed
to
display
aggregated
metrics,
but
sometimes
you
need
to
dive
further
into
detail
for
that.
We
also
include
extra
detail
in
collapse.
Rows
one
for
each
component.
A
A
This
started
off
with
better
communication
between
infrastructure
teams
and
product
engineering
teams,
but
has
since
expanded
to
include
our
product
organization
and
because
of
this,
we're
now
able
to
start
introducing
error
budgeting
with
full
buy-in
from
our
product
teams.
Another
advantage
is
that
our
dashboards
are
really
consistent
and
reliable.
A
This
consistency
allows
operators
to
navigate
quickly
during
an
incident
to
isolate
the
cause
of
a
problem.
The
third
advantage
of
using
a
dsl
is
that
we
have
significantly
reduced
the
barrier
for
engineering
teams
to
maintain
their
own
slis
and,
of
course,
another
advantage
of
slo
monitoring
is
that
our
alerts
are
precise
and
consistently
specified
across
the
application.