►
Description
At GitLab, we’ve built an extensive framework for defining service level indicators (SLIs) for our different services. This allows us to take a simple definition, and turn that into dashboards and alerts. There are different owners involved: Infrastructure and stage groups. The SLIs we use to monitor GitLab.com are attributed to groups building the features we run. Everyone is held to the same 99.95% SLO, everyone can contribute to our observability.
Join this talk to learn about the challenges with SLOs and error budgets. Hear how we are aggregating our infrastructure SLIs by features groups, and how we are involving groups in improving our SLI definitions.
A
A
A
The
main
stakeholders
were
sres
and
sres
also
owned
the
definitions
of
the
slis,
the
slos
and
the
alerts
when
we
get
an
alert,
it's
handled
by
the
sre
on
call,
and
this
approach
has
improved
the
coverage,
the
accuracy
and
the
recall
of
alerts
a
great
deal.
I
did
a
talk
at
slo
conf
last
year
on
this
topic
and
you
can
find
it
online
if
you
look
for
it
or
there's
a
link
in
the
presentation,
just
a
quick
explanation
about
how
we
structure
our
slis.
A
So
we
apply
a
consistent
structure
to
all
our
slires
that
we
monitor
in
gitlab.com,
and
this
structure
has
some
quirks,
but
it
works
well
for
our
purposes
and
it's
evolved
over
time
and
has
some
features
that
predate
our
slo
monitoring
stack
at
the
highest
level
on
the
left.
We
have
an
environment,
and
this
is
composed
of
multiple
services.
A
Moving
to
the
middle,
each
service
is
composed
of
multiple
components
and
then,
finally,
on
the
right,
each
component
has
up
to
two
slis,
both
of
which
can
be
optional,
a
latency
sli,
which
we
refer
to
as
aptx
and
error
sli.
Ideally,
a
component
has
both
slis,
but
it's
not
always
possible
to
measure
both,
which
is
why
we
make
them
optional.
A
A
Now
the
main
reason
we
separate
latency
and
errors
in
our
slis
is
historic,
but
we
keep
it
because
it
often
is
a
good
signal
to
the
cause
of
an
alert.
The
types
of
issues
that
may
lead
to
aptx
slo
violations
are
often
different
from
those
that
lead
to
error.
Slo
violations
and
knowing
this
at
alert
time
can
give
us
a
jump
start
in
finding
the
cause
of
the
problem
from
an
alert
budgeting
perspective.
We
aggregate
these
slis
together.
B
One
of
the
difficulties
we
encountered
when
trying
to
identify
ownership
of
features
and
metrics
is
the
fact
that
gitlab
is
mostly
a
rails
monolith.
We've
got
some
separate
services
that
are
built
by
a
single
team
like
a
runner,
service
or
gitly,
but
most
people
contributing
to
gitlab.
Do
it
inside
our
rails
code
base.
B
So
that's
where
most
of
our
features
live
besides
being
easy
to
contribute
to.
It
also
makes
installing
and
running
gitlab
easier
for
most
common
cases.
Gitlab.Com
runs
as
standard
as
possible,
just
on
a
lot
more
hardware,
so
the
monolith
isn't
going
to
go
away
during
phase
one
of
our
of
our
slo
alerting
rollout.
B
B
B
Not
all
requests
are
equal
and
using
the
small
api
developers
and
project
managers
can
specify
what
makes
up
an
acceptable
what
makes
up
acceptable
performance.
B
B
When
we
add
a
new
metrics
to
the
catalog,
we
use
the
the
same
dsl
that
andrew
described
last
year.
As
a
result,
we
have
slis
defined
by
developers
with
input
from
products
managers
that
will
alert
us
sres
when
features
aren't
performing
well.
According
to
the
to
what
product
managers
have
defined
for
phase
2,
we
wanted
to
start
aggregating
these
metrics
using
the
feature
label
we
added,
so
we
could
apply
an
slo
to
to
teams
in
the
future.
We
hope
to
be
able
to
alert
teams
directly
when
an
slo
isn't
met.
B
A
This
is
made
much
easier
by
the
fact
that
all
of
our
slis
are
currents
based,
so
we
don't
have
to
deal
with
the
complexity
of
mixing
time-based
and
occurrence-based
slis
internally,
we
store
each
sli
for
each
period,
with
two
counters
good
and
total.
It's
only
at
display
time
that
we
turn
this
into
a
percentage.
The
same
is
true
for
our
aggregated
slis
they're,
composed
of
two
counters.
This
makes
aggregating
very
easy.
The
good
concept
is
the
sum
of
all
the
aggregated
good
counters
and
the
total
is
the
same.
B
Because
we've
reused
the
framework
for
slo
alerting
from
phase
one,
we
can
we
get
dashboard
for
free.
This
is
a
dashboard
for
a
single
group
that
shows
the
same
slis,
but
only
for
features
that
they
are
working
on.
The
dashboards
are
generated
from
the
same
code
as
the
service
dashboard,
so
they
have
the
same
panels:
error
ratio,
aptx
success,
success
ratio
and
the
request
rates,
because
we
also
include
the
information
in
the
logs.
We
can
generate
links
that
only
show
requests
that
did
not
meet
the
solo.
A
One
of
the
most
dramatic
changes
to
our
error
budgeting
process
was
when
our
cto
decreed
that
our
error
budgets
must
have
teeth,
and
this
was
the
moments
at
which
our
error
budgeting
process
changed
from
being
just
another
input
into
our
planning
process,
purely
used
as
informational
guidance
to
being
something
much
more
contractual,
and
it
gave
teams
the
ability
to
carve
out
time
to
work
on
technical
debt
that
they
needed
to
address.
So
this
slide
shows
our
error
budget
report.
It's
reviewed
weekly
on
our
engineering
allocation
meeting,
which
incorporates
elements
of
the
sre
books
production
meeting.
B
B
So,
to
summarize,
we
went
from
slos
being
the
business
of
sres
only
towards
involving
everyone
at
gitlab
developers
and
product
managers
have
been
included
by
putting
the
sli
definition
inside
the
application
where
they
can
specify
what
is
good
and
bad.
These
numbers
can
be
aggregated
by
feature
and
by
group
and
used
to
inform
management.
So
it's
clear.
We
need
to
spend
some
more
time
working
on
performance
and
reality.