►
Description
A presentation I prepared at the request of the Self-managed Scalability Working Group.
https://docs.google.com/presentation/d/1xx8sOoWsRvw8_wHqBKbujrWQwmK-meDQrSs4zGNY53I/edit?usp=sharing
Issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7217
A
A
I'm
going
to
start
off
by
explaining
why
I
think
this
is
useful
using
a
true
story:
June
was
a
bad
month
for
gait
lab
comm.
We
were
hit
by
a
string
of
separate
but
related
issues
which
had
a
severe
impact
on
the
performance
of
the
site.
During
the
month
June.
We
only
managed
to
retain
our
service
level
objectives
about
89%
of
the
time.
Some
of
the
issues
were
related
to
progress,
some
to
ArrayList
cache
instance,
and
some
tyrannous
persistence.
Incident
instance
and
I've
included
some
links
to
some
of
those
issues
over
here.
A
Broadly
speaking,
most
of
the
incidents
we
experienced
on
gate
lab
comm,
well
that
we
do
experience
when
get
lab.
Comm
can
be
broken
down
into
one
of
three
categories
from
most
prevalent.
To
least
they
are
application
changes,
and
this
is
when
a
change
is
deployed
to
the
application
and
works
well
on
a
developer's
machine
and
it
passes
the
QA
process,
but
we
are
unable
to
scale
sufficiently
to
handle,
get
lab
work
on
traffic,
and
this
is
by
far
the
most
prevalent
class
of
degradation
that
we
see
on
github.com.
A
A
The
third
class
is
infrastructure
changes,
and
this
is
when
a
change
is
made
to
the
underlying
infrastructure,
and
this
doesn't
scale.
So
we
might
make
a
change
to
Postgres
or
Redis
and
then
find
out
that
due
to
some
unforeseen
circumstance,
it
doesn't
work
as
we
expected
and,
of
course,
there's
a
long
tail
of
other
issues,
including
cloud
provider,
related
issues
etc.
But
these
are
the
three
main
categories
of
issue
that
we've
had
to
deal
with
up
to
now
on
get
low
calm.
A
So
in
order
to
monitor
incidents,
the
two
key
metrics
that
we
observe
for
each
service
are
error
ratios,
and
this
indicates
a
percentage
of
requests
that
results
in
a
service.
So
an
error
ratio
of
1%
indicates
that
one
requests
out
of
every
100
that
is
made
to
service
will
fail
with
some
sort
of
service
line
error
and
the
second
key
metric
is
the
app
tech
school.
This
is
a
measure
of
the
percentage
of
requests
that
complete
within
a
satisfactory
amount
of
time.
A
The
Jun
incidents
I
mentioned
previously
didn't
fit
into
these
categories.
They
didn't
appear
to
be
related
to
a
single
change
or
to
a
small
set
of
changes
in
the
application.
No
infrastructure
changes
have
taken
place
and,
for
the
most
part,
user
action
did
not
appear
to
be
part
of
the
problem
in
each
of
these
instances.
A
What
had
happened
was
that
we
reached
the
capacity
limit
on
a
particular
resource.
Once
these
capacity
limits
were
reached,
the
system
started
performing
burden.
These
resources
included,
PG,
bouncer
connection,
pools,
Redis,
CPU
and
unicorn
workers.
When
we're
utilizing
a
resource
at
or
above
capacity.
We
call
this
saturation.
There
are
many
ways
to
reach
saturation,
but
you
only
need
to
reach
saturation
in
one
area
of
the
application
in
order
to
see
degradation.
A
A
Now
the
problem
is
that
using
the
two
key
metrics
that
we
rely
on
for
determining
the
health
of
our
services,
the
optics
and
error
ratios,
we
didn't
gradually,
they
didn't
gradually
get
worse
as
we
approach
saturation-
and
this
is
the
apdex
measurement
for
get
labs
web
service
in
the
second
week
of
June.
The
dotted
red
line
is
our
threshold
SLO.
A
So
how
do
we
avoid
this
happening
again
in
future
by
introducing
a
third
key
metric
for
each
service?
And
this
we
call
saturation
saturation
is
pretty
simple
to
understand
for
any
given
resource.
What
percentage
of
the
maximum
utilization
for
that
metric?
Are
we
currently
consuming
so
for
each
service
in
the
application
we
can
start
tracking
a
set
of
appropriate
saturation
metrics.
Some
of
these,
such
as
some
of
the
saturation
metrics
we
track,
include
workers,
disk
utilization,
several
different
CPU
metrics
database
pools
and
others,
and
many
more
are
to
follow.
A
A
Now
that
we
have
these
metrics,
we
can
start
alerting
on
any
resource
exceeding
a
value
of
say,
90%
for
a
period
of
say,
10
minutes,
and
this
is
already
a
big
improvement
over
the
previous
loading
rules
that
we
were
using,
which
tended
to
be
quite
piecemeal.
So
while
we
had
some
CPU
alerts,
we
didn't,
for
example,
have
any
alerts
covering
single-threaded,
CPU
utilization
on
Redis
or
single
node
CPU
utilization
on
sidekick.
A
With
these
alerts,
we
now
cover
that
and
many
other
cases
that
weren't
previously
covered
so
after
we've
got
alerting
sorted
so
that
we
can
know
of
immediate
problems.
The
next
step
is
to
build
out
a
long-term
forecasting
model.
What
we've
realized
during
the
Troubles
in
June
was
that
once
we
hit
capacity,
it
wasn't
really
too
late.
So
many
of
the
fixes
for
the
reasons,
resource
utilization
problems
required
application,
changes
to
inefficient
unscalable
features,
and
these
changes
might
take
at
a
minimum
several
days
to
isolate,
fix
and
tests.
A
So
it's
crucial
that
we're
able
to
forecast
saturation
problems
well
in
advance
of
the
saturation
event
so
that
we
have
time
to
improve
the
application
or
scaler
infrastructure.
So
how
do
we
do
this?
A
naive
approach
would
be
to
linearly
interpolate
our
resource
utilization
to
predict
when
it's
going
to
reach
saturation.
A
Unfortunately,
most
resources
are
too
spiky.
Take,
for
example,
the
single
core
utilization
on
our
Raiders
fleet.
This
chart
is
taken
from
a
period
of
the
June
troubles,
so
CPU
was
routinely
spiking
up
to
100%,
which
is
a
very
bad
thing
and
was
one
of
the
reasons
we're
seeing
dismal
web
performance.
The
dotted
line,
however,
shows
the
rolling
weekly
average
for
this
metric.
This
graph
is
almost
flat
if
we
were
to
linearly
interpolate
on
this.
It
would
indicate
that
in
one
month's
time
average
CPU
would
be
at
about
76%,
which
doesn't
sound
that
bad.
A
So
instead
of
interpolating
directly
on
the
value
of
the
metric,
the
better
approach
might
be
to
divide
we
resource
utilization
up
into
three
zones:
green
warning
and
saturation.
This
is
based
on
the
concept
of
app
set
optics
but
applied
to
saturation.
This
is
why
I
sometimes
refer
to
this
as
saturation
applix
or
a
Sat
Dec
score
when
a
saturation
metric
is
below,
say
80%,
we
say
it's
in
the
Green
Zone
above
80%
centered
the
warning
zone
and
when
it
exceeds
90%
is
entered
the
saturation
zone.
A
We
then
score
the
metric
for
the
amount
of
time
it
spends
in
each
zone
in
the
saturation
zone.
It's
called
zero
in
the
warning
zone
50%
and
in
the
green
zone
it
gets
a
score
of
100%
and
using
the
score
we
can
calculate
the
average
score
for
the
resource
over
time.
This
is
an
advantage
over
direct
linear
interpolation.
In
that
the
metrics
saturate,
and
in
that
the
metric
saturation
score
will
highlight
dangerous
spikes
and
resource
utilization,
even
if
the
average
value
is
not
affected.
A
So
going
back
to
the
previous
metric
for
Redis
CPU,
let's
look
at
how
its
effects
core
performs
over
the
same
period
during
the
Troubles
in
June.
The
top
graph
shows
a
single
called
CPU
utilization,
and
the
bottom
graph
shows
a
septic
score
was
for
red,
a
single
core
CPU.
As
you
can
see,
the
metric
is
the
subtext
metric
is
tending
down
as
CPU
spikes
keep
occurring
and
using
linear
interpolation
on
the
sacked
x-value
interpolating
one
month
into
the
future.
We
can
forecast
potential
resource
saturation
issues.
A
So
this
is
the
platform
triage
page.
It
shows
the
health
of
each
service
in
the
in
the
application.
It
shows
the
key
metrics
for
each
service,
as
well
as
the
saturation
at
the
bottom.
There
is
a
report
which
gives
us
different
services
and
their
current
state,
so
this
the
first
graph
gives
us
resources
that
are
currently
at
risk
of
saturation.
A
So
here
you
can
see
it's
telling
us
that
the
diskspace
resource
on
the
Patroni
service
is
nearing
its
capacity
limits
and
likewise
on
the
Postgres
archive
and
Postgres
delayed
services,
the
same
disk-based
resource
is
nearing
its
limits,
interpolating
arts,
one
month
into
the
future.
We
get
the
same
report
on
the
same
resources
because
this
is
not
being
addressed.
This
is
actually
something
that
is
currently
being
addressed.
The
discs
on
these
machines
will
be
increased
soon,
so
this
is
something
that
is
being
worked
on.
A
Again,
but
instead
of
seeing
them
for
all
services
together,
we
only
see
it
for
this
individual
service
and,
as
you
can
see
here,
saturation
is
at
90%.
If
we
open
up
the
component
metrics,
we
get
to
see
why
the
saturation
metric
is
at
90%.
So
what
we
do
with
saturation
is
we
aggregated
on
the
maximum?
Because,
if
we
use
say
the
average,
you
might
have
three
resources
that
are
a
very
low
utilization
and
one
that's
near
the
top,
but
you
need
want
to
be
saturated
for
the
service
to
be
saturated.
A
A
So
what
this
gives
us
is.
It
shows
the
current
saturation
school
for
each
service
in
the
Patroni
service.
So
here
you
can
see
this
service
has
got
a
disk
space
resource,
a
single
node,
CPU
resource
memory
and
overall
CPU,
and
what
it's
telling
us
is
that
in
one
month's
time
the
disk
space
will
still
be
near
saturation,
but
everything
else
will
be
fine
and
then
down
below
that
we
get
a
long
term
saturation
trend
which
shows
us
over
time
for
each
resource
on
the
Patroni
service.
A
What
it
is
doing
over
the
last
three
weeks,
so
we
can
see
that
CPU
is
gradually
going
up,
but
is
still
very
low.
Disk
space
is,
is
has
exceeded
90%,
but
it's
growing
quite
slowly,
and
then
we
get
the
long
term
saturation
for
this
particular
service.
It
isn't
very
exciting.
Most
of
the
resources
are
at
100%
on
the
subjects
core,
except
for
the
disk
utilization,
which
is
a
50%.
But
if
we
switch
over
to
the
Redis
service,
we
can
actually
get
a
more
interesting
graph.
So
it's
going
here.
A
A
What
we
have
here,
which
is
quite
interesting,
is
this
graph
over
here
shows
us
the
sap
tech
school
for
the
Redis
single
threaded
CPU,
and
so
you
can
see
when
we
started
recording
this
metric.
It
was
very
low
near
50%,
which
we
would
consider
to
be
a
bad
score,
but
then,
as
we
managed
to
make
application
changes
that
take
some
of
the
load
off
Redis
we've
seen
the
school
tend
up
and
then
recently
we
actually
split
our
Redis
cluster
into
two,
and
you
can
see
that
this.
A
This
score
is
now
tending
up
to
a
hundred
percent,
which
is
telling
us
that
Redis
is
actually
doing
very
well
now.
Last
week
when
I
was
looking
at
this,
what
this
told
us
was
that
the
single
threaded
CPU
was
at
capacity,
and
it
also
told
us
a
one-month
saturation
forecast
that
Redis
CPU
was
was
going
to
be
at
capacity
and
because
of
the
changes
that
we
made
this
weekend.
This
is
now
improved.
So
hopefully,
next
time
before
we
see
a
saturation
event,
these
graphs
will
tell
us
of
the
problem.