►
Description
Andrew & Marin pair on adding a new saturation metric to the GitLab.com monitoring suite. This video resulted in this https://gitlab.com/gitlab-com/runbooks/merge_requests/1679.
This change was a corrective action following on from Production Incident https://gitlab.com/gitlab-com/gl-infra/production/issues/1419 and again in https://gitlab.com/gitlab-com/gl-infra/production/issues/1437 (exactly one week later)
A
A
Those
jobs
already
take
about
60
to
70
percent
of
all
compute
CPU
in
our
API
fleet,
and
so
when
they
started
each
taking
five
times
longer.
What
ended
up
happening
was
that
the
API
fleets
got
very
saturated
and
that
started
causing
lots
of
latency
and
queuing.
You
know
this
navigation
saw
Miller
dance
throughout
the
application,
so
we
got
a
bunch
of
alerts.
The
first
we
got
was
full
API
latency
issues.
What
would've
been
really
good
is
if
we
could
have
got
some
forewarning
on
the
NFS
latency
issues.
A
We've
got
a
capacity
planning
framework,
that's
we
use
and
it
tells
us
how
things
grow
over
time.
A
morn.
B
A
So
yes,
so
we
have
a
resource
monitoring
framework
and
the
resource
monitoring
framework
gives
us
like
long-term
trends
and
saturation,
and
what
we've
done
is
we've
kind
of
abstracted
the
idea
of
saturation,
so
it
could
be.
A
Utilize,
it
could
be
the
number
of
sidekick
workers
that
we
have.
It
doesn't
really
matter,
it's
just
a
saturation
metric
and
then
we
take
a
look
at
it
over
time
and
we
see
how
much
variance
there
is
a
metric
and
what
direction
it's
growing.
We
use
that
to
predict.
Well,
then
we
get
our
problems
and
obviously
the
other
thing
we
get
out
of
it
is
that
for
any
saturation
nurtured
minute
reaches
a
threshold
where
it's
too
high.
We
can
alert
to
Newton,
but.
A
We
didn't
get
any
notes
for
any
of
the
NFS
servers
yep.
So
the
first
thing
is,
we
don't
really
know
what
it
was
on:
the
NFS
servers
that
was
slowing
down
and
obviously
something
was
saturated,
but
it
could
have
been
the
disks
on
those
machines.
It
could
have
been
the
CPU
on
this
machine.
Job
said
it
might
even
be
the
network.
Hello
GCP
claim
that
their
network
is
infinitely.
A
A
So
I
guess
the
first
thing
that
I'm
gonna
look
at
is:
let's
go
and
take
a
look
at
hosts
stats,
which
is
a
dashboard
that
we've
got
and
you
can.
It
sits
at
empathized
dashboard.
So
you
can
give
it
any
machine.
You
know,
fleet
problem
is
pulled
on
and
it
will
give
you
a
bunch
of
statistics
about
the
machine.
A
A
Like
so
I
mean
he's
so
mostly
you
that
it
never
really
got
beyond
20
25
percent,
and
so
it's
new
user
mode.
There
was
nothing
extra.
This
one
is
more,
it's
so
small.
It's
like
very
very
definitely
seem
to,
interestingly,
that
definitely
affected
CPU
six
Shannon
really
understand
romance
money.
Could
give
me
some
interesting
details
as
to
why
that
is,
but
I'm
not
sure.
So.
The
second
thing
we
have
is
right.
A
A
Like
this,
you
can
see
it's
spiking,
but
it's
not
it's
not
tipping,
and
so
that's
probably
three.
That
is
probably
not
saturated
right.
You
could
cook
me
so
that's
kind
of
interesting,
and
that
could
be
the
thing
that
we
need
to
put
saturation
on
and
then
network.
We
see
the
same
pattern,
but
obviously,
once
you
hit
a
bottleneck
in
one
thing,
you
know
you're
gonna
see
that
pats
and
reflected
in
all
the
downstream
elements,
and
so
you
know
on
receive.
We
could
only
receive
at
a
certain
rate.
A
A
A
It's
well
you
see.
This
is
the
thing.
That's
that
I
really
love
about
building
up
the
saturation
fragment
is
we
don't
actually
know
where
we
are
at
the
moment,
so
we
could
be
skirting
along,
like
5i,
ops
per
second
lower
than
the
threshold,
and
this
just
just
gently
nudged
us
over
the
other
side,
and
you
know
with
the
saturation
threshold
of
limits.
It's
always
chaos.
When
you,
when
you,
when
you
exceed
them,
you
know
we
had
CI
logs
failing.
We
had
API
failing
all
because
of
of
a
distant
project.
It
works
I'm.
So.
A
A
A
B
A
But
here
we
can
see
this
garage
is
like
300
feet
about
three
hundred
and
twenty
megabytes
per
second.
If
we
look
here
right,
it
says
400,
but
I
think
what
this
shows
us.
Actually,
the
limit
is
a
little
bit
down
on
that,
maybe
there's
overhead.
Obviously,
the
other
thing
we
should
take
a
look
at
is,
let's
add
another
metric.
A
C
A
A
It's
critical
that
we
don't
it's
right
through
books
and
if
we
do,
we
have
to
move
projects
off
the
machine
and
so
what
I
ended
up
doing
for
gibbyyyy,
and
this
is
why
I
know
what
those
values
are
of.
My
arts
is
that
we
hard-coded
those
values
into
our
saturation
matrix.
So
for
all
of
our
saturation
metrics,
we
say
it's
between
0
&
1
will
1
is
saturated
and
so
for
right,
3
puts.
We
know
that
1
is
full.
We
use
the
stated
number
that
they
use,
which
is
400
megabytes.
A
The
second
and
then
we'll
put
a
saturation
threshold
at
a
box.
I,
don't
know
70%
or
somewhere
will
work
out
what
the
right
number
is,
but
like
so
that
when
we
get
today,
we
get
alerts
and
also
like
not
that
I
expect
this
to
happen,
because
I
don't
expect
the
us
to
be
using
NFS
for
much
longer.
But
if
this
number
grows
over
time,
we'll
get
alerting
on
on
that
tree.
C
A
And
this
is
what
I
was
doing
yesterday
before
I
decided
to
join
the
court.
What
the
meeting
said
my
work
so
there's
a
file
called
saturation
service
situation,
yeah
more
and
just
because
I'm
lazy
and
research
was
60,000
and
chilly
day.
A
A
So,
let's
start
off
with
these
these
two,
interestingly
with
kiddingly,
we
never
come
close
to
these
60,000
and
30,000
members,
but
on
again
on
write
in
the
same
way
that
we
saw
here
if
somebody's
doing
abusive
things
on
a
Guidry
server,
we
actually
see
right-side
aeration
on
single
machines,
but
often
so
that's
interesting
that
it's
slightly
actually
I
was
wrong.
It
is
400
megabytes
per
second
for
giving
me
as
one.
A
C
A
So
we
have,
and
I
was
ready
wondering
about
this.
We
haven't
loading
on
CPU,
but
we
only
do
that
for
machines
that
have
a
time
we
have
a
whole
bunch
of
alerting,
but
there's
a
bunch
of
stuff
that
doesn't
have
types,
and
you
know
we
should
fix
them
all.
Basically,
like
everything
should
be
every
service,
you
know,
every
server
in
our
fleet
should
fall
into
a
specific
service
and
not
be
like
this,
which
is
just
so
some
random
machine.
A
Basically,
so
what
I'm
gonna
do
is
I'm
gonna
kind
of
hack
this,
because
I
need
a
way
of
recognizing
these
machines,
and
so
I'm,
probably
gonna
use
the
fully
qualified
domain
name
to
say
to
select
the
machines.
But
then,
once
we
add
types
on
to
these
machines,
we
can
we
can
fix
it
in
a
better
way.
So
how.
A
A
So
my
my
chef,
knowledge
is
not
amazing,
but
to
me
it
looks
as
though
the
share
box
doesn't
have
a
role
in
chef.
Oh
I,
don't
know
it's
fine,
it's
its
role
is
like
a
base.
Storage
machine.
Not
you
know,
and
the
base
storage
machine
is
also
shared
for
other
heads
and
yeah.
So
I
think
I
can
do
this,
as
I
said.
I
think
we
need
to
do
is
yeah,
but
I
will
create
a
merger
quest
for
them
and
then
we
need
a
name.
I
get
it's
the.
A
What
is
the
difference
between
that
and
that
maybe
maybe
I
take
that
back?
Look,
that's
all
okay,
so
I
was
thrown.
This
is
an
azure
box.
Wait
what
I'm
pretty
sure
that
represents
the
old
naming
schema
that
we
used
to
measure.
But
again
this
isn't
so
this
looks
much
better
cuz
there.
We
have
this
guy.
A
A
A
So
the
only
box
that's
got
it
is
is
share,
although
you
know
what,
if
I'm
thinking
about
a
service,
name
like
what
is
the
name
of
the
service,
and
if
this
is
super
obvious,
like
the
NFS
service,
forget
lab,
so
yeah
I'm
gonna
just
make
a
call,
and
that
is
store.
So
thanks
for
pushing
me
to
do
that
because
I
think
people-
and
why
can
you
do
this
and
then
it
takes
weeks
yeah
so
easy
to
do.
A
A
A
B
A
A
A
There's
actually
another
thing
that
was
working
with
Henry
on
where,
unfortunately,
at
the
moment,
these
values
are
hard-coded
and
then
we
need
to
replicate
that
hard-coded
value
in
other
places
it's
horrible,
and
that's
because
this
is
yeah
malleable,
but
one
of
the
other
things
in
doing
is
removing
all
of
this
across
the
JSON
X
and
so
at
least
for
all
of
the
others.
They
on
one
place
in
JSON
it,
and
eventually
this
will
be
generated
from
JSON
it
as
well
so,
but
for
the
moment,
you've
gotta
do
this.
So.
A
One
of
the
things
is
that
we've
got
the
sustain,
just
watch
we
put
and
the
sustained
discreet
throughput,
and
that's
if
you
have
a
saturation
metric
at
the
moment.
The
way
we
do
is
there's
only
one.
Well,
there's
two
SLRs,
but
whatever
reports,
those
things
has
the
same.
Miss
alone
so
I'll
probably
easier
to
explain
this
with.
A
So
at
the
moment
we
define
this
as
80%
and
90%
are
the
two
values
and
now
because
we're
reporting
a
disk
sustained
three
three
three
wooden
disks,
the
same
write
throughput
from
giddily
and
NFS
services,
and
both
of
them
will
be
80%
to
90%,
and
that
is
I
think
what
we.
What
we
noticed,
though,
was
that
if
you
take
80%
of
the
400
well,
we
know
it
is
323
yeah.
A
Know
what
I
suspect
is
that
we
actually
want
to
know
of
both
of
these
because
net
that
there's
no
proof
that
Katie
will
be
any
difference,
so
I
mean
the
only
worry.
Actually.
What
we
can
do
is
forget.
Li
and
God.
Look
at
the
graphs
and
see
what
we
see,
because
I
don't
want
to
get
anyone
getting
alerts
when
I
don't
want
them,
so
not
get
any
dashboards.
A
Interesting.
What's
that,
but
just
disk
usage,
we
can
go
down
and
use
this
new
panel
called
saturation
details
which
is
super
useful,
and
this
breaks
down
the
saturation
it
generates
charts
according
to
your
saturation
for
each
of
the
resources
that
you
monitoring
in
the
street.
So
this
is
the
one
and
we
gonna
do
this
of
this,
assuming
so
that
we
don't
that's
our
baby
falling
looking
for
three
ports,
so
you
can
see
that
we
do
occasionally
sort
of
he
convinced
the
next
acres
of
two
days.
A
A
A
The
reason
why
I'm
doing
that
is
that,
with
these
spikes,
it'll
often
be
very,
very
steep
and
if
you've
got
a
one
in
three
resolution.
Basically
the
way
you
can
imagine
that
is
each
data
point
represents
three
pixels.
So
when
the
resolution
is
one,
then
each
data
point
represents
one
pixel
so
effectively
you
loading
pixel,
while
a
data
point
per
pixel,
and
so
sometimes
if
you've
got
very
spiky
data
that
will
round
it
out
enough
that
you
actually
lose
the
spikes.
But
you
can
see
here
that
this
hasn't
made
it
a
knock,
no
spiking.
A
So
that's
fine!
So
let's
bring
this
down
to
70%
and
we'll
bring
this
to
a
read
three
books
and
then
I'm
gonna
be
the
same
online
throughput
as
well.
Let
me
say
that
this
will
be
17
and
this
will
be
80,
because
it
might
be
being
a
little
bit
optimistic
on
on
what
they
were
putting
on
agency.
A
C
A
A
A
So
at
the
moment,
that's
telling
us
that
it's
8%,
so
let's
go,
take
a
look
over
two
days
and
that
that's
like
a
really
good
signal.
Right
like
that,
would
have
alerted
us
to
this
problem.
We
would
have
got
saturation
on
that
desk.
You
can
see
that
there
are
these
spikes,
but
I
suspect
that
if
you
zoom
in
those
are
like
well,
maybe
we
should
just
sue
me
because
I
don't
want
to
get
like
a
bunch
of
anger
to
me
saying
so.
A
A
A
Yeah
I
mean
with
me
it's
amateur
how
it's
cool
so,
but
I
mean
this
is
really
good.
So
we
we
know
that,
like
we
wouldn't
go
in
alert,
but
I'm
actually
curious,
like
what
happened
on
Tuesday
I
wouldn't
be
surprised
that
actually,
what
had
happened
was
a
was
a
precursor
to
what
we
saw
yesterday,
but
as
an.
A
B
A
A
A
A
Really
like
to
be
able
to
get
this
direct
from
I'm
an
explorer
looking
plant
at
the
moment.
If
we
could
get
this
direct
from
like
a
node,
Explorer
or
some
sort
of
tcp
exporter,
we
could
do
it
literally
across
entire
fleets,
so
we
wouldn't
have
to
have
these
specific
cases.
We
could
just
say
when
any
disk
on
gig
lab
comm
is
saturated,
like
give
us
an
alert,
but
because
at
the
moment
there's
this
expense
of
maintaining.
All
of
these
things
we
just
stood
in
the
most
frightful
places.
A
A
Alliterate,
okay,
so
here's
the
saturation
in
it,
and
basically
this
is
no
it's
single
note.
Cpu
has
exceeded
its
capacity.
What
that
means
is
that
this
there's
a
single
machine
out
of
the
get
in
40
machines.
That's
basically
running
at
like
95%,
see
even
and
we
should
figure
out
why,
because
it's
not
good,
and
so,
when
I
click
on
this
sorry
for
how
long?
In
that
case,
it
was
what
the
the
trigger
is
five
minutes.
B
A
Always
useful
to
look
at
because
you
can
kind
of
see
how
long
actually
in
this
case,
so
let's
do
this
now
I,
don't
think
no!
It
is
when
you
click
through
from
here,
and
it
gives
you
this
bra,
very
important
to
notice
that
the
time
frame
is
from
when
it's
from
six
hours
before
they
look,
it's
not
happening
until
when
they
notified,
and
so
this
isn't
now
so
some
people
look
at
this
and
think
it's
now.
It's
not
so.
Let's
just
change
this
to
custom
time
frame
so.
A
That
you
can
see
it
was
for
them,
but
now
the
problem
is
that
this
is.
This
is
what
we
use
for
alerting.
It's
just
saying
like
someone
in
the
giddily
fetus
is
not
happy,
but
this
doesn't
really
help
like
an
operator
that
much
the
bottom
here
you
can
see,
there's
a
say
for
further
details.
Select
the
saturation
detail
menu
from
the
links
menu
at
the
top
of
the
dashboard
then
select.
A
A
This
froth
will
give
us
the
machine,
that's
misbehaving,
so
this
like
instantly
chops
down
the
amount
of
work
that
the
operator
needs
to
do,
because,
instead
of
going
and
looking
through
the
fleet
I'm
trying
to
figure
it
out,
they
just
they
go
to
this
graph.
And
here
it
is
pilot
pilot,
big
surprise.
A
B
A
A
The
Lord
that
we
saw
on
that
machine
that
dotted
line
is
the
number
of
chords.
So
obviously,
if
the
load
kind
of
exceeds
me,
of
course,
is
my
page,
but
what
I'm
looking
for
is
G
OPC
method
invocations,
so
there
it
is,
and
it's
list
converts
by
already.
So
this
is
something
we've
we've
looked
into
like
list
commits
bio
IDE
is
a
laburnum.
What,
in.
A
I
think
it's
a
midge,
it's
metal,
Chris
controller
I'm,
pretty
certain.
That's
doing
it
and
I
suspect
that
it's
it's
the
widget.
So
when
you
know
and
and
the
worst
part
is
that
we
actually
news
some
of
the
observability,
because
if
it's
polling
and
it's
cached,
we
not
seeing
those
log
messages
at
the
moment
but
yeah.
So
this
is
probably
maybe
a
lot
of
people
in
the
company.
Looking
at
the
same,
merge
request,
and
maybe
somebody
pushed
something
and
all
of
the
widgets
started,
updating
and
because
of
race
conditions.
A
They
all
kind
of
start
fetching
from
the
cache
again
and
I
would
guess
something
like
that,
like
there's
a
whole
sort
of
thing
that
we
need
to
look
into.
What's
going
on
with
that,
and
and
the
other
thing
is
that
gitlab
calm
by
sorry,
a
good
level
gitlab
is
like
ten
times
the
traffic
of
any
other
repository
with
that
miss
convinced
by
already
so
there's
something
about
us
and
that
we
go
and
it's
not
a
general
thing,
but
I'm
kind
of
interesting
anyway.
A
So
so,
anyway,
going
back
to
here
what
we
had
was
we
have
the
detail,
metrics
and
then
we
have
that
that
we
used
to
create
the
saturation
detail
panels
here,
and
so
this
is
the
general
Gimli
dashboard.
You
know,
we've
got
the
key
metrics
for
the
service
up
at
the
top
here,
and
we
also
have
those
in
like
more
detail.
A
And
yeah,
but
then
the
saturation
detailed
really
often
helps
its
figuring
out
what's
going
on,
and
so
one
of
the
things
we
need
to
check
is
that
this
graph
is
going
to
be
okay.
So
actually
that's
going
to
be
fine
because
we're
using
the
same
useless
measure.
What
I
do
want
to
do.
There
is
take
a
look
at
where
this
is
being
used
to
make
sure.
A
I'm
gonna
have
to
figure
this
out.
It's
not
so.
The
problem
is
the
saturation
detail
node,
because
we
have
to
read
recast
the
the
metrics
that
we're
using
with
different
aggregators
and
so
that
we
can
see
individual
nodes
without
aggregating
them
up.
We
divide
by
the
magic
number,
but
the
problem
is
that,
because
previously
I
know
how
to
fix
this
all
right.
This
is
a
bit
of
a
horrible
hack,
but
I
can
fix
this.
So
I
can
say.
A
Yes,
so
this
will
give
us
for
the
giddily
nodes,
it'll
divide
by
the
giddily
number
and
for
the
for
the
interface
nodes,
but
it'll
only
ever
show
one
of
them,
because
this
selector
over
here
will
be
where
type
equals
giving
you,
and
so
this
should
work.
Let's
go
through
all
of
these
and
do
this
then.
C
A
A
C
C
A
A
A
A
A
It
hasn't
shake,
hasn't,
kicked
in
and
actually
changing
the
boxes
yet,
but
it'll
it'll
happen,
but
once
we
have
that
I
will
build
a
graph
on
its
interface
dashboard
after
this.
But
it's
kind
of
difficult
to
do.
If
you
don't
have
any
data,
so
I'll
wait
for
that
to
roll
out.
A
I'm
gonna
use
my
daddy
secret
because
I
happen
to
like
gooeys.
Oh,
it's
so
much
better
I
can
like
everything
and
I.
Can
click
cuz
I
find
it
as
a
review
tool.
It's
just
fantastic
for
me.
I,
don't
know
it's
like
in
my
mind,
is
how
I
like
Nietzsche,
when
I
have
like
a
change
with
no
Chinese,
that
it's
some
sort
stream.
That's.
A
C
A
A
A
A
You
know
these
will
get
this
coming
through
here.
We
will
get
them
showing
up
on
capacity
planning
as
well,
because
they're
part
of
the
framework
now
so
when
she's,
really
struggling
so
so
when
that
is,
is,
is
looking
like
it's
going
to
be
saturated
in
the
next
14
days.
Even
if
it's
not
yet
we'll
get
a
chair
and.
A
C
A
Have
all
of
this
stuff
kind
of
come
through
and
it
kind
of
it
all
fits
together
quite
nicely
yeah.
Look
at
that!
Wow
I!
Think
you
see
this.
It
works
I
think
we
should
get
Ingrid
to
I'll.
Just
switch
this
out
of
it
too.
So
I'm
just
gonna,
say
from
saturation
video
I'll
go
to
disk
space,
saturation,
video
and
then
from
this
graph
we
can
see
exactly
where
it
is
where,
on
the
other
graph,
there's.