►
Description
Hordur and Andrew discuss how AutoDevOps can be better monitored using the key metrics framework used for monitoring the components of GitLab.com.
This follows on a outage in the feature https://gitlab.com/gitlab-org/configure/general/issues/9
A
B
Yes,
so
let
me
just
opening
the
ticket
again
yeah.
So
what
happened
recently
was
that
auto
dev
ops.
It
had
like
a
kind
of
a
full
outage
of
the
deployment
pipelines
and
we
didn't
notice
until
like
24
hours
after
like
so
it
had
been
going
on
for
24
hours,
and
it
was
only
a
customer
that
escalated
to
us
and
yeah.
B
B
B
B
A
A
A
A
That's
basically
a
map
of
all
the
services
in
the
application
and
every
every
kind
of
thing
that
we
monitoring
should
mount
back.
Just
something
in
the
Service.
Catalog.
Sorry
go!
Look
in
all!
We
need
to
create
a
new
thing
in
the
Service
Catalog.
That's
got
a
pretty
good
idea
where
I
think
in,
but
nothing
for
clues
there.
So
we
basically
got
in
here.
We've
got
like
a
bunch
of
teams.
We
don't
really
use
it
as
much
it's
a
bit
out
of
date,
but
then
the
really
important
part
is
the
services.
A
B
A
A
Yeah
yeah,
okay,
so
I
mean
if
we
could
always
just
come
up
with
like
a
arbitrary
sort
of
dashboard,
but
the
thing
you
it'll
maybe
become
apparent.
Why
I
like
sort
of
categorizing
in
this
way,
and
so
what
we
need
trying
to
build
up
over
the
last
few
months
is
for
each
of
these
services
that
we've
got
you
you've
got
a
set
of
key
metrics
and
those
metrics
are
the
number
of
requests
per
second
I
am.
B
A
Number
of
errors
per
second
and
then
what
percentage
of
the
requests
compete
with
in
a
satisfactory
threshold
like
amount
of
time
right
and
the
reason
why
we
do
it
that
way
and
not
say
what
is
the
p95
latency
of
the
same
point
is
because
then
you
know,
if
I
tell
you
that
99%
of
the
requests
are
coming
in
within
a
bit
of
amount
of
time.
You
know
that.
A
That's
probably
okay,
if
I
say
it's
50%,
you
know
it's
probably
bad,
but
if
I
tell
you
that
the
P
95
is
three
seconds,
you
don't
know
unless
you
have
more
context,
whether
that's
for
the
bad,
and
so
you
know
it's
it's
with
a
ratio.
It's
also
cool
is
the
other
way
that
that's
known,
yeah
and
with
an
optics
score.
You
know
we
can
be
talking
about
Redis.
We
can
be
talking
about
microseconds
or
milliseconds
or
Gideon.
You
can
be
talking
in
you
know
seconds
and
we're
on
the
same
scale.
A
It's
basically
a
percentage
like
what
percentage
of
requests
are
competing
and
so,
and
so
we
have
optics
for
we
have
operation
rates
and
we
have
the
error
rate,
and
what
that
gives
you
for
each
service
is
the
dashboard
with
values
like
this.
So
these
three
values
here
are
the
key
metrics
for
the
web
service.
A
Today,
I've
noticed,
let's
wait
for
it
to
come
up
in
McGarry,
and
so
you
can
see
that
yellow
line
over
there
is
like
the
apdex
score.
So
you
know
they're
these
spikes,
that's
what
I've
been
investigating
all
day.
You
know,
something's,
causing
us
to
drop
down
to,
like
only
75%
of
requests,
are
completing
with
the
new
acceptable
amount
of
time.
So
that
means
25%.
A
It
was
actually
really
really
poor
and
we
got
a
kind
of
directive
from
senior
management
that
95%
of
CI
jobs
should
start
within
one
minute
of
being
scheduled.
And
so
what
we
could
say
is
that
our
lab
tech
school
is
one
minute,
and
then
we
say
what
percentage
of
the
requests
complete
within
one
minutes
and
as
long
as
above
95%,
then
we
meeting
a
goal.
A
What's
actually
really
interesting
is
that
when
we
started
doing
this,
I'll
value
was
so
poor
that
he
acted
to
set
it
to
50%
in
order
to
just
get
it
to
stop
and
loading,
and
just
the
fact
that
we've
got
that
we've
been
monitoring
it
and
actually
keeping
an
eye
on
it
and
and
like
it's
become
a
thing.
We've
been
able
to
improve
it
really
really
quickly
and
so
I
actually
intended
to
increase
this.
A
This
dotted
line
is
the
SLO,
that's
like
the
rates
of
which
me
that
that's
acceptable
rate
and
I've
been
meaning
to
bring
this
up
to
about
80
percent.
We're
still
having
a
bit
of
a
few
drops
every
now
and
again,
actually
4
a.m.
every
morning
and
there's
a
reason
for
that.
But
you
see
the
spike,
and
so
we
keys
are
gonna.
Do
the
same
thing,
sorry,
yeah
yeah,.
A
A
We
started
off
by
doing
this
manually
and
so
Prometheus
has
a
thing
called
a
recording
room,
and
so
what
we'd
say
is
create
a
recording
room
that
that
records
the
apdex
score
for
this
service.
I
wanted
to
make
it
slightly
to
kind
of
add
a
little
bit
more
complication,
but
also
to
explain
it
a
bit
better.
A
A
So
you
can
see
here,
it's
saying
we
have
a.
This
is
a
definition
of
a
service.
It's
the
web
service,
it's
in
this
SV
chair,
which
is
just
a
breakdown
that
we
have
in
Prometheus,
and
then
we
say
we
consider
90
95
percent
of
requests
meet
there
at
tech
school.
Then
we
are
nestled.
Oh,
it
goes
below
that
we'll
resolute
and
the
same
with
the
error
rates.
A
This
is
kind
of
a
detail
of
applix,
but
with
the
nap
tech
school
we
say,
satisfactory
score
is
any
request
that
competes
within
one
second
and
it
tolerates
at
school
is
any
request
that
completes
within
10
seconds,
which
is
pretty
slow
but
you'll
tolerate
it,
and
so
anyway,
we've
got.
This
is
that
this
is
how
we
define
the
haptics
worked
workhorse.
This
is
the
request
rates.
Again,
we
just
give
it
a
metric,
like
that's
a
Prometheus
metric
and
some
some
filters
on
that
metric
in
this
case.
A
A
A
We've
got
this
kind
of
what
we
consider
the
normal
range
and
what
actually
happened
was
it
uses
statistics
to
work
out
in
plus
minus
three
sigma,
but
what
actually
happened
was
exactly
a
week
ago
we
changed
the
the
ingredients
for
this
request
for
second
rate,
and
so
that's
why
we
have
this
change
here.
A
B
A
From
that
we
get
half
in
that
errors
and
don't
really
have
a
latency
score
for
that,
because
those
are
long
polling
requests.
They
all
take
pretty
much
fifty
seconds,
so
it
doesn't
make
sense.
In
the
case
of
toiling
to
be
saying,
you
know
we
want
to
have
the
request
completed
within
this
amount
of
time,
because
it'll
complete
when
it
gets
a
job,
and
so
it's
not
within
the
you
know,
I'll
control.
How
long
that
will
take.
A
It
depends
on
workloads,
but
we
do
have
a
request
rates
and
an
error
rates
for
e
for
the
polling
service
and
then
for
the
shared
runner
queues.
That
was
the
one
where
we
got
the
definition
from
senior
management
where
they
said
that
you
know
we
want
jobs
that
run
within
jobs
to
run
within
60
seconds,
and
so
he
said,
the
satisfactory
threshold
is
60
seconds.
A
There's
no
tolerable
threshold
and
just
say:
there's
only
one
threshold,
that's
60
seconds,
and
then
you
know
we
also
get
the
the
request
waits
in
the
airways
and
that
generates
everything
else
that
we
need
we'll.
Also,
this
will
generate
alerts,
it'll
also
generate
alerts.
If
the
metrics
stop
flowing.
A
Like
if
somebody
breaks
the
application,
we'll
don't
get
alerts
for
that,
so
all
of
this
stuff
kind
of
flows
out
of
this
file
and
everything
kind
of
comes
from
this,
and
so
what
I
would
imagine
we
need
is
we
need
to
have
at
least
a
request
rates
and
error
rates
for
order.
Devops
right,
I,
don't
know
if
you
would,
if
there's
anything
that
you
would
consider
to
be
a
latency
school
like
a
con.
A
B
B
A
A
B
A
A
B
A
Obviously,
if
you
can't
you
can't,
but
it
is
problematic,
like
I'll,
give
you
an
example
like
with
the
registry,
the
docker
registry,
that
we
have
there's
a
bunch
of
stuff
that
a
user
can
do
that,
creates
500
errors
and
the
that
really
plays
havoc
with
all
of
our
loading,
because
you
know
we
say:
show
fall
for
xx
errors
like
that's
the
user,
not
gonna.
Wake
someone
up
at
like
3
a.m.
to
like
look
at
this,
because
it's
a
user's
problem
but
5
xx.
B
That
one
I
am
Not
sure
we
like
I,
can't
think
of
any
way
that,
like
we
can
really
distinguish
it,
except
we
could
at
least
maybe
flatten
it
or
like
try
to
report
it
by
so
so
do
something
we
could
probably
do
something
to
have
each
project
not
have
an
outsized
impact
like
so,
if
one
user
is
like
breaking
the
whole
thing
that
just
still
only
show
up
as
like
one
error
in
in,
like
you
know,
in
the
time
frame,
it
happened.
Does
that
make
sense
here?
I'm
not
sure
Prometheus
offers
that,
though
we
might.
A
B
A
That
threshold
a
bit
higher
and
then
get
averted.
The
other
thing
I,
didn't
kind
of
explain
is,
is
one
of
the
things
that's
really
advantageous
about
having
the
Arceus
right.
The
requests
per
second
rates
is
that
we
automatically,
because
of
that,
if
the
rate
does
outside
of
that
of
this
boundary
right,
we
will
automatically
get
a
method
on
it.
So,
if
the
rate,
if
you
kind
of
expect-
and
this
rate
by
the
way
is.
A
Is
kind
of
based
on
us
on
a
seven-day
cycle,
so
we
have,
like
you
know,
Monday
through
Friday,
we
always
get
sort
of
a
certain
type
of
traffic
and
then
weekends
as
January,
quieter,
I'm,
just
loading
seven
days
now
it
might
take
a
while
because.
B
A
You
know
through
the
floor
and
so
had
that
been
something
we
were
paying
attention
to
at
the
time.
We
would
have
seen
that
you
know
we
obviously
had
bigger
problems
than
that,
but
we
would
have
got
an
alert
that
you
know
the
wave
RPS
in
this
case
was,
and
so
that
that
could
be
something
you
get
for
free.
So,
presumably
one
of
the
things
that
happened
during
that
incident
was
that
the
rate
of
order,
DevOps
jobs,
dropped
to
zero
close
to
zero,
and
you
would
have
got
alerted
on.
B
A
B
A
B
B
Yes,
yes,
so
we
we
do.
We
do
review
apps.
A
B
A
little
bit
like
this
is
something
that
I
would
really
want
to
explore
a
bit
further,
because
there's
like
you
could
we
could
be
posting
the
metric
directly.
You
know,
as
the
pipeline
is,
is
doing
its
thing,
but
then
you
have
to
look
then
you're
stuck
with
exactly
the
data
that's
in
the
pipeline
like
in
in
that
particular
pipeline,
whereas
we
could
also
be
doing
post
aggregation
of
data
from
the
pipeline's
table
where
you
could
do,
like
you
know,
say,
say
per
project.
You
know
we
had
this
many
failures
in
a
minute.
Yeah.
B
A
Cardinality
but
there's
a
you
know,
put
that
in
structured
logging
and-
and
that
would
be
great
but
I
mean
where
you
record
that
value.
You
know
whether
that's
recorded
in
the
runner.
It
kind
of
feels
to
me
I,
know
very
little
about
the
architecture
of
your
system,
but
it
kind
of
feels
like
an
enhancer
to
have
that
in
the
main
application,
then
kind
of
exposed,
from
the
one
hand,
right
yeah,.
B
So
it
probably
should
be
in
the
rails
application,
or
for
this
particular
thing
and
you
stay
structured
logging
is,
is
a
so
that's.
Actually.
One
of
my
questions
is
how
much
should
be
going
into
prometheus,
how
much
should
be
going
into
structured
logging
and
for
in
both
cases,
what
is
our
like
budget
like
for
for
collecting
the
static,
because,
obviously
this
data
is
not
free,
no
yeah,
so.
A
A
Obviously
you
have
roots
and
we
have
a
few
more
those,
but
because
of
those
everything
else
has
to
be
smaller,
because
we
have
so
many
roots
and
we
and
you're
dividing
everything
by
by
that,
and
so
we
have
like
then
in
the
strip
on
the
structured
logging
side.
I
would
say
like
if
this
is
something
that
you
would
find
like.
A
For
me,
structured
logging
is
about
like
user
events
or
events,
and
if
it's
going
to
be
useful
for
analysis
at
a
later
stage
like
it's
like
people
using
DevOps,
it's
their
success
with
order
DevOps
going
up
or
down,
or
you
know,
another
might
be.
If,
on
the
first
time,
someone
runs
order,
DevOps
of
fails,
you
know
what's
the
likelihood
of
them,
using
what
it
erupts
in
a
month's
time
you
know
like,
and
then
you
can
kind
of
divide
data
that
way.
Those
are
all
things
like.
Those
are
all
based
on
events.
A
So
if
it's
something
that,
like
you're,
not
gonna
care
about
from,
like
a
business
point
of
view
or
growth,
point
of
view
or
like
a
Diagnostics
point
of
view,
like
I've,
been
spending
the
last
two
days,
looking
at
all
of
our
log
data,
then
try
and
leave
it
art.
Sometimes
it's
difficult
to
gauge
that,
but
generally,
if
it's
kind
of
something
that's
been
driven
by
a
user,
doing
something
and
that
thing
has
started
or
finished,
then
I
would
record
it
as
a
structured
log
like
the
other
stuff
less.
A
So
you
know
we
we've
got
things
in
our
logs
like
every
time
something
goes
and
checks
for
a
if
the
database
connection
is
still
valid
and
like
that's
fine
for
a
certain
type
of
log,
but
getting
recorded
in
structured
logs
that
might
get
recorded
for
six
or
nine
months
like
it.
It
doesn't
matter
like
what
I'm
interested
in
is
things
that,
like
has
been
caused
by
users,
doing
something
you
know.
Events
I,
don't
know,
that's
a
kind
of
a
nice
answer,
but
that's
sort
of
how
I
feel
about
it.
B
A
A
So
Bob
Bob's
working
on
something
at
the
moment,
which
is
like
it's
called
AK
application
context,
and
it
will
basically
give
you
this
stuff
right.
So,
like
the
the
context
in
which
you're
running
who's
the
user,
what's
the
project,
maybe
what's
the
IP
address
one
or
two
other
things,
and
then
those
plus
like
an
event
like.
Ideally,
what
I'd
like
to
do
is
have
all
of
that
kind
of
just
going
out
to
use
it
logging.
So
you
could
say
you
know
this
person
did
one
invocation
of
water
DevOps
and
it
was
successful.
A
B
A
B
A
A
A
Yeah
and
really
for
me
in
here
like
if
the
loads,
not
structured
its
kind
of
I'm,
not
gonna,
use
it,
because
you
know
we
just
get
too
much
data
to
deal
with
like
unstructured
logging
and
obviously
for
something
like
giddily.
We
pretty
much
have
one
log
line
per,
give
me
call,
and
then
we
can
aggregate
those
up
in
Cabana.
B
So
back
to
my
question
about
cost
like
what
would
help
me
a
lot
like,
because
I
would
obviously
want
to
just
collect
everything,
because
you
know
that
makes
my
life
so
much
easier.
But
like
do
we
have
any
any
roughly
like
if
I'm
gonna
be
collecting
a
metric
like
let's
say
like
a
single
metric
with
a
single
label
like
do
I
know.
Do
we
know
roughly
how
much
that
costs
like
what
is
reasonable
for
a
project
that
is
a
small
fraction
of
our
pipelines.
A
A
A
A
A
Lot
of
this
thing,
the
problem
is
that
we've
got
like
a
lot
of
service
and
each
server
has
label
all
right,
and
then
we
have
a
lot
of
roots,
and
then
we
have
a
lot
of
controllers
and
then
you
know
for
the
for
the
histograms.
We
have
a
lot
of
histogram
buckets,
and
so,
when
you
add
all
of
those
up,
it
just
kind
of
lowers
out
of
control.
But
if
you
look
on
here,
there's
it
doesn't
like
there's
nothing
in
here.
That's
really
sort
of
terrible
right.
A
Like
yeah,
you
know,
we've
got
action
controller,
and
so
you
know,
though,
just
action,
controller
and
fully
qualified
domain
name
kind
of
multiplied.
Arts
kind
of
takes
you
to
pretty
much
fifty
thousand.
So
you
know
in
in
your
case
I
guess
it
probably
wouldn't
need
that
many
values,
it's
kind
of
like
deploy
versus,
build
success,
versus
failure,
mm-hmm
and
then
sort
of
what
you
should
try
to
do.
The
pattern
is,
is,
if
you
have
one
for,
if
you
keep
a
histogram
for
how
long
something's
run
for
don't
include
on
that
all
the
other
labels.
B
A
B
A
No
so
like
one
of
the
problems
is
that,
like
the
binding
between
the
the
alerts
and
the
dashboards
and
the
application,
metric
names
is
obviously
just
it's
soft
right.
There's
no
like
compiler
checking
that
everything's
the
same,
and
so
generally
when
people
change
those
values.
There's
there
can
be
a
great
deal
of
pain,
especially
he's.
A
All
your
metrics
and
you
don't
get
the
alerts
and,
and
things
go
awry,
and
so
I
prefer
to
keep
things
as
they
are.
This
name
this.g
OPC
server
handle
title
is
it's
kind
of
a
standard
name.
We
use
it
on
our
giddily
roots,
but
then
Thanos
actually
uses
it
internally
as
well.
A
You
know,
you
often
also
see
HTTP
requests.
Turtle
is
another
name,
but
in
some
cases
we'll
also,
you
also
see
Hitler
underscore
workhorse
underscore
HTTP
request
turtle
I
mean
there's,
there's
some
naming
conventions
in
prometheus
and
it's
well
documented.
You
can
just
look
it
up
on
the
website,
but
mostly
if
it's
a
can't,
it
should
end
with
underscore
turtle
yeah,
and
then
you
know
that
this
is
kind
of
protocol
sort
of
subsystem
and
then
lucky
veins,
but
but
there's
not
really
kind
of.
A
We
know
we
don't
have
any
plans
to
kind
of
like
reunite
this,
because
it
would.
It
would
be
a
very
painful
thing
to
do.
What
I
would
rather
do
is
kind
of
focus
on
on
this
metrics,
catalog
and
kind
of
like
these
are
like
and
have
this
as
a
way
of
saying
these
are
all
the
really
important
events
and-
and
this
is
a
map-
and
then
you
know,
maybe
in
future
we
could
even
slip
this
in
in
the
CI
test
and
make
sure
that
the
application
is
actually
publishing
these
things.
Catch
breaks
there
anywhere
sure.
B
A
A
A
Suffixes,
that
with
buckets,
can't
take
a
look
at
the
the
metric
naming
like
one
of
the
things
that
people
sometimes
do
when
they
come
from
other
metric
systems
is
that
they
concatenates
like
what
should
we
label
values
into
the
metric
name
and
that's
like
a
big
no-no
in
prometheus
world.
So
you
know,
people
might
say,
like
even
successes
and
failures
to
me
should
be
something
that's
the
metric
and
not
in
the
sorry
in
the
inner
label
and
not
in
the
metric
name.
B
A
A
So
the
next
step
of
that
that's
a
bit
pause,
because
we've
got
some
problems
with
sidekick
prometheus.
At
the
moment.
It's
not
working
very
well
but
but
effectively
everything's
everything's
attributed
in
that
way.
The
next
thing
that
we're
going
to
do
is
we're
going
to
start
enforcing
the
same
attribution
on
alerts
and
on
services
as
well
to
some
degree
so
that
we
can
set
what
least
on
there
on
the
components
so
that
we
can
say
like
whose
is
this
it's
going
to
be.
It's
still
going
to
be
very
difficult,
there's
a
lot.
B
A
B
A
A
B
A
B
So,
like
one
thing,
I
noticed
with
the
auto
dev
ops
incident
was
like
so
so
I
are
you.
It
was
an
incident
for
us
and
I
like
I
handled
it,
but
I
felt
I
was
lacking
a
process
right
because
it
completely
fell
out
of
the
normal
incident.
Handling
process
and
I've
followed
our
gillip
incidents
before
and
they're
like
the
way
we
do.
That
is,
is
quite
nice,
but,
like
I,
didn't.
A
B
A
It's
what
if
you
I'm
going
incidents
but
I
think
like
really
kind
of
what
I
really
like
to
see
is-
and
this
isn't
your
fault
I,
don't
think
it's
it's
well
documented
is
something
like
you
automatically
go
to
the
production
issue
tracker
and
you
create
an
issue
there
and
we've
got
scoped
labels
and
the
labels
are
scoped
by
the
service.
It
should
be
tricky
because
I
guess
it
would
be
the
web
servers,
but
maybe
what
we
need
is
feature
category
scope,
labels
as
well,
so
we
can
say
this
incident
was
related
to
this
feature
category.
A
You
know
in
obviously
not
in
all
cases.
Can
you
do
that,
but
in
some
cases
you
can
and
then
and
then
actually,
even
if
it
was
an
incident
that
the
the
production
on
call
wasn't
involved
in
you
know
it
was
it
was.
You
know
it
becomes
part
of
the
record
of
of
of
incidents
yep
you
know
and
having
it
having
it
in
there
cuz.