►
Description
A quick demo of https://gitlab.com/gitlab-com/runbooks/-/merge_requests/2684, which allows SREs to quickly navigate between different observability systems, such as Kibana, Bigquery, Stackdriver and Sentry. The aim is to reduce the MTTD for incidents, helping to drive up the availability of GitLab.com.
A
A
So
I'm
going
to
choose
the
web
service
overview
dashboard
and,
as
you
know,
the
top
line
is
our
key
service
metrics
for
the
service
and
that's
aggregated
up
to
the
service
level,
but
down
below
that
each
service
is
broken
down
into
components
and
for
the
in
the
case
of
the
web
servers,
the
components
are
the
puma
service
or
the
puma
component
and
the
workhorse
component,
and
these
two
together
work
together
to
to
service
requests
for
the
for
the
web
service.
A
And
for
these
we
have
again
aptx
error
rates
and
request
rates.
So
we
have
sort
of
component
level
metrics
for
each
component,
and
this
is
the
new
panel
that
we've
added,
so
each
component
gets
one
of
these
one
of
these
panels.
So
this
first
row
is
all
about
the
puma
components,
and
the
second
row
is
all
about
the
workhorse
component
and
you
can
see
because
the
different
services
they
each
have
their
own
sort
of
links.
A
So
let
me
give
you
a
demo,
so
most
services
have
got
a
sentry
tool
associated
with
them.
So
if
I
clicked
on
that,
it
would
take
me
to
the
right
place
in
century
to
show
me
the
exceptions
for
the
service,
but
probably
more
interesting.
One
is
kibana,
so
we
have
various
logs,
we
have
slow
logs,
we
have
failed
requests
and
we
have
just
the
normal
logs,
those
aren't
as
exciting
as
the
visualizations.
So
obviously,
during
an
incident,
one
of
the
things
that
takes
up
a
lot
of
time
is
trying
to
build
a
visualization.
A
So
we
can
dig
into
some
sort
of
high
cardinality
data
like
ip
address
or
username,
and
what
this
does
is
it
just
automatically
generates
pre-canned
visualizations
in
kibana
when
you
click
on
this
link,
it'll
take
you
there,
so
the
first
one,
if
I
click
on
this,
this
is
remember
for
the
web
service
that
we're
looking
at
and
we're
looking
at
the
the
puma
components
on
the
main
stage,
and
if
I
click
on
here
now.
A
Hopefully
this
will
give
me
a
graph
of
all
the
requests
coming
to
the
web
service
on
the
main
stage
and
it's
the
rails
log,
which
is
the
log
associated
with
the
puma
components.
And
so
that's
just
already
saved
us
a
whole
bunch
of
time.
And
then,
if
we
want
to
dig
into
into
further
detail,
we
can
break
this
down
very
quickly
by
just
splitting
the
series
say
by
status
code.
A
A
We
also
have
other
tools
like
google
stackdriver,
and
this
is
probably
more
useful
for
development
teams.
But
if
you
come
into
here,
you
can
see,
workhorse
has
got
continuous
profiling,
and
so
all
you
need
to
do
is
click
on
this
link
and
it
will
take
you
to
the
right
place
to
go
and
investigate
the
continuous
profiles
for
that
service.
A
We
also
have
for
other
services.
If
you
go
to
whoops
the
front-end
service,
which
is
where
we
run
aj
proxy
now
aj
proxy
is
unusual
in
that
the
volume
is
too
high
to
to
import
into
elasticsearch,
and
so,
instead
of
sending
it
there,
we
send
it
directly
to
bigquery.
A
A
So
if
we
want
to
do
some
investigation
into
what's
going
on
somewhere
here,
all
I
have
to
do
again
is
click
on
that
link,
and
this
opens
up
bigquery
and
what's
great
is
it's
got
a
pre-canned
query
and
if
I
run
that
I
will
be
able
to
get
a
whole
bunch
of
information
into
what
aj
proxy
was
doing
at
at
this
moment
in
time,
I'm
not
going
to
run
it
because
there
will
be
personally
identifiable
information,
and
I
don't
want
to
include
that
in
the
video.