►
From YouTube: Dive into Graphana and Prometheus at GitLab
Description
GitLab Infrastructure SRE and Director of Engineering discuss internal usage of Grafana, Promenthus, Thanos, Runbooks, and production observability
A
B
B
This
is
the
monitoring,
Seri
monitoring,
system
and
time
series
data
store
so
with
Prometheus
we're
not
just
collecting
data
for
visualization
we're
also
collecting
data
for
alerting,
and
so
all
of
our
production.
Alerting.
Monitoring
is
driven
by
Prometheus
data,
but
we
can
be
very
data-driven
in
term
in
in
in
our
approach
to
what
is
actually
the
root
cause
of
a
problem.
We
can
get
that
from
our
data
need.
A
B
So
the
Prometheus
monitoring
system
has
a
number
of
components.
The
the
core
Center
is
prometheus,
Prometheus,
the
Prometheus
server
and
it's
a
it's.
Basically,
a
data
collector
a
time
series
database
and
an
API,
and
the
API
provides
a
query
language
to
get
to
the
data
that
you've
collected.
So
it
has
all
the
data
it
stores
it,
and
then
you,
if
you
can
query
it
from
that
single
single
source
of
data
and
so
each
Prometheus
server,
is
an
independent
running.
B
A
B
C
B
So
it's
designed
to
be
modular
eyes
down
to
you,
have
one
Prometheus
and
one
app
and
one
prometheus
and
one
app,
and
so
it's
designed
to
be
hyper
sharded
to
you
know,
so
it's
one-to-one
with
an
app
and
see
you'd
have
to
have
a
single
application
that
is
so
big
that
it
requires
horizontal
charting
before
you
even
need
to
start
thinking
about
that.
Gotcha.
B
So
the
there's
the
configuration
of
Prometheus
for
alerting
it's
all
done
through
what
are
called
rules
and
rules
are
basically
continuously
executed,
queries,
and
so
you
can.
You
can
ask
that
you
can
say
you
got
a
bunch
of
data
on
memory
usage
and
you
can
ask
the
cluster.
What's
the
average
memory
usage
for
this
process
across
my
fleet?
B
A
B
Sure
so
we
have
we
have
a
general,
and
so
the
the
gitlab
essary
team
follows
some
of
the
what
are
called
the
the
used
method
or
read
method,
and
that's
we
we
don't
look
at.
We
actually
typically
don't
look
at
like.
What's
the
memory
utilization
of
our
servers,
we
look
at
the
latency
for
our
application.
We
look
at
the
error
rate
of
our
application
and
we
look
at
the
saturation
of
our
application
and
so
we're
looking
at
these
golden
signals,
as
they're
called.
B
So
we
don't
so
we
we
have
these.
We
have
these
signals
and
we
have
them
standard
across
standardized
across
all
of
our
services
and
that's
all
checked
in
so
from
a
from
a
simple
logistical
perspective.
All
of
our
data
rules
are
all
checked
in
to
a
git
repo,
and
then
those
rules
are
automatically
synchronized
up
to
our
Prometheus
servers
so
and
and
Prometheus
automatically
picks
those
up
and
and
and
acts
on
them
gotcha.
What's.
A
B
So
we
so
we
have
this.
We
have
this
idea
that
everything
that
has
to
do
with
paging
and
alerting
and
on-call
should
all
live
in
the
same
place.
So
say
we
have
a
new
service
that
we're
turning
out.
We
add
some
new
alerting
rules
for
that
service,
but
we
also
want
to
add
the
documentation
for
what
to
do
for
that
service.
So
that's
all
can
be
done
in
one
merge
request
and
or
say
you
change
say:
word
you're,
gonna
change
the
threshold.
We
document
the
fresh.
We
change
the
threshold,
we
document.
B
Why
and
we
document
any
changes
to
the
to
the
run
books
or
to
the
you
know,
to
the
to
the
troubleshooting
playbooks
for
our
production
and
say
we
update
any
dashboards
that
need
to
be
updated
when
we
add
that
new
service,
and
so
the
dashboards
are
also
in
the
same
repo,
the
the
run
books
and
the
troubleshooting
guides
and
the
how
to
fix
production
is
all
in
this.
One
big
run
books,
repo.
A
B
B
B
B
And
so
this
this
was
a
data
templating
language
that
was
used
all
right.
So
it's
a
it's
a
data,
templating
language
that
was
inspired
by
the
need
to
configure
jobs,
running
and
Borg,
so
basically
the
same
thing
as
kubernetes.
So
there
are
some
people
that
are
using
this
same
templating
thing
to
configure
their
their
kubernetes
jobs
instead
of
using
something
like
helm,
because
helm
is
not
very
powerful
in
terms
of
language
expressions.
So
it's.
This
has
a
much
more
powerful
language
expression
to
actually
do
function,
programming
and
other
types
of
input
and
output.
A
B
So
this
is
inheriting
stuff
from
multiple
template
layers
like
general
graph
panel,
and
this
is
a
graph
panel
new
and
it
contains
this
data
source
and,
and
all
of
these
things
are
all
templated.
It
just
says
you
just
pulls
in
add
template
and
template,
add
template,
and
so
this
whole
entire
complicated
dashboard
with
all
of
these
graphs,
a
huge
huge
thing.
This
is
all
coming
from
like.
A
B
A
B
C
C
B
There's
there's
a
there's
a
couple
of
there's
a
couple
of
blog
posts
and
some
other
documentation
on.
What's
called
practical
anomaly:
detection
with
Prometheus
and
if
you
search
for
those
keywords,
you'll
find
this
stuff,
and
this
is
all
generated
through
our
are
these
these
service
recording
rules.
So
if
I
look
I
can
find.
Let
me
let
me
see
if
I
can
keyword
this
into
my
source
code
right
here.
Look
I'm
in
the
right
directory.
B
Service
operate,
and
so
what
we're
doing
is
for
each
service.
We
have
these
recording
rules
and
we
create
these
component
operations
rates
and
we
get
this
from
say:
Redis
commands
to
processed,
and
so
this
is
a
recording
rule
elsewhere
in
the
clinic
in
our
rules,
and
we
take
that
and
we
weave
we
get
this
by
this
environment
tier
and
type.
B
And
so,
if
we
look
scroll
down
here,
you
can
see
this
component
operations
per
second
average
over
one
week.
So
what
we
do
is
we
generate
an
average
over
one
week
and
a
standard
deviation
over
one
week.
So
we
take
a
whole
week's
worth
of
data
in
Prometheus
and
we
figure
out
what
the
standard
deviation
is
over
that
and
we
record
that
as
a
separate
piece
of
data.
Mm-Hmm.
C
A
C
A
Gotcha
gotcha
interesting
well,
you
mentioned
performance
and
you
know
other
types
of
related
to
other
types
of
normal
or
anomaly
detection.
Have
you
have
you
hit
issues
with
that?
Where
there
were
queries
that
you
know
had
performance
issues
or
maybe
weren't
advisable
to
be
running
because
of
that
type
of
concern?
Well,.
B
So
yeah
we've
had
you
know
we
have
a
lot
of
dip.
We
have
a
lot
of
different
dashboards,
and
so
we've
had
to
go
and
create
these
recording
rules
to
summarize
out
one
week
because
Prometheus
can
I
can
ask
for
this
right
now
and
Prometheus
will
go
and
calculate
that.
But
if
you
want
to
graph
a
whole
one
week,
standard
deviation,
that's
gonna
be
stupid,
slow
because
it's
gonna
be
pulling
in
millions
of
millions
of
samples
in
a
memory
and
trying
to
do
a
live
calculation
on
that.
B
B
B
Every
component,
like
so
the
standard
thing
in
Prometheus,
is
we
have
this
thing
called
a
metrics
and
point,
and
the
metrics
endpoint
is
just
a
simple
HTTP
API
that
when
you
say
get
metrics,
it
dumps
all
the
current
state
of
all
the
data,
and
this
is
a
standard
API
that
every
Prometheus
target
uses.
So
if
we
look
at
targets
in
our
production
environment,
you
can
see
that
while
we-
those
are
those
are
probes,
but
we
don't
care
about
probes.
Why
are
we
still
looking
at
probes?
B
Let's
look
at
something
more
interesting,
so
here
we
go
so
here
is
90
101.
So
that's
C
advisor.
So
here's
here's
data
coming
from
C
advisor.
We
can't
actually
click
this
because
it's
on
our
internal
network-
and
we
can't
get
to
that-
but
you
can
see
that
were,
will
reach
in
and
get
this
data
out
of
our
internal
10
net,
and
this
will
print
out
a
thing
and
if
we
go
to
something
like
Prometheus
dot
demo
do
de
curie
cio.
B
B
Let's
do
that
they
have
returned
780
metrics,
that's
a
little
noisy.
We
wanted
to
know
about
this,
this
one,
which
was
this
one
so
I,
think
if
we
yeah.
So
if
we
we
have
this
rule
group
and
if
we
go
up
here
and
we
filter
for
this
rule
group
string-
and
we
say,
let's
only
look
at
one
Prometheus
server
since
we
have
wait.
A
A
C
B
A
B
So
one
of
the
things
that
we
were
talking
about
is
we
were
talking
about
the
fact
that
you
might
have
hundreds
of
Prometheus
servers
and
soget
lab
has
not
hundreds,
but
you
know
dozens.
We've
got
about
20
or
30,
or
so
I
think,
and
that
number
is
growing
as
we
as
we
expand
and
in
order
to
view
all
the
data
all
at
once.
So
we
you
know
we
lurk.
We
were
looking
at
this
one.
B
One
Prometheus
server
in
front
of
Prometheus
we've
actually
got
a
proxy
system
called
Thanos,
and
so,
if
you
go
to
Thanos
and
we
look
at
the
stores,
we
can
see
all
the
backend
prometheus
servers
and
also
what
are
called
so
there's
there's
different,
there's
different
back-end
types,
there's
the
sidecar
and
these
are
attached
to
the
individual
Prometheus
servers.
So
we've
got
like
a
Prometheus
server
running
in
gke
and
kubernetes.
B
We've
got
these
other
ones
that
are
just
running
in
regular
chef
management
and
then
we've
got
these
rule
evaluation
servers
which
we'll
get
to
which
we
can
get
to
later.
But
then
we
also
have
these
things
called
a
store
server
and
the
store
server
is
so
one
of
the
things
that
we
do
is
we
have
this
thing
in
Thanos
that
takes
the
prometheus
is
constantly
scraping
data
storing
this
time
series
database
and
then
what
happens.
B
B
B
B
So
we
have
a
subgroup
of
infrastructure
that
is
focused
on
observability
and
so
we're
working
on.
You
know:
we've
we
were
working
on
expanding
and
improving
the
Thanos
performance,
we're
working
on
it,
expanding
our
elasticsearch
infrastructure
and
we're
kind
of
trying
to
come
back
and
look
at
observability
from
a
high
level
perspective.
B
Okay,
let
me
give
you
some
fun
numbers
so
from
the
ops
environment,
we
monitor
all
of
our
Prometheus
servers
in
production,
so
there
are
now
so
the
number
is
22
right
now,
and
so
you
can
see
how
many,
what
are
called
head
series
is
how
many
current
metrics
are
we
tracking.
So
if
we
take
and
just
to
a
some
of
the
head
series,
we
get
the
ding.
A
high-level
number
of
29
million
is
29
29,
29
million
individual
metrics
mm-hmm.
C
B
Then,
but
that's
that's
deduplicated
between
all
of
our
high
availability
instances.
So
in
chrome,
you
see
have
high
the
high
availability,
where
there
are
two
Prometheus
servers,
just
both
doing
the
same
thing:
they're
configured
identically.
They
have
slightly
different
data
in
the
time
series
database,
but
the
they're
there
they're
operating
independently,
and
if
we
execute
that
we
have
58
million
metrics
metrics.
B
Well,
bytes
over
time
in
our
in
the
in
I
was
just
doing
some
auditing
of
our
production
database.
For
all
this,
all
the
production
side,
GCSE
data
which
covers
about
the
last,
which
covers
lat
the
last
year
of
raw
metric
data
plus
downsampled
data.
For
the
last
year
for
the
last
five
years
we
get
to
about
26
terabytes
of
metric
data,
which
is
a
cameelious
compressed
format,
yeah
yeah,
it's
quite
a
bit
yep
well,.