►
From YouTube: Delivery: Intro to Monitoring at GitLab.com
Description
Follow along as members of the Delivery team discuss the various components involved in monitoring GitLab.com.
B
Rerecorded,
so
monitoring
there
is
quite
a
lot
of
components
to
it
shopping,
so
here
I,
listen
to
all
the
components
that
we
actually
install
somewhere.
We
have
a
bunch
of
stuff
that
we
use
externally.
Some
of
these
are
all
part
of
the
Prometheus.
Suites
are
like
alert
manager,
Prometheus
and
the
push
gateway
are
all
components
of
Prometheus
itself.
B
Griffin
is
our
fancy
dashboard
that
we
all
know
and
love.
Em
tail
is
a
component
that
we
use
to
monitor
our
logging
and
it
will
generate
metrics
based
on
very
fancy.
Reg
xed,
which
is
amazing
and
that's
produced
by
Google
I,
think
we
got
a
component
called
Thanos
which
I'm
not
100%,
familiar
with,
which
is
what
drove
the
creation
of
this
issue
trickster,
which
I
learned
is
new.
We.
A
B
Many
many
exporters
for
various
services
we
create
our
own,
our
applications,
export
metrics,
Prometheus
is
geared
towards
scraping
all
of
them,
and
then
we
still
have
influx
DB
in
a
relay
sitting
in
our
environment,
so
I'm
just
gonna
kind
of
roll
through
what
I've
discovered
and
if
you
have
questions
feel
free
to
stop
me
and
you
know,
try
to
figure
out
what
in
world
I'm
talking
about.
But
as
far
as
the
alert
manager
goes,
this
is
responsible
for
sending
alerts
in
all
shapes
of
sizes
and
forms
everywhere.
B
I
decided
not
to
dig
into
precisely
how
rules
are
set
up
because
we
have
so
many
of
them,
but
the
alert
manager
is
configured
to
send
alerts
to
various
places.
Pedro
duty
and
slack
are
the
primary
locations
for
those
and
based
on
the
type
of
rule,
will
determine
the
severity
that
Pedro,
Duty
or
slack
reacts
upon.
B
An
alert
manager
is
set
up
in
a
redundant
fashion,
so
we've
got
two
nodes
sitting
in
our
production
environment
we've
got
two
nodes
sending
in
ops
and
because
we
still
have
some
infrastructure
manager.
We
have
an
alert
manager
sitting
in
the
Azure
environment
as
well.
What
I
thought
was
kind
of
interesting
is
how
Prometheus
actually
operates
with
the
alert
manager,
so
it's
the
job
of
the
Prometheus
service
to
reach
out
to
an
alert
manager
to
determine
how
an
alert
gets
sent
out.
For
me,
this
will
simply
send
the
data
to
the
alert
manager.
B
It's
up
to
you.
Lurk
manager
the
send
the
alert
so
in
the
production
environment,
for
example.
He
knows
a
four
total
alert
manager
servers.
He
picks
one
and
the
alert
just
goes
out
in
some
way,
based
on
how
the
alert
manager
service
is
configured
staging,
for
whatever
reason
will
only
use
the
production
alert
managers,
the
disaster,
recovery
environment,
uses
production,
alert
managers,
the
pre
environment
uses
both
production
and
absolute
managers.
Testbed
uses
both
as
well
and
offs,
will
only
use
production
alert
managers.
B
A
A
B
B
A
B
A
little
odd
I
know
at
one
point
in
time
we
had
a
desire
to
move
as
much
monitoring
as
possible
into
the
ops
infrastructure
and
I.
Wonder
if
we're
just
in
the
middle
of
that,
but
I
never
saw
an
issue
to
discuss
that
comment.
So
I
can't
validate
whether
that's
true
or
not,
or
if
it's
something
I
made
up
when
I
was
dreaming,
welcomed,
jarv,
so
yeah,
that's
the
alert
manager.
B
So
gir
fauna,
we've
got
two
dashboards.
You
probably
know
this,
but
dashboards,
calm
and
dashboards
or
excuse
me,
dashboards,
gitlab,
calm
and
dashboards
that
get
lab
net.com
is
the
public
facing
one.
Anyone
could
see
any
metrics
that
that
specific
instance
has
access
to
people
supposedly
cannot
modify
and
create
their
dashboards,
or
rather,
they
would
not
be
saved
if
they
were
trying
to
create
dashboards.
Dotnet
is
our
internal
one.
All
the
developers
could
log
into
it.
They
create
dashboards.
B
B
B
Kind
of
nifty
we
create
dashboards
automagically,
the
are
run
books,
so
we
have
a
job
inside
of
our
get
lab,
run
books
project
and
it
will
take.
We
got
a
small
shell
script.
It
looks
through
a
directory
that
we've
got
sitting
inside
of
here.
Excuse
me,
where
is
that
directory
at
dashboards?
So
it's
just
a
bunch
of
Lib
sonnet
files
that
creates
a
lot
of
dashboards.
A
B
Wars.Com
only
has
access
to
application
metrics
for
staging
and
production
infrastructure
wise.
It
has
metrics
for
staging
production
in
the
ops
environments,
all
of
the
data
sources
are
created
manually.
This
is
a
known
thing
and
I'm
pretty
sure.
Ben
has
created
an
issue
to
address
this
in
the
future.
B
For
internal
I'm
pretty
sure
the
same
exists,
but
we
have
metrics
coming
from
many
many
places.
So
Santos
query
is
one
location,
I'll
get
to
the
thinnest
query
and
what
that
is
later.
We
got
an
elastic
search
data
source
for
our
production
environment
as
well
as
some
of
our
abuse
cluster
stuff.
I,
don't
entirely
know
what
all
that
entails.
We
got
in
flux,
DBS
for
staging
and
production.
B
Some
of
the
webpage
speed
test
comm
or
whatever
we
have
infrastructure
environments
for
the
staging
production,
disaster,
recovery,
ops,
pre
and
the
testbed
environment,
and
we
have
application
metrics
for
all
those
accept
testbed,
which
I
thought
was
kind
of
interesting,
but
that's
probably
just
a
small
oversight
and
I'm
pretty
sure.
All
these
data
sources
are
entered
into
that
instance
manually.
So
anytime
this
instance
needs
to
get
rebuilt.
Some
needs
to
go
in
there
and
change
it,
which
I
think
kind
of
sucks
cuz.
That
happened
recently
to
the
dashboards
comm
instance,
which
really
says.
B
So
in
flux,
DB
is
another
component:
we've
got
four
nodes,
they
all
serve
as
database
servers.
We
have
two
of
them
in
staging
and
two
of
them
sitting
in
production.
As
far
as
I
know,
just
the
WebKit
API
inside
cake
fleets
actually
talked
to
the
influx
DB
relay,
which
is
installed
all
no
servers
and
the
relay
will
push
the
metrics
to
influx
DB.
Wait.
A
C
A
B
M
tell
another
component
mentioned
earlier.
This
looks
at
our
log
data
and
there's
I
use
the
word
amazing,
reg
X's.
So
there's
a
cookbook
literally
called
get
lab
M
tale,
and
this
is
where
we
store
all
of
our
configurations
for
the
amazing
reg
X's
that
generate
our
metric
data.
So
these
are
pretty
complicated.
I,
don't
know
how
we
test
this,
but
somehow
we
do,
but
it
makes
it
out
there
and
it
becomes
a
scribble
endpoint
for
Prometheus.
B
B
Prometheus
responsible
for
scraping
metrics
responsible
for
sending
metrics
to
the
alert
manager,
so
he
knows
how
to
alert
things.
We
have
many
many
jobs
across
a
few
different
types
of
Prometheus
instances,
so
we
have
three
Prometheus
types.
We
have
simply
called
prometheus.
We
have
Prometheus
app
and
Prometheus
DB.
B
Production
and
staging
the
other
environments
like
ops,
this
has
to
recover
in
period
they
only
had
the
prometheus
and
Prometheus
app
instances,
so
for
what
I
could
tell
all
of
our
jobs
are
in
chef,
so
there's
there's
roles
that
define
all
these
scraping
jobs.
I
did
find
some
inconsistent
configurations
so
like
we
have
Prometheus
TV,
but
we
have
all
this
stuff
related
to
monitoring
our
database
stuff
inside
of
the
Prometheus
app
instance,
which
I
thought
that
was
kind
of
awkward,
but
Prometheus
is
pretty.
B
It
does
a
lot
it's
it's
configuration
is
pretty
convoluted,
but
it's
easy
to
determine
what
it
does,
because
everything
is
in
chef
and
it's
very
well
configured
from
what
I
could
tell
it's.
Just
there's
some
inconsistent
configurations
in
there.
I
have
an
open
issue
to
figure
out
how
we
got
to
where
we
got
where
the
app
instance
is
polling.
Some
database
information
out
of
it.
B
A
B
A
B
So
exporters
we
have
a
ton
of
them
alert
manager
has
its
own
end
point
for
scraping
metrics,
so
you
know,
Prometheus
has
a
way
to
figure
out
whether
or
not
we
are
actually
able
to
alert
or
not.
We
have
an
alert
to
tell
us
whether
or
not
we're,
alerting
or
not,
which
is
pretty
funny.
Black
box
is
a
Prometheus
component.
It's
primarily
used
for
scraping
things
that
we
need
from
an
external
endpoint
so
like
we
would
scrape
our
kind
of
like
what
we
do
with
Pingdom,
but
for
black
box.
B
It's
just
a
service
that
we
run
and
from
a
description.
Fluent
D
is
a
kind
of
an
interesting
one.
Instead
of
running
a
dedicated
service
that
runs
an
exporter,
it's
actually
a
plugin
into
fluent
D
made
in
Ruby
and
Prometheus
will
scrape
any
data
that
the
fluent
D
service
provides
us
C
advisor
for
some
odd
reason.
We
have
this
only
on
our
giddily
notes,
so
I
don't
exactly
know
what
metrics
were
getting
out
of
C
advisor
that
we
don't
get
out
of,
say
or
node
exporter.
A
B
C
B
We
also
have
something
called
gitlab,
giddily
I,
don't
know
what
it
actually
is,
but
it's
also
installed
alongside
the
see
advisor
in
the
same
chef
recipe.
I
forgot
to
do
further
investigation
on
this,
so
I
feel
uneducated
about
this.
So
my
apologies,
but
this
is
something
else.
I
gets
installed
in
all
to
get
early
notes.
I
completely
forgot
to
circle
back
around
to
that
that
sucks
not
to
do
that
at
some
point.
B
B
We
have
one
dedicated
for
PG
bouncer
again
that
we're
using
a
community
provided
one
and
then
we
have
a
project
called
get
lab,
monitor,
I,
learn
about
this.
This
is
new
to
me,
but
this
is
monitoring
a
few
items
for
the
Prometheus
database,
stuff
that
I
guess
either
Postgres
or
PG
bouncer
exporters
do
not
provide
so
database
bloat
database
mirroring
the
row
counts.
I.
Guess
we
have
something
specific
to
CI
builds
remote
mirrors.
B
All
of
that
is
built
into
our
get
lab,
monitor,
project
and
apparently
get
Lamont
or
something
that
we
also
ship
to
our
customers.
The
analyst
package,
so
that
was
kind
of
interesting
to
learn
about
get
lab,
monitor,
also
provides
us
with
some
process.
Metrics,
but
I
didn't
look
into
what
those
process
metrics
are
and
why
we
don't
have
stuff
defined
in
there
provided
by
like
the
node
exporter
or
process
explorer
exporter,
but
I.
B
B
Sorry,
what
I
think
you
it
looks
at
processes
that
you
configure
and
gathers
various
metrics
about
it,
so
H
a
proxy.
For
instance.
We
want
to
make
sure
that
we're
only
running
a
specific
number
of
those
instances
we
want
to
make.
We
want
to
monitor
the
CPU
usage
of
the
HT
proxy
processes,
so
it
takes.
C
About
that
this
came
about
because
of
the
problems
with
H,
a
proxy
being
single
threaded
fact.
We
were
running
the
single
bit
version
and
that
it
would
just
like
news,
like
Mac
CPU,
on
the
single
core,
and
then
we
would
have
performance
problems,
so
we
had
the
process
monitor,
so
we
could
monitor
it
a
bit
closer.
Now
we
have
multi
per
day
now.
B
B
Just
because
we
didn't
want
to
create
a
separate
node
for
something
that's
not
going
to
use
very
heavily
I.
Don't
I
can't
recall
off
the
top
of
my
head.
How
many
services
actually
use
this?
It's
not
many
I'm,
pretty
sure
Andrew
set
up
a
few
services
that
send
metrics
to
it,
but
it's
not
heavily
utilized.
B
Stackdriver
exporter,
we
want
to
make
sure
that
we
capture
all
the
metrics
that
GCP
is
providing
us,
so
we
got
a
dedicated
node
set
up
to
monitor
all
that
stuff
get
lab
itself.
We
know,
has
its
own
ability
to
generate
metrics.
So
here's
a
list
of
all
the
port
numbers
and
services
that
generate
metrics
for
I'm.
A
B
A
B
B
The
one
that
I
want
to
highlight
out
of
here
is
the
unicorn
one,
this
one
when
I
was
doing
my
investigation
of
this,
as
well
as
when
I
was
hopping
on
one
of
our
recent
incidents.
Unicorn
would
constantly
time
this
endpoint
out.
It
would
oh,
it
would
constantly
take
over
ten
seconds,
and
it
makes
me
think
we're
losing
some
metric
data
because
of
that
but
Andrew
made.
It
seem
like
this
was
a
very
well
known
problem
and
that
there's
an
issue
somewhere
to
address
it.
So
I
thought
it
was
interesting.
B
Okay,
a
to
proxy
exporter-
and
this
is
dedicated
to
gathering
a
cheap
proxy
metrics
out
of
the
socket
that
the
HP
proxy
admin
interface
provides.
I
already
talked
about
em
tail
and
Prometheus
exports
its
self,
so
it
scrapes
itself
due
to
her
mother
Prometheus
is
running
or
not,
which
is
pretty
fun.
The
Redis
service
there
we
use
the
community
provided
Redis,
exporter
and
registry
has
its
own
metrics
interface
when
you
enable
it
on
one
of
the
debug
port
on
5001.
So
so
we
try
to
a
feel
feel
like.
B
B
B
B
It
sends
our
data
to
the
cloud
for
long-term
storage
and
allows
us
to
be
able
to
query
that
long-term
storage
quickly
without
needing
a
store
or
cache
that
data
locally
for
long
periods
of
time,
so
that
we
were
controlling
our
disk
space
usage,
and
it
also
allows
us
to
actually
modify
what's
in
long
term
storage.
That
way
we
are
not
blowing
up
our
cloud
storage
bills,
so
the
few
components
Thanos
query
is
one
of
them.
This
is
what
allows
us
to
connect
to
various
Thanos
store
instances.
B
The
query
is
the
end
point
where
we
would
send
queries
to
so
Gravano
would
use
Thanos
query
as
the
data
source
for
itself.
We
only
have
one
of
these.
It
sits
in
the
ops
instance
and
it
is
configured
to
look
at
everything
icebox
that
we
have
in
our
environment
and
we've
got
the
network
peering
in
GCP,
set
up
to
cross
various
projects.
To
allow
this
to
work,
the
Thanos
query
is
going
to
connect
to
the
Thanos
store
instance.
B
B
When
the
Thanos
store
receives
a
query,
I
guess
somehow
it's
got
magic
program.
So
it
knows
how
to
interpret
that
query.
So
it
knows
where
to
find
the
data.
So
it
know
if
you
send
it
a
query
for
old
Lita.
It
somehow
knows
to
look
at
object.
Storage.
If
it
knows
the
data
supposed
to
be
from
five
seconds
ago,
it
knows
to
go.
Look
at
the
Tendo
sidecar
for
the
data
I'm
gonna
skip
the
compact.
The
FEI
know.
Sidecar
is
what
runs
next
to
Prometheus.
B
So
for
every
Prometheus
box
we
have
a
Thanos
sidecar
service
that
runs.
It
actually
looks
at
the
data
that
Prometheus
has
dropped
on
disk,
we'll
peel
that
data
and
serve
it
up
as
a
method
for
querying
that
data,
and
then
it
also
has
a
secondary
process
for
sending
that
data
to
object,
storage
for
long-term
storage.
A
B
A
B
B
Last
one
I
wanted
to
mention
was
Thanos
compact.
This
is
primarily
for
the
ability
to
save
us
in
billing
costs,
so
it
maintains
a
set
of
life
cycle
rules.
I,
don't
think
I
linked
directly
to
that.
So
maybe
I
won't
be
able
to
find
that
quickly,
because
I
do
think
it's
important
for
us
to
know
that
information.
A
B
Perfect,
okay,
so
yes,
so
Thanos
will
actually
reach
out
to
long
term
storage,
so
object,
storage
for
metrics
that
are
actually
doesn't
really
tell
me
that
it
modifies
the
resolution
of
our
data
based
on
these
metrics.
So
for
the
past
one
year
we
should
be
keeping
every
single
metric
available
to
us
that
we
ever
captured.
If
it's
older
than
a
year.
B
We
change
the
resolution
of
those
metrics
to
five
minutes
and
I'd
be
curious
as
to
what
kind
of
algorithm
that
uses
so
that
we're
not
severely
hampering
us,
but
I,
don't
know
how
can
we
go
back
five
years
worth
of
data
anything
older
than
five
years?
We
keep
indefinitely,
but
it
changes
the
resolution
to
an
hour.
B
Sana's
compact
also
has
a
awkwardness
that
it
needs
to
be
the
only
thing
running
in
that
environment
per
bucket.
So
that's
a
singleton
on
purpose.
It
needs
to
remain
that
way
because
otherwise
to
compact
instances
might
run
and
bump
into
each
other
encrypt
our
data,
which
is
really
fun
to
hear
about.
B
B
We
have
a
helper
file,
just
simply
called
generate
file
generate
inventory
file.
It
uses
a
it's
not
in
this
file.
In
this
we
use
a
chef
search
to
figure
out
all
the
famous
instances
determine
which
port
they're
running
on
and
we
feed
some
extra
data
into
it.
You
determine
whether
or
not
that
host
is
public
or
not
I.
Don't
know
why
we
need
this
and
I.
Don't
know
why
we
need
that
third
item,
but
our
chef
search
in
particular
just
to
look
at
it
really
quickly.
B
A
B
All
right
so
trickster
this
one
is
new
to
me.
Trickster
is
a
reverse
proxy
for
prometheus,
so
our
dashboards
comm
instance
uses
trickster
to
communicate
with
Prometheus.
It
doesn't
talk
to
Prometheus
directly.
That's
we
got
a
kind
of
a
secondary
win
with
this,
because
trickster
is
only
configured
to
look
at
that
specific
application
or
infrastructure
data
for
those
environments
or
that
app
here
so
because
Trix
is
only
configured
to
look
at
those
for
me.
These
instances
that
live
in
these
environments,
dashboards,
calm,
will
only
have
access
to
those
instances.
B
B
B
Any
questions
about
trickster,
ok
and
the
last
thing
I
want
to
talk
about
was
the
third-party
services.
We
use
Pingdom
to
provide
us
external
monitoring
via
some
service
that
we
pay,
for.
We
have
a
mixture
of
automated
configurations
that
are
generated
and
stored
inside
of
our
own
books
repository.
We
also
have
a
couple
of
checks
that
are
created
by
hand.
B
We
have
a
CI
job
that
publishes
any
checks.
We've
got
automated
status.
Io
is
primarily
just
for
advertising
our
state
over
the
course
of
time.
Our
C
mockers
responsible
for
saying
that
situs
we
have
a
there's
a
the
link
about
this
is
here,
for
that
page
duty
is
simply
responsible
for
paging
us,
so
alert
manager
will
send
alerts
if
they're
your
high
enough
priority.
B
We
have
the
ability
to
send
pages
to
people
if
we
need
to
gather
people's
attention,
but
only
you
know
if
they've
got
a
council
page
or
duty,
and
then
slack
is
where
we
got
a
bunch
of
alerts
they
get
set
to
again.
This
is
all
configured
inside
of
our
run
box
rules
repository
and
we
have
a
a
lot.
We
probably
have
a
think,
20
or
so
alerts
channels,
I
highlighted
the
ones
that
I
feel
are
the
most
important,
so
alerts
is
the
bucket
we're,
like
all
the
alerts,
just
go
to
alerts.
B
General
appears
to
be
specific
to
a
lot
of
the
operational
things
that
s
eries
are
starting
to
pay
attention
to.
So
a
lot
of
this
work
was
done
by
Andrew.
It
also
uses
some
sort
of
plug
in
so
we
get
fancy
charts.
We
get
little
buttons
that
allows
us
to
do
fancy
things
inside
of
it,
I
think
in
general.
B
That's
what
we
want
a
migrant
to
in
the
future,
but
until
then,
because
it's
such
a
large
thing
to
accomplish
the
alerts,
channel
still
exists,
alerts,
G,
staging
specific
for
staging,
see,
ICD,
they
have
their
own
channel
I'm,
actually
not
a
member
of,
and
then
our
abuse
team
they
have
their
own
select
channel
as
well.
So.
A
B
A
A
A
A
This
major
comment
you
made,
but
I
know
that
you
made
a
couple
of
other
comments
around
this,
so
please
organize
organize
it
in
such
a
way
that
I
think
this
way
might
be
okay
per
topic,
basically,
where
you
are
going
to
like
put
this
again
in
the
epic
description
and
then
start
collecting
the
issues
that
we
have
linking
them
to
this
epic
and
then
also
you
have
a
couple
of
open
questions.
There
I
want
open
questions
to
also
link
to
issues.
A
A
Rom
books
collect
the
ROM
books
that
we
have
collect
handbook
items
that
we
have
related
to
this
and
put
it
in
one
place
and
create
issues
that
are
going
to
be
covering
groups
of
of
things.
So
you
don't
have
to
go
very
granular
on
this,
but,
for
example,
I,
don't
know
describe
trickster
why?
What
have
write
describe
Thanos,
what
why
and
how
yeah
and
so
on
right.
So
this
is
for
specifically
for
documentation
so
specifically
for
documentation,
you're
going
to
apply
on
each
issue.
Documentation
and
you
can
apply
I,
don't
know,
find
the
label.
A
If
we
don't
have
it
already
create
a
new
one
related
to
monitoring
and
alerting,
maybe
the
two
together
or
yeah
two
labels
monitoring,
alerting
and
put
them
together
in
in
those
issues,
and
this
is
where
we
are
going
to
stop.
We
meaning
delivery
are
going
to
stop
so
uploading,
the
recording
so
brigadiere
Suri's,
giving
the
epic
to
whoever
is
going
to
be
looking
into
this.
If
that
is
us
at
some
point,
then,
okay,
but
for
now
just
have
it
written
down
somewhere,
and
then
we
are
continuing
with
the
work
you
started.
A
Originally,
you
started
looking
into
this
that
should
take
you
about
four
hours
total
to
complete,
in
my
humble
opinion,
so
linking
all
these
things
together,
uploading
to
YouTube
and
all
of
those
things
so
basically
rest
of
today.
You
should
take
to
clean
this
up
and
then
you're
gonna
communicate
this
with
the
rest
of
the
organization.
A
For
now,
what
you
can
also
do
is
for
curious
people
just
place
it
also
in
cross
linkage
to
development
as
well,
because
I
know
that
we
have
some
people
wanting
to
know
more
about
this.
And
finally,
we
do
have
our
monitoring
group,
so
they
need
to
know
about
this
as
well,
so
go
to
wherever
the
the
monitoring
group
is
in
slack
and
also
cross
link
to
them.
B
A
Then
thinking
about
it
I
think
that's
not
necessarily
our
task,
so
yeah.
Those
are
the
couple
of
things.
I
want
to
see
as
there's
follow-ups
and
then
obviously,
once
you
complete.
All
of
those
in
this
issue
link
to
the
recording
link
the
epic
link
to
well,
you
don't
have
to
link
to
just
say
that
you
placed
a
certain
message:
copy/paste
the
message
inside
of
issue
and
then
commenting
close
and
that's.