►
From YouTube: Monitoring the monitor - David Leadbeater, G-Research
Description
Don’t miss out! Join us at our upcoming event: KubeCon + CloudNativeCon North America 2021 in Los Angeles, CA from October 12-15. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.
Monitoring the monitor - David Leadbeater, G-Research
There are various ways to make sure Prometheus is working. We’ll cover these from cloud based services to Prometheus instances monitoring each other. Then we’ll explain why we developed a component to help with this.
A
A
A
So
that's
quite
simple,
but
that's
not
enough
to
actually
monitor
prometheus
itself
because
well
it's
not
a
cartoon.
It's
not
that
simple
and
you
can't
monitor
yourself
with
yourself
so
going
back
to
basics
again,
the
architecture
of
a
normal
previous
setup
is
something
like
this.
We
have
prometheus
talking
to
an
alert
manager,
sending
alerts
to
some
kind
of
alert
receiver.
A
A
A
A
A
Obviously,
it's
a
bit
difficult
to
connect
a
phone
to
a
server
in
the
cloud,
so
how
people
often
deal
with
this
is
rather
than
alerting
when
something
is
down,
have
a
particular
alert
that
exists
as
a
heartbeat
that
is
always
expected,
which
then
is
always
sent
to
the
receiver
and
the
receiver
somewhere
on
the
internet
knows
that
it
should
expect
to
receive
an
alert
and
if
it
doesn't,
then
it
raises
an
alert.
So
it
inverts
alerting
essentially
saying
if
there
isn't
another,
then
start
raising
the
load.
A
So
there
are
many
ways
of
doing
this.
Healthchecks.I
o
provide
a
service
that
does
this,
which
is
written
in
python.
You
can
run
it
yourself
or
there's
a
crowd-hosted
version
of
it
dead
man,
snitch,
integrates
with
pagerduty
and
is
cloud
hosted
karma,
which
is
a
web-based
ui
for
alert
manager,
can
also
display
an
alert
when
a
particular
alert
isn't
present,
so
that
obviously
doesn't
page
anyone
but
can
show
on
a
screen
or
something
that
there's
a
problem
which,
if
you
have
a
knock
or
something,
could
potentially
be
useful.
A
So
let's
look
at
how
we
actually
set
up
alert
manager
to
talk
about
heartbeat
receiver
in
the
alert
manager
config.
We
have
a
route
that
matches
a
label
of
severity
heartbeat
and
then
sends
that
to
a
particular
heartbeat
receiver
and
you'll
see
in
this
example.
The
url
has
a
id
in
it,
which
would
be
team
specific
or
specific
to
each
prometheus
instance.
That
is
monitored
by
the
receiver
at
the
other
end,
which
unfortunately
then
means
that
this
alert
manager
file
needs
to
have
every
id
for
everything
that
is
monitored
in
it.
A
Obviously,
that's
not
too
difficult,
so
it
can
be
templated
or
various
other
approaches,
but
it
still
means
that
this
is
yet
another
thing
to
configure
and
the
configuration
needs
to
be
managed
and
so
on.
It's
yet
another
moving
part.
Essentially.
So,
instead
with
prom
msd,
we
have
the
same
alert
that
we
had
before.
A
The
activation
is
the
activation
time
some
labels
to
override,
and
then
the
alert
managers
to
send
the
alert
to
which
is,
unfortunately,
the
one
thing
that
primosd
compromises
on
it
can't
support
dynamic
alert
manager
discovery
because
the
allowed
managers
have
to
be
actually
specified
in
the
alert
itself,
although
potentially
we
could
fix
that
with
some
changes
elsewhere
in
the
future.
A
But
this
does
mean
that
all
the
configuration
for
a
team's
alert
is
actually
contained
in
the
alert
itself
and
nothing
special
is
needed
for
heartbeat
events.
Obviously,
they
probably
would
have
team-specific
routing
in
the
central
alert
manager,
but
they
don't
need
separate
configuration
for
heartbeats,
which
you
know
might
get
forgotten
or
so
on,
because
it's
not
used
all
the
time
and
so
on.
A
So
what
then
happens
is
this
in
this
case
raises
two
alerts
for
each
of
our
availability
pair
and
those
go
to
problem
sd.
So
let's
actually
see
how
this
works
so
over
here
I
have
some
of
the
example
configs
that
come
with
prominence
d.
So
it's
just
a
conflict
directory
and
I'm
running
four
terminals
here.
First
of
all,
I'm
just
running
a
netcat
listening
on
a
random
port.
This
is
going
to
be
the
normal
alert
receiver.
So
we'll
just
see
the
http
requests
sent
to
that
so
inside
prime
msd.
A
This
configs
directory
has
an
alert
manager,
config
an
alert
and
a
prometheus
config.
So
what
I'm
going
to
do
is
I'm
just
going
to
run
alert
manager
using
that
pre-provided
config,
I'm
also
just
going
to
run
prometheus
and
prometheus
will
then
be
running
so
you'll
notice.
I
haven't
yet
started
prominence
d,
so
I
also
just
need
to
do
that.
A
So
we
now
have
previous
alert
manager
and
prime
msd
all
running,
and
let's
just
first
of
all
go
to
prometheus
here
and
if
we
look
at
the
alerts
ui,
you
now
see
this
expected
on
that
heartbeat
is
active
and
we
can
see,
as
I
discussed
all
the
activation
things
you'll
see
in
this
case,
though,
that
I've
put
the
activation
at
one
minute.
You
also
notice.
This
alert
for
now
is
not
actually
active
because
there's
a
full
threshold
of
30
seconds
just
to
make
sure
that
the
previous
instance
isn't
flapping,
so
this
amount
is
still
pending.
A
Hopefully,
I've
spoken
for
long
enough.
I
have
and
that
alert
is
now
firing,
so
that
now
means
that
we
have
an
expectation
of
that
heartbeat.
That
is
fight
rank,
so
what's
happening
to
that.
Well,
that
is
going
to
alert
manager
which
conveniently
I
have
running
here,
and
we
now
see
there's
an
alert
for
expected
alright
heartbeat
over
here
that
we
have
the
relevant
annotations
on
and
if
we
check
where
that's
going,
that's
going
to
rob
a
misd,
and
actually
our
pager
has
no
one
that's
going
to
okay.
A
So
then,
if
we
go
over
to
prominence
d
over
here,
we'll
see
we
just
have
a
prometheus,
and
in
this
case
it's
not
running
in
kubernetes,
so
there's
no
namespace
or
anything
it's
just
prometheus,
which
obviously,
in
a
real
setup,
you
would
have
a
few
more
labels
there,
but
this
for
a
demo.
This
works
so
you'll
see
that
that
is
saying
it
will
activate
in
a
few
seconds
and
I've
actually
got
this
set,
I
think,
to
repeat
every
five
seconds.
A
So
if
I
just
sit
reloading
this
page
you'll
see
it
actually
never
gets
below
about
55
seconds.
So
now,
let's
just
go
to
where
we
are
running
prometheus
and
I'll,
just
kill
it
okay,
so
that
was
pretty
fierce.
Yes,
so
I've
now
stopped
prometheus.
A
So
actually
that's
interesting,
because
you'll
see
this
alert
is
now
still
active
and
that's
because
alert
manager
over
here
still
knows
about
this
alert.
For
now,
I've
actually
set
the
in
the
prometheus
config
the
evaluation
interval
to
15
seconds.
So,
if
I
carry
on
talking
for
about
four
times
15
seconds,
we
should
eventually
find
that
that
alert
stops
being
sent.
Luckily,
this
stuff
isn't
live.
So
if
this
fails
I'll
just
edit,
it.
A
A
Okay,
so
obviously
that's
a
very
simple
setup
and
in
reality,
you'd
have
a
few
more
components
involved.
So
a
full
architecture
of
deploying
it
might
look
something
like
this.
A
You
have
three
teams
running
prometheus
instances
for
applications
which
talk
to
an
alert
manager,
cluster,
the
alert
manager,
cluster
routes
to
prominence
d,
as
well
as
things
in
the
cloud
for
other
alerting,
as
well
as
an
infrastructure
prometheus,
for
example,
that,
rather
than
using
the
prominence
d
running
locally
in
the
cluster,
uses
something
in
the
cloud
which
could
also
be
a
another
instance
of
promo
sd
running
elsewhere,
or
it
could
be
one
of
the
mentioned
cloud
monitoring,
services
and
you'll
also
notice.
A
Which
means
promise
d
doesn't
need
to
depend
on
anything
other
than
a
weapon
receiver
which
could
run
on
the
same
machine
or
you
know,
even
in
the
same
pod
as
a
as
prominent
d
in
kubernetes
and
yeah.
As
as
mentioned,
the
infrastructure
prometheus
has
a
separate
monitoring
that
is
potentially
in
a
different
cluster
or
elsewhere,
so
the
infrastructure
team
can
be
notified.
A
If
everything
is
broken,
application
teams
can
be
notified
if
their
prometheus
is
broken
by
an
explicit
alert,
but
if
they
actually
are
running
in
multiple
clusters,
maybe
they
don't
need
to
be
told
about
their
actual
application
being
down.
If
it's
an
infrastructure
problem,
because
the
probers
don't
fail-
and
it
means
that
you
know
you
don't
get
a
critical
alert,
everything
is
broken
when
actually,
it's
not
all
broken.
So
there's
flexibility
in
how
you
how
you
set
this
up.
That
means
you
can
make
sure
that
the
alerts
are
actually
actionable
and
so
on.