►
From YouTube: Prometheus Deep Dive - Ben Kochie, GitLab
Description
Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io
Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects
Prometheus Deep Dive - Ben Kochie, GitLab
After the Intro session we will go into a mix of advanced use cases, news, and open Q&A with all Prometheus maintainers who are at CloudNativeCon.
https://sched.co/UahS
A
So
welcome
to
the
Prometheus
deep
dive,
a
kook
on
my
name
is
Ben
I'm,
a
site,
reliability
engineer
at
KITT
lab
and
I'm
also
been
a
contributor
to
the
Prometheus
team
for
quite
a
number
of
years.
Now.
Most
of
my
work
is
not
on
prometheus
itself,
but
all
the
exporters
and
integrations
that
people
use
and
yeah
feel
free.
If
you
have
questions,
one
of
our
other
Prometheus
developers
will
be
handing
out
mics
during.
If
you
have
a
question
in
the
middle
of
talk,
feel
free
to
interrupt
me.
A
A
So
now
that
you've
got
Prometheus
installed.
What
do
we
go?
What
do
we
do
here?
Well,
there's
a
lot
of
good
reading
material
on
on
monitoring
itself
and
and
why
it's
good
to
use
monitoring
to
let
you
know
when
your
systems
are
working
properly,
I'm
a
big
fan
of
the
of
this
tecnique
of
the
red
method
and
used
method
where
kind
of
two
sides
of
the
same
coin,
and
it's
it's
all
about.
A
Looking
at
your
metrics
from
the
perspective
of
your
users,
because
your
users
don't
care
if
your
out
of
memory
or
your
CPUs
are
overloaded,
they
care
if
they're
there
there.
Their
requests
are
going
through
and
they're
going
through
quickly
and
there's
a
lot
of
great
material
on
that
and
then
on
for
Prometheus
itself.
There's
a
couple
of
really
great
books
that
you
can
read:
they
they
go
through
all
the
detail
of
getting
you
into
prometheus,
but
yeah.
A
So
Prometheus
itself
is
a
is
what
I
call
a
intentionally
uncoordinated
distributed
system
is
the
Prometheus
design
was
came
from
a
need
where
the
monitoring
system
needed
to
be
the
most
reliable
thing
on
the
network
and
which
meant
that
the
Prometheus
itself
needed
to
have
the
least
number
of
dependencies
on
anything
else
on
your
network.
So
as
long
as
it
it's
up
and
running,
it's
got
a
little
local
disk
and
it
can
reach
to
the
network.
It
can
monitor
it
versus
other
other
monitoring
platforms
where
like,
if
you've
got
it.
A
So
if
you've
got
a
thousand
pods
and
500
of
them
are
for
one
service
and
500
500
or
for
another
service
split
your
Prometheus
to
have
500
one
service
and
500
on
the
and
yes,
so
it's
a
it's
a
minimal
dependency
super
robust
piece
of
software.
The
prometheus
includes
its
own
built-in
time
series
database.
It
uses
right
ahead
log
for
reliability
during
operation
and
restarts,
but
in
the
the
time
series
database
itself
is
is
written
out
and
it
is
immutable.
So
it's
it's
hard
to
corrupt
and
it
works
works
pretty
well.
It
has.
A
A
Prometheus
is
collecting
float64
data
all
the
time
and
the
the
best
practice
is
you
want
to
expose
your
your.
You
want
to
count
your
events
and
expose
those
as
counters,
because,
if
me,
because
Prometheus
is
a
polling
based
system,
if
you
lose
a
data
point,
you
don't
want
to
lose
the
information
that
was
contained
in
that
data
point
completely,
so
you
have
a
count
of
say
one
and
then
a
scrape
happens,
and
now
the
count
is
two
and
another
scrape
happens.
Now
the
count
is
three
and
Boop's.
A
That's
going
up
and
somewhere
around
you
know,
fifteen
hundred
two
or
so
every
every
fifteen
seconds,
and
you
can
see
that
the
scrape
intervals
is
fifteen
seconds.
So
we
get
four
data
points
within
16
seconds
and
there's
like
one
at
four
seconds:
a
nice
note
next,
one
at
19
seconds,
etc,
etc,
and
then
that's
just
one
instance
of
this
server.
So
we've
got
a.
This
is
from
a
H,
a
proxy
web
server
or
web
reverse
proxy,
and
you
know
we
of
course
have
more
than
one
H
a
proxy.
The
load
balance
all
of
our
systems.
A
So
here's
the
data
points
from
a
different
H,
a
proxy,
and
so
it's
it's
not
going
up
as
fast,
but
also
notice
that
the
the
the
samples
are
offset
differently.
So
Prometheus
you
tell
it
to
scrape
all
of
your
targets
every
15
seconds
and
it's
not
doing
that
on
a
clock
basis.
It's
actually
taking
all
of
your
targets
and
spreading
them
out
over
the
scrape
interval
so
that
you
get
this
nice
even
and
flow
of
ingestion
into
Prometheus.
A
It
actually
extrapolates
the
the
two
data
points
to
find
out
what
the
total
increase
would
be
if
there
were
more
samples
within
that
section
or
within
that
within
that
range,
and
this
is
really
a
little
bit
confusing
for
some
people,
because
they
they
see
a
counter,
and
it
goes
up
by
one
and
they're
like
but
I
got.
A
number
that
was
was
I,
got
two
point:
five
because
and
they're
like
that.
A
Another
interesting
thing
is
what,
if
I
asked
for
a
minute
of
increase,
I
don't
have
any
more
pretty
pictures.
Thank
you
good
gnuplot.
What
if
I
asked
for
an
increase
and
what,
if
over
the
whole
minute?
What?
If
that
say
that
the
the
blue
circles
on
the
top
of
the
graph,
the
first
two
data
points
were
missing.
So
if
I
didn't
do
extrapolation
and
I
had
was
missing.
Those
two
data
points,
I
would
only
get
say
about
2000
on
the
increase
versus
the
total
of
6,000,
which
is
what
it's
really
going
on
here.
A
A
So
next
big
question:
a
lot
of
people
ask
is
okay,
so
now
I've
got
a
Prometheus
and
it's
overloaded
or
I
started
a
nice
I
now
have
two
Prometheus
servers
and
I've
got
high
availability
and
and
we're
gonna
start
talking
about
scaling
and
the
first.
The
first
question
people
ask
is
well
how
do
I
capacity
plan?
A
Well,
it
depends
mostly
on
the
rate
of
ingestion
of
your
data,
because,
as
Prometheus
collects
data,
it's
got
a
store
it
somewhere
and
it
needs
currently
about
1.5
it's
per
per
sample,
and
so
it's
pretty
easy
to
just
do
a
little
bit
of
capacity
planning,
math
or
and-
and
it
also
depends
a
little
bit
on
the
data
that
you've
got
so
usually
the
recommendation
is.
Is
you
start
out?
A
You
look
at
the
to
you,
you,
you
scrape
some
of
your
data
and
then
you
can
now
start
to
build
a
capacity
planning
story
for
your
for
your
Prometheus
and
you,
and
so
like.
For
example,
we
have
a
server
in
our
production
environment,
that's
doing
about
a
hundred
thousand
samples
per
second
by
about
one
and
a
half
bytes
for
60
seconds
for
60
minutes
and
it's
about
half
a
gigabyte
an
hour
and
that's
a
bit
of
data.
But
it's
not
too
bad
when
a
hundred
thousand
samples
per
second
is
quite
big.
We
have.
A
Now
the
question
is:
is
now
I've
got
all
these
multiple
instances
and
how
do
I
scale
it?
Well,
the
there's
a
bunch
of
different
ways:
Prometheus
has
a
technique
called
Federation
where
you
can
take
a
single
Prometheus.
That
is
collecting
data
from
a
lot
of
targets
and
you
don't
maybe
not
even
care
because
it's
a
you
know,
say
a
bunch
of
individual,
auto
scaled
pods
and
the
individual
pods
don't
matter
so
what
you
really
care
about
is
the
cluster
level
data.
So
you
have
what
promethease
is
called
recording
rules.
A
It
takes
a
Prometheus
query
and
stores
that
into
a
new
metric,
and
then
you
have
a
Federation
server
on
top
of
that,
that
pulls
in
just
the
recording
rules
and
ignores
the
individual
pod
data
and
that's
a
simple
like
hierarchical
system.
The
promise,
of
course,
if
you
tried
to
pull
everything
into
that
federated
server,
it
would
blow
up,
because
if
you
try
it's
it's
not
a
method
of
like
replicating.
A
C
A
So
yeah
the
Prometheus
uses
an
inverted
index.
So
when
you
add
a
new
new
label,
it
doesn't
really
take
up
that
much
more
space,
because
the
index
itself
only
really
needs
to
store
the
the
the
key
from
that
label
into
the
the
metric.
And
it's
it's
relatively
efficient.
So
it's
it's
totally.
Okay
to
have
a
bunch
of
different
labels,
especially
if
it
you
know
they're
important
that
you
want
them
to.
A
You
wanna
you,
you
want
to
make
sure
that
you
don't
have
tons
of
values
for
those
labels,
because
the
the
the
more
values
the
bigger
the
index
is
and
the
longer
it
takes
to
scan
those
indexes
when
you're
when
you're
walking
a
metric.
But
if
you've
got
you
know
tens
of
data,
centers
or
or
hundreds
of
clusters.
Those,
though
that's
a
reasonable
amount
of
of
cardinality.
It's.
B
Also
hope
how
things
are
correlated
so
if
you
have
a
an
instance
level
naturally,
and
then
you
have
faster
label,
but
in
one
cluster
you
always
have
the
same
instances.
It's
actually
not
increasing
cardinality,
that's
essentially
a
free
label.
But
if
you
add
something
only
two
orthogonal,
you
get
those
like
square
growth
of
cardinality,
okay,.
D
B
Should
I
take
it
sure
why
now
that
sounds
like
retro
actively
evaluating
recording
rules?
Is
that
it?
Okay,
that's
not
yet
a
feature,
but
the
plumbing
is
all
in
place.
So
it's
on
the
roadmap
and
it
will
I,
don't
know
whoever
works
on
it.
It's
actually
she's
pretty
straightforward.
Technically,
no
to
do
there.
A
B
Dts
DB
is
structured
in
a
way
that
inserting
individual
time
series
later
is
a
really
expensive
operation,
but
there
is
like
it's
implemented
as
an
operation,
and
now
you
just
need
the
tooling
that
Ravel
gets
a
rule
or
that
backfills
data
you
got
from
another
source,
and
this
will
all
happen
fairly
soon.
Okay,.
E
A
So
so
I
get
lab
worries
in
Thanos
because
it
was
really
really
nice
to
deploy.
Is
we
we
were
before
we
were
using
Thanos.
You
know
we
were
smaller
than
we
were
growing
and
we
were
adding
more
Prometheus
servers
and
we
were
having
more.
You
know
as
tedious
to
setup
dashboards
that
pulled
from
multiple.
You
know
so
we're
using
Groupon
up
for
our
dashboards,
and
it
was
tedious
to
have
like
one
data
source
for
this
cluster
in
this
data
source.
A
For
this
cluster
and
mixing
data
sources
was
really
annoying,
so
we
added
Thanos
as
an
overlay
proxy
to
make
it
easier
to
query,
but
we
weren't
actually
using
the
any
storage.
So
we
actually
had
six
months
of
data
in
our
Prometheus
servers,
and
that
was
working
quite
well
and
it
was
just
fan
of
as
an
add-on
to
start
doing
the
overlay
layer
and
deduplication,
because
you
know
we
we
we
had
four
are
between
our
ger
fauna
and
our
Prometheus
servers.
We
had
a
little
nginx
failover,
so
it's
just
a
whichever,
whichever
one
was.
A
It
would
pick
the
first
one
and
if
that
one
was
down,
I
would
grab
the
second
I
forget
what
the
nginx
configures
for
that,
but
it
was
just
a
simple
failover
and
then,
but
that
that
would
produce
weirdness,
sometimes
because,
of
course,
they've
got
different
gaps
and
so
adding
Thanos
to
do
the
gap.
Fill
was
really
really
nice
and
then
after
we
after
we
rolled
that
out,
then
we
started
to
experiment
with
pushing
our
T's
to
be
data
into
a
object,
storage
and
using
the
storage
for
that.
A
So
we're
we're
using
we're
not
we're
using
it
for
our
public
dashboards,
so
we're
using
trickster
and
it
works.
Okay,
but
yeah.
There's,
there's
a
bunch
of
I've
been
we're
talking
with
the
I've,
been
talking
with
the
tano's
developers
and
we're
talking
about
adding
a
caching
layer
and
actually
specifically,
there's
talk
about
bringing
the
cortex
query
caching
layer
into
Santos
and,
like
there's
some
some
code
share
between
all
these
projects.
A
B
F
So
prometheus
alert
manager
is
like
a
nice
tool
for
monitoring
alerts
and
such
and
I,
like
its
simplicity.
But
do
you
have
any
other
integrations
that
might
like
empower
like
the
the
ops
personnel
that
collecting
Prometheus
data
to
like
tune
their
alerts
without
necessarily
having
to
go
into
like,
like
crate
configurations
at
the
code
level
like
ROM
qol
and
such
like.
A
G
Yeah
giving
recommendations
or
resources
on
how
to
like
go
about
comparing
contrasting
the
various
storage
solutions
between
cortex,
m3,
DB
or
Thanos,
and
how
to
choose
one
to
use
for
your
environment.
G
A
Specific
I
don't
have
any
Darren.
Do
you
have
any
recommendations
for
picking
a
storage
layer?
It's
a
kind
of
it
depends
on
what
your
what
your
like
network
layout
is,
and
you
know
what
your,
what
kind
of
storage
systems
do
you
have
you
know
dependent.
You
know
how
much
complication
do
you
want
like?
No,
the
Thanos
is
really
interesting
because
it
only
requires
object,
storage
and
it
works
by
sin.
It
works
really.
H
So
you
asked
us
what's
our
greatest
problem
with
Prometheus
right
now
for
us,
it's
that
the
Jenkins
plugin
doesn't
work,
we
installed
it
and
it
crashed
immediately
now
before
I
go
fix
that
and
submit
a
PR
for
myself
to
fix
it.
What
is
the
plan
for
things
like
Jenkins
or
other
integrations
that
are
popular
but
might
not
be
in
the
core
of
what
you
guys
are.
A
Working
on
yeah,
no,
we
actually
have
a
separate.
We
have
a
separate
github
organization
called
prometheus
community
and
we're
slowly
trying
to
find
that
we're
slowly
trying
to
build
that
up
as
like
the
place
where
popular
things
can
go
and
have
additional
maintenance
and
be
more
more
official
than
than
just
some
random
other
github
org
and
we're.
You
know
we're
slowly
trying
to
get
that
on
board
and
help
them
so
check
out
the
Prometheus
community,
github
org
I'll.
H
I
Maybe
this
is
a
little
more
starter
question
we've
been
using
Prometheus
prom
operators
and
getting
our
all
the
metrics
for
our
cluster
and
that's
working
really
well.
We've
got
Gravano
running
all
sorts
of
interesting
dashboards
and
as
soon
as
we
do
that
now
the
application
developers
are
saying:
hey
I
want
to
use
that
for
some
application,
specific
metrics
that
we
want
to
put
in
and
I'm
wondering.
If
you
could
talk
a
little
bit
about
what
we
have
to
do
to
do,
that
I
know.
There's
SDKs
for
a
Java
go
know.
I
But
now
are
we
running
an
additional
web
server?
That's
gonna
be
on
each
one
of
our
pods
that
are
now
you
know
hosting
all
the
metrics.
Are
we
sending
them
somewhere?
Are
we
registering
with
that?
I
mean
the
pom.
Operators
were
great
for
the
cluster,
but
now
they
build
this
in
how
complicated
is
that
going
to
be
yeah.
A
So
all
of
the
the
Prometheus
metrics
protocol
is
simple:
HTTP
GET
and
all
the
Prometheus
client
libraries
include
hooks
into
whatever
language
web
servers
are
available.
So
if
you've
got
the
Java,
you
know
it
uses
the
Java
Web
HTTP
server,
you
have
a
go
lying
at
uses,
go
HTTP
and
then
yeah.
So
you
you
just
use
these
Prometheus
client
libraries
in
your
code
and
either
on
an
exhibit.
You
can.
You
can
register
it.
If
you've
got
like
a
request,
router,
you
can
register.
A
You
can
put
the
Prometheus
metrics
registry,
which
is
the
usual
name
for
the
the
internal
metric
counter
trackers,
and
you
put
you
can
just
send
a
route
to
slash.
Metrics
and
Prometheus.
Can
then
scrape
that
data
and
and
collect
and
put
it
into
the
database,
and
but
you
know
there
are
cases
where
you
might
want
it.
You
might
want
to
put
it
on
a
different
port
and
set
the
service
discovery.
So
you've
got
your
main
API
port
and
you
get
your
metrics
port,
but
it's
it's
up.
A
We
have
a
large
rails
app
and
we
moved
the
we're
moving
the
Prometheus
metrics
from
an
inline
controller
to
its
own,
dedicated
port
on
the
the
unicorn
controller
process.
So
because
it
turns
out
that
whenever
we
get
slammed
with
traffic
unicorn
queues
and
then
we
lose
monitoring,
and
so
we've
moved
it
to
a
separate
port.
Because,
though,
because
of
the
limitations
of
Ruby
but
ongoing,
it's
all
guru,
teens
and
it's
no
big
deal
got
one
on
the
right.
A
That's
more
of
a
question
for
things
like
cortex
and
and
Thanos
I
believe
cortex
takes
the
H
a
and
throws
away
one
of
the
them
automatically,
whereas
Thanos
keeps
both
and
and
keeps
both
around,
and
so
that's
that's
kind
of
like
prometheus
Prometheus
itself
is
not
a
remote
right
receiver.
So
you
can't
stream
one
Prometheus
to
another.
That's
more
of
a
question
for
things
like
panels
and
cortex.
A
B
A
So
yeah
we,
the
Prometheus
web
interfaces,
is
intentionally
simple
and
it's
mostly
there
to
like,
provide
a
basic,
debugging
interface
and
and
getting
started
and
testing
queries.
And
then
you
know
my
usual
workflow
is
all
all
beyond
Prometheus
or
Thanos,
using
interface
and
I'll.
Do
a
bunch
of
test
queries
and
I'll
figure
out
what
I
want
and
then
I'll
copy
and
paste
those
in
the
core
fauna,
but
we
actually
just
started
a
rewrite
of
the
web
interface.
A
B
L
A
L
E
A
A
Other
people
are,
you
know,
there's
there's
the
operator
stuff
and
you
can
use
JSON
it
yeah.
There's
there's
a
lot
of
JSON
O's
is
super
interesting.
We're
actually
moving
all
of
our
dashboards
into
Griffin
at
JSON
it
so
our
instead
of
hand
clicking
and
making
all
of
our
dashboards.
We
we
auto
generate
them
so
they're,
all
consistent
with
all
the
services,
so.
B
Prometheus
is
super
opinionated
in
many
aspects,
but
this
is
intentionally
up
to
you
to
pick
your
poison.
Having
said
that,
bunch
of
projects
get
mix
in
sub-directories,
just
as
examples
how
you
could
with
JSON
or
create
rules,
create
dashboards,
but
that's
not
as
in
like
this
is
another
one
canonical
way.
It's
just
a
way
for
us
to
document
possible
rules
possible
dashboards,
three
for
a
few
more
minutes,
so.
M
As
like
an
infrastructure
or
operations,
person,
I
think
Prometheus
makes
a
lot
of
sense
and
the
scalability
aspect
of
it.
It's
very
logical,
there's
a
bit
of
an
overhead
I
think
when
you
ask
sort
of
regular
day-to-day
application
engineers
to
start
using
it,
because
there
are
some
sort
of
non
intuitive
concepts
in
there.
They're
like
different
than
yeah
metrics
is
a
service
kind
of
company.
Do
you've
any
recommendations
on
sort
of
like
how
to
onboard
people
mentally
and
yes,
the.
A
The
the
previous
books
and-
and
we
have
you
know
we
have
a
bunch
of
how-to
guides
and
tutorials
stuff
on
our
website.
You
know
we're
always
looking
to
improve
those
and
yeah
there's.
We
also
have
a
YouTube
channel
so
we're
every
year
we
have
Prometheus
conference
and
we
have
lots
of
great
talks
and
I'm
actually
going
to
be
putting
together
after
we
get
the
2019
talks
posted
online.
Well,
I'm
gonna
make
like
a
Prometheus
101
playlist
on
YouTube,
it's
cool
and
I.
Think
that's
all.
We
have
time
for.