►
From YouTube: Observability 101
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Awesome,
thank
you
very
much.
Everyone
welcome
back
from
the
break
and
we'll
be
continuing
with
our
next
session
from
richard.
Richard
is
very
well
known
for
personality
within
the
open
source
industry
and
in
this
session,
he'll
be
speaking
to
us
about
observability
when
you
are
deploying
your
infrastructure
or
you
are
managing
your
services
or
whatever
you
have
in
your
enterprise.
A
One
thing
that
needs
serious
attention,
it's
observability
and
he
will
be
introduced
in
the
stop
stability
101.
We
did.
Some
workshops
did
a
workshop
on
graffana
yesterday
and
last
week,
so
this
will
give
more
context
to
understand
it
better
and
other
things
around
observability
over
to
you,
rich.
B
A
B
Let's
get
started
observability
101,
with
the
focus
focus
on
prometheus
and
beyond.
Let's
start
with
the
buzzwords,
because
observability
and
also
are
absolute
buzzwords
and,
as
per
usual,
buzzwords
tend
to
have
a
core
of
of
truth,
a
core
of
meaning,
but
often
they're,
just
applied
to
whatever
you
already
have,
which
is
which
is
understandable,
but
it
is.
B
It
is
somewhat
dangerous
because
often
you
you
need
to
actually
understand
why
a
term
has
become
a
buzzword
and
why
there
is
so
much
industry
attention
onto
that
onto
that
term
or
that
concept.
There
is
a
concept
of
cargo
culting,
which
is
basically
just
replicating.
What
what
you
perceive
others
to
be
doing
without
actually
looking
into
the
details
of
it
and
then
not
getting
the
outcome
which
you
would
actually
like
to
be
getting,
which
is
obviously
not
what
you
want.
B
B
In
this
context,
monitoring
is
the
old
term
more
or
less,
and
it
has
more
taken
a
meaning
of
collecting
a
lot
of
data,
but
not
necessarily
using
it.
There
are
extremes
where
either
you
just
toss
everything
into
store
data
lake
and
you
you
don't
really
use
it
at
least
not
in
in
monitoring
and
observability
context
or
full
text,
and
all
full
indexing,
which
is
just
hugely
expensive.
B
You
don't,
or
I
mean
you,
want
to
have
the
complex
systems
and
you
want
to
have
the
benefits
of
those
complex
systems,
but
you
still
want
to
enable
humans
to
understand
those
systems
and
also,
at
the
same
time,
you
enable
machines
to
understand
complex
systems,
which
means
you
can
automate
a
lot
of
things
like,
for
example,
alerting.
B
B
I
I
distinguish
between
two
types
of
complexity.
One
is
fake
complexity,
which
is
just
bad
design
or
legacy
design.
Or
what
have
you?
Maybe
there
were
design
constraints
before
which
are
not
there
anymore,
doesn't
matter,
but
often
things
which
are
complex
in
a
system
are
not
system
inherent
and
they
can
be
in
that
complexity
can
be
reduced
and
it
should
be
reduced
because
if,
if
you
just
have
complexity
for
complexity's
sake,
you're
making
your
own
life
harder
and
you're,
making
it
more
expensive
to
run
your
service,
which
is
again
not
what
you
want.
B
So
you
can
move
this
complexity
around.
We
have
we
had
monolithic
and
mainframe
designs.
We
had
client
servers,
we
have
microservices,
for
example,
premises
itself
is
a
monolith,
so
you
can
see
that
you,
you
can
make
different
decisions
and
even
within
even
within
the
cloud
native
context,
it
can
make
sense
to
run
monolithic
services
like
prometheus,
but
you
cannot
get
this
away.
You
just
move
it.
B
It
must
be
comparison,
mentalized,
a
different
name,
for
this
is
service
boundaries,
like
your
hard
drive,
is
insanely
complex,
but
it
it
has
a
clearly
defined
interface
and
your
operating
system
in
your
main
board
can
just
address
your
hard
drive
same
for
your
cpu
and
such
those
are
also
super
complex,
but
it's
compartmentalized
away,
so
you
don't
have
to
deal
with
it
on
the
level
which
you're
dealing
with
with
whatever
service
you
have
same
for
cloud
instances
and
everything,
of
course,
and
ideally
it
should
be
distilled
in
a
meaningful
way
that
you
can
actually
extract
what
you
want
from
that
complexity
to
to
understand
what
is
happening
where
you
need
to
to
understand
it,
and
else
you
can
just
more
or
less
ignore
it
unless
you're
part
of
of
the
team,
which
is
actually
responsible
for
running
that.
B
B
The
core
meaning
of
sre
is
to
align
incentives,
because
you
want
different
people,
different
teams,
you
want
them
to
actually
work
together
and
not
against
each
other
and
a
lot
of
what
you
can
see
in
the
google
asset.
Ebook
and
such
is,
if
you
just
distill
it
to
its
essence,
it's
basically
about
making
people
think
about
the
same
things
and
aligning
your
incentives,
so
they
automatically
without
having
to
discuss
and
fight
about
this,
do
the
same
thing
or
go
in
the
same
direction.
B
Hugely
important
here
is
sli
slo
sla,
which
is
service
level,
indicator
service
level,
objective
and
service
level
agreement.
The
indicator
is
just
what
you
measure.
The
objective
is
what
you
do
not
want
to
go
above,
and
the
agreement
is
where
you
go
above,
you
actually
have
to
pay,
or
you
break
a
contract
or
whatever
specific
example.
If
you
have
error
budgets
for
your
service,
this
allows
your
developers,
your
operation,
people,
your
product
manager,
your
everyone
to
optimize
for
shared
benefits.
B
If
that
service
is
super
stable,
everyone
gets
to
do
their
a
b
testing
and
their
new
features
and
everything.
But
if
that
error
budget
goes
away,
the
operation
people
can
say,
okay,
we
cannot
put
any
any
updates
unless
it's
super
well
tested,
which
puts
load
on
the
developers
and
they
can't
put
their
new
features
in
which
they
don't
like.
B
B
What
can
this
mean
in
in
in
the
specific
everyone
using
the
same
tools
and
dashboards
would
be
a
good
thing.
Of
course
you
have
the
shared
incentive
that
everyone
invests
in
the
same
tooling.
Everyone
works
on
the
same
dashboard,
so
they
share
a
language,
because
the
terminology
is,
if
you
have
only
one
dashboard
or
five.
The
terminology
is
the
same,
so
they
share
this
language
automatically.
They
share
the
understanding
of
how
to
look
into
the
service,
of
course,
again
all
their
tools
and
such
are
working
the
same
ways
that
also
pulls
your
institution.
Knowledge.
B
B
B
B
Some
people
in
your
org
will
care
more
about
external
customers,
but
I
would
argue
that
internal
customers
within
the
org
are
just
as
important
because
they
provide
services
to
other
external
customers,
so
treating
yourself
within
the
org
as
your
own
customers
between
different
teams
and
service
owners
makes
absolute
sense.
In
my
opinion,
you
could
also
call
this
layer
and
the
internet
wouldn't
exist
without
proper
layer
layering,
where
you
can
have
your
your
layer,
2
your
layer,
3d
layer,
everything
and
you
can.
You
can
actually
fully
paralyze
the
work
on
those
different
layers.
B
B
Of
course,
it
is
a
different
layer
with
clean
interfaces,
and
so
you
could
just
do
this
on
a
different
layer
and
no
one
had
to
think
about.
Could
I
ever
have
wireless
at
the
time
when
they
designed
ip?
It's
just
still,
working
and
yeah.
I
already
talked
about
cpu
hard
disk
and
such
even
your
lunch.
You
will
not
in
common
case,
be
be
doing.
B
Everything
like
you
won't
be
growing,
all
your
own
wheat
and
and
blacksmithing
your
own,
your
own
tools
to
to
to
actually
grow
that
wheat
and
everything
you
will
be
buying
certain
bits
and
pieces.
So,
no
matter
how
much
you
cook
yourself!
Still
you
have
those
service
interfaces
everywhere
in
your
life,
your
customers.
They
don't
really
care
about
your
internal
things.
They
don't
care.
If,
if
half
of
your
database
notes
are
down,
they
care
about
their
database
service
being
up
and
quick,
and
that
is
how
you
need
to
think
about
those
services.
B
You
need
to
think
about
them
from
the
perspective
of
the
paying
customer
who
doesn't
really
care
about
any
of
your
journals.
They
just
care
to,
so
that
the
service
works,
something
which
you
will
not
see
very
often
but
which
I
think
is
hugely
important.
You
need
to
discern
between
between
different
types
of
slis.
B
Do
it
during
business
hours,
don't
wake
someone
up
for
this!
If
your
customers
are
not
able
to
access
the
system
course
of
that
half
fold
is.
That
is
the
reason
why
to
alert,
but
not
just
that
a
disk
is
full
or
something.
So,
let's
look
at
tools,
first
and
foremost,
obviously
prometheus.
Many
of
you
will
know
it,
but
still,
let's
walk
through
the
101..
It's
inspired
by
google
sportsman.
It's
a
time
series
database,
which
internally
stores
the
values
in
64-bit
numbers.
B
It
has
concepts
for
instrumentation
and
exporters,
instrumentation
being
modifying
your
own
source
code
or
other
people's
source
code
to
emit
metrics
directly
from
within
the
system
and
exporters
are
basically
proxies
where
you
can
take
snmp
or
a
database,
or
something
and
rewrite
this
into
something
which
prometheus
can
understand.
It
is
not
meant
for
event.
Logging
and
dashboarding
happens
via
grafana.
B
B
What's
in
there
and
there's
integration
with
pretty
much
every
cloud
provider,
or
at
least
every
major
cloud
provider
and
there's
more
coming
all
the
time,
so
you
just
point
them
at
this
end,
point
of
your
of
your
cloud
provider
and
prometheus
knows
about
the
services
and
start
scraping
them.
You
don't
have
a
hierarchical
data
model.
You
have
an
n-dimensional
label
set
you,
so
you
don't
have
your
region,
country
customer
and
then
you
want
to
to
group
by
customer
and
you,
you
kind
of
break
this
hierarchical
tree
model.
B
Now
you
can
just
select
by
the
label
customer
equals
whatever
and
you're
done.
There's
a
language
prom
we
used
for
everything,
processing,
graphing,
alerting,
exporting
everything
you
need
to
learn
it.
It's
a
new
language,
but
it
is
insanely.
Powerful
permit
itself
is
quite
simple
to
operate
and
it's
super
efficient,
most
likely
more
efficient
than
anything
which,
which
you
saw,
which
is
older
than
prometheus,
which
is
not
so
common
anymore,
but
still
there's,
there's
still
people
who
who
see
this
as
a
new
thing.
B
Other
selling
points
it's
pull
based,
which
gives
you
nicer
properties
about
certain
types
of
alerting
and
and
consistency
checks.
We
have
the
concept
of
black
box
monitoring
where
you
look
at
stuff
from
the
outside
versus
white
box.
Monitoring
where
the
box
is
is
completely
open
and
you
can
look
into
the
inside
of
of
that
box.
B
B
If
you
have
individual
events
like
a
function
being
called
those
are
merged,
usually
into
counters
and
or
histograms
like
latency,
and
such
changing
values
like
your
temperature
or
your
memory
usage
are
as
gorgeous
and
they
can
go
up
and
they
can.
They
can
go
up
and
down
typical
excess
examples.
You
probably
already
read
excess
rates.
The
web
servers
would
be
encountered,
temperatures
would
be
a
gorge
service.
Latency
would
be
histogram,
it's
super
easy
to
admit
in
parse.
B
Kubernetes
is
equivalent
to
borg,
which
is
what
google
runs
their
their
services
with
and
prometheus
is
basically
equivalent
to
pokemon,
but
the
apis
are
more
of
monarch
type
and
while
kubernetes
and
prometheus
were
not
started
with
each
other
in
mind,
inherently,
they
are
designed
for
each
other
because
of
the
heritage
of
their
shared
heritage
and
also
if,
if
kubernetes
changes
anything
about
their
cubesat
metrics
and
such
that's
always
agreed
with
with
prometheus
team
course,
we
have
people
overlapped
between
those
two
projects,
raw
numbers,
the
highest
we
know
of-
are
2.5
million
samples
per
second
and
prometheus
server,
which
comes
out
depending
on
how
you
tune
it.
B
A
recent
test.
I
got
260k
samples
per
second
and
core
test.
Before
that
we
went
to
100k,
we
can
compress
those
16
bytes
per
sample
and
second
or
per
sample
into
into
1.36
bytes,
which
speaks
a
little
bit
about
the
efficiency
and
the
highest.
B
The
largest
prometheus
we
know
of
has
125
million
active
series.
There's
two
long-term
storage
options:
one
is
thanos,
one
is
cortex.
Historically,
thanos
is
easier
to
run
and
and
was
scaling
storage
horizontally,
whereas
cortex
was
harder,
but
it
has
become
a
lot
easier
and
that
started
with
scaling
the
ingester
and
the
query
horizontally.
B
They
experiment
differently,
but
still
they
are
closed.
I
hope
that
at
some
point
they
merge,
but
probably
not,
but
I
would
hope
so.
The
official
format
for
prometheus
is
called
openmetrics.
It's
basically
an
independent
standard
of
prometheus,
but
permeated
uses
it
as
its
official
standard.
B
B
So
for
political
reasons
in
in
part,
that
name
was
chosen.
That's
also
about
about
putting
all
of
this
into
itf.
So
you
have
a
real
official
independent
standard
yeah.
B
There
is
a
concept
of
three
pillars:
metrics
logs
and
traces.
Of
course,
they
usually
have
the
metrics
and
logs
are
the
easiest
and
cheapest
in
many
ways,
and
traces
are
just
where
you
go
with
your
application
monitoring,
which
is
why,
which
is
why
those
are
super
tight
super
tightly
coupled
and
in
particular,
tying
metrics
to
traces
or
lobster
traces
is
super
easy
with
with
ex-employer.
B
This
is
a
way
to
to
attach
ids
of
traces
directly
to
your
logs
or
your
traces
reason
being
you
don't
have
to
have
the
full
label
set
on
your
traces.
You
can
just
use
this
one
direct
pointer,
which
has
a
few
nice
other
properties.
In
particular,
you
can
just
you
already
have
all
the
context
when
you
jump
into
your
into
your
into
your
trace,
you
already
know
what's
wrong,
and
yes,
I'm
absolutely
absolutely
serious
about
that
one.
B
I
did
start
openmetrics
to
change
how
the
world
does
observability,
where
you
have
matrix
logs
and
traces
all
with
the
same
data
model
with
the
same
underlying
assumptions,
so
it
makes
it
easier
to
jump
in
between
those
those
things,
speaking
of
which
loki
loki
is
basically
like
prometheus,
but
for
logs
it
has
the
same
label
based
system
like
prometheus.
You
don't
need
your
full
text
index,
you
just
index
your
labels
and
everything
else
is
an
opaque
string
which
makes
it
super
quick
and
super
cheap
to
run.
B
Your
yeah,
your
your
logs,
would
have
the
same
label
sets
as
your
metrics.
I
already
said
that
which
makes
it
a
lot
easier
to
just
jump
between
the
two
and
you
can
also
easily
extract
metrics
from
your
logs.
If
that
looks
familiar,
that's
because
it
is,
you
have
your
timestamp,
which
is
mandatory
in
in
logging,
but
else
you
have
the
same
label
set,
and
then
you
just
have
your
your
opaque
string.
B
That
leaves
us
with
traces
tempo
is,
is
one
of
the
implementations
is
designed
precisely
for
this
example
based
world?
It's
only
an
object
store.
You
don't
have
to
run
any
any
expensive
services
in
the
backend.
You
can
just
use
an
object
store.
It's
fully
compatible
with
open
telemetry
tracing
zipkin
conjure
all
those
things
because
it
is
so
efficient.
You
don't
need
to
sample
your
traces.
So
if
you
have
an
interesting
trace
id,
you
know
you
can
actually
jump
to
it
and
you
don't
just
lose
it
and
you
can
like
prometheus
cortex,
thanos
loki.
B
They
all
support
they
all
support
ex-employers.
So
you
can
do
this
jumping
back
and
forth
some
numbers
on
scaling
for
what
we
run
internally.
We
have
1
million
samples
per
second
retain
100
of
those,
and
if
we
go
and
as
we
go
for
14
day
retention
with
three
copies
stored,
we
have
a
cost
of
roughly
200
cpu
cores,
300
gigs
of
ram
and
40
terabyte
of
object,
storage
for
1
million
samples
per
second
for
14
days,
and
we
did
a
10x
jump
recently
and
we
already
have
plans
for
the
next
10x
jump.
B
Those
numbers
are
already
a
few
weeks
old.
I
think
we
already
have
better
numbers
now
bringing
all
of
this
together.
This
allows
you
to
jump
from
your
logs
to
your
traces
directly.
This
allows
you
to
jump
from
your
metrics
to
your
traces,
from
your
traces
to
your
logs,
and
all
of
this
is
open
source.
You
can
run
it
yourself.
B
A
A
B
Yeah,
absolutely
it's
a
hard
requirement.
In
my
opinion,
if
you
look
at,
if
you
look
at
previous
systems
where
you
had
one
service
running
on
one
machine
or
some
such
a
lot
of
you,
you
basically
have
a
lot
of
the
same
underlying
complexity,
but
this
was
well
hidden
behind
the
operating
system
and
behind
more
traditional
tools
which
allowed
you
to
do
all
that
debugging
already
that
changed
with
cloud
native.
B
B
And
if
you
have
a
previous,
you
had
maybe
your
server
and
then
you
had
more
users
and
you
had
to
buy
a
bigger
server
and,
and
you
you
contained
a
lot
of
this
through
the
system.
But
now,
if
you
run
everything
in
the
cloud
and
you
you
have
a
lot
of
users
jump
in,
maybe
you
just
scale
out
to
two
three
ten
times
the
amount
of
of
whatever
is
your?
B
Is
your
service
thing
and
you
scale
this
out,
and
this
ex
leads
to
an
absolute
explosion
of
the
overall
system,
information
about
your
system
as
it
is
running,
and
this
immense
amount
of
data
you're
not
able
to
anymore
to
just
as
a
human
go
through
a
few
log
lines
and
figure
out
what's
happening,
it's
just
impossible
because
you
have
so
much
stuff
going
on
at
the
same
time,
so
you
don't
have
a
chance
to
to
run
a
service
properly
unless
you
have
a
chance
to
understand
how
that
service
is
run.
B
Not
at
all
tracing
is
part
of
observability.
There's
like
there
are
different.
There
are
different
approaches
to
how
you
do
tracing
within
within
observability.
B
If,
if
you
want
me
to
talk
about
this,
I
can
easily
do
it,
but
the
high
level
reply
is
just
it's
one
of
the
signals
which
you
need
to
do
for
proper
observability
and
at
least,
if
you
have
access
into
your
software,
which
is
cloud
native
and
such
is,
is
pretty
common.
B
If
you
run
more
traditional
services
or
even
servers
and
machines
and
and
network
routers,
you
usually
don't
even
have
access
to
those
places
yeah
you
just
cannot,
but
as
soon
as
you
have
access
to
them,
you
should
absolutely
make
this
part
of
your
observability
story.
A
Okay,
awesome
yeah.
Thank
you
very
much.
I
think
we
still
don't
have
any
questions.
I've
checked
the
live
stream.
Also,
there
are
no
questions,
but
I
believe
the
participants
have
seen
your
contact
details.
You
can
reach
richie
on
either
twitter
or
send
him
an
email.
If
you
have
any
questions
or
if
you
need
more
clarification
on
observability
or
tracing
he's
an
expert
in
it
and
can
definitely
point
you
in
the
right
direction.