►
From YouTube: From Monitoring to Observability: Left Shift your SLOs with Chaos - Michael Friedrich, GitLab
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hello:
everyone,
my
name
is
michael,
I'm,
a
senior
developer
evangelist
at
gitlab,
and
today
I
want
to
dive
into
from
monitoring
to
observability
left
shift
your
slos
with
chaos
before
we
dive
in
I'm
dns
mihi,
which
is
dns
mi-chi
on
the
internet.
If
you
want
to
reach
out
or
have
a
chat,
I'm
looking
forward
to
actually
see
you
in
in
hamburg
at
container
days
this
week.
A
So
this
is
this
is
like
the
virtual
talk,
I'm
providing
to
you
today
and
I
want
to
start
with
a
lesser
retail
building,
some
bricks
and
stuff
like
that,
and
for
that
we
want
to
turn
back
time
a
little
bit.
So
when
we
started
with
monitoring,
we
had
traditional
service
monitoring
state
black
box
monitoring.
A
There
were
some
service
level
like
remands
and
reported
requirements,
so
state
changes
have
been
tracked
over
time.
We
then
had
metric
data
points
and
trends
and,
at
some
point
we
kind
of
added
metrics
into
everything,
making
it
white
box
monitoring
and
one
of
the
things
I
I
discovered
in
the
past
was
prometheus,
actually
created
this
format
of
having
slash
metrics
being
exposed
at
an
http
server
and
docker
has
added
that
natively
into
the
application
in
2016
and
from
there
it
was
not
far
too
saying.
Hey
a
web.
A
Application
should
actually
have
that
now
when
we
were
defining
slos
or
service
level
agreements,
several
service
level
objectives
and
service
level
indicators.
We
need
to
keep
in
mind
that,
like
the
agreement
is
99.5
percent
availability,
the
objective
is
much
higher.
It
should
be
like
99.9
and
from
the
indicators
it's
like
errors,
latency,
something
which
gives
an
indication
what
is
going
wrong
and
in
order
to
really
do
that,
we
also
need
to
define
error
budgets,
so
body
is
allowed
to
be
violated
in
order
to
match
the
service
level
objectives.
A
We
then
iterate
it
into
the
golden
signals
defined
by
the
google
sre
handbook
or
the
google
search
book,
which
allows
you
to
immediately
see
that
something
is
going
wrong.
So,
like
the
four
golden
signal,
signals,
latency
traffic
errors
and
saturation,
but
it
didn't
solve
everything
because
at
some
point
we
needed
code.
Instrumentation
and
sre
didn't
really
solve
everything,
but
it
was
a
great
way
to
get
started.
A
Now,
speaking
a
little
bit
about
myself
to
bring
you
into
the
idea
why
we
need
observability
and
chaos
engineering,
I
want
to
start
with
a
developer's
tale
of
myself.
A
Maybe
we
can
do
that
in
c,
plus
plus
as
well,
and
we
discover
the
library
using
coroutines
working
stackless,
putting
the
function
pointer
on
the
heap
and
then
using
stack
unwinding
for
continuation.
This
is
rather
complex,
but
at
some
point
we
figured
it
out
the
thing
is
it
didn't
turn
out
that
good?
So
we
had
crashes
happening
in
production,
but
only
like
when
there
were
like
a
thousand
api
clients,
and
the
problem
was
it:
it
was
not
reproducible
on
on
the
developers,
machines
and
in
test
environments,
so
the
memory
was
corrupted
or
maybe
exhausted.
A
There
were
specific
crashes,
so
there
were
stack
cards
and
the
security
scanner
triggered
the
crashes
and
basically
it
was
like
it
was
a
mess.
It
kind
of
lead
led
to
burn
out
of
myself
as
well.
In
a
retrospective,
I
would
have
loved
to
have
something
like
defining
a
service
level
indicator
which
could
which
could
have
been
the
memory
usage
or
memory
usage
level
and
the
service
level
objective
should
be
on
every
commit
or
merge
request
being
created.
A
A
A
So
the
idea
was
simple:
we
had
the
signing
hardware,
there's
a
state
machine
of
steps,
verifying
things
and
then
everything
gets
deployed
and
dns
is
safe.
Basically,
on
a
friday
afternoon,
the
script
has
been
changed,
which
was
responsible
for
the
deployment
and
it
was
deployed
to
production,
then,
for
some
reason,
no
more
signing
happened,
which
also
meant
that
no
dns
updates
were
being
pushed.
So
the
domain
delegation,
the
names
of
updates,
the
taxi
admin
c
and
so
on.
A
Well
and
we
we
had
monitoring,
but
let
me
tell
you
well,
we
also
monitored,
for
example,
the
zone
serial
h
and
the
problem
was
there
were,
like,
I
think,
25
or
30
name
servers
alerting
at
the
same
time,
so
the
first
alarm
came
at
on
saturday
at
3
am
via
email.
A
A
Now
we
figured
out
that
the
change
was
persisted
in
git,
which
we
thought
about
as
a
corrective
action,
but
the
thing
is,
it
was
directly
rolled
into
production,
so
there
was
no
ci
cd
or
quality
gate
in
place
which
could
have
helped
prevent
the
situation
and
from
there
again
thinking
about
what
could
have
been,
maybe
having
a
staging
signing
hardware
where
everything
can
be
like
simulated
or
being
tested
properly,
and
all
the
changes
should
be
rolled
out
with
infrastructures,
code
or
github's
workflows
in
with
regards
to
service
level,
objectives
and
indicators.
A
Using
the
sound
serial
age
and
the
service
level
objective
is,
for
example,
that
the
zone
age
or
so
on.
Serial
h
is
not
older
than
one
hour
with
regards
to
adding
some
chaos
to
observability
ideas,
while
maybe
denying
zone
server
updates
with
prox
in
the
middle
or
just
returning
different
zone
series
and
then
see
how
everything
is
behaving.
A
It's
super
fun,
but
it
doesn't
stop
with
problems.
At
some
point
I
switched
into
the
devops
role
and
we
had
containers
and
it
also
brought
some
problems
because,
when
you're
consuming
services,
which
are
free
at
some
point,
they
might
not
be
free
anymore.
There
might
be
some
limits
and
dockerhub
announced
some
rate
limits
in
september
2020,
and
we
didn't
at
that
point.
We
didn't
know
what
is
affected.
A
We
thought
of
well
ci,
cd
pipelines
are
using
containers
cloud
native
deployments
and
kubernetes,
for
example,
are
using
containers
organizations
behind
a
net
which,
which
is
using
like
a
single
appear,
address
for
rate,
limiting
and
obviously
help
providers
within
their
network
within
the
deployments
and
everything
else
and
the
problem
was
like,
we
had
a
known
state,
so
the
limits
have
been
applied.
There
was
things
like
an
api
and
pull
simulating
a
pull,
getting
a
header
response
passing
something
and
back
then.
I
also
wrote
a
promises
exporter
for
that,
and
it
was.
A
It
was
pretty
much
efficient
how
it
worked.
The
only
problem
was
the
monitoring
was
there,
but
we
also
had
the
unknown
state
like
docker,
pulling
in
in
an
environment
what
happens
in
a
kubernetes
cluster,
what
happens
in
the
icd
pipelines
when
developers
are
waiting
for
the
deployment
for
fast
feedback
for
code
reviews
and
so
on,
with
the
only
option
of
having
four
to
nine
too
many
requests
in
the
logs?
A
Maybe
so
this
is
like
super
hard
to
debug,
and
one
of
the
problems
we
also
like
figured
out
was
maybe
there's
also
a
problem
with
deployment,
because
the
application
has
been
rolled
out
in
in
one-third
to
the
customers
showing
a
different
price
just
because
the
release
deployment
failed
because
docker
pull
failed
in
production.
A
So
this
could
be
really
expensive
and
I
thought
of
well-
maybe
we
we
could
have
triggered
this
somehow
different
or
treated
it
differently,
with
more
long-term
planning,
knowing
what's
actually
going
on
and
defining
a
service
level
indicator.
Like
the
pull
counts
remaining
or
the
limit
and
the
service
level
objective
could
have
been
like,
the
pull
counts.
Remaining
should
be
more
than
10,
which
is
an
arbitrary
number
in
huge
environments,
maybe
100,
or
something
like
that,
and
with
the
quality
gate.
A
We
shouldn't
even
start
a
deployment
when
the
pull
limit
is
too
low
because
essentially
those
who
are
consuming
the
deployments
and
the
ci
cd
pipelines
like
the
developers,
they
are
blocked
and
when
nothing
is
working
because
of
a
rate
limit,
you
need
to
wait
for
like
six
hours
or
something
you
cannot.
You
cannot
be
productive
or
do
something
now.
A
We've
talked
a
lot
about
stories
and
things
gone
wrong,
going
slow
go
using
as
service
level
objectives
is
another
story,
because
how
do
I
start
what
is
like
my
service
level
objectives
and
somehow
it
relates,
relates
to
monitoring.
A
A
Promises
allows
to
monitor
things,
and
it
also
allows
to
query
the
metrics
data
which
is
being
stored
and
do
calculations
based
on
that
and
also
compare
it.
So
it's
it's
a
pretty
easy
and
easy
to
learn
format,
and
we
can
also
use
it
to
define
slos
in
in
the
screenshot.
There
is
an
example
from
from
clearing
an
http
server
now,
in
order
to
really
like
understand
what
is
going
on
in.
A
In
our
environments,
we
have
different
metrics
resources
or
different
metric
sources,
which
can
be
from
infrastructure
like
memory
cpu
io,
we
do
have
the
exporter
on
the
kubernetes
node
in
the
port
inside
the
cluster.
Actually,
then,
we
have
services
which
have
their
own
promises,
exporters,
hopefully-
and
at
some
point
we
also
need
to
do
instrumentation
in
the
code
adding
our
own
metrics,
exposing
our
own
metrics.
A
This
is
something
to
keep
in
mind
for
defining
slos,
because
when
there
are
no
metrics,
we
cannot
define
any
slos
when
it
comes
to
describing
an
slo,
we
can
use
promql
and
alert
rules
from
promisius,
which
is
pretty
nifty.
So,
in
order
to
do
that,
we
calculate
something
and
then
define
the
condition
which
should
be
should
be
met
and,
for
example,
saying
hey.
This
needs
to
be
active
or
violated
actively
violated
for
15
minutes.
For
example.
A
In
the
end
we
can
build
on
that.
The
thing
is
in
order
to
add
more
metrics.
I
would.
I
would
advise
you
to
learn
with
playful
examples,
so
create
your
own
docker
file,
cicd,
build
images,
container
registries,
inspect
the
promises
operator
and
the
service,
monitor
customer
resource
definition
and
then
start
inspecting
metrics
with
promisius
and
later
on,
with
open
telemetry.
A
Now
my
talker
is
also
about
left
shifting
slos.
So
what
does
this
actually
mean?
Like
I've?
We've
talked
about
promises
with.
We
have
like.
We
know
what
service
level
objectives
are.
This
can
be
like
calculated
and
in
ci
cd
environments.
A
We
can
actually
like
deploy
prometheus
in
the
staging
environment
and,
for
example,
use
a
quality
gate
with
captain
to
measure
the
slos
and
when
it's
violated,
the
feedback
to
the
cicd
pipeline
is
hey.
I'm
I'm
in
a
failed
state.
You
shouldn't
be
deploying
because
something
is
going
wrong
within
the
staging
deployment
captain
on
its
own
works.
Quite
this
way,
it's
an
observability
platform.
It's
not
tied
only
to
the
icd
quality
gates,
but
one
can
use
it
this
way.
A
Now,
with
a
quality
gate,
not
everything
is
solved.
We
also
need
to
ensure
that
next
to
continuous
delivery,
we
kind
of
need
to
simulate
a
production
incident
for
applications
like
when
dns
is
wrong
or
going
wrong.
What
happens
and
the
idea
is
to
add
chaos
to
the
staging
environment
and
also
to
the
production
environments.
A
Now,
how
can
we
actually
do
that
so
left
shifting
with
chaos,
and
in
order
to
really
do
that,
we
kind
of
remembering
what
what
we
said
with
metrics?
We
have
cloud
native,
we
have
clusters,
we
have
deployments
and
then
so-called
chaos
frameworks
have
been
built
in
the
open
source
community.
A
A
One
of
the
examples
of
one
of
the
tools
is
litmus
chaos,
which
allows
you
to
fail.
Your
infrastructure
cluster
see
how
the
application
behaves
see
if
the
slo
is
still
matching
and
then
define
your
actions
and
improvements.
It
provides
experiments
and
workflows
out
of
the
box,
and
it's
it's
easy
to
get
started
with
another
application
or
another
tool
is
chaos
mesh,
which
we
will
be
looking
into
in
a
little
bit.
A
It
would
kind
of
works
the
same
way:
you're
failing
cubano,
kubernetes
or
hosts.
We
have
chaos,
experiments
which
either
run
once
or
continuously
in
a
schedule
and
failing
dns
is
always
a
good
idea,
because
sometimes
it's
really
always
dns
from
the
usability
perspective,
chaos
mesh
provides
ui
or
cli.
A
It
has
a
preview
and
some
like
scheduling
strategies.
I
find
it
pretty
straightforward
to
use
and
just
to
reiterate
to
all
the
stories
which
I
talked
about
before
chaos.
Engineering
could
help
this
way,
for
example,
for
an
sre
we
can
simulate
cpu
overload
or
stressing
the
the
cpu.
Maybe
the
http
requests
get
get
blocked
following
the
golden
signals
to
figure
out
latency
and
so
on
and
errors
as
a
developer.
I
won't
have
many
many
api
clients
which
are
not
closing
their
connection,
so
something
is
leaking
potentially
for
an
ops.
A
A
A
On
the
other
side,
when
you're
thinking
about
your
own
chaos,
there
are
limits,
so
you
should
be
evaluating
the
resource
use
usage
because
when
you're
simulating
something
which
leaks
memory
at
some
point,
maybe
you're
paying
too
much
and
also
define
a
maintenance
window
or
work
window
where
the
chaos
engineering
happens,
so
that
it
doesn't
harm
existing
workflows
or
teams
and
also
know
that
it
won't
solve
immediately
all
the
reliability
issues
you
might
be
seeing.
A
But
it
can
help
you
with
new
perspectives
in
combining
chaos,
engineering
with
sre,
for
example,
now
my
own
chaos,
which
was
in
the
beginning
with
as
a
developer.
A
We
do
have
like
this
dns
connection
problem,
which
leaks
memory,
so
I
wrote
a
c
plus
plus
code
with
as
a
demo,
which
basically
leaks
some
memory,
but
only
when
the
dns
resolution
fails
or
it
doesn't
provide
any
results.
So
this
is
a
special
kind
of
thing,
but
it
shows
when
something
is
going
wrong
with
dns.
A
We
are
leaking
memory,
and
this
is
something
I
would
have
never
found
out
in
my
deaf
environment
system,
everything
is
deployed
into
a
kubernetes
cluster,
cube
promises
and
the
premises,
operators,
monitoring
things,
chaos
mesh
is
installed
and
promises,
alerts
and
slos
have
been
defined
to
trigger
something
actually
now
for
this
demo,
we
will
look
into
this.
This
is
this
is
running
in
a
civil
cloud
kubernetes
cluster
currently,
and
just
to
show
you
what's
currently
going
on.
A
We
can
switch
into
the
terminal
for
a
moment
and
cube.
Ctl
get
parts
shows
out,
shows
what's
currently
running.
Hopefully,
yes,
and
then
we
say,
cube
ctl
logs
and
follow
one
of
the
o11
violet
love
parts
which
is
actually
just
trying
to
resolve
this
domain
and
yeah.
In
order
to
now
say
what
else
could
be
going
on
or
what
what
is
happening
in
in
my
promises
on
my
kubernetes
deployment,
I've
prepared
the
promises
interface.
A
A
The
dns
chaos
schedule
already,
so
it's
generating
some
dns
chaos.
It
generates
an
error,
it
doesn't
fail,
but
the
general
is
an
error
so
that
we
can
like
fail.
Everything
which
is
following
these
patterns
for
these
domains,
and
we
want
to
start
this
one
because
yeah,
we
love
chaos.
A
Now
that
this
being
started,
we
should
actually
be
seeing
some
errors
already,
which
is
great,
just
need
to
scroll
a
little
bit
oops
a
little
bit
up,
maybe
not
yeah
over
there.
It
was
at
some
point
the
chaos
experiment
will
stop.
I
think
I
configured
it
for
one
minute
or
so
and
next
to
seeing
that
nothing
is
working,
we
should
also
be
seeing
an
increase
in
memory
actually
yeah.
Actually,
we
have
been.
A
We
are
already
leaking
some
memory
and
maybe
to
show
this
a
little
more
in
the
scope,
we
can
see
that
the
memory
is
going
up
and
it's
going
up,
and
then
there
is
a
certain
limit
for
this
memory,
where
we
should
be
seeing
an
alert.
Actually,
let's
quickly
inspect
the
alert
manager
over
here
and
refresh.
Do
we
have
some
new
alerts
which
need
to
be
handled
not
yet,
but
maybe
this
is
coming
in
fast.
A
Let's
see
if
something
is
over
here
and
you
can
see
that
I've
been
testing
before
doing
this
talk,
recording
yeah,
but
no,
it's
it's
not
yet
triggered.
This
is
like
the
demo
problem
all
the
time.
A
Let's
see
if
we
do
have
some
more
memory,
yet
it's
consuming
much
more
memory
over
time,
so
it's
still
failing
and
then
continuing
what
is
going
on
now.
Everything
is
fine
again,
so
this
schedule
the
experiment,
paused
and
then
the
next
experiment
will
be
starting
and
at
some
point
we
can
actually
see
that
memo.
So
memory
will
be
leaked
and
hopefully
we
can
see
yeah.
We
have
some
alerts
being
generated,
which
is
pretty
accurate,
yeah,
it's
it's
utc,
it's
a
little
late.
Actually
doing
doing
this
talk,
recording
but
anyways.
A
We
can
see
that
we
triggered
the
alert
and
potentially
we
should
be
seeing
the
alert
over
here
as
well.
Yeah
and
it's
it
doesn't
look
for
some
reason:
it's
pretty
slow.
Now
it's
just
for
testing,
so
it's
need
to
work
on
the
payload
and
things
like
that.
But
this
is
like
an
iterative
process.
If
we
are
like
navigating
back
to
our
use
case,
we
have
kind
of
proven
that
the
dns
chaos
is
leaking
memory
and
we
can
trigger
our
slo
and
everything
is
fine.
A
A
What
else
can
we
do,
and
this
is
like
the
screen
shots
of
what
I
did
before
when
we're
moving
from
monitoring
to
observability?
We
also
need
to
keep
in
mind.
There
are
not
only
metrics.
There
are
logs
events,
there
are
this
there's
distributed
tracing,
there's
continuous
profiling.
There
are
so
many
observability
event,
types
which
we
need
to
get
keep
track
of,
and
we
also
need
to
potentially
shift
from
monolith
lift
to
microservices,
but
you
need
to
stop
and
you
need
to
start
focusing
on
metrics
and
tracing,
but
everything
else
is
like
it's.
A
Okay,
traces
are
a
little
different
to
logs
spans
with
start
and
end
time
more
metadata.
In
the
context
code
instrumentation
might
be
needed.
We
do
have
open
telemetry,
which,
which
is
growing
fast.
It
allows
you
it
needs
you
to
bring
your
own
backend
like
jaeger,
elasticsearch,
clickers
and
so
on
can
build
your
own
distribution,
and
it
also
has
auto
instrumentation
for
certain
languages,
which
is
great.
A
Another
idea
is
using
ebpf,
maybe
combining
it
for
with
slos
so
like
on
the
kernel
level.
Implementing
observability
could
also
be
an
interesting
idea
just
leaving
this
idea
for
you
now
and
in
the
end
we
want
to
shift
left.
A
I
want
to
seize
the
problem,
the
code
or
something
else,
and
for
that
I
would
recommend
using
boring
solutions
like
starting
with
metrics
and
mon
and
promisius,
adding
tracing
with
open
telemetry,
maybe
metrics
in
open
telemetry
at
some
later
point
and
collect
more
observability
data
to
really
decide
what
is
going
on
to
see
things
left.
Shifting
the
slos
is
observability
needs
to
be
everywhere,
so
we
need
to
collect
everything.
We
need
to
learn
app,
instrumentation
and
educate
everyone.
A
Team
onboarding
is
super
important,
integrate
that
into
devops
workflows
so
like
slos
and
alerts
in
ci,
cd,
merch
requests,
alert
channels,
incident
management
and
also
make
sure
that
you're
using
cloud
native
benefits.
So
like
everything
in
kubernetes
clusters,
we
do
have
litmus
chaos,
we
have
chaos
mesh,
we
have
open
telemetry,
we
do
have
promises
and
we
have
a
great
community
around
it.
A
A
The
observability
data
have
chaos
engineer
out
of
the
box
available
for
everyone
to
consume
and
combine
it
with
all
the
ideas
with
open
telemetry
into
use
cases
as
a
recap
we're
starting
with
app
instrumentation
and
with
metrics
and
traces.
We
have
prompt
ql
and
slos.
We
have
quality
gates
with
captain
and
promisius.
A
That's
that's
about
it.
If
you
want
to
learn
more
dive
into
o11,
why
not
love
which
is
a
knowledge
base
which
I've
been
building
in
the
past
months,
and
I
would
love
to
connect
with
you
and
talk
everything
about
observability
chaos,
engineering,
sre,
thanks
for
listening
thanks
for
attention.
If
there
are
any
questions
hit
me
up
either
now
or
sdns
me
here
on
social
media,
see
you
around
and
happy
observability.