►
Description
Developers and SREs are instrumenting applications and apply observability workflows with metrics, traces, logs, and beyond. The first service level objective (SLO) is defined, now what - wait for the first production incident?
Think of day-2-Ops: SLOs need to be well understood and simulated early in the development process. False-positive alerts can lead to on-call fatigue. How to simulate an incident? Add chaos to production and simulate network failures, broken apps, etc. - and validate the SLOs. Developers can add their own chaos experiments too.
Join this talk to learn how SLOs can be shifted left with chaos, and get inspired by new tools and workflows for your production environment.
A
Servos
today
I
want
to
dive
into
left
shift
your
slos
with
chaos
and
would
dive
into
nsl
retail.
Obviously,
we
started
adopting
slos
or
service
level
agreements,
objectives
and
indicators
with
different
aspects
of
tracking
availability
and
also
defining
error
budgets
and
from
there
it
was
on
fire
to
so-called
golden
signals
defined
by
the
google
series.
We
had
latency
traffic
error
saturation
as
an
indicator
and
at
some
point
we
needed
code
instrumentation.
A
So
maybe
I
said
sre
solved
everything
yeah,
maybe
maybe
maybe
but
switching
roles
a
little
bit
into
a
developer,
and
you
might
remember
that
from
my
talk
last
year,
where
I
told
you
about
hey,
I
needed
to
debug
a
software
and
it
crashed
a
lot,
but
only
in
production.
A
I
really
would
have
loved
to
have
ops
requirements
being
met
like
monitoring
the
stack
and
heap
memory
defining
the
sli
as
the
mem
usage
level
and
dslo
maybe
has
a
10
increase
throughout
the
development
process,
in
ci,
cd,
merch
requests
and
later
on
in
production
deployments-
and
I
also
was
thinking
of
maybe
using
some
sort
of
chaos,
engineering
and
fussing
for
apr
requests.
A
In
2011
I
was
maintaining
the
it
top
level
domain
and
we
were
installing
dns
sec.
We
had
signing
hardware,
we
had
a
state
machine
typically
friday
afternoon
deployment
something
broke,
production
was
broken
over
the
weekend.
We
had
no
more
signing,
no
dns
updates
and
monitoring
was
going
wild.
So
the
thing
is,
monitoring
has
alerts
and
someone
is
getting
paged.
Someone
is
getting
called
so
the
dns
zone
serial
was
out
of
date.
A
The
first
alarm
was
fired
by
email,
nobody
read
it,
then
it
got
into
sms,
you
get
woken
up
and
at
5
am
in
the
morning.
Debugging
is
not
fun.
A
A
We
might
be
using
infrastructure
as
code
and
github's
workflows
for
that
and
define
the
sli
and
slos
for
the
zone
serial
age
and
say
hey
when
it's
older
and
then
one
hour,
we
need
to
alert.
We
need
to
do
something
about
it
and
also
thinking
a
little
bit
about
chaos.
Engineering
with
the
name
servers
like
denying
zone
updates
returning
different
zone
serials
like
diving
into
that,
and
so,
when
thinking
about
these
stories,
how
can
I
really
get
there
and
what?
A
What
is
needed
to
really
adopt
slos
so
like
as
an
sre,
devops
devops
engineer
where
to
actually
start
with
slos
and
then
chaos
engineering
for
slos?
Let's
look
into
monitoring,
we
have
metrics,
we
have
key
or
tags
with
values
and
what
else
is
in
there
and
one
of
the
things
where
I
got
hooked
into
was:
okay,
let's
use
prometheus
and
promql
for
queries.
A
A
We
need
to
add
that
into
the
code
like
using
app
instrumentation
when
we
have
the
metrics,
we
need
to
do
something
with
them
and
using
pramcal
for
alert
rules
which
trigger
the
slo
is
one
way
to
do
it
so
you're,
defining
the
allowed
errors
in
the
error
budget
and
define
when
the
slo
is
violated
and
to
alert-
and
this
is
basically
how
we
can
adopt
slos
for
later
on
now,
when
we,
what
can
we
do
with
an
slo?
A
A
We
can
add
chaos
to
our
staging
environments
and
also
to
our
production
deployments
and
see
how
the
application
is
behaving
now.
This
brings
me
to
shifting
left
with
chaos,
which
means
like
again
similar
to
to
metrics
under
cellos.
We
need
to
understand
where
to
go
and
as
as
as
a
re,
def
ops,
whoever
you
what
whatever
your
role
title
might
be
start
thinking
about
cloud
native
clusters.
You
have
deployments
inside
these
deployments.
A
One
example
is
to
use
litmos
chaos,
which
is
a
cloud
native
computing
foundation
project,
and
it
helps
you
to
fail
your
infrastructure
and
the
in
and
the
kubernetes
cluster.
For
example,
you
can
see
how
the
application
is
behaving,
whether
it
crashes
or
lags,
or
does
something
else.
A
You
can
measure
the
metrics
and
the
slos
and
see
if
they
still
match
and
from
there
you
can
define
actions
and
improvements
to
iterate
on
your
observability,
workflows
and
specific
other
things
you
might
be
missing.
A
Now,
thinking
about
the
stories,
I
told
you
before
like
what
what
is
needed
to
really
add
chaos,
to
to
your
deployments
like
as
an
sre,
maybe
adding
something
which
is
a
cpu
overload
or
http
requests
are
getting
blocked.
A
So
the
golden
signals
become
available
again
as
a
dev,
who's
struggling
with
rest,
apis
and
building
software,
maybe
simulate
something
where
api
clients
are
not
closing
the
connections
correctly
or
something
is
intercepting
the
dns
traffic
or
something
else
for
ops,
think
about
intercepting
dns
traffic,
obviously
or
just
break
it
somehow
and
ensure
that
it
doesn't
resolve
and
see
how
the
deployment,
the
application,
the
kubernetes
cluster
and
everything
else
is
behaving
super
interesting
by
the
way
and
for
devops
think
about
container
registries
not
allowing
to
pull
or
something
else.
A
A
So
you
want
to
define
and
measure
the
slos
before
they
are
actually
being
deployed,
and
you
can
also
benefit
from
from
cloud
native
environments
and
the
cloud
native
ecosystem,
because
the
deployments
and
and
other
things
happen
with
auto
scaling
in
the
kubernetes
cluster,
for
example-
and
there
are
many
projects
out
there
which
integrate
with
each
other
and
you
can
learn
from
the
best
practices
so,
for
example,
kubernetes
edit
promises,
monitoring
and
metrics,
and
you
can
inspect
the
source
code.
A
You
can
see
the
pull
requests
and
learn
from
that
and
get
into
the
documentation
how
it's
being
used
similar
thing
with
open,
telemetry
and
litmus
chaos,
and
so
on.
A
When
you
want
to
shift
left
shift
with
chaos,
bring
kos
into
observability,
see
how
it
behaves,
verify
the
slos
in
quality
gates
and
also
ensure
that
reliability
is
there
iterate
and
innovate
from
there.
The
wish
list
on
my
wish
list
is
correlate
things
and
use
machine
learning
to
maybe
make
it
easy
for
us.
Add
chaos
out
of
the
box,
make
it
accessible
for
everyone
and
ensure
that
open
telemetry
gets
widely
adopted,
for
example,
for
ci
cd
observability.