►
Description
When trying to iterate quickly on our code, reliability tends to be overlooked and given lower priority than getting the latest features out. Until a large incident comes and knocks our applications out of service.
This talk will give a quick introduction to the Site Reliability principles, and look into how they can be applied to cloud applications, regardless of the size of the organization.
A
All
right,
so,
let's
get
started
so
I'm
marga
as
lexi
said,
I'm
a
staff
software
engineer
at
kinfolk.
I
work
in
the
flat
car
container
linux
product,
which
is
one
of
our
open
source
products
before
working
at
team
fog.
A
I
was
a
site
reliability
engineer
at
google
for
almost
eight
years,
so
reliability
is
a
very
important
theme
for
me
and
I've
learned
to
make
it
part
of
my
life
and
I
moved
from
working
at
the
big
tech
company
where
there
were
like
teams
of
engineers
that
were
working
on
reliability
to
working
at
a
small
company.
A
Everyone
is
wearing
many
different
hats,
so
we
have
very
few
people
and
we
need
to
get
things
done
with
the
small
team
and
so
some
names,
I'm
a
software
engineer
and
I'm
developing
software
other
days.
I'm
a
test
engineer
and
I'm
writing
tests
and
other
days,
I'm
a
reliability
engineer
and
I'm
focusing
on
reliability.
And
so
this
talk
is
for
people
that
are
in
that
position.
A
So
it's
not
for
google
employees
who
have
like
huge
teams,
but
rather
people
that
are
trying
to
balance
a
lot
of
different
balls
at
the
same
time
and
still
care
about
reliability.
A
So,
let's
start
by
saying
like
explaining
what
reliability
is
right,
because
this
talk
is
about
reliability.
We
need
to
be
in
agreement
what
it
means.
So,
if
you
look
this
up
on
the
web,
you
will
see
that
it
says
that
it's
the
quality
of
being
trustworthy
or
of
performing
consistently
consistently
well
and
so
yeah
okay.
So
this
is
what
the
dictionary
says,
but
what
does
it
mean
when
we
move
this
on
to
like
cloud
applications
in
the
contents
of
cloud
applications?
A
We
will
care
about
some
specific
features
of
our
application,
like
availability,
which
is
like
if
our
users
can
reach
the
site
or
not.
If
we
have
like
a
super
cool
website
or
app
or
whatever
and
half
of
the
time,
our
users
can
reach
it,
they
will
move
on
to
a
different
application
when
the
is
actually
reachable,
even
if
it's
less
fancy
or
less
cool
and
and
similarly
with
the
other
measurements
so
like,
we
want
our
site
to
be
fast.
A
If,
if
we
have
like
a
a
great
application,
that
does
everything
perfectly,
but
it's
super
slow
users
will
go
away
or
if
it's
super
fast,
but
it
gives
you
the
wrong
answer:
it's
not
useful
and
reliability,
it's
also
about
data
safety.
So
we
also
care
about
like
doing
the
right
things
with
our
users,
data
having
backups
etc.
All
of
that
is
included,
and
these
are
the
generic
themes
right
and
depending
on
what
your
application
your
service
does.
A
You
will
maybe
care
about
other
things
as
well,
not
just
about
these
four
and
that's
all
right
because,
like
there's
a
wide
range
of
things
that
you
may
care
about,
these
are
just
some
basic
examples,
but
there's
a
there's
a
theme
in
all
of
this,
which
is
that,
if,
if
our
application
is
not
reliable,
users
will
go
away.
So
this
is
why
reliability
matters
if
an
application
is
not
reliable
users
move
on
to
a
different
one.
A
So
maybe
it's
not
so
much
fun
to
develop
reliability
into
our
applications,
but
it's
something
that
lets
us
keep
our
users
and
an
application
without
users
is
not
relevant,
so
so
we
need
to
make
it
reliable.
So
all
of
this
seems
pretty
obvious.
So
why
are
we
even
talking
about
this?
Like?
Isn't
it
like
super
clear
that
everything
needs
to
be
reliable?
A
The
problem
is
that
there's
a
conflict
of
incentives
when,
when
I'm
writing
software,
when
I'm
developing
features,
I
get
like
this
rash
of
developing
something
and
then
seeing
it
in
action
and
it
works
and
it's
great
right
and
if
you're
a
programmer,
you
you've
experienced
this
thing.
Where,
like
you
have
an
idea,
you
write
the
code
and
then
it
works
and
it's
so
much
fun
and
of
course,
when
you're
a
software
developer,
you
want
to
do
this
all
the
time
right.
A
You
want
to
create
new
features,
launch
a
new
version,
make
your
application,
do
some
new
cool
thing,
and
and
do
this
as
fast
as
possible.
Maybe
you
have
like
a
weekly
sprint
or
two-week
sprint
whatever,
and
you
want
to
keep
releasing
new
features,
and
this
is
all
nice,
but
then
the
other
side
of
the
coin
is
the
person
who's,
maintaining
the
software
and
who's
keeping
the
software
running.
Who
can
who
can
be
you
right?
A
It's
like
it
can
be
you
on
a
different
time
of
the
day,
and
so,
when
you're
trying
to
keep
the
system
running,
you
don't
want
to
have
any
outages.
You
don't
want
to
have
any
headaches,
you
don't
want
to
do
any
firefighting
and
the
more
changes
you
introduce,
the
more
chances
there
are
of
an
outage,
and
so
here's
where
the
conflict
of
incentives
comes.
So
if
you,
if
you
want
to
release,
features
as
fast
as
possible,
you
will
cause
outages.
A
So
how
do
we
do
that?
The
first
step
is
going
to
be
measuring
how
our
application
is
doing,
because
if
we
don't
have
data,
if
we
don't
know
how
our
application
is
behaving,
if
it's
behaving
well
or
badly,
then
we
can't
do
anything.
So
how
do
we
get
the
data
we
get
it
through
monitoring?
A
So
in
the
cloud
context
I
say
monitoring
and
you
probably
hear
prometheus
and
that's
fine,
that's
a
tool
and
it's
a
valid
tool.
But
in
this
talk
I'm
not
talking
about
prometheus.
In
particular,
but
rather
about
the
concept
of
monitoring
and
the
tool
that
you
use
depends
on
your
infrastructure,
your
other
needs
you
could
as
well
use
like
google
analytics
for
getting
the
the
information
that
you
need.
So
as
long
as
you
get
the
information
that
you
need,
it's
fine
there's,
no
like
less
tool.
That
is
the
thing
to
use.
A
The
important
thing
is
the
metrics
right.
The
important
thing
is
getting
the
metrics
that
matter
for
your
service
or
your
application,
and
you
want
really
to
be
the
ones
that
matter
for
your
application.
So
whatever
system
you're
using
might
come
with
a
set
of
default
metrics
and
those
default
metrics
might
or
might
not
serve
a
good
purpose
for
you.
So
if
you're,
if
you're
maintaining
a
website,
you
probably
do
care
about
the
error.
500S
that
your
website
is
serving
right.
You
probably
do
care
that
those
are
low.
A
A
So
that's
what
proverbs
are
for
which
you
can
run
yourself
or
you
can
get
a
third
party
to
run
them
for
you,
but
it's
a
way
of
measuring
how
your
website
is
responding
from
the
outside
and
if,
if
your
application
is
a
global
application
that
is
reached
by
people
from
all
over
the
world,
you
want
to
have
proverbs
that
are
also
all
over
the
world,
not
just
in
your
country,
because
you
need
to
know
like.
Does
it
work
for
someone
connecting
from
south
africa?
A
Does
it
work
from
someone
connecting
from
brazil
it's
if,
if
you're
trying
to
reach
a
global
audience,
you
need
to
check
it
globally
all
right.
So
I've
spoken
a
bit
about
monitoring.
Let's
assume
now
that
we
deployed
our
monitoring
infrastructure,
we
have
a
super
cool
dashboard.
It
has
the
right
metrics.
A
Are
we
going
to
get
someone
to
look
at
the
dashboard
all
day?
No,
of
course
not
looking
at
dashboards,
it's
very
boring
and
it's
not
a
good
use
of
anybody's
time.
So
what
we're
going
to
do
is
to
set
up.
A
A
If
it
just
goes
away
on
its
own
and
you
don't
do
anything,
it's
also
not
useful.
We
need
to
trigger
alerts
for
events
that
require
a
human
to
intervene,
so
the
alerts
can
suffer
from
false
positives
and
false
negatives.
A
False
positives
is
when,
when
there's
an
alert
and
there's
nothing
to
be
done
right
like
what
I
was
just
saying,
either
it
went
away
on
its
own
or
it's
actually
not
a
problem.
It's
just
a
fact
of
life
that
that
thing
is
happening
and
the
problem
with
those
types
of
alerts
is
that
they
create
alert
fatigue
that
people
get
used
to
ignoring
them
so
like.
A
If
every
day
you
get
an
iron
saying
that
you
are
getting
too
many
error,
500s
and
then,
when
you
go
to
look
at
the
system,
it
just
went
away
and
you
don't
know
why
you
start
ignoring
the
alert
and
then
one
day,
there's
actually
a
problem,
but
because
you
were
ignoring
this
alert
all
the
time
you
you
just
think.
Oh
okay,
it's
the
same
area
of
everything,
that
of
always
that
it
will
just
go
away
in
its
own
and
it
doesn't
right.
So
so
that's
really
bad.
A
So
that's
why,
when
we
have
an
alert,
that's
triggering
that
is
not
useful.
We
need
to
fix
it
so
that
it
gives
us
an
actual
signal,
either
disable
the
alert
or
make
it
less
sensitive
or
whatever
make
it
useful
and
then
false
negatives
are
alerts
that
should
have
triggered
but
didn't,
and
they
didn't
and
we
had
an
outage
and
everything
broke
whatever.
Whatever
our
application
was
doing,
users
were
unhappy
and
we
didn't
catch
it
in
time
because
the
alerts
didn't
trigger.
So
usually
we
we
realize
that
we
have
these
false
negatives.
A
After
when,
when
we
have
the
outage
and
it's
important
that
we
follow
up
and
fix
it
right,
we
add
the
alert.
That's
missing.
We
add
the
metric,
that's
missing,
so
that
we
don't
have
another
outage
for
the
same
reason,
which
would
be
very
embarrassing
all
right,
so
we
covered
monitoring
and
alerting.
But
how
does
this
help
us
solve
the
incentive
conflict
that
I
mentioned
earlier
to
help
us
solve
that
conflict?
We
need
to
introduce
one
other
concept,
which
is
service
level
objectives,
service
level
objectives
are
metrics
that
help
us
assess
how
our
service
is
behaving.
A
So
it's
important
that
they
are
metrics.
So
there
are
numbers-
and
we
can
say,
for
example,
that
our
service
is
99
of
the
time
available.
That's
a
typical
slo
or
we
could
focus
on
the
latency.
We
can
focus
on
the
latency
of
each
request
or
the
99th
percentile
of
our
requests.
Things
like
that.
So
we
want
to
measure
how
our
service
is
behaving
and
we
measure
it
with
this
metrics
and
then
we
set
goals
of
how
we
want
our
service
to
behave.
A
So,
let's,
let's
use
availability
because
it's
easy
and
we
can
say
we
want
our
service
to
be
99
of
the
time
available
right.
And
so
we
can
measure
whether
our
service
is
available
or
not.
And
we
can
say
whether
we.
B
A
Okay,
so
I
was
talking
about
slos
and
metrics,
and
I
was
saying
that
the
slos
need
to
capture
the
users
expectations
and
the
developers
expectations
so
like
if
you're,
if
our
users
expect
our
site
to
be
up,
we
need
to
fulfill
that
expectation,
but
also
we
need
to
let
developers
release
features.
So
that's
how
they
help
us
find
this
balance
and
slos
need
to
be
achievable.
So
there's
no
use
aiming
so
high
that
we
can't
achieve
it
because
then
it
they
are
not
really
providing
any
any
help.
A
Yes,
yes,
okay,
all
right!
So
it's
working,
so
this
table
helps
us
understand
that
this
availability
number
that
I
was
talking
about,
what
it
means
in
regards
to
days
or
hours
that
our
service
can
be
done.
A
So
if
we
say
our
service
can
that
we
want
to
have
an
slo
of
99
availability,
that
means
that
we
can
have
3.65
days
a
year
that
our
service
is
down
right
and
that's
like
over
the
course
of
a
whole
year,
and
if
we
look
at
it
over
the
course
of
a
month,
it's
7.2
hours
and
over
the
course
of
a
day
is
14.4
minutes.
A
Usually
services
don't
go
down
per
day
right,
usually
you
don't
go
down
14
minutes
per
day,
but
sometimes
you
have
an
outage
right,
and
so
that's
why
you?
You
pick
the
window
that
makes
the
most
sense
for
you
so
say
you
picked
per
month.
You
you
have
7.2
hours
per
month
that
your
service
can
be
down
right
and
that's
that's
for
an
availability
of
99
7.2
hours
can
be
a
lot
or
can
be
very
little.
It
depends
on
what
your
service
does.
A
A
But
if
you
are
developing
a
banking
application
and
in
this
7.2
hours
or
7.2
hours,
where
your
users
were
not
able
to
reach
their
bank
accounts,
they
probably
are
not
very
happy
right.
So
that's
why
we
need
to
figure
out
the
right
metrics
for
our
application
and
the
perfectionist
people.
Like
me,
may
say
why
don't
we
just
aim
for
100
reliability?
Why
99
percent
99.9
99.99?
A
A
Is
it
really
worth
it
for
some
applications
it
might
be,
but
for
most
applications?
It's
not.
So
unless
you
are
developing
medical
devices
or
aviation
devices,
that
really
should
be
as
reliable
as
possible,
you
probably
want
to
aim
a
little
bit
lower,
something
that
is
high
but
achievable,
and
then
how
do
you
use?
This
number
here
comes
one
of
the
most
interesting
concepts
which
is
error
budgets.
A
So
this
is
how
how
we
solve
the
conflict
of
incentives
when
we
have
an
error
budget,
let's
say
for
a
month
of
our
service
being
down
for
seven
hours
right.
We
we
said
we
would
aim
for
99
availability,
seven
hours
of
downtime,
and
we
can
reach
this
target
by
never
being
down,
of
course,
or
maybe
having
one
hour
of
outage
one
day,
another
hour
of
outreach
another
day
and
maybe
three
hours
a
day
where
the
outage
was
really
bad
and
it's
still
under
seven
hours.
So
we
are
reaching.
We
are
inside
this
error
budget.
A
That's
when
we
say
we
reached
our
error
budget,
we
don't
have
any
more
budget
anymore,
so
we
can't
keep
releasing
new
features
and
we
need
to
spend
time
on
reliability
instead,
right
and-
and
this
is
the
for
me-
this
is
the
key
concept
that
helps
us
fix
this
problem
of
incentives
when,
when
we
have
so
many
problems
that
our
application
is
no
longer
as
reliable
as
we
decided
that
it
should
be
that's
when
we
need
to
stop
and
say,
okay,
no
more
features,
we
work
on
reliability
until
everything
is
working
exactly
and
okay.
A
So
so
what
do
we
do?
How
do
we
fix
that?
So
here
is
how
how
we
start
mitigating
the
risks
of
putting
out
features
so
that
we
reduce
the
chances
that
we
will
have
an
outage.
We
never
can
reduce
it
to
zero,
but
we
can
reduce
the
chances
so
that
we
are
stay
within
our
error
budgets.
A
The
first
thing
is
the
stream
infrastructure,
and
when
I
talk
about
testing,
I'm
not
just
thinking
about
unit
tests
or
integration
tests,
which
is
what
people
first
thing
when
they
talk
about
testing
but
like
a
bigger
set
of
tools
that
are
related
to
testing.
A
Of
course
you
want
with
test
coverage,
but
after
that
comes
the
more
interesting
things.
So
continuous
integration
is
something
that
probably
you've
all
heard
about,
and
it's
very
nice
and
then
pushing
green,
which
means
releasing
only
the
things
that
pass
the
test.
It's
also
very
nice,
but
a
lot
of
people
don't
actually
apply
this
or
they
give
themselves
a
pass,
and
then
things
start
to
turn
badly.
A
So
if
you
have
continuous
integration
and
your
tests
always
pass
and
are
always
green
everything's
fine,
but
if
you
have
flaky
tests
that
sometimes
fast
and
sometimes
fail
or
worse
tests
that
always
fail
and
you've
taught
yourself
to
ignore
these
tests,
whether
they
get
slicky
or
always
failing,
but
they
are
still
part
of
the
continuous
integration,
so
you're
no
longer
pushing
on
green.
You
are
just
clicking
some
override
button
or
whatever
to
push
a
release
that
didn't
pass.
A
All
the
tests
then
you're,
basically
ignoring
all
the
testing
infrastructure
that
you
have,
and
it's
very
likely
that
you
will
make
mistakes
and
release
stuff.
That
isn't
good
because
you
are
ignoring
the
test
so
having
a
good
testing
infrastructure
implies
not
having
flaky
tests
and
not
having
tests
that
you
just
ignore,
because
a
failing
test
needs
to
be
actionable.
A
Otherwise,
your
testing
infrastructure
is
just
wasting
resources
and
it's
not
helping
you,
but
on
top
of
unit
testing
and
integration
testing,
we
also
need
to
do
other
kinds
of
testing
like
load
testing,
to
check
that
our
servers
will
be
able
to
handle
the
load
that
we
expect
and
even
more
than
the
load
than
we
expect.
It
would
be
great
if
our
application
is
super
successful
and
then
we
need
to
have
more
loads.
A
Of
course
you
want
to
do
this,
not
in
production,
but
in
your
test
servers,
or
at
least
don't
start
doing
it
in
production.
Do
it
in
your
test
servers?
First,
you
don't
want
all
your
production
servers
to
suddenly
crash
so
release.
Canaris
is
a
strategy
to
test
in
production,
but
without
actually
breaking
of
production.
A
The
name
the
origin
of
the
name
of
canaries
comes
from
it's
a
bit
morbid
that
it's
useful
to
understand
it
to
know
what
we
are
talking
about.
It
comes
from
the
canaries
that
were
used
by
coal
coal,
mine
workers
to
know
whether
there
was
enough
oxygen,
so
what
they
did
was
have
canaries
and
if
the
canaries
they
died,
they
knew
that
they
had
to
get
out
of
there
because
there
wasn't
enough
oxygen,
so
yeah,
it's
it's
a
little
bit
south
for
those
old
canaries.
A
But
we
we
kept
this
name
for
sacrificing
some
of
our
servers
or
with
our
instances
with
the
new
versions
of
our
software,
and
so
we
check
whether
it's
working
correctly
or
not,
by
running
it
on
a
subset
of
the
servers,
and
if
we
see
that
this
new
version
is
not
working
successfully,
we
roll
back
to
the
previous
version
and
only
a
few
users
were
affected.
So
this.
A
This
idea
of
of
having
a
subset
allows
us
to
also
use
less
of
our
error
budget
because
say
you
deployed
a
new
version
to
10
of
your
instances
and
this
version
was
not
working
and
it
took
you
one
hour
to
realize
that
it
wasn't
working
and
roll
back
to
the
previous
version.
So
you
had
a
one
hour
outage,
but
it
was
only
ten
percent.
So
instead
of
60
minutes
of
your
error
budget,
this
actually
it's
six
minutes
of
your
error
budget,
because
only
10
of
your
users
saw
the
problem.
A
So
it
wasn't
like
a
100
outage,
and
so,
but
it's
important
that
we
use
this
correctly
by
by
checking.
So
we
deploy
to
the
new
to
the
new
instances,
a
new
version,
and
we
check
that
it's
working
correctly
and
if
it's
not,
we
roll
back
right
so
rolling
back
is
what
saves
us
and
what
allows
us
to
go
back
to
the
the
previous
working
condition.
A
What
happened?
No,
we
rolled
back
the
slide
and
then
yeah
and
then,
if
it's
working
correctly
after
some
time,
it
depends
on
what
it
is
that
you
are
deploying,
but
usually
it's
it's
usually
to
wait
for
one
day
and
then
you
deploy
it
to
more
instances.
A
It
depends
on
how
how
big
or
small
the
application
the
service
is
all
right.
Next
slide,
yeah
and
then
another
source
of
problems
are
the
humans
so
like.
If
we
are
doing
the
kind
of
everything
we
need,
a
human
that
we
may
need
a
human
that
is
pushing
it
to
one
percent,
two
percent
25
percent
and
then
the
human
may
make
mistakes.
A
So
it's
important
to
try
to
remove
humans
from
the
loop
as
much
as
possible,
in
other
words
automate
as
much
as
possible,
and
this
includes
release
automation,
which
is
like
the
canary
thing
that
I
was
talking
about,
but
it
also
includes
things
like
automatic
roblox.
This
may
sound
kind
of
like
black
magic,
but
it
might
be
possible
to
have
your
monitoring
infrastructure
detect
that
your
service
is
now
suddenly
responding
with
a
lot
of
500s,
and
this
doesn't
seem
right.
A
A
A
So
how
can
we
prepare
ourselves
for
outages
the
first
step,
the
step?
Zero
is
to
accept
that
outages
will
happen,
even
if
we
we
feel
like
they
shouldn't,
they
will
happen
and
once
we
accept
that
we
can
prepare
for
them,
so
have
playbooks
for
the
most
common
problems.
Playbooks
are
basically
documentation
of
what
to
do
when
a
problem
occurs.
So
if,
if
we
get
an
alert,
what
do
we
do
with
this
alert?
A
And
you
might
think
that
it's
very
obvious
what
to
do
with
the
alert?
But
when
you're
under
pressure
when,
like
the
system
is
down
and
and
the
the
phone
is
ringing,
and
why
haven't
you
fix
it
having
a
clear,
step-by-step
process
of
what
you
need
to
do?
Even
if
it's
obvious?
It's
really
helpful.
Also.
A
I
mentioned
this
already,
but
it's
very
important,
so
I'm
repeating
it
have
this
rollback
first
fix
later
philosophy
like
engraved
in
your
mind,
because
it's
very
very
common
for
software
developers
to
want
to
fix
it
in
production
and
do
a
hotfix
and
you
you
see
the
code
and
you
are
like.
Oh
yeah,
it's
just
a
plus
one
is
missing
here.
A
I
will
just
add
the
plus
one
and
push
it,
and
then
it
turns
out
that
that
plus
one
that
you
added
had
unintended
consequences
and
you
went
from
some
user
seeing
a
problem
to
all
users
seeing
the
problem
because
you
didn't
realize
so.
A
Roblox
fares
fix
later,
even
if
it's
tempting
to
hotfix
and
then
also
have
a
process
for
the
like
a
meta
process
for
handling
the
incidents
like
who's
going
to
communicate
with
the
customers,
who's
going
to
escalate
who's,
going
to
write
the
postmortem
whatever,
but
to
how
this
meta
process
again,
because
when
you
are
in
in
a
very
stressful
situation,
it's
really
hard
to
think
on
your
feet.
So
the
most
things
that
you
can
just
follow
from
a
checklist,
the
better
all
right-
and
I
mentioned
pros
mortens
and
that's
our
next
slide.
A
Postmortems
also
called
by
some
people
root.
Cause
analysis
are
documents
that
explain
what
happened.
What
happened
before
the
outage?
What
happened
during
the
outage,
how
it
got
fixed?
Who
did
what
and
why
etc?
But
it's
important
that
none
of
this
is
blaming
people,
so
postmortems
should
be
blameless,
even
if
you
say
margaret
did
this
like
margaret
run
this
command
it
shouldn't
be
marga,
is
stupid
and
run
this
command.
It
should
just
be
margarine,
this
command,
and
it's
important
that
we
remember
that
whatever
mistakes
were
made.
A
Even
if
a
human
made
a
mistake,
the
problem
lies
in
the
system.
The
system
that
allowed
the
human
to
make
that
mistake
right,
because
we
are
all
trying
to
do
the
best
we
can
and
if
the
system
allows
us
to
do
things
that
are
wrong.
The
system
is
at
fault,
not
the
humans,
because
humans
will
make
mistakes.
A
We
need
to
engineer
a
system
that
will
not
let
those
mistakes,
cop
outages
and
so
that
the
goal
of
the
postmortems
is
to
learn
from
all
of
this,
not
to
blame
anybody
and
to
list
the
action
items
that
will
help
us
prevent
the
outages
from
happening
in
the
future
and
it's
important
to
follow
up
on
these
action
items,
because
it's
no
use
listing
all
the
things
you
need
to
do.
A
All
right
so
I've
given
similar
versions
of
this
talk
before
and
usually
when
I
get
to
this
point.
People
are
very
anxious
because
they
feel
like
there's
a
lot
of
information,
and
they
don't
know
where
to
start.
So,
let's
look
into
how
you
can
get
started
next
slide,
yeah
all
right.
So
the
first
step-
and
this
is
like
it
has
to
be.
The
first
step-
is
to
monitor
your
service.
If
you
don't
have
monitoring,
you
need
to
deploy
monitoring
and,
as
I
said,
it
doesn't
need
to
be
prometheus.
It
can
be
something
simple.
A
A
And
then,
as
you
start
getting
this
information
from
your
monitoring,
your
slos,
your
alerts,
you
can
decide
to
invest
time
in
testing
and
automation
as
needed
right
and,
as
I
said,
it
doesn't
make
sense
to
try
to
aim
to
100.
Reliability,
invest
as
much
as
it
makes
sense
for
your
service
and
finally
have
a
plan
for
outages
and
learn
from
the
mistakes
that
you
make.
A
Mistakes
are
really
valuable
in
helping
us
deploy
better
services,
so
make
sure
that
you
don't
just
like
paper
over
them,
but
that
you
spend
the
time
to
learn
from
them
all
right.
That's
it.
We
have
a
question
slide
yeah,
and
hopefully
there
are
questions
now.
A
B
Yeah
cool
thanks
thanks
a
lot
for
the
talk,
I
found
it
very
interesting.
I
just
would
like
to
ask
something:
maybe
you
share
some
ideas
to
this
side
of
the
problem?
So
in
my
case
I
come
from
operation
team
point
of
view.
B
We
have
a
lot
of
the
things
that
you
have
mentioned,
like
metrics,
alerting,
slo.
We
do
post
mortem
and
many
teams
also
support
us
from
the
development
team.
Of
course
we
cannot
do
this
on
our
own,
so
many
of
our
applications
already
support
this,
but
I
still
find
it
sometimes
not
so
easy
to
kind
of
convince
developers
why
this
is
important.
So
any
ideas
on
this
side
like
we
don't
to
be
to
be
fair.
We
don't
do
this
error
budgeting.
Yes,
we
are
thinking
about
it
and
this
might
be
maybe
something.
A
A
Maybe
have
something
simple:
where
developers
can
see
the
results
of
their
work
so
like
if,
if
the
service
is
down,
if
there
are
too
many
errors,
whatever
the
the
problems
that
arise
from
from
bad
features
whatever,
if
it's
visible
it's
easier
to
to
communicate
and
to
to
send
the
message,
and
in
the
end
it
does,
I
understand
the
struggle
for
me.
A
So,
like
maybe
try
to
send
the
message
from
that
part
that
point
of
view
of
like
users
are
not
happy
with
the
service
being
down
all
the
time
or
with
it
crashing
or
with
half
of
the
request
returning
errors
or
whatever
issues
your
application
is.