►
Description
A View from the Trenches
A
Hello,
everyone
Welcome
to
Cloud
native,
live
where
we
dive
into
the
code
behind
Cloud
native
I'm,
Anita,
Lester
and
I'm.
A
cncf,
Ambassador
and
I
will
be
your
host
Tonight.
So
every
week
we
bring
a
new
set
of
presenters
to
Showcase
how
to
work
with
Cloud
native
Technologies
and
the
stories
behind
them.
A
They
will
build
things,
they
will
break
things
and
they
will
answer
all
of
your
questions,
so
you
can
join
us
every
Wednesday
or
like
this,
sometimes
especially
on
other
days
as
well,
and
this
week
we
have
amazing
set
of
presenters
here
to
talk
with
us
about
designing
and
operating
reliable
cloud
services.
A
View,
From,
The
Trenches,
so
very
exciting
for
this
special
program.
Today,
and
as
always,
this
is
an
official
live
stream
of
the
CNC
app
and
as
such,
it
is
subject
to
the
cncf
code
of
conduct.
A
So
please
do
not
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
that
code
of
conduct.
So
basically,
please
be
respectful
of
all
of
your
fellow
participants
as
well
as
presenters
with
that
done
I'll
hand
it
over
to
our
amazing
presenters
to
introduce
themselves
yeah
who
wants
to
go
first.
B
I'll
go
first,
so
my
name's
otherwise
Gupta
I'm,
the
founder
and
CEO
of
shoreline.io
and
one
of
the
founders
of
reliability.org.
Both
correlate
to
my
lifelong
interest
in
building
High
highly
reliable
systems.
C
My
name
is
Nyle
Murphy
I,
Am
CEO,
founder
of
a
teeny,
tiny
startup
in
the
SRE
ml
Space
Coast
stanza
systems.
If
you
know
my
name,
though,
it
is
probably
because
of
the
site,
reliability,
engineering
book,
the
esri
book
and
or
possibly
the
ml
Ops
book,
which
are
both
explorations
of
what
it
means
to
build.
D
I
am
Stephen
Townsend
I
am
currently
a
developer
advocate
for
a
company
called
squid
up
who
has
unified
with
dashboarding
and
visibility.
I
did
performance
engineering
for
many
many
years,
I
moved
into
esari
more
recently
and
I
have
a
podcast
called
slight
reliability
where
I
share
share
my
Learning
Journey
in
the
reliability
space.
A
Perfect
amazing
group
of
people
here
with
us
today,
so
today
we
have
a
great
kind
of
panel
format
where
I
will
ask
questions
and
then
we're
gonna
get
amazing
answers.
But,
as
always,
you
can
still
ask
your
questions
to
your
audience
as
we
go
along
and
we'll
probably
have
some
time
for
Q
a
in
the
end
as
well,
but
ask
them
away
as
we
go
along
as
well.
So
first
question
to
everyone
here:
what
does
good
or
great
look
like
when
discussing
reliability.
C
So
we
only
have
an
arrow
right,
yes,
yeah
yeah,
so
like
just
off
the
top
of
my
head,
I
I,
don't
think,
there's
a
a
context.
Insensitive
answer
to
this,
like
I,
don't
think
there
is
one
answer
which
just
fits
everything
and
I
I
expect.
Most
people
in
the
industry
wouldn't
be
surprised
to
hear
that
I
will
say,
there's
a
couple
of
fundamental
questions.
C
You
have
to
be
addressing
or
addressing
to
a
sufficient
level
in
order
to,
in
order
for
this
question,
to
kind
of
mean
anything
to
you
really,
the
first
one,
of
course
being
how
much
does
the
ORD
care
about
reliability
like
if
it
doesn't
care
at
all,
for
whatever
reason
like
there
might
be
arbitrary
reasons,
then
good
or
great
doesn't
really
matter.
I
would
say,
though,
that
making
sure
the
basics
are
kind
of
done.
Well
is
always
going
to
be
a
foundational
thing.
You
have
to
have
before
figuring
out.
C
What
it
is
that
you're
doing
have
you
decided
organizationally
how
reliable
you
should
be
like
if
you
haven't
decided
that
and
you're
just
getting
I
suppose
whatever
the
native
platforms
will
deliver
to
you,
then
you,
you
have
a
huge
decision
to
make
about
precisely
what
to
improve.
Where
and
like,
are
you
resourcing
it
sufficiently?
Do
you
have
the
right
number
of
people,
the
right
kind
of
people,
the
right
resources,
etc,
etc?
C
B
A
B
You
know
you
can
always
get
better
now
here,
I'm
talking
about
reliability,
not
availability.
Sometimes
people
confuse
the
two.
B
They
talk
about
their
fleet-wide
availability,
but
that
ignores
things
like
am
I
getting
errors
back
from
the
system
is
the
performance
out
of
whack,
and
you
know
I
often
find
that
people
take
a
fleet-wide
availability
goal
and
then
they
claim
they
did
really
well,
even
if,
like
one
region
was
down
for
an
hour
that
has
happened
to
me
with
some
of
the
services
I
used
to
work
on
at
AWS
with
you
know,
the
services
I
depended
on,
and
you
know
your
customers
just
don't
care
about
your
fleet-wide
availability.
That's
an
internal
goal.
B
What
they
care
about
is
their
particular
experience
right
that
they
were
they
able
to
check
out
were
they
able
to
perform
their
task,
and
you
know
in
their
expected
time
without
any
drama-
and
you
know
ideally
without
having
to
code
against
a
bunch
of
error
cases
due
to
in
efficiencies
in
your
software.
A
D
I
think
I
think
the
ultimate
goal
of
reliability
is
to
support
an
organization
like
to
meet
its
objectives,
whatever
they
might
be
so
because
reliability,
just
for
the
sake
of
being
reliable,
I,
don't
think
necessarily
is
enough.
You
know
what
I
mean
it
needs
to
have
a
purpose
behind
it.
So
I
think
maybe
good
reliability
is
when
the
augers
achieve
against
business
goals.
D
The
customers
are
able
to
achieve
the
outcomes
they
set
out
to
achieve
the
Ops
Team
enjoy
their
work
hopefully
or
are
not
totally
stressed
out
by
constant
incidents
and
fires
burning
all
the
time
the
technology
is
easy
to
operate.
Incidents
are
manageable
and,
of
course,
the
technology
is
also
reliable
and
available
and
performs
well
enough
to
to
do
what
they
need
to
do.
So
it's
maybe
a
bit
fluffy,
but
that's
that's
my
take.
D
A
D
Oh
I'm
gonna
flow
on
so
I
have
a
story.
It's
actually
from
my
performance
teasing
guys
so
I
was,
and
it
was
the
system
which
was
served
on
the
side.
Though
I
was
thinking
about
it
used
to
be.
It
was
a
system
of
record
it
used
to
be
a
Mainframe
over
the
years.
It
got
upgraded
and
I
got
put
into
a
Mainframe
emulator
running
on
a
Windows
server
and
at
one
point
we
needed
to
take
some
some
of
the
data
that
system
used
and
put
it
in
a
big
cloud
repository.
D
So
this
mainframe
system,
which
was
emulated,
would
call
to
the
cloud
to
retrieve
you
know,
customers
or
something
like
that,
turned
out
that
we
didn't
know.
I
knew
this,
but
the
Mainframe
emulator,
which
also
its
own
database,
was
a
SQL
server
was
considered
an
external
component.
It
would
have
make
calls
externally
to
the
cloud,
and
there
was
a
performance
issue.
It
would
take
like
15
seconds
to
retrieve
a
customer.
D
The
thing
is
that
that
those
calls
were
single
threaded
so
that
when
it
was
making
a
call
to
the
cloud
to
retrieve
a
customer,
the
entire
system
would
freeze
for
every
single
user
for
15
seconds.
It
couldn't
even
read
his
own
database
to
do
anything,
it
was
like
the
worst
like
thing,
I've
ever
seen
and
I,
don't
know
when
I
think
back
in
it
I
think
it
was
yes,
it
was
a
failure
of
design,
probably
more
it's
more
of
a
failure
of
too
much
technical
deed.
D
B
So
I
spent
about
eight
years
at
AWS,
so
I
saw
a
lot
of
failures
there
and
you
know
I'd
say
the
biggest
ones
were
almost
a
failure
of
imagination
where
the
system
had
an
error
path
to
deal
with
the
resource
becoming
problematic,
but
it
wasn't
able
to
deal
with
the
at
scale
case
where
the
failure
cascaded.
So,
for
example,
the
week
before
I
joined
AWS,
there
was
a
huge
outage
in
Dublin
due
to
a
lightning
strike
and
that
because
so
many
different
machines
went
out
at
exactly
the
same
time.
B
We
had
this
remerroring
storm
and
where
every
machine
was
trying
to
remarry
to
some
other
machine
that
was
trying
to
remarriage
it,
and
you
know
there
just
wasn't
enough
capacity
and
nothing
was
going
through
and
you
know
the
reason
I
call
it.
A
failure
of
imagination
is
because
we
all
can
see
and
code
against
the
things
that
everyday
things
that
happen
all
the
time.
The
question
is
really
thinking
about.
What's
the
largest
scale.
D
B
Know,
but
you
know
designing.
B
Is
really
I
think
incredibly
important
when
you're
designing
reliable
systems,
you
must
have
seen
a
bunch.
C
Again,
kind
of
the
margin
of
this
call
is
too
small
to
contain
all
that.
The
things
M
I,
actually
in
a
rare
example
of
trumpet
bone
trumpet,
blowing
I,
actually
run
a
podcast
called
getting
there
with
Nora
Jones
the
CEO
founder
of
jelly.io,
which.
C
Two
lens
there's
there's
an
emerging
movement
about
how
folks
actually
analyze
and
respond
to
incidents
which
is
making
quite
a
lot
of
inroads
into
the
SRE
community,
in
particular,
and
but
also
in
in
hospitals
and
like
chemical
engineering
and
Aviation,
and
a
whole
bunch
of
other
fields
where
the
standard
reaction
to
incidents
can
be
possibly
a
little
bit
more
blameful
than
we
might
like,
and
the
the
organization
doesn't
learn
as
much
as
it
would
otherwise,
but
in
terms
of
kind
of
bite-sized
stories,
yeah
like
I've,
I've,
loads,
I,
suppose
there's
the
time
that
somebody
dereferenced
a
pointer
incorrectly
and
everyone
saw
the
data
for
customer
zero.
C
That
was
pretty
cool.
That
was
not
so
good
because,
obviously
customer
zero
did
not
want
to
show
their
data
to
customer
one
through
umpty
million.
But
it
was
relatively
simple
from
a
kind
of
a
contributing
factor.
Point
of.
C
The
code
got
changed
and
pushed
the
difficult
thing
was
figuring
out.
What
the
correct
privacy
and
legal
response
was.
Now
that
you
have
this,
this
data,
which
which
shouldn't
have
been
accessed
by
other
folks,
then
there's
the
ones
where
the
technical
thing
is
maybe
clear,
but
you're,
working
internally
or
externally,
with
folks
who
folks
who,
who
might
have
some
trust
issues
with
the
the
the
framework
of
authority
that
you're
dealing
with.
And
so
you
have
to
kind
of,
establish
trust
in
the
middle
of
an
active
incident.
C
I
had
one
with
a
public
sector
in
in
Ireland,
where
the
technical
cause
was
actually
relatively
clear
from
from
day
Zero
and
was
something
to
do
with
traffic
routing,
but
folks
on
the
call
didn't
weren't
in
a
position
to
believe
that
a
large
organization
could
make
guarantees
about
how
their
systems
were
performing,
that
the
smaller
organization
was
kind
of
ready
to
really
absorb.
C
So
there
was
primarily
a
social
conversation
rather
than
a
technical
conversation
and
then
a
little
bit
like
I,
suppose,
anorex
story,
early
early
in
the
days
of
S3
S3,
three,
the
AWS
system
used
a
gossip
protocol
to
tell
other
machines
where
it's
where
the
data
lived,
and
so
if,
for
example,
a
system
or
a
data
center
lost
power
and
all
of
the
machines
came
back
up
at
one
time,
they
will
go.
C
Hi
I'm
machine
X
I
have
chunks
1
through
70
million
and
hi
I'm
machine
X
plus
one,
and
they
would
all
tell
each
other
and
completely
flood
the
network,
and
they
would
completely
float
the
network
even
worse,
because
there
was
an
assumption
that
this
network,
which
was
actually
split
between
two
data,
centers,
had
exactly
the
same
bandwidth
characteristics
on
every
point-to-point
link,
which
of
course,
is
not
the
case.
If
you're
going
between
two
data
centers
so
and
that
was
fun.
B
Let's
start
so
my
first
take
is:
you
should
automate
everything
and
that
you
can
so
because
you
know
we've
done
a
decent
job
over
the
last
20
30
years
in
improving
quality
and
so
I
think
it's
now
time
to
provide
that
same
sort
of
engineering
rigor
to
reliability
as
we
did
for
Quality.
You
know:
have
things
go
through
pipelines,
inject
failures,
see
how
they're
handled
simulate
large-scale
events
to
see
if
the
system
can
heal
so
I
feel
like
the
more
this.
B
Is
software
the
more
it
scales
the
more
you
can
manage
it
like
it's
software,
as
opposed
to
you
know
people
and
processes
intrinsically
hard.
You
know
people
change
processes,
sometimes
they're
followed
and
sometimes
they're,
not.
D
I
think
maybe
maybe
the
biggest
challenge
right
now
is
just
how
distributed
our
systems
are
and
how
many
components
are
they
need
to
talk
to
each
other.
So
we
talk
about
zero
trust
for
security,
I.
Think
in
a
way
we
need
to
start
thinking
about
reliability
in
a
similar
way.
So
maybe
you
know
being
careful
about
the
you
know
the
vendors
that
we
choose
to
connect
to
and
making
sure
that
they're
reliable,
because
at
the
end
of
the
day
you
are
only
as
reliable
as
the
dependencies
that
you
depend
on.
D
That's
there's
a
whole
nother
way
of
thinking,
I,
think
and
also
realizing
that
there
are
going
to
be.
You
know,
maybe
dozens
of
components
that
you're
out
of
your
control
that
you
depend
on
I
was
going
to
say
yeah
and
building
sort
of
degraded
service
around
that.
So
if
this
thing
goes
down,
have
a
backup
plan
or
you
know,
provide
the
graded
service
I
think
that's
that's
that's
one
thing
and
I
guess.
D
The
other
thing
is,
of
course,
having
really
effective
observability
and
that's
easy
to
say
that
with
observability
but
I
think
being
able
to
pinpoint.
What's
you
know
to
understand
and
see
clearly
what's
happening
in
a
really
complex,
distributive
system?
It's
hard
it's
hard
to
get
right,
but
when
you,
when
you
get
it
right,
I
think
it
makes
a
big
difference
too,
and
sort
of
improving
reliability.
C
There's
an
absolutely
gigantic
amount
of
stuff.
We
could
say
here
it's
even
a
struggle
to
keep
it
to
56
minutes,
so
I
think
I'd
start
off
saying,
like
there's
a
bunch
of
things,
you
could
do
at
different
levels
and
some
of
the
levels
are
maybe
relatively
tactical
or
code
focused
or
whatever
one
thing
that
believe
it
or
not,
contributes
a
lot
to
reliability.
C
Is
there
some
kind
of
resiliency
you
can
think
about
there
on
a
design
level?
There's
a
it's
not
quite
a
programming
language,
more
kind
of
modeling
language,
I'm,
quite
fond
of
called
TLA,
plus
hello,
Wayne
and
a
bunch
of
other
folks
do
popularizations
from
it.
Of
course.
It's
a
Leslie,
of
course,
production
from
the
same
folks
who
gave
you
or
same
person
who
gave
you
paxos
another
leader,
election
Master
leader
election
kind
of
Primitives,
but
TLA
plus
allows
you
to
kind
of
model,
State
transactions
in
a
distributed
system
and
say:
okay.
C
If
this
goes
to
here-
and
we
message
this
thing
back
with
this
other
information,
can
we
prove
that
this
is
correct?
Or
can
we
fuzz
it
enough
so
that
we
can
look
at
some
potentially
some
cases
where
potentially
reliability
might
be
in
threat
and
actually
Amazon
used
TLA
Plus
in
a
bunch
of
stuff
and
discovered
like
there's,
some
39
level
stack
deep
stack
of
stuff
that
actually
ends
up
being
problematic.
A
Perfect,
how
are
you
feeling
do
we
want
to
take
an
audience
question
here
in
the
mix?
Oh
perfect,
enthusiasm,
that's
nice!
So
we
have
Lauren
George
asking
how
do
small
slash
medium-sized
organizations
predict
themselves
from
outages
from
Big
public
Cloud
providers
without
breaking
the
bank?
Is
it
even
possible
today,
foreign.
C
Sir
I'll
initially
give
really
depends
on
what
your
app
is.
Trying
to
do
really
depends
on
what
your
dependencies
are.
There's,
actually
a
fascinating
piece
of
work
from
I,
think
Walmart,
who
are
in
no
sense
a
smaller
medium-sized
company,
but
a
fascinating
piece
of
work
from
Walmart
who
do
kind
of
multi-cloud
spot
pricing
for
instances
for
stuff,
and
that's
almost
the
definition
of
not
breaking
the
bank,
I
suppose
or
breaking
other
Banks.
But
the
basic
deal
is
figure
out
what
you're,
depending
on
figure
out.
C
If
in
your
application,
you
can
use
specific
cached
results
or
algorithmic
fallbacks
or
hard-coded
fallbacks
of
various
kinds.
Note
all
of
these
May
potentially
have
reliability
outcomes
that
you
were
not
expecting
at
some
future
point
of
time.
But
if
U.S
east
one
disappears-
and
you
are
using
some
subset
of
stuff,
but
you
can
limp
along
with
cached
results
for
a
while,
like
actually
that's
that's
kind
of
a
win.
A
lot
of
people
are
pushing
multi-cloud
at
the
moment,
which
there's
been
a
lot
of
discussion
about
in
the
industry
and
basically
the
feeling
is.
C
It
doesn't
make
a
whole
load
of
sense
to
go
copy
paste
for
your
infrastructure
from
one
provider
to
another.
They
work
hard
at
making
sure
that
you
can't
actually
duplicate
that
kind
of
thing,
a
little
bit
like
mobile
Telco,
building
models,
I
suppose,
but
the
the
real
deal
is,
if
you
are
using
a
specialized
thing,
that's
only
available
in
one
provider
make
sure
it's
something
you
can
do
without
for
a
couple
of
hours.
If
you
need
to
like
big
query
equivalent,
but
I'll
stop
talking
now.
B
Yeah
I
think
there's
a
lot
to
unpack
in
now's
the
answer,
let
me
add
on
to
it
a
little
bit
so
I
think
the
question
really
becomes.
You
know
to
think
about
reliability.
The
same
way
that
we
used
to
think
about
Security
in
terms
of
threats,
except
now
the
threats
are
when
you
lose
availability
of
your
dependencies,
and
so
it's
you
know
for
how
long
can
you
relate?
B
How
do
you
degrade
when
that
happens,
rather
than
fail
entirely
like
if
you're
losing,
if
you're
using
Lambda
and
Lambda
goes
down
entirely
you're
kind
of
out
of
block?
If
you
are
using
VMS
and
a
VM
goes
down,
you
can
probably
spin
up
another
VM,
particularly
if
you've
got
a
warm
pool
already
out
there.
So.
D
B
Know
there
it's
really
a
question
of
tolerating
failures
and
you
know
you
can't
just
duplicate
everything
right.
D
Yeah
I
I,
don't
know
I'm
I
guess
my
two
cents
is
that
if
you're,
a
small
to
medium
organization
and
you're
thinking
about
going
multi-cloud,
have
a
serious
think
about
the
the
risk
versus
the
cost
of
what
you're
trying
to
implement
in
the
complexity
of
it.
I
think
it
would
be
a
hard
case
to
you
know
to
make
just
to
say:
that's
actually
worth
the
effort
and
the
cost
and
yeah
there's
other
things
you
can
do.
I'm
like
no
I
was
saying.
A
Great
and
then
we
had
a
few
more
questions
but
I
think
we're
gonna
go
through
some
of
the
pre-decided
questions
first
and
get
back
to
few
of
those.
So
Andrew
I
saw
your
question
about
case
engineering.
A
We're
gonna
get
to
it
eventually,
as
well
as
Laura
Lauren
asked
about
books
or
articles
and
I
think
that's
a
great
question
to
maybe
wrap
up
in
the
end
towards
if
any
of
our
panelists
have
great
resources
for
everyone
to
jump
into
next,
and
then
also
I
saw
the
podcast
question
and
I'm
gonna
send
the
links
to
everyone
in
the
chat
so
no
worries
there
and
then
to
the
next
question,
which
is
how
do
you
identify
and
resolve
potential
reliability
issues
before
they
become
your
customers
concerned?.
B
You
know
at
AWS
one
thing
that
you
we
all
did
even
Andy
jassy
did
was
monitor
Twitter
because
at
least
at
the
time
there
was
you
know,
people.
B
Ask
is
that
three
down
or
something
like
that
and
your
whole
goal
was
to
make
sure
that
by
the
time
someone
was
asking
that
question
you
already
knew
you
already
had
the
event
started,
you're
already
working
on
it.
Maybe
you
didn't
have
it
identified
the
root
cause,
but
you're
working
and
so
I
think
it's.
B
Important
question
and
it's
a
really
important
goal
we
should
all
have,
because
what
you're
doing
as
a
cloud
provider
is,
you
know,
you're
taking
you
know,
you're
taking
on
the
responsibility
that
your
customer
will
otherwise
take
on
themselves,
and
so
they
need
to
trust
you.
They
need
to
trust
you
to
care
about
them
in
the
way
beyond
what
they
would
do
on
their
own.
So
you
know
you
just
need
to
be
really
good
at
this
stuff.
D
Well,
obviously,
I
think
great
observability
is
is
important,
so
and
I
think
that
one
of
the
keys
there
is
to
focus
in
the
beginning,
let's
say
you're,
starting
from
from
scratch
start
with
making
sure
the
customer
can
use
the
service
rather
than
getting
lost
in
all
the
Myriad
of
technical
metrics
and
events
that
you
could
be
tracking
and
looking
at,
because
once
you
get,
if
you
can't
answer
that
question,
can
customers
consume
the
service,
then
nothing
else
really.
D
Matters
in
my
opinion,
I
think
that
eslos
are
a
good
way
to
potentially
do
that,
but
I
don't
think
you
have
to
do
slos
either,
but
that's
contentious,
but
this
is
my
particular
opinion.
D
I,
don't
think
the
other
thing
is
coming
from
a
performance
testing
background
is
to
be
testing
reliability
during
during
delivery,
especially
for
a
new
product
which,
maybe
you
can't
just
go,
live
with
immediately
and
have
load
on
it,
because
there's
no
customers
at
first,
you
know
so
testing
it
is,
is
a
great
way
to
to
shake
out
not
everything,
because
real
customers
in
the
real
world
do
unexpected
and
wonderful
things.
But
you
can't
shake
out
a
lot
of
issues
and
understand
your
Solutions
and
systems
and
services
better
another.
But
beyond.
D
So
those
are
the
I
think
ways
you
can
I
identify
issues
before
customers
do.
But
beyond
that,
there's
all
the
other
things
we
can
do
to
mitigate
the
impact
yeah.
The
way
that
we
deploy,
you
know
doing
deployments
like
blue
green
deployments
or
Canary
or
rapid
rollback
and
things
that
can
reduce
risk
there,
learning
from
incidents
not
just
having
internet
secure.
Not
not.
You
know
gaining
something
from
that,
because
incidents
are
fantastic
in
terms
of
learning
and
growing
as
an
organization
and
also
I
guess.
D
C
C
No,
thank
you
very
much,
and
then
you
figure
out
what
went
wrong
and
then
you
turn
that
into
an
observability
Rule,
and
you
just
do
that
a
billion
times
and
eventually
everything
is
covered,
hooray
except,
of
course,
everything
is
covered,
and
so
you
have
a
billion
things
to
Monitor
and
it's
not
necessarily
clear
which
seven
of
the
billion
things
actually
matters.
So
that's
a
question
that
you
can
also
resolve
with
this
other
magic
trick,
which
is
we're
pretty
kind
of
big
back-end
people
right.
C
D
A
Great
a
good
suggestion
there
so
then
getting
to
the
next
topic
is
is
reliability.org.
So
why
do
you
see
the
need
for
new
reliability.org
community.
A
B
Started
reliability.org,
which
is
a
non-profit
nothing
to
do
with
you
know,
showing
any
of
our
own
Goods
is
that
building
highly
reliable
systems
is
something
of
a
black
Arc
and
it's
mostly
informed
by
just
bitter
experience,
and
you
know,
that's
okay
at
a
hyperscaler,
because
they've
got
lots
of
people
with
lots
of
bitter
experience
and
they
get
better.
B
But
it's
a
problem
for
the
rest
of
us
right
and
there's
just
no
good
place
that
you
can
go
to
offer
your
thoughts
or
to
get
advice,
and
so
you
know
that's
Twitter
kind
of
used
to
be,
but
it's
less
so
now
so
you
kind
of
want
a
safe
place.
You
can
go
without
a
lot
of
noise
from
the
thought,
a
lot
of
vendors
that
you
can
do
this.
So
you
know
I,
asked
Lyle
I
asked
you
know
other
people
like
Stephen
to
join
and
yeah
it's
early
days,
but
I
I'm,
actually.
D
Where
are
we
come
on?
I
I
actually
wasn't
aware
of
any
other
sort
of
reliability:
communities
out
there
that
were
not
based
around
a
particular
technology,
maybe
or
an
open
source
project,
so
I
haven't
found
one
before
so.
For
me,
it
was
like
a
new
thing,
but
maybe
that's
just
because
I
wasn't
aware
what
else
was
going
on.
D
I
also
think
that
there's
generally
been
a
split
between
open
source
communities
who
are
very
active
from
you
know,
from
what
I
see
and
also
these
sort
of
commercial
communities
of
people
who
are
built
around
technology
like
AWS
or
whatever,
right
and
so
bringing
those
together,
I
think
is
quite
exciting
and
getting
those
different
perspectives,
yeah
I,
think
cross
cross
company
collaboration
is
important
as
well.
D
Yeah
I
think
we
can
talk
and
talk
more
about
that,
but
I
I
know
someone
in
New
Zealand
who
who's
in
SRE
in
quite
a
large
important
organization
and
he's
the
only
one
he's
a
winey
sorry.
So
he
has
no
internal
Community
whatsoever.
So
the
only
chance
for
him
to
get
ideas
at
the
bounce
ideas
is
to
go
out
to
the
industry.
C
On
the
other
hand,
he's
living
in
New,
Zealand
I
mean
could
be
worse,
so
I
hang
around
a
lot
in
either
role
focused
or
conference,
focused
or
sometimes
technology
focused
slacks
is
what
it
tends
to
be
these
days
rather
than
mailing
list
or
whatever
so
I
like
the
idea
that
there
would
be
a
kind
of
a
crossroad
cross
company
cross-industry
conversation
about
this
that
isn't
tied
to
any
one
particular
thing,
and
that
seems
to
me
to
be
a
coin.
A
Great,
so
how
in
the
future,
then,
now
that
this
great
new
Community
has
started,
how
do
you
see
reliability.org,
Community,
growing
and
evolving
in
the
future,.
B
Let
me
start,
you
know,
communities
are
hard
they're
taking
nurturing,
they
take,
you
know
constantly
adding
useful
or
thought-provoking
content,
and
it
takes.
You
know,
creating
a
safe
place
where
you
can
offer
your
opinion.
Even
if
there's
an
expert
like
Nile
hanging
out
there,
you
know,
who
might
just
you
know
say
well
my
experience.
It's
the
answer
is
12.
and.
B
C
Answer
is
13..
Let's
just
get
that
right,
yeah
I
I
mean
I,
don't
know,
communities
are
hard.
Things
are
hard,
maybe
it'll
be.
C
Don't
know,
but
I
will
say
that
I'm
I'm
increasingly
anxious
about
the
future
of
the
SRE
profession
in
a
world
which
is
I,
suppose
growing
increasingly
unreliable.
Is
that
a
good
word
for
it,
I
suppose
and
in
a
sense
like
I?
Have
this
whole
thing?
I
talked
to
this
Recon
a
number
of
times
about
weaknesses
in
the
intellectual
foundation,
for
justifying
the
value
of
reliability
and
so
on
so
forth.
Right
so
I
think
those
questions
are
unanswered.
C
D
C
D
I
see
I
also
see
the
potential
in
the
in
the
community
for
mentoring,
minty
relationships,
potentially
something
we
could
extend
upon
and
and
yet,
as
has
already
been
said,
just
the
idea
of
having
a
place
where
you
can
put
out
an
idea
bounce
an
idea
with
people
with
a
whole
wide
range
of
experiences.
It's
just
a
it's
fantastic
and
it
can't
be
I
can't
undersell
how
important
that
is.
A
Great
and
then
Stephen,
as
you
were
one
of
the
first
people
to
join
the
reliability.org
community.
What
made
you
jump
right
away
and.
D
Yeah
so
I
work
for
a
smallish
company
around
100
staff.
We're
building
a
new
product
I
haven't
hit
that
infliction
Point,
yet
we're
reliabilities
the
key
thing:
it's
more
about
building
the
right
product
at
this
particular
point
in
time,
so
I
I
was
really
excited
to
be.
You
know
to
have
a
place
where
I
could
go
and
sort
of
keep
my
finger
on
the
pulse
and
hear
what's
happening
in
the
reliability
world,
because
I'm
not
getting
the
chance
to
do
the
work
every
single
day.
D
So
that's
a
great
thing
and
I
just
really
like
the
the
vendor
neutral
nature
of
it,
of
the
community
yeah,
as
I
said
before,
most
other
communities
that
I've
been
part
of
or
have
heard
about,
have
been
around
a
project
or
or
a
technology,
and
this
is
great
I
I'm,
just
excited
I,
don't
know.
What's
going
to
happen,
it's
oh,
you
know
it's
great.
A
Amazing,
so
to
everyone
here:
how
can
people
get
involved
with
the
reliability.org
community.
B
A
Great
good
answer,
since
no
one
wants
to
I,
guess,
add
anything
yeah,
perfect
and
then
now
I
think
we're
gonna
grab
one
audience
question.
While
we
then
go
next
to
the
other
question
here
so,
and
you
asked
before
how
trendy
is
chaos,
engineering,
the
practice
Netflix
pioneered
a
few
years
back
and
he
adds
of
course
tuber
news
wasn't
as
popular
as
it
is
today,
but
your
takes
on
chaos,
engineering.
C
So
if
you'll
you'll,
forgive
me
for
putting
on
my
kind
of
copy,
editor
house
or
like
parser
hot
and
going,
how
trendy
is
chaos
engineering
like
do
you
want
a
scalar
answer
like
13,
or
are
you
saying
how
important
is
it
that
I
should
know
about
chaos,
engineering
or
how
relevant
relevant?
Is
it
in
the
industry?
I
have
only
my
opinion
here,
like
I,
don't
have
strong
survey
data
or
anything
like
that.
I
think
one
credible
group
of
people
who
are
doing
this
are
verica.io
if
you've
come
across
those.
C
They
also
run
the
the
void
database,
Courtney
Nash
from
the
void
database
of
kind
of
incident
data
out
of
verica
as
well,
but.
D
C
Main
thing
is
kind
of
chaos.
Engineering
chaos,
engineering
is
is
really
useful,
like
it's
a
really
fundamental
technique.
Instead
of
just
waiting
for
things
to
arbitrarily
break
into
your
production
and
tidying
up
after
it,
you
go.
Okay,
I'll
break
a
tiny
bit
of
it
all
of
the
time
and
if
I
break
the
right,
tiny
amount
of
it
in
the
right
place,
I'll,
hopefully
learn
something
that
I
can
make
progress
on
with
respect
to
improving
reliability
in
my
production
without
actually
having
a
complete
Wipeout
event.
C
So
it
kind
of
in
the,
if
you
think,
of
outages
or
whatever,
as
kind
of
a
tree
of
possibilities
flowing
from
some
kind
of
single
node,
then
chaos
engineering
helps
you
to
kind
of
depth.
First
explore
a
bunch
of
potential
failure
modes
that
you
might
otherwise
only
really
encounter
after
they've
set
something
serious
serious
off.
C
If
has
one
particular
downside,
which
is
as
I
understand,
it
I
have
no
direct
evidence
for
this,
but
it
has
one
particular
downside,
which
is
people
go
okay,
so
you're
gonna
break
my
production
and
they
don't
like
that
bit
they
go
I
would
much
rather
just
wait
for
it
to
break
completely
rather
than
break
it
a
little
bit
all
of
the
time,
because.
B
C
The
moral
failure
is
somehow
not
directly
connected
with
my
actions,
which
is
not
true
at
all,
of
course,
but
there
is
an
issue
with
having
the
the
case
for
it
kind
of
resonate
with
certain
kinds
of
audiences
yeah.
That's
all
I
can
tell
you.
C
I
suppose
you
could
make
that
argument.
It
depends
on
whether
or
not
the
because
often
chaos
engineering
is
quite
complex
to
set
up
because
you
can't
like
you,
can
set
up
a
simple
buff
that
goes
around
zapping
arbitrary
VMS
every
so
often,
but
if
the
subset
of
VMS
that
had
zaps
aren't
performing
different
functions,
you
only
ever
learn
the
same
thing
that
you
were
going
to
learn
anyway.
So
it's
not
adding
to
your
additional
stock
of
knowledge.
C
So
in
order
to
be
really
useful
for
the
organization,
the
chaos
engineering
has
to
be
doing
something
nastily
and
new
to
you
every
time.
But
if
you're
getting
nasty
and
new
things
happening
to
you
all
the
time
anyway,
like
that's,
not
much
additional
value.
So
why
don't?
We
just
do
the
thing
that
we're
doing
right
now,
which
is
nasty
and
new
until
it
stops
being
so
new,
at
which
point
we
can
introduce
cast
engineering
again.
B
So
my
quick
thought
is
that
chaos
engineering
can
be
done
chaotically
right.
You
know
where
you're
just
doing
random
things
and
the
chaos
Monkey
kind
of
case
and
I.
Don't
personally
find
that
terribly
useful.
You
know
it's
kind
of
fuzzing
compared
to
thinking
very
carefully
about
your
testing
framework
or
security
framework.
B
What
I
do
think
is
incredibly
useful
is
Fault
injection,
where
you
really
think
deliberately
about
what
percentage
of
things
do
you
treat
as
a
fault
as
you
call
a
subsystem,
and
you
know
how
well
do
you
deal
with
those
things
in
terms
of
recries
in
terms
of
redirects,
whatever
it
is,
and
that
I
think
can
be
done
in
a
very
careful,
methodical.
You
know
way
that
you
know
you
can
actually
get
some
use
out
of,
because
it's
very
hard
to
get
useful
knowledge
out
of
randomness.
A
Great,
if
no
additional
comments
hoping
on
to
the
next
topic,
which
is
what
are
the
top
causes
of
major
site
outages
and
how
can
people
avoid
them.
C
Yeah
so
there's
some
old
data
from
the
second
three
book,
suggesting
that
Iran
70
advantages
flow
from
change
of
some
kind
like
config
change
or
binary
change
or
whatever
so
stop
changing
and
everything
will
be
fine.
Oh
hang
on
actually
I'm
very
sorry.
It
turns
out
you
can't
stop
changing
okay,
so
what
we
have
to
do
instead
is
to
change
in
a
particular
disciplined
fashion,
so
we
don't
trigger
the
unexpected
interactions
between
attribute
sets
a
b
and
c
on.
C
D
This
is
just
purely
speculation,
but
I
I
feel
like
an
increasing
I,
feel
I.
Think
in
an
increasing
number
of
allergies,
are
going
to
come
from
the
the
dependencies
that
we
have,
because
we've
got
so
many
dependencies
and
they're
growing
all
the
time.
So
that's
I
feel
like
there's
going
to
be
a
a
an
increasing
source
of
incidents
and
additives.
C
D
I
used
to
work
with
a
a
guy,
and
he
used
to
tell
me
that
SRE
is
devops
without
empathy.
B
C
B
We
used
to
have
a
two-hour
meeting
every
week
where
we'd
go
through
the
prior
weeks,
outages
in
some
important
services,
and
so,
if
I
think
about
the
collection
of,
shall
we
say
greatest
hits
that
were
on
replay
across
those
weeks.
There
are
always
things
like
database
issues,
bad
deployments
expired
certificates,
misconfigured
network
settings,
and
you
know
it's
very
similar
to
what
Nile
was
saying,
and
you
know
what's
common
across
that
set.
Is
that
there's
widespread
severe
impact
because
you
know
they
have
a
large
blast
radius
and
that
they
take
time
to
resolve.
B
So
how
do
you
deal
with
that?
Well,
one
thing
is:
is
that
you
plan
the
deployment
roll
out
so
that
you
control
the
blast
radius?
You
automate
the
rollback
of
changes
so
that
you
can.
You
know,
minimize
the
time
of
failure,
because
it's
basically
an
integral.
You
know
the
severity,
the
duration
and
the
breadth
of
impact
right,
and
so
it's
you
know
you
can
reduce
any
one
of
those
Dimensions.
That
I
think
make
progress.
B
C
B
Mean
we
pretty
much
stopped
using
databases
internally,
because
you
know
it's
at
least
Charlie
bell
used
to
say
like
it's
like
putting
a
switchblade
in
your
baby's
Crypt.
You
know.
Just
don't
do
that.
You
know
the
it's
complex
software,
that's
easy
to
use
and
you
should
stop
using
it.
Just
use.
Definitely
TP
opinions
may
vary
I've
written
a
lot
of
the
databases
over
the
years.
So
something
of
a
fan.
C
I,
like
sqlite,
is
awesome
actually
and
the
the
unit
test
framework
for
sqlite
is
really
awesome.
Yeah
but
like
type
safety
is
for
worses
I.
Think
general
idea.
Yes,
sorry,
stop
I'll!
Stop
there.
A
No,
no!
It's
good!
It's
great!
It's
good
to
have
some
discussion.
So
how
do
the
best
people
out
there
manage
their
Cloud
environments.
C
C
That
does
so
like
there's
there's
a
lot
of
nuance
behind
that
question,
but
I
would
say
that
a
lot
of
the
things
we
actually
talked
about
earlier
in
this
session,
with
respect
to
understanding
your
dependencies
figuring
out,
observability
figuring
out
critical
user
Journeys,
making
sure
you
can
roll
back
all
of
those
best
practices
are
things
that
quotes
the
best
people
and
quotes
are
either
doing
or
they've
got
a
good
excuse
for.
Why
not-
and
sometimes
it's
a
question
of
picking
your
excuse,
yeah.
C
B
Resilient
architectures,
which
can
handle
it
so,
for
example,
you
know
I
built
Amazon
Aurora
and
it
effectively
injects
a
large-scale
event
every
week,
because
it
does
a
deployment
which
breaks,
which
you
know
takes
one
out
of
six
elements
of
the
Quorum
out.
While
it
does
the
deployment
and
you
know,
but
it
handles
it
without
any,
you
know
drama,
and
that's
just
because
it's
designed
to
deal
with
that
failure
and
dealing
with
that
failure
means
you
can
also
deal
with
a
z,
failures
or
disc
failures
or
network
failures,
and
you
know
blah
blah
blah.
C
B
But
you
know
the
I
think
that
kind
of
Designing
for
the
fragility
of
the
environment
in
which
we
operate
is
important.
A
Great
so
now
let's
take
grab
another
question
from
the
audience,
so
we
had
Luther
asking
my
company
embeds
an
SRE
team
in
rotation
with
different
teens
in
hope
to
work
closely
with
them
to
improve
reliability
and
monitoring,
occasionally
in-house
workshops
any
downsides
to
this
approach.
B
B
How
do
they
escalate
the
I've
seen
orgs
that
have
put
in
SRE
functions,
but
everything
still
flows
to
engineering
to
fix
because
it
and
it's
just
a
bump
in
the
you
know
in
the
wire
and
that's
useless
right-
and
you
know,
having
someone
look
over
your
shoulder
and
just
tell
you:
okay,
Implement,
you
know
you're
not
doing
processes
well
enough,
you're
slouching,
you
know
you're,
not
typing
correctly.
You
know,
that's
not
helping
me
make
things
better.
B
So
what
does
help
me
is
if
they're,
actually
there
shoulder
to
shoulder
fixing
things
which
I
think
you
know
Luther's
Point
kind
of
touches
upon
the
notion
of
embedding
together,
rather
than
creating
it
as
a
Cascade
or
treating
it
as
a
separate
retrospective
function.
Now
you
know
all
about
this
yeah.
C
I
mean
there's
a
huge
amount
of
nuance
to
this,
depending
on
what
the
definition
of
theme,
rotation,
different
team,
etc,
etc.
All
of
those
could
have
a
huge
impact
on
what
the
end
result
ends
up
being
I
will
say:
I
am
most
familiar
with
the
single
individual
embedded
model,
rather
than
the
team
to
team.
Embedding
like
team
to
team
embedding
seems
weird.
That's
like
that's
a
partnership
model,
not
an
embedding
model,
I.
Think
the
the
weakness
of
the
embedding
the
single.
C
Kind
of
model
where
you
would
have
an
S3
that
would
go
and
sit
with
the
team
that
has
a
particular
reliability
challenge
or
some
kind
of
knowledge,
deficit
or
whatever
for
six
months,
say.
That's
a
pattern.
That's
very
common
in
Facebook
production,
engineering
and
last
I
lost
I
was
aware.
The
difficulty
with
that
model
is
like
the
the
thing
it's
good
at
is
responding
to
particular
emergencies
or
lacks
quickly.
Yes,
Team.
These
17
teams
have
some
problem.
We
will
send
staff
out
there
and
they
will
fix
stuff
and
so
on.
C
But
what
turns
to
happen
is
if
you
do
a
lot
of
these
kinds
of
rotations,
you've,
no
real
team
identity,
at
least
not
one
that
lasts
longer
than
the
period
of
the
rotation,
and
that
turns
out,
even
though
you
might
not
think
it's
that
important,
it
actually
turns
out
to
be
really
important
with
respect
to
giving
people
the
idea
that
they
can
build
a
career
and
have
a
kind
of
a
long-lived
contribution
to
long-running
Services
Etc,
all
of
which
are
kind
of
necessary
sub-components
of
promotion,
amongst
other
things.
C
So
that's
kind
of
the
upsides
and
downsides
of
of
embedding
the
other
question.
I
suppose
that
kind
of
hides
behind
some
of
that
is.
Why
does
the
team
in
question
feel
they
can't
do
this
themselves
like?
Are
they
looking
for
Warm
Bodies,
because
if
they're
looking
for
Warm
Bodies,
that's
definitely
speaking,
not
a
good
sign?
It's
like
I,
don't
care
who
they
are
just
get
them
cranking
at
the
code
now.
C
Well,
actually,
that's,
maybe
not
good,
or
are
they
looking
for
a
specific,
guided
expertise
on
on
something
in
which
case
that
can
sometimes
be
a
bit
better
yeah
depends.
He
said
shockingly
and.
D
They
were
too
busy
worrying
about
all
this
other
stuff
they
had
to
do
so.
If
the
priority
isn't
there
then
embedding
is
in
their
sort
of
enablement
is
pretty
hard.
A
Right
so
now
we
can
enter
the
audience
q,
a
not
that
we
haven't
taken
audience
q
a
already
here,
but
if
you
have
any
questions
in
mind,
now
is
the
perfect
time
to
ask
them,
because
we
have
a
bit
less
than
10
minutes
left
so
type
them
away.
And
while
we
wait
for
people
to
type
their
questions,
if
they
are
now
frankly
typing
away,
let's
ask
the
last
question
from
from
my
side,
so
tell
us
about
the
best.
Oh,
we
have
immediately
a
question.
A
So
let's
jump
into
that
and
leave
my
question
a
bit
later
so
have
you
found
that
those
with
certs
such
as
Azure,
AWS
and
so
forth,
are
better
at
thinking
through
designing
and
operating
reliable
services?.
C
Think
it's
Cloud
certification
programs
or
applications
yep.
What's
your
view
yeah
so
I
like
I,
have
a
definite
view
on
this,
which
may
not
be
shared
by
other
people.
He
wanted
to
warn
everyone.
Yeah,
like
you
say,
Lauren
I
I
have
no
certs
I,
never
met
in
11
years.
In
Google,
for
example,
I
never
met
a
single
person
that
had
a
vendor,
related
Source,
I,
think
that's
true
yeah
and
in
general
I'm,
not
saying
I'm.
C
Like
I
I,
know
back
in
the
very
old
days,
like
the
networking
certifications
that
Cisco
used
to
do
CCNA
and
all
that
kind
of
stuff
you
would
spend.
You
know,
36
of
your
life
learning
about
the
difference
between
Type
2
and
type
3
lsas
in
ospf.
And
how
relevant
is
that?
Do
you
really
probably
not
that
relevant,
I
I
suppose
it
helps
to
distinguish
you
from
the
larger
mass
of
people
formally
in
some
sense,
who
have
no
experience
with
these
ideas?
B
Yeah
so
I
kind
of
agree
with
you
I
feel
like
you
can
kind
of
have
three
tiers
of
people.
You
know
there's
the
tier
who
actually
kind
of
don't
know
what
they're
doing
and
then
there's
a
tier
where
you
know
they've
shown
through
some
sort
of
certification
that
they
kind
of
know
what
they're
doing
and
then
you've
got
this
tier
that
you
really
want,
or
they
really
know
what
they're
doing,
because
they're
working
at
it
and
they're
way
too
busy
to
get
certifications,
and
so
the
problem
is.
B
How
do
you
distinguish
between
the
top
tier
and
the
bottom
tier?
And
you
know,
assuming
you
can
do
that
I,
don't
think
certs
matter.
If
you
can't
do
it,
it
matters
a
lot
right.
I
mean
you
know.
Suddenly
everyone's
been
a
you
know,
AI
engineer
for
the
past
30
years
and
I
kind
of
doubt
it.
But
you
know
that's
how
they
describe
themselves.
A
Okay,
anything
to
add
Steven
anything,
and
then
we
have
one
question
from
the
audience,
but
I
also
want
to
address
you're
asking
about
the
profile
as
a
community
member
I.
Think
there's
going
to
be
a
good
resources
online
about
how
to
change
your
profile
around
I
think
we
don't
have
the
time
maybe
to
handle
that
question
here,
but
I
do
want
to
get
back
to
Lauren's
Lauren's
earlier
question
about
any
grade.
C
Yes,
so
there
is
another
book
yet
another
book
called
the
service
level
objectives
book,
primarily
written
by
Alex
Hidalgo,
the
the
SLO
book.
It
has
a
chapter
in
it
about
SLO
monitoring,
which
happens
to
be
written
by
somebody
called
Nile
Murphy
who
I've
never
met,
and
this
Nile
Murphy
wrote
about
how
to
do
pretty
concrete
steps
with
respect
to
slo-based
monitoring
and
observability
I've
been
told
it's
a
really
good
introduction
that
might
be
a
place
to
start.
C
There
are
a
lot
of
other
monitoring
things
like
resources
out
there
and
honeycomb
has
an
observability
book,
which
I
think
is
very
good.
Actually,
yes,
excellent,
Steven's,
holding
it
up
now
and
written
by
many
of
my
favorite
people
and
there's
also
some
sections
in
the
SRE
books
about
monitoring
as
well,
and
something
from
James
Turnbull
I
think
a
couple
years
ago,
called
The
Art
of
monitoring,
which
I
also
think
is
freely
available
online.
D
Yeah
loads
of
places,
there's
an
annual
conference
called
Ollie
Fest.
You
can
just
still
go
to
olliefies.org
ollie
with
the
the
ones
you
know
and
I
think
you
can
still
watch
them
all
for
free
and
there's
tons
of
really
good
content
there.
If
you
want
to
watch
those
monitorama
yeah.
B
The
other
thing
I'd
say
is:
is
that
pretty
much
everyone
who's?
You
know
illuminary
in
the
SRE
Community
is
sitting
on
Twitter
and
you
could
just
reach
out
to
them
with
your
questions
and
they're,
pretty.
D
B
You
know
so
I
followed
a
lot
of
people
there
and
they're
pretty
generous
with
their
time.
So
yeah
can
do
better
than
reading
some
dusty
book.
C
A
Yeah
perfect,
but
I
think
that's
all
that
we
have
time
for
today
but
great
to
see
so
many
amazing
questions
and
answers
from
everyone
here
and
thank
you
always
everyone
for
joining
the
latest
episode
of
cloud
native
live.
It
was
great
to
have
a
session
of
our
reliability
here
today
and
also
I
love,
the
interaction
and
the
questions
from
the
audience,
and
you
can
always
tune
in
in
the
coming
weeks.
We
have
more
crave
sessions
coming
up
and
thanks
for
joining
us
today
and
see
you
all
in
the
coming
weeks,.