►
From YouTube: Observability: What Got You Here Won’t Get You There
Description
In the first wave of DevOps, practitioners embraced change to incorporate development into design and build processes. To successfully build the next generation of software, practitioners will need to catch the second wave of DevOps to focus on controlling and fine-tuning evolving architectures. Honeycomb CEO, Charity Majors, discusses how to ensure buildability, empower developers, and make a truly observable architecture that’s primed for success.
Learn more about Kong: https://konghq.com/
A
Well,
what
I'm
going
to
talk
about
is
observability
and
in
high-performing
teams.
As
lovely
gentleman
just
said,
my
name
is
charity.
You
can
tell
them
from
ops
cuz.
That's
how
I
feel
about
software.
The
only
good
dip
is
a
red
death.
I've
worked
in
a
bunch
of
places.
Recently
I
work
at
honeycomb,
gotta
yo
we
just
raised.
We
just
announced
this
this
week
that
we
just
raise
another
round,
so
we're
gonna
be
around
for
two
more
years.
A
A
A
Have
you
all
read
Jaz
and
Nicole
and
jeans
amazing
book,
accelerate
I
feel,
like
we've
been
cargo
quilting,
a
lot
of
best
practices
for
as
long
as
I've
been
in
the
industry
like
someone
has
something
that
they
sell
work
pretty
well
at
a
company
that
they
used
to
work
at,
and
so
it
becomes
like
it's
almost
like
a
DNA
or
like
a
few
different
like.
Oh,
that's,
better!
A
Isn't
it
okay
Justin
bride
anyway,
so
they
like
six
every
thousands
of
teams
and
they
realized
that
you
can
actually
tell
how
high
performing
a
team
is
by
just
asking
for
four
metrics.
How
often
do
you
deploy
how
long
between,
when
you
emerged
a
master
and
when
your
code
is
live?
How
quickly
can
you
cover
from
knowledge
and
how
many
of
your
deploys
fail,
like
just
makes
it
much
more
concrete
and
and
and
and
clear
for
us
right,
it's
not
like.
Well,
it
feels
good.
You
know
we're
spending
like
30
percent
of
our
time.
A
It's
really
just
these
four
things
and
that
what
that
means
is
that
if
we,
if
we,
if
we
put
energy
and
pressure
into
these
four
things
and
make
them
better
over
time,
the
ripple
effects
for
the
rest
of
your
engineering
organization
for
the
rest
of
your
company
can
be
huge
right.
Oh
and
PS
teams
that
use
that
use
platforms
in
the
service
are
one
and
a
half
times
more
likely
to
be
elite.
The
way
they
defined
elite
is,
you
know.
Well
here,
I'll
show
you
on
demand.
A
Very
much
achievable
right,
but
it
really
really
really
pays
off,
because
if
you
look
at
the
Delta
between
teams
that
are
high
performing
teams
that
aren't
you
can
see
how
quickly
you
can
get
out
competed
by
teams
that
can
move
faster
and
with
more
confidence.
So
what
does
it
mean?
What
what?
What
do
elite
teams?
What
are
they
made
up
of?
Well,
obviously,
they're
made
up
of
all
X
Facebook
expanding
employees
and
mighty
grads
as
a
dropped
out
music.
A
It
doesn't
matter
how
clearly
you
can
see
things
and
it
doesn't
matter
how
much
will
and
desire
you
bring
to
this.
If
you
can't
actually
see
what's
going
on
so
I
think
it's
sort
of
elite,
I
think
I
really
like
to
use
the
term
excellent
right.
What
we
need
is
production
excellence.
We
need
production
excellence.
My
my
coworker
Liz
Lange
Jones.
It's
a
great
talk
about
this,
which
you
should
all
watch.
We
need,
because
it
happier
customers
happier
teams.
A
Like
I
said
to
constituencies
and
a
lot
of
times,
people
and
they'll
be
very
hard-nosed
business
folks,
you
know
and
they'll
be
like
it's
all
about
users,
we're
obsessed
with
users
right
Amazon.
Their
mission
is
like
obsessed
with
users,
they
never
say
anything
about
being
obsessed
with
happy
teams,
and
this
is
short-sighted.
This
is
short-sighted
and
in
the
end
it
it
will.
It
will
damage
you,
because
engineering
excellence
is
tightly
correlated
with
happy
well-rested,
empowered
engineers.
A
So
next,
let's
talk
about
the
way
things
are
changing
and
why
this
suddenly
matters
systems
are
changing.
Problems
are
changing.
This
people
have
to
change.
This
is
always
scary
right.
We
hate
change
as
human
beings
as
animals,
like
it's
scary,
so
let's
walk
through
the
house
and
why
those
systems
complexities
going
up
the
roof,
as
everyone
here
knows,
I'm
sure
our
systems
are
now
ephemeral,
dynamic,
loosely
coupled
distributed.
A
Basically,
if
you
look
at
the
architectural
graphs
on
the
one
side,
we
see
our
humble
lamp
stack.
The
middle
is
a
diagram
of
parse
when
I
was
working
there
a
few
years
ago,
and
then
the
left
is
the
National
electrical
grid
and
when
you're
building
systems
these
days,
I
would
argue
what
you
should
have
fix
in
your
mind.
Is
that
left
side
you
should
be
building
systems
like
they're,
the
National
like
they
are
far-flung
loosely
coupled
distributed
like
you
have
no
ability
to
predict.
What's
going
to
go
on,
like
failure
is
a
given.
A
It's
a
it.
You
have
to
make
friends
with
it.
You
know
some
problems
are
going
to
be
only
a
hyperlocal
right.
You
have
to
zoom
way
in
to
see
like
a
tree
fell
over
on
Market
Street
in
San
Francisco.
Could
you
have
predicted
that
no
should
you've
written
a
monitor
check
for
it?
Also,
no,
should
you
invest
in
gathering
a
detail
at
the
right
level
of
abstraction
so
that
you
can
ask
any
question
of
your
systems
to
understand
any
state
that
they
can
get
into
whether
or
not
you
could
predict
it.
A
Yes
right,
some
problems
are
hyperlocal.
Some
some,
you
can
only
see
if
you
zoom
way
way
out
like
like.
If
a
certain
like
a
certain
component,
see
every
bolt
was
manufactured
in
2013
is
rusting
ten
times
as
fast
as
all
the
other
bolts
right
like
this
happens,
all
the
time
in
our
data
centers
last
year's
batch
of
RAM.
Let's
see
we
have
to
proactively
replace
it
a
lot
faster
right.
It's
happened
at
the
Facebook
data
centers.
This
means
that
our
tools
for
understanding
them
have
to
shift
away
from
the
known
unknowns.
A
Where
you
know
lamp
stack,
I
would
look
at
it.
I
could
eyeball
it
and
guess.
Eighty
percent
of
the
ways
it
would
ever
fail,
so
I
write
a
bunch
of
monitoring
checks
for
those
things
right,
I
would
page
myself.
I
would
write
up,
run,
books
right
and
then
like
over
the
next
six
months,
we'd
kind
of
bake
it
in
I've
learned
the
other
20%
and
like
twice
a
year,
I'd
be
really
truly
stumped.
What
the
hell
is
going
on
in
my
lamp
stack
right,
very,
very
rare.
A
Well,
it
happens
every
goddamn
day
with
modern
systems
right
in
fact,
it's
so
much
so
that,
like
it's,
this
savings
of
shift
from
monitoring
to
observability,
obviously
right
from
a
world
where
you
could
predict
what
was
going
to
happen.
You'd,
write
checks
and
you
write,
run
books
to
a
world
where
you've
pretty
much
automated
those
things
away
and
every
time
you
get
paged
should
be
something
new
that
takes
your
creative
engineering
brain
like
that
again
shouldn't
be
that
should
be
I'm
stumped.
A
Every
time
you
get
paged
right
well,
this
means
that
our
tool
set
needs
to
needs
to
change
in
other
ways
to
to
make
it
more
explorable
to
make
it
more
so
that
we
can
start
assuming
that
we
know
nothing
and
quickly
break
it
down
to
figure
out
exactly
what's
going
on
every
time
or
you
know
we're
just
bad
at
understanding.
Our
systems
we're
really
bad
at
understanding
our
systems,
every
one
of
you,
I
assume,
has
had
the
experience
of
getting
paged
starting
to
try
and
figure
out
what
was
wrong.
A
Getting
the
recovery
and
everyone
kind
of
looks
at
each
other.
Like
we
don't
know
what
happened?
Are
we
going
to
invest
the
rest
of
our
day
into
trying
to
figure
it
out?
Probably
not
very
rarely
right,
it's
fine
probably
find
wait,
but
we
really
don't
understand.
What's
happening
most
of
the
time
and
I
will
illustrate
this.
Here
is
a
problem.
Photos
are
loading
slowly
for
some
people,
not
everyone.
A
A
A
Like
I
could
do
this
all
day
know
that
one
in
Romania
was
probably
one
of
my
favorite
outages.
Ever
we
start
getting
these
reports
it
like
pushes
it
down,
I'm
like
pushes
or
not
down
like
I'm,
getting
pushes
and
they're
in
a
queue
or
pushes
are
not
down
and
like
days
later,
there's
still
very
upset,
so
we
go
and
start
start
figuring
it
out
and
I.
Don't
know
if
this
is
still
true,
but
it
used
to
be
the
Android
devices.
All
you
still
have
to
keep
a
socket
open
to.
A
You
know
the
push
notification
service,
so
they
could
subscribe
to
push
this.
So
we
would
run
an
auto
scaling
group.
You
know
load
Bella,
it's
using
round-robin
penis,
so
they
pasady
all
this
service,
and
so
we
add
a
capacity
one
day
no
big
deal.
We
do
this
all
the
time
it
all
came
up,
started
serving
traffic.
Look
normal
turns
out
the
round
robin
DNS
record
had
exceeded
lead.
A
Ask
you
like
what
exactly
am
I
supposed
to
monitor
for
right,
like
there's
just
instead
of
having
like
a
few
things
to
monitor
for
that
a
relativist
and
that
we
can
right
run
books
for
instead,
it's
like
we
have
this
infinitely
long,
thin
list
of
things
that
almost
never
happened
so
once
they
do,
or
you
know
it's
five
impossible
things
that
all
have
to
converge
in
order
for
this
one
bug
to
get
triggered.
Yes,
this
makes
staging
basically
useless
they're.
All
completely
unknown.
A
Unknowns
may
never
have
happened
before
probably
will
never
happen
again,
so
the
workflow
that
we've
had,
which
is
you
know
for
years,
we've
been
like.
Okay,
we
encounter
a
new
error
cool.
Let's,
let's
have
a
retrospective.
Let's
post-mortem
it,
let's
write
a
monitoring
check
for
it,
let's
build
a
graph
for
it,
so
we
can
find
it
immediately.
The
next
time
I'm.
Looking
at
this
one
dashboard,
you
know
and
then
we'll
write
a
modern
check
so
that
we
get
you
a
page.
We
know
what
it
is
right.
A
Okay,
fast
forward,
a
few
years
later,
I
mean
we're
getting
paid
bombs
of
like
things
that
are
in
no
way
related
to
the
actual
problem,
and
we
have
tens
of
thousands
of
dashboards
and
you
can't
jump
directly
to
the
one
that
describes
this
exact
problem
because
whatever
was
serving
at
status,
probably
stopped
shipping
it
at
some
point,
and
it's
just
it
just
doesn't
work
yes
welcome
to
distributed
systems.
We
are
all
the
tribute
systems,
engineers
now
I
think
that
means
you
get
to
ask
for
a
raise.
A
Look
at
the
lamp
stack.
You
know
it's
famous
for
having
the
database.
You
know
the
one
database,
the
application
right
and
because
it
has
this
these
characteristics
we
we
grew
up
with
monitoring
and
metrics,
which
are
inherently
incapable
of
handling
high
cardinality
data.
So
like
anyway,
if
you
ever
start
using
data
dog
or
whatever
and
you're
like
I,
know,
it'd
be
a
nice
tag
to
be
able
to
group
my
metrics
by
hostname
and
that
works
until
you
had
a
couple
few
dozen
hosts
and
then
suddenly
blew
out
the
cardinality
and
keyspace
and
david.
A
It's
like
pay
billions
of
dollars
and
it
won't
work
well.
Cardinality
is
important,
turns
out
now
like
everything's,
a
high
cardinality
problem,
because
there
are
many
of
everything,
because
we
could
mostly
we
felt
we
had.
We
had
the
illusion
of
control
because
we
felt
like
we
had
a
handle
in
the
ways
that
our
systems
were
going
to
break.
We
saw
failure
as
a
thing
that
should
be
prevented
right.
We
put
all
this
energy
into
like
preventing.
A
A
We
taught
people
to
fear
production,
we
taught
them
to
leave
it
to
us
right
which
led
all
of
the
you
know
the
yeah,
the
self
abuse
that
ops
people
are
known
for
right,
masochistic,
on-call
culture,
that's
what
I'm
looking
for
and
and
we
treated
deploys
like
they
were
like
they
were
scary
like
there
were
things
to
be
feared
and
prepared
for
and
gripped
really
tightly,
so
the
right
I.
Think
of
it
as
like.
We
built
this
glass
castle
and
it
was
beautiful.
A
It's
fragile,
forbidding
out
of
the
edifice
and
it
was
very
hostile
to
exploration
and
experimentation.
Well,
the
world
is
changing
in
many
many
many
ways
right.
So
let's
look
at
the
technical
aspects
and
cultural
associations
for
distributed
systems.
We've
got
many
storage
systems,
I'm,
not
saying
that
all
these
are
good
things
as
someone,
but
the
database
backgrounds.
It
makes
my
eyes
bleed
how
many
databases
people
like
to
use
these
days,
but
fine
I'm
down
I'm
hip.
A
There
is
a
lot
of
services
every
unknown,
unknown
check.
Every
every
alert
should
be
a
new
thing.
It's
it's
entirely
an
instrumentation
game.
In
fact,
this
is
part.
This
is
driving
DevOps.
This
is
a
big
part
of
the
reasons
we
have
DevOps
is
that
the
people
with
the
original
intent
in
their
head,
who
are
writing
code,
have
to
see
it
through.
You
have
to
see
it
all
the
way
out
to
watch
users
using
it
in
production.
A
There
double
go
wrong
in
the
process
and
then
deployment
instead
of
being
like
these
big
bang's,
where
we
like
pull
hard
and
we're
like
peas
it
over
the
wall,
it's
more
like
baking
cookies,
you
have
to
think
of
deployments
more,
like
you're
you're,
putting
them
in
the
oven
and
you're
beginning
the
process
of
gaining
confidence
in
your
code.
It's
just
the
start
right,
it's
just
the
start.
You
should
never
trust
your
code
when
you're
putting
it
out
and
prod.
A
This
is
why
we
write
you
know,
like
you
know,
progressive
deployment
stuff
and,
like
you
know,
automated
checks.
Little
rollback
errors
exceed.
You
know
all
of
these
safety
things
that
we
baked
in
to
do
these
things
automatically
are
done
because
we
know
we
can't
trust
this
code
right,
but
it
needs
to
be
I,
think
it
almost
like
development.
A
The
development
process
is,
is
smearing
all
the
way
over
into
production,
and
the
prog
process
has
to
be
like
all
the
way
back
into
dev,
so
that
it's
more
like
it's
just
a
continuum
right
production
is
where
your
users
live.
Failures
have
to
become
your
friend
and
just
frankly,
because
there's
just
too
many
of
them.
You
know
we
have
to
embrace
them.
We
have
to
embrace
them.
We
have
to
make
friends
of
them.
A
We
have
to,
you,
know,
become
a
at
home
with
the
idea
that
in
reality,
you've
got
so
many
things
failing
right
now
that
you
don't
even
know
about
so
sleep
well,
you
know
like
it's
a
different.
It's
it's
it's
kind
of
black
humor,
but
it's
also
true
that
your
software
is
way
more
broken
and
it
is
correct
at
any
point
in
time.
That's
fine!
That
means
deploys
opportunities.
That
means
every
outage
and
opportunity.
I
find
this.
A
Actually,
very
cheering
I
I
mean,
like
I,
believe
that
this
is
the
only
way
that
we
move
forward
to
a
more
humane
industry
is
by
just
accepting
the
human
frailty
of
all
of
our
systems,
best
practices,
you
build
it.
You
run
it
three
years
ago,
when
I
started
talking
about
putting
software
engineers
on
call
people
got
very
angry.
You
were
very
upset
and
now
I
think
that
debate
is
actually
over
it's
more.
A
The
the
only
question
is
how
right,
but
the
trade-off
here
is
that,
in
exchange
for
everyone
agreeing
to
be
on
call
to
support
their
code,
management
has
an
equal
responsibility
to
make
sure
it
does
not
suck
it
should
not.
Those
of
us
who
are
over
30,
who
don't
gonna,
want
to
get
woken
up
in
the
middle
of
the
night
anymore.
We
should
be
able
to
not
plan
our
lives
around
being
on
call
right.
It
should
not
be
life
impacting.
A
A
Don't
reward
bad
management
with
the
gift
of
your
labor,
so
here's
like
the
dirty
little
secret,
like
the
next
generation
assistance,
is
not
going
to
be
built
and
run
by
burned-out,
exhausted,
tired
people
who
can't
communicate
with
each
other
and
they're,
not
gonna,
be
run
by
teams
we're
just
following
orders.
We
just
check-in
at
the
beginning
of
the
day
and
check
out
and
don't
really
bring
their
full
creative
self.
There
they're
just
too
complex,
now,
they're
too
hard
they're,
too
chaotic
they're
too.
You
can't
just
like
learn
the
system
and
then
watch
it.
A
A
It
should
be
lovable
to
sides
right,
but
I
like
this,
because
when
I
think
about
you
know
our
tools,
our
tools
were
designed
for
much
more
predictable
world
and
you,
you
might
have
looked
into
this
future
and
and
thought
that
this
would
be
scary
and
terrible
and
doom
doom
doom
and
it
could
be,
but
I
actually
think
that
all
of
the
incentives
are
kind
of
lining
up
in
the
right
direction
to
make
it
a
better
place.
Anyway,
a
philosophical
blah
blah
blah.
A
Let's
talk
about
visuals
one
of
the
things
about
tools
is
that,
like
the
last
couple
generations
of
tooling
for
understanding
our
systems
like
monitoring,
metrics
logging,
they
were
really
organized
around
the
idea
of
answering
questions
quickly.
If
you
had
a
question,
they
would
answer
it
fast
and
they
do
that
very
well.
A
But
when
I
think
about
the
kinds
of
problems
that
I'm
having
I
usually
dealing
with
a
situation
where
I
have
some,
maybe
some
problem
reports
that
I
may
or
may
not
trust
I
have
a
hazy
like
idea
that
something
over
there
like
reminds
me
of
something
I
might
have
seen
once
that
might
be
wrong.
I've
got
some
leaps
of
fancy.
I've
got
some
suspicions.
A
What
I
don't
have
is
a
question
and
in
fact,
by
the
time
that
I've
worked
my
way
to
the
point
where
I
have
a
question
that
I
can
ask
I,
probably
know
the
answer
to
right
because
finding
the
answer,
it
turns
out
it's
much
easier
than
figuring
out
what
the
question
is,
and
so
we
need
to
shift
tooling
to
a
much
more
exploratory
open-ended
way
of
interacting
with
our
systems,
and
we
also
need
to
pull
all
the
state
about
these
systems
out
of
our
heads
and
put
it
into
a
tool
where
everyone
can
access
it.
A
First
of
all,
we
can't
fit
them
in
our
heads.
Anymore
did
I
mention
they're,
ephemeral
and
dynamic
and
changing
every
15
seconds.
Is
you
know
if
you're
trying
to
reason
about
how
the
request
is
making
it
through
the
system
in
your
head?
You've
forgotten
something
you've
left
it
out
right
and
more
to
the
point.
What
and
your
head
is
not
available
to
your
team,
and
this
will
hold
us
back
right.
A
A
Think
that
it's
really
important
that
we
that
we
kind
of
release
the
like
I
love
being
the
genius
you
know
cause
like
I,
remember
like
I,
could
look
at
a
dashboard
of
par,
so
I
could
just
go
by
the
curve
of
that
graph.
I
can
tell
you.
The
fault
is
Redis
like
a
freaking
genius,
I
kind
of
missed
that
I
do
I'm,
not
gonna
lie,
but
what
I?
A
What
I
don't
miss
is
the
fact
that,
when
I
was
on
my
honeymoon,
I
got
called
like
half
a
dozen
times,
because
nobody
else
could
fix
MongoDB
right.
There
are
upsides.
There
are
downsides
on
balance,
I
think
that
it's
it's
fair.
So
next
I
want
to
talk
about
how
observability
directly
leads
to
these
high-performing
teams
and
and
in
fact,
why
you
can't
really
do
it
without
a
durability
and
listen.
A
I
came
up
with
a
maturity
model
and-
and
we
distilled
really
everything
that
we've
ever
seen
or
experienced
about
high-performing
teams
and
to
do
these
five
things
and
I'll
go
through
them
really
quickly
and
they
come
with
pairs
of
lists
right.
Where
one
list
is,
you
know
if
you're
doing
well
in
this
button,
that's
in
this
area,
then
you
should
resonate
with
this.
You
should
be
like
yes
to
subscribe
me,
in
which
case
cool
move
to
the
next
one
right,
because
the
other
list
is
these
are
pathologies.
A
These
are
you
know,
signs
that
you're
not
doing
well
so
for
resiliency
right
if
you're
like
yes,
all
of
you
subscribe
me
cool,
you
don't
need
to
invest
in
resiliency
right.
We
all
the
limited
number
of
engineering
cycles
and
part
of
what
the
next
generation
of
systems
involves
is
learning
to
think
ruthlessly
about
where
to
spend
those
cycles.
This
is
why
I'm
always
telling
people
to
staging
you
know,
because,
like
it's
a
black
hole
for
your
engineering
time,
it's
probably
not
gonna
get
you
the
answers
that
you
want
by
the
time
you
do
get.
A
The
answer
is
probably
out
of
date,
spend
your
spare
your
scarce
engineering
cycles
on
production,
so
for
resiliency
right
ultimately
on
call
is
not
extremely
sec,
stressful
and
that
leads
to
low
turnover,
which
is
nice
on
the
flip
side,
if
your
outages
are
frequent,
if
you
have
alert
fatigue,
if
troubleshooting
is
hard,
then
this
is
a
place
to
invest
in
and
observability
helps
here,
because
it
gives
you
context,
because
you
know
there's
such
it.
There
is
such
a
difference
between
being
on
a
team
that
has
invested
in
instrumentation.
A
A
You
don't
have
to
have
experience
before
versus
teams
that
are
just
pattern
matching
based
on
shared
trauma
or
like
the
last
outage,
or
you
know
they
see
one
graphic
one
spike
in
a
graph
and
they're
like
I
know
what
this
means,
and
so
they've
only
got
got
like
a
corner
of
the
answer.
A
It
is
honestly
it's
really
hard
to
describe
to
people
with
this.
What
this
looks
and
feels
like
because,
like
we
found,
we've
had
like
zero
churn
ever
a
honeycomb,
the
heart,
the
heart
part,
is
getting
people
to
experience
it
on
their
own
systems,
because
everybody
knows
that
demos
can
be
faked.
Everybody
knows
that
you
know
fake
data
isn't
real,
but
once
you've
experienced
the
ability
to
interact
and
like
explore
and
just
ask
any
question
understand
any
any
result.
It's
like
night
and
day.
A
It
really
is
it's
like
it's
like
putting
your
glasses
on
before
you
go
drive
down
the
road
and
I'm
really
blind,
but
you
do
not
want
me
to
drive
something
glasses.
That's
how
I
feel
about
having
observable
systems.
High
quality
code
is
another
one
right
if
you're
doing
well,
you
spend
your
time
like
customer
happiness,
not
on
customer
bugs
right,
subtle
distinction
and
cascading
failures.
Don't
bite
you
all
the
time.
The
thing
were
you
like?
A
Oh
there's
a
bug
here
and
you
go
to
fix
it
and
suddenly
your
week
is
shot
because
you
know
nobody
understands
it.
It's
a
mess,
can't
reproduce
anything.
This
also
manifests
when
you're
not
doing
well.
It
manifests
his
fear
of
the
deploy
process
because
anytime
something
goes
wrong.
You
basically
have
to
go.
Learn
the
code
base
some
scratch
right,
it's
not
intuitive,
it
doesn't
feel
it
doesn't.
It
doesn't
have
a
consistent
look
and
feel
it
doesn't
you
like
debug
problems,
it
neighbor
your
area
of
expertise
and
observability
lets.
A
If
they
can't
answer
the
question,
how
will
I
know
if
this
isn't
working
right?
That
is
like
that's
the
bar
for
instrumentation?
How
will
I
know
if
this
thing
that
I'm
about
to
push
is
or
is
not
working
right
and
then
you
go
and
you
look
at
it
and
is
it
doing
what
you
wanted
it
to
do
and
does
anything
else?
Look
weird
like
catches
like
85%
of
all
problems
before
users
can
ever
even
notice
them
right.
There's
there's!
No,
because
that
original
intent
is
gonna
decay
from
your
head.
A
If
it's
a
week
from
now
and
you're
going
and
trying
to
figure
it
out,
who
knows
if
you're
gonna
remember
what
someone
else
who's
trying
to
debug?
What
you
just
did
right
know
the
best
way
to
catch
bugs
best
way
to
have.
You
know
a
honeycomb.
We
we
have
everyone
in
one
on-call
rotation
and
about
once
a
week
someone
will
get
paged
out
at
hours
and
we've
got
a
giant
multi-tenant
system
with
unpredictable
traffic,
spikes
and
everything,
and
not
even
I'm,
not
even
talking
about
getting
woken
up
like
out
of
hours.
A
A
A
No
one
is
admitting
it
excellent.
Yeah
I
mean
sometimes
that's
legit
step
to
take
temporarily
right,
but
you
should
be
kind
of
ashamed
of
it.
You
know
you
should
be
like
here's,
a
band-aid
that
we
have
to
do.
You
know
can't
wait
till
we've.
You
know
progressed
to
the
point
where
we
don't
have
to
do
this,
because
the
number
one
thing
that
will
make
your
deploys
terrible
is
when
you're
batching
up
multiple
changes
at
once.
A
Do
not
do
that,
like
all
of
this
fear
that
people
feel
about
deploys,
it's
actually
legitimate,
it's
just
that
what
they
should
be
afraid
of
is
batching
changes
together
and
deploys
not
deploys
period
if
you're
back
if
you're
just
shipping
one
change
at
a
time.
Oh
my
god,
they're,
not
scary,
people
should
not
avoid
doing
deploys,
and
it's
everybody
helped
here,
because
you
know
everything
about
this
can
be
instrumented.
Everything
about
this
could
be
everything
about.
This
should
be
transparent.
It
should
be.
It
should
be
receptive
to
a
curious
person,
exploring
complexity
and
tech.
A
Debt
needs
to
be
managed
is
the
fourth
thing
you
should
be
able
to
answer
any
question
without
shipping
new
code,
because
you,
if
you
have
to
ship
new
code
well,
you
have
to
restart
everything
bugs
gonna
go
away.
You
know,
haunted
graveyard.
That
was
my
favorite
thing.
Yeah
absolutely
helps
you
do
the
right
work
at
the
right
time,
and
this
like
honestly,
this
is
what
helps
teams
win.
A
You
need
to
be
up
to
your
elbows
in
prod
every
day
right
under
seeing
how
users
are
interacting
with
what
you're
doing
improving
things,
making
sure
that
the
work
that
you're
putting
into
it
is
having
an
actual
impact.
I,
don't
know
how
people
expect
to
pick
the
right
thing
to
do
the
right
work
if
they
aren't
constantly
interrogating
production
and,
finally,
understanding
user
behavior.
This
is
everyone's
job
right.
This
isn't
just
the
PM's
job,
because,
like
your
code
meets
reality
where
users
interact
with
it
in
production,
right,
that's
a
magical
place.
A
That
is
the
only
thing
that
matters
it
does
not
matter.
If
your
test
can
pass
if
it
breaks
and
prod
it
doesn't
matter
nothing
matters,
but
production
teams
should
share
useful
for
a
view.
A
single
view
of
reality
and
observability
will
ground
you
in
that
reality,
and
sometimes
people
will
protest,
but
I
don't
have
the
time
to
invest
in
it,
which
is
a
chicken
and
the
egg
problem,
and
in
fact
you
can't
afford
not
to
because
you
waste
so
much
time.
Let
me
tell
you
about
all
the
time
that
you
waste.
A
There
is
a
stripe
developer
report
recently,
where
they
they
showed
that
they
surveyed,
like
thousands
of
developers
about
42%
of
every
engineers.
Time,
goes
to
doing
things
that
do
not
move
the
product
forward
that
are
often
frustrating
time-consuming.
You
know
trying
to
figure
out
where,
in
the
code
is
the
right
piece
of
code
to
fix
fixing
it,
it's
probably
pretty
easy
figure
out
what
needs
to
be
fixed
that
can
take
all
week
right.
A
A
lot
of
that
could
just
be
cut
out
if
you
could
just
see
what
you're
doing
where
you're
going.
You
really
can't
afford
not
to
engineering
quality
of
life
is
tightly
linked
to
high-performing
teams
and
resilient
systems,
and,
if
you're
into
all
of
these
new
fangled,
you
know
chaos,
engineering,
observability
is
kind
of
prerequisite
for
those
things
to
inclusion
for
thinking
about
the
future.
I
think
we
now
have
consensus
that
you
know
on-call
is
everyone's
job.
Everybody
has
to
support
what
they're
writing
and
production,
and
it's
actually
key
to
doing
a
good
job.
A
As
an
engineer,
it's
actually
observing
it
all
the
way
out.
The
end
of
its
life
cycle
and
on-call
must
not
be
miserable.
I.
Think
of
it.
Kind
of
like
on-call
is
gonna,
be
less
like
a
heart
attack
and
more
like
diabetes
or
maybe
visit
I.
Don't
know
it
might
not
ever
be
pleasurable,
although
I
have
seen
it
be
pleasurable,
it
can
be
a
really
nice
break
in
the
routine.
It
can
be
permission
to
go
off
on
six
other
little
things
that
you
can't
really
justify
doing
during
your
normal
week.
A
Right
I
think
it's
a
good
idea
to
have
a
policy
of
you.
Do
not
do
normal
project
work
doing
during
your
on
call
week.
That
is
a
week
where
you
get
to
work
on
all
the
other
things
work
on
your
deploy.
Scripts
work
on.
You
know
your
your
deploy
pipeline
work
on
all
of
the
little
things
that
kind
of
bug
you
about
everyone
else's
stuff,
but
you
know
well
don't
take
that
too
far,
but
you
know
work
on
the
operability
work
on
resiliency
work
on
the
things
that
contribute
to
on-call,
not
being
a
nightmare.
A
Often
people
ask
me
how
to
think
about
instrumentation
and
honestly
I
just
say:
look
at
how
serverless
does
it
that's
the
correct
lens
right.
You
should
be
observing
the
world
through
the
lens
of
the
instrumentation
that
you
write,
you
can
instrument
almost
any
code
and
deploy
list
is
coming.
I
don't
know
if
y'all
have
been
paying
attention
to
the
stuff
that
dark
is
doing
so
cool
they're.
Just
like
writing
code
like
well
on
production.
A
It
saves
it
and
runs
unit
tests,
and
oh-
and
this
is
the
cool
stuff
that
I've
seen
people
doing
where
it'll
embed
like
a
sparklines
BRAF.
Yet
so
when
you're
modifying
a
function,
it
shows
you
the
estimate
of
how
long
it
took
to
run,
and
they
estimates
how
long
it
will
take
to
run
after
the
change
that
you
just
made.
It's
so
freaking
cool,
tightening
these
loops
better
for
everyone
right,
invest
in
your
deploys
democratize.
Access
to
this
data
don't
be
scared
by
regulations.
A
A
A
That
was
a
mistake.
We
should
have
built
the
playground.
We
should
have
built
a
place
where
people
feel
invited
to
come
and
experiment
and
see
what's
see
the
consequences
of
what
they've
done.
What
they've
built
see
how
you
like,
we
always
human
beings.
We
deeply
crave
the
impact
right.
We
crave
autonomy,
we
crave,
meaning
we
crave
impact
and
production
is
where
you
get
all
of
these
things.
A
Villager
does
a
playground.
Guardrails
I
mean
we
expect
children
to
get.
You
know
a
bloody
nose
at
the
playground.
We
don't
expect
them
to
die,
so
you
know,
invest
in
Tilly
to
make
it
safer.
But
like
we're
all
partners
in
this,
but
guardrails
encourage
curiosity,
emphasize
ownership,
don't
punish
people
for
making
mistakes,
because
you
make
mistakes
too.
We
all
do
and
our
is
failing
all
the
time.
It's
fine.
It's
probably
fine,
that'll
be
fine
if
we
practice
practice
failures
and
finally,
senior
engineers
like
your
job,
is
to
amplify
the
costs
that
are
hidden
right.
A
Your
job
is
to
is
to
is
to
surface
things.
Decision-Makers
are
making
the
best
decision
that
they
can
with
the
information
that
they
have
and
they
often
lack.
What's
in
your
head
a
lot
of
time
when
you're
grumbling
about
something
a
decision
being
bad,
that's
because
you
know
something
the
decision-maker
did
not.
They
did
not
know
that
they
were
building
something
that
was
going
to
be
impossible
to
maintain.
They
did
not
know
that
they
weren't
allocating
enough
time
to
you
know
to
testing
or
whatever
they
didn't
know.
A
You
know
they
didn't
know
these
things
and
it's
your
job
not
to
just
say
it
once,
but
be
a
squeaky
wheel
to
be
constantly
providing
the
right
amount
of
signal
so
that
the
decisions
that
get
made
are
good.
Communication
is
absolutely
your
job.
So,
in
conclusion,
all
the
stuff
is
changing,
but
I
actually
think
it's
it's
it's
gonna
be
alright.
I
think
it's
gonna
be
good.
A
I
think
that
the
shape
of
the
changes
I'll
point
to
a
more
democratized,
more
empowered,
more
necessary
labor
force
of
engineers
who
get
to
come
to
work
and
bring
their
full
creative
self
and
feel
very
invested
in
what
they're
doing
so
and
every
engineer
organ
has
two
constituencies,
and
this
really
just
nines,
don't
matter
if
he's
right,
happy,
but
great
teams
build
great
systems,
burn
out
burned
out.
People
do
not
make
great
teams,
so
everyone
should
get
enough
sleep
and
everyone
should
feel
like
a
donor.
So
we
had
the
opportunity
to
make
things
better.