►
From YouTube: Cloud Native Social Hour - April 26, 2019
Description
Join @apinick & friends at 2:00pm PST as we talk about what it means to introduce chaos to your environment and how you can use it to improve your performance.
A
I
know
it's
sad
that
being
said,
but
yeah,
so
the
Seattle
office,
the
beaver
Seattle
office,
the
FT
office
is
going
away
very
shortly,
and
so
this
is
last
time
we're
doing
it
here
and
I
know
it's
a
nice
space.
We've
got
like
a
pile
of
junk
over
here
that
we
never
really
ended
up
using,
but
I
think
we'll
that'll
be
joining
us
in
in
Bellevue,
somewhere,
medic
stuff
in
a
corner
and
yeah
we'll
be
doing
it
out
of
a
various
conference
rooms
in
Bellevue.
From
now
I.
A
B
A
C
A
We're
going
to
be
the
end,
but
also
I'll
be
going
on
vacation
sometime
around
when
we're
moving,
and
so
logically
this
is
the
last
time,
though
yeah
too
bad.
You
guys
it'll
be
good
to
kind
of
have
a
new
home
when
we
get
settled
in
there
all
right.
So
this
time
on
the
cognitive
social
hour,
we
are
talking
about
chaos,
as
I
said
in
a
internal
email,
we're
issuing
the
gaudy
rep
trappings
of
stability
and
introducing
a
world
of
randomness
toward
infrastructure.
A
B
A
B
A
A
B
A
I
actually
haven't
had
too
much
of
a
chance
to
look
into
a
new
project
or
anything
that
really
intrigued
me
actually,
the
last
time
we
kind
of
in
over
this
segment
of
things
that
interested
us.
It
was
around
chaos
all
right.
What's
up
mark,
it
was
around
chaos
and
actually
Duffy
you're
going
to
be
doing
a
demo
on
that
very
topic.
Oh
yeah.
D
A
C
You
hot
topics
for
me
lately
then
I've
been
having
a
bunch
of
weird
like
Network
timeouts.
So
of
course
you
know
networking
and
cloud
natives
always
good,
but
also
we're
talking
about
storage
lately
and
persistence.
C
That's
and
then
also
there's
a
thing
coming
up
through
the
CN
CF
they're
doing
like
some
sort
of
webinar
about
that
like
persistent
services.
So
that's
kind
of
that's.
That's
me
kind
of
getting
into
that
a
little
bit
gold,
oh
I,
know
there's
like
ruk
out
there,
but
which
is
so
I
mean
it's
it's
on
top
of
Ceph,
but
I
was
hearing
if
there
was
anything
else
up
there.
So
this.
A
Is
actually
good,
I
have
a
good
answer
to
that
I'm,
actually,
not
gotten.
Look
to
work.
I
know
that
the
the
team
was
working
a
lot
of
like
really
smart
people
and
I
think
that
it's
a
great
tool,
though
personally
I've,
never
done
it
to
work
so
one
that
I
have
turned
to
or
if
I
need
something
like
Ceph.
It's
open,
EDX
I,
keep
talking
about
it
continue
to
keep
talking
about
open
EBS
is
a
really
cool
storage
tool
that
rhyme
I'm,
a
cool
guy.
So.
A
Basically,
the
way
it
works
is
it
uses
the
ephemeral
storage
of
each
pod
to
create
a
set
cluster
and
so
gullible
what
happens
behind
battle.
Pod
dies.
These
anyone
comes
out
and
the
set
cluster
migrates
the
data
and
make
sure
that
you
poor
monster.
So
you
have
this
kind
of
cloud
native
storage
solution
that
can
grow
and
expand
with
a
number
of
pods
at
your
cluster
bombs
award.
That
being
said,
it
also
takes
a
compute
and
other
resources
that
you're
we
might
be
competing
with,
but
I
think
that's
a
really
handy
tool,
particularly
for
prototyping.
B
A
D
Others
out
there
I'll
suck
him
to
vote
for
open
EBS
and
for
stuff
they're,
both
pretty
decent,
really
good.
If
you're
looking
to
pay
somebody
for
storage,
which
is
probably
not
a
terrible
idea,
cause
it's
storage,
so
it's
always
it's.
You
know
obviously
gonna
be
pretty
darn
important
to
you
in
the
long
run
or
in
the
short
run,
either
way
there
are
some
other.
There
are
some
other
ones
that
I've
seen
customers
have
pretty
reasonable
success
with
things
like
port
work.
D
C
A
A
D
You
know
yes
actually
so
another
one.
They
probably
talked
about
it
a
little
bit
in
the
in
the
in
the
TTIP
this
week,
but
there
was
a
blog
post
on
kubernetes
blog.
We
find
that.
D
A
D
Like
I'm
super
impressed
by
this
idea,
I
haven't
played
with
it.
Yet
it's
definitely
my
list
of
things
to
do,
but
I
love
the
idea
of
it.
You
know
cuz
effectively.
What
Q
iptables
Taylor
tries
to
do
is,
if
you
slip
log
entry
or
log
line
on
failed
packets
at
the
underlying
node
they've
written
into
this
daemon
set
a
mechanism
that
would
allow
you
to
annotate
the
event
of
a
drop
packet
for
any
given
part.
D
So
if
you're
seeing
packets
be
dropped,
maybe
on
purpose
because
of
network
policy
or
envy
you're
seeing
packets
get
dropped
because
of
you
know
a
number
of
different
things.
This
seems
like
a
a
pretty
interesting
way
of
exposing
useful
information
up
as
the
event
stream,
so
I
was
kind
of
impressed
by
that
and
they're
gonna.
Have
it
exported
too
much,
but
I
was
really
kind
of
impressed,
not
only
by
by
this
idea,
but
also
just
gonna
by
the
design
pattern.
A
B
You
look
at
there's
another
blog
post
that
we
had
written
in
that
same
page.
If
you
look
at,
there
was
a
box
trying
to
do
for
the
control
plane.
That's
a
pretty
nice
function
like
they
were
looking
at
doing
like
reporting
errors
from
the
control
plane
to
applications
using
it's
kind
of
up
there
yeah
this
one.
D
D
B
B
B
If
your
application
is
using
this
much
amount
of
CPU
and
this
much
amount
of
RAM
and
then
put
a
dollar
value
ma
on
it,
so
that
that
can
be
go
back
to
you,
which
your
lines
of
businesses
are
deploying
that
a
particular
application.
So
there's
another
vial.
Exploring
this
show
back.
I
also
came
across
a
small
company.
Doing
cube
cost
cube,
cost
is
another
way
of
doing
show
back
and
communities,
and
if
you
look
at
the
inverse
solutions,
we
have
something
of
what
is
that
called.
B
A
And
I'm
glad
to
see
that
there's
buildin
action
in
that
in
the
community
like
this
has
been
in
question
when
I
said
this
is
a
classic
question:
it's
something
that
really
hasn't
solved
yet
like
a
lot
of
people
handle
it
through,
like
Prometheus
or
the
fauna
metrics.
You
know
you
can
take
it
for
media's
metrics
and
multiply
it
by
some
amount
and
that's
the
amount.
Thank
you
all
right
to
the
resource.
A
So
there
are
issues
with
that
because
chicken
remember
we
were
talking
about
this
where
Prometheus
lose
this
data,
if
your
step
of
data
is
big
enough,
it
will
just
drop
data
and
it's
not
convenient
for
that
purpose.
It's
more
about
a
floating
point
in
time
series
so,
I'm
glad
to
see
there
are
some
other
companies
that
are
written
on
this
enough.
A
knows,
speaking
a
bad
game
enough
a
knows
is
a
another
rule
in
this.
In
this
arena,
right,
it's
a
SAS
offering
for
Prometheus,
it's
like
long-term
Prometheus
or
yes,.
B
Yeah
also
does
your
head
shape
aspect
of
it,
which
is
great
and
that's
also
another
solution.
We
were
writing
some
scripts
custom
scripts
pull
data
from
like
a
long-term
storage
like
if
Thanos
actually
works,
yeah.
We
were
looking
to
use
tano's
and
then
uses
custom
scripts
to
pull
out
data,
but
that
would
be
also
another
solution
to
look
at.
But
honestly,
operative
metering
is
in
alpha
yeah.
D
B
A
A
Last
week,
called
drunk
drivers
slash
killer
whales,
the
killer
whales
like
it
doesn't
make
me
to
do
with
the
song
that
they'd
say
they
like
screaming
at
one
point:
it's
by
a
band
called
car
seat
headrests,
and
it's
just
this
really
fun
kind
of
cool
song.
It
is
fun,
I,
find
it
fun,
but
it's
actually
it's
kind
of
about
a
person
who
is
going
through
what
sounds
like
a
breakup
and
they're
like
convincing
themselves
not
to
drive
drunk
and
they're.
Like
you
know,
screwed
up
and
terrible.
A
A
D
Yeah,
listen,
Justin,
I,
like
blues
stuff
lately,
but
I
wouldn't
say
I
have
anything
any
particular
ones
that
jumped
out
at
me.
Yeah
I've
been
kind
of
an
RA
LaMontagne
kick,
which
is
nice,
got
some
really
good,
really
good
stuff.
I
remember
when
I
first
heard
ray
lamontagne
sing,
I
was
like
I
had
a
very
different
I
was
trying
to
picture
this
dude
in
my
mind,
just
by
listening
to
his
voice
like
he
do.
D
A
D
We
try,
we
have.
We
try
to
categorize
the
things
that
we
see
here
in
terms
that
are
are
reasonable
for
us
right.
I
know
they
might
be
totally
off
and
skin
health
like
it's
always
kind
of
amazing,
when
you,
when
you,
when
you
think
that
you
pet
something
figured
out
you're
like
no
I,
wasn't
even
close.
A
A
I
think
that's
a
good
point
Duffy
and
it's
going
to
drive
us
directly
into
the
topic
of
the
discussion,
which
is
Cass
engineering
and
a
part
of
chaos,
and
a
very
important
part
of
guessing
is
data,
recognition
and
monitoring.
I
think
that's
something
we're
going
to
talk
about
in
just
a
second
cush,
Creek
or
Duffy.
If
you
have
the
item
be
available,
looking
you
edit
of
the
zoom
chat,
so
that
there's
a
part
of
the
heck
can
be
called
Twitter
handles
for
gratuitous
follows
follow
request.
A
A
A
Right
so
this
week
we're
talking
about
chaos
and
chaos.
Engineering,
so
I
first
became
familiar
with
chaos
engineering
when
I
heard
about
like
the
way
that
Netflix
handles
their
infrastructure.
They
have
a
kind
of
this
interesting
idea
where
and
I
kind
of
I've
heard
about
this.
Originally,
when
I
was
working
with
mostly
PMS
back
there,
when
I
read
had
doing
like.
A
Platforms
and
read
them,
which
is
the
retina
bridge,
virtualization
manager,
and
so
one
of
the
things
that
they
were,
that
Netflix
was
telling
was
the
idea
that
no
VM
should
last
longer
than
a
day,
and
it
handled
that
by
inducing
chaos.
They
have
a
mechanism
that
would
like
to
periodically
just
go
and
destroy
random
VMs
and
see
what
happens,
applications
and
so
ever
since
then,
that's
been
something
that's
been
kind
of
on
my
mind
to
check
out
and
now
recently
I'm
really
starting
to
dive
into
it.
A
B
A
Means
they're
in
this
and
for
most
of
us
are
familiar,
but
this
would
go
over
it
again.
Like
chaos,
engineering,
the
idea
is
that
you
introduce
logical
or
predictable
chaos
into
instability
into
your
infrastructure
and
into
your
applications
so
that
you
can
see
what
happens
so.
Basically,
the
idea
is
that
how
do
you
measure
the
stability
of
your
application
of
your
infrastructure
if
it
isn't
tested
at
all
right?
So
if
you
have
infrastructure
that
never
goes
down
what
happens
when
he
goes
down,
are
you
able
to
recover?
What's
the
meantime,
your
recovery?
A
All
of
these
things
are
important
to
test
out
and
if
you
don't
do
it,
if
your
precious
with
your
infrastructure
of
your
applications,
something
kind
of
trap,
it
will
happen
because
we
live
in
a
world
world
by
the
way
that
was
a
cat,
see
the
Dragons.
You
can,
if
it's
a
wild
world,
and
you
will
get
the
rug
pulled
out
from
underneath
you.
So
this
is
something
that
I
think.
That
is
very
important
and
we
don't
see
it
a
lot
in
practice
and.
B
A
I
totally
agree,
so
it's
something
called
day
to
operations
and
it
seems
like
to
me
a
lot
of
people
never
get
past
day,
one
from
day,
one
being
setting
up
your
infrastructure,
setting
agreed
clusters,
things
on
your
applications
and
it
seems
like
no
people
don't
really
get
past.
They
wanna
get
mean
today
to
do
they
do
being
monitoring
logging
and
and
it's
something
I
want
people
to
be
braver,
about
trading
jump
into
instability
like
have
it
be
part
of
your
day,
one
or
day
zero
operations.
D
I
agree
with
most
what
you
said:
I
think
that
you
know
a
I
think
another
thing
that
another
way
that
I
kind
of
internalized,
like
what
chaos
engineering
is
trying
to
do
is
that
we
are
all
of
us
in
the
field
of
computer
science,
but
like
we
should
be
applying
science
to
the
work
that
we're
doing.
We
should
just
be
like
throw
it
out
there
and
see
what
happens
see
what
stuff.
D
Why
I
feel
like
the
chaos,
engineering,
piece,
really
kind
of
comes
to
the
fore
because
you,
you
might
develop
a
theory,
a
pretty
obvious
theory
that
like
if
you
have
a
few
instances,
and
one
of
them
goes
away
you're
going
to
lose
half
your
traffic,
but
when
you
get
into
the
science
part
of
that,
that's
probably
your
true
that
you're
going
to
lose
about
half
of
your
available
of
the
availability
of
that
of
that
service.
But
for
how
long
and
what
servers?
D
And
what
does
it
look
like
empirically
to
when
the
deployment
spins
up
that
new
instance
and,
like
you
know,
understanding
the
characteristics
of
these
applications,
especially
at
scale,
is
like
critical
to
your
success
in
this
collimated
world.
You
know
and
that's
right,
I
think
the
chaos
engineering
really
comes
in
comes
into
play.
What
things
that
I'll
point
out,
though?
It's
like
it's
interesting,
like
the
engineering
part
of
chaos.
D
Engineering
I
think
is
also
pretty
important
right,
like
anybody
can
write
something
that
will
introduce
chaos
or
troubleshoot
order
or
trouble
into
a
system
whether
that
things
that
you've
introduced
is
that
you're,
just
gonna
randomly
kill,
pause
or
that
you're
going
to
increase
latency
and
whatever
those
things
are.
That's
all
pretty
easy
to
actually
generate
you
know
kind
of
these.
You
know
insert
errors
into
a
system.
A
It's
kind
of
silly
right,
it's
you
know.
How
do
you
know
that
the
test
occurred
or
that
it
was
something
that
is
this
like
kind
of
outrageous
right,
it's
like
well
what
happens
if
you
know
suddenly,
but
how
does
our
application
handle
if
we
introduced
a
primary
counting
system,
binary
counting
system
well,.
A
Your
computing
system
yeah
like
if
there's
no
reason
for
that
piece
of
chaos
to
occur
or
there's
no
way
to
know
the
chaos
occurred,
it's
kind
of
useless.
Actually,
it's
just
you
know
it
kind
of
acting
like
a
a
child
with
a
magnifying
glass
in
an
anthill
or
I,
just
killing
things.
I
ran
up
for.
A
A
Everyone
knows
and
they
they're
tracking
everything,
booting
hosts,
isolating
networks,
etc,
introducing
latency
and
then,
like
judge
how
those
specific
services
react
and
then
that
team
can
then
take
that
back
and
like
fix
that
stuff
right,
they've,
their
first
posts
on
it
on
their
blog
was
like
2013.
So
they've
been
doing
this
for
a
while,
but
it
is
a
pretty
cool
process
that,
like
it's,
not
just
kind
of
like
like
Tuffy,
was
saying
it's
not
random.
It's
like
a
process
that
they
do
every
Friday
with
specific
services
and
teams.
A
A
Friday
most
teams
introduce
that
by
trying
to
deploy
to
production
on
Friday
and
and
that's
how
they
introduce
chaos,
it
was
something
no
I
like
the
idea.
What
they
were
like
instead
of
deploying
anything
outside
is
they're
gonna,
try
and
break
something.
That's
a
lot
of
fun,
I
think
yeah,
but
it's
not
just
a
it's,
not
just
breaking
random
things.
It's
like
a
specific
teams
service
is
targeted
and
they're
running.
You
know,
tests
that
are
published
and
advanced
kind
of
thing
right,
like
we're
gonna,
do
things
to
your
service.
B
B
I
think
rich
Lander
had
done
it
where
like,
if
it
doesn't
get
killed
like
any
poor,
doesn't
get
killed
and
that
first
time
frame
where
there
was
a
25%
chance,
the
next
tongue
would
be
like
a
50%
chance
that
you
would
kill
a
corn,
so
they
would
introduce
like
chaos
and
chaos.
Just
oh
that's
interesting.
A
Like
yeah
I
dig
that
so
you
never
know
exactly
woman's
gonna,
it's
pretty
cool
something
going
along.
The
lines
of
this
that
we're
talking
about
is
the
need
to
measure
D
sense,
so
chaos
for
chaos
to
say,
as
we
said,
is
kind
of
useless
having
some
control
around
your
chaos
like
knowing
the
time
frame
that
the
stoner
curve
that's
valuable.
It's
also
kind
of
it's
not
about
any
it
likes.
A
It
depends
on
the
philosophy
when
we
going
to
define
like
do
you
want
to
live
in
an
instable
environment
all
the
time,
but
make
sure
that
these
things
are
measured
or
if
you
want
to
do
a
furniture,
it's
certainly
during
a
time
frame.
So
you
can
get
all
that
data
very
quickly,
I,
personally
kind
of
enjoy
the
idea
of
living
in
an
unstable
environment.
At
the
same
time,
I
also,
apparently
don't
like
having
free
time-
or
you
know
like
a
weekend
or
anything
like
that.
A
So
I
like
the
idea
that
if
you
are
constantly
living
in
an
environment,
the
application
that
you're
developing,
if
you're
doing
iterative
processes
properly,
it
should
be
more
and
more
bulletproof
as
you
go
on
I'm
kind
of
add
to
the
agile
mechanism,
Angela
ideology
a
little
bit
night
if
it
doesn't
survive
this
test
this
time,
how
do
you
get
sort
of
how
it
survives?
Oh.
A
A
A
lot
of
people
use
monitoring
tools
like
Cortana
or
date,
a
dog,
a
Prometheus
on
the
previous
gopanna
or
data
dog,
or
any
of
these
other
monitoring
tools
and
keeping
like
a
track
record
like
okay
during
this
timeframe,
when
we
knew
that
chaos
would
be
happening,
what
happened
or
application?
Did
we
lose
X
amount
of
traffic?
A
Did
we
lose
X
amount
of
products
service,
little
sort
of
those
types
of
things,
and
this
is
something
I
actually
don't
have
a
lot
of
experience
in
I'll
be
honest,
like
I,
don't
I'm
only
like
seven
recently
getting
into
Q
of
chaos,
engineering
and
also
more
into
monitoring
development.
I've
only
really
started
playing
around
with
your
fauna.
I
love
your
pond
now
I
actually
very
much
enjoy
it,
but
I,
don't
know
the
mechanism
well
enough
to
create
like
a
good
and
I'm
kind
of
curious.
C
A
C
Latency
throughput,
there's
one
more
I'm,
not
remembering
it,
but,
like
you
know,
if
you're
monitoring
for
those
things
like
you
know,
the
impact
of
the
chaos
will
become
pretty
evident.
Like
you
know,
errors
start
to
go
up
okay.
So
when
we
did
this
thing,
you
know
you
saw
stickers
or,
like
you
know,
my
now,
my
like
95th
percentile
Layton,
sees
have
gone
from.
You
know
700
milliseconds,
to
like
five
whole
seconds.
You
know
it's
like.
A
There's
like
the
idea
here
saying
like
errors
like
these
things,
you
need
to
track
like
these
are
the
kind
of
well-known
paths
like
errors.
Lady
see
help
my
pot
help
or
just
like
application
of
what
processes
are
running.
These
are
the
things
that
are
like
kind
of
key
triggers,
but
for
each
application
you
can
do
more
more
specific
around
like
or
else
engineering
like
these
are.
The
metrics
of
this
application
cares
about
right.
It's
like
did
you
introduce
a
guest
into
your
database?
What
like?
How
many
rows
are
change
over
what
time
frame
right?
A
That
sort
of
thing
so
we're
gonna,
add
more
complexity
to
the
monitoring
I'm,
one
of
our
one
of
our
colleagues
growing
Lyle's
recently
tweeted
about,
if,
like
how
many
people
think
monitoring
or
observability
is
just
a
point
for
me
and
how
many
people
think
that
it's
a
mathematical
function
as
I
kind
of
a
you
know,
bringing
the
math
back
into
observability,
but
I
really
I
saw
that
and
I
was
like
hey
shut
up.
I,
don't
do
that.
A
How
dare
even
call
me
also
directly
Brian,
damn
you
know,
but
it's
it's
drilling
to
me
for
up
to
this
point,
to
be
observability
most
tools
but
I
like
the
idea
that
it's
the
math
actually
so
you
see
like
aggregate
errors
over
time
aggregate
or
even
put
a
like
ethic
tation
over
time.
Doing
this,
like
rigorous
mathematics
around
your
system,
to
see
truly
what's
occurring,
I
find
them
endlessly
fascinating,
really
really
cool.
D
It's
come
to
mean
a
few
different
things
and
it
kind
of
depends
on
I.
Guess
the
consumer
of
what
that
term
might
be
right.
So
from
the
perspective
of
Jesse's,
or
you
know,
somebody
who's
doing
a
sorry
site,
reliability,
engineering,
that
sort
of
thing
and
observability
means
being
able
to
find
some
percentage
of
transactions
that
move
through
this.
In
a
way
that
gives
you
a
capability
of
understanding
how
each
of
the
components
as
part
of
that
system
are
operating
so
that
you
can
actually
find
a
dye
needle
in
a
haystack.
D
You
can
actually
find
where
things
are
broken.
It
isn't
just
mentoring,
it
isn't
just
matrix,
it
isn't
just
monitoring
it's
more
than
that.
It's
like
you,
know,
being
able
to
actually
instrument
your
code
in
such
a
way
that
you,
under
the
places
where
you
it's,
where
you
make
calls
to
systems
that
are
external
to
your
own,
are
measured
right
and
in
in
a
reasonable
way.
You
understand,
and
okay
well,
I,
know
that,
there's
an
error
in
the
system.
My
availability
is
down.
D
D
That's
where
there's
a
ton
of
stuff
in
this
space
right
now
and
it's
all
all
pretty
interesting,
there's
some
interesting
work
being
done
by
the
open,
tracing
folks
and
they're
kind
of
in
zip
can
and
just
do
and
kind
of
all
of
these
things
all
kind
of
fit
into
this
space
and
trying
to
wire
all
that
up,
but
I
would
say
yeah.
So
for
me
it's
definitely
a
loaded
term.
I!
Think
it's
like
it's.
Definitely
it's
more
than
monitoring
it's
more
than
metrics,
it's
more
than
logging.
D
C
A
C
It's
it's
a
startup
still,
but
charity
majors
who's
talked
a
lot
about
monitoring
and
observability
and
and
stuff
like
that.
You
know
she's
the
CTO
and
it's
really
the.
D
D
C
So
it
came
from,
she
was
at
parse
that
got
bought
by
Facebook
and
then
once
she
moved
to
Facebook,
she
saw
that
they
had
this
tool.
I
think
it
was
called
scuba,
but
basically
it
they
were
able
to
diagnose
once
they
you
know,
instrumented
their
application.
They
were
able
to
diagnose
problems
like
within
minutes
versus
whenever
they
had
their
limited
tooling
at
parse.
It
took
them
hours
or
days,
and
so
she
wanted
to
take
that
to
the
larger
lake.
C
You
know
she
wanted
to
kind
of
make
it
available
to
everybody,
and
so
you
know
she
got
she
and
a
team
got
together
and
wrote
a
specific
data
back-end
to
do
high,
cardinality
search
engines
and
queries
and
stuff
like
that
which
is
kind
of
really
what
they're
building
their
whole
message
on
is.
You
know
we're
here
to
let
you
search
like
high
cardinality
data.
D
D
I
know
that
I
have
a
bunch
of
200
errors,
but
I
don't
know
if
it
was
two,
oh
one
or
two
two
or
three,
oh
one
or
two
four
there's
a
lot
of
detail
that
gets
missed
it
because
missing
there,
and
so
the
value
of
a
high
cardinality
database
is
that
you
can
do
things
like
say.
You
know
what
I
want
to
gather
all
of
the
information
at
great
depths
and
be
able
to.
D
My
specimens
trick
for
that
application.
For
this
brief
time
and
get
into
that
detail-
and
this
is
where
I
think
you
know,
I-
think
honey
comes
on
the
right
track.
Actually,
I
did
another
good
friend
of
mine,
Ben
our
chart
it
there
and
he's
been
there
since
beginning,
and
he
came
from
Linden,
Labs
and
kind
of
solving
that
problem
there,
as
well,
so
yeah
they're
on
the
right
track.
They're.
Looking
at
the
problem,
in
my
opinion
correctly,
but
it's
tough.
A
Have
we
defined
what
is
the
kind
of
goal
of
chaos
and
monitoring,
and
all
these
things
and
it's
confidence?
Confidence
is
what
we're
trying
to
test
for
right.
It's
do
you
have
the
confidence
to
do
to
do
your
job
right,
the
way
that
you
want
to
do
it
right,
if
you
have
the
confidence
to
be
able
to
recover
from
an
outage
or
to
withstand
and
out
of
it
right.
All
of
these
things
give
you
confidence
and
a
lot
of
times,
I.
Think
that,
like
when
we
deploy
applications,
we
don't
have
that
confidence.
A
I
woke
up
this
morning
and
two
of
my
nodes
were
restarted
and
there
were
like
two
of
three
of
my
control
Clinton
homes,
and
so
it
was
just
like
crap,
like
I,
didn't
have
any
time
to
like
go
back
and
like
figure
out
why
they
restarted
there
like
once
a
we
started,
I
also
haven't
set
up,
beat
mechanism
to
rejoin
them
to
the
control
plane.
A
A
What,
if
those
two
of
those
nodes,
were
my
Etsy
notes
right
and
then
suddenly
my
exit
is
different,
but
you
have
no
confidence
in
that
closer
to
be
able
to
continue
to
run
because
I
have
no
monitoring
around
these
events
that
are
current,
why
they
occurred
no
idea,
and
so
it's
what
got
introduced
to
me
was
pointless
chaos.
Yes,
yeah
no
function
behind
it.
You
normally
I'm
just
like
sitting
in
the
lurch
like
if
I
were
like
going
for
production
in
this
environment.
I
would
be
tearing
my
hair
out
right
now.
A
This
sucks
this
feeling
sucks,
and
it's
just
some
dummy
cluster
I,
said
earlier
this
week.
Imagine
the
buzz
around,
like
my
entire
business.
Awful
an
awful
feel
have
no
confidence
in
this,
and
so
these
tools
and
this
practice
help
drive
confidence
and
resiliency.
As
as
kosher
computer
saying,
chaos,
engineering
is
resiliency
engineering
right.
It's.
How
do
you
stay
up
and.
B
A
The
that's
the
other
aspect,
there's
a
final
aspect
to
ks
engineering
that
doesn't
get
executed
on
a
lot,
but
I
think,
is
the
most
crucial
part
improvement.
It's
constant
improvement,
so,
like
chaos,
engineering
gives
you
confidence
in
like
for
the
example
of
my
buster
that
just
went
down
what
do
I
do
now?
What
I
do
now
now
that
I've
seen
that
this
failure
occurred,
I
want
to
resolve
or
I
want
to
develop
automation
to
make
it
so
that
that
doesn't
happen
again.
A
So
if
these
nodes
go
down,
my
cluster
stays
running
the
entire
time
or
if
you
introduce
chaos
and
there's
this
threshold
spice,
have
your
kind
of
your
applications
survive
these
threshold
spikes
right
these,
like
infinitely
growing
like
use
of
network
right
or
infant
growing
use
of
memory?
How
do
you
either
kill
the
pods
that
you
don't
want?
How
do
you
like
mitigate
your
traffic,
such
that
it
avoids
that,
like
erroneous,
like
endpoint
chaos,
engineering
should
lead
to
improvement
of
your
infrastructure
and
your
applications
through
these
tests.
A
Right
if
your
hypothesis
is
be
like
I,
believe
that
my
cluster
is
stable
and
then
I
test
it
through
chaos
and
find
that
it's
unstable
or
it
doesn't
lead
to
resiliency,
and
that
means
that
my
test
failed
and
I
need
to
introduce
steps
to
fix
it
and
I'm.
Really,
that's
the
part
that
I
really
really
like
I
love.
The
irritant
growth
that
house
can
quickly
give
you
write
these
out.
Is
that
you
control
yourself
versus
outages
of
the
world
controlled
for
you
right.
A
Let's
say
you
have
a
application,
that's
running
in
the
US
East
to
it's
2013
and
UNC's
to
wind
down
for
three
hours,
because
some
of
that
finger
the
next
three
men
that
is,
chaos,
I,
got
introduced
for
you
and
have
you
resolved
that
right?
That
instance
doesn't
happen
that
often,
but
if
it
were
me
doing,
okay
cool
now,
I
see
that
this
application
for
all
of
my
applications
need
to
run
in
many
regions
and
then
how
do
I
facilitate
that
through
a
lot
of
mention
through
never
in
through
infrastructure
right.
A
C
A
C
C
There
was
actually
a
lot
to
look
at,
especially
around
chaos
engineering.
With
regards
to
that
particular
s3
outage,
you
mentioned
mmm-hmm,
no
cuz.
You
know
if
you
could
think
of
like
doing
that.
You
know
in
a
like
it
was
done
actually
during
the
day
as
part
of
like
a
normal
operation.
But,
like
you
said
you
know,
someone
gave
it
an
input
that
caused
the
system
to
kind
of
go
down.
It's
like
well,
they
you
know
people
were
like.
Oh
you
know,
you
should
input,
you
should
be.
C
You
know
looking
at
the
system,
you
know
when
you
start
to
introduce
this
chaos,
especially
if
you
setup
like
if
you've
given
yourself
the
allowance
to
maybe
introduce
some
of
this
lack
of
reliability
right,
I'm,
not
I'm,
not
gonna,
say
era
budgets,
but
that's
kind
of
what
I'm
talking
about
it.
So
you
know
just
because
what
they
learned
also
what's
when
they
tried
to
kind
of
get
this
underlying
systems
that
supported
s3
back
in
line.
C
You
know,
work
under
that
type
of
situation,
so
this
is
kind
of
something
else
that
comes
out
of
chaos.
Engineering
is
like
doing
it
during
the
day,
you're
doing
it
when
everyone's
there
versus
in
in
with
actual
chaos,
whatever
it's
gonna,
be
at
two
o'clock
in
the
morning,
and
you
can't
find
the
person
who's
on
call
for
the
thing.
That's
not
you
know.
That's
that's
not
responding.
Yeah.
A
C
The
way
that
you
get
better
Incident
Response
is
practice
right.
That's
that's
why
anyone
who,
like
myself,
who
has
dealt
with
incidents
in
the
past,
a
that
don't
worries
and
why
we're
like
sort
of
capable
of
getting
and
jumping
into
an
incident
jumping
into
like
an
emergency
and
kind
of
you
know
being
able
to
just
kind
of
have
kind
of
a
calm
approach
to
it.
Is
it's
like
I've
been
here
before
the
frantic
like?
Oh,
my
god,
the
world
is
on
fire
like
in
the
middle
of
the
night.
D
C
A
D
To
brush
on
two
of
them
real,
quick,
so
one
is
it's
true,
regardless
of
what
you're
doing
that
you're
gonna
be
good
at
what
we
practice
right.
This
is
become
good
piano
players,
good
bicycle
riders,
good
speakers.
We
practice
it
you're
absolutely
dead
on
target.
For
that
one.
The
other
piece
that
I
would
like
to
highlight
is
that
I
I
have
found
myself
in
the
position
over
the
years,
where
I'm
trying
to
teach
someone
how
to
troubleshoot
and
I've,
never
really
thought
about
it
from
this
perspective.
D
A
This
is
a
moment
where
I
don't
necessarily
intend
to
be
a
gatekeeper,
but
I
kind
of
will
be
a
little
bit.
It
I
feel
that
you
should
be
hard-pressed
to
describe
yourself
as
an
SRE
if
you
don't
practice
any
of
these
things.
If
you
don't
embrace
chaos,
if
you
don't
want
to
use
the
things
that
you
need
to
do,
to
maintain
type
reliability,
I
feel
you'd,
be
hard-pressed
I've
actually
described
yourself
as
an
SRE,
despite
whether
from
your
job
table.
A
C
C
It
should
it's
not
our
job
to
be
our
job,
to
be
gatekeepers.
It's
our
job
to
be
enabler
right.
You
know
we're
taking
two
different
perspectives
here.
There's
the
perspective
of
okay.
Like
we've
got,
you
know,
we've
got
an
engineering
team
that
wants
to
do
this
and
then
we've
got
the
other
perspective
of.
C
If
we
can
compromise
a
little
bit,
we
could
have
it
done
in
two.
You
know
and
stuff
like
that,
and
also
you
know
kind
of
relaying,
like
also
we
do
it
in
this
way,
because
it
provides
a
more
reliable
system.
The
thing
that
came
out
of
this
blog
post
is
great
for
a
dev
environment,
but
won't
work
in
a
row.
You
know
production
system
kind
of
you
know
advocating
for
the
you
know,
different
qualities
of
a
production
system,
making
sure
that
that's
understood
by
everybody
involved.
A
B
A
A
Read
I
haven't
gotten
through
all
of
it,
but
every
part
that
I've
read
is
system,
it's
kind
of
a
page-turner
for
being
a
technical
book.
It's
actually
pretty
interesting
to
read
it's
written
in
an
interesting
way.
I
should
say:
there's
a
lot
of
technical
books
can
be
dry
and
boring
to
have
them,
and
so
also
to
see.
One
thing
you
kind
of
touched
on
briefly
is
the
idea
of
error
budgeting.
This
is
something
I
can't
want
to
touch
on
as
well
in
going
into
a
gatekeeping
corner.
A
It's
if
you
have
an
SLA
or
SLO
for
like
X
number
of
lines
and
you
don't
measure
any
of
those
nines.
Where
did
that
come
from
what
does
that
define
Don,
and
if
you
say
this
thing
is
five
minds
really.
What
you're
saying
is
that
you
want
100%
uptime,
but
you
don't
want
to
say
a
lot
of
times.
That's
cool
anymore.
C
Yeah
the
I
mean
I,
think
everyone
understands
that
stuff
can't
be
available
100%
of
the
time.
Yeah
that
you
know
stuff
sometimes
just
doesn't
work.
That's
that's
the
way.
That's
the
way.
Life
is
right,
mm-hmm,
but
I
think
what
is
subtly
understated
and
like
just
not
understood
is
that
in
it
I
think
it's
best
condensed
into
the
phrase.
If
you
want
to
add
another
nine
add
another
dollar
sign,
because
you
know
the
more
uptime
and
reliability
that
you're
trying
to
get
is
gonna
cost
more
because
you
know
that
means
now.
C
You
have
to
run
an
additional
replica
of
your
database
in
a
different
region
or
a
different
availability
zone.
You
know-
and
it
starts
becoming
like
more
and
more
costly
I
think
that,
but
you
know
with
regards
to
and
I
wanted
to
break
down
real
quick
like
no
SL
A's
s,
ellos
initialized,
so
the
SLA,
that's
like
the
service
level
agreement,
people
in
product
and
legal
and
all
these
people
who
are
mostly
not
engineers,
come
up
with
that
number
like
they
they
kind
of
come
up
with
you
know.
C
When
do
we
have
to
start
giving
people
money
back
for
being
unavailable?
It's
that's
usually
the
way
that
I
think
of
an
SLA
like
engineers
typically
should
not
be
thinking
of
essays,
because
their
number
is
the
SLO.
That's
the
service
level
objective
and
that
those
are
what
you
define
is
like.
You
know
give
yourself
some
padding
between.
You
know
if
you're
say
99.9%.
C
You
know
figure
out
how
to
to
get
the
project
back
into
budget.
This
is
like
a
budget
that
resets
itself
every
month,
yeah
and
so
with
those
s
ellos
those
now
those
have
to
come
from
somewhere
and
that's
where
your
SLI
is
come
in
your
service
level
indicators,
and
so
these
are
the
metrics
or
you
know,
whatever
you're
going
to
use
to
determine
the
number
that
the
solos
are
based
out
off
of,
and
so
that's
I
think
you
know,
that's
where
the
error
budget
is
right.
Is
it's
the
area
between
where
your
SLO
is?
Like?
C
You
know,
our
requests
in
the
95th
percentile
can't
be
lower
than
like
two
seconds
to
render
a
page
on
a
web
browser.
You
know
something
like
that
is
what
defines
your
SLO
wished.
You
know
should
be
supporting
the
SLA,
which
is
like.
If
no
one
can
log
into
the
site
or
more
than
45
minutes
a
month,
then
we
have
to
start
giving
people
money
back.
You
know
yeah.
A
Which
is
I
thank
you
for
breaking
that
down,
because
that
was
something
I
actually
forgot
about
this
Oh
eyes,
and
you
know
that
those
are
the
metrics
that
you
care
about.
So
when,
in
the
process
of
like
deploying
these
things,
requests
Masai
people
masa,
it's
like,
oh,
what
are
the
metrics
we
care
about,
like
I,
can't
define
that
for
you.
A
Well,
I,
don't
know
what
metrics
your
application
cares
about,
how
to
define
these
things,
and
so
it's
it's
good
to
have
that
like
be
understanding
it
like
the
breakdown
of
an
SLA,
SLO
and
SLI,
and
so
thanks
for
running
through
that.
That
was
something
that
I
forgot
about,
and
this
is
why
I
don't
call
myself
an
SRE,
because
I
I
would
totally
be
a
bad
one.
B
A
A
B
A
B
A
A
The
cube
monkey
tool
is
a
take
on
Netflix's
chaos,
monkey
and
so
like
retarding
them
at
the
beginning
of
the
shoulders.
He
idea
Netflix
introduced
where
they
restart
their
the
ends
of
their
pods
every
so
often
and
that's
controlled
by
a
tool
called
the
chaos
monkey
it.
Basically,
it's
a
monkey
throwing
the
wrench
in
the
operation,
so
monkey
are
tube
monkey.
A
Basically,
what
it
does
is
every
so
often
depending
on
how
often
you
set
it,
for
it
will
destroy
X
number
pods
or
all
pods
for
an
application
if
it's
hard
to
get
in
this
way,
this
basically
every
hour
or,
however
long
it
said
it.
The
reason
why
I
think
this
might
not
be
a
particularly
valuable
demo
or
like
it
might
not
have
all
the
information
I
wanted
to
show
is
that
you
can
only
set
up
the
treatments
of
our
and
I.
Couldn't
have
that
start
running
I
finally
got
it.
A
My
system
set
up
because
I've
been
a
busy
boy
right
around
the
time
that
we
started
the
company
with
social
hour
and
you
can't
set
the
time
for
keep
monkey
to
start
at
the
same
time
or
you
wanted
to
start
operating
at
the
same
time
that
it
starts
and
so
I
couldn't
it
like.
It's
gonna
start
it
too,
and
have
it
run
at
some
point
in
this,
like
our
timeframe,
so
I
don't
know
how.
Hopefully
it'll
have
killed
a
pod
that
get
mine
out
of
we'll
see,
but
these
are
pretty
graph
to
show
you
so.
A
A
Alright,
let's
see
how
this
works
cool,
alright!
So
here
on
the
screen,
we
have
this
your
font
page.
This
is
something
I
set
up
using
Cube
Prometheus.
This
isn't
a
demo
to
Prometheus,
although
it's
pretty
awesome,
if
you
have
a
chance
to
check
out
cue
Prometheus,
it's
in
the
Corollas
repo-
and
this
is
just
using
the
example-
manifests-
are
there.
It
provides
you
with
your
fauna
and
your
fun
of
dashboards
that
are
fairly
useful
kind
of
giving
you
the
building
blocks
to
set
up
your
own
monitoring
system.
A
A
What
that
was
was
me
setting
up
the
cube
or
the
kid
monkey
labeling
and
I'll
get
into
in
just
a
second,
so
I
created
the
deployment
and
then
I
modified
the
deployment
to
enable
testing
on
this
on
these
pods,
and
so
these
old
pods
had
to
die,
and
then
the
other
ones
came
up
again.
Monitoring
is
really
awesome.
You
can
see
just
the
kind
of
flow
of
your
infrastructure
with
this
very
simple
tool
and
I
find
it
endlessly
fascinating.
Anyway.
A
A
All
right
there
we
go.
Can
everyone
see
that
can
read
that
pretty?
Well,
you
can
see
I'm
running
our
clinics,
so
one
of
the
things
social,
our
so
jumping
into
the
cube
config.
This
is
the
config
map
for
cube
monkey,
and
this
is
the
way
are
you
defined
like
what
how
it
works?
Essentially,
so
it
takes
a
tonal
type
of
config
file.
It
will
run
it
starts
running
or
starts
running
the
destructive
tests
based
on
the
run
hours.
A
A
So,
when
you
deploy
cube
monkey,
it's
a
there's,
a
repo
for
it.
It's
a
pretty
easy
to
look
for
an
out.
Add
the
link
to
the
hack
and
beef
out.
Its
here
is
the
image
ordinance
under
the
same
repo,
and
so
it
is
a
just
a
simple
deployment.
They
have
a
mechanism
or
they
have
the
manifest
there.
So
if
you
might
do
to
poison
cube
system
and
it'll
run,
you
know
output
this
log
and
into
this
log
there,
and
so
that's.
A
A
So
this
is
memory
usage
you'd,
see
that
the
memory
usage
for
one
pod
went
down
and
the
other
came
up
and
regrettably,
that
didn't
occur,
but
that's
kind
of
a
nature
like
randomness
and
chaos
that
it
is
unpredictable
and
it
it's
only
been
running
for
a
very
short
time
as
a
like
a
testing
mechanism,
but
it
didn't
quite
pan
out,
but
so
this
the
reason
why
I
want
to
show
you.
This
is
like
this
is
actually
kind
of
the
beginning
of
chaos,
engineering
and
chaos.
Cooling.
A
That
kind
of
didn't
really
kick
it
off,
but
is
a
a
good
first
step.
If
you
want
to
start
introducing
chaos
into
your
infrastructure
is
a
tool
like
cube
monkey
where,
in
a
controlled
fashion,
between
these
hours
on
these
days
in
these
namespaces,
you
can
say
what
pods
you
want
to
destroy
and
how
you
want
to
destroy
that
there
is
or
like
how
many
you
want
to
destroy.
B
A
That's
right
there
there
we
go
so
here
we
go
so
for
every
deployment
and
for
every
pod
that
you
want
to
be
targeted
by
cube
monkey.
You
need
to
add
these
labels,
so
Q
monkey
enabled
and
the
identifiers
monkey
victim
kill
mode
is
fixed,
so
fixed.
Basically,
the
kill
mode
can
be
like
percent
random.
There's
a
bunch
of
these
configurations.
Fixed
basically
means
it'll,
kill
only
the
killed
value,
number
of
pots,
so
it'll
only
kill
one
pot
and
then
the
mtbf
is
a
mean
time
before
failure
or
between
failure.
A
So
this
is
a
value
in
number
of
days.
That'll
wait
before
trying
again
to
destroy
pods
in
this
in
this
application.
So
you
need
to
add
them
here
in
the
labels
and
then
for
the
deployment
and
then
down
here
in
the
template
labels
for
the
pod
as
well.
So
if
you
monkey
enabled
identifier
as
monkey
victim,
the
identifier
can
be
any
string
by
the
way
it
just
needs
to
be
unique,
and
then,
however,
you
want
to
be
destroyed,
oops
and
then
my
screen
look
like
her
Oh
No
zum-zum
suddenly.
A
A
Was
it
for
my
demo?
It
was
just
a
very
simple
tool
to
use,
but
it's
a
very
good
first
step
to
get
into
chaos.
Engineering
in
your
environment,
very
simple
tool
to
use
very
easy
to
control,
but
as
we
saw
it,
there's
an
element
of
chaos
involved
in
the
use
of
it
and
I
wasn't
able
to
demonstrate
anything
particularly.
A
B
A
A
A
C
A
A
A
A
I
wasn't
able
to
get
that
set
up
a
cute
monkey,
but
what
I
wanted
to
show
this?
The
destruction
that
what
what
what
never
like
you've
seen
it
is
for
show
off,
is
at
the
application
self
website,
because
it
doesn't
engender
Todd
should
have
been
resilient
through
the
destruction
and
I've
enabled
I
just
didn't
there.
B
Have
you
read
that
you
know
those
IP
tables
in
destroying
those
IP
cables,
and
that
is
a
value
inside
the
manifest
files,
whether
it
is
a
set
of
about
30
seconds
before
which
apart
can
be,
should
you
and
not
everybody,
everybody
just
takes
it
as
a
default
right.
You
just
don't
go
around
messing
up
messing
with
these
manifest
files.
These
chaos,
testing
tools
will
tell
you
exactly
what
might
happen
to
your
production
systems.
If
you
kill
cube
proxy
it
will
you
immediately
know
that
a
card
coming
up?
B
It's
gonna,
take
like
30
seconds
the
default
high
before
you
can
have
it
up
and
running.
So
those
are
some
of
the
things
that
we
you
would
uncover.
When
you
do
this
sort
of
chaos
testing,
it's
a
pretty
nice
blog
post,
it's
about
communities,
chaos,
engineering,
lessons
learned,
I
can
probably
link
it.
It
essentially
talks
about
the
documentation
about
the
various
default
like
I,
think
before
1.9
its
IP
cables.
Now
we
have
my
EBS,
what
values
can
be
changed
or
it
can
be
defined
and
make
your?
What
is
your
tolerance
level
right?
B
D
D
B
The
way
it
does
rewrite
stuff,
but
it
takes
about
the
rules,
are
refresh
from
between
10
to
30
seconds.
Sometimes
you
get
the
Refresh
rule.
Rules
are
refreshing
like
27
seconds,
sometimes
25
whatever.
But
what
is
your
tolerance
level
for
that?
Refresh
rate
will
happen,
it
will
refresh,
but
it's
your
tolerance,
tolerance
level,
so
you
can
go
ahead
and
now
define
okay,
I,
just
don't
want
to
wait
27
seconds
I
want
it
to
be
refreshed
in
a
much
faster
rate.
B
C
A
This
is
a
perfect
example
of
what
not
to
do,
because
my
honor
he
had
nothing
to
do
with
the
chaos
that
I
was
involved
with
I
didn't
show
off
resiliency
in
any
capacity
it
didn't,
it
was
just
showing
the
effect.
What,
hopefully,
would
have
happens
that
were
showing
the
effect
of
chaos,
but
not
the
effect
on
the
application,
and
so
it
was
essentially
useless.
Well,.
C
A
D
D
It
can
be
deployed,
was
helm
before
I
get
too
far
into
how
to
deploy
it.
I
would
like
to
share
that
in
the
hack
MD
I
put
a
link
to
this
repository,
which
is
in
github
under
github
calm
such
my
name,
Maui
Lyon
under
kind
chaos
cube.
So
if
you
want
to,
you
know,
follow
along
at
home
the
documentation
for
a
to
do
that
should
be
here.
D
Okay,
Cass
cube:
this
is
a
repository.
Their
github
link,
e
li
n,
ki
chaos
cube
and
then
inside
of
this
they've
got
some
pretty
reasonable
documentation
about
how
to
get
started.
They
have
a
film
chart
that
they
maintain.
They
provide
a
number
of
interesting
filters,
including
you
can
filter
by
name
space
by
label
set
by
annotation
by
age,
because
you
can,
you
can
have
things
like
exclusions
right
so
certain
days,
but
don't
do
it
on
Saturday
or
Sunday
just
deserve
your
chaos
for
business
hours.
Please
see
you
can
avoid
time
to
days
of
here.
D
It
kind
of
all.
It's
kind
of
all.
It's
got
a
it's
got
a
pretty
reasonable
implementation
and
you
can
also
and
so
form
Michael.
What
I
was
trying
to
do
was
actually
make
it
so
that
yeah,
so
that's
the
that's
their
website
before
I
get
into
the
demo.
Part
of
this
I
want
to
show
this
page,
which
I
thought
was
also
pretty
good
shooting
and
it
kind
of
speaks
sort
of
to
like
what
we
were
talking
about
and
I'm
gonna
put
this
in
the
heck
and
be.
D
Julie-
and
it
can
actually
also
help
you
kind
of
instrument,
this
thing
in
such
a
way
that
you
can
understand
whether
the
application
is
responding
in
the
way
that
you
might
expect
this
blog
post
kind
of
gets
into
a
lot
of
the
topics
that
we
talked
about,
which
is
you
know
like
what
is
chaos
engineering
was
before.
How
do
you
go
about
launching
this,
and
the
neat
thing
about
this
talk
is
that
it
gets
into
like
you
know
the
experiment
idea,
you
know
behind
it
right.
D
We're
gonna,
create
an
experiment,
we're
going
to
deploy
our
application,
we're
going
to
simulate
steady
state,
specific
the
load
on
the
on
the
host,
then
we're
going
to
use
chaos
in
this
case
they
actually
I
think
they
ended
up
using
the
cue
chaos.
Another
chaos
saw
me
was
showing
up
earlier,
and
then
they
go
back
to
load
mill
and
because
it's
actually
the
thing
measuring
the
students
need
for
the
application.
They
can
provide
this
really
pretty
graph.
D
That
shows
like
what
happened
when
interruptions
happen
to
the
application,
which
I
thought
was
actually
a
little
summary
of
of
chaos,
engineering
and
the
way
that
we've
been
talking
about
it
in
this
episode
kind
of,
like
you
know,
generate
a
theory
test
it.
If
it
didn't
work
out,
you
know
make
sure
that
your
make
sure
that
your
your
inputs
enough,
it's
all
make
sense
of
that.
Your
series
will
dig
into
why
I
didn't
work
out
or
why
it
did
work,
etc,
etc.
Good
stuff
know
a
little
bit
about
my
setup,
so
I
am
running
kind.
D
The
reason
I'm
running,
cut
metal
LD
in
this
case
is
because
I
needed
a
load
balancer
that
would
be
in
front
of
my
service
right.
I
needed
I
needed
some
mechanism
like
that,
so
that
the
so
that
I
could
show
kind
of
the
resiliency
or
failure
model
here.
If
I
just
went
directly
to
the
application,
I
would
only
be
showing
access
to
that.
One.
D
So
in
the
repository
I
have
a
directory
in
the
middle
will
be
how
I
applied
it.
That's
all
in
here
and
then
also
inside
of
this
directory,
I
have
a
directory
called
card
required,
where
I'm,
actually
using
kind
of
the
application
kubernetes
up
and
running
with
a
health
check
and
with
a
readiness
check
and
all
that
stuff
kind
of
exposing
it
vehicles.
The
balancer
and
I'm
annotating,
the
pods.
A
D
Chaos
equals
true,
and
so
in
this
case,
what
I'm
doing
is
I'm.
This
is
actually
how
I
configure
chaos
cube
to
kill
these
things
right.
So
I've
deployed
chaos
cube
and
said
a
way
that
can
actually
apply
this
stuff
or
that
could
find
all
pods
that
are
annotated
with
Kaos
true
and
it
will
kill
them
now.
Side.
Note
I
actually
confused
the
heck
kind
of
myself
for
a
little
while
and
as
I.
Imagine
everybody
there's
at
some
point,
because
I
was
playing
this
annotation
at
the
deployment.
D
Which,
obviously
doesn't
do
anything
I
wanted
to
do
because,
from
the
perspective
of
chaos
cube,
if
it's
not
a
pod,
it
doesn't
exist.
Yeah,
so
I
have
to
actually
apply
this
down
with
the
template
level,
and
that
was
a
took
me
a
second
to
figure
out.
You
know,
that's
it's
a
good
flex
world
we
live
in.
So
that's
that's
how
that
was.
D
D
Safe
deployment
card
is
actually
the
one
that
I'm
running
the
chaos
test
against,
to
publish
it
and
maybe
something
other
than
part
and
then
safe
is
the
one
I'm
not
actually
running
any,
and
you
can
kind
of
see
that
by
the
age
right.
If
you
look
at
the
PATA
age
over
here
for
these
four
pods
I've
got
two
that
have
been
up
for
37
minutes
on.
D
So
what
I
was
trying
to
actually
what
I
was
trying
to
prove?
Well,
this
error
rates-
and
this
is
kind
of
getting
back
to
sort
of
that
histogram
idea.
We
talked
about
and
some
of
that
stuff.
Now,
in
my
tests
in
my
environment,
everything
is
local
to
me,
so
I'm
kind
of
just
beating
up
on
my
own
little
laptop
here.
You
know
it's
suffering
into
the
load,
just
fine,
but
there
are
a
couple
of
things
happening
on
the
screen
on
the
left,
but
I
wanted
to
kind
of
walk
through
car
damaged.
Okay,
some.
A
D
Thought
it
was
vagina,
that's
awesome,
yeah
and
then
I
actually
got
the
Jagger
thing
from
this
page
as
well,
because
they
were
going
down
to
like
how
to
produce
graphs
and
stuff
I
I
was
actually
looking
at
the
real
time
analysis
piece
here
and
that's
how
I
found
Jagger,
which
is
a
way
of
taking
the
output
and
getting
just
it
up
with
it.
You
care
about.
B
D
D
Let's
click
those
guys
off
and
you
know
kind
of
look
at
what
the
output
is.
Is
he
what
we're?
Looking
at
and
I
have
a
couple
of
different,
interesting
experiments
to
kind
of
walk
through
here,
which
I
think
will
be
fun
for
all
of
us,
so
so
on
the
left
side,
these
top
two,
which
are
card
f4
for
d5
right.
These
two
are
running.
You
can
see
that
they're
running
up
first,
our
time
and
I
can
see
the
majority
of
my
error
code
or
my
return.
D
D
See
a
pod
die
see
my
500
error
rate
go
up,
and
so
what
this
is
actually
testing
is
what
happens
to
the
traffic
that
was
being
terminated
on
the
ID
when
the
pod
suddenly
goes
away,
the
pod
just
dies
out
of
hand.
What's
gonna
happen
from
a
kid
like
you,
proxy
or
another,
it's
actually
evening,
I
guess
it
would
be
key
proxy.
D
What's
gonna
happen
when
I
have
a
truck
when
I've
got
routed
to
that
pod
and
that
pod
goes
away
during
the
middle
of
that
call
right,
so
I'm
issuing
a
thousand
requests
per
second
across
a
load-balanced
IP
address.
That
means,
presumably
that
one
each
way
my
availability
drops
in
half
and
I
see
an
error
rate
increase,
because
the
traffic
that
was
in
flight
for
that
endpoint
is
now
hitting
more
and
that's
what
I'm,
seeing
in
that
error
rate
here
whenever,
whenever
we
see
a
pod
company,
we
see
that
500.
D
D
A
D
Is
really
awesome?
That
was
one
experiment.
The
other
thing
that's
neat
is
that
a
another
great
example
of
this
and
which
I
thought
was
actually
kind
of
interesting,
is
when
I
was
sitting
this
out.
Obviously,
none
of
my
seven
notes
actually
had
that
pod
image
only
like
the
two
of
them,
but
the
two
of
them
were.
These
pods
were
initially
deployed.
D
B
D
Yeah
and
then
down
here,
which
is
extreme
so
on
the
bottom
of
my
screen,
which
I
should
probably
make
a
little
bigger.
What
I'm
doing
what
I'm
doing
here
is
I'm
actually
looking
at
the
end
points
that
are
behind
that
bit
right,
so
I
do
just
the
trophy
here:
I'm
doing
a
team
cuddle
get
end
points
which
is
a
command
that
you
can
run
inside
of
kubernetes
and
I'm.
D
Looking
at
that
particular
service
and
I'm,
watching
those
endpoints
change,
so
as
I
Todd
goes
away
or
comes
back
or
a
new
one
gets
created,
I'm
watching
the
ending
points
you
know
populate
with
a
new
IP,
and
this
explains
some
of
the
other
interesting
behavior
that
we're
seeing
here
and
that
we're
not
down
for
a
really
long
time.
We're
not
incrementing
errors
for
a
long
period
of
time,
because
as
soon
as
that
pod
becomes
unready,
we
stop
sending
traffic
to
it.
D
D
It
gets
removed
from
the
from
the
healthy
endpoints
of
the
service,
and
mine
and
Vegeta
will
no
longer
be
able
to
Regina
a
little
ogre,
be
able
to
actually
brought
traffic
to
it
because
it
won't
be
in
the
running,
so
I
thought
that
was
pretty
so
that
was
that
part
of
it
now
there's
one
of
the
tests
that
I
wanted
to
share
with
you,
which
I
thought
was.
This
is
actually
kind
of
an
artifact
of
the
tool
that
these
folks
wrote
when
they
are
running
kubernetes
been
running.
D
So
in
the
on
the
left
on
the
left
side,
my
theory
was
that
if
I
kill
a
pod
I'm
going
to
be
able
to
capture
some
500
errors
and
a
group
that
I
could
on
the
right,
what
I'm
going
to
do
here
is
I'm
actually
going
to
force
a
pod
to
restart
by
having
it
failed.
Aliveness
check
and
my
theory
is,
but
I
will
not
be
able
to
see
any
errors,
because
as
soon
as
it
fails
the
liveness
check,
it
will
also
fail.
D
Pod
will
get
rescheduled
that
will
come
up
on
a
new
node
and
it
will
enter
the
pool
for
reddit
for
availability
again,
which
is
a
completely
different
life
cycle
than
what
we
saw
on
the
left
here,
where
I
was
not
using
HealthNet
enough
healthy
check,
our
health
checks,
aliveness
checks
to
actually
handle
this
I
was
just
shooting
it
and
seeing
what
would
happen
right
and
so
because
I,
because
it
died
out
of
hand
I'm
able
to
see
those
error
rates
incorrect,
which
was
interesting.
So
let's
try
this
out
on
on
this
guy.
D
A
D
A
D
Kind
chaos
cube
repository
high
35,
github
name,
how
a
lion
all
that
it's
documented.
Okay,
here
you
know
it's
all
out
here
here,
you're
kind
of
Vegas
laid
out
here
how
we're
actually
doing
it.
The
reasoning
talks
about
the
instructions
a
little
bit.
It's
not
super
mobile
document,
but
I
think
it's
enough
to
get
started,
but
so
I
hope
that
was
interesting.
Any
questions
there
anything
we
want
to
try
like
it
like
it
crazy
here
that.
A
Was
super
interesting?
Nothing
I
really
like
that
test.
I,
don't
know
some
really
awesome
way
to
show
off
like
one
this
tool,
which
is
a
really
awesome
tool,
but
also
like
one
of
the
points
of
live
in
his
hopes
and
like
the
redness
checks
and
all
those
things
which
I
think
kind
of
get
looked
over
a
lot.
A
The
best
they're
very
valuable
and
useful
tools
and
the
kubernetes
like
the
probes,
probes
and
people
don't
manage
them
as
much
as
I
think
that
they
could
be,
and
that
was
a
cool
way
to
show
off.
But
yeah
you
failed.
We
simulated
a
failure,
but
the
failed,
but
the
system
itself
was
resilient
enough
to
manage
in
a
way
that
the
user
and
user
would
not
know
about
so
cool,
really
cool
dude,
Nick.
A
A
B
A
We
do
yeah,
that's
fine.
Yes,
thanks
everyone.
By
the
way,
if
you
haven't
added
your
Twitter
account
or
anything
that
you
wanted
to
talk
about
or
or
anything
in
the
hacking
be
that
should
be
available,
we
can
add
it
to
the
channel,
but
they're
not
useful.
It
is
anymore
Jessie,
just
bounced
yeah.
A
Happy
Friday,
everyone
I
hope
you
all
have
a
great
weekend.
Thank
you
for
joining
us
on
the
quality
of
social
hour.
We
will
be
picking
this
back
up
in
a
little
bit.
It's
usually
before
night,
but
I
will
be
on
vacation
next
week,
unless
somebody
wants
to
pick
up
hosting
duty,
who
certainly
with
a
mind
that
continuing
there
was
something
else,
I
was
gonna,
say
and
I
totally
just
blanked
on
it.
Well,
alright,
thank
you.
So
much
and
I
hope
you
all
have
again.