►
From YouTube: Chaos Engineering w GlooShot - Scott Cranton, Solo.io
Description
Scott Scranton presents on how distributed microservices introduce new challenges: failure modes are harder to anticipate and resolve. In this session, he presents a “Chaos Debugging” framework enabled by three open source projects: Gloo Shot, Squash, and Loop to help increase microservices’ “immunity” to issues.
event page : https://www.meetup.com/New-York-Kubernetes-Meetup/events/262237646/
A
Well,
I'm,
plugging
and
I'll
apologize
up
front
I've
got
a
chest
cold.
So
if
I
start
hacking
up
a
lung
or
anything
I
apologize,
it's
all
part
of
the
show
it's
working
awesome
already
been
reduced.
My
ROI
run
around
customer
service.
I
joined
solo
about
six
months
ago
before
that
I
guess
relevant
to
this
group
as
I
was
at
Red.
A
Hat
I
led
the
North
American
solution,
architects
for
all
of
OpenShift,
which
was
Red
Hat's
criminais,
so
I
dealt
with
that
for
a
long
time,
so
I've
been
talking
to
people
and
customers
around
communities
for
a
very
long
time.
A
lot
about
Sto
and
now
we'll
talk
about
I,
guess
with
this
environment,
it's
almost
sort
of
the
post
sto
world
of
like
once.
You
have
something
like
sto
or
any
kind
of
a
service
mesh.
A
What
are
interesting
things
you
can
do
with
it,
and
that's
really
what
I'm
going
to
talk
about
today,
a
little
bit
cool
and
just
in
the
topic
of
trend,
giveaway
stuff,
so
Christian
post
also
works
at
us.
He's
a
big
talker
on
sto
and
generally
has
been
for
a
couple
of
years.
He's
got
a
book.
Is
do
an
action,
we're
raffling
it?
If
you
fill
out
that
survey,
betty
decided
that
the
734
and
63rd
I
don't
know
if
those
are
favorite,
Street,
restaurants
or
bars,
but
anyways.
A
So
it's
just
a
quick
thing
about
solo
and
just
to
put
a
little
bit
of
context
about
this
so
solo,
we
I
think
at
various
points
have
been
sort
of
the
number
two
number
three
commit
her
to
envoy,
which
is
the
underlying
data
plane
for
is
to
you.
Have
anybody
heard
of
well
anybody
heard
of
this
to
you?
Okay,
everyone
about
envoy,
okay,
pretty
much
everyone
else!
Okay!
A
So
so
we've
been
dealing
with
that
we've
been
helping
build
products
like
API
gateway
products
that
just
like
a
mini
control
plane
on
envoy
kind
of
dealing
with
the
traffic
shifting
part
of
it
helps
a
lot
of
people
that
are
migrating
into
kubernetes
world
from
standalone.
So
we
can
kind
of
route
traffic
around.
A
So
the
idea
of
Multi
mesh
I
know
it's
like.
Why
would
you
ever
do
that?
But
people
things
like
they
have
a
lot
of
load
up
on
Amazon
a
tap
mesh
is
going
to
let
a
deep
integration
with
with
the
Amazon
infrastructure
and
then,
but
they
might
have
something
that
also
needs
Ischia.
So
how
do
they
deal
with
that?
Or
they
might
have
multiple
clusters
with
multiple
service
meshes
deployed
and
they
need
to
kind
of
have
federated
policy
manager
evans?
We
also
help
with
that.
A
What
this
talk
is
going
to
be
dealing
with
is
once
you've
got
that
kind
of
world.
There's
all
sorts
of
interesting
things
like
if
you've
got
sto
or
any
kind
of
a
service.
Michelle
talked
about
service
mission,
general
side
proxies.
Suddenly
you've
got
a
lot
of
observability
in
the
network
because
of
all
the
sidecars.
A
You
can
do
very
interesting
things
with
that,
and
so
that
can
actually
help
you
a
lot
with
making
your
larger
scale
micro
service
environments
more
resilient,
and
that's
really
what
I'm
going
to
kind
of
talk
about
today,
so
really
kind
of
dealing.
The
problem
of
when
you've
got
a
micro
service
environment.
So
now
I've
taken
my
nice
happy
monolithic,
app
I
got
a
single
process
and
I've
now
broken
that
into
tens
to
hundreds
of
pieces.
A
It's
a
little
more
challenging
in
terms
of
how
do
I,
how
do
I
have
to
deal
with
that
because
you've
really
you've
taken
this
now
and
you've
you've
taken?
What
will
is
a
could
still
be
a
big
but
simple
from
a
deployment
topology
perspective
app
and
now
you've
maked.
It
you've
turned
it
into
a
full-scale
distributed
application.
A
So
now
again,
you
could
have
lots
and
lots
of
things
so,
when
you're
trying
to
actually
debug
this
or
understand
it,
you've
got
lots
of
interactions
and
again
if
people
are
doing
kind
of
DevOps
methodology,
each
micro
service
is
moving
at
its
own
speed.
The
things
that
you're
connecting
to
are
changing
at
speeds
that
you
know
maybe
you're
not
aware
of
they're
making
changes
you're,
not
fully
aware
of
what
happened
there,
but
you're
still
dependent
on
it.
So
how
do
I?
How
do
you
manage
a
lot
of
that
system?
A
And
I
love
this
quote
from
from
Leslie
Lamport.
Well,
I
actually
believe
him
at
the
high
school
in
the
Bronx
down.
Here,
though,
he
went
up
to
college
up
in
Boston,
but
he
talks
about
you
know.
A
distributed
system
is
one
that
prevents
it
from
working,
because
some
machine
that
you
don't
even
know
about
has
failed
and
that's
basically,
what
a
distributed
application
environment
is.
Is
that
something
is
broken?
A
Monkey
that's
been
talked
about
a
lot
yeah,
so
you
know
they
were
seeing
the
case
that
random
services
would
disappear.
They
had
to
make
their
system
deal
with
it,
so
they
built
an
application
to
go
randomly
turn
off
services.
Then
they
built
chaos
con,
which
would
randomly
turn
off
regions.
It
would
simulate
entire
region
failures,
so
you
know
why
would
people
do
that?
Well,
if
you,
if
you
look
at
these
pictures
and
I
know,
these
have
been
talked
about
a
lot,
but
the
interaction
connective
graph
between
any
service
and
any
other
service
said
his
mind.
A
Blackened,
there's
no
way
that
any
person
could
know
you
know
what
services
you're
talking
to
and
how
its
interactive
and
that's
kind
of
what
I
mean
by
the
scale
of
this
of
this
problem
and
I.
You
know,
maybe
they
inject
a
little
humor
right
now,
we've
taken
hermana,
listen
and
have
turned
it
into
a
murder
mystery.
You
know
who
did
it?
What
broke
I
don't
know,
so
it
kind
of
becomes
a
fun
problem,
aha,
except
when
you're
debugging
it
at
2:00
in
the
morning
on
Saturday,
it's
not
so
fun.
A
The
other
part
of
this
challenge
is
a
lot
of
our
tooling
has
not
caught
up
to
this.
So
we
have
a
lot
of
great.
As
you
know,
as
computer
scientists
and
engineers,
we
have
a
lot
of
great
tooling
that
can
deal
with
single
process
debugging.
You
know
we
can
understand
that
stack.
We
can
debug
into
that
or
we
have
language
specific
libraries
again.
Netflix
did
a
lot
of
this,
but
then
goes
get
like
micro
and
others,
so
there's
language,
specific
libraries
that
can
help
with
you
know,
service
discovery,
retry
logic.
A
You
know
like
history,
it's
informing
a
circuit,
breaking
that's
great,
but
again
what
happens
if
I've
got
some
stuff
than
go
and
some
stuff
in
JavaScript
or
type
Graff
and
okay,
so
I
introduced
some
language,
specific
libraries
to
help
with
retry
and
stuff.
Well
now,
you're
debugging,
the
fact
that
those
libraries
do
it
slightly
differently.
These
circuit
breaks,
like
those
who's
dealt
with
that
problem.
I
know
I've
hit
that
a
few
a
few
people.
So
you
know
what
I'm
talking
with
like
the
tools
are
problem,
the
language
specific
stuff,
and
then
you
know
code
modification.
A
If
you're
changing
this
you're,
changing
your
code
to
deal
with
the
fact
that
something
that
somebody
else
just
changed
broke
and
again,
if
you're
trying
to
deploy
quickly,
you're,
never
you're,
never
getting
anything
in
front
of
this.
So
it's
a
problem.
It's
okay!
So
I've
now
made
everyone
depressed,
there's,
no
solution!
You
know
we
must
just
start
drinking
more
beer
or
you
know,
there's
ways
to
deal
with
this.
A
So
I'm
gonna
use
the
analogy.
I
don't
have
enough
time
to
kind
of
talk
about
Cass
engineering
in
a
lot
of
detail,
but
I'll
use
the
analogy
of
sort
of
vaccinations.
So,
if
you
think
about
a
vaccination,
the
idea
is
that
I
am
intentionally
injecting
a
little
bit
of
a
virus
into
a
person,
a
production
system.
My
production
system
right.
This
is
me:
I'm
gonna,
eject
it
in
with
the
idea
that
my
body's
gonna
generate
antibodies
to
help
me
fight
that
off
when
a
real
virus
dependent.
A
So
I'm
gonna
intentionally
inject
a
failure,
a
virus
into
a
production
system
to
help
make
it
stronger
and
that's
really
the
core
of
what
chaos
engineering
is
and
that's
what
we're
going
to
take
to
our
distributed
application
is
we
need
to
in
production
intentionally
inject
small
faults
so
that
we
can
see
how
the
system
behaves
and
does
it
degrade
gracefully?
You
know
if
I
lose
a
back-end
system,
they
depend
up
like
you
know,
I
I
can't
get
whatever
capability
is,
but
is
it
going
to
take
down
the
whole
system?
That
would
be
bad.
A
So
how
do
I
kind
of
manage
that?
So
it's
the
idea
of
intentionally
breaking
it
and
you
know
in
short,
I'll
give
you
there's
a
few
principles.
You
know.
Netflix
has
a
nice
little
short
a
Riley
book.
That's
okay!
As
engineering
there's,
a
bunch
out
there,
but
kind
of
two
principles.
I
want
to
leave
you
with
and
I'll
talk
about
is
one
it.
You
want
to
automate
these
experiments
in
production.
A
You
know
it's
great.
You
should
do
all
of
your
testing
that
you
do
a
normal
system.
You
should
do
all
your
staging,
but
again,
like
those
graphs
showed
you
don't
know
until
you
get
the
system
into
production,
what
the
interactions
are
with
all
of
those
other
systems.
It's
only
gonna
happen
in
production,
so
you
need
to
get
systems
and
a
culture
where
you're
actually
gonna
do
testing
of
these
kind
of
failure,
scenarios
in
production
and
kind
of
automating
that
and
continuous
running
it.
A
The
second
thing
you
really
need
to
think
about
is
how
to
kind
of
minimize
the
blast
radius.
So
if
you
interject
a
fault
in
your
system
and
it
fails
okay,
you
know
yeah,
our
experiment
succeeded.
It
failed.
You
don't
want
it
to
take
the
whole
application
down.
You
want
to
kind
of
have
it
in
a
controlled
way,
so
that
you
can
experiment
now
I'll
segue
for
a
second
who's,
seen
the
documentary
like
Chernobyl,
you
know
that's
out
now,
I'll
use
the
current
topic.
A
I'll
move
away
from
my
vaccination
thing,
so
recent
analysis
on
Chernobyl
is
actually
prove
that
the
real
source
of
the
failure
that
was
a
steam
pipes
burst
because
they
had
a
coolant
failure.
It
was
actually
they
were
running
experiments
to
generate
to
increase
the
power
loads
when
they
were
doing
that
automated
systems
were
fighting
against
them.
It's
starting
to
put
the
control
rods
in
place,
so
they
were
trying
to
increase
the
power
loads.
A
These
it
was
triggering
automated
systems
to
bring
it
back
down,
so
they
had
this
sort
of
fight
going
on
in
their
systems
that
weren't
communicating
and
a
third
party
saw
these
behaviors
said:
oh
something's,
going
really
bad
in
our
nuclear
reactor
and
triggered
an
entire
shutdown
which
caused
a
cooling
system
to
fail,
and
we
all
know
what
happened
from
there.
So
this
is
a
horrible
example
of
bad
blast.
We
use
controlled.
You
know
an
experiment
went
bad
and
something
horrible
horrible
happened.
A
So
when
you're
thinking
about
chaos,
engineering,
one,
you
want
to
think
about
failure
scenarios
either
you've
seen
in
the
past
or
experiments.
You
want
to
say
like
how
is
my
system
going
to
handle
these
failures
and
to
make
sure
when
you
do
the
experiments?
It's
not
going
to
take
that?
Hopefully,
no
one
here
is
doing
nuclear
engineering
and
running
a
plant.
A
Please,
if
you
don't
do
that,
but
you
know
if
you're
gonna
do
it
make
sure
you
don't
take
down
your
whole
application
so
make
sure
that
your
tests
have
ways
that
you
can
kind
of
trol
that
very
important.
The
other
thing
so
I'll
talk
about
kind
of
bringing
it
back
to
like
how
to
service
mesh
fit
into
all
of
this,
because
I
I
kind
of
introduced.
That
idea
that
you
know
this
is
sort
of
the
post,
the
post
service
mesh
world.
A
You
can
do
these
interesting
things,
so
service
meshes
in
short
again
now:
I'm,
not
gonna,
do
a
whole
talk.
I
would
love
to
talk
about
it.
I
know,
there's
a
social
hours.
I
would
love
to
talk
about
service
meshes.
What
what
the
idea
with
like
you
know,
envoy
in
particulars
a
core
technology
is
a
data
plane.
It
basically
took
a
lot
of
that
retry
logic
circuit,
breaking
rate-limiting
and
moved
it
effectively
to
the
networking
glare
I'll
call
it
the
application
networking
tier,
but
it
basically
moved
it
outside
of
your
code.
A
So
all
of
that
kind
of
common
language
that
flick,
OSS
micro.
This
moved
it
outside
this
allowed
you
to
deal
with
any
language
you
can
deal
with.
It
is
just
interacting
at
the
network
tier
at
this
point,
and
so
it
allowed
us
to
have
that.
It
also
gave
us
a
lot
of
visibility
in
terms
of
how
the
network
you
know
who's
talking
to
who.
So
so
this
is
kind
of
how
service
meshes
it
it
also
because
it's
managing
its
proxying
all
of
the
the
network,
the
network
communication.
A
A
Actually
I
mean
actually
really
introducing
failures
in
your
network
without
tampering
with
the
back-end
service
itself,
because
I'm
catching
it
at
the
network
level.
So
it's
kind
of
it's
an
interesting
service
mesh
is
enable
among
many
things
but
relative
to
chaos.
Engineering.
It
allows
you
to
do
some
very
interesting
things
around
that
and
again,
you
know
kind
of
the
opposite
of
a
lot
of
this.
A
I
don't
have
to
think
about
it
as
a
service
developer
myself,
all
kind
of
good
stuff,
so
glue
shot,
is
an
open
source
project.
So
anybody
can
try
it
so
quick
glue
shot
so
Lodi!
Oh,
don't
try
it
now,
because
I
need
the
network
for
a
second,
but
then
you
can
try
it
in
a
few
minutes.
So
this
allows
you
to
kind
of
do
those
controlled
experiments,
so
it
leverages.
It
assumes
that
there's
a
service
mesh
in
place
that
it
can
take
advantage
if
it's
using
the
you
know
the
new
service
mesh
interface
standard
API.
A
Is
that
we've
kind
of
that
abstraction?
That's
we
helped
pioneer,
but
that's
there.
So
we
can
kind
of
work
with
any
of
the
meshes
any
ones
that
are
doing
traffic
shifting
any
of
them
that
have
observability
capabilities
we
can
deal
with,
and
so
we
can
run
these
and
you
can
set
what
I'll
call
sort
of
stop
triggers
so
that
you
can
say.
If
something
you
know
it.
A
You
know,
here's
a
trigger
that
if,
if
the
network,
if
something
is
bad
happening
in
your
application,
stop
the
experiment
in
an
automated
way,
so
you
can
define
metrics
against
Prometheus
so
that
you
can
say
hey
if
my
application
really
starts
going
really
bad
stop.
You
know,
stop
whatever
thing:
I
hate,
whatever
faults
I
injected,
so
hopefully
your
system
of
how
to
recover
back
from
that.
A
That's
kind
of
where
we're
at
at
this
point-
and
this
is
part
of
the
set
of
open
source
projects
that
we've
done
again
to
kind
of
give
you
a
sense
of
it.
But
you
know:
we've
got
glue
shop,
then
loop
is
starting
to
because
again,
we've
got
observability
in
the
network.
If,
if
any
request,
a
set
of
services
are
going
through,
if,
if
they're
starting
to
return
failure
codes,
what
loop
will
do
is
it'll
actually
capture
the
entire
call
chain.
So
it'll
know
where
how
the
request
flowed
through
all
of
the
different
micro
services.
A
It
can
capture
all
of
the
headers
and
bodies
of
the
messages.
So
we
can
save
any
failures
if
things
work,
it
just
throws
away
the
data,
but
if
it's
something
like
that,
it
can
save
it,
which
can
allow
you
to
then
kind
of
debug
that
again
relevant
to
you
doing
chaos,
experiments
something
bad
happens.
You
can
capture
what
went
bad
and
then
replay
that,
hopefully
not
you're
not
going
to
do
debugging
in
production
because
setting
a
breakpoint
in
your
production
system.
A
A
All
of
this
is
documented.
So
if
you
do
want
to
try
it
yourself,
you
can
go
download
this
play
with
it.
Yourself
have
fun,
we
love
feedback,
there's
a
public
slack
Channel
all
of
our
engineers.
Listen
to
so
you
find.
If
everything
works
great,
do
you
find
things
that
don't
work?
Let
us
know
we'll,
try
and
fix
it.
So
we
love.
We
love
feedback
on
that
front.
The
other
things
I've
got
going
over
here,
so
so
a
classic
book
get
info
app
yeah.
A
It's
running
I've
also
got
a
prometheus
console
up
here,
so
we
can
kind
of
see
some
of
that
and
then
I've
got
my
command
light,
hiding
so
and
a
couple
of
things
just
just
to
give
it
a
little
bit
of
context
so
again
that
small
picture
in
the
diagram
so
that
the
book
info
app
actually
has
has
a
product
page.
It
calls
out
to
a
set
of
reviews,
services
and
then
that
calls
a
rating
service
on
the
back
end.
So
you
know
the
review
service
is
going
to
say.
A
You
know
whether
people
liked
it
or
not,
and
the
rating
service
is
actually
going
to
say
how
many
stars
again
so
we
introduced
a
fourth
I
haven't
I
could
update
the
picture,
but
we
introduced
a
fourth
review
service
that
we're
gonna
build
a
test
against
to
see
if
it
fails,
gracefully
or
not.
We're
gonna
actually
test
we're
going
to
introduce
faults
on
the
rating
service.
Sort
of
we're.
Gonna
see
how
the
review
services
behave
in
the
case
of
a
back-end
fault,
okay,
cool.
A
So
let
me
so
again
get
superb
for
a
little
bit
of
context
and
also
test
to
make
sure
my
systems
running
so
in
Prometheus
this.
This
little
Prometheus
graph
is
our
as
our
exit
failure
case.
So
we're
basically
saying
hey
if
we're
seeing
too
many
500s
out
of
the
review
service.
So
the
thing
in
the
middle
we're
seeing
too
many
of
those
stop
the
experiment.
You
know
within
a
minute
sort
of.
If
we
get
too
many,
it's
like
stop.
You
know
things
have
gone,
really
bad
stop!
A
You
know
we
don't
want
that,
but
we're
gonna
intentionally
fail
the
backend
but
I.
This
is
our
failure
case
here
and
then
this
all
everything
we
we
do,
the
solo
projects
that
are
that
are
dealt
out
there.
You
know
we're
dealing
things
with
using
custom
resources,
so
we'll
build
those
out
and
those
help
control
the
the
deployed
services
inside
of
kubernetes,
so
I'm
gonna,
try
and
even
be
brave.
I'm
gonna
actually
use
our
live
doc.
I'm
gonna
copy
the
command,
I'm
gonna
paste
it
into
command
line.
A
It's
gonna
work
so
now
I've
got
an
experiment
running
and
I
probably
should
have
explained
that
one
so
I'll
explain
it
while
I'm
running
the
test.
So
what
this,
what
this
experiment
is
doing
is
I
define
the
failure
case
and
then
I
defined,
hey
on
the
belt
on
the
book
and
service,
introduce
500's
100%
of
the
time
so
I'm,
basically
completely
failing
the
release
service
and
we're
seeing
up
in
the
book
info
and
I
reviewed
it.
It
actually
failed,
both
the
review
and
the
ratings.
A
So
not
the
stars
went
away
and
actually
the
text
for
the
review
went
away
as
well.
So
it's
a
complete
like
catastrophic
failure.
Review
service
just
dropped
very
bad
I.
Don't
want
that,
but
I
clicked
and
it
came
back
pretty
quick.
If
I
had
gone
quick
within
a
few
seconds,
it
actually
brought
it
back.
So
the
it
ended
up
deciding
whoops
and
let
me
show
you
in
Prometheus.
So
if
I
go
update,
my
graph,
we
suddenly
see
this
is
the
graph
of
500s
coming
out
of
their
view.
A
A
A
Is
a
live
demo,
so
hopefully
all
kinds
of
crazy
things
could
happen
so
down
here
in
the
text.
Hopefully
big
enough,
you
can
see
again,
you
can
try
it
yourself,
so
it
stores
the
results
actually
in
the
experiment,
object
self
and
then
we
see
in
this
case
hopefully
probably
can't
see
it's
a
little
too
small.
But
the
this
thing
is
actually
saying
it's
a
failure
report.
So
it's
it's
saying
that
this
experiment
stopped
because
that
experiment
failure
scenario
was
triggered.
So
it's
telling
us
here's
the
reason
why
the
you
know
we
tried
the
experiment.
A
It's
actually
gonna
keep
all
the
results,
the
time
snapshots
of
what
happened.
But
in
this
case
it
said
hey,
we
triggered
the
experiment,
stop
scenario
and
it's
gonna
tell
us
that
we're
like
okay,
great.
You
know
that
and
there's
actually
report
objects
in
this
case
because
it
stopped
quick,
we're
not
going
to
see
a
lot
but
I'll
just
show
you
real,
quick
oops.
A
Let's
show
you
the
contents,
it
might
be
more
interesting.
There's
a
report
object,
so
there's
the
experiment
itself
that
has
some
status
and
then
there's
a
report.
This
case
it
stopped
pretty
quick,
but
it
would
actually
keep
time
snapshots
of
all
of
the
data
so
that
you
can
pull
it
out
of
the
the
custom
resource
and
do
it.
Obviously,
all
the
data
itself
is
also
in
Prometheus,
so
you
can
use
that
to
do
some
analysis.
Around
okay
I
saw
failures.
What
happened?
Okay,
so
that
was
a
case
of
yep
I'm.
Sorry.
C
A
So
it'll
keep
I
mean
all
these
Prometheus
is
running
in
your
app
and
then
for
the
experiment.
It'll
keep
the
stats
about
what
you
told
it
to
kind
of
watch
what
trigger
cases
so
it'll
keep
that
information
as
well.
If
you
wanted
things
like
the
full
call
graph
and
stuff,
that's
why
we're
starting
to
build
projects
like
loop
and
other
things
to
actually
kind
of
capture
the
request,
headers
and
bodies,
and
the
full
call
graft.
C
A
Introduce
that
up
front,
so
we
get
swag
so
now,
I'm
gonna,
try
and
see.
If
I
get
some
more
interactive,
I
only
got
one
more
experiment:
I'm
gonna
run.
So
now
people
are
going
like
hey
I'll,
be
around
later,
so
I'll
give
out
more
stuff
later.
So
so
let
me
let
me
go
rerun
the
experiment,
so
we're
gonna
go.
We
realized!
Hey.
We
read
the
review
service.
We
tried
our
experiment.
We
realized
it
does
not
fail,
gracefully
at
all
based
on
rankings
bad.
So,
let's
roll
back
now.
A
In
this
case,
all
of
the
review
services
are
actually
already
deployed,
like
that.
That's
part
of
the
book
info
app,
that's
part
of
the
clinic
classic
examples
that
is
showing
off
is
do
traffic
shifting
right,
you
know,
there's
multiple
review
services
and
we
can
run
different
loads
at
it.
So
we're
gonna
go
back
and
update
that
this
little
snippet
over
here
we're
actually
using
superglue,
which
is
that
that
sort
of
management-
that's
the
implementation
of
the
service
mesh
interface.
A
So
it's
gonna
allow
us
to
provide
routing
rules,
so
we
can
say
hey
for
any
service
mesh
that
that's
under
management.
We
can
change
the
routing
rules
and
it's
going
to
go,
translate
it.
So,
in
this
case,
I'm
gonna
introduce
a
change
and
say:
hey:
go
to
routing
the
to
review
version
three.
You
know
sort
of
the
and
go
back
to
that,
and
now
we're
going
to
rerun
an
experiment
and
see
if
that
one
fails
more
gracefully.
A
So
let
me
run
another
experiment.
So
again,
this
other
experiment,
same
failure
scenario,
we're
still
looking.
You
know
introducing
a
ratings
of
500,
so
we're
still
gonna
fail
the
backend
rating
service
100%
of
the
time,
and
we
could
make
it
an
intermittent
failure
in
this
case
we're
making
a
catastrophic
failure
and
then
we're
gonna
run
this
experiment
for
30
seconds.
The
reason
I'm
saying
this
is
I'm,
hoping
this
experiments
going
to
go
a
little
better.
A
While
we're
doing
this
so
now,
I've
got
the
experiment
running
so
now.
If
I
refresh
the
page,
we
see
up
on
the
screen,
the
reviews
are
still
coming
up.
You
know
comedy
of
errors,
an
extremely
entertaining
place.
The
reviews
are
still
coming,
but
the
stars
aren't
and
we're
actually
seeing
a
graceful
fail,
because
it's
actually
saying
hey
rating
service
is
unavailable.
So
this
is
a
graceful
kind
of
circuit
breaking
scenario
back
in
service.
It's
completely
failed
rating
service,
but
we're
still
providing
you
know.
The
review
service
is
still
working.
It
hasn't
failed
entirely.
A
That's
a
good
thing!
You
know,
we've
kind
of
gracefully
failed
our
application.
This
is
exactly
the
kinds
of
things
we
want
to
test
in
terms
of
the
interactions
between
it.
How
is
it
going
to
deal
with
in
this
case
I'm
doing
500
I'm
completely
failing
it,
but
I
could
be
doing
what
if
I
introduced
your
high
latency?
What
if
I
introduced
a
10%
failure
of
the
ratings
how's
it
going
to
behave?
No!
No!
A
So
this
is
exactly
what
you're
going
to
do:
chaos,
engineering
in
production
to
kind
of
test
what
those
interaction
patterns
are
going
to
be
so
30
seconds
has
gone
by
our
test
has
stopped.
You
know
we're
back
to
the
little
stars,
so
let's
go
look
and
see
what
from
from
a
different
experiment
what
the
reviews
are.
A
So,
in
this
case,
the
results
here
is
the
experiment
succeeded,
which
means
it
ran
the
full
30
seconds
and
if
we
go
look
at
Prometheus
re-execute
the
careers
we
see
you
know
here
now
the
number
you
can't
really
see
it,
but
that
it's
a
flat
line
at
zero.
So
we
saw
no
500s
out
of
the
reviews,
but
so
we
didn't
trigger
any
failure.
A
So
the
experiment
in
that
sense
is
considered
a
success
when
we
didn't
trigger
the
experiment,
failure
scenario,
but
now
we've
got
data
around
how
it
behaved,
and
we
also
now
have
insight
around
the
fact
that
you
know
how
the
review
service
kind
of,
in
this
case,
gracefully
degraded
when
the
rating
service
went
away.
So
it
kind
of
gave
us
some
insight
around
how
it's
going
to
behave
in
a
production
environment.
A
A
A
Custom
resource,
so
those
things
I
was
showing
what
you
couldn't
really
read
on
the
on
the
screen:
desert
yeah
we've
got
a
bunch
of
custom
resource
definitions,
so
in
the
case
of
glue
shop,
we've
defined
an
experiment,
customer
resource
and
a
report
customer
resource.
Okay,
all
right
that
way
you
can
configure.
So
what
I
was
showing
is
like
when
you're
configuring,
the
experiments
you're
just
defining
variables
in
a
custom
resource.
Okay,.
A
C
C
A
In
theory,
we
could
do
that
at
this
point.
We
were
just
looking
at
it
from
just
introducing
network
failures.
That's
a
very
common
scenario:
I
think
data
corruption.
We
could
look
at
some
of
that.
Some
of
that
starts
to
fall
into
kind
of
functional
testing
as
well,
because
it's
sort
of
how
are
you
going
to
deal
with
that
so
I
get.
You
know.
That
is
a
scenario
you
need
to
deal
with.
So
it's
certainly
possible
envoy
allows
you
to
to
deal
with
requests
and
response
transformations.