►
From YouTube: Chaos Engineering WG Meeting - 2018-10-09
Description
Join us for Kubernetes Forums Seoul, Sydney, Bengaluru and Delhi - learn more at kubecon.io
Don't miss KubeCon + CloudNativeCon 2020 events in Amsterdam March 30 - April 2, Shanghai July 28-30 and Boston November 17-20! Learn more at kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects
A
Right
so
the
agenda
as
free
she's,
not
here
I'm,
trying
to
do
my
best
and
hopefully
just
you
know,
raise
your
hand
if
you
know
I
miss
something
or
if
I'm
not
clear
enough.
Today,
we're
going
to
one
demo
run
PCF
from
from
you
guys,
Carolyn
and
Ramesh,
and
then
we'll
have
a
quick
landscape
and
white
pepper
update,
I'm,
not
sure
there
as
there
is
much
there.
But
let's,
let's
keep
the
discussion
going.
I
think
Chris
was
meant
to
do
a
quick
recap
of
of
chaos.
A
Conf
of
fully
there
will
be
someone
who
can
you
know
you
know
tell
us
all
about
the
conference.
I
was
a
great,
so
probably
you
feel
the
roll.
If
you
don't
mind
right
so,
let's
get
started
I
think
we
are
rolling
for
the
demo
guys
so
I'll
leave
you,
you
know,
show
your
screen.
I
think
that's
probably
best
sure,
stop
sharing.
A
C
C
D
Makes
you
want
to
start
yeah
sure
thanks
Kevin
good
morning
folks,
my
name
is
Ramesh
I'm
the
senior
engineering
manager
for
the
platform
engineering
team
at
t-mobile
first
off.
Thank
you
for
giving
us
an
opportunity
to
present
here
very
excited
to
know
that
there's
a
community
around
this
and
now
we
can
tap
into
like
extensive
community
network
as
also
get
help
on
what
we're
trying
to
do
at
mobile,
but
good
background
on
me.
I've
been
with
t-mobile
for
10
months.
D
C
So
my
name
is
Corinne
January
and
I'm
senior
software
engineer
working
at
t-mobile
I
joined
this
team
in
March
2018,
so
I'm,
fairly
new
to
t-mobile.
Well
before
that
I
was
with
Huawei
as
a
cloud
security
engineer
for
them.
I
have
about
13
years
of
experience
in
the
field
of
information,
security
and
enterprise
security.
So
yeah,
that's
pretty
much
so
here
at
t-mobile
we
take
care
of
pivotal
Cloud,
Foundry
operations
and
DevOps
kind
of
the
role,
and
also
we
have
communities.
You
know
in
house
cluster.
These
two
come
under
a
mesh.
D
D
So
one
of
the
things
that
you
know
I'd
like
to
start
off
with
it's
jokers
on
our
you
know
analogy
on
how
he
interprets
chaos
in
this
movie.
Dark
Knight-
and
we
kind
of
spoke
about
that
at
our
conference
at
Spring.
One
and
Joker
calls
it
as
a
fair
act,
which
is
every
time
you
disturb
the
harmony
of
systems.
Good
things
can
come
out
of
it
right,
so
that's
and
that
ontology
has
started
kind
of
like
started.
D
My
thinking
also,
which
is
my
team,
built
all
of
these
capabilities,
on
top
of
like
a
massive
infrastructure
and
behind
the
scenes,
there's
compute
network
and
storage
and
things
will
always
go
wrong
right
so
get
to
the
actual
problem
statement.
But
I
always
like
to
start
off
any
presentation
with
who
we
are
what
we
do.
What
are
the
services
we
provide
in
the
stack
and
kind
of
drive
you
towards
the
problem
statement
and
then
handed
over
to
Cologne?
D
One
of
the
capabilities
and
the
containers
the
service
matrix
is
okay,
you
know
offering
for
Cuban
Eddie's,
so
people
can
bring
their
own
containers
and
then
a
platform
as-a-service
abstraction
where
people
can
just
give
us
their
code.
You
know
we'll
go
run
it
within
our
abstraction
layer
and
they
don't
have
to
worry
about
other
capabilities
right.
It's
like
a
self-driving
car.
D
We
give
them
the
bells
and
whistles
of
operating
their
code
and
scale
and
at
the
same
time
you
know
they
get
the
best
experience
in
terms
of
like
dealing
with
live
customer
issues
right
so,
depending
upon
the
kind
of
abstraction
you
choose
different
different
flavors
comes
in.
So
that's
really
what
my
team
does
in
a
nutshell,
let's
move
on
to
the
next
slide,
so
one
of
the
things
that
folks
have
asked
me
is
okay.
So
what's
the
big
deal,
every
company
is
doing
this.
D
You
know
what
what's
really
in
my
portfolio,
which
is
driving
the
need
for
chaos,
engineering,
I'll
focus
on
the
fact
that
we're
building
our
services
on
top
of
on
Krim
infrastructure.
Today
there
is
compute
network
and
storage
we've
gotten
the
business
used
to
agility
already
in
the
last
two
years,
roughly
4,000
applications,
500
active
users
per
day
for
a
31,000
containers
across
development
and
prod
their
production.
For
me,
even
though
this
development
environments,
business
has
gotten
used
to
the
ideality,
which
is
faster
applications
faster,
mean
time
to
respond
and
resolve.
D
The
last
iPhone
launch
event
saw
a
max
peak
of
like
16,000
requests
per
second
provider.
The
minute
iPhone
launch
was
launched
right
and
then
since
then,
it
has
been
trending
around
an
average
of
14,000
thing
on
that,
culturally,
when
to
move
on
to
the
next
five
weeks.
So
then
we're
moving
towards
the
feature
which
is
like
you
know
where
we
want
to
extend
our
capabilities
around
simplicity,
security
and
scalability,
we're
trying
to
deliver
net
new
capabilities
on
the
function
of
the
service
and
also
exploring
new
capabilities
as
the
platform
as
a
service
layer.
D
So
a
lot
of
work
that
is
already
being
planned
in
terms
of
a
Denarius
meant
enhancements,
all
of
which
entails
infrastructure
in
the
in
the
in
the
background.
The
next
slices
okay.
So
our
bit
of
the
explanation
of
the
problem
statement,
little
more
have
folks
here
on
the
call
seen
this
before,
like
blue
and
the
black
dots
anybody.
D
Okay.
So
let's
actually
go
through
the
animation
here
girl.
So
basically,
what
you
see
here
is
what's
called
as
a
death
star
diagram
and
it's
a
representation
of
the
kind
of
the
ecosystem
you
micro-services
deal.
You
know,
live
in
and
the
kind
of
interaction
that
they
have
independent
services.
The
snapshot
on
the
left
is
from
Amazon
from
2010
in
the
Netflix
that
star
diagram
is
the
blue
version
and
then
all
looks
a
little
more
less
chaotic,
but
we're
getting
there
in
terms
of
you
know
what
the
chaos
is
going
to
look
like
the
future.
D
So
the
key
message
here
is:
you
know
we
as
engineers
we
write
services,
the
services
have
a
back-end
systems,
they
interact
with
and
obviously
even
system
scale.
You
know
things
your
customer
is
going
to
take
the
impact
in
terms
of
like
any
customer
impacting
events.
So
what
we're
trying
to
do
here
is
like
us,
in
terms
of
our
digital
transformation
initiative,
we're
trying
to
think
about
failure
in
a
different
way
and
trying
embrace
failure,
because
we
know
failure
is
inevitable.
D
X,
like
this
I
mean
so
and
I
actually
start
off
with
this
problem
statement
with
karoon
and
a
couple
of
months
ago,
we
we
wanted
to
look
at
like
two
kinds
of
failures.
Obviously
there
is
the
platform
level
failures
that
I
care
about,
because
I
run
the
platform
and
what
I
mean
here
is
what
are
the
kinds
of
things
that
could
go
wrong
with
my
platform?
What
are
the
assumptions
engineers
make
when
they
build
services
on
on
top
of
an
infrastructure
right
think
about
things
like
about
and
that
would
be
homogeneous?
We
have
infinite
bandwidth.
D
The
fact
that
we
have
infinite
compute
resources.
All
of
these
assumptions
needs
to
be
validated
because
when
you
fail
to
validate
at
this
event,
problems
start
happening,
and
you
know
you
could
get
into
a
disaster
scenario
like
these
two
guys
on
the
left
side,
which
is
not
our
data
center,
but
somebody
else
is
there
a
center.
The
fact
here
is,
you
know
our
data
center
is
in
a
North
Quay
prone
zone.
You
know
and
anything
could
hack
happen
here.
D
We
have
active
volcanoes
in
this
region,
so
we're
trying
to
be
cognizant
about
the
fact
that,
okay,
if
things
fail
how
our
system
is
going
to
react,
how
can
t
mobile
continue
its
business
or
and
because
a
lot
of
the
business
critical
applications
run
on
this,
and
then
there
is
the
containers
that
are
running
within
the
platform.
Containers
have
applications,
and
it's
not
just
one
target
application
right,
there's
several
application,
so
you
want
to
launch
specific,
targeted
attacks
on
containers
and
just
affect
that
one
application
under
context
right
all
the
different
servers
it
interacts
with.
D
C
So
so
what
so?
Looking
at
the
problem
statement,
we
have
two
problems
there.
One
is
the
platform
level
attacks
and
the
other
one
is
applications
running
on
the
platform.
You
know
attacking
the
applications
that
are
running
on
the
platform,
so
we
started
exploring
you
know.
Are
there
any
existing
tools
because
we
didn't
want
to
reinvent
the
wheel,
so
we
started
with
an
open
source
solution,
called
chaos,
lemur
and
but
all
we
could
see
make
it
work
with
the
chaos.
C
Lemur
is
killing
of
the
virtual
machines,
but
in
fact,
at
the
platform
level
we
wanted
to
achieve
killing
virtual
machines,
killing
a
process
introducing
latency
into
the
system
and
introducing
memory
and
CPU
hog
right.
So
all
these
come
under
the
infrastructure
level
attacks,
but
chaos
lemur
for
us
was
like
more
like
a
chaos.
Monkey
can
only
go
and
then
you
know
turn
off
a
random
virtual
machine,
which
is
definitely
not
something
that
we
are
looking
for.
C
Looking
for
a
more
bigger
solution,
so
we
started
looking
at
gremlin
as
one
of
the
commercial
offerings
as
well,
and
we
see
here
the
version
that
we
evaluated
with
the
gremlin.
It's
a
pretty,
very,
very
powerful,
you
know
tool,
I
must
say,
because
it
comes
with
a
very
neat
UI
and
there
was
initial
thinking
whether
gremlin
would
work
on
the
PC
of
environment
or
not,
but
we
made
it
work
and
it
seemed
to
fairly
work.
C
You
know,
perform
the
operations
in
the
infrastructure
level
like
killing
of
virtual
machines,
killing
of
process
introducing
latency
and
but
again
the
version
that
we
evaluate
it
doesn't
have
the
application
knowledge,
which
seemed
to
be
the
case
that
you
know
the
gremlin
is
working
on
it
and
even
in
the
recent
term,
intro
from
there
from
from
the
CEO
mr.
Colten
I
think
I
seem
to
be
coming
up
and
they
have
added
this
capability
in
the
latest
release
of
Kremlin,
which
we
have
never
looked
at
yet
right.
C
So
but
again,
grimian
comes
with
the
cost
and
we
are
also
very
conscious
about
the
cost.
It
is
involved,
you
know
and
running
on
the
infrastructure.
So
we
looked
at
turbulence
as
another
alternative
and
it's
an
open
source.
As
you
can
see,
it
performs
fairly.
You
know,
and
it's
very
native
to
the
Cloud
Foundry
as
well.
Like
you
know,
which
is
pretty
much,
we
were
looking
for
any
Bosch
hosted.
C
Virtual
machines
can
be
done
with
the
Chaos
engineering
attacks
or
failure
injection
that
attacks
can
be
performed
on
the
Bosch,
hosted
environment,
performing
killing
virtual
machines
process
and
introducing
latency
and
CPU
memory
hog.
But
again
it
lacks
application,
knowledge
right.
So
so
for
us,
as
I
said,
like
you
know,
animation
explained
those
are
the
two
problem
statements
like
infrastructure
level,
chaos,
engineering
attack
or
the
platform
level,
chaos,
engineering
attack
and
the
application
level
at
attack
right.
So
here
is
what
we
looked
at:
chaos
toolkit
a
safe
framework.
It
basically
orchestrates
multiple
other.
C
You
know
solutions
like
gremlin
and
turbulence
as
drivers.
So
what
we
built
is
we
built
two
drivers
there.
One
is
a
driver
for
the
turbulence
itself
and
then
another
custom
homegrown
bill
driver,
that's
built
from
the
scratch
which
has
application
knowledge,
so
it
can
go
and
then
figure
out
discover
where
your
application
is
running
within
the
cluster.
So
if
I
have
a
cluster
of
two
thousand
nodes
in
that
two
thousand
nodes,
this
driver
can
go
and
figure
out.
Your
application
is
running
on
those
particular
nodes.
It
can
also
figure
out
what
service
dependence
is.
C
C
Let
me
explain
more
clearly
what
exactly
that
we
are
talking
about
here
on
simulating
failures
on
the
platform
level
right.
You
can
see
the
component
diagram
for
the
PCs,
the
pivotal
Cloud
Foundry.
There
are
various
components
here:
each
component
could
be
a
virtual
machine
or
multiple
boxes
here
could
be
processes
running
inside
a
virtual
machine.
So
failure
can
happen.
C
There
is
a
lot
of
interaction
happening,
so
it
is
so
obvious
that
you
know
failure
can
randomly
happen
at
one
point
at
any
point
and
see
that
you
know
it
might
eventually
lead
to
the
disaster
as
well
right.
So,
for
example,
let
us
take
a
simple
example
here:
the
ref
process
going
down.
So
what
happens?
Is
the
ref
process
running
inside
the
deagle
cell
is
responsible
for
managing
the
life
cycle
of
the
containers
running
in
it?
So
digo
SIL
is
like
a
worker
node
in
Kuban
at
it.
C
C
Let's
say
a
set
of
apps
have
auto
scaling
enabled
like
for
which
means
so
it
based
on
the
load
or
the
CPU
stress,
or
the
HTTP
latency,
the
apps
gonna
scale
up
and
scale
down
in
terms
of
the
instances.
So
for
that
the
autoscaler
as
a
service
depends
on
cloud
controller.
So
what
happens
when
there
is
a
huge
traffic
to
the
app
and
then
the
cloud
controller
goes
down
so
and
then
what
happens
is
just
up
going
to
scale
up
or
scale
down?
You
know
those
kind
of
failure
injection
tests
that
can
seamlessly
be
performed.
C
Why
are
the
driver
that
we
are
talking
about
here
so
how
we
perform
this,
or
this
is
the
turbulence?
Turbulence
comes
with
API
server
and
the
agent
the
agent
goes
and
sits
in
e,
each
of
the
virtual
machines
in
your
cluster
and
then
listening
to
your
API
server,
which
is
a
control
plane.
So
we
use
C
DK
and
initiate
few
attacks,
which
goes
through
the
API
server
and
then
agents
fulfill
that
request.
C
So
all
the
ones
which
are
highlighted
here,
like
pausing
a
process,
so
some
of
the
attacks
that
we
can
perform
is
killing
AVM,
killing
a
process,
pausing
a
process.
So
that
would
be
one
of
the
demo
scenarios
here,
introducing
a
stress
into
the
system
by
increasing
this
by
introducing
like
CPU
and
memory
hog
corrupting
a
disk
associated
to
you
know
a
virtual
machine
right
and
network
delays
limiting
the
bandwidth
reordering
of
the
packets.
What
happens
if
the
packets
are
reordered?
How
is
your
system
going
to
behave?
C
Obviously
a
platform
going
to
react:
firewalls
attacking
on
the
firewalls
at
the
platform
level
targeted
level
blockings
like
you
know,
you
can
go
and
then
perform.
You
know
IP
table
rule
level
failures
as
well,
shutdown
block
DNS
and
duplication
of
packets,
so
these
are
some
of
the
features
that
come
with
the
turbulence
and
highlighted
ones
are
the
ones
that
we
have
added
and
contributed
back
to
the
open
source.
So
let
me
show
you
the
first
demo
for
this
I
would
like
to
run
the
video
from
my
desktop.
So
just
give
me
a
second
here.
B
C
Know
yeah,
so
this
is
a
demo.
I
was
talking
about,
so
what
we
will
do
here
is
we
are
going
to
demonstrate
how
chaos
toolkit
has
been
used
as
a
driver,
and
then
a
turbulence
driver
has
been
added.
What
it
does
is
like
you
know,
it
demonstrates
two
scenarios.
The
first
scenario
is
pausing
a
process.
So
here
for
this,
we
are
going
to
pause,
SSH
process.
What
happens
if
an
SSH
process
has
been
paused
to
the
ego-self
right
and
then
the
other
one
is
killing
a
particular
VM
itself
like
killing
a
degausser?
C
What
happens
to
the
containers
running
in
that
right?
So
a
very
short
video
and
it's
there
on
the
YouTube
as
well.
The
reason
why
I
run
here
is
a
it
has
better
quality
and
the
zoomin
effect.
So
first
I
go
in
here
pause
process
door
Jason.
This
is
my
experiment:
file
in
chaos,
tool,
gate
with
title
and
descriptions
here
with
some
steady
state
hypothesis
and
configuration
information
that
I
am
supposed
to
pass
as
a
part
of
the
experiment.
C
So
these
can
be
again
can
be
enhanced,
like
you
know,
instead
of
putting
in
username
and
password
here
it
can,
it
can
come
from
the
vault
as
well.
So
what
I'm
doing
here
is
this
is
a
one
box
environment
called
Bosch
light,
which
gives
you
a
cloud
formerly
running
in
one
laptop,
and
you
can
see
in
the
turbulence
deployment
virtual
machines.
You
can
see
the
API
server
turbulence.
Api
server
is
running
there
as
I
said.
Like
you
know,
there
is
a
turbulence,
API
server
and
turbulence
agent
running
in
each
virtual
machine.
C
So
that's
what
the
configuration
we
provided
in
the
experiment
and
then
the
method
here
is,
is
to
attack
pause
process
SSH
for
one
minute,
so
we
are
going
to
pause
SSH
process
for
one
minute
on
deployment,
CF
and
Group
D
yourself
limit
to
one
which
can
be
any
deeper
cell.
So
right
now
in
this
environment,
I
have
only
one
digger
cell,
since
it's
one
box
environment,
but
we
tested
this
successfully
on
the
staging
environment
with
about
hundreds
of
VMs
there.
C
C
C
There's
a
pause
like
you
know
for
and
it
spas
so
for
about
sixty
minutes
and
we
can
go
and
check
the
UI
as
well.
There's
a
new
eye
aspect
of
turbulence.
It
is
saying
the
post
process
is
in
progress
and
it
will
continue
to
you
know,
spin
up
till
bollocks
about
one
minute.
So
after
about
a
minute,
you
will
see
the
lock
is
released
right,
so
you
can
try
again
to
do
an
SSH
and
it's
again
you
know
very
responsive
after
one
minute
right
so
because
there
is
no
lock
on
that
SSH.
C
So
since
we
could
do
it
for
SSH
process,
you
can
do
it
for
anything
like
you
know,
you
can
do
it
for
the
rep
process
or
anything
second
scenario
is
killing
Adi
go
sell
any
day,
go
sell
like
again.
There
is
an
experiment
file
for
this
separate
one.
We
go
with
a
standard
definition.
Experiment
file
for
now
steady
state
hypothesis
is
actually
empty
for
now,
like
you
know,
we
are
not
doing
much,
but
we
can
do.
We
can
add
some
probes.
C
There
method
is
to
attack
and
kill
a
degausser
which
is
running
in
the
deployment
CF
and
limit
to
one.
So
you
kill
1d
yourself
for
that
matter,
so
any
D
goes
in
and
then
I'm
going
to
run
this
experiment
killed.
Ego
cell,
validating
hypothesis
and
experiment
ended
with
compete.
Shared
is
completed
so,
as
you
could
see,
there
is
a
the
D
go
cell
running
in
this.
Now,
let's
go
and
then
print
the
BMS.
Clearly,
there's
no
D
go
cell
here
right,
so
it's
killed
right.
C
So
what
happens
to
the
applications
running
in
that
D
go
cell
or
continuous
running
in
there
right.
So
that's
another
way
of
looking
at
things.
So
if
so
again
the
UI
shows
that
the
attack
has
been
successfully
completed
and
that's
the
reason
it
is
in
green
color
and
then
our
synthesizer
as
I
said
it's
a
Bosch,
hosted,
environment
or
wash
will
go,
and
then
you
know
get
the
bring
back
the
D
yourself
again.
C
C
So
this
is
again,
so
that's
the
first
half
of
the
problem
statement.
How
do
you
perform
the
platform
level
accounts?
Engineering
attacks
is
why
a
turbulence
and
chaos
toolkit
driver
for
the
turbulence.
So
the
next
half
of
the
problem
statement,
which
is
crucial
for
us,
is
application
level.
Chaos
engineering,
because
we
have
about
4,000
applications
running
on
our
platform.
It's
not
a
single.
We
are
not
a
single
application
company
right,
wherein
you
just
attack
an
application
and
all
the
components
of
that
application
and
see
what
happened.
That's
right!
C
It's
not
that
we
have
different
different
teams,
building
different
stuff
every
day
and
about
4000
applications,
not
not
at
all
independent.
They
have
been
independent
and
not
interdependent
applications
right,
so
we
don't
want
to
randomly.
You
go
and
then
kill
a
degausser.
It
would
impact
multiple
themes
within
t-mobile
and
that's
that's
a
big
problem
for
us.
We
have.
We
wanted
to
make
a
very
conscious
level
attacks
like
in
a
very
targeted
attacks,
in
a
way
that
ok,
it's
specific
at
a
specific
application,
is
targeted
for
the
chaos
engineering.
What
would
happen?
C
E
D
Interrupts
well,
we
deal
with
an
open
support
model
where
we
get
a
number
of
different
questions
and
I
want
to
talk
about
here.
Four
favorite
favorite
questions
as
to
how
we
can
actually
not
need
to
be
enablers
would
also
be
guardians
right,
but
is
not
necessarily
just
rely
on
best
intentions,
but
provide
tools,
capabilities
that
will
also
kind
of
like
help
with
the
best
intentions
when
working
with
such
a
large
interval
customer
base.
First
one
is
the
my
app
isn't
picking
up
latest
configurations,
and
our
first
reaction
is.
D
This
is
because
of
bad
karma
and
your
app
is
misbehaving
right.
Second,
one
is
my
app
isn't
connecting
to
Cassandra,
and
our
first
reaction
is
because
you
can't
and
that
cluster
was
potentially
decommissioned,
or
you
must
be
hit
with
the
Cassandra
team
and
then
the
next
instance
we
see
is
my
app
works
locally,
but
not
on
CPS.
D
This
is
likely
because
you
misbehave
with
the
PCF
team,
it
is
asked
and
then
last
one
we
see
as
unit
was
working
telephony
and
then
it
stopped
working,
and
this
is
because
we
believe
that
you've
not
get
the
bills
for
us
all
right.
So
but
jokes
aside,
guys,
these
are
some
serious
concerns
where
we
classify
them
as
debugging
as
a
service
and
oftentimes
customers
like
to
start
with
us,
because
we're
very
nice
to
them.
We
try
to
enable
them.
D
We
try
to
like
make
them
some
sufficient,
but
that's
not
enough,
so
you
need
to
like
provide
tools
and
capabilities
which
will
also
provide
guardrails
for
them
to
operate
with
them,
and
that's
where
to
like.
This
is
going
to
come
in
all
going
to
be
very
effective,
which
is
it's
going
to
like.
You
know,
enlighten
them
as
to
what
they
can
do
to
actually
validate
some
of
these
off
soil
conditions
right
and
help
them
be
more
self-sufficient.
C
Perfect
sounds
good,
so,
having
said
that,
you
know
again
the
same.
Extending
the
problem
statement
or
I
would
like
to
touch
base
on
the
cascading
effect,
popularly
a
butterfly
effect
as
well
right.
So
there
are
two.
We
all
know
this.
What
is
cascading
effect
I
just
wanted
to
I,
don't
want
to
dig
into
the
more
details,
but
quickly
explaining
this.
We
have
concert
and
web
weather
micro
services
running
in
our
platform
and,
in
this
case,
weather
is
dependent
on
third
party.
C
What
happens
if
the
third
party
application
you
know
goes
down,
so
it
totally
packs
with
her
and
those
times
out,
concert
right,
and
it
may
so
happen
that
you
know
the
database
also,
which
is
dependent
on
which
concert
is
dependent
on,
might
go
down.
So
all
this
put
together
creates
a
cascading
effect
and
gives
an
unfavorable
behavior
to
the
experience
to
the
web
application
running
on
on
the
front
end
of
front
end
to
the
client
right,
so
the
client
will
see
experience
a
very
unfavorable
thing,
so
zooming
in
a
bit.
C
What
happens
if
these
applications
are
running
in
a
spring
cloud
kind
of
an
ecosystem?
So
these
are
the
D
goes,
so
imagine
that
you
know
weather
and
concert
micro-services
both
are
scheduled
and
running
on
it's
the
same
degausser.
The
gazelle
is
virtual
machine,
so
the
both
these
containers
are
scheduled
to
are
scheduled
and
running
on
on
one
particular
node.
C
So
it
means
what
I
mean
by
targeted
attack
is
what
happen
docking
the
traffic
to
the
concert
so
coming
from
the
go
router
or
the
load
balancer
to
the
concert
service
only
and
and
then
blocking
the
traffic
from
weather
service
to
the
my
sequel
database.
As
you
can
see,
weather
and
concert
services
both
are
dependent
on
my
sequel
database,
so
I'm
doing
a
very
targeted
attack
here.
C
Both
are
running
on
the
same
node,
but
these
two
apps
I
could
go
and
then
you
know,
do
a
fine-grained
attacks
in
a
way
that
okay
blocking
the
traffic
to
the
concert
and
blocking
the
traffic
to
database
from
whether
again
the
failure
can
again
happen
at
different
levels
as
well.
There
is
a
lot
of
interaction
happening
here.
The
weather
might
lose
the
connection
to
the
service
discovery
or
circuit.
Breaker
concert
can
lose
connection
from
the
car
to
the
config
server
as
well.
C
All
these
attacks
can
be
simulated
all
right,
so
how
we
do
that
today
is
CT
k,
CF
blocker,
that's
a
new
driver
that
we
wrote
and
then
target
specific
CF
apps
application
host.
So
what
it
does
is
it
discovers
where
your
application
is
running
and
then
it
discovers
what
services
your
application
is
bound
to.
So
in
this
case,
my
weather
and
concert
micros
are
are
bound
to
my
sequel
database
config
server,
you
Ray
Kassar
Eureka's
and
then
his
trick
service
and
then
now
it
can
also
go
into
service
instance
like,
for
example,
it
can.
C
C
D
C
So
the
spring
music
just
to
see
what
happened
here
was
the
app
got
loaded,
and
these
are
the
breadcrumbs.
These
are
the
album
information
that
got
loaded
from
a
database,
so
you
can
see
here
which
database
it
is.
It
is
my
sequel
database
called
music
DB,
so
these
two
apps
spring
music
and
spring
music
both
are
bound
to
this
music
DB.
So
now
what
we're
going
to
do
is
bring
down
application
connectivity
from
music
DB
from
for
spring
music.
C
C
And
and
then
the
method
here
is
blocking
a
service
and
unblocking
a
service.
So
first
we
block
the
service
for
60
seconds.
What
we,
the
service
that
we
are
going
to
block,
is
for
the
app
spring
music
and
the
service
name
is
music
DB
and
we
have
to
provide
some
information
like
organ
space
name
so
which
is
specific
to
PCF
again
in
the
on
rollback,
unblocking
the
service
provide
agnate,
space
name
and
the
app
name,
and
then
the
service
to
be
unblocked.
C
C
C
Seblak
service-
and
there
is
a
verbose
you
can
enable
both
to
have
a
deep
dive,
stacktrace
information,
but
time
will
not
use
that
for
this
demo.
So
what
this
timer
is
doing
is
it
is
trying
to
block
traffic
to
music
DB
bound
to
only
spring
music
app.
So
now
it
found
where
the
applications
are
running
and
it
has
the
APIs
has
three
instances.
So
it
figured
out
all
the
three
instances
of
the
app
running
you
know
in
your
cluster
and
then
it
started
attacking
now.
C
So
you
can
just
refresh,
as
you
can
see,
there
is
a
successful
attack.
It
there's
no
data
now
right
again,
spring
music,
which
is
running
on
the
same
VM
and
pointing
to
same
database,
can
fetch
the
data.
There
is
no
issue
there,
but
spring
music,
there's
no
data.
So
so
that's
that
proves
our
successful
attack
and
it
takes
like
60
seconds
to
you
know,
bring
the
system
back
because
we
have
a
rollback
policy
after
60
seconds
right.
C
So
so,
let's
go
back
and
see
what's
happening,
so
the
rollback
unblocking
music
DB
has
kick-started
so
there
you
go.
You
see
he
that
spring
music
is
back
in
action,
so
within
that
60
seconds
you
can
see
what
happens
with
your
application.
That's
what
is
actual
goal
here
and
it
can
be
any
service.
Now,
it's
not
just
music
DB,
but
it
can
be
anything
my
sequel
or
you
rayker
service
history.
It's
all
the
services
that
app
is
bound
to.
The
second
scenario
is
blocking
traffic
to
an
app.
C
So
what
we're
going
to
do
is
we
are
going
to
block
traffic
to
this
spring
music
app.
Only
again,
these
two
are
running
on
the
same
virtual
machine.
Blocking
traffic
is
the
experiment
file.
Let
us
look
at
the
block
traffic
JSON
file
and
you
can
see
again
it's
the
same.
You
know
similar
pattern.
Of
definition.
Configuration
goes
in
here,
steady
state
hypothesis.
Let
me
come
back
to
the
method
here.
What
we
are
doing
is
blocking
a
traffic
to
the
app
spring
music.
There
is
no
such
service
here,
that's
fine!
C
C
It's
a
blocking
all
traffic
to
spring
music,
so
it's
still
working,
but
let's
give
it
a
time
and
and
there
you
go,
you
see,
the
traffic
has
been
blocked
and
you
see
the
5:02
bad
gateway.
So
the
Gateway
is
aware
of
the
problem
route,
but
it
doesn't
know
how
to
do
a
TCP
connection
to
the
app.
So
that
says
it's
this
full
attack
for
us
right.
So.
C
And
unblocking
is
been
initiated,
so
you
can
see
that
you
know
unblock
happened
because
there
is
no
timeout
for
60
seconds
here,
so
it
happened
so
quickly.
So
that's
that
brings
down
the
end
of
the
second
demo
as
well,
and
let
me
go
back
to
my
other
slide.
I
have
like
two
three
slides
now.
That's
all
I
have.
C
D
C
We
have
Kaos
toolkit,
CF,
blocker
driver
and
kiosk
toolkit
turbulence.
We
wanted
to
bring
these
two
into
the
chaos
toolkit
umbrella
that
Seyfert
we
are
putting
in
as
well,
and
these
demo
videos
are
available
for
you
to
go
through
again
and
we
have
raised
again.
Turbulence
is
built
by
another
person,
you
know
and
we
we
are
not
the
ones
who
created
turbulence.
So
we
wanted
to
do
a
pull
request
and
we
did
the
pull
request
with
all
the
new
feature
add-ons
and
it's
pending.
Approval
of
the
PR
still
will
wait
and
then
see.
C
C
We
wanted
to
conduct
some
game
games
game
day,
so
we
wanted.
This
is
still
a
tip
slightly
matured
and
proof-of-concept
right
now
we
wanted
to
make
it
work
and
productize
it
and
call
out
teams
on
team
to
team
by
team
basis
and
perform
game
days
in
a
war
room
and
randomly
attack
our
own
infrastructure
and
see
how
their
applications
behave.
So
we
wanted
to
build
this
capability
and
give
it
to
the
application.
Teams
to
you
know,
perform
chaos,
engineering
attacks
on
their
apps,
so
that's
all
I
have.
A
Well,
thank
you
very
much,
Karen
and
and
Rama.
She
was
already
really
sweet
demo.
It
was
really
really
interesting
to
see.
Thank
you
very
much
for
that.
I'll
carry
on
on
the
you
know.
The
slides,
which
don't
mine
now
and
share,
should
I
stop
sharing.
Yes,
please
yeah!
Thank
you.
It
was
really
a
very
interesting
demo,
very
fun
to
watch
alright
I.
A
Think
that's
all
right,
so
we
pass
that
nope,
yeah
I
think
we
are
back
to
you,
know
usual
discussion
around
landscape
and
you
know
and
crafting
the
categories
I
personally
didn't
get
any
chance
to
actually
look
at
it.
Yet
so,
if
anyone
has
fantastic
I'll
be
paying
I
see,
there
is
a
pull
request.
I'll
be
looking
at
the
podcast
as
best
as
I
can
but
I
think
the
idea
and
we'll
be
talking
about
cube
communities
probably
to
have
that
wrapped
up
in
some
fashion
enough
anyway,
bye
bye,
soon
ish
to
actually
get
that.
A
A
A
A
Basically,
this
is
where
we
are
now
as
a
community
with
chaos
ensuring
at
the
tip
done
I
think
at
least
I
know,
I
and
Gillian,
who
presented
two
weeks
ago,
we'll
be
doing
a
demo
join
demo
drink
that
that
one
on
this
year,
where
he
started
and
and
showing
how
we
can
actually
automate
that,
after
that,
we
secure
circuit,
please
being
Chris.
If
you
interested
in
talking
or
being
being
there,
were,
you
know
in
some
some
capacity,
obviously
we.
E
A
To
him,
yes,
this
I
don't
reach
out
to
me
because
I
don't
have
any
control
over
that
I'm.
Just
passing
the
message
yeah,
please
do
it
do
talk
to
Chris,
maybe
I
feel
about
it.
Alright,
it
be
it
be
to
be
an
asset,
be
fantastic.
If
we
could
see
you
know
as
many
people
as
we
can
not
just
on
screen
this
time,
but
you
know
face
to
fit,
and
you
know
just
have
a
coffee
or
something
would
be
perfect.
So.
A
We'll
see
you
there
that
definitely
music.
Alright,
this
is
where
I'll
be
leaning
on
the
people
who
were
EOS.
Kampf
I
felt
very
good
thing,
so
I'm
willing
to
listen
to
bit
more
about
it.
So
who's
be.
Will
you
know,
would
be
willing
to
talk
about
it
really
I
think
Michael,
yeah
and
I.
Don't
know
if
there's
anyone
else
from
the.
E
So
kiosk
offers
about
a
week
and
a
half
ago
in
San
Francisco,
we
kicked
off
the
day
with
Colton
from
gremlin
talking
about
the
sort
of
evolution
of
actual
chaos,
engineering
attacks
and
sort
of
concluded
by
talking
about
gremlins
new
product
Alfie,
which
is
application-level
fault
injection,
where
you
can
write
into
your
application.
Various
tax,
which
was
pretty
cool,
yeah
Adrian
Cockroft
from
Amazon,
then
spoke
about
the
history
of
chaos,
which
I
really
enjoyed
was
really
good
talk.
We
had
a
number
of
other
presentations.
E
E
Later
in
the
afternoon,
we
had
Tammy
and
Anna
from
gremlin.
Do
a
really
cool
demo
of
braking
aks
and
eks,
which
was
pretty
cool,
and
we
finished
the
day
with
Jesse
Frisell,
who
was
talking
about
containers
and
breaking
stuff
in
containers.
So
it
was
a
really
good
day.
Sorry,
no,
no
I'm
gone
I
was
nothing
nice
yeah.
It
was
a.
It
was
a
really
cool
day,
good
to
meet
more
people
in
the
community
and
I
look
forward
to
next
year
and
I
hear
there
will
may
have
a
new
venue
which
we
go.
A
B
E
A
C
D
A
That's
back
to
the
what
do
I
was
saying
about
the
working
group.
Work
I
need
to
pay
attention
personally
to
to
the
state
of
the
Piazza
and
everything
not
just
one
I
mentioned
earlier,
but
device
ones
on
the
on
the
white
paper.
Just
try
to
melt
everything
into
one
document
and
see
where
it
stands.
Right
now
probably
needs
a
lot
of
polishing,
so
lp's,
you
know
basically
welcome
and
that's
that's
about
it
for
this
week
you
know
meetup.
A
Anyone
ask
you
and
I
think
this
usually
ends
up
by
saying
that
we
are
welcoming
any
demo
and
anyone
wants
to
do
a
new
demo
or
better
one.
You
know
whatever
I
think
it's
always
cool
to
have
that.
So
that's
the
longest
word
to
people
that
aren't
aware
of
this
working
group.
I.
Think
it's
important
with
you
know,
I'm
sure
there
are
plenty
of
companies
who
do
while
doing,
kill
centering
or
resiliency
engineering,
whatever
you
want
to
call
it
and
they
should
come
and
just
show
it
to
us
ray
all
right.
C
A
A
good
question
but
yeah
if
anyone
wants
to
respond
I,
don't
want
to
all
the
you
know,
actually
actually
done
this
the
showing
the
screen.
So
we
can
see
each
other
happy
to
let
anyone
answer
that
one
I
can
go
for
it.
I
think
I
think
the
working
group
is
not
yet
working
group,
that's
important
to
notice
from
a
CN
CF
point
of
view.
It's
not
yet
working
needs
to
be
proposed
and
accepted
and
blah
blah
blah.
A
The
point
of
we're
trying
to
you
as
a
community
is
is
to
bring
everyone
across
and
basically
work
together,
and
for
that
we
try
to
at
least
add
a
white
pepper,
doesn't
have
to
be
complete.
You
know
comprehensive,
and
the
OL
model
of
you
know
got
white
pepper
for
chaos
enduring,
but
enough
to
to
showcase
what
it
is
and
where
it
goes.
I
know.
Chris
is
looking
at
expanding
on
the
landing
scape
of
the
scenes.
A
B
B
A
Mike
Michael
did
mention
that
last
last
time
as
what
we
need
to
say
is,
but
it's
it's
a
good
point
or
eighty
right
I
think
the
idea
is,
you
know:
I
I,
don't
work
for
the
C&C.
If
I
don't
know
that
that
proceeds,
so
I
I
can
do
is
echo
what
we
discussed
before
but
I
as
far
as
on,
and
the
idea
would
be
to
put
that
what
you
know
that
github
repo
in
good
shape,
where
we
can
you
know,
agree
on
that
white
paper.
At
least
whiz
at
you
know,
phase
one
version
one.
D
A
Or
anything
and
and
increase
the
landscape
so
that
we
have
a
good
overview.
What's
what
exists
as
a
mine,
stone
I
think
it
would
be
before
the
end
of
the
year
and
best
would
be.
If
we
could,
you
know,
do
it
before
keep
calm,
because
during
coop
con
I
think
Chris
is
trying
to
raise
more
awareness
and
more
you
know,
promotion
yeah,
so
three
lady,
might
would
be
better.
Okay,.
B
Let
me
know
if
you
shall
get
like
maybe
Chris
having
better
visibility,
but
the
most
the
best
case
is
to
fix
the
target
to
finish
the
landscape
before
the
end
of
the
year
and
to
motivate
to
to
grow
up
to
work
in
a
real
working
group.
For
for
the
target
to
write
specifications
all
right
best.
Practice
in
Koenji
means
guideline
for
currents,
engine
for
all
engineer,
working
the
services
I
think.
A
I
think
that's
interesting
because,
as
far
as
I
understand
the
CN
CF
tries
to
to
avoid
specification.
Initially
they
don't
want
to
in
us
you
know
governing,
but
in
that,
in
that
sense
I
as
I
understand
it.
So
my
practice
or
something
my
my
assumption
here
is
that
we
do
the
tidy
up.
You
know
we
didn't
like
six
weeks
or
something
so
that
gets
ready
for
you
know
during
November,
so
that
at
coop
con
we
can
meet
up
and
probably
discuss,
but
next
year.
A
If
we
look
at
what
Services
has
done
its
subtly
interesting
I
was
talking
about
specification.
They
came
up
with
the
specification
on
the
side
and
I
I.
Don't
know
if
we
need
a
specification
or
I,
don't
know
what
we
need
as
a
community
but
cube
con,
and
you
know
and
other
meetups
here
for
those
who
come
come
to
me.
Con
are
the
good
places.
This
is
where
I
would
like
to
go
and
as
a
community
we
can
basically
decide
that.
A
A
Every
time
we
meet
up
in
person
is
probably
a
good
good.
You
know
it's,
it's
it's
faster,
it's
more
concrete
and
you
know
it's.
It
seems
to
work
better,
even
though
we
you
know,
we
all
distribute
it,
and
we
all.
You
know
you
know
in
places
somehow,
when
we
meet
up
in
real
person.
That
seems
to
be
faster.