►
From YouTube: Chaos Engineering WG Meeting - 2018-05-22
Description
Join us for KubeCon + CloudNativeCon in Barcelona May 20 - 23, Shanghai June 24 - 26, and San Diego November 18 - 21! Learn more at https://kubecon.io. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.
Join us for KubeCon + CloudNativeCon in San Diego November 18 - 21. Learn more at https://bit.ly/2XTN3ho. The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy and all of the other CNCF-hosted projects.
A
A
I
run
see
it
see,
yeah
yeah
yeah
go
so
yeah
Genda
today,
so
we'll
do
some
introductions.
Then
we'll
have
some
community
presentations
about
15
minutes
each
from
the
gremlin
on
chaos.
Toolkit
communities
we'll
talk
a
little
bit
about
the
status
of
where
we
at
the
landscape.
Mostly
me,
I'm
asking
help
from
everyone
and
trying
to
come
up
with
some
reasonable
categories
on
how
to
categorize
the
different
technologies
and
chaos.
Engineering,
we'll
also
talk
a
little
bit
about
the
white
paper
based
on
community
feedback,
I
uploaded
it,
but
to
github.
A
So
we
could
bang
away
at
its
vehicle
requests
instead
of
Google
Docs,
which
seems
to
be
favorable
to
a
lot
of
folks
and
then
we'll
kind
of
end
things
out,
but
first
off
it
would
be
great
if
we
could
get
some
introductions
from
folks,
especially
if
you're
new
on
the
call.
So
any
new
faces
from
from
last
time
walked
to
speak
up
and
say,
say:
hello,
cool.
C
D
C
A
E
A
A
So,
whether
you
could
you
know,
there's
simple
things
you
could
categorize
by
like
hosted
solutions
versus
you
know,
maybe
frameworks.
You
know
that
run.
You
know
client-side
and
are
driven
client-side
versus
maybe
chaos,
engineering
tools
that
are
focused
on
security,
so
have
an
idea
how
to
categorize.
This
thing
is
definitely
something
I've
been
trying
to
tackle
so
I'm
asking
folks
we
could
have
a.
A
We
could
have
probably
a
five-minute
discussion
on
the
call
today,
but
there's
a
github
issue
open
and
if
you
kind
of
have
ideas
of
how
to
kind
of
categorize
things,
they
would
be
great
to
hear
because
that's
essentially
is
gonna
drive
the
landscape
that
will
be
produced
by
ciencia
for
this.
For
this
work,
so
I
don't
know
if
anyone
wants
to
take
a
stab
at
this
I
know
Sylvain.
You
had
some
thoughts
that
we
chatted
over
email
but
I
kind
of
love
to
hear
from
the
group.
F
A
Well,
I
mean
it
was
a
lot
of
bike
shedding
for
ever
to
come
out
kind
of
come
up
with
the
categories
I
mean
first
first,
we
started
with
a
list
of
server
list
projects
and
things
right
and
then
from
there
we
tried
to
break
it
up
in
terms
of
categories,
I
kind
of
what
makes
sense.
So
that's
kind
of
the
approach
we
took.
It
took
a
long
time.
You
know,
and
I
was
trying
to
use
that
framework
to
apply
to
chaos.
A
Engineering
because
you
know
obviously
there's
seems
like
there's
an
upstart
of
kind
of
hosted
offerings.
You
know
you
know
with
fuzz
box
women
cassock,
you
Exedra
and
then
there's
a
bunch
of
kind
of
tools
that
have
different
focuses
on
whether
they're
trying
to
do
chaos,
engineering
in
a
what's
it
called
security
context
or-
or
you
know
some
other
context.
So
so
you
know
I'm
just
asking
how
to
how
to
categories
things
from
like
what
are
people's
thoughts
on
this.
A
C
It
gets
supplied
versus,
what's
being
applied
to
I,
suppose
poorly
trying
to
see
whether
it's
you
know
like
what
well
I
might
do,
which
is
is
like
run
on
like
a
snapshot
of
beyond
tonight.
It's
almost
like,
like
you,
know,
staging
environment,
whether
it's
actually
you
know
you
have
like
the
gremlin
agent
deployed
and
you're
running
now
on
production
with
a
bus
radius
or
whether
you
know
you're,
actually
just
saying
something
loose
like
case
against
monkeys
almost
like.
C
H
A
Yeah
yeah
I
mean
you
know.
I'd
also
ask
people
to
have
some
like.
You
know,
empathetic
thoughts
from
like
an
end-user
perspective.
If
you're
trying
to
like
evaluate
with
tools
out
there
and
like
hey
I,
want
this
to
stop
to
work
against
AWS
or
kubernetes,
you
know
making
it
easy
for
them
to
find
that,
at
least
from
that
perspective
is
super
useful,
so
I'd
like
us
to
also
make
sure
that's
somehow
possible
from
like
in
just
an
end-user
perspective.
H
F
A
There
could
be
multiple
kind
of
accesses
of
a
filtering.
You
know
I'm
not
asking
for
like
a
complete.
You
know
solution,
but
for
folks
to
kind
of
put
their
thoughts
on
that
github
issue
and
for
us
to
kind
of
keep
iterating
on
this
eventually
I
kind
of
want
to
get
to
a
state
where
we
have
kind
of
a
rough
agreement
and
then
I
could
work
with
our
design
team
to
start
sketching
out
how
this
is
going
to
look
and
that
we
could
kind
of
continue
to
iterate
on
that.
E
C
E
With
the
other
contributor,
what
we
came
out,
the
the
main
idea
was,
we
split
everything
into
use
cases
so
who
who
is
it
for
not?
What
does
it
do
that?
Who
is
it
for
and
from
there
most
people
find
like
the
common
terminology
and
and
can
relate
to
keywords,
much
easier
and
find
a
way
and
contribute
to
the
list
as
well.
So
that's
that's
so
yeah.
It
will
involve
the
community
around
it.
A
Yes,
so
this
would
definitely
be
like
in
an
open-source
fashion
where
anyone
could
contribute
their
thing
to
the
landscape.
We
already
have
we
rate
if
you
go
to
something
called
L
dot,
C
and
C
F
dot
IO,
you
kind
of
see
what
we've
done
for
the
wider
clown
native
landscape.
It
works
pretty
well,
because
community
members
tend
to
police
themselves,
which
is
beautiful.
A
It's
a
beautiful
thing
to
watch
to
make
sure
that
things
are
kind
of
categorized
properly,
but
it's
for
us
to
kind
of
come
up
with
the
categories
categorization
scheme,
so
yeah,
that's,
basically,
all
the
time
I
can
I
want
to
spend
on
this
one
just
because
we
have
demos
and
I'm
really
excited
to
see
them.
All
I
asked
from
the
group
is
to
throw
some
ideas
on
the
github
issue
that
I've
linked
to
and
we'll
kind
of
go
from
there
and
try
to
do
that.
Work
stream
in
async
fashion
as
as
possible.
A
All
right
so
moving
on
so
next
up
we
have
two
community
presentations
kind
of
see
how
people
are
doing
chaos,
engineering
in
the
wild,
so
first
off
we'll
have
Eugene
from
gremlin
talk
a
little
bit.
What
he's
up
to
and
then
we'll
have
Sylvain
from.
Let's
talk
about
chaos:
okay,
but
let
me
stop
sharing
my
screen,
so
Eugene
could
get
going.
Oh
thanks.
I
A
I
That's
good
thanks
everyone,
so
I
hate.
My
name
is
Eugene
I've
been
at
gremlin
since
July
last
year,
just
one
whoa,
it's
the
slide.
Okay,
seeing
a
little
bit
of
artifacts
around
you
got
it
good,
now
very
cool,
so
the
company
itself
has
been
found
in
2016
Holmes
left
at
that
point
in
time,
and
you
know
recruited
for
need
to
be
CTO
to
start
working
on
this
product
to
make
chaos
engineering
available
for
basically
the
rest
of
the
world.
Since
we've
all
seen
it
work
at
places
like
Google
with
their
debt
during
exercises
at
Dropbox.
I
Similarly,
with
Tammy
and
her
dirt
exercises,
and
also
with
coal
net
net
flix
Kaos
engineering
and
then
call
them
with
forney
doing
you
know,
chaos
engineering,
making
every
having
a
tool
in
building
a
tool
internally
at
Amazon
retail.
So
we
know
the
value
at
the
corporate
level
and
so
we're
just
trying
to
make
it
available
to
everybody
else,
gremlin
itself,
installs
in
a
myriad
of
ways,
one
of
them
is
a
Linux
package.
I
So
all
you
really
do
is
pull
down
our
repo
and
then
you
know
install
it,
install
the
package
or
you
can
install
us
as
a
docker
container,
pull
us
from
gremlins,
slash
gremlin
on
dr.
hub
or
as
a
Kay
as
a
kubernetes
daemon
set
so
I've
seen
a
lot
of
our
customers
do
that
as
well.
We
have
a
CLI
interface
that
you
could
SSH
into
any
host
to
run
gremlin
experiments.
I
We
have
an
API
as
well
for
automation
as
well
as
a
web
app
just
to
make
it
really
easy
for
people
to
dive
into
all
the
types
of
failure
modes
that
we
could
introduce
into
your
ecosystem
or
otherwise
just
to
properly
scope
here
attacks,
one
of
the
things
that
you
know
I
find
very
valuable.
When
helping
our
customers
scope
out
chaos,
engineering
experiments
that
they
go,
you
know
where
do
I
start.
You
know!
I
Well,
if,
if
you
don't
have
this
nice
rich
UI
to
they
give
them
all
the
parameters,
it's
really
hard
to
get
started,
because
you
know
I
could
just
blow
up
a
whole,
auto
scaling
group
any
WS
and
see
what
happens
right
well,
start
small
and
we
could
help
you
scope
that
out
properly,
we
also
have
a
built-in
scheduler.
You
know,
I
think
one
of
the
things
that
we
all
kind
of
hear
a
lot
is
about
automation
and
so
putting
things
into
the
scheduler
really
helps.
You
maintain
that
floor
of
resilience
in
any
of
your
applications.
I
Alternatively,
if
you
just
to
stop
things
from
going
wild-
and
you
know
your
client
loses
control
to
our
control
plane,
you
can
then
we
add,
ms,
which,
on
our
climb,
will
kick
in
and
then
roll
back
to
steady
state
automatically
such
that
you
know
you
don't
have
any
uncontrollable
chaos
within
your
ecosystem.
So
that's
this
slide.
Let
me
go
ahead
and
get
into
the
demo
really
quickly.
A
J
I
Works
for
me
antennas,
so
all
right
when
you
log
into
our
service.
This
is
the
UI
that
you're
gonna
get.
If
you
have
clients
already
hooked
up
to
our
control
plane,
you
could
already
begin
to
create
your
first
attack,
otherwise
put
things
in
the
scheduler
and
finally
just
to
manage
our
client,
your
clients
or
users.
That's
connected
to
us.
I
already
talked
about
the
way
of
installation,
so
I'm
not
gonna,
usually
talk
about
that
with
our
with
our
prospects
and
customers.
I
So
I'll
just
skip
on
that
and
go
right
into
the
types
of
attacks
that
we
could
run
right
now.
The
client
itself
is
focused
on
infrastructure
level
attacks,
so
things
that
happen
on
your
host
things
that
happen
on
your
operating
system.
Things
that
happen
on
the
network,
all
those
things
we
have
a
good
suite
of
attacks
that
you
could
run
with
gremlin,
so
resource
attacks.
First
off
that
I'll
start
off
with
is
things
that
happen
on
your
post
right.
I
We
could
consume
host
cores,
so
the
amount
of
course
that
you
have
on
your
hosts
or
your
instance.
For
example,
you
can
specify
the
exact
amount
of
hosts
that
you
want.
We
could
specify
amount
of
disk
space.
You
want
us
to
fill
right.
What
happens
if
your
alarm,
if
our
log
rotation
does
it
happen?
We
internally
got
bit
by
Blatt.
You
should
check
out
our
blog
post
at
gremlin
comms
and
finally,
we
can
also
introduce
disk
rewrite
activity
for
those
of
you
that
have
a
lot
of
disk
intensive
tasks.
I
This
might
be
a
good
gremlin
to
run.
If
you
have,
you
know
a
lot
of
heavy
I/o
operations
and
finally,
we
could
also
consume
gigs
of
memory
notice
that
every
single
one
of
these
attacks
we
could
also
save
as
a
template.
When
you
have
a
this
some
attack
that
you
find
that
you're
going
to
recall
fairly
frequently,
some
of
these
attacks
might
be
a
little
bit
more
specific
and
highly
targeted.
So
as
a
result
of
that,
the
configuration
could
take
a
little
bit
of
time,
save
it
as
a
template.
I
That
way,
you
could
bring
it
back
in
the
future
or
otherwise
throw
it
into
the
scheduler
so
that
you
can
then
just
automate
that
particular
attack
state
gremlins
here
alter
the
state
of
your
operating
system.
So,
for
example,
we
have
a
process
killer
here,
where
we
can
string
match
the
process
that
you
send
to
us
and
we'll
just
kill
it
perpetually.
What
happens
if
your
web
server
like
area
httpd,
were
to
go
away
or
your
job
or
your
tomcat
app
where
to
go
away?
I
Does
health
check
pick
that
up
and
then
I
otherwise
terminate
the
hose
or
start
recovering
from
it
good
thing
to
test?
With
this
we
also
have
a
shutdown
gremlin.
So
what
happens
in
a
public
cloud
right
like
AWS,
I,
failed
health
check
your
instance
good.
It's
gonna
get
terminated
or
meltdown
your
AWS
with
rolling
reboot
across
your
our
fleet.
I
You
know
when
it's
gonna
hit,
but
it's
gonna
hit,
use
the
shutdown
gremlin
for
the
one
of
the
key
values
that
we
have
here
is
that
you
know
if
you
use
this
in
conjunction
with
our
scheduler
down
here,
and
you
say:
Oh
run
this
during
business
hours.
Right,
9:00
to
5:00
we're
running
to
shut
down
a
reboot
gremlin
five
times
a
day.
Well,
you
basically,
then,
which
have
that
chaos
monkey
experience
right
out
of
the
boss
for
yourself
right.
I
There
final
state
gremlin
that
we
have
is
time
travel
where
we'll
break
NTP
and
introduce
clock
skew
many
times,
I
hear
from
our
customers
that
when
you
introduce
this
in,
say
your
cassette
or
cluster
terrible
things
happen.
Maybe
you
want
to
see
what
happens
in
that?
You
can
your
own
worlds
for
your
data
layer,
otherwise
things
like
daylight
savings
time
is
also
something
to
consider
a
certificate
on
your
host
were
to
expire.
I
That
is
good
thing
to
simulate
as
well
or
a
leap
year
now
the
final
gremlins
set
of
gremlins
that
we
have
on
the
network
once
and
these
tend
to
be
the
most
valuable
and
most
powerful
ones,
because
in
a
distributed
system
in
in
AWS
or
any
kind
of
cloud
provider,
the
network
tends
to
be
the
most
fragile
point,
as
you're
breaking
your
applications
from
monolithic
to
micro
services.
Your
network
is
now
basically
part
of
your
application
stack
right.
You
expect
it
to
be
high-performing,
but
really
some
things
happen
because
well
cloud
happens.
I
Network
devices
have
built-in
entropy
to
them,
definitely
want
to.
Similarly,
what
happens
when
things
become
degraded
or
otherwise
unavailable
the
black
hole.
Gremlin
here
drops
all
packet
going
from
one
place
to
another,
so
you
can
simulate
something
like
a
full
service.
Unavailability
notice
that
all
these
network
gremlins
have
the
most
amount
of
arguments
that
you
could
pass
to
it,
and
I
really
just
want
to
highlight
the
concept
right
here
that
we
could
actually
simulate
full
service
outages.
Now
some
of
you
might
remember
this
great
outage.
I
That
happened
last
year
called
the
s3
outage,
and
we
can
stand
like
that
for
you
out
of
the
box
just
by
adding
that
as
your
as
a
service
provider
right
here,
the
next
gremlin
that
we
have
is
the
DNS
gremlin
and
we
could
break
DNS
for
you.
You
want
to
test
what
happens
if
your
primary
DNS
server
were
to
go
away.
Do
your
host
actually
do
secondary,
fall
or
fall
backs
to
your
psyche?
Definitely
something
to
worthwhile
worthwhile
to
check.
Otherwise,
you
can
simulate
bigger
DNS
outages
like
Dai
DNS,
right
or
ultra
DNS.
I
That
happened
a
few
years
ago,
or
just
maybe
about
fifty
three
being
unavailable
as
well.
These
last
two
gremlins
latency
and
packet
loss
are
the
ones
that
I
would
call
great
states
of
failure.
You
know
your
systems
are
are
running,
but
due
to
things
like
noisy
neighbor
or
having
to
traverse
through
a
lot
of
Internet
traffic,
things
should
become
slow
or
otherwise
degrade
it.
So
it's
not
operating
at
its
most
efficient
point
point.
I
So
the
problems
usually
manifest
themselves
in
the
form
of
leads
I
think
she's
become
a
little
bit
slower
than
what
you're
expecting
them
to
be
you've.
Seen
the
menus
here,
so
I'll
just
go
ahead
and
talk
a
little
bit
about
it.
You've
got
you
want
to
dial
in
how
long
you
want
to
run
run
the
attack
force,
sometimes
you're
observability
or
your
monitoring
tools,
take
a
little
bit
longer
for
the
metrics
to
show
itself.
I
So
if
we
definitely,
you
could
definitely
specify
how
long
you
want
the
attack,
for
you
can
specify
things
at
the
IP
address,
IP
address
range
or
cyber
block
level.
At
the
device
level,
such
as
your
eat,
zero
or
your
East,
one,
your
hosting
or
endpoints
such
a.
If
you
want
to
just
inject
some
latency
going
to
google.com,
you
could
do
just
type
it
in
or
if
you
want
to
do,
have
an
external
your
service
has
the
third-party
dependencies
it's
something
before
messaging,
like
Twilio,
you
can
do
that
as
well.
I
If
you
want
a
whitelist,
particular
traffic,
it's
just
a
carrot
right
here,
so
maybe
you
want
to
whitelist
you're
monitoring
just
so
you
could
have
some
observability
into
this
chaos
experiment.
We
can
also
support
port
port
ranges
and,
finally,
you
can
also
specify
what
protocol
you
want
once
you're
done,
defining
the
attack.
I
You
now
want
to
specify
the
targets
that
you
want
to
attack
and
by
this
we're
talking
about
the
blast
radius,
if
you
will
to
help
you
with
this,
we
pulled
down
for
AWS
instance
metadata
here,
where
you
can
click
on
any
of
these
bubbles
to
filter
by
say
region
or
availability
zone.
And
finally,
we
also
support
services
that
you
could
pass
on
to
us
as
well,
so
that
you
know
you,
your
hosts
are
serving
up
a
particular
application.
I
You
can
specify
that
one
of
the
things
that
I
like
to
highlight
here
is
that
it's
our
concept
of
random.
You
know
we,
while
most
of
the
tools
that
we
see
say
you
know,
do
random.
Do
it
in
prod,
we
kind
of
say
you
want
to
still
be
a
little
bit
targeted.
So
our
concept
of
random
here
is
that
we
take
all
of
the
clients
that
you
have
installed
with
us
and
then
three
of
filtering
like
say,
I,
only
care
about
things
like
things
up
like
our
API,
for
example.
I
You'll,
then
pull
in
a
little
ice
filter
all
the
clients
that
are
serving
that
particular
service,
and
then
you
can
then
specify
well
I
only
care
about
maybe
two
hosts
here
so
go
ahead
and
use
that
as
by
target
or
otherwise
you
can
also
support.
A
percent
of
our
environment
is
impacted
right,
so
maybe
I
will
say.
50
percent
of
it
is
sorry
50
percent
of
it
is
going
to
get
impacted.
Now,
if
you
want
to
do
a
container
attacks
like
in
your
kubernetes
environment
or
something
along
those
lines,
you
could
send
us
your
labels.
I
Maybe
you
have
put
in
your
pots
and
we'll
just
go
ahead
and
say
we'll
go
ahead,
attack
the
matching
pods
within
those
hosts.
We
don't
use
it
right
now.
So
let
me
just
go
ahead
and
just
kick
off
this
attack
once
you've
done,
specify
your
attacking
and
kicked
it
off.
It'll
take
us
over
to
our
attacked
page,
where
you
see
all
the
current
attacks
and
also
all
historically
run
a
text.
We,
our
own
dog
food,
so
you're
gonna
see
a
lot
of
a
text
in
the
Kremlin
plate.
I
For
example,
once
you
click
in
you
get
to
see
the
full
attack
definition.
All
client
logging
comes
back
up
to
us
so
that
you
don't
have
to
to
remote
into
a
host
so
to
see.
What's
going
on
at
any
given
point
in
time,
you
feel
like
you
again,
you've
done
enough
damage
or
I
need
to
roll
this
back,
because
I
made
a
mistake,
fat-fingered
my
attack,
for
example.
We
have
a
halt
button
right
here.
Our
client
will
pick
that
up
within
seconds
and
go
back
to
steady-state
within
seconds
everything.
I
I've
shown
you
again
full
feature
parity
with
our
API
right.
We
don't
circumvent
ourselves
via
the
web,
app
or
anything
of
that
sort.
So
you
can
go
ahead
and
orchestrate
your
own,
tooling
or
otherwise
put
it
into
your
CIC,
the
pipeline's
such
that
you
would
have
your
smoke
test.
Your
regression
test
well
spit
up
a
canary
cluster,
install
gremlin,
run
some
resilience
tests
on
it
as
well
to
get
some
confidence
in
the
resilience
of
your
systems.
So
that's
the
demo
for
gremlin
I
hope
you
all
enjoyed
it.
K
K
I
J
J
I
L
I
A
L
Definitely
something
we
thought
about.
You
know
if
we're
sort
of
just
getting
past
the
POC
stage
right
here,
so
the
things
we
want
to
do
you
know,
and
it's
not
where
the
word
that
you
know
the
squeaky
wheel
gets
the
grease
sort
of
thing.
So
containerization
has
been
a
real
big
part
of
you
know
our
pasta,
roadmap,
okay,.
B
Can
you
go
in
through
a
little
bit
of
a
telemetry
that
you
get
after
you've
run
an
experiment?
I
see
like
there
was
some
logs
on
your
dashboards.
Do
you
get
any
sort
of
visualizations
or
anything
like
that,
so
right.
N
Cool
my
question
was
that
actually
centered
around
monitoring
as
well?
So
how
do
you
guys
interface
or
do
you
interface
at
all,
with
things
that
are
tracking,
like
SLO,
is
SL
A's
to
like
make
that
them
put
them
on
like
a
quiet
mode
or
like?
Would
you
even
want
that
and
like
kind
of
sub
note
I'm
new
to
this?
So
maybe
that
is
something
that
you
want
to
see
if
your
essays
are
affected
by
such
a
thing.
L
N
That's
fine,
like
I
heard
about
your
court,
your
statement
on
core
competency
I
think
that's
great,
but
like
being
able
to
tie
into
something
like
Prometheus
or
and
like
maybe
making
it.
So
it's
not
gonna
go
page
like
an
entire
engineering
team,
while
you're
doing
tests
like
this,
maybe
at
least
in
a
controlled
setting
but
ya,
know.
L
You're
totally
right,
I
guess
what
we
sort
of
we
usually
tend
to
advocate
is
over
communication
in
that
regard.
So
if
somebody
does
a
page,
they
know
that
they're
getting
paged
because
there's
we're
running
testing,
they
said
yeah.
You
can
turn
off
your
paging
right.
You
can
turn
off
whatever
pager
duty
or
silence
these
sort
of
things
often
times,
though.
Actually
we
we
want
to
do
this
to
see
that
a
page
actually
goes
off
when
something
that
happens
right.
So
if
you
pick
a
bunch
of
course,
you
expect
pages
to
go
off.
L
I
A
fair
point
yeah.
No,
that
makes
a
lot
of
sense.
Thank
you
right.
I
do
agree
for
Nia
at
this
point
many
times
when
I
run
game
days
with
our
customers.
It's
not
so
much
finding
folks
within
more
so
making
sure
that
they
have
their
observability
they're
monitoring,
they're
paging
dialing
tune,
and
you
know
a
lot
of
times
when
they
go.
This
happened,
I
never
got
a
page
for
it.
Yeah
you
found
here
is
real
quickly
to
get
that
dialed
in
guy.
Otherwise
it's
gonna
be.
This
is
already
in
production.
M
M
L
Go
ahead,
Forney
I'm,
not
sure,
I,
understand
the
question
entirely.
I
mean
if
you
want
to
trigger
your
ASE
cycling,
pretty
quickly
like
you
can
just
use
kit
like
you
can
use
the
shutdown
gremlin
on
loop,
I
suppose
I'm,
not
sure
that
there's
like
a
direct
integration
in
terms
of
cloud
watch.
Events
I'm
not
sure,
like
what
I'm,
not
sure
with
your
hypothesis
is
here
that
you're
trying
to
accept
or
disprove
I
suppose.
L
People
have
asked
people
I,
don't
know.
If
you
know
engineers
they
ask
for
everything
they
ask
for
everything
under
the
Sun
yeah.
It's
it's
been
asked.
You
know
we're
slowly,
rolling
out
more
integrations
and
sure
prioritizing
what
our
customers
ask.
They
definitely
ask
for
that.
They
definitely
ask
for
other
other
sort
of
things
as
well.
I
A
Cool
I
just
want
to
make
sure
that
we're
sensitive
to
at
a
time
but
Thank,
You,
Eugene
and
Matthew
for
the
presentation
that
was
those
super
cool
all
right
so
but
we've
got
about
20
minutes
left.
So
it's
about
15
minutes
for
Sylvain
to
present
with
some
five
minutes
for
questions.
So
Sylvain
are
you
there?
Yes,
I
am
yeah
all.
F
The
idea
was
roughly
that
we
saw
that
tools
like
Romanian
Pumbaa.
You
know
chaos
monkey
obviously
weren't
there,
but
as
we
were
trying
to
figure
out
how
to
apply
the
experimental
pattern
that
we
had
read
about
in
the
Kaos
engineering
book,
we
felt
that
those
tools,
while
actually
delivering
the
goods
went
helping
us
forging
the
experiment.
If
you
will
so
we
decided
to
create
the
girl
circuit
to
do
that.
Basically
yeah.
F
So
the
case
will
keep
us
by
itself.
Does
nothing
like
remand
us
or
others?
What
it
does
is
it
helps?
You
declare
your
experiment
and
then
you
decide.
How
do
you
want
to
drive
what
tool
or
what
idea
you
want
to
drive
to
actually
inject
chaos?
So
the
council,
kid
wouldn't
actually
provide
anything
it's
much
like
Kremlin,
but
does
actually
drive
their
API
if
you
wish
to
use
Grameen
for
that
matter.
So
basically
it's
just
an
open
API
for
your
experiment.
F
If
you
will
it's
a
silly
CLI
driven,
we
felt
like
we
wanted
something
we
could
automate.
That
way,
that's
why
we
didn't
care
for
UI
at
first
or
initially,
I.
Don't
get
simplicity,
I'm
talking
about
the
code
itself,
where
we
wanted
something
that
other
people
could
actually
contribute
to,
and
we
tried
hard
to
actually
make
things
as
simple
as
we
could
so
it.
Basically,
it's
just
a
bunch
of
functions
glued
together
in
Python
way,
it's
a
bit
more
than
that,
but
that's
that
rough
idea.
F
If
you
want
to
contribute
just
you
know,
you
do
need
to
know
Python
to
very
basic
level
what
it
does
it's
orchestrates
existing
tools
are
api
by
existing
tools.
That
means
that
if
you
have
a
binary
that
you
want
to
drive
from
the
Kyousuke,
you
can
call
it,
but
if
equally,
if
you
want
to
call
an
api,
you
can
also,
you
know
just
call
it
by
passing
all
the
parameters
the
API
requires,
and
simply
it
will
call
that
for
you,
we
already
actually
implemented
a
set
of
drivers.
F
We
call
them
drivers,
but
just
extensions
to
kill
toolkit
really.
We
we
don't
claimed
with
support
all
the
API
of
those
providers.
That
would
be
foolish
and
you
know
just
a
lie,
but
we
try
to
target
a
PR
that
people
don't
necessarily
trigger
very
much
like
yes,
taco
service
or
stop.
You
know:
remove
remove
just
community
service
or
thing
like
that,
basically,
anything
that
you
probably
don't
call,
except
if
you're
a
developer
and
you
do
that
on
daily
basis.
F
But
without
the
idea
of
a
chaos
engineering
in
mind,
you
just
you
know
stopping
something
to
restart
after
that.
Well,
those
ideas
are
very
powerful
in
in
production
or
pre,
prod
or
whatever.
You
want
to
run
them
to
actually
impact
your
system,
if
you
will
so,
for
example,
if
you
want
to
remove
a
service,
so
if
you
want
to
terminate
but
just
called
you
know,
did
it,
but
basically
that's
it.
F
F
That
for
all
sorts
of
providers,
Asia
we
service
fabric
is
a
bit
different
because
they
already
actually
have
kalos
services
native
to
the
platform.
So
you
don't
actually
call
it
just
call
start.
You
know,
start
chaos
or
stop
kalos.
Much
like
issue
I
think
which
has
40
injection
and
when
realize
we're,
you
know
causing
trouble
is
fine,
but
we
really
need
some
probes
as
well
like,
like
you
guys
said
earlier
about
coding.
F
F
Basically,
we
call
we
query,
you
can
query
Prometheus
or
email
if
you
use
that
as
a
central
logging
platform
and
that,
basically,
all
that's
contained
in
the
file
that
I'll
show
you
in
a
minute.
We've
got
a
bunch
of
plugins
for
creating
reports
and
sending
slack
notifications.
And,
finally,
the
future
of
Keio
circuit
is
we're
going
to
go
more
native
in
communities
with
chrome
job
so
that
you
can
schedule
things
and
just
run.
Let
them
run
we're.
F
Looking
at
operators
as
well
to
actually
control
a
bit
more,
the
experiments
when
we
run
natively
into
event,
but
that's
just
you
know,
starting
to
think
about
that-
and
drivers
in
those
are
runtimes
we
run
in
in
Python,
and
we
are
not.
You
know
we're
not.
You
know,
we
love
everything.
So
if
you
want
to
run
your,
you
know
your
extensions
and
go
and
run
so
anything
we're
going
to
try
to
make
it
easy
to
actually
call
it
from
the
Cal
circuit
as
best
as
we
can,
and
we
are
trying
to
aim
for
my
son.
F
F
Yes,
that's
right!
So
that's
website,
we,
like
I,
said
it's
a
CLI
driven,
so
nothing
very
fancy
to
show
here.
The
idea
is,
is
just
to
walk
through
the
the
important
bits
like
I
said.
We
try
to
create
an
open
API,
so
the
open
API
is,
you
know
it's
just
a
definition
of
the
various
elements
of
avocados
experiment.
That's
what
it
looks
like
riff,
you
know
briefly.
You've
got
a
set
of
metadata.
F
What's
interesting
is
you've
got
the
steady
States
here
what
what
you
know
the
normal
in
your
system?
So
what
we
do
is
that
is
we
use
a
bunch
of
probes.
You
can
have
as
many
as
you
want
to
query
for
some
things
in
your
system,
and
what
are
we
going
to
do
is
if
any
of
them
fails
at
relevance
here,
it's
just
a
boolean.
It
could
be
something
else.
F
We
bail
the
experiment
because
you
know
mind
if
your
system
is
not
normal,
at
least
in
your
you
know,
in
what
you
expect
to
be
normal
and
there's
no
point
actually
in
going
through
the
control,
though,
because
you
won't
be
able
to
read
analyze
and
make
sense
of
what
you
you
see,
so
we
barely
merely,
but
we
run
that
steady
state
hypothesis,
the
N
again
once
we
cross
trouble
to
see
if
we
deviate
it.
If
that
passes
again
now
that
means
to
sing
either.
You
ask
some
questions
or
you
found
a
potential.
F
You
know
weakness
so
that
stage
what
you
want
to
do
is
basically
go
into
the
report
and
see
what
what
you
know
what
happened
and
make
sense
of
it
as
a
team.
Now
you've
got
the
method
so
once
you've
got
the
you've
run
the
steady-state
once
the
hypothesis,
you
run
the
method
and
just
it's
a
bunch
of
actions
or
probes.
F
Usually
you've
got
one
or
two
actions,
because
you
want
to
make
sense
of
what's
happening.
So
if
you
change
too
many
things,
I
guess
it's
similar
to
what
Grameen
is
doing.
You
can't
create
attacks.
I.
Suppose
you
try
not
to
create
that
act
that
conflict
each
other,
otherwise
it's
probably
much
harder
to
actually
make
sense
of
what's
happening
and
then
you've
got
roll
backs.
F
We
we
tend
to
call
the
end
remediation
these
days,
because
to
us
roll
backs
are
strong
promise
that
you
can
come
back
to
you
normal
state,
which
is
not
always
the
case
if
you're
really
broken
your
system,
but
the
idea
is
it's
some
time
he
want
to
come
back
to
you
know
steady
states
now
with
communities,
usually
roll
backs
empty
because
given
teas
is
meant
to
actually
support
and
deal
with.
You
know
failures
automatically,
so
you
don't
actually
do
anything
there
right.
F
So
that's
it!
It's
a
JSON
file
that
is
declarative
and
what
happens?
Is
you
define
your
probes
actions?
They
all
have
the
same
format.
So
let's
pick
that
one,
it's
a
provider,
it's
in
Python
it
takes
that
module
and
that
function.
In
this
case
it
doesn't
actually
have
any
parameters
or
arguments,
and
this
one
you
can
actually
specify
the
name
or
the
label.
You
know
things
like
that.
F
There
are
just
functions
basic
in
the
Python
module
somewhere,
but
you
declare
them
and
you
can
share
them
on
github
and
you
can,
you
know
basically
see
the
we
wanted
to
have
something
on
fine,
so
that
you
can
really
use
that
inside
your
CI
CD
pipeline
as
usual.
It's
just
another,
so
thing
like,
although
I'm
trying
to
not
sound
like
a
test,
it's
only
you
know
overlaps
with
that.
You
know
tooling,
in
in
some
fashion,
and
you
can
have
you
know
reference
existing
prop
see
so
that
you
don't
have
to
actually
duplicate
things.
F
Sometimes
we
do
have
opposed
I
personally
dislike
that,
but
I
couldn't
see
any
other
fish
fashion,
except
if
I
was
doing
synchronization
with
the
stem
telling
me
it
was
not
doing
so.
Sometimes
it's
a
bit
freaky
that
thing.
So
you
know
contributions
are
very
welcome
to
actually
improve
the
API.
Definitely
alright.
Let's
try
to
show
an
experiment
here,
we're
going
to
use
a
very
stupid
demo.
F
We've
got
that
application,
which
is
this
one,
does
nothing
but
data
that
this
year
is
pulled
from
a
post
rest
database
and
what
we
want
to
see
is
under
some
medium
load.
Can
we
actually
kill
what
happens
if
the
database
master
actually
dies
and
behind
the
scene?
Occupant
is
what
we're
using
is
batch
money
from
zalando
which,
as
leader
and
follower
for
petrest
and
which
should
switch
from
one
to
the
other
if
the
master
dies
and
that's
what
we
want
to
prove,
because
we
expect
that
if
the
master
dies,
we
don't
actually
impact
our
users.
F
C
F
Sweet,
so
what
we
have
it's
exactly
the
same.
What
I
showed
you
before
we
select?
You
know
the
application
that
were
interested
in.
That's
we
check
that
the
body's
alive,
the
application
must
written.
You
know
it
does
respond,
and
if
that
does
happen,
if
that
doesn't
happen,
you
know
the
the
experiment
bales
immediately.
Otherwise
it
goes
to
the
method.
And
here
what
we
do
is
we
terminate
the
DB
master
and
we
don't
have
actually
a
function
doing
that
what
we
do
is
we
terminate
the
part.
F
The
part
that
has
that
label,
and
luckily
enough,
we
only
have
one
it's
a
demo
round
is
here,
means
nothing
because
again
we
have
only
one
that
much
is
that
label.
But
if
you
had
many
actually
you
know
pick
one
through
that
through
the
documents
and
then
we've
got
a
bunch
of
probes.
Now
you
might
wonder
why
I
am
you
know,
picking
you
know
going
and
fetch
slugs
during
the
experiment
doesn't
actually
do
and
it
does
anything.
But
it's
interesting.
F
Why,
when
you
do
the
analysis,
because
you
can
come
back
and
look
at
if
you
add
the
logs,
that
you
were,
you
know
looking
for
in
your
application
or
you
know
in
your
various
parts
of
the
system.
So
for
you
analysis,
sometimes
it's
nice
to
actually
go
and
fetch
them,
as
you
run
the
experiment
and
that's
basically
it
so,
let's
pray
that
this
works,
so
you
just
run.
Chaos
is
the
equipment
that
you're
going
to
run.
F
I,
don't
actually
see
that,
but
there
is
a
notification
on
the
top
right,
because
we
send
that
to
slag,
saying
it
started
and
it
basically
it
doesn't
do
much.
It
runs
things
in
order
that
you
know
it
reads
them
from
the
farm
and
that's
why
it
does
look
like
a
tests
because
it
does
doesn't
it
in
this
case
it
should
fail
if
I'm
err,
if
I'm
correct,
because
the
application
will
collapse
there.
You
go
so
here
the
reason
it
fails.
F
What
so
what's
interesting
here
is
we
see
that
first
did
did
succeed,
so
the
system
was
normal,
so
we
went
on
and
killed
the
DB
master,
but
it
felt
when
we
run
that
again
and
the
reason
is
because
in
in
that
specific
case,
my
code
was
not
good
enough
to
actually
reconnect
if
you
will
to
the
new
database
master.
So
when
I
call
the
application,
it
failed,
because
the
connection
at
that
stage
was
was
tall.
F
F
There
you
go
it's
right
there
and
now
we're
going
to
run
that
what
I
did
but
I
I
don't
have
set
setup
right.
You
know
properly
oops,
it's
not
set
up,
yet
it's
not
up
yet.
So
in
that
case,
actually
what
you
saw
is
yappi
the
steady-state
bail.
You
experiment
you
initially,
because
the
system
was
not
normal,
I
went
too
fast
and
previous
system
was
not
yet
ready.
F
It
is
now
there
you
go,
but
what
you
do
is
probably
you
look
at
that
whisk
it
you
know,
get
cube
or
whatever
and
say:
well:
okay,
I've
reduced
the
fix
rerun
that
you
know
all
those
things
that
you
know
you
would
do
with
any
sort
of
test.
Basically,
and
hopefully
that
fixed
us,
you
know,
does
work
and
you've
proven
you
had
a
weakness
and
you
found
it.
You
fixed
it
and
you
try
it
again
and
that's
he
before
and
after
that
you
would
want
to
see
from
chaos,
engineering
experiment.
F
Now
there
you
go,
it's
you
know
sure
that
the
steady
state
now
is
met.
So
that
means,
oh
application
now
is-
is
able
to
actually
sustain
that
kind
of
loss.
You
know
if
the
master
goes
away
and
the
connection
is
lost.
We
are
able
to
actually
sustain
that
that
error,
that's
a
kind
of
you
know
the
failure
that
you
won't
want
to
see.
That's
a
basic
one
and
if
I
have
time
I
don't
know
perhaps
not,
but
there
is
another
one
I
would
I
would
want.
F
You
know
would
want
to
show
you
I
stop
now,
but
I.
Don't
think
that
a
time
would
be.
If
you
let's
say
you
have
your
you're
using
you
know,
gke
or
something
that,
and
you
realize
one
of
your
nodes.
The
virtual
machine
is
actually
I,
don't
know
security,
a
security
issue
or
something
what
you
wanted
to
do
is
it's.
You
know,
pull
out
a
new
node,
pull
right,
new
set
of
machines
with
a
fix,
but
you
want
to
see
whether
or
not
it's
going
to
impact
your
users
to
switch
from
one
node
to
another
node.
F
Well,
that's
an
experiment.
You
can
run
with
your
toolkit.
You've
been
to
bring
a
new
node
pool
with
new
machines
during
that
to
communities,
see
the
Lord
spread
from
one
cluster
to
the
other
side
of
the
of
the
node
pool
and
see
if
your
application
is
actually
impacted.
That's
the
kind
of
thing
you
can
do
with
circuit,
because
we
all
we
do
is
we
drive
existing
api's?
We
don't
try
to
create
new
sort
of
care.
You
know
tooling,
because
already
they
already
exist
and
in
invite
questions
like
that
right,
there
was
my
demo
cool.
L
F
Yeah,
it's
a
good
question.
The
wine
in
steady
state
we
use
probes
only
in
robots.
You
can
use
actions.
Basically
all
you
do.
Is
you
try
to
revert
something
so
sample
if
I
had
the
time
to
show
you
the
not
put
one
I
actually
did
it
a
creator
not
pool?
So
it's
just
a
bunch
of
virtual
machines
really
and
and
one
that
runs
I
killed
the
node
pool
in
the
robots,
but
in
in
the
example,
I
showed
you
because
I'm
using
humanity's
cue,
antistick
care
of
of
the
roll
back.
L
A
C
G
D
B
D
A
F
A
All
right,
well,
that's
good
for
now,
so
we
have
a
couple
minutes
left
so
just
want
to
be
sensitive
to
people's
time.
So
thanks
everyone
for
showing
up
I
want
to.
You
know,
continue
to
do
all
the
white
paper
and
landscape
work
as
a
sneek
as
possible.
Hopefully,
people
could
give
their
input
on
that,
while
we
kind
of
build
that
out,
I
think
other
than
that.
That's
that's
it.
Anyone
else
have
anything
to
say.
Otherwise.
We
could
think
our
presenters
and
meet
again
in
a
couple
weeks.
A
So,
like
I
love,
forcing
functions
so
like
picking
a
date
and
like
hammering
towards
it
generally
works
well,
I
think
it's
gonna
take,
there's
good.
It
takes
Olsen
consensus
building,
but
I
would
love
to
get
something
out,
probably
in
in
a
1
a
2
month.
Time
frame,
it's
like
we're.
Also
gonna
have
to
give
my
designers
on
the
CNC
F
side
about
two
weeks
to
kind
of
make
things
pretty
and
that's.