►
From YouTube: Chaos Engineering for cloud native systems
Description
Kubernetes Community Days Bengaluru'21
Tired of recurring production outages that nobody benefits from? You aren't alone! Introduced as a tool to test the resiliency of its infrastructure in 2011 by Netflix, Chaos Engineering is one of the top 5 technologies to watch out for in 2021 per CNCF. This talk covers all the important aspects of Chaos Engineering from a Cloud Native perspective & will focus on LitmusChaos, an open source framework helping orchestrate Chaos on Kubernetes. Towards better cementing of concepts, we shall also have a live demo of the tool in action.
Slides: https://drive.google.com/file/d/1gbFu9kGC-I8L8nLxF45DySur1mYRXC15/view?usp=sharing
A
Hello,
everyone
and
hope
you
are
having
a
fabulous
time
at
kubernetes
community
day,
bengaluru
so
far.
Surrender-
and
I
are
super
stoked
to
be
speaking
with
you
all
about
chaos,
engineering
for
cloud
native
systems.
But
before
we
go
ahead,
it's
just
fair,
really
that
we
introduce
ourselves
to
you
so
saranya.
Why
don't
you
go
first
and
introduce
yourself.
B
A
Awesome
so
hello,
everyone
I
am,
and
as
part
of
my
day
job,
I
am
a
team
lead
with
hsbc
and
probably
from
the
slide
that
you're,
seeing
on
your
screen
right
now,
you've
decided
that
I
I'm
an
I'm
an
avid
open
source
enthusiast.
A
I've
contributed
for
the
past
one
and
a
year,
one
and
a
half
year
to
the
kubernetes
and
litmus
chaos
projects
in
different
capacities,
and
I'm
also
one
of
the
co-organizers
of
cncf
student
user
group,
which
is
owing
to
my
passion,
project
of
getting
more
folks
into
open
source
and
having
them
benefit
from
the
community
and
the
learning
that
we
have
here
so
today,
as
I
mentioned
before,
we
are
going
to
be
speaking
about
chaos,
engineering
for
cloud
native
systems.
A
Now
before
we
get
there,
we
obviously
do
need
to
understand
how
the
situation
was
before
chaos.
Engineering
came
into
the
picture,
which
is
where
most
of
us
are
currently
and
then
we
need
to
look
at
why
there
was
a
requirement
for
chaos
engineering
to
come
into
the
picture.
What
exactly
chaos
engineering
is
and
how
it
can
be
extended
to
the
cloud-native
context,
some
benefits
and
thereafter
the
second
segment
of
the
presentation.
A
I
shall
hand
it
over
to
saranya
to
speak
about
the
various
chaos
tools
and
platforms,
and
since
we
both
work
on
the
letmaskios
project,
she'll
be
sort
of
doing
a
quick,
deep
dive
into
you
know.
Let
me
scare
us
as
a
product
with
a
really
cool
demo
and
towards
the
end.
We
be
completely
remiss
if
we
actually
do
not
lay
out
what
we
think
is
the
future
roadmap
for
chaos,
engineering
as
a
discipline.
A
So
that
being
said
without
further
ado,
let's
dive
right
in
before
chaos,
engineering,
which
is
where
a
lot
of
us
are
at
even
right.
Now,
even
the
chaos
engineering
is
not
a
very
new
discipline,
so
to
say
so
before
chaos
engineering
was
implemented
in
a
lot
of
organizations
and
if
we,
where
we
are
currently
there
are,
you
know,
customer
and
service,
impacting
outages.
That
is
something
that's
there.
A
Irrespective
of
whether
chaos
engineering
is
there
or
not,
really
honest,
but
when
those
customer
and
service
impacting
outages
used
to
occur,
the
mean
time
to
resolve,
or
the
mean
time
to
recover
from
such
an
outage
used
to
be
high
because
more
often
than
not
during
such
an
outage,
you
required
smes
and
getting
them
on
the
call.
A
Or
you
know,
coordinating
with
them
towards
recovery
was
somewhat
of
a
struggle
not
because
the
processes
were
laid
down
and
you
did
not
have
recovery
documents
in
place,
but
it's
simply
wasn't
or
isn't
a
streamlined
process
altogether
and
like
the
shaolin
monks
say
the
more
you
bleed
and
practice
the
less.
You
bleed
when
you
actually
have
a
problem,
and
I'm
paraphrasing
it
here,
but
when
an
actual
outage
occurs,
we
weren't
really
prepared
for
it.
A
If
chaos
engineering
was
not
something
we
considered
implementing
in
our
piece
of
infrastructure
and
of
course,
after
this
you
know
outage
occurred,
there
would
be
what
we
term
as
blameless
postmortems,
but
they
really
are.
A
You
know
far
from
being
blameless
really
because
it's
always
about
finding
the
root
cause,
but
we
tend
to
target
the
symptom
instead
of
the
actual
course,
so
these
would
eventually
end
up
being
the
burden
on
your
sres
or
your
infrastructure
engineers,
as
opposed
to
diving
deep
inside
your
particular
code
base
or
the
way
your
application
is
designed,
and,
of
course
all
of
this
is
sort
of
fed
into
you
know.
A
All
of
this
information
is
gathered
from
you
know
your
monitoring
and
observability
infrastructure
and
in
the
absence
of
a
discipline
that
sort
of
streamlines
this
process
that
was
really
rudimentary
it's
just
now
in
the
recent
past
that
monitoring,
observability
distribution,
tracing
logging
have
started
gaining
a
sort
of
importance
towards
better
recovery
processes
and
towards
ensuring
that
the
applications
we
deliver
are
highly
resilient
and
highly
available.
A
So
we
can
clearly
see
a
why
there
was
a
need
for
a
discipline,
because,
although
we
knew
what
was
required,
we
did
not
have
all
of
it
streamlined
in
a
proper
way
and
chaos.
Engineering.
Let's
look
at
what
it
is.
According
to
principles
of
chaos.or
cures.
Engineering
is
basically
just
experimenting
on
your
systems
towards
building
confidence
among
your
customers
to
withstand
turbulent
conditions
and
production
as
and
when
they
occur.
A
How
do
they
do
that,
and
why
is
this
really
any
different
from
testing
so
how
we
do
that
is
by
our
experiments,
which
is
something
you
know
serena's
going
to
demonstrate
in
the
latter
half
of
our
presentation
and
how
is
it
any
different
from
testing,
which
is
a
very
good
question,
because
that's
something
that
I
myself
struggled
with
when
I
started
off
learning
about
chaos,
engineering,
so
in
testing,
you
are
basically
aware
of
what
you
are
giving
as
an
input
and
what
you
get
as
an
output.
A
So
if
your
system
is
subjected
to
particular
conditions,
it
should
either
give
you
a
b
or
c
result,
and
if
it
doesn't
it
it
is,
it
is
failing
the
test.
I
mean
you,
you
get
the
point
test
is
testing
is
basically
a
choice,
not
a
choice.
Really,
but
testing
is
between.
A
You
know
the
systems
input
being
known
as
well
as
the
output
being
known
correct,
while
experiments
don't
really
have
a
fixed
output,
because
when
you're
experimenting,
you
are
just
subjecting
the
system
to
unstable
conditions.
You
do
not
know
what
the
outcome
of
that
experiment
is
going
to
be.
It
could
be
anything,
and
that
is
what
we
are
looking
to
learn
more
about
chaos.
Engineering
is
jumping
into
the
unknown
and
taking
a
leap
of
faith
towards
understanding
your
systems
better.
A
Now,
when
we
speak
about
chaos
engineering,
it's
a
common
myth
that
a
lot
of
us
really
believe
that,
yes,
engineering
is
only
for
cloud
native
systems
and
there's
often
a
question
thrown
at
me.
When
I
speak
about
chaos
engineering
as
to
whether
cure
chaos,
engineering
is
only
cloud
native,
so
the
answer
is
no
chaos.
A
Engineering
is
definitely
not
cloud
native
because,
as
afo
mentioned,
we
have
experiments
done
on
various
systems
and
systems
really
do
not
have
this
discrepancy
of
you
know,
judging
or
amongst
themselves
whether
they're
cloud
native
or
not
it's
as
humans,
who
give
them
that
context.
So
chaos
engineering
can
be
performed
on
cloud
native
and
not
cloud
native
systems
alike.
A
The
only
plus
one
here
in
terms
of
cloud
native
is
that
the
availability
of
tooling
is
much
more
as
compared
to
the
you
know,
others,
because,
as
a
sector,
I.t
is
moving
towards
the
cloud
native
side
of
things.
So
it's
just
check
it.
It's
just
generally
evident
that
you
know
chaos.
Engineering
tools
which
would
be
developed
are
more
catering
towards
that
move,
rather
than
you
know,
being
stuck
to
the
baby
used
to
do
things.
So
you
absolutely
can't
have
chaos
engineering
performed
on.
A
You
know
your
non-cloud
native
infrastructure,
which
is
in
your
own
data
center.
But
the
only
thing
is:
you
will
have
a
lot
of
challenges
because
of
the
lack
of
a
lot
of
associated
monitoring
and
observability
tooling,
and
obviously,
if
you
are
trying
to
do
it
by
yourself,
it's
going
to
be
a
little
more
difficult
because,
with
the
available
tooling,
you
are
leveraging
a
lot
of
you
know,
associated
monitoring
and
of
your
observability
tooling,
with
your
own
customized
scripts
or
with
your
own
customized
tool.
A
You
probably
might
not
be
able
to
leverage
that
like
if
you
create
something
from
scratch,
it's
a
little
more
difficult
to
actually
have
anything
related
to
chaos
done.
So
it's
not
that's
not
impossible,
but
yes
doing
it
by
yourself
is
going
to
be
a
problem
and
I'm
sure
serena
is
going
to
cover
some
of
the
available
tools
in
the
market,
both
cloud
native
non-cloud
native,
open
source
and
proprietary
in
the
section
that
she
is
going
to
take.
A
That
being
said,
you
know
I've
been
practicing
about
pierce
engineering
and
what
it
is
about
and
how
it
can
be
extended
to
the
cloud
native
context,
but
really
when
you're
talking
about
something
or
you
need
to
understand
what
are
the
benefits
and
how
it
impacts
you
as
someone
who's
from
the
application,
development
team
or
someone
who
designs
applications.
A
So
in
that
case
you
know
this
slide
is
a
very
good,
maybe
overview
of
how
it
benefits
you.
So
the
very
first
you
know
impact
and
I'm
going
to
talk
to
the
bottom.
Is
that
because
of
you
know
resilient
systems
and
better
processes,
your
systems
aren't
going
to
fail
as
frequently
they
will
feel,
which
is
something
that
is
a
given.
But
they
are,
you
know,
more
resilient
towards
failure
and
because
you
have
experimented
on
them
and
have
learned
where
their
vulnerability
is
like.
A
You
are
aware
of
what
could
possibly
go
wrong,
so
you
are
in
a
better
position
to
recover
from
such
failures.
So
when
you
are
in
such
a
good
position
to
recover
from
failures,
you
have
better
mttr,
which
also
means
that
you
know
customers
have
to
spend
less
time
being
frustrated
and
obviously,
when
you
have
resilient
system
that
directly
correlates
to
having
lesser
outages,
which
is
again,
you
know
pretty
good
from
a
customer
for
a
customer
because
nobody
likes
their
apps
down.
A
Right
then
comes
the
business,
because
when
we
look
at
the
business
side
of
things
we
are
looking
at,
you
know
better
processes,
and
when
we
look
at
better
and
more
efficient
processes,
we
are
definitely
looking
at
efficient
incident
management
and
when
we
are
saving
up
on
so
so
many
overheads
in
terms
of
you
know,
outages,
and
you
know,
coordination
and
recovery
times.
A
We
are
definitely
aiding
in
the
prevention
of
losses
and
last
but
not
the
least
technical
staff
are
the
most
important
in
this
ecosystem
because
they
are
the
ones
who
ensure
that
you
know
all
all.
This
is
working
as
per
pro
as
per
requirement.
So
when
your
technical
staff
are
happy
be
more
be
rest
assured
that
all
the
about
two
layers
are
going
to
be
happy
as
well,
because
then
we
look
at
chaos,
engineering
being
implemented
in
organizations.
A
We
are
looking
at
better
and
more
reliable
systems,
because
during
these
experiments
you
will
discover
what
vulnerabilities
are
there
within
your
systems
and
you're
able
to
address
them
ahead
of
time?
And
maybe,
when
an
actual
outage
occurs,
you
are
going
to
be
better
prepared
towards
understanding
how
to
recover
from
that
which
directly
correlates
to
for
the
technical
stuff
that
you
know
they
have
to
not
be
present:
24,
plus
seven
in
case
you're,
a
subject
matter
expert
or
even
the
on-call
for
that
particular
application.
A
So
that
being
said,
I
think
you
know
it's
time
for
me
to
hand
over
to
saranya
to
talk
about
the
various
tools
available
in
the
ecosystem
and
maybe
walk
out
to
a
quick
demo
of
the
litmus
chaos
project
over
to
you,
saranya
I'll.
Just
stop
my
screenshot.
B
B
B
Then
then,
coming
to
litmus
litmus
is
an
open
source
cloud
native
kiosk
engineering
toolkit,
which
provides
all
those
features
that
diva
has
already
talked
about.
While
explaining
the
major
principles
of
cloud
medicare's
engineering.
As
you
can
see,
it
is
open
source,
then
it
supports
community
collaboration.
Then
we
have
kept
the
api
and
life
cycle
management
open
to
maintain
actually
the
transparency
that
user
can
know
what
are
the
apis
and
how
are
they
being
managed
so
yeah?
B
So
we
have
the
githubs
enabled
so
that,
like
get
being
a
very
like
primary
tool
for
developers
like
integrating
it
with
litmus
has
proven
to
be
very
useful,
then
we
have
open
observability
so
like
as
of
now,
we
are
using
prometheus
and
grafana,
but
since
it
is
open
you
can
bring
your
own
observability
tool
and
add
it
to
your
solution
so
other
than
that
litmus
is
a
cncf
sandbox
project
and
we
have
been
getting
a
lot
of
attractions
lately.
B
So
we
have
around
56khz
experiments
with
more
than
100k
installations,
and
we
believe
that
in
the
upcoming
years,
litmus
is
going
to
be
the
go-to
tool
or
platform
for
practicing
kiosk
engineering
to
look
forward
to
yeah.
So,
let's
get
like,
let's
get
started,
how
we
can
like
start
inducing
cures
in
our
system,
so
we
can
actually
install
litmus
using
helm
chart
and
we
can
like
to
induce
cures.
B
We
can
either
use
custom
templates
or
we
can
indeed
induce
our
own
custom
workflows
or
our
chaos
using
either
by
using
public
public
kyosa,
which
contains
actually
all
the
experiments
that
like
it,
is
actually
public
and
we
can
even
have
our
private
owned
private
kiosk
hub
and
connected
use
it
within
litmus
so
I'll.
B
Just
giving
I'll
just
give
a
demo
of
how
we
can
induce
skills
but,
like,
let
me
tell
you
like,
there
are
actually
two
ways,
so
we
can
either
use
induce
it
by
using
terminal
or
we
can
use
litmus
portal,
which
is
actually
a
web.
Ui
like
it
provides
like
yeah.
It
provides
better
visualizations
of
workflow
and
it
comes
with
a
lot
of
features.
A
lot
of
other
features
such
as
tv
teaming,
analytics,
etc
so
I'll
be
giving
a
demo
of
litmus
portal
itself.
So
without
much
dealer,
let's
get
started
yeah.
B
So,
as
you
can
see,
I
I
am
running
it
on
my
own
mini
cube,
but
you
can
use
any
other
platform
such
as
like
kind
k3s
or
any
other
public
cloud.
So
I
have
applied
the
latest
like
litmus
two
beta
yama,
then
I'm
as
it
is
visible,
the
pods
are
up
and
running.
This
is
the
front
end
part.
This
is
the
server
port,
and
this
is
the
mongodb.
B
And
if
I
take
like
I'm
taking
the
port
number
of
front-end
service
and
the
mini
qip,
I
should
be
able
to
run
I'm
able
to
run
litmus
portal
so
yeah.
So
this
is
the
login
screen
of
litmus
portal,
and
by
default
the
credentials
will
be
like
the
username
is
admin
and
the
password
is
litmus
so
once
I
log
in
our
default
project
would
be
created
for
me
and
I'll
be
asked.
So
this
is
the
onboarding
screen
where
I'll
be
asked
to
change
my
password,
so
for
now
I'll
be
keeping
it
one.
Two
three.
B
So
once
I
have
onboarded,
so
this
is
the
main
home
page
where,
as
you
can
see
this
like
here,
this
is
the
default
project
that
has
been
created
for
us
and
if
you
are
part
of
any
other
project
you,
you
should
be
able
to
see
them
here,
and
this
is
the
user
level
details
and
you
can
like
edit
your
profile
here
by
coming
to
my
accounts
online
settings
and
you
can
edit
your
details.
Then,
if
you
are
an
admin,
you
should
be
able
to
create
new
users.
B
So
once
the
users
are
created,
I
should
be
able
to
invite
people
to
my
own
project.
So
currently
we
are
supporting
just
a
viewer
and
editor
like
the
viewer
and
editor
role,
and
so
this
is
the
teaming
section
actually
where
you
can
change
the
project
where
we're
getting
the
list
of
members,
the
invitation
status,
any
active
invitations
or
any
active
project
that
you
have
been
a
part
of
and
other
than
that
we
have
the
github
section
and
yeah.
B
So
this
is
the
analytics
section
where
you
can
see
the
analytics
of
the
workflow
that
you
have
created,
and
this
is
the
agent
section.
So
agent
is
like
when,
when
you
install
kiosk
self
agent
gets
created
by
default
and
the
kiosk
operator
gets
installed
here
and
we
can,
we
should
be
able
to
run
any
chaos
experiments
on
this
target
agent.
So
I
can
actually
connect
multiple
agents
external
agent
by
following
this
procedure,
but
for
now
I'll
be
using
my
own
self
widget.
B
So
then
comes
the
kios
hub,
as
I
have
already
mentioned.
So
this
contains
all
the
experiments
that
are
available
publicly
for
use
then
comes
the
workflow
page.
So
a
workflow
is
a
single
single
unit
where,
like
it,
contains
a
lot
like
several
experiments
which
can
be
sequenced,
parallelly
or
serially,
or
a
combination
of
both
according
to
our
own
requirements,
and
we
can
see
if
we
have
scheduled
any
workflow.
So
we
can
see
any
of
all
the
details
related
to
that
particular
workflow
which,
like
which
I've
already
run
or
which
are
currently
running.
B
And
then
this
is
the
schedule
page
where,
if
I
have
scheduled
any
workflow,
I
should
be
able
to
see
all
the
details
regarding
that,
so
I'll
go
ahead
and
schedule
a
workflow.
So,
first
of
all
I
need
to
choose
the
target
agent.
So
this
is
the
self
agent
for
now
yeah.
So,
while
choosing
workflow
I'll
be
asked
four
options,
so
first
one
is:
if
like
choose
any
predefined,
qrs
workflow
template
and
if
you
have
any
existing
workflow
you
can
using.
You
can
actually
clone
it
and
use
it.
B
Then
you
have
the
option
to
choose
experiments
from
my
heart.
So
for
now,
since
I
have
just
the
cure,
gives
her,
I
can
just
choose
it
and
go
ahead
and
the
fourth
option
is:
if
you
have
any
workflow
yaml,
you
can
just
drag
and
drop
and
get
started
with
it.
So
for
now
I'll
be
choosing
potato
headquarters,
the
predefined
workflow
and
then,
if
I
hit
next,
so
this
is
the
page
where
you
can
change
any
descriptions
or
like
according
to
your
own
requirements,
then
yeah.
So
this
is
the
actual
visualization
of
a
workflow.
B
How
the
experience
will
take
place.
So
you
can
actually
edit
the
sequence
if
you
want
like
by
just
dragging
and
dropping,
but
I
I'll
keep
it
as
it
is
for
now
yeah.
So
if
I
hit
next
so
I
can
tune
the
experiments
of
my
workflow
according
to
my
own
requirements,
then
once
this
is
done
so
I'll
be
asked
to
schedule
the
workflow,
so
we
have
two
options:
either
we
can
show
it
right
now
or
we
can
have
a
recurring
schedule
so
I'll
go
with
option
one
for
now
so
yeah.
B
So
this
is
the
like
summary
page
of
the
workflow
that
we
are
going
to
schedule
and
if
you
want
to
do
any
changes,
so
you
can
just
go
back
and
make
the
changes.
So
once
I
hit
the
finish,
button,
workflow
should
get
created
and
we
can
yeah.
So
this
we
are
in
the
browser.
We
can
get
all
the
details
here.
As
you
can
see,
the
box
is
currently
running.
So
if
I
click
on
this
particular
workflow,
so
this
is
the
graph
view
of
how
the
workflow
is
running.
B
So
this
is
like,
as
you
can
see,
it
is
running
right
now
and
this
the
application
installation
is
pending.
So
if
you
click
on
each
of
the
node,
you
can
get
all
the
details
regarding
that
related
to
that
and
yeah.
So
here
like
it
is
pending
right
now,
so
you
can
get
the
logs
here.
So
this
is
the
graph
view
and
we
also
have
the
table
view
where
you
can
get
all
the
details
related
to
the
current
workflow
so
yeah.
B
B
Yeah,
so
litmus
in
a
nutshell.
So
in
a
nutshell,
we
have
the
central
litmus
portal
and
all
we
need
to
do
is
just
pick
up
any
predefined,
workflow
or
any
custom
workflow
and
run
it
against
a
target
agent,
and,
as
I
have
already
mentioned,
we
have.
We
can
have
multiple
agents.
So
once
the
experiments
are
run,
the
chaos
metrics
are
exported
to
prometheus
and
the
cures
analytics
are
pushed
back
to
litmus
portal.
B
So
we
can
like
monitor
the
kiosk
results
using
prometheus
by,
like,
with
the
help
of
kiosk
interleaved
analytics
yeah,
so
like
litmus
can
also
be
run
on
or
use
on
like
bare
metal
devices
or
bare
metal
environment
or
any
other
public
cloud
such
as
aws,
azure,
gcp,
etc.
B
B
So
it
is
very
likely
that
these
tools
would
be
integrated
with
machine
learning
tools
or
artificial
intelligence
so
that
they
can
help
automating
the
analysis
activities
and
then
they
can
also
help
in
improvising
the
monitoring
of
applications
when
the
experiments
are
being
run
and
who
knows
like
they
might
be
even
able
to
find
out
or
detect
some
threats
or
falls
that
have
gone
unnoticed
before
so.
That
starts
that's
all
from
asset.
I
we
hope
that
you
enjoyed
this
session.
Thank.