►
From YouTube: Using GitOps to Increase System Resiliency with LitmusChaos by Amit Das and Saranya Jena
Description
Chaos engineering has come a long way from its early days at Netflix. While companies are widely adopting Kubernetes for their microservice architecture solutions, for the much-required competitive edge, it is important to ensure the resiliency of these systems for ensuring reliable services for everyone. As a large number of systems are moving towards the Kubernetes-Native approach, an important problem arises, testing these applications the right way to prevent outages in production. In this talk, we will introduce LitmusChaos and how we can leverage GitOps to increase the system resiliency. Further, we'll talk about how GitOps can be used by engineers and SREs to automate Chaos in their applications.
B
Hey
everyone,
I'm
amit,
kumar
das,
I'm
a
senior
software
engineer
at
harness
and
I'm
a
core
contributor
to
litmus
kiosk
and
I've
been
contributing
to
the
project
from
past
two
years
and
yeah
very
excited
to
be
a
part
of
kcd
channel
yeah.
That's
pretty
much
about
me
and
looking
forward
to
it.
Thank
you.
A
Yeah,
thank
you
amit,
so
today's
agenda
will
be
will
be
first
of
all,
we'll
be
talking
about
cures
engineering.
Why
is
it
required
and
then
we'll
be
introducing
litmus
skills,
talk
about
its
core
components
and
features,
then
amit
will
be
taking
us
through
the
get
ops
and
giving
us
a
small
demo
of
how
you
can
use
get
offs
to
like
leverage
like
use
like
increase
the
system
resiliency
using
litmus
so
yeah
without
further
ado,
let's
get
started
to
start
off
with.
A
First
of
all,
we
need
to
know
what
is
resilience
so
resilience
is
basically
the
system's
ability
to
sustain
a
fault
and
bring
itself
back
up.
So,
for
example,
let's
say
a
pod
gets
evicted
from
the
node.
What
is
its
state?
Is
it
healthy
or
not?
Does
it
bring
itself
back
up?
If
it
does,
then
it
is
resilient
and
that
period
from
going
down
to
bringing
itself
back
up
is
the
resilience.
A
So
similar
is
the
case
with
node
and
memory
leak
as
well
talking
about
down
times
down
times
are
expensive,
not
just
in
terms
of
money,
but
also
there
are
in
other
aspects
as
well,
such
as
customer
confidence.
A
There's
a
loss
of
customer
confidence
then
damage
to
brand
integrity,
then
loss
of
productivity
and
employee
morale
as
well
so
considering
all
these
etc
like
these
are
some
of
the
aspects
and
considering
all
these
all
these
aspects,
we
definitely
want
to
avoid
down
times
at
any
cost,
and
one
way
to
do
this
is
by
adopting
the
practice
of
chaos,
engineering
and
a
cures.
Engineering
is
the
like
process
of
testing
a
distributed
computing
system
by
injecting
fault
intentionally.
A
So
the
goal
here
is
to
identify
the
weaknesses
in
your
applica
in
the
application
through
controlled
experiments,
so
that
to
check
whether
it
can,
whether
it
can
withstand
the
unexpected
situations
or
not,
and
so
how
it
is
done.
It
is
typically
done
by
like.
First
of
all,
you
have
to
identify
the
steady
state
conditions,
so
steady
state
conditions
are
the
desired
behavior
of
the
application
in
a
given
scenario
when
it
is
healthy.
So,
first
of
all
you
identify
that.
Then
you
introduce
a
folder
application.
Then
you
check.
A
If
the
steady
state
conditions
are
met
or
not.
If,
yes,
then,
the
application
is
resilient
and
if
not,
you
can
go
and
fix
it,
and
if
some
similar
case
happens
in
production,
you
are
already
covered,
then
are
talking
about
the
foundations
of
cures,
cloud
native
cures,
engineering
and
how
like
how
it
can
practice
effectively.
So
the
cloud
native
definition
itself
includes
some
mandatory
principles,
such
as
it
requires
declarative
conflicts
being
scalable,
flexible
and
support
of
cross
cloud.
A
So
these
are
some
of
the
major
principles
which
we
have
been
following
for
the
past
few
years
and
yeah
so
cloud
native
communities
and
technologies
or
revolve
around
open
source
and
these
chaos
engineering
framework
being
open
source
gives
them
the
benefit
to
make
themselves
better
and
add
more
and
more
features
and
the
cures
experiments
need
to
be
a
very
simple
to
use
highly
flexible
and
highly
tunable,
with
very
less
or
little
little
or
no
chance
of
false
positives
or
false
negatives.
A
Then,
with
more
and
more
people
getting
involved
into
cures,
chaos,
engineering
more
and
more
changes
happen
very
frequently
that,
with
the
requirements
be
being
altered,
so
there
are
arises,
a
need
of
like
it
becomes
very
important
for
the
cures.
Engineering
framework
to
enable
proper
management
of
the
chaos
experiment,
and
that
too
in
kubernetes
way,
then,
as
you
start
practicing
cures
and
doing
more
and
more
start
fixing
the
little
issues
that
come
in
more
and
more
cure
scenarios
come
into
picture
and
gradually
it
becomes
very
large
and
comprehensive.
A
So
these
are
cure.
Scenarios
need
to
be
automated
and
triggered
if
changes
are
made
either
in
the
application
or
in
the
cause.
Experiments
and
tools
around
guitars
can
are
one
way
to
achieve
it,
then.
So
this
is
like
this
is
a
section
we'll
be
talking
about
in
detail
and
giving
us
giving
you
all
the
demo
and
then
lastly,
there's
open
observability,
which
is
also
one
of
the
principles
introduction
to
chaos.
Engineering
should
should
not
require
any
new
observative
system.
A
The
existing
ones
should
fit
in
perfectly
so
yeah.
That's
that,
then,
with
that,
I
would
like
to
introduce
litmus
chaos,
which
is
an
open
source
cloud
native
sending
framework
with
it
has
also.
It
also
has
the
cross
cloud
support
and
currently
it
is
cncf
incubating,
and
it
has
adoption
across.
Several
organizations,
then,
are
talking
about
the
features
that
cure
center,
provide
starting
off
with
the
cures.
Workflows.
A
Chaos
workflows
is
the
collection
of
several
several
experiments,
which
can
be
clubbed
in
either
sequentially
or
parallel
like
in
any
manner,
and
it
can
be
created.
These
workflows
can
be
created
using
custom
templates
that
you
can
like
you
can
upload
or
you
can
use,
create
your
own
class
custom
workflows
from
kiosk
hub,
which
is
the
like
repository
kind
of
like
it
is
the
place
where
all
the
cures
experiments
are
present.
You
can
choose
from
there
or
you
can
use
some
pre-create
pre-created
yamls
are
also
there.
A
You
can
choose
from
them
as
well,
then
you
can
schedule
your
workflows
either
as
a
recurring
one,
the
chron
workflows
or
you
can
have
a
singular
workflow
as
well
then.
Lastly,
you
can
attach
priority
for
to
each
of
the
experiments
in
your
in
the
particular
workflow.
According
to
your
own
requirements,
workflow
management
get
offs.
This
is
the
section
we're
talking
about
a
bit
later.
A
Then
you
can
litmus
allows
you
to
add
your
own
image
from
your
own
image,
server
custom
image
so
which
can
be
the
republic
or
private
or
then,
once
the
cures
injection
is
done,
you
can
measure
and
analyze
the
resilience
score
of
each
workflow.
You
can
analyze
how
your
application
performed
in
that
particular
close
workflow,
so
yeah.
That's
that,
then
a
litmus
also
supports
multi-tenancy,
which
means
you
can
create
your
own
team.
A
Add
other
invite
other
users
to
your
team
and
like
as
viewer
or
editor
permissions
like
it
has
a
fine-grained
role-based
access
controls
which
gives
the
necessary
privileges
to
the
users.
Then
scope
support
is
also
there.
I
have
talked
about.
You
can
install
it
in
name
space
or
cluster
white
scope
and
authentication
is
there
you
can
choose
to
have
local
authentication
or
you
can
or
the
oauth
one
so
yeah.
That's
that
then
coming
to
monitoring
and
observability.
So
you
can
connect
your
own
data
source
and
monitor
the
workflows
or
you
can.
A
Visualize
are
graphs
present,
where
you
can
visualize
the
workflow
run,
statistics
or
the
schedule
statistics
you
can
also,
once
the
workflows
are
running
or
computed
execution,
you
can
compare
two
or
more
workflows
how
they
performed,
and
in
case
you
do
not
like
the
interleave
dashboard
that
is
present.
You
can
upload
your
own
dashboards
from
the
available
that
are
available
in
the
community.
You
can
edit
them,
you
can
tune
your
own
dash
tune,
the
dashboards
according
to
your
own
requirements.
A
And,
lastly,
you
can
monitor
the
chaos
in
real
time
with
the
interleaved
events
and
metrics
from
the
prometheus
data
source.
A
Then
with
litmus
cures,
you
can
not
only
target
kubernetes
application,
but
you
can
also
target
cures
like
on
infrared
sources
or
attacked
by
metals
or
a
machine
as
well.
A
Lastly,
get
tops
for
chaos,
so
it
basically
integrates
it,
integrates
any
git
based
source
control
manager
to
provide
a
single
source
of
truth,
provided
that
you
have
enabled
githubs
once
the
githubs
is
enabled
it
kind
of
switches
off
mongodb
as
the
as
the
db
like
the
data.
So
then
git
will
like
act
as
a
single
source
of
truth.
A
So-
and
this
is
also
bi-directional
nature,
so
that
means
if
any
change
occur
to
either
all
the
like,
all
the
workflows
are
being
stored
and
get
in
the
git
source.
So
if
any
change
happened
into
either
cure
center
or
in
the
get
source
both
of
them
will
automatically
in
sync,
then
it
also
provides
even
tracker
server
as
a
micro
service,
where
you
can
launch
the
subscribed
workflows
it
like
it
launches
the
subscribe
or
works
automatically.
A
If
there's
any
change
in
the
application,
such
as
upgrades
or
and
all
so,
it
automatically
launches
the
close
workflow
yeah.
So
that's
that
now
I'll
be
like
now.
Amit
will
be
talking
about
get
ops
in
more
detail
and
we'll
be
giving
a
demo
so
yeah
over
to
you
amit.
Thank
you.
B
It
follows
the
principle
of
infrastructure
as
a
code
where
managing
and
provisioning
of
the
infrastructure
is
through
the
code
rather
than
manual
processes.
Now,
moving
on
to
the
main
question,
why
do
we
need
kiosk
engineering
with
githubs,
so
the
chaos
engineering
with
githubs
will
enable
a
vast
scope
of
automation
with
ci
cd
pipelines.
So
currently,
chaos
engineering
is
being
performed
in
a
closed
environment
or
in
a
pre-production
stage,
but
what
we?
B
So
the
third
point
is
better
security,
so
kit
is
a
very
secured
platform
or
a
framework,
because
it's
very
strong
with
this
cryptography
and
the
ability
to
sign
your
changes
provides
the
ownership
to
the
change
or
to
the
source
code
and
it
improves
the
auditing.
So
since
gitobs
uses
git,
so
we
can
keep
the
track.
We
can
keep
a
track
on
the
audit
logs
and
we
can
know
any
change
which
is
going
into
the
gate.
Repository
with
the.
A
B
So
it
increases
the
auditing
as
well,
so
now,
moving
on
to
the
demo,
I
have
set
up
the
litmus
cure
center.
Let
me
yeah
so
for
this
demo.
B
I
have
installed
the
qr
center
on
gk
and,
along
with
it,
I'll,
be
using
two
cloud
native
applications
which
are
the
bank
of
anthon
and
those
application
and
an
online
beauty
application,
and
so
this
bank
of
antarctic
application
is
actually
a
banking
application
and
we
can
perform
a
lot
of
operations
like
sending
a
payment
or
depositing
a
payment,
and
similarly,
this
online
boutique
application
is
actually
an
e-commerce
application.
B
Since
you
can
see
a
lot
of
products
listed
here
and
we
have
a
catalog,
we
have
a
functionality
to
change
the
pricing
according
to
different
currencies,
and
we
have
a
cart
option
here
so
we'll
be
performing
some
chaos
engineering
on
these
two
microservices
and
for
this
I'll
be
using
your
center
and
to
enable
the
guitar
functionality
of
chaos
center.
This
is
very
simple
to
do.
B
We
have
in
the
settings
tab,
we
have
a
tab
named
as
gitoffs
simply
select
this
git
repository
option
and
I
will
be
providing
a
git
repository
so
moving
here.
Yeah,
so
this
is
a
empty
repository
which
I
have
created
for
this
demo
and
to
connect
this
git
deposit
I'll
use.
The
repository
link.
B
The
branch
where
I
will
be
pushing
all
my
changes,
which
is
the
main
branch
and
we
can
provide
two
authentication
methods
which
are
the
access,
token
and
ssh.
So
I
have
my
access
token
with
me
so
I'll
be
using
it.
B
B
Delete
my
access
token
later
so
I'll,
just
click
connect
and
it
will
take
a
few
seconds
yeah.
So
we
have
successfully
enabled
the
git
ops
for
our
project
and
to
verify
the
same.
We
can
go
to
the
git
repository
again
and
if
I
refresh
this,
I
should
see
a
litmus
directory
being
created
and
the
directory
structure
shows
me
the
project
id
here.
So
if
I
see
that
this
205ed
is
actually
my
project
id,
which
is
205ed,
we
can
also
verify
it
from
here.
B
The
project
id
is
given
here,
so
we
have
successfully
configured
githubs
within
our
application
and
now
we'll
start
to
do
some
kiosk
engineering
and
let's
get
started
with
the
bank
of
anthous
application.
B
So
I
have
deployed
this
application
along
with
all
its
services
in
the
name,
space
called
bank,
and
here
we
can
see
a
lot
of
services
like
balance,
reader
contacts,
load,
generator
transition,
history
are
available,
and
so
currently,
what
I'll
try
to
do
is
I'll
try
to
delete
this
spot.
The
transition
history
pod,
which
actually
shows
me
all
the
transition
transaction
history
within
this
application.
So
let's
get
started
with
it.
So
I'll
try
to
schedule
a
workflow
I'll,
create
I'll
click
on
the
self
agent,
and
here
we
have
four
options:
to
create
a
kiosk
workflow.
B
So
we
have
the
option
to
run
a
predefined
workflow
or
we
can
clone
a
existing
kiosk
workflow
or
we
can
use
the
git.
We
can
use
the
cursor,
which
is
a
marketplace
of
all
the
kiosk
experiments
and
we
can
also
import
a
workflow
manifest
yaml,
so
for
now
I'll
just
use
the
cure,
sub
I'll
click.
Next
and
I'll
provide
a
name
here,
delete
transaction
or
so
I'll
click.
Next
and
now
I
will
add
the
pod
lead,
experiment
or
delete
yeah
here
it
is
and
to
target
the
pod.
B
I
have
to
select
the
name
space,
which
is
this
bank
name
space,
and
we
have
the
transaction
history
label
here
so
I'll,
select
this
one
and
for
the
timing,
I'll
not
add
any
probes
I'll
just
continue
to
tune
the
experiment.
Here
I
can
provide
different
environment
variables
to
my
experiment
so
for
for
now,
for
this
experiment,
I
will
select
the
total
chaos
duration
as
60,
and
the
chaos
interval
to
be
as
30
seconds
yeah.
Now.
B
So
I'll
click
next
and
I
can
select
the
weights
here
of
the
experiment,
I'll
select
the
schedule
now
option
and
I'll
I'll
verify
all
my
changes.
It's
the
delete
transaction
board
and
I'll
check
if
the
labels
are
correct
over
here
so
which
is
the
bank
name
space
and
the
label
is
transaction
history
and
I'll
just
finish.
My
changes
here
yeah.
So
we
can
see
that
the
workflow
has
started
and
if
I
click
here
I'll
get
a
argo
graph
which
shows
the
live
changes
which
are
taking
place
in
the
workflow.
B
B
Few
seconds
or
few
minutes
for
the
workflow
to
get
completed
and
meanwhile
we
can
observe
the
chaos
which
is
which
will
be
happening
in
this
bank
of
anthous
application.
So
we
can
see
that
the
port
delete,
experiment,
pause
had
just
started
up
and
if
we
go
to
the
litmus
namespace,
I
can
confirm
that
the
port
delete
runner
has
just
started
and
the
transaction
part
is
actually
terminating.
B
So
if
I
refresh
this
page
I'll,
I
should
see
that
this
service
is
under
chaos
and
we
don't
have
any
data
related
to
the
transaction
history
and
once
the
once,
the
transaction
history
part
is
back
into
its
running
state.
We
should
see
the
details
over
here,
so
let
me
refresh
this
page
again:
it's
still
under
chaos
and
once
the
workflow
is
finished,
and
this
service
is
in
running
state,
we
we
should
get
the
details.
B
Yeah
so
since
we
can
see
that
the
workflow
has
completed
and
the
port
delete,
experiment
has
also
run
successfully,
so
we'll
go
back
to
the
bank
of
enthouse
application
and
we'll
just
refresh,
and
we
can
see
that
the
transaction
history
is
now
available
available
so
to
cross
verify
this.
We
can
also
see
that
the
transactions
report
is
now
back
and
running,
so
we
have
induced
a
chaos
on
this
service,
the
transaction
history
service
on
bank
of
antos
application.
B
So
what
if
I
need
the
like
currently
in
this
manifest,
we
can
see
that
the
chaos
duration
was
60
and
the
chaos
interval
was
of
30
seconds.
But
if
I
need
to
change
these
environment
variables
so
instead
of
creating
a
new
workflow
completely
I
what
I
can
do
is
I
can
go
to
my
git
repository
and
I
can
simply
update
these
changes
in
my
workflow
manifest.
B
So
let
me
go
here
and
try
to
change
the
variables
or
the
environment
variables
here.
I
change
it
to
100
and
change
the
chaos
interval
to
50
seconds
and
now
I'll
commit
these
changes
yeah.
So
in
our
git
repository
we
have
made
the
required
changes
and
it
will
take
a
few
minutes
to
get
sync
with
this
with
the
cure
center.
So,
let's,
let's
wait
for
a
few
minutes
over
here.
B
So
if
I
refresh
the
page
and
load
the
manifest
again,
I
can
see
that
previously
it
was
50
seconds
or
60
seconds,
but
since
I've
changed
the
environment
variables,
the
values
can
be
seen
here.
So
these
are
the
updated
values
which
I
provided
in
my
git
repository,
so
the
total
kiosk
duration
was
100
and
the
kiosk
interval
was
50
and
these
changes
are
now
available
in
my
qr
center
and
to
run
this
workflow.
B
I
just
have
to
do
a
quick
rerun
of
the
workflow
and
the
same
workflow
will
get
started
with
the
updated
values
and
we
can
cross
verify
it
from
our
manifest
and
we
see
we
can
see
that
the
key
observation
value
is
100
and
the
kiosk
interval
value
is
50..
So
all
the
changes
from
my
git,
as
well
as
from
my
cure
center,
are
synced
together
and
yeah.
B
B
So
I
have
this
workflow,
which
is
the
delete,
catalog
workflow,
and
it
will
actually
target
the
online
boutique
shop,
and
here
we
can
see
that
namespace
is
shop
and
the
app
label
is
product
catalog
service,
so
instead
of
configuring
it
from
the
qr
center
itself.
What
I'll
do
I'll
just
add
a
new
file
over
here
I'll
upload,
a
file
and
I'll
drag
and
drop
this
file
over
here
and
I'll
provide
a
workflow
name
to
this.
B
B
B
The
delete,
catalog
service
and
I'll
raise
a
pr
to
the
main
branch
yeah,
so
add
files
via
upload
or
pr
and
I'll
create
a
pull
request
and
the
pull
request
has
been
successfully
created
and
once
I
merge
these
changes
into
my
main
branch,
we
can
see
that
a
schedule
getting
created
over
here,
as
well
as
the
workflow,
getting
started
since
it's
a
one-time
workflow.
So
let
me
merge
this
pull
request.
B
Yeah
so
now
we
can
see
that
since
the
pr
got
merged
and
the
changes
are
now
in
main
branch,
so
it
has
triggered
the
git
operations
and
we
can
see
that
the
server
schedule
name
delete.
Chaos,
delete
catalog
service
from
pr
which
is
same
as
the
file
name
over
here
has
been
created
and
similarly
the
workflow
run
has
also
started.
So
let
me
click
here
and
see
all
the
related
information.
B
So
we
can
see
that
it's
currently
installing
the
kiosk
experiments
and
in
a
few
minutes
we
can
see
that
the
catalog
service
getting
down-
and
let
me
just
show
all
the
services
over
here
yeah.
So
this
is
the
shop
name
space
where,
where
I
have
all
the
services
running
like
the
card
service,
the
currency
service,
the
front,
end,
email,
service,
payment
service
and
the
catalog
service.
So
with
the
current
experiment,
we'll
be
terminating
this
catalog
experiment
and
we
can
see
that
the
status
is
in
terminating
state.
B
And
if
I
refresh
this,
I
will
see
that
yeah,
something
has
failed
below
some
details
for
debugging
and
the
service
is
down.
So,
even
if
I
refresh
this,
I
think
it
should
be
down,
but
I
guess
it's
back
into
its
original
state,
since
the
kiosk
injection
time
was
pretty
low
in
this
case
yeah
and
yeah.
So
we
can
see
that
we
have
injected
chaos
from
the
gate
repository
and
it's
now
it
was
visible
in
the
application
as
well
from
the
gate
depository.
B
So
these
were
a
few
operations
which
can
be
performed
from
cure
center.
So
the
major
scope
here
for
githubs
with
litmus
curse,
is
to
add
these
github
functionality
in
your
ci
cd
pipelines,
or
you
can
use
these
in
your
github
actions
to
run
chaos
within
your
ci
cd
ci
cd
stage.