►
Description
Blog: https://everyonecancontribute.com/post/2021-06-23-cafe-35-litmus-chaos-engineering-kubernetes/
Litmus: https://litmuschaos.io/
Twitter thread: https://twitter.com/dnsmichi/status/1407731465509654530
A
Now
live
on
youtube
for
our
35th
or
the
35th.
Everyone
can
contribute
cafe,
and
today
we
are
not
breaking
something
on
purpose,
but
we
yeah.
We
are
actually
doing
we're
doing
something
around
chaos
and
some
chaos,
engineering,
and
I'm
super
happy
that
we
have
our
friends
from
litmus
over
here
and
showcasing
us
or
like
teaching
us
what
is
new?
What
is
coming
in
in
chaos,
engineering,
what
it
is
why
we
want
to
use
it
and
many
more
insights.
A
B
Sure,
michael,
thank
you
so
much
for
introducing
us
and
thank
you
so
much
for
inviting
us
to
this.
Everyone
can
contribute
show.
I
think
we
are
really
really
glad
to
be
a
part
of
it,
and
I
mean
it's:
it's
going
to
be
an
experience
all
together
with
the
cloud
native
world
coming
so
far
and
chaos
engineering
as
a
technology
coming
so
far,
I
think
it
will
be
an
honor
to
present
in
front
of
the
audience
and
before.
B
I
think
I
would
just
like
to
talk
a
little
bit
about
how
this
journey
has
come
together
and
how
the
community
has
responded
to
it.
So
when
back
when
I
started
last
year,
the
the
amount
of
users
towards
cloud
native
chaos
engineering
were
pretty
less.
I
think
he
had
a
community
of
40
to
50
members
and
and
kiosk
engineering
adoption
was
taking
a
peek,
but
I
think
the
cloud
native
chaos
engineering
segment
was
still
having
a
low
amount
of
users
or
was
not
getting
that
amount
of
attraction.
B
But
today,
when,
when
a
few
days
ago,
I
just
saw
a
hundred
thousand
installations,
and
I
saw
a
lot
of
folks
taking
keen
interest
even
in
the
community
or
if
you
talk
about
the
events
that
are
happening,
the
support
from
cncf
kubecon
and
all
these
things.
Then
I
think
today
I
can
claim
that
chaos.
Engineering
as
a
technology
has
has
come
really
far
and
has
been
an
interest
of
many
many,
not
just
sres,
but
for
initially
we
thought
it
is
an
interest
of
just
sre,
the
sorry
person
or
the
the
testing.
B
B
It
is
right
thanks,
so
so
we'll
be
starting
off,
as,
as
you
can
see,
the
presentation
on
hand
is
introducing
litmus
chaos,
cloud
native
chaos,
engineering,
and
we
are
glad
to
be
a
part
of
the
gitlab
everyone
can
contribute.
I
hope
the
folks
who
are
joining
in
will
will
have
a
good
run
through
this
presentation
and
the
demos
that
we
have
so
starting
off
I'll
just
introduce
myself.
B
My
name
is
prithviraj
and
I
am
working
as
a
community
manager
at
chaos
native
with
just
spun
off
from
a
company
called
maya
data,
and
I
was
back
in
my
data
since
last
year,
working
as
a
community
manager
for
this
atmospheres
project,
which
is,
as
you
know,
a
cncf
sandbox
project,
and
alongside
that
I
work
as
an
event
organizer
or
co-organizer
of
these
meetups
and
events.
As
you
can
see,
chaos
carnival
an
event
based
on
chaos.
Engineering,
I
mean
chaos.
B
Engineering
has
grown
so
much
that
there's
an
event
around
it
and
a
lot
of
folks
starting
from
red
hat
to
intuit.
To
so
so
many
folks
from
around
the
world.
Around
different
companies
came
in
and
talked
about
various
chaos,
engineering
use
cases,
and
then
we
have
our
chaos.
Engineering
meetups,
which
happen
every
last
saturday
of
the
month.
So
folks
who
want
to
join
in,
can
come
in
and
join
in
the
chaos
engineering
meetups.
B
We
we
pretty
much
have
an
informal
session
or
have
some
agenda
items
and
then
pick
up
q
and
d's
and
try
and
bring
in
this
journey
of
chaos.
Engineering
adoption
further
and
then
this
communities
community
is
happening
this
saturday,
the
bangalore
edition,
so
anyone
wants
to
join
in
kubernetes
again
is
seeing
the
adoption
at
such
a
level
where
even
chaos
engineering
hasn't
yet
so,
where
communities
has
crossed
the
chasm
for
major
market
adoption
chaos
engineering
is
still
coming
up.
B
So
moving
on
the
agenda
that
we
have
in
hand
today
I
mean
there
are
a
lot
of
things
we'll
be
talking
about,
but
starting
off
we'll
be
introducing
chaos.
You
you
can
ask
your
questions.
You
can
put
your
questions
in
the
stream
chat
as
well,
and
then
we'll
move
on
to
cloud
native
chaos,
engineering
and
then
introduce
litmus
to
you.
B
So
what
is
chaos?
Engineering?
That's
that's
what
you
all
wanted
to
learn.
What
is
chaos
and
how
did
it
come
up
so
I'll
just
start
with
the
story
and
then
perhaps
I'll
move
on
to
what
chaos
engineering
is
so
so
in
india,
or
you
know
around
the
world,
there's
there's
this
black
friday
sale
around
the
world
or
in
india
we
have
great
indian
festivals
or
a
big
billion
day
sales.
B
These
are
these
are
huge
sales
organized
by
flipkart
or
amazon,
these
these
companies
and
what
we
noticed
was
when,
when
back
when,
I
was
not
introduced
to
chaos.
What
chaos
is,
I
used
to
wonder
how
these
spikes
are
actually
causing
an
outage.
What
exactly
is
an
outage,
and
why
are
resiliency
goals?
Not
being
met
in
spite
of
so
much
being
spent
on
quality
and
assurance
penetration,
so
white
box,
black
box,
so
many
types
of
testing
is
around
there.
But
still,
why
is
this
one
factor
which
is
not
available
to
meet
these
resiliency
goals?
B
B
What
is
it
we'll
talk
about
it,
but
moving
on
just
let's
just
not
talk
about
the
spike
in
users,
but
what
does
it
cause
exactly
down
times?
Chaos
engineering
causes
down
times.
Your
system
goes
down
and
there's
not
just
loss
of
time,
but
there's
lot
loss
of
money
as
well.
I
mean
billions
and
billions
of
dollars
are
lost
for
a
huge
enterprise
if,
if
the
system
goes
down
due
to
spiking
users
or
due
to
some
outage
that
happens,
and
this.
D
B
Systems
itself
it
was,
it
was
pretty
essential
to
to
actually
understand
how
how
these
are
functioning
so
down
times
are
expensive,
and
that
is
where
chaos
engineering
came
into
play.
What
chaos
engineering
is
chaos.
Engineering
is
nothing
but
inducing
an
outage
in
the
system
deliberately
in
a
controlled
way,
so
that
you
understand
what
a
future
vulnerability
or
an
outage
can
be,
how
exactly
your
system
reacts
when
there
is
a
future
actual
outage
in
production
and
when
the
system
goes
in
production.
That
is
where
the
devops
feedback
loop
is
activated.
B
D
B
Which
which
we
learned
through
the
chaos
first
principle
now
a
lot
of
people
ask
what
is
the
chaos
first
principle?
The
chaos
first
principle
is
is
nothing,
but
why
why
to
wait
for
an
outage?
Why
not
test
first
so
test
your
systems
first,
before
moving
into
production?
That
is
how
chaos
engineering
came
into
play.
B
So
what
has
been
the
state
of
chaos?
Engineering
till
as
of
now,
as
I
mean
there
are
standard
practices
that
exist.
It
started
all
all
of
this
started
with
chaos
monkey,
but
it's
it's
still
a
limit.
Not
everyone
is
practicing
chaos.
Not
everyone
has
moved
on
to
practice,
chaos
and
what
exactly
chaos
is
doing
is
still
the
the
amount
of
unawareness
that
is
out.
There
is
something
which
is,
I
mean
mind-boggling
for
me,
but
it's
very
important
for
you
to
understand
how
how
to
start
chaos,
engineering
practices-
or
this
is
very
important.
B
This
is,
this
should
not
just
be
limited
to
experts
or
large
deployments,
but
this
should
come
as
a
practice
for
each
and
every
one
whose
looking
forward
to
resiliency
so
as
of
now
those
who
have
burned
their
hands,
those
who
have
gone
through
an
outage
for
them.
Chaos,
engineering
has
been
a
resolution
and
you
can
see
a
lot
of
folks
from
amazon
to
netflix
to
disney
hbo.
These
these
folks
have
started
working
on
chaos
already.
There
are
some
chaos
engineering
stories
already
out
there
from
them.
B
So
how's
it
done.
Typically,
usually
it's
it's
done
in
some
these
ways,
but
this
is
not
what
it
should
be
limited
to.
This
is
not
what
it
should
be.
You
know
just
should
be
the
ways
to
follow
chaos.
I
mean
they
are
done
through
chaos
engineering
game
days.
Usually
there
is
a
game
day
where
one
of
the
applications
is.
I
mean
chaos.
Experiments
are
run
in
in
one
of
the
applications
and
a
few
kiosk
tests
are
carried
out
to
see
how
how
the
application
reacts
it
can
be
specific
to
to
the
company.
B
It
can
be
specific
to
the
use
case,
and
basically
this
is
how
usually
the
chaos
engineering
experiments
are
done,
and
then
it's
rarely
integrated
to
the
ci
or
the
cd
of
your
system.
It's
it's.
I
mean
everyone
hasn't
even
thought
about
chaos
in
your
ci
cd.
Why
not?
I
mean
case?
Chaos
can
be
anywhere.
It
can
be
in
your
ci
cd
in
in
your
developers,
environment.
It
can
be
in
your
pre-staging
staging
production
environment,
so
kiosk
engineering
has
to
come
in
everywhere
in
some
form
or
the
other.
B
As
I
said
before,
the
start
of
the
presentation
that
only
sres
are
focusing
on
chaos.
As
of
now
the
developer.
Persona
is
not
still
not
adopted.
Chaos
in
a
huge
way.
Everyone
has
not
adopted
or
engaged
in
chaos,
been
practicing
chaos,
engineering,
but
slowly
slowly.
It's
it's
coming
up
and
I
mean
there's
a
manual
planning
and
execution
that
is
going
on,
but
you
have
to
understand
that
chaos.
Engineering
is
also
a
part
of
automation.
It's
it's.
I
mean
with
systems
getting.
B
I
think
you
need
you'll
be
able
to
run
automated
chaos
as
well.
That's
that's
the
goal
eventually,
so
the
manual
planning
and
execution
is
good,
but
slowly
you
need
to
move
on
towards
the
automated
way
of
doing
it.
Where
you
can
schedule
them,
you
can
create
workflows.
You
can
see
how
various
chaos
tests
together
function
and
observability
is
key.
I
mean
you
have
to
see
what
is
happening
to
your
system
right.
It's
it's
not
a
commodity.
It
is
actually
every
enterprise
that
everyone
has
to
see.
What
is
exactly
what
is
going
down?
B
What
is
going
up
and
how
is
your
system
being
affected
when
a
chaos
test
is
induced?
And,
lastly,
I
mean
there's
a
lot
of
things
to
see
your
results
and
how
they
increase
your
reliability.
I
mean
there's
a
custom
measurement
process
to
managing
all
these
things,
but
the
typical
way
needs
to
change.
That
is
what
the
agenda
is
of
talking
about
all
these
things.
I
mean
that
these
typical
practices
are
good.
I
think
the
adoption
which
has
come
in
has
been
good,
but
this
typical
way
has
to
change.
B
I
mean
more
and
more
tests
need
to
be
curated
and
you
need
to
create
more
and
more
use
cases.
I
mean
write
your
own
experiments,
think
about
the
system
and
and
it
should
come
out
in
the
community
as
well.
I
think
this
event
talks
about
that.
I
mean
such
such
streams
and
such
events
are
very
important.
So
what
are
the
benefits?
What
what
exactly
are
the
benefits
of
chaos?
Engineering,
I
mean
the
first
and
foremost.
Is
you
run
your
services
without
an
outage
chaos
tests
help
you
and
you
you
test
repeatedly.
B
It
is
a
process,
it's
just
doesn't
happen
once
you
you
test
repeatedly.
What
happens?
Is
you
need
to
understand
the
steady
state
of
your
system?
What
exactly
is
the
steady
state
and
how?
How
does
it
function
and
then
you
need
to
run
your
services
without
an
outage.
You
need
to
actually
take
a
look
at
repeated
testing,
so
I'll
just
perhaps
take
you
through
just
just
give
me
a
sec
I'll
I'll,
take
you
through
something
which
which
shows
how
how
it
is
tested.
Basically,
a
workflow.
B
B
B
As
well
running
alongside
that,
it
can
be
in
mongodb,
or
I
mean
in
the
cloud
native
world
you
you
run
all
these
applications
like
kafka,
ti,
kv
witness
and
all
these
applications
and
then
come
the
services
which
you
know,
code,
dns
onward
prometheus
to
get
your
metrics
or
databases.
Open
eds
is
one
example.
Then
there
are
kubernetes
services
in
a
chaos
engineering
I
mean
in
an
application
and
then
come
the
platform
services.
So
it's
like
it's
like
a
pyramid
right.
B
It's
not
just
your
application,
but
there
are
various
layers
which
are
also
functioning
alongside
that.
So
what
is
very
important
is
that
each
layer
should
you
know
the
resiliency
of
each
layer
should
be
maintained.
Each
layer
should
be
tested
so
that
all
the
components
and
infrastructure
are
strong.
All
the
components
and
infrastructure
become
reliable
because
resiliency
doesn't
just
depend
on
one,
but
it
depends
on
all
the
components
and
how
is
it
done
so?
Basically,
if
you
can
see,
I
I
mean
you,
you.
B
The
steady
state
you
identify
what
is
the
steady
state
of
your
system?
I
think
I
mean
this
diagram
is
going
here.
I
don't
know
why
and
then
you
introduce
a
fault,
you
introduce
one
test.
It
can
be
anything
I
mean,
for
example,
just
giving
an
example
in
a
kubernetes
application.
It
can
be
a
part
delete,
so
you
introduce
a
port
delete
experiment
and
then
you
see
if
the
steady
state
conditions
are
required
or
not.
These
are
part
of
the
principles.
B
The
principles
of
chaos
tell
you
that
first,
you
have
to
identify
what
are
the
steady
state
conditions
of
your
system
and
if
the
steady
state
conditions
are
regained,
then
yes,
your
system
is
resilient,
but
you
go
on
testing.
It's
not
it's
not
a
single
step
process.
Then
you
inject
a
new
fault.
It
can
be
upon
cpu
hog.
B
Just
another
example
that
if
a
body
elite
experiment
works,
then
you
induce
another
fault
to
identify
the
steady
state
conditions
and
know
if
if
your
system
is
not
resilient,
then
there's
a
weakness
found
you
fix
it
and
then
you
again
introduce
a
fault,
and
this
is
a
process
which
continues
going
on.
This
is
a
repeated
step
and
because
you
never
know
when
in
an
outage
occur,
so
this
this
is
how
a
chaos
engineering-
this
is
basically
a
diagrammatic
representation
of
the
principles
or
how
your
system
gets
resilient,
with
chaos
being
induced
into
your
system.
B
B
I
mean
agreements
and
all
these
things
are
very,
very
important
to
run
your
services
to
meet
your
business
goals
or
to
meet
your
individual
goals.
We
just
had
an
external
confirming
slos
have
become
so
important
and
that
you
know
everyone
is
thinking
about
it.
So
chaos,
engineering
and
slos
go
hand
in
hand,
and
you
are
able
to
meet
your
business
level
slas
and
slos
by
running
these
chaos
tests
and
then
scalability,
as
I
talked
about
your
services
are
scaled.
B
According
to
these
tests,
I
mean
these
tests
help
you
understand
the
resiliency,
and
that
is
how
you
scale
your
services
and
earn
demand
and
then
upgrade
them.
According
to
your
requirement,
however,
you
want
add,
adding
more,
I
mean
systems
to
it
and
doing
it
without
an
outage.
That
is
what
the
benefit
of
chaos
is
and
why
we
talk
about
it.
As
of
now
it's
I
mean
there
are
a
lot
of
projects,
as
you
can
see
these.
B
These
are
the
projects
out
there,
there's
kremlin
there's
vmware
manual
and
then
some
open
source
projects
as
well
like
yours,
mesh
kiosk,
plate,
cube
invaders,
chaos
cube
and
then
also
on
the
enterprise
side
of
things.
Also,
there
it's
a
trending
technology.
Cncf
has
called
it
one
of
the
top
five
technologies
to
look
forward
to
in
2021
and
with
all
these
projects
coming
out,
I
think
the
the
options
are
pretty.
B
B
The
chaos
engineering
ecosystem
is
completely
developing,
so
why
exactly
I
mean
chaos?
Engineering
has
been
given
so
much.
Focus
too
is
because
reliability
is
a
budding
challenge
and
where
does
chaos
engineering
come
into
play
in
these
container
level
challenges?
I
think
all
of
them
has
some
amount
of
role
for
chaos
to
play.
Complexity.
B
Of
course,
chaos
engineering
helps
coming
to
understand
where
your
system
is
getting
complex,
then
security
challenges
again.
There
are
32
security
challenges.
If
you
talk
about
reliability
there
again,
12
are
reliability,
container
level
challenges,
so
I
think
and
cultural
changes.
I
think
one
thing
is
we
just
conducted
a
survey
recently,
and
I
mean
people
talked
about
a
lot
of
challenges
they
face,
and
that
was
all
part
of
cultural
challenges.
Someone
doesn't
have
observability
in
place
that
its
chaos
is
not
in
their
road
map
or
their
systems
are
not
ready.
B
B
B
So
what
is
cloud
native
chaos
I
mean
karthik
will
be
explaining
a
lot
more
about
it
with
litmus,
but
I'm
just
gonna
point
out
a
few
principles
in
our
belief
that
are
part
of
cloud-native
chaos
engineering.
So,
according
to
us
I
mean
there
are
a
lot
of
principles
of
chaos,
but-
and
I
mean
the
majority
of
them
are
mentioned
at
a
website-
called
principles
of
chaos.
Engineering.Org
I'll,
perhaps
forward
the
link
here
as
well.
So
as
you
can
see,
the
I
mean
principles
of
chaos.
B
Engineering
are
already
out
there
and
it's
it's
an
open
source
repo.
I
think
he
has
put
up
the
principles
and
these
are
the
major
principles.
You
build
a
hypothesis
around
your
steady
state,
as
I
mentioned,
and
then
you
vary.
Your
real
world
events
run
experiments
in
production
and
then
automate
them.
So
that
is
how
you
it
continues.
And
lastly,
you
minimize
your
blast
radius.
While
you
continue
experimenting
in
production.
B
B
B
So,
while
we
move
ahead
with
the
demo
and
while
we
present
you
we'll
go
in
depth
into
all
these
principles
of
cloud
native
chaos,
engineering
and
how
they
build
the
foundation
of
it,
how
they
came
up-
and
that
will
be
explained
by
karthik,
but
these
these
are
the
principles
mostly
which,
which
was
followed
in
litmus
as
well.
So
moving
on
to
what
litmus
is
I
mean
this?
This
was
the
best
time
to
introduce
litmus
litmus
is
nothing
but
an
open
source.
Toolset
part
of
the
cncf
sandbox
foundation,
which
is
used
to
practice.
B
These
chaos
engineering
practices
in
a
cloud
native
environment
just
not
limited
to
kubernetes
by
the
way
in
case
you're
thinking,
it's
just
limited
to
communities.
It's
it's
also,
I
think
I
mean
the
the
with
the
the
community
demands
coming
in.
It's
it's
more
than
kubernetes
for
aws,
or
I
mean
azure
and
all
the
non
kts
scenarios
as
well.
This
comes
into
play
in
a
very
huge
way,
with
various
experiments
and
obviously,
while
I
talk
about
the
features
it
provides,
I
think
you'll
get
some
more
idea
on
it.
B
So,
as
of
now
I
mean
there
are
58
chaos
experiments
you
can
find
them
on
the
chaos
hub.
Karthik
will
take
you
through
the
chaos
hub,
and
these
are
a
few
stats.
I
think,
as
I
mentioned,
it,
helps
you
identify
weaknesses
and
potential
outages
by
leaving
chaos
in
a
controlled
way,
and
litmus
is
the
early
leader.
I
mean
the
amount
of
adoption
we
have
seen
and
the
amount
of
use
cases
that
have
come
up
with
litmus.
B
B
I
mean
we,
we
founded
it
late
in
2017,
early
2018,
while
testing
another
project
which
which
we're
working
on
that
was
open.
Eds,
open
ebs
is
a
project
based
on
cloud
native
databases
and
then
we
thought
that
we
need
a
tool
set
to
actually
find
out
our
resiliency
goals.
We
needed
chaos,
testing
back
then
itself
to
to
find
out,
and
we
started
writing
out
just
a
chaos
testing.
D
B
B
B
B
A
Can
you
maybe
share
the
url
to
the
litmus
sdk?
Is
it
the
litmus
goal,
repository
and
github.
B
A
B
Thanks
a
lot
thanks
a
lot
everyone,
I
think
my
gothic
can
share.
B
C
E
All
right
thanks
thanks
michael,
let
me
go
ahead
and
share
my
screen.
B
E
Yeah,
let
me
introduce
myself,
I'm
karthik,
one
of
the
maintainers
of
the
rippers
project
and
I've
been
having
a
blast
really
trying
to
contribute
to
this
project.
B
And
maintaining.
E
So
what
we'll
do
as
part
of
this
segment
of
this
session
is
we'll
try
to
go
through
the
basic
architecture
of
fitness
and
the
written
version
that
is
being
used
by
folks.
All
over
is
one
dot
x
that
is
1.13.6.
That's
the
version
that's
being
used,
there's
also
some
efforts
going
on
to
come
up
with
litmus
2.0.
E
E
How
do
you
expect
your
application
or
infrastructure
to
behave
when
it
is
under
optimal
operational
state?
And
then
you
inject
a
fault
right
and
you
check
whether
your
steady
state
hypothesis
is
met,
and
this
steady
state
hypothesis
that
you're
checking
can
basically
be
anything.
It
could
be
a
lot
of
parameters.
E
E
If
not,
then
you
would
have
found
a
weakness,
so
you
go
back
to
the
drawing
board
and
you
fix
the
business
application
or
you
might
fix
something
in
your
deployment
environment
and
essentially
repeat
the
experiment
and
see
whether
things
are
as
per
expectation
and
then
move
on
to
the
next
fault
or
the
next
resiliency
test
or
experiment,
as
you
would
call
it.
So
this
is
the
typical
flow.
Now.
Why
did
we?
Why
did
we
do?
E
E
Things
you
can
implicitly
read
it
as
or
understand
it
as
the
community's
way
of
doing
things,
because
that
is
the
predominant
platform
that
is
driving
the
cloud
native
innovation
today.
So
everything
on
kubernetes
is
declarative
and.
B
E
Everything
is
declarative,
it
is,
it
is
basically
yaml
and
it
is
reconciled
and
when
you
are
coming
to
resilience,
checks
or
resilience,
testing
or
chaos
intent,
you
would
want
that
also
to
confirm
or
adhere
to
this
user
experience
that
people
are
having
with
kubernetes,
but
we
talked
about
how
litmus
was
actually
trying
to
it
was
brought
up
for
the
resilience.
Testing
needs
of
open
abs
and
open
abs
being
another
cloud
native
container
attached.
Storage
solution
offered
everything
in
terms
of
resources.
E
So
when
we
were
trying
to
test
open
ebs-
and
we
were
trying
to
get
the
users
of
open
eb
is
to
test
their
environments
using
weakness,
we
were
trying
to
give
them
that
homogeneous
experience
and
trying
to
put
chaos
intent
in
a
decorative
way.
So
apart
from
this
huge
dependency
tree,
that
you
saw,
the
pyramid
that
was
explaining,
where
in
case
of
kubernetes
and
dense
environment,
contains
your
platform,
services
and
kubernetes.
On
top
of
that,
and
then
you
have
all
your
tooling
from
the
cloud
native
ecosystem
that
you
so
dependent
on
from
your
services
successfully.
E
Then
you
have
your
application
dependencies
database,
storage,
etc.
And
then
you
have
your
middleware
and
application
services
front-end
user-facing
services.
There
are
so
many
points
of
failures,
so
there
is
one
reason
of
trying
to
do
chaos.
There
is
one
more:
what
do
you
call
good
reason
or
more
reasons
for
doing
chaos
and
humanities
environments
and
better
done
in
pre-product
environments,
and
when
you
do
that,
you
would
like
to
do
it
in
a
cloud
native
way
using
a
declarative
approach,
and
I
have
just
created
some
or
mapped
a
couple
of
diagrams
here.
E
E
So
there
is,
there
needs
to
be
a
way
to
define
your
steady
state,
hypothesis
or
application
validation,
also
as
part
of
your
experiment,
in
a
declarative
way,
not
just
the
fault,
and
then
you
go
ahead
and
put
all
the
information
inside
of
a
resource
which
you
can
consume
in
an
easy
way
in
another
cr,
and
you
would
basically
go
ahead
and
generate
useful
reports.
So
that's
what
we
try
to
do
with
litmus
and
at
the
heart
of
litmus
are
some
chaos
resources.
E
We
will
take
a
look
at
them
as
we
do
our
demonstration
in
a
second
chaos.
Experiment
is
a
template.
It
is
pre-built
and
it
contains
granular
definition
of
your.
D
E
Intent
and
the
experiments
run
as
kubernetes
jobs
like
I
was
trying
to
say,
and
all
the
kubernetes
chaos
experiments.
The
templates
are
available
in
this
chaos
hub.
You
can
pull
them
right
and
there
are
categories
defined
here.
Generic
are
the
ones
which
consist
of
most
of
the
experiments
that
you
would
do
on
a
day-to-day
basis
on
standard
kubernetes.
E
Sdk
the
chaos
result
holds
information
about
your
run
process.
What
happened
to
your
hypothesis
intent
and
what
happened?
What
is
the
state
of
the
test?
Currently,
if
it's
a
long
running
test
and
what
is
the
verdict
of
the
experiment?
The
verdict
here
is
something
that
is
optionally
consumed.
If
you
are
doing
your
experiments
in
a
freestyle,
exploratory
manner,
you
do
not
need
to
depend
upon
the
verdicts
that
litmus
provides
you
so
much.
E
You
have
your
own
observation
to
make,
but
if
you're
running
it
in
a
knob
in
an
automated
way,
you
would
like
to
know
what
happened
to
the
experiment,
whether
the
expectations
were
met
or
no.
That's
where
the
verdict
is
going
to
be
useful.
There
are
some
other
crs
auxiliary
ones
like
the
schedule,
which
helps
repeat
experiments
in
a
in
a
different
in
a
schedule
that
you
would
like
to
see.
E
You
can
define
start
or
end
time
stamps
or
you
can
define
miranda's
intervals.
You
can
do
it
strictly
scheduled
or
randomly
etc.
So
the
the
flow
of
the
litmus
one
dot
x
is
something
like
this.
As
a
devops
engineer
or
an
sre
or
a
qa
or
a
developer
or
whoever
the
persona
is
you
have
your
application
on
your
cluster?
E
You
pull
the
experiment
from
the
chaos
hub.
You
create
the
chaos
engine
to
apply
this
experiment
against
a
particular
application,
or
maybe
some
infra
con
current,
like
a
disk
or
node
on
your
cluster,
and
then
you
apply
the
chaos
engine.
The
chaos
operator
is
then
going
to
create
some
child
resources,
the
experiment,
runner
and
the
job.
C
E
E
All
right
so
yeah,
the
experiment,
cr
looks
like
this.
You
have
an
image
here
that
actually
runs
the
chaos,
as
you
can
see
the
spec
dot
definition,
which
holds
information
about
what
minimum
permissions
are
necessary
to
run
this
experiment.
It's
just
indicative.
This
particular
spark
respect
dot
permissions
is
basically
specular.
Definition
of
permissions
is
to
just
say
what
minimum
permissions
are
necessary
to
run
the
experiment.
You
can
create
an
r
back
using
that
information.
E
There
are
some
tooling
on
the
right
side
that
can
help
you
do
that,
but
you
could
create
it
yourself
and
then
you
could
define
how
the
job
runs.
What
is
the
image
that
is
going
to
be
used?
What's
the
pull
policy
and
there's
some
entry
point
here
and
then
some
minimal
environment
variables
provided
the
bear?
Minimum
ones
are
necessary
ones
that
are
just
enough
to
carry
out
the
chaos.
The
mandatory
inputs
are
provided
here,
which
you
will
eventually
override
from
the
chaos
engine,
because
that's
the
instance
thing.
E
That's
the
dynamic
entity
that
you're
going
to
create
on
a
per
chaos
basis.
This
is
something
that's
global
to
the
name:
space
global
at
the
name,
space
level-
or
you
could
put
it
at
a
cluster
wide
level
as
well.
So
in
case
of
air
gapped
environments,
you
could
have
your
own
images
coming
from
your
own
registries,
and
these
crs
can
be
maintained
in
your
own
repositories.
E
We
just
create
a
private
fork
of
this
repository,
called
as
chaos
chart.
Let
me
just
show
you
that
the
hub
here
is
a
canonical
front
end.
What
manifests
are
stored
in
the
chaos
charts?
You
can
just
clone
this
and
maintain
your
own
set.
It's
heavily
inspired
from
operator
frameworks
operator
hub,
so
we
have
the
experiments
defined
here
in
the
respective
folders
inside
the
category.
E
The
generic
category
contains
the
experiment
folders,
and
there
is
an
experimental
that
you
can
pull
and
there
is
a
recommended
r
back
that
you
can
use
to
run
this
experiment
that
is
going
to
be
consumed
by
the
job.
That
runs
the
experiment
and
there
is
just
some
metadata
files
here
to
render
the
information
that
is
on
the
chaos
all
the
metadata
that
you
see
within
a
card.
So
you
could
run
this
by
pulling
these
manifests
into
your
own
repositories.
C
Yes,
thank
you,
but
then
one
another
question
before
we
start,
as
I
saw
it,
does
the
operator
and
the
jobs
it
spawns.
Are
they
privileged
do
they
need
more
rights,
because
I
saw
some
some
host,
let's
say
experiments,
so
are
there
privileged
or.
E
Great
questions,
so
some
of
them
are
not
some
of
these
experiments
use
purely
the
cube
api
and
just
go
ahead
and
run
without
needing
to
be
privileged
containers.
E
If
that's
what
you're
asking
there
are
some
experiments,
for
example,
the
ones
that
make
use
of
some
redux
utilities
within
the
hose
or
within
the
system,
to
go
ahead
and
inject
network
loss
or
stress
chaos,
for
example,
you're
trying
to
simulate
some
resource
exhaustion
or
resource
chaos
within
a
pod.
By
running
some
stress
processes,
you
are
running
from
cpu
burn
or
memory
burn
processes
or
in
case
of
network
experiments,
you're
trying
to
create
some
egress
packet
loss.
E
So
to
do
that,
we
make
use
of
the
runtimes
apis
container
d
or
docker
or
cryo,
and
for
that
we
need
to
be
living
in
the
same
node
where
your
target
application
resides
and
we
will
be
injecting
these
processes.
The
ks
processes
into
the
network,
namespace
of
the
target
part
or
the
process
name
space
of
those
target
containers,
and
in
doing
so
we
need
to
be
able
to
run
them
in
a
privileged
manner.
E
So
that
is
a
few
of
the
experiments
which
do
need
privilege
escalations
for
the
jobs
that
run
most
of
them
most
of
the
others.
The
power
delete
ones
are
the
the
node
related
experiments
that
the
ones
that
are
draining
the
nodes
or
causing
eviction
things
things
like
that.
Don't
really
need
the
privileged
services
privileged
inspirations.
E
Does
that
also
help.
C
E
Yeah
and
one
since
we're
talking
about
the
privilege
escalations,
one
of
the
features
differentiating
features
of
late
most
is
the
fact
that,
as
you
can
see
here
in
this
architectural
diagram,
we.
E
E
There
might
be
some
gatekeeper
policies
that
we
might
document
in
time
which
might
help
you
to
create
some
security
policies
and
create
our
backs
that
just
make
use
of
that
security
policy
and
not
use
the
service
account
that
you
might
be
using
for
some
other
purposes
on
your
cluster
and
thereby
find
means
to
minimize
the
impact
of
running
something
like
this.
E
Okay,
coming
back
to
the
demonstration
of
fitness
1.x
we're
trying
to
go
with
the
hello
world
process,
so
to
say
in
chaos,
the
power
delete
is
always
it's
like.
The
hello
world
programming
gears
engineering,
a
very
popular
test,
pod
delete
in
the
later
part
of
this
demo.
If,
if
we
have
time,
we
can
show
how
effective
it
can
be
by
adding
the
right
kind
of
intelligence
into
the
experiment,
so
the
deletion
of
the
pod,
simple
as
it
sounds,
can
have,
can
unearth
a
lot
of
impact
on
the
cluster
or
on
the
applications.
E
So
right
now
I
just
have
a
gke
cluster
with
me
as
a
single
node
cluster,
and
I'm
going
to
install
the
operator,
I
have
come
to
the
litmus
docs
docs
dot
treatment,
squares,
dot,
io.
This
is
getting
started
section
here.
E
E
If
you
are
in
some
kind
of
service
model,
where
you
would
like
the
admin
persona
to
put
litmus
there,
the
operator
there
an
individual
service
owner
or
developers
to
run
their
own
chaos,
they
can
create
their
own
service
accounts.
That's
all!
That's
the
model,
we're
going
to
use
right
now
and
the
operator
itself
can
be
installed
in
a
cluster-wide
mode
like
we
did
now
or
a
namespace
mode
where
it
can
work
only
within
a
particular
namespace.
E
That's
also
supported
now
that
I
have
got
I'm
just
going
to
verify.
If
my
crds
are
there,
my
api
resources
are
created
sometimes
on
some
clusters
or
some
distributions.
It
takes
times
for
time
for
this
api
endpoints
to
be
available.
Now
that
we
have
them.
My
next
step
would
be
to
pull
the
experiment
cr
from
the
chaos
hub,
I'm
just
going
to
pull
them
from
the
hub
dot
atmospheres,
dot,
io
and
I've
just
been
the
cube
ctl
command.
E
E
All
right,
so
I
have
an
application
for
I
just
selected
nginx,
just
because
it's
a
popular
communities
deployment.
So
I
have
this
part
which
I'm
going
to
try
and
kill.
That's
what
we're
going
to
do,
and
I
already
created
this
nginx
ahead
of
time.
We
have
a
note
here
that
asks
us
to
create
a
sample
application
and
I've
already
installed
the
experiment,
as
I
showed
you,
and
my
next
step
would
be
to
create
an
are
back
for
running
this
bar
delete.
E
You
could
that's
left
to
the
user
and
the
use
case.
So
I
have
this
file
copied
onto
my
harness
machine,
so
I'm
just
going
to
go
ahead
and
apply
it
and
that's
going
to
create
my
service
account
in
my
nginx
namespace.
I'm
just
going
to
confirm
that.
E
E
You
can
see
that
we
are
trying
to
do
part,
delete,
experiment
or
the
fault
against
an
application
instance,
as
defined
by
the
label
name,
space
and
kind.
There
are
a
lot
of
options
you
can
provide
here.
Auxiliary
app
info
is
just
to
see
if
you
want
to
find
out
if
some
other
application
other
than
the
one
that
you
selected
here
is
good
and
ready
and
healthy.
Sometimes
you
want
to
do
those
checks
on
downstream
applications
and
you
can
provide
the
service
account
of
your
choice.
E
We've
selected,
the
one
that
we
just
created
and
the
job
cleanup
policy
is
just
to
state
whether
you
want
to
retain
the
chaos
parts,
the
job,
essentially
that
rand
default
or
you
can
clean
things
up
automatically,
and
then
we
have
some
env
here.
These
are
the
overrides
for
the
tuneables
we
saw
within
experiment
cr.
I
want
to
do
a
few
iterations
of
the
kiln.
Sometimes
may
not
be
so
meaningful,
but
this
is
just
for
illustration
purposes.
E
E
And
you
will
be
able
to
track
the
kill
of
the
nginx,
so
you
saw
that
it's
already
killed.
It's
come
up
recently,
but
since
we've,
given
more
than
one
iteration
of
power
elite,
it's
going
to
continue
doing
that
for
a
few
more
times
and
then
it's
going
to
complete
it
and
you
could
do
similar
chaos
engines
for
wide
variety
of
faults,
cpu,
hog
resource
chaos
like
cpu,
or
memory,
hog
or
network
suite
of
faults
like
loss,
latency
or
corruption,
etcetera.
E
You
could
eat
up
some
disk
space.
You
could
do
pod.
Io
stress
you
could
do
things
at
the
node
level.
Node
level
resources
can
beaten
up.
There
are
some
very
subtle
differences
between
the
power
and
node
level
ones
in
terms
of
the
way
the
resources
are
consumed
and
also
the
use
cases
where
they're
employed-
and
you
could
also
do
things
like
draining
your
nodes
or-
and
you
could
put
eviction
things
and
go
ahead
and
move
all
the
parts
out
of
that
node.
E
Something
like
an
ungraceful
kind
of
nha
could
do
restarts
and
do
dns
errors,
etcetera.
So
these
are
the
kind
of
parts
you
could
do
with
a
very,
very
similar
approach.
As
to
the
one
I
showed,
and
now
the
experiment
is
completed.
These
parts
continue
to
live
here
because
we
selected
retain
as
the
cleanup
policy
and
the
chaos
result
can
be
checked
here.
E
E
It
basically
said
the
the
phase
of
the
experiment
is
completed
in
the
experiment,
but
it
is
passed
so
you
may
ask,
by
on
what
condition
did
we
call
it
pass?
So
right
now,
by
default,
litmus
is
going
to
check
whether
the
fault
was
injected
successfully
and
if
the
application
that
was
subjected
to
the
fault
is
ready
and
available
after
the
fault
seized
to
be
injected.
So
by
by
saying,
ready
and
available,
we
are
saying
that
the
part
phase
should
be.
E
The
containers
listed
within
the
part
should
be
ready,
and
once
that
is
satisfied,
we
go
ahead
and
say
it
is
pass,
but
you
can
define
complex
conditions
here,
which
we
will
see
in
later
part
of
this
demo,
where
you
can
look
for
custom
conditions
around
kts
resources
or
maybe
some
rest
calls.
You
want
to
make
to
check
some
health
of
applications
of
downstream
applications,
or
you
might
check
some
metrics
that
you
are
trying
to
get
on
prometheus
or
some
such
observability
mechanism
and
use
all
that
information
to
arrive
at
this
passive
page.
E
And
then
you
also
have
some
history,
and
this
is
the
resource-
that's
actually
going
to
be
parsed
or
queried
by
the
chaos
exporter,
which
is
another
component
which
will
provide
prometheus
matrix
around
experiments,
and
this
is
the
source
for
all
that
information,
all
the
metrics
that
it
generates.
So
this
is
what
we
had
in
one
dot
x
and
what
we
try
to
provide
is
a
declarative
way
of
putting
out
your
chaos,
indent
and
preview
mention
something
about
by
oc.
E
You
can
create
your
own
experiments,
your
own
fault,
wrap
them
up
in
an
image,
define
that
in
a
chaos,
experiment,
cr
and
test
it
out
as
a
job
and
then
get
it
orchestrated
by
it.
That's
that's,
essentially
how
you
would
do
the
byoc
and
we
have
a
declarative
construct
for
defining
chaos,
intent
and
gathering
information
about
what
happened
during
chaos
and
basically
make
it
simple
to
run
chaos
engineering
for
kubernetes,
but
there
are
a
lot
of
requirements
on
top
of
this
that
you
would
need
right.
Let
me
go
back
to
my
slide.
D
E
D
E
Times
you
would
want
to
simulate
those
complex
scenarios.
So
how
do
you
do
it?
How
do
you
stitch
together
faults?
That
is
one
thing
there's
another
thing.
Let's
say
you
want
to
run
some
benchmark
or
load
job
or
some
performance
test.
As
you
do
chaos,
and
you
would
like
to
see
what's
happening
under
those
stressful
conditions,
you
don't
want
to
inject
chaos
under
idle
utopian
kind
of
cases,
conditions
of
the
app
you
want
to
do
it
when
these
active
loads.
E
How
do
you
factor
in
how
do
you
weave
those
capabilities
in
into
an
experiment,
and
then
there
is
this
need
for
visualizing
chaos
not
just
running
it,
but
you
want
to
know
what
is
running,
how
it
is
running
and
what's
happening.
As
you
run,
the
experiment
we
talked
about
observability
and
observability
is
really
important
to
derive
benefit
out
of
an
experimentation
process.
You
need
to
know
what's
happening
when
you
run
chaos,
how
the
application
is
behaving.
E
You
need
to
be
able
to
visualize
that
you
need
to
have
you
need
the
chaos
framework
to
generate
observability
information
and
also
consumers.
Observability
information
generate
so
that
you
can
correlate
those
the
data
with
some
observability
that
you
already
have
set
up
around
your
application
of
infrastructure.
E
E
I
have
several
fleet
of
clusters
and
I
want
to
do
chaos
against
all
of
them
and
manage
them
centrally
and
also
visualize,
what's
happening
with
them.
Compare
reports,
chaos,
reports
over
a
period
of
time
associate
and
experiment
with
some
resiliency
information
that
I
can
some
quantifiable
resiliency
metric
that
I
can
associate
the
experiment
with
and
thereby
experiment
my
thereby
associate
with
my
applications
of
three
way
association.
E
You
have
your
application
or
infra,
you
have
your
experiment
or
the
fault,
and
then
the
resilience
metric
that
relates
these
two,
so
your
application
or
infra
is
resilient
to
this
degree
to
this
chaos
workflow
or
this
chaos
scenario.
So
you
want
to
be
able
to
do
that.
You
want
to
be
when
a
more
useful
story
that
you
can
sort
of
use
in
your
organization.
All
these
requirements
prompted
us
to
come
up
with
the
next
version
of
litmus
called.
It
must
two
dot
x,
which
is
in
advanced
beta
right
now.
E
E
So
the
agents
are
the
ones
that
actually
carry
out
chaos
and
they
speak
to
the
portal
and
get
information
on
what
chaos
to
actually
do
and
then
go
ahead
and
do
it
on
your
clusters
and
you
can
connect
or
register
your
clusters
to
the
portal
to
a
centralized
portal
through
the
agents
and
get
a
single
plane
of
class
management,
of
all
your
chaos
environments.
So
to
say-
and
we
also
have
prometheus
metrics
that
are
collected
from
each
of
these
clusters
or
namespaces.
E
Depending
upon
what
kind
of
target
you
connect
your
portal,
you
can
connect
either
an
entire
cluster,
which
is
me,
which
means
the
agent
is
in
cluster
mode
or
you
can
connect
a
name
space,
which
means
the
agent
is
in
name
space
mode,
and
then
you
can
collect
some
metrics.
The
agents
collect
some
metrics
around
your
experimentation
and
it
is
going
to
derive
this
metric
from
the
chaos
result,
resource
and
then
populate
it.
And
basically
you
can
scrape
it
from
prometheus
and
instrument.
E
Your
grafana
dashboards
to
see
what
happens
on
your
applications
as
chaos
is
happening.
You
could
use
grafana
rotations
and
things
like
that,
and
with
this
model
of
execution
we
also
started
supporting
chaos
against
non-kats
applications.
So
let's
say
you
have
a
hybrid
infrastructure.
Where
you
have
some
legacy
components,
then
you
also
have
some
kt
services
that
you
would
like
to
do.
Chaos
on.
D
E
Sdk
or
the
google
cloud
sdk
and
go
ahead
and
do
out
of
band
chaos
against
instances
living
on
those
clouds.
In
case
you
have
hybrid
environment
or
you
have
some
services
residing
on
vanilla,
ic2
instances
or
gcp
instances.
You
could
run
them
from
the
portal.
The
experiments
or
jobs
will
run
in
kubernetes,
but
the
impact
or
the
subject
of
chaos
will
still
be
outside
and
we
will
be
making
use
of
the
apis
provided
by
those
providers
to
carry.
E
Chaos,
so
these
are
some
of
the
things
that
we
we
can
do,
and
this
is
just
another
schematic
which
is
trying
to
reinforce
the
same
thing.
The
architecture
here
is
you
have
the
request
portal,
which
is
comprising
of
the
graphql
server.
It
has
mongodb
to
store
chaos,
state
of
your
workflows.
Then
you
have
the
odd
server.
E
Then
you
have
the
agents
sitting
on
your
cluster.
The
green
part
here
block
represents
the
cluster
where
chaos
is
happening.
Subscriber
speaks
to
the
it
must
portal
and
creates
the
kiosk
workflow
resources
and
this
workflow
resource,
which
essentially
is
an
argo,
workflow
embeds
the
rituals
chaos
experiment,
chaos,
engine
resources
within
it
and
they
are
applied
as
part
of
individual
steps
within
the
workflow,
and
you
could
create
chaos,
engines
or
kiosk
experiments.
You
could
string
them
together
in
different
orders,
in
parallel
or
in
sequence,
etcetera.
E
When
you
apply
that
the
operator,
which
is
also
one
of
the
installations
that
happens
when
you
do
the
cluster
registration,
along
with
the
other
agents,
then
takes
care
of
reconsidering
this
geos
engine
and
the
same
process
that
we
saw
in
the
one
dot
x
demo
is
going
to
play.
D
D
E
B
E
E
Awesome,
thank
you.
So
I've
just
opened
up
the
litmus
portal
documentation
here.
It's
in
the
litmus
readme,
the
litmus
repo
is
request,
krs,
slash
litmus
and
I've
gone
inside
the
request
portal
folder.
I
am
just
selecting
this
particular
command
to
install
my
litmus
beta
and
before
that.
Let
me
just
go
ahead
and
create
namespace
call.
It
must
and
go
ahead
and
apply
the
service
and,
as
you
can
see
it
has
these
control
plane
components
that
it
has
created
which
I'm
going
to
try
and
check
by
look
at
the
pods
in
first
name
space.
E
The
last
time
we
did
something
similar
on
one
dot
x.
You
could
see
that
we
had
just
the
litmus
operator,
and
this
is
the
cluster
on
one
dot
x.
I
had
this
operator,
I'm
just
going
to
close
this
now
for
ease
of
navigation,
so
we
have
the
litmus
parts
that
are
going
to
come
up
and
we
also
have
these
services,
which
we
are
going
to
make
use
of
by
default.
It
is
going
to
use
node
port
output
is
not
recommended
for
production.
E
E
So
it's
going
to
take
you
through
a
very,
very
simple
set
of
steps
to
configure
the
rightmost
portal.
Typically,
you
will
be
asked
for
some
user
information
and
you'll
be
asked
to
reset
the
password
things
like
that,
I'm
going
to
use
the
default
credentials
and
I'm
going
to
keep
this
cards
on
this
namespace
in
watch
mode
so
that
I
can
see
what
new
things
get
installed.
E
As
I
proceed
with
my
project
setup,
I
have
a
little
bit
of
a
slow
internet,
so
it's
probably
going
to
take
some
time
to
load
just
bear
with
me
and
give
me
a
few
more
minutes.
So
once
the
portal
is
going
to
come
up,
we
are
going
to
set
up
a
project.
Each
user
is
given
his
or
her
own
organization
or
project
within
the
litmus
portal,
which
is
where
you
can
actually
invite
some
team
members
into
and
when
you
set
up
the
project,
we
have
the
cluster
on
which
the
portable
is
installed.
E
That
is
this
one.
The
gke
cluster,
where
I
have
the
portal
installed
auto
magically
register
itself
as
a
target
of
chaos,
so
the
agents
will
get
installed.
The
moment
you
set
up
the
project,
which
means
you
could
start
doing,
chaos
against
some
applications
that
are
already
residing
in
this
cluster,
but
you
could
also
connect
external
agents
into
the
litmus
portal
using
the
rightmost
ctl,
which
is
a
very
useful
cli
tool
which
will
help
you
to
install
agents
on
remote
clusters
and
subscribe
them
to
the
portal.
E
So
let
me
go
ahead
and
use
the
default
credentials
to
log
in
into
the
litmus
portal,
and
once
I
get
in,
I
just
probably
get
the
same.
Password
hit
admin
and
litmus
is
what
it
is
and
once
I
get
in,
I
will
be
able
to
take
a
look
at
the
project
dashboard
and
you
can
see
there
are
some
tabs
on
the
left
hand,
side.
E
The
workflow
is
the
unit
of
execution
of
chaos
in
the
reference
portal
in
britain's
one
dot
x.
It
was
the
chaos
engine
which
you
are
all
aware
of.
The
kiosk
engine
is
what
we
basically
ask
users
to
create,
and
while
that
remains
to
be
the
central
piece,
even
in
two
dot
x,
we
have
by
default
the
workflows
that
are
going
to
be
the
overarching
manifest.
That
is
going
to
do
your
chaos
with
all
the
dependencies
baked
in
a
lot
of
faults
together.
Stitched
in
you
have
the
option.
E
Also,
you
will
have
the
option
of
creating
a
pure
chaos
engine
using
the
portal
without
selecting
an
argo
workflow,
that's
something
it's
coming
up
in
beta
9,
which
is
an
upcoming
version
of
the
beta.
So
going
back
you
can
see,
I
don't
have
any
workflows
right
now.
I
can
schedule
one
and
I
have
this
chaos
hub
embedded
inside
the
portal.
You
saw
the
chaos
hub
on
the
outside,
but
you
have
the
same
thing
that's
available
here.
E
E
You
can
provide
that
here
you
can
through
access
token
or
ssh,
you
can
connect
it
and
you
can
give
a
private
hub
that
will
get
rendered
here
and
then
you
can
pick
that
to
construct
your
workflows,
select
faults
for
a
given
workflow
and
you
can
see
that
there
is
an
agent.
That's
auto
magically
registered
the
self
agent
means
the
same
cluster
where
the
portal
resides,
and
now
you
can
see.
E
The
workflow
controller
is
going
to
basically
reconcile
the
argo
workflows
and
it's
going
to
launch
a
set
of
pods
that
carry
out
individual
steps
within
a
workflow,
and
one
of
those
steps
will
be
to
install
the
experiment
and
install
the
engine,
and
when
you
install
the
engine,
the
operator
is
going
to
reconcile
it
and
carry
out
chaos.
Export
is
going
to
provide
metrics
and
even
tracker
is
something
that
is
going
to
help
you
in
doing
even
triggered
chaos.
E
That's
something
we'll
hold
for
a
later
discussion.
I
don't
want
to
probably
bring
in
too
many
things
here.
So
let
me
go
ahead
and
also
show
you
a
few
other
things.
You
have
an
analytics
tab.
C
C
D
E
By
running
workflows
or
constructing
them
from
the
portal
and
running
them
or
running
them
by
hand,
as
you
saw
in
the
previous
demo,
etc,
but
sometimes
you
want
chaos
to
be
triggered
in
an
automated
fashion
as
a
response
to
certain
events
that
you
see
happen
on
your
cluster
and
we
have
an
event
tracker
policy,
which
is
a
cr
which
you
can
go
ahead
and
define
right
now,
it's
configured
to
look
for
image
changes.
So
let's
say
you
have
argo
cd
or
flux.
E
That's
actually
upgrading
your
applications
on
your
cluster
as
a
result
of
your
gitops
flow,
and
you
want
to
check
the
sanity
of
your
new
change
deployed
onto
your
cluster
and
the
way
you
want
you
could
do
that
is
have
this
application
subscribe
to
a
predefined
workflow
that
you
have
stored
in
git.
Your
portal
actually
has
this
some
thing
called
gitops
flag.
There
is
a
setting
that
you
can
do
where
you
will
have
workflows,
created
or
constructed
in
the
portal.
E
If
you
have
the
githubs
enabled
it
is
going
to
get
committed
onto
a
git
repository,
it
will
have
a
workflow
id
you
could
subscribe.
There
is
the
applications
could
subscribe
to
one
such
workflow
that
you
have
stored
and
if
you
have
an
application
change
happening,
you
may
change
to
a
particular
tag
or
something
like
that.
You
could
go
ahead
and
even
tracker
will
go
ahead
and
watch
the
applications
for
those
changes.
E
And
if
the
change
has
happened,
then
it
is
going
to
trigger
the
workflow
which
has
been
subscribed
to
by
that
app,
and
you
can
track
the
progress
of
that
workflow
within
the
portal
and
analyze
that
here
so
right
now
we
are
in
the
process
of
enriching
what
all
policies
you
can
fit
in
into
that
event.
Tracker
just
like
image
changes.
There
will
be
so
many
other
things
in
the
cluster
that
you
want
to
do
or
trigger
chaos
in
response
to,
and
not
only
do
you
want
to
do
it
for
git
rocks
flows.
E
You
might
want
to
do
it
for
non-gitops
flows
as
big
and
and
the
that's
where
we
are
looking
at
different
kind
of
inputs
or
triggers
for
chaos.
Rvince
is
a
great
one.
We
are
thinking
on
those
lines,
but
right
now
we
don't
have
that.
E
Going
ahead,
let
me
create
a
workflow
and
I
want
to
explain
a
simple
use
case.
I
am
just
going
to
show
you
two
things
with
this
workflow.
I
am
going
to
do
a
network
loss
just
for
variety
and
show
you
a
very
interesting
use
case.
There
is
this
bank
of
anthos
application,
which
you
might
all
be
aware
about.
D
E
You,
how
simply
you
can
construct
an
experiment
from
the
portal
and
run
it
and
see
the
impact
of
chaos,
and
once
I
do
this,
the
next
part
of
it
would
be
to
see
how
you
can
visualize
the
metrics
of
chaos,
and
for
that
I'm
going
to
use
a
potato
head
application.
We're
going
to
kill
this,
and
I
have
set
a
black
box
exporter
to
maint
to
check
the
availability
and
access
latency
of
this
particular
service.
E
I
have
on
a
dashboard
for
it
and
I
have
annotated
it
with
some
chaos
annotations
and
I
am
going
to
run
an
experiment
which
is
going
to
kill
a
single
replica
of
this
particular
service
and
it
has
been
deployed
with
just
one
app
replica
and
you're
going
to
see
the
the
application
basically
change
here.
I
hope
my
screen
continues
to
stay
shared.
E
I
might
have
just
flipped
something
to
disable
sorry
about
that
yeah.
So
we
will
see
this
parameters
changing
on
grafana
and
we'll
also
see
interleaved
dashboard.
That
is,
will
have
some
area
covered
here
by
the
the
chaos
matrix.
So
before
that,
let
me
show
you
how
you
can
run
a
simple
experiment:
the
bank
of
amthos
has
some
money
here.
I
am
going
to
inject
network
loss
or
black
hole
attack
against
a
balance
reader
service.
So
I
cannot
read
my
balance.
Even
if
I
deposit
something,
you
cannot
read
what
what
my
balance
is.
E
C
One
question
in
the
in
the
meantime,
so
when
I,
with
version
2,
we
have
to
use
the
portal
so
it's
like
by
default,
or
can
I
just
insert
our
operator
like
in
version
one.
E
C
E
You
could
do
both,
so
I
just
selected
the
clust,
the
that's
the
same
cluster,
where
I
have
the
bank
of
phantos
as
well
as
the
portal.
So
let
me
go
ahead
and
select
next
and
I'm
just
going
to
select
my
hub.
There
are
different
options.
You
could
create
a
workflow
from
a
predefined
template
that
sort
of
burned
in
into
the
image
or
you
could
clone
an
existing
workflow
and
then
reuse
that
template
or
import
a
yaml
file
that
you've
constructed
by
hand
or
you
can
select
experiment
and
construct
your
workflow
newly
from
the
wizard.
E
So
I've
just
selected
the
public
keys
hub,
that's
embedded
into
the
portal
and
I'm
just
going
to
call
it
black
hole.
Bank
of
antos
and
the
workflow
is
going
to
run
in
litmus
namespace
and
I'm
just
going
to
select
an
experiment.
E
I'm
going
to
select
for
network
loss,
that's
the
one
we're
going
to
do
and
we're
going
to
go
ahead
and
tweak
some
tunables
here.
I
just
clicked
on
the
experiment,
and
I'm
just
going
to
do
next
and
see
where
I
have
my
application.
I
have
it
in
default,
name
space
and
it's
of
kind
deployment,
and
I'm
trying
to
look
what
application
I
have.
There
is
a
deployment
by
this
name,
balance
reader
that
you
can
also
see
in
the
default
name
space.
E
That
is
the
application
here.
This
is
the
one
that's
what
we
select
and
I'm
going
to
keep
the
cleanup
policy
as
retained.
I'm
going
to
select
next
and
I
can't
choose
some
probes.
Probes
is
a
concept
which
we
will
come
to
in
a
while.
I
am
just
going
to
say
finish
and
with
this
I
am
going
to
take
a
look
at
how
my
aml
has
been
constructed.
E
You
can
see
this
is
the
workflow
and
it
pulls
the
network
loss
experiment
from
the
hub
as
one
of
the
first
steps.
That
is
the
first
step.
The
second
step
is
to
install
the
engine
and
the
engine
is
defined
like
this.
I
have
selected
default
application.
App
is
balance.
Reader
kind
is
deployment
and
we're
doing
a
pod
network
loss,
and
I'm
going
to
do
it
for
60
seconds,
I'm
going
to
do
100,
network
class
and
just
provided
some
socket
paths
and
runtime
details
and
yeah.
D
E
Is
it's
going
to
multiply
the
points
that
you
gave
to
an
experiment
with
the
success
factor
of
the
experiment,
the
probe
success
percentage
which
we
saw
earlier,
and
it
is
going
to
get
a
summation
of
that
for
all
the
faults
divided
by
the
total
points
available.
That
is
going
to
give
you
a
resilience
score.
It's
like
a
weighted
average.
That's
going
to
tell
you
how
resilient
your
application
is
to
this
overall
scenario.
E
So
I
just
have
one
experiment:
I'm
just
going
to
give
all
points
to
it
and
I
could
do
a
recurring
schedule
or
I
could
do
a
single
time
schedule
so
just
go
ahead
and
finish,
go
ahead
and
apply
it.
So
that's
going
to
start
it.
We
have
a
workflow
visualization
graph,
which
is
going
to
tell
you
the
progress
of
the
workflow
as
it
happens,
and
you
will
see
some
experiments
that
will
get
created
in
the
plus
name
space.
E
In
order
to
carry
out
the
experiment.
There
are
some
bank
of
anthos
parts
that
will
get
created,
so
the
first
step,
as
you
saw,
was
to
pull
the
experiment
for
network
loss
experiment
from
the
chaos
hub.
It's
going
to
apply
that
the
next
step
would
be
to
apply
the
chaos
engine
and
launch
the
actual
chaos
process.
E
So
once
this
starts-
and
this
is
not
yet
created-
which
is
why
you
basically
see
this-
it's
going
to
turn
blue,
which
means
it's
going
to
start
doing
the
network
loss
experiment
and
once
it
starts
doing
that
experiment,
we
will
be
able
to
go
ahead
and
visualize
some
impact
that
will
happen
here.
E
So
what
we've
done
to
the
vanilla
workflows
is
we've
instrumented
it
with
some
alertness
images
to
carry
out
these
steps
of
creating
the
engine
and
tracking
it
tracking
its
status,
waiting
for
its
completion,
etc,
and
it
these
images
understand
the
rightmost
api,
so
they
have
been
used
within
these
workflows,
so
it's
basically
going
to
go
ahead
and
create
that
it's
still
doing
it
in
it.
It's
a
fresh
cluster
that
I've
created,
which
does
not
have
some
of
this
images
already.
So
it's
probably
going
to
take
some
time.
E
Meanwhile,
while
this
is
happening,
I
could
also
probably
just
show
you
what
probes
are
why
those
are
useful,
and
for
that
I'm
just
going
to
show
you
a
sample
workflow
that
we
have
for
kafka,
where
we
are
doing
a
simple
part,
delete
experiment,
but
we
are
going
to
kill
the
broker
pod,
which
is
handling
the
io
or
the
message
string
and
that's
going
to
trigger
a
failover
and
in
that
process
we
want
to
check
the
message
stream
continuing
and
not
breaking.
E
E
E
It's
going
to
give
me
a
container
okay,
I
think
it
finally
started.
It
took
a
very
long
time
to
pull.
It
doesn't
usually
take
this
much
time.
Okay,
so
now
that
it
is
actually
running,
you
will
see
some
new
experiment,
the
parts
that
get
created
you
have
the
pod
network
helper.
Let
me
go
to
bank
of
anthos
and
let
me
refresh
it
so
let
me
sign
in
and
you
will
see,
I
will
not
be
able
to
read
the
balance
inside
this
particular
application.
So.
E
E
E
E
I
you
can
think
of
it
like
that,
and
I
am
going
to
check
at
the
edges
the
beginning
of
chaos
and
after
the
end
of
chaos,
that
I
don't
have
any
under
replicated
partitions,
which
is
basically
a
check
to
say
you
leave
your
system
insane
state
and
it
has
self-healed
auto
recovered,
and
then
there
are
also
checks.
You
would
like
to
do
through
the
chaos
continuously
and
there
is.
There
is
a
check
we
are
doing
for
offline
partitions,
which
is
again
through
a
prompt
probe
with
a
different
metric.
E
You
want
to
do
this
as
you
do
the
chaos.
Sometimes,
there
are
negative
checks
you
might
want
to
do
just
for
the
chaos
period
and
not
through
the
experiment,
which
has
all
the
pre
chaos,
post,
chaos,
phases,
etc.
So
I
am
going
to
check
this
consumer
container,
which
is
the
one
that's
actually
running
this
message
stream
that
you
saw-
and
I
want
this
to
always
be
available
and
ready,
which
means
that
there
has
been
no
exception.
The
failover
has
been
successful.
The
mississippi
stream
is
continuing,
so
I
am
just
verifying.
E
The
container
status
is
true
and
I
am
doing
this
using
a
command
probe
and
there
are
also
other
probes
like
kts,
probe
and
http
probe,
which
you
can
use
to
do.
Rest
calls
get
or
post
requests,
etc,
and
all
this
can
be
put
in
inside
of
an
experiment,
and
the
chaos
result
that
you
get
in
such
cases
is
probably
more
elaborate
and
gives
more
information
about
what's
happening
in
your
experiment.
E
These
are
very
useful
for
automated
runs,
and
this
is
what
we
meant
by
consuming
observability
information
when
we
do
the
chaos
experiment
now
coming
back,
so
I
think
the
anthos
I
assume
will
have
recovered
because
the
chaos
duration
is
over.
It
was
a
60
seconds,
so
it's
back
and
the
experiment
has
finished
here.
When
you
click
this,
you
will
be
able
to
see
the
logs
of
the
experiment
and
also
be
able
to
see
some
results
here
and
we
didn't
have
any
probes,
so
it
just
basically
says
the
experiment
passed
and
it's
successful.
E
Another
iteration
of
this
workflow
to
see
whether
the
results
were
improving
or
they
were
decreasing,
and
you
could
get
your
residence
score
as
100
percent
here
going
ahead
now
I
will
show
you
the
other
aspect
of
chaos
and
I'll
probably
end
after
this.
This
is
about
how
you
generate
metrics.
As
you
do
chaos-
and
you
can
see
this-
there
is
a
chaos
exporter
that
we
are
running
and
which
is
getting
some
metrics
and
I
am
going
to
actually
create
in
the
litmus
namespace
a
service
monitor.
E
I
I
actually
have
a
service
monitor
defined
for
the
other
components.
The
the
blackbox
exporter
is
already
defined.
That's
why
you
see
this
griphon
dashboard,
I'm
going
to
create
one
service
monitor
for
the
kiosk
exporter
in
the
litmus
name
space,
and
let
me
just
check
if
my
elitemas
services
actually
have
the
label,
so
it
has.
The
label
called
app
is
equal
to
ks
exporter.
E
E
Because
exporter
is
what
it
is,
and
then
I
have
this
named
as
it
is
appgas
exporter
and
added
it
already
into
my
sorry
about
that.
I
have
already
added
it
into
my
prometheus
cr.
I'm
using
cube
from
atheist
stack,
so
it's
basically
using
a
parameters
operator,
so
app
chaos.
Exporter
is
what
I
have
and
that's
what
we
are
going
to
provide
here
as
well.
Tcp
is
the
port
name,
and
that's,
I
suppose,
the
name
that
we
already
have
for
my
service
as
well.
D
E
E
E
And
once
I
have
prometheus
and
go
take
a
look
at
my
targets,
I'm
going
to
check
if
there
are
litmus
metrics
yeah,
you
do
get
richness
matrix.
You
can
basically
check
that
in
the
targets
as
well.
So
now
let
me
go
ahead
and
do
another
chaos
experiment,
a
very
simple
one,
this
time
trying
to
delete
this
potato
head
replica
and
for
that
I'm
going
to
use
a
very
similar
method.
Select
self
agent,
I'm
going
to
select
the
my
hub.
E
D
E
E
You
might
be
doing
experiments
with
defined
context
in
mind,
so
you
can
provide
that
and
we
have
the
app
labels
and
name
space
and
deployment
for
the
hello
pod,
we're
doing
with
witness
admin,
and
I
just
want
to
do
not
too
many
iterations
of
chaos,
just
one
iteration
of
chaos,
but
nevertheless
let
me
go
ahead
and
I'm
just
going
to
finish
this.
Let
me
go
ahead
and
run
chaos.
E
Just
have
a
single
replica,
it's
a
deliberate
attempt
to
make
sure
that
we
lose
this
service
for
a
while.
I
have
not
put
multiple
replicas
with
breaking
things
on
purpose,
so
it's
still
pulling
the
experiment
here
and
now
it's
going
to
do
the
podiate
step.
It's
going
to
inject
the
particulate
chaos
and
once
you.
B
E
E
So
you
can
see
that
there
is
the
success
percentage
that
has
dropped
access,
duration
that
has
spiked
right
and
greatness
prometheus
we're
trying
to
get
some
metrics
what's
happening
to
this.
So
there
is
an
awaited
experiment,
which
is
what
now
we
are
actually
trying
to
show
in
grafana.
There
should
actually
be
a
red
area.
The
sort
of
startup
starts
up
here,
so
that's
not
appearing.
Then
I
need
to
take
a
look
at
what
happened
to
my
annotation,
whether
that
is
right
or
wrong.
E
So
let
me
take
a
look
at
whether
the
it
was
awaited
chaos,
experiment,
details
and
chaos.
Name,
space
is
litmus
result.
Name.
Space
is
basically
witness
and
the
job
that
I've
provided
is
chaos
monitor,
but
unfortunately
that's
not
what
we
have
it's
the
chaos
exporter.
I
think
that's
the
difference.
So
let
me
go
ahead
and
change
this
to
chaos.
Exporter
and
the
rest
remains
the
same.
E
So
I'm
just
going
to
update
this
I'm
going
to
save
my
dashboard
and
let
me
go
back
and
take
a
look
at
my
dashboard,
so
that
should
probably
provide
my
interleave
dashboard.
Let
me
take
a
look
at
something
else
is
also
needed:
space
weakness.
Okay,
there
was
a
service
monitor,
that's
again
oversight,
so
I
think
the
service
also
is
called
use
exporter
here
all
right
now.
I
hope
things
will
work.
D
E
Yeah
and
now
you
can
actually
see
chaos
happening,
so
this
period
is
when
chaos
actually
occurred,
and
you
could
also
use
other
metrics
to
give
you
the
verdict,
also
on
rafana,
if
you
so
intent.
But
this
is
the
period
when
the
chaos
actually
happen
and
you
can
see
that
the
availability
drop
access,
duration,
spiked
and
then
it
recovered
right.
E
So
the
hello
services,
I'm
sure
it's
going
to
be
back
anyways,
but
this
is
how
you
can
sort
of
use
the
observability
data
generated
by
the
framework,
the
ks
framework
and
instrument,
your
application
dashboards
with
that
and
then
basically
get
a
closer
look,
get
a
better
idea
of
what's
happening.
You
can
even
run
this
automated
and
see
this
later,
so
that
gives
you
an
idea
how
things
have
run
so
this.
E
So
we
talked
about
being
open
source
and
community
collaborated
without
the
community
getting
involved.
You
cannot
have
a
rich
library
of
experiments
and
scenarios,
which
is
why
we
believe
that
cloud
native
chaos
should
be
open,
source
and
community
collaborated,
open
api
and
life
cycle.
You
took
a
look
at
the
chaos
engine
and
the
experiment
and
the
result
crds
are
themselves
apis
and
using
the
byoc
approach
to
bring
your
own
chaos
approach
that
we
enable
with
the
chaos
experiment
crs.
E
You
can
write
your
experiment
in
a
standardized
way
and
have
a
standard
structure
of
orchestrating
it
as
well,
which
makes
it
easy
for
people
to
come
and
contribute
their
own
official
experiments
and
but
still
have
a
defined
way
or
expected
way
of
going
about
it
and
how
they
can
operationalize
it.
Git
ops
is
the
event
tracker
story.
E
We
just
discussed
some
time
back
and
then
open
observability
is
about
being
able
to
generate
metrics,
consume
metrics
and
have
a
standard
for
how
you
can
speak
to
existing
observability
infrastructure,
open
source,
observability
infrastructure
and
leverage
that,
within
your
chaos,
experimentation
process.
So
these
are
the
principles
that
we
thought
we
should
inculcate
into
the
platform
as
we
were
pulling
it
out
and
that's
what
litmus
is
all
about
and
with
that
I'm
concluding
the
presentation.
E
Thank
you
so
much
for
giving
us
so
much
time
and
being
patient
through
this
process,
and
I
think
there
was
a
contribution
aspect
which
we
probably
missed
out
on,
but
there's
a
link
that
I've
shared
with
the
sdk.
E
You
could
contribute
new
experiments
using
that
tool
to
just
bootstrap
new
experiments
and
fill
in
your
business
logic
and
share
it
or
you
could
also
contribute
to
documentation
or
docs,
are
really
welcome.
That's
one
area
that
we
really
seeking
help
on
and
also
to
the
infrastructure
components
of
the
portal
control,
plane,
operator,
etcetera.
So
these
are
the
areas
which
you
can.
Collaborate
on
and
trippy
will
share
some
information
on
where
you
can
join
the
community
and
how
you
can
sync
up
with
us.
There
are
some
monthly
sync
up
calls
and
meetups
that
happen.
E
A
E
A
So
much
to
be
honest,
I'm
now
like
flooded
with
ideas.
What
I
want
to
try
out
immediately.
I
think
one
of
the
the
contributions
could
also
be
you've
talked
a
lot
about
github's,
workflows
and
and
argo
and
flux.
There
are
potentially
other
tools
around
that
so
like
combining
it
with
gitlab
cicd
or
any
specific
other
integration
is
then,
and
not
only
writing
a
blog
post
or
a
tutorial
around
it,
but
also
provide
email
repositories,
provide
the
insights
in
ios
session
with
us,
which
everyone.
D
A
It
was,
it
was
basically
a
proposal
to
do
it,
because
the
more
different
tools
and
different
environments
we
can
gather,
we
can
like
get
an
idea
about
the
more
like
popular
litmus.
Chaos
will
get,
and
it's
also
like
a
diverse
setting
of
saying
hey.
Does
it
work
in
that
scenario
or
do
we
need
to
add
a
certain?
A
I
don't
know
feature
code
thing
to
actually
enable
everyone
to
follow
the
chaos
engineering,
workflow.
E
E
Let
me
just
share
my
screen
to
show
you
what
is
already
there:
it's
a
little
bit
rudimentary,
but
it's
more
on
the
one
dot
x
side
of
things,
so
we
have
what
we
call
as
git
lab
remote
templates.
E
So
if
you
have
a
gitlab,
ci,
dot,
yaml
and
you
are
running
your
own
stuff
in
your
pipelines
and
want
to
introduce
like
your
stage
or
as
a
residency
test
into
it,
you
could
make
use
of
some
of
these
templates
that
we
have
already
you
could.
Just
gitlab
has
an
amazing
feature
where
you
could
import
templates,
remote
templates,
and
you
can
just
import
this
and
override
some
of
these
variables
from
the
gitlab
ci
dot
tml
and
get
the
experiment
done.
E
So
this
is
going
to
actually
pull
it
must
there
is
also
actually
another
flag
called.
It
must
install
true
or
false,
and
you
can
actually
install
it
must
run
the
experiment
against
the
app
of
your
choice
and
get
the
gitlab
job
pass
or
fail.
Based
on
the
chaos
result,
this
the
the
success
of
the
gitlab
job
depends
upon.
The
chaos
result
toward
it.
E
That
can
be
done
with
the
gitlab
templates,
and
we
also
have
some
integration
with
captain
where
there
is
a
control
plane
service
called
it
for
service
that
runs
in
the
captain,
control
plane
and
you
can
add,
chaos
stages
within
a
captain
pipeline,
and
you
could
basically
do
this
with
load
generation
that
the
captain
naturally
allows.
You
can
do
geometer
locus
tests
on
your
application,
and
then
you
could
also
use
captain's
quality
gate
feature
by
using
litmus
experiments.
E
So
some
some
of
this
integrations
are
going
on,
especially
in
the
cicd
space.
Another
space
where
we
would
really
like
to
integrate
and
get
better
at
is
the
observability
platforms.
Prometheus
is
great,
it's
very
common
and
it's
probably
the
most
used
one.
Similarly,
there
are
other
platforms
cloud
native
platforms
we
would
like
to
hook
into
and
enable
people
to
sort
of
get
their
this
experiment
details
onto
their
own
app
dashboards
and
apm
environments.
E
I
would
call
integrations,
but
collaborations
that
are
happening
with
teams
like
pravega,
which
is
another
cncf
sandbox
project
shrimsey,
which
is
a
kafka
cnc
project,
another
sandbox
project.
Here
we
are
trying
to
enable
them
to
run
chaos,
experiments
for
testing
out
those
projects
as
well
praveda
has
in
fact
been
using
us
for
nearly
a
year
now
for
their
e2e
tests.
E
So
these
are
some
of
the
things
that
we
have
been
doing
on
the
community
front
on
the
integration
front,
within
the
larger
context
of
the
cloud
native
ecosystem,
we'd
like
to
continue
doing
that
and
be
more
engaged
with
the
community
and
make
it
a
tool
that
is
going
to
help
sort
of
shipping
other
tools.
That's
that's
the
intention
and
we're
also
trying
to
dog
food,
some
of
our
own
templates
and
scripts
for
litmus
test
litmus
components
itself.
Some
of
the
faults
are
being
injected
against
witness
components.
E
These
are
some
of
the
directions
that
we
are
going
into,
trying
to
provide
some
documentation
around
all
these
things,
which
is
now
missing,
to
be
honest
with
something
that
we
need
to
go
ahead
and
add
in
the
the
days
to
come
before
we
sort
of
go
ahead
and
do
ga
of
2.0
and
that's
where
we're
also
trying
to
seek
some
contributions
on
from
industry
folks.
E
So
yeah,
that's!
What's
going
on.
A
This
is
really
great.
To
be
honest.
I
just
pinked
everyone
on
twitter
to
try
it
out
and
shared
all
the
the
things
you
you
shared.
Like
reverse
engineering.
Your
screen
share,
yeah,
I
would
say,
potentially
we
will
try
it
out.
Well,
I
will
definitely
try
it
out
and
especially
the
observability
parts
and
integrating
it
into
the
existing
kubernetes
cluster
and
the
monitoring
stack
and
the
promises
operator.
A
I
think
this
is
a
really
great
and
nifty
idea,
and
also
I
really
like
I
admire
your
passion
of
presenting
that
I
would
just
wanted
to
say
that
out
loud
I
also
tweeted
it
and
the
other
thing
is
I
really
like
the
ux.
This
is
like
you
have
a
wizard,
you
have
workflows,
you
have
immediate
feedback,
you
get
the
logs,
you
get
the
the
chaos
results.
A
So
this
is,
I
think
it's
it's
a
really
great,
first
impression
when
you
start
it
and
you
can
keep
going
so
I
think
this
will
definitely
be
one
of
the
yeah.
You
want
to
stay
in
the
ui
and
I
think
that's
really
great
what
you
have
created,
what
you
have
designed.
So
thanks
for
your
passion
and
also
for
your
open
source
spirit-
and
I
don't
know
philip-
do
you
have
any
other
questions
around.
C
I
guess
yes,
yes,
sure
just
two
one
technical
when
I
use
the
ux
and
the
portal
okay,
how
can
I
make
it
like
declarative
because,
like
for
example,
let's
say
I
migrate,
my
management
cluster
from
a
to
b
and
I
don't
want
to
click
to
the
ui
again
to
create
all
those
experiments?
C
E
E
Right,
so
there
is
the
section
called
for
dashboards
in
analytics
where
you
could
compare
experiments,
you
could
take
a
look
at
how
it's
done,
and
then
you
could
download
stuff.
E
E
E
All
that
is
declarative,
as
in
it's
a
mixture,
if
you
see
of
being
declarative
and
imperative,
argo
is
reconciling
the
workflow
and
the
chaos
engine
is
reconciled,
but
both
argo
workflows,
as
well
as
request
chaos,
experiments
here,
allow
you
to
define
what
you
want
to
run
and
it
could
be
imperative
to
a
degree
within
those
manifests
and
it's
sort
of
a
good
mix
of
both.
We
are
also
having
some
features
coming
in
to
manage
the
status
of
these
tests.
E
When,
let's
say
when
the
cards
are,
evicted
or
jobs
are
lost,
and
things
like
that
will
still
be
able
to
reconcile
will
be
able
to
clean
things
up
without
needing
more
fund
resources.
All
that
is
something
that's
there
to
a
degree
and
is
also
being
added,
and
we
kept
it
this
way
to
support
both
modes
of
usage.
E
As
regards
how
you
can
run
portal
without
the
dashboard,
if
that
was
the
question,
you
could
do
that
with
apis
aps,
which
we
are
documenting
and
we'll
expose
shortly
to
the
community,
and
you
could
start
using
that,
and
you
could
write
some
scripts
around
the
day,
0
and
day
1
operation
that
you
want
to
do
on
the
portal
things
like
connecting
agents
and
creating
workflows
or
triggering
schedules
downloading
reports-
all
that
is
something
that
you
could
do,
but
most
of
whatever
is
defined
as
yaml
and
whatever
is
mostly
the
git
stored
component
of
the
portal
would
be
the
workflows.
E
C
Yes,
good
thanks,
and
my
second
question
is
more:
like
a
general,
I
mean,
as
you
said,
you
started
2018
around,
which
is,
let's
say,
movement
yeah.
How
is
it
like,
in
I
mean
cows?
Engineering,
to
be
honest,
is
like
showing
your
your,
for
example.
Okay,
let's
try
others,
let's
start
different
when
I
have
enough
trust
to
run
couch
engineering
and
production
yeah.
Basically,
it's
like
a
sign
of
weakness.
In
my
work
to
my
boss.
C
You
know
how
you
can
pick
up
those
people
to
to
get
engaged
in
cloud
engineering
because
yeah
we
are
humans.
We
are
not
people
who
want
to
show
weaknesses
in
the
team
and
to
our
boss.
Let's
say:
hey
your
work
is
bad
because
your
pot
is
when
it's
restarted,
then
your
work
is
bad,
you
know.
So
how
can
we
all
together,
like
yeah,
pick
up
those
people
to
engage
more
in
chaos,
engineering
and
more
resilience?
E
It's
a
great
question:
it's
in
fact,
a
very
one
of
the
initial
big
challenges
that
we
faced
was
exactly
on
this
account.
It's
a
culture
thing
like
quickly
also
related
to
earlier.
People,
don't
want
to
do
chaos
because,
like
you
said
humans,
we
don't
want
to
show
weaknesses
and
we
don't
want
to
disturb
what's
already
running
and
good
there's
also,
this
are
other
inhibitors
to
adoption
of
chaos.
One
is
the
lack
of
a
proper
observability
infrastructure.
E
Sometimes
teams
are
in
the
initial
stages
of
setting
up
micro
services,
environments
and
all
that,
and
they
do
not
have
the
right
tools
in
place
to
see
what's
happening
when
they
do
chaos,
so
they
feel
that
lack
of
control
and
they
are
basically
very
reluctant
to
go
and
do
chaos
because
they
don't
know
how
to
remediate
when
to
step
in
things
like
that.
So
observability
goes
hand
in
hand
with
chaos
and
is
a
real
real
prerequisite,
and
the
other
thing
is,
though
doing
chaos
and
production
is
really
the.
E
Engineering
practice
the
pinnacle
of
chaos,
discipline
within
an
organization.
There
are
not
so
many
people
who
do
it
in
production;
they
want
to
do
it
in
staging
environments,
free
fraud,
environments,
dev
clusters,
etc,
and
one
of
the
things
that,
like
you
say,
for
the
people
to
pick
what
are
the
right
people
to
target?
What
is
the
lowest
entry
barrier
teams
to
take
for
chaos
and
the
observability
teams
are
the
ones
that
will
probably
jump
onto
this
first,
because
why
you
don't
want
to
disturb
your
main
business
applications?
E
You
don't
want
to
basically
go
ahead
and
disturb
what's
running
well,
you
will
still
be
okay
to
kill
some
observability
systems
or
some
services
you
are
using
for
monitoring
to
see
if
you
are
getting
the
right
reports
and
notifications
and
alerts
for
of
your
main
application
and
your
observability
is
highly
available
or
not.
Let's
say
you
have
prometheus
and
you
just
have
a
single
replica
to
make
this
deployment
and
you
went
and
did
a
portrait
or
network
failure
there.
E
You
would
still
want
to
know
how
your
end
service
is
behaving.
You
would
want
to
get
the
right
alerts
and
notifications.
You
want
to
see
the
graphs
there,
so
is
that
still
happening
when
your
monitoring
framework
is
going
getting
subject
to
some
kind
of
faults,
so
that
is
a
good
service
to
target
and
the
team
that
works
on
observability
is
a
good
one.
A
good
adopter
of
chaos,
upfront,
which
is
what
we
have
seen
in
our
community
as
well,
and
then
they
slowly
gain
confidence
from
it.
E
They
pitch
there
for
the
other
teams,
the
service
teams,
and
then
they
go
ahead
and
do
chaos
and
then
it
sort
of
becomes
a
movement
which,
with
which
everyone
becomes
comfortable
at
some
point.
So
that
is
one
easy
way
of
pushing
things
and
another
way
of
doing
it
is
to
start
in
dev
clusters.
It's
like
even
in
dev
test
when
I
say
they're
testing
and
not
even
I'm
talking
about
the
ones
that
the
qa
teams
are
doing,
because
we
also
need
to
sell
it
to
the
qa
teams.
E
Sometimes
they
have
their
own
test
plans
and
they
have
their
own
schedules,
and
this
comes
in
as
sort
of
an
additional
thing,
so
you
could
even
do
it
developers
themselves,
could
do
it
in
the
sandbox
environments
that
they
have
where
they
do
their
initial
tests
before
they
push
off
core
into
their
cd
pipelines
or
into
the
qa
teams.
That's
also
another
place
where
they
could
do
chaos
and
show
the
value.
E
Then
they
can
catch
on
the
fewer
teams
they
can
catch
on
and
they
can
do
a
lot
one
of
the
one
of
the
huge
consumers
of
kr's
in
the
last
few
months,
as
we
see
in
the
request
community
have
been
the
qa
teams,
not
only
the
traditional
ops,
sre
teams,
who
we
generally
associate
with
chaos.
Engineering,
of
course,
that
persona
is
there,
but
what
we
see
are
a
lot
of
qa
people
trying
to
do
failure,
testing
with
the
chaos
mindset
with
this
exploratory
mindset
that
is
associated
with
case
experimentation.
B
C
Yeah,
I
I
think
always
like
I
mean
if
you
have
pre-products
dating
and
tests,
it's
all
good
yeah.
You
need
this
definitely,
but
I
mean
at
the
end
in
staging
and
pre-port,
you
never
have
the
production
workload.
Okay,
so
you
never
have
one-on-one
the
situation
which
you
have
in
production
and
I'm
the
kind
of
like
I
would
rather
haven't
planned.
C
I
don't
know
half
window
of
maintenance
announced
hey,
we
do
maintenance
and
then
run
our
cars
experiment
in
production.
Then
have
I
don't
know
some
unplanned
outage,
which
can
I
don't
know
nobody
knows
of,
and
then
you
know
how
it
is
sre
in
the
night
the
greatest
cluster
goes
down.
Then
you
have
the
unplanned
work.
I
mean
if
we
refer
to
the
devops
handbook.
The
unplanned
work
is
the
worst
work
yeah,
I
think.
Hopefully
it
will
grow
because
movement.
E
E
Have
all
medication
plans
ready
to
go
injected?
They
not
on
observations
and
they
share
findings,
and
then
they
repeat
it
because,
like
you
said
nothing
can
simulate
production.
The
dynamic
nature
of
the
production
environment
is
true.
I
think
slowly.
It
should
build
up
to
that
point
where
people
are
comfortable
running
chaos
in
production,
the
they
should
have
tested
it
enough
and
also
the
framework
that
they're
using
should
probably
enable
them
to
do
that.
Give
them
that
feeling
of
control.
E
For
example,
one
of
the
features
in
reference
that
we've
added
in
recent
times
is
called
as
the
deadman's
switch
of
the
automated
rollback.
Where
let's
say
you
have
a
provided
probe
defined
to
say
you
are
exceeding
a
certain
metric.
E
Which
means
things
are
getting
out
of
hand
you
need
to
stop
and
you
can
immediately
bash
the
engine
to
about
it
automatically
stops
the
experiment
and
then
you
sort
of,
and
then
you
can
specify
blast
radius
in
a
very
controlled
way.
Apart
from
just
the
labels
and
namespaces
we
gave
you
can
use
annotations
to
say
amongst
the
so
many
deployments
in
this
namespace
with
this
particular
label
choose
or
zero
in
on
just
one
particular
deployment,
or
maybe
one
specific
part,
or
something
like
that.
So
you
could
do
this
blast
radius
control.
E
Do
the
right
hypothesis,
the
ability
to
abort
the
ability
to
scale
you
at
a
specific
time
when
you
know
the
traffic
is
less
things
like
that.
These
are
all
enablers
that
you
could
add
from
the
framework
side
to
get
people
to
run
it
more
in
production
in
a
more
confident
way.
So
that's
been
our
endeavor
as
well.
We've
been
trying
to
think
in
those
lines
of
also.
D
C
A
You
mentioned
before
that
kubernetes
is
not
a
dependency.
Did
I
understand
this
correctly,
so
you're
also
targeting
aws
or
something
else.
E
Yes,
so
you
have,
you
do
have
experiments
that
can
terminate
easy
to
instances
or
detach
eds,
whether
they
are
like
just
discs
attached
or
maybe
they
are
marked
as
pvps,
and
we
also
have
a
recently
introduced
experiments
that
can
do
not
just
these
out
of
band
kind
of
things
using
apis
of
aws
to
kill
instances
or
dash
disks.
But
you
can
also
go
inside
an
instance
and
simulate
some
resource
form
or
latency,
and
network
latency
and
stuff
like
that.
E
So
the
experiment
with
two
running
cabinets,
the
kiosk
engine-
is
going
to
look
very
similar
to
the
way
you
saw
how
partly
it
was
defined,
but
the
you
will
have
some
secrets
created
or
iem
rolls
mapped
within
the
experiment,
and
you
will
use
that
to
go
ahead
and
do
the
chaos
against
the
cloud
resources
over
the
network,
but
the
control
plane
and
the
experiment
plane.
The
execution
plane
will
still
be
on
kubernetes
when
you're
targeting
entities
which
are
not
really
kubernetes.
There
may
be
infra
components,
we
vandalize
it
to
instances
or
gcp
instances.
E
You
could
you
could
do
that
as
well,
because
a
lot
of
people
run
hybrid
environments
and
they
are
not
completely
migrated
into
kubernetes.
They
have
some
services
here,
some
services
living
in
legacy
infrastructure
and
they
want
the
same
tool
set
or
platform
to
be
able
to
orchestrate
chaos
and
all
these
things
so
there's
also
a
requirement
that
we
got
and
we
sort
of
enabled
that
recent
times
so
same
goes
for
bare
metal
systems
as
well.
E
In
fact,
there
are
experiments
that
we
have,
we
are
going
to
add,
or
we've
been
working
on,
where
you
can
use
ipmi
apis
to
sort
of
do
out
of
band
power
of
power
on
stuff
machines
like
some,
for
example,
dell
machines
provide
eye
drag
capability
and
they
give
you
apis.
You
can
do
things
there,
so
these
are
things
you
could
do
while
retaining
a
common
experience,
homogeneous
experience
on
kubernetes
engines,
declarative
storage,
on
a
git,
etc.
But
doki
also
means
a
wide
variety
of
infrastructure
around
you.
A
Thanks,
that's
that's
really
great
to
hear
I
I
also
get
some
or
like
from
my
own
experience.
I
often
get
like
kubernetes
is
like
hard
to
get
into,
and
I
experienced
it
myself.
So
the
the
learning
curve
is
like
wow
when
you're
into
it.
It's
like.
Oh,
it's,
amazing
and
I'm
totally
into
it
so
like
when
I
want
to
start
as
a
developer
or
maybe
as
a
devops
engineer.
A
Whatever
the
road
title
is
you
potentially
want
to
hide
the
kubernetes
component
and
just
say:
hey
here,
execute
something
or
this
is
like
the
steps-
and
you
maybe
have
a
q
a
this
cluster
or
it's
just
something
else,
and
the
entry
barrier
is
not
that.
Oh,
this
kubernetes,
I'm
not
using
it,
because
I
think
right
now.
The
messaging
on
the
website
is
also
a
chaos
engineer
for
your
kubernetes
or
kubernetes
cluster.
A
This
could
also
be
opened
up
when
you
have
like
a
directional
roadmap,
where
you
say:
okay,
our
potential
target
groups
are
kubernetes
90,
but
we're
also
moving
into,
like
you
said,
hardware
bare
metal
or
specific
hyper
cloud,
so
the
small
cloud
providers
anything
which
can
be
tested,
maybe
even
going
into
iot
or
I
don't
really
know
what
edge
is,
but
all
the
new
stuff
which
potentially
has
some
yeah.
A
You
don't
have
any
competitors
inside
because
there
is
no
chaos
engineering,
yet
there's,
potentially
some
unit,
testing
or
extreme
testing
somehow
but
like
finding
in
or
putting
in
the
chaos
in
production
is
something
sitting
there
saying
how.
How
should
I
do
that?
A
Everyone
is
using
that
we
can
rely
on
that
and
we
can
invest
on
our
in
in
our
product
in
our
infrastructure
and
it
adopt
little
scales,
for
example
litmus
in
in
that
specific
sense,
and
I
think
that's
a
really
great
idea-
and
I
will
I
will
talk
to
our
product
managers
and
also
to
our
teams
around
this
and
see
how
far
we
can
get
or
maybe
even
like
work
in
a
similar
fashion.
A
As
we
tried
with
kept
with
the
captain
project
just
to
collaborate
even
more
and
see
where
we
are
where
we
are
going
and
can
build
a
great
cloud
native
or
even
more
chaos,
engineering
idea
or
collaboration
around
that.
That
would
be
amazing.
E
Awesome
that
would
be
just
fantastic.
Thank
you
for
that,
michael.
I
think
you
summarized
it
very
well
yeah.
We
have
seen
some
interesting
use
cases
in
the
community
around
chaos.
E
Some
of
the
end
users
stories
are
very
fascinating
and
there's
some
good
discoveries
made
as
a
result
of
chaos,
which
we
are
also
trying
to
sort
of
get
them
to
speak
on
some
forums
like
this
or
go
to
cncf
and
basically
talk
about
some
of
the
very
interesting
use
cases
that
have
that
they
have
so
that
the
community
at
large
is
benefited
by
those
insights.
E
So
hopefully
we
will
see
more
sort
of
information
and
more
awareness,
chaos.
Engineering
has
been
there
over
a
decade.
It's
really
gathered
pace
in
the
last
I
would
say
couple
of
years
accelerated,
no
small
part
because
of
all
this
cloud
native
paradigm
and
digital
transformation
journey
people
are
making
migration
to
cuban
at
least
micro
services.
A
And
I
I
also
submitted
some
cfps
for
later
this
year
with
chaos,
engineering
inside
I
I
knew
that
I
needed
to
look
into
litmus,
so
I'm
really
grateful
that
you
took
the
time
today
to
to
educate
us
or
to
show
us
what
is
what
is
hot
or
what
is
new,
especially,
I
would
love
to
have
you
back
like
I
don't
know
in
half
a
year
or
something
when
you
when
2.0
is
released
or
2.6,
and
we
have
some
more
use
cases,
and
maybe
we
have
deeper
collaboration
or
something
around
that
other
than
that.
A
I
really
appreciate
you
taking
the
time.
I
know
it's
late
for
you,
so
we
kind
of
extended
it
to
125
minutes
now,
but
it's
totally
fine.
I
told
us,
I
was
very
much
enjoying
it
and
yeah.
Thank
thanks
to
you
colleague
and
pritivi
for
joining
today,
and
I
would
just
say,
bye.
Bye
on
youtube.
Thanks
for
listening,
blog
post
will
be
will
be
published
later
on.