►
From YouTube: Cloud Native Live: Introducing LitmusChaos 2 0
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
Hello,
everybody
welcome
to
another
cloud
native
live.
My
name
is
mario,
laura
you've
seen
me
before.
I'm
sure
I've
been
hosting
cloud
native
live
on
and
off
since
we
started
the
cloud
native
tv
network
this
year.
Thank
you
so
much
for
joining
us
for
another
wonderful
episode
where
we
are
gonna
dive
into
chaos
engineering.
A
So
today
I
have
karthik
with
me
from
chaos
native
and
they're,
a
company
that
is
working
on
resilience,
engineering
in
our
world
of
cloud
native
and
leveraging
a
chaos
engineering
to
achieve
that
with
their
product
litmus
chaos.
So
I'm
gonna,
be
I'm
not
gonna
get
into
karthik.
What
I'm
gonna
do
is
try
to
answer
all
of
your
questions.
Please
leave
your
questions
in
the
platform
of
your
choice
that
you're
using
to
watch
this
right
now
we
thank
you
for
tuning
in
and
spending
time
with
us
today.
A
This
is
going
to
be
a
really
really
fun
session.
I
have
a
lot
to
learn.
A
I
know
of
chaos
from
the
chaos
monkey
project
on
github
and
other
talks
from
from
netflix
engineers,
like
adrian
cockroft
and
others,
who
have
been
pushing
for
a
chaos
to
increase
resiliency
and
reliability
of
your
platform
very,
very
difficult
to
hone
in
and
get
just
right,
and
it's
also
very
scary
for
a
lot
of
organizations
knowing
where
companies
I've
been
in
sre
at
you
know
what
they've
been
able
to
do
and
what
they've
been
comfortable
with
doing
introducing
chaos
is
actually
really
scary
and
it's
hard
to
take
that
first
step.
A
So
I
think
karthik
is
gonna,
be
teaching
us
a
ton
of
great
stuff
I'll.
Let
leave
him
to
get
into
his
background,
but
I
again
I
thank
everybody
for
joining.
Please
leave
your
questions,
comments,
thoughts
in
the
chat
and
I
am
monitoring
those,
so
I
will
be
sure
to
get
those
questions
asked
to
karthik
and
we'll
be
kind
of
going
through
a
wide
gamut
of
different
areas
in
chaos,
engineering,
the
ecosystem
and
cloud
native's
role.
So
thank
you
so
much.
A
If
you
want
to
chat
with
more
people,
I
think
myself,
karthik
and
others
that
have
been
on
the
show
are
definitely
definitely
hanging
out
there
and
check
out
chaos
native
while
you're,
watching
at
cast
native
at
litmus
cass
on
twitter
cassnative.com,
I'm
scrolling
through
here,
and
I
even
see
victor
farcik
from
the
I
think,
devops
exchange
podcast
has
even
done
a
episode
on
integration
with
argo
workflows,
which
is
super
exciting
and
I
didn't
know,
existed,
and
I
use
our
ghost.
A
So
I
have
a
lot
of
work
to
do
this
week
to
share
this
with
my
team,
I
think
without
further
ado
karthik.
Thank
you
so
much
for
joining
us
I'll.
Let
you
take
it
away.
Go
ahead.
B
Thank
you,
mario,
really
excited
to
be
part
of
cncf
life
and
discuss
chaos,
engineering
and
it
was
chaos.
Thank
you
for
the
introduction,
like
what
you
said.
My
name
is
karthik.
I
worked
for
a
company
called
keos
native
and
one
of
the
maintainers
of
the
open
source,
cncf
sandbox
project
called
litmus
chaos,
and
today
we
just
want
to
discuss
what
cloud
native
chaos
engineering
is
and
talk
about
litmus
project.
B
B
That
was
one
dot
x
and
I
think
in
the
process
of
talking
about
the
project,
I'll
just
introduce
you
to
what
requires
fundatex
did
and
what
is
the
feedback
that
we
received
from
the
community
and
how
2.0
was
brought
up,
how
it
was
created,
and
we
will
go
through
a
couple
of
demonstrations
to
just
discuss
about
what
this
platform
is
about,
and
I
hope
this
encourages
you
to
start
with
your
chaos
engineering
journey
in
your
organizations.
B
So
please
feel
free
to
ask
us
any
questions
and
be
happy
to
answer
and
yeah.
That's
about
it
all.
Right
with
that
said.
I
just
want
to
introduce
what
chaos
engineering
is.
I'm
sure
a
lot
of
folks
already
know
about
chaos.
Engineering,
you
might
be
practicing.
Ks
engineering
also
might
be
practitioners,
so
you
might
have
heard
about
it.
Read
about
it.
B
B
And
netflix,
along
with
a
few
of
the
other
organizations
earlier,
doctors
of
chaos,
came
and
created
the
basic
tenets
of
chaos.
Engineering.
You
can
look
at
this
website
called
the
principlesofchaos.org
which
carries
a
lot
of
information
about
what
it
is
why
it
is
important
what
are
its
principles,
etc.
B
B
Okay,
I
am
hoping
that
the
screen
is
going
to
be
visible.
Let
me
just
quickly
run
through
this,
so
one
of
the
basic
reasons
why
ks
engineering
is
very
important
is
because
downtimes
are
very
expensive.
We
had,
we
have
had
some
past
incidents
in
organizations
that
are
generally
very
resilient,
where
it
has
cost
a
lot
of
money
because
of
downtime,
and
that
is
something
we
would
want
to
avoid.
B
A
lot
of
services
that
we
are
consuming
on
a
day-to-day
basis
have
publicly
available
slas,
for
example,
google
cloud
initiative,
since
it
is
99.95
available
all
the
time,
so
chaos
engineering
is
a
practice
which
actually
helps
you
to
verify.
If
you
are
able
to
provide
that
kind
of
necessarily
be
available
all
the
time
and
by
definition,
there
are
a
lot
of
definitions
on
the
international.
You
can
see
this.
There
are
a
couple
of
one
that
I've
picked.
B
It
basically
says
it
is
a
method
by
which
you're
testing
a
distributed
computing
system
so
that
it
can
such
that
it
can
withstand
unexpected
disruptions,
you're
testing,
if
some
of
the
components
or
some
of
the
assumptions
that
you
had
when
you
built
the
code,
for
example,
you
always
thought
the
network
is
always
going
to
be
alive.
You
have
unlimited
compute,
storage
bandwidth,
that's
not
often
the
case.
B
There
are
failures
that
happen
all
the
time
and
we
need
to
check
if
we
can
withstand
the
disruption
we
can
recover
quickly
enough
and
continue
to
provide
the
service
at
an
acceptable
level
to
the
users
and
chaos.
Engineering
is
not
about
reckless
fault
injection.
It's
a
scientific
process
by
which
you
identify
a
control
group,
an
experimental
group,
you
inject
faults
in
a
very
controlled
way.
You
basically
ensure
that
there
is
minimal
blast
radius
when
you're
injecting
faults
and
then
basically
see
what
happens.
You
try
to
learn
about
system.
Sometimes
you
go
with
hypothesis.
B
That
is
proved.
Sometimes
it
is
disproved
if
it
is
disproved.
It
is
better.
That
means
there
is
something
that
you
learn
newly
you
can
go
back
and
fix
your
application,
or
you
can
go
back
and
fix
your
department
practices.
Maybe
you
can
improve
the
underlying
infrastructure
and
make
it
more
resilient.
There
are
a
lot
of
things
you
can
do
repeat.
Your
experiments
gain
confidence,
etcetera,
so
that
is
generally
the
practice
of
chaos.
Engineering
because
of
the
times.
B
B
You
might
call
yourself
or
call
your
system
another
resilient
to
this
form
that
you
injected,
and
then
you
move
on
to
the
next
one
or
if
there
is
a
weakness
that
you
found
and
your
hypothesis
was
disproved,
then
you
go
back
to
the
triangle
and
check
what
went
wrong.
What
needs
to
be
better
make
those
fixes
and
then
repeat,
once
again,
chaos
engineering
is
traditionally
has
been
traditionally
done
in
production
for
a
long
time.
That
was
the
philosophy.
Chaos
engineering
is
most
effective
and
useful
when
you
do
it
in
production.
B
The
principles
of
chaos
is
as
much
but
with
the
recent
proliferation
of
kubernetes
and
the
evolution
of
the
cloud
native
paradigm,
wherein
a
lot
of
organizations
are
re-architecting.
Their
applications
they've
been
taking
from
taking
away
from
the
wall
with
modern
and
creating
everything
as
micro
services,
they're
containerizing,
it
they're
running
it
in
new
deployment
environments,
mostly
kubernetes.
B
So
there's
a
lot
of
apprehension
about
how
things
are
going
to
work
and
probably
folks
are
not
ready
to
do
chaos.
Engineering
in
production
from
the
get
go,
there's
a
lot
of
projection
and
chaos,
experimentation
that
is
done
in
preparing
rounds
to
gain
confidence
before
it
is
really
done
in
production.
B
At
the
beginning
of
this
discussion,
there
is
a
principle
that
indian
is
a
big
advocate
of
called
as
the
chaos
first
principle,
so
it
is
about
doing
chaos,
engineering
in
a
more
ubiquitous
and
a
democratic
way.
You
start
doing
it
in
development
environments.
You
start
within
stages
in
production.
B
Maybe
you
react,
add
failure
tests
as
part
of
your
ci
by
applying
ci
cd
pipelines
and
you
basically
do
slow
validations
based
on
or
during
chaos.
Experimentation.
Basically,
you
go
and
validate
within
your
system
continues
to
stay
alive
and
your
objectives
are
made
so
that
several
objectives
are
made
under
duress.
B
Before,
let's
say
your
application
is
moved
into
production
and
then
probably
when
you
make
sure
you
start
doing
the
actual
game
days,
chaos
experiments
in
your
production
environment
and
try
and
see
whether
it
system
holds
up
there.
So
that
is
what
we're
seeing
happen
in
the
recent
times
I
mentioned
about
how
communities
and
cloud
native
is
a
factor
in
getting
people
to
do
chaos
engineering
much
before
and
earlier,
and
often
that's
because
in
a
communities-based
deployment
environment
there
are
so
many
variables.
So
many
factors,
kubernetes
itself
is
quite
dense.
B
B
Then
you
have
your
direct
application
dependencies,
your
databases,
message
queues
and
then
your
app
with
all
its
services
like
resources,
your
middleware
front,
facing
user
facing
services,
etc.
A
lot
of
things
can
go
wrong
and
it
is
important
that
all
these
components
work
well
in
sync,
to
provide
the
best
user
experience
that
you
have
guaranteed
to
the
users
of
your
service.
B
So
it
is
possible
to
test
out
varied
scenarios
and
test
it
often,
and
one
of
the
reasons,
one
of
the
pillars
of
the
cloud
native
way
of
doing
things
is
to
release
fast,
to
keep
everything
as
micro
services
to
ensure
that
everything
is
declarative.
Everything
has
the
git
as
a
source
of
truth.
You
have
controllers
that
ensure
your
infrared.
Your
code
in
your
source
is
synced
always
so.
Changes
are
happening
in
a
fast
pace,
so
you
need
chaos.
Engineering
to
sort
of
borrow.
B
Some
philosophies
from
the
the
current
model
ensure
that
your
chaos
intent
can
be
declarative,
ensure
that
you
automate
steady
state
hypothesis,
validation
as
part
of
the
experiment,
ensure
that
it
lends
itself
to
githubs
and
ensure
that
you
have
the
same
user
experience
or
same
homogeneous
experience
that
you
have
had,
while
doing
other
things
on
communities,
maybe
you're
talking
about
defining
application,
life
cycles,
security
or
resource
policies,
etc.
Everything
is
defined
as
resources,
and
you
have
controllers
in
communities
to
manage
them,
and
you
would
sort
of
like
to
break
that
into
our
chaos
engineering
as
well.
B
So
that
is
an
introduction
on
what
chaos
engineering
is
generally
and
what
this
category
of
cloud
native
chaos
engineering
is.
Let
me
introduce
you
to
latest
project.
This
is
an
open
source
project
which
has
been
around
for
about
three
years
or
so
now,
and
it
provides
you
an
end-to-end
platform
for
doing
chaos.
Engineering
on
kubernetes
and
we've
also
started
expanding
it,
providing
capabilities
to
do
chaos
against
free,
kubernetes
or
say
non-communities
infrastructure,
as
with
ec2
instances
on
gws,
gcp,
vms
vmware
machines,
vmware,
vms,
etc.
B
So
the
litmus
platform
branch
has
a
set
of
microservices
on
kubernetes,
so
it
uses
kubernetes
as
the
substrate
to
run
the
chaos
services
so
to
say,
and
you
could
make
sure
you
could
pull
some
ready
made
off-the-shelf
experiments
that
are
available
in
what
we
call
as
the
chaos
hub.
The
chaos
hub
is
an
open
market
place
where
you
have
a
lot
of
common
scenarios
that
you
would
like
to
execute.
You
can
pull
the
fault
templates,
install
them
on
your
cluster
and
define
a
custom
resource
that
maps
the
form
to
any
object
on
your
cluster.
B
It
could
be
a
known
resource,
a
power
resource
and
you
sort
of
go
from
there.
That's
what
litmus
is
about.
Very
simply
speaking.
It
has
a
set
of
custom
resources
to
define
chaos
and
statistical
validation
at
its
heart,
and
it
has
a
controller
that
reconciles
these
resources
and
carries
out
the
chaos
experiment
or
default
injection
business
logic,
and
there
is
a
way
for
you
to
look
at
the
results
of
the
experiment
so
performed
and
explained
some
information
about
your
applications
in
christ's
behavior.
B
So
litmus
as
a
project
was
started,
in
fact,
to
serve
the
resilience
test,
needs
of
another
cnc
project
called
open
abs
and
over
a
period
of
time
it
sort
of
acquired
a
roadmap
with
its
own
became
more
popular
in
the
community.
So
we
had
requirements
which
sort
of
started
coming
in
and
we
went
from
being
a
platform
that
can
do
chaos
in
a
cloud
native
way.
That
is,
you
define
intent
in
cr.
There
is
a
controller
that
is
going
to
carry
out
experiment
for
you
from
there.
B
We
went
on
to
make
it
an
end-to-end
platform,
because
chaos
engineering
has
a
lot
of
other
requirements
in
terms
of
observability
needs
in
terms
of
defining
blast
radius
in
a
very
controlled
way.
In
terms
of
in
terms
of
ensuring
that
your
chaos
results
are
being
analyzed
for
a
period
of
time
to
give
you
useful
information
that
you
can
generate
about
your
system,
there's
some
general
kpis
associated
with
the
case
engineering
practice
so
help
you
bring
some
information
about
how
your
kpis
are
doing
as
far
as
your
practice
goes.
B
B
You
want
to
provide
the
steady
state,
validation
intent
is
what
an
experiment
run,
and
sometimes
there
is
a
very
diverse
place
of
verifying
study
state.
So
you
want
to
pack
all
of
them
in
so
all
that
is
about
what
we
did
to
get
requests
from
its
initial
stage
or
the
one
dot
x,
release
to
richmond's,
true
dot,
o
or
what
it
is
now.
So
that's
an.
A
Yeah
for
sure,
thank
you
so
much.
Oh,
my
gosh,
so
much
to
chew
on
this
is
this
is
great
okay,
so
I
just
have
a
couple.
These
are
lightweight
questions
that
you'll
be
able
to
smash
pretty
quickly.
I
think
when,
when
most
people
think
about
testing
an
environment
to
to
maybe
test
as
close
to
a
real
world
example
as
possible,
what
they
often
consider
is
something
like
emulating
a
ddos
attack.
A
Something
outside
is
causing
harm
coming
in
right
at
the
kind
of
ingress
layer
right
and
they
are
hitting
certain
api
endpoints
or
they
are
sending
malformed
requests
right.
They
are
doing
something
at
maybe
a
higher
level,
and
it
sounds
like
what
litmus
does
is
actually
runs
in
cluster
right,
and
this
is
kind
of
going
back
to
chaos.
A
Engineering
you're
actually
inflicting
chaos
from
from
yourself
like
in
internally,
not
externally
right,
and
so
you
know
you
mentioned
limits,
has
a
few
different
types
of
maybe
like
attack
library
and
I'm
actually
looking
at
gremlin
as
well,
which
is
another
gas
engineering
platform
and
I'm
interested
in
some
of
the
differences
there.
But
what
what
you
know?
A
What
is
kind
of
the
go-to
mo
for
the
the
default
patterns
that
you
find
people
using
litmus
for,
and
and
can
you
expand
on
a
little
bit
of
what
you're
doing
it
sounds
like
you're,
taking
limits
to
the
next
level
building
a
platform?
How
do
you
intend
to
like
leverage
that
platform
to
help
provide
continued
reliability,
insights,
slas
things
like
that
for
not
just
your
kubernetes
cluster
or
the
api
or
an
api?
You
know
for
your
application
right,
but
more
so
for
the
entirety
of
a
platform.
B
It's
a
great
question
yeah,
so
you
you're
right
about
the
first
part.
I
think
lateness
when
it
started
a
lot
of
members
of
the
community
started
using
it
and
still
do
predominantly
a
large
set
of
users
use
it
for
inflicting
some
chaos
within
the
communities
cluster
and
the
services
that
are
inside
it.
B
So
redmos
has
this
feature
to
do
some
asset
discovery,
which
we
will
show
very
shortly
during
a
demonstration
where
you
can
identify
what
are
the
different
services
that
are
living
inside
of
your
cluster
and
you
will
be
able
to
sort
of
access
them
and
sort
of
target
them
and
two
different
faults.
For
example,
the
generic
category
consists
of
most
of
the
part
level
and
kubernetes
node
level
faults.
So
you
could
just
kill
parts,
send
some
permission
signals
to
containers.
B
You
could
do
some
chaos
on
the
pod
network,
maintaining
latencies
or
heat
up
resources
and
slow
down
the
applications.
Your
pid
one
essentially
running
inside
of
your
containers
and
similarly
for
nodes.
You
basically
take
into
maintenance,
you
drain
it
or
you
cause
an
erection,
paint
and
push
out
all
the
pods,
so
things
that
can
be
done
with
in
capabilities,
and
it
just
has
this
model,
especially
in
2.0,
where
you
could
run
the
control
plane,
services
within
one
cluster,
and
you
can
register
several
other
target
environments
or
communities
clusters
into
the
control
plane.
B
B
You
can
have
label
the
annotation
selectors
in
space
space
filters
that
you
can
use,
and
you
can
also
set
up
some
affinity
policies
and
node
selectors,
and
things
like
that
to
ensure
that
your
your
application
services,
fitness
application
services
are
not
really
impacted
by
the
fault
that
it
is
causing,
and
you
can
really
point
to
specific
resources
against
which
you
want
to
do
chaos.
But
when
you
think
about
things
happening
outside
of
your
cluster,
for
example,
you
are
talking
about
some
application
services
which
are
receiving
requests
from
an
increase.
B
Their
requests
from
outside
circus
also
has
the
capability
to
go
ahead
and
ensure
that
the
traffic
is
inhibited
against
a
certain
destination
addresses.
Probably
so,
let's
say
there
is
a
service
that
is
talking
to
a
service
inside
of
the
cluster.
You
can
ensure
that
the
traffic
between
these
services
alone
are
disrupted,
for
example,
that's
something
that
you
could
do.
That's
one
thing
that
we
developed
recently
and
a
lot
of
organizations
still
have
a
hybrid
model.
B
Business
logic
from
inside
of
kubernetes,
so
it
drops
us
apart,
but
we
are
making
use
of
the
api.
That's
provided
by
the
the
cloud
provider,
for
example
aws
and
gcp
or
seo,
and
we
can
go
ahead
and
invoke
those
apis.
They
all
provide
very
good
sdk,
so
you
can
make
sure
that
you
can
have
some
access
secrets
created
on
the
kubernetes
cluster,
where
the
experiment
business
launch
comes,
and
then
you
target
some
non-communities
resource,
that's
completely
elsewhere
and
still
invoke
the
failures
there.
B
You
can
cause
instance,
values
discipline
attaches
and
also
cause,
let's
say
the
source
burn
or
network
degradation
within
those
beams
they
have
which
may
or
may
capabilities.
So
that
way
you
would
still
run
litmus
from
within
the
humanities
control
plane
and
you
would
still
be
able
to
target
resources
that
exist
outside
of
kubernetes.
B
So
that
is
the
model
that
we
are
sort
of
working
towards
and
there
are
a
limited
set
of
experiments
on
aws
and
gcp
and
vmware
nature,
and
where
the
process
of
expanding
those
experiments
and
the
faults
that
you
could
do
on
non-communities,
so
that
you
get
sort
of
a
wholesome
platform,
you're
able
to
use
the
same,
centralized
platform
for
doing
chaos,
different
kinds
of
targets.
B
So
that
is,
you
brought
up
gremlin,
which
is
another
great
tool
which,
which
has
been
around
for
quite
some
time,
and
they
have
added
capabilities
to
do.
Kubernetes
based
cares
as
well,
so
they
primarily
started
out
as
doing
chaos
against
vms.
B
So
that
is,
we
are,
I
think,
the
chaos
engineering
community
and
the
tools
in
the
open
source,
as
well
as
the
closed
source
space,
are
really
growing.
Today.
Litmus
is
differentiated
in
terms
of
its
architecture
in
how
it
runs
as
a
humanities,
app
as
well
as
the
way
in
which
it
treats
and
experiment.
So
what
we're
trying
to
do
is
align
with
the
the
principles
of
chaos
and
provide
an
end-to-end
experiment.
B
The
notion
of
a
complete
experiment
that
has
fault
injection
as
at
its
core,
but
also
has
blast
radius
control
ability
to
do
steady,
state
validation,
ability
to
simulate
complex
scenarios
by
sort
of
stitching
together
experiments.
So
you
could
actually
go
ahead
and
run
more
than
one
font.
Let's
say
you
have
a
node
which
is
almost
exhausted
with
resources,
there's
nothing
much
that
can
be
scheduled
there.
B
Then
you
have
an
other
node
in
which
there
was,
let's
say,
an
addiction
that
happened
or,
let's
say,
there's
a
there's,
a
part
that
got
deleted
there
because
of
some
reason-
and
you
are
not
able
to
get
scheduled
anywhere
else,
because
there
is
another
node
which
is
already
running
full
capacity.
So
this
is
a
condition
that
is
sometimes
seen
in
production.
You
might
want
to
bring
up
this
scenario.
It's
a
complex
scenario.
You
might
have
to
do
two
forms
and
tie
the
right
validation
along
with
this.
B
So
it
was
enables
that,
through
what
we
call
as
request
workflows,
so
it
to
summarize
we're
trying
to
build
what
is
an
into
end
gears
platform
for
doing
complex
experiments
and
also
visualize
the
progress
of
the
experiments
and
you
mentioned
about.
How
can
I
get
information?
So
how
can
organizations
try
and
take
a
look
at
how
their
case
engineering
practice
is
going?
Does
it
have
an
overall
residence
view?
That's
what
we're
trying
to
build
with
litmus
as
well.
B
So
there
is
an
analytics
section
here
which
goes
through
all
the
past,
workflows
that
you
have
run
against
your
services.
Then
you
will
have
a
comparison
that
you
can
do
against.
You
might
have
done
these
workflows
or
experiments
against
different
environments,
maybe
devastating
in
production
or
maybe
across
the
roof
releases.
So
you
are
trying
to
validate
this.
You
are
trying
to
compare
how
your
experiments
went
and
see
whether
you're,
improving
or
whether
you
are
going
down.
So
that
is
something
that
we
are
trying
to
add.
So
there
are
other
views.
B
The
other
viewpoints
that
we're
trying
to
add
here,
based
on
community
feedback
as
to
what
is
what
they're
most
interested
in
when
they're
running
experiments.
For
example,
people
would
like
to
see
how
their
application
behaves
we'll
see
that
one
of
the
demos
where
okay
there's
something
that
I'm
peering
into
my
application
dashboard.
Now
I
want
to
see
when
chaos
is
actually
running.
When
did
it
start?
When
did
it
end
and
how
the
application
behaved
during
this
process?
So
that's
some
amount
of
observability
that
we're
trying
to
add
into
the
into
the
platform
as
well.
B
A
Yeah,
that
was
amazing.
Thank
you
so
much
for
getting
into
that
and
some
of
the
this
platform.
You
know
this
ui
really
helps
seal
the
deal
in
terms
of
what
am
I
actually
getting
from
an
end-to-end
perspective.
I
think
the
analytics
like
you
need
to
be
able
to
measure
progress
right.
Where
am
I
at
now?
A
What
is
my
desired
state
and
what
are
the
incremental
changes
or
pieces
to
getting
to
what
my
desired
state
is,
whether
that
is
you
know,
with
being
able
to
support
so
many
requests
per
second
or
being
able
to
sustain
failures
of
database
connections
or
whatever
it
might
be
right.
So
I
think
you
know
principlesofchaos.org
you've
heard
karthik
mention
it
a
couple
of
times.
I
think
that's
a
good
starting
point.
That's
actually
a
github
project
as
well.
There's
literally
just
search
cast
engineering
on
google.
A
You
can
find
tons
of
great
resources
that
kind
of
look
like
what
what
karthik
and-
and
I
have
been
talking
about
here,
and
why
this
is
so
important.
The
other
thing
I
wanted
to
mention
too
carthage
is
a
lot
of
people.
Don't
really
understand.
Why
do
I
need
chaos,
engineering?
What
do
what
why
do?
A
I
need
that
right,
like
what
I'm
not
going
in
no
one
in
our
our
s3
team
is
going
to
go
in
and
just
delete,
pods
no
one's
going
to
go
and
mess
with
external
name
objects
and
kubernetes
or
screw
with
our
cni
daemon
set
like
no
one's
going
gonna.
Do
that
right
and
it's
not
about
the
humans
as
much
as
this
is
like
the
natural
elements
of
a
cluster
and-
and
you
know
the
churn
what's
going
on-
there's
maintenance,
there's
updates,
there's
auto
scaling.
A
There
are
devs
constantly
deploying
things
you
know
there
are
people
hopping
in
pods
to
look
at
things
and
test
things
right.
There
are
objects
coming
in
and
out,
there's
many
different
name
spaces.
A
So
I
think
the
you
know
I'm
kind
of
talking
about
some
of
the
sre
core
principles
here
of
assuming
failure,
assuming
failure
and
using
measurements,
strong
measurements,
slas
slis
slos,
to
track
your
services,
your
endpoints,
and
and
really
once
you,
you
think
about
that,
and
you
think
about
maybe
a
let's
say,
because
I
have
experience
with
an
e-commerce
platform
where
you've
got
at
any
given
time.
You
might
have
a
marketing
event
and
have
millions
of
people
coming
into
your
platform
in
the
scope
of
five
minutes.
A
How
are
your
applications
gonna
work
and
I've
seen
so
many
different
sorts
of
problems
where
you
know
you
can
throw
compute
at
something,
but
if
the
things
it
depends
on,
if
it
can't
get
out
to
the
internet,
if
the
nat
gateway
is
is
broken
or
if
other
services
it
depends
on
in
the
chain
of
doing
its
operations,
to
do
some
processing
and
produce
an
output.
If
something
is
broken,
there
isn't
scaling
there
as
you'd
expect
right,
you're
not
going
to
know
about
that
until
it
actually
happens.
A
I
fundamentally
think
that
there
is
no
way
to
anticipate
problems
until
you
have
actually
experienced
them,
and
so
I
think
that
chaos-
engineering,
basically,
is
saying
you
know
this.
Is
you
have
to
commit
to
being?
Okay
and
again,
it's
gonna
be
scary,
but
things
are
never
gonna,
be
perfect.
You're.
Never
always
gonna
have
five
instances
of
your
application.
100
working
perfect,
so
you
know
hitting
health
checks.
Responding
in
you
know
under
you
know
one
second
right:
you're,
never
gonna
have
things
in
a
perfect
capacity,
and
so
what
actually
happens?
A
What
does
that
mean?
For
your
end,
users,
for
the
people
that
are
using
your
platform
day
in
and
day
out,
who
you
know,
might
be
buying
something
from
it
or
depend
on
it,
for
whatever
reason,
this
is
all
together,
making
kubernetes
a
better
platform
for
you
to
continue
continually
use
ship
applications
and
really
get
that
feedback
loop
of
analytics,
metrics
and
other
data.
So
you
know
what's
actually
going
on
having
that
intelligence
right,
it's
not
it's,
not!
Maybe
the
older
world,
where
we're
just
like
throw
it
on
there
and
the
services
system.
Ctl
great.
A
The
service
is
running,
and
you
know
hopefully
everything's
good
right.
I
think
this
is
the
new
kind
of
model
for
thinking
about
how
we
do
things,
especially
in
the
cloud
native
way.
So
with
that
I'm
going
to
give
it
back
to
you
karthik.
I
think
you've
already
shown
us
a
little
bit
about
the
platform
you
you
probably
want
to
dive
into
what
some
of
the
differences
from
2.0
and
1.0
I'd
love
to
hear
more
about.
A
You
know
what
was
the
intent
with
1.0
and
what
are
some
of
the
key
things
and
learnings
that
that
you
and
the
team
leveraged
to
kind
of
figure
out.
We
want
to
do
2.0,
and
this
is
what
it
should
be.
Take
us
through
that
a
little
bit
and
I'm
sure
we'll
have
some
questions
from
there.
I
know
I
have
a
few
other
questions
as
well,
but
take
us
through
that
sure.
B
I
think
when
we
built
1.0,
it
was
just
like
I
said
the
need
for
us
to
be
able
to
create
something
that
was
cloud
native
to
do
chaos.
So
one
of
the
things
we
felt
was
in
communities.
Everything
happens
to
be
a
resource,
whether
it
is
native
or
custom,
and
then
you
have
a
controller
that
reconciles
things.
B
So
we
want
to
bring
that
experience
to
kubernetes
chaos,
engineering
as
well,
and
that's
when
we
basically
created
some
custom
resources,
something
called
liquid
experiments
here,
the
engine
and
the
result,
of
course,
along
with
the
controller,
to
carry
out
the
chaos
process.
So
this
is
a
very
brief
summary
of
what
it
is
on
the
hub
that
I
just
showed
you
a
few
minutes
back.
There
are
a
lot
of
templates
that
you
can
put
the
build
templates
that
define
particular
fault.
B
Then
there
is
a
chaos
engine,
the
one
that
is
user
defined,
the
one
with
with
which
users
deal
with
on
a
day-to-day
basis,
where
you
provide
run
characteristics
about
the
experiment
and
map
a
fault
with
some
component
that
is
living
on
your
cluster
right
now,
which
is
either
a
component
or
a
service,
or
maybe
in
case
of
non-communities
chaos,
some
instance
that
is
living
somewhere
on
the
cloud.
B
So
that's
what
you
create
and
run
so
chaos
engine
is
the
one
that
actually
triggers
when
creation
of
creation
of
which
triggers
the
injection
process
and
the
results
of
this
experiment
are
stored
in
a
chaos
result
and
we
sort
of
create
separate
resources,
because
there
is
a
huge
scope
for
you
know,
sort
of
expanding
the
schema
and
what
all
this
can
hold.
B
In
case
of
chaos
results,
you
can
store
the
experiment
status,
the
verdict
of
the
experiment
upon
completion
based
on
certain
steady
state,
validation
constraints.
You
would
like
to
know
how
each
of
those
constraints
worked
when
you
run
the
experiment,
we
use
something
called
probes
to
define
those
constraints
when
you
run
the
experiment,
so
how
did
they
go?
And
then,
of
course,
to
repeat
the
experiments
and
then
with
different
scheduling
options
you
might
want
to
randomize
your
experimental
donate
over
a
prolonged
period
of
time,
either
strictly
scheduled
or
with
some
randomness
thrown
in.
B
So
these
are
all
the
chaos
crts.
So
when
we
started
with
chaos,
engineering
project
litmus,
we
had-
let's
say
this
part
along
this
deployment
alone-
called
the
kiosk
operator.
The
rest
that
you're
seeing
here
is
you
initially
had
this
big
ass
operator
and
you
could
create
a
clear
engine
manifest
something
like
this.
You
just
have
an
application
that
you're
defining
by
name
space
label
and
the
kind
these
are
the
identifiers
for
a
given
application
and
you
you
can
basically
go
and
run
this
experiment
with
a
specific
service
account.
B
It
was
allows
you
to
run
your
experiments
with
define
certain
functions,
so
you
can
choose
who
is
doing
the
experiment
and
what
permissions
the
persona
has
and
therefore
you
can
control
what
can
be
done
as
part
of
the
experiment.
We
provide
permission
just
say
to
delete
parts,
maybe
nothing
more.
So
you
cannot
do
a
node
level
experiment
that
way
litmus
lends
itself
to
sort
of
a
self-service
model
of
experiments.
Maybe
there
is
a
shared
cluster.
Different
service
owners
and
developers
are
running
their
own
experiments.
B
You
can
still
do
that
by
adopting
different
service
accounts,
and
then
you
have
something
like
this.
You
have
a
total
duration
for
which
the
chaos
starts,
and
then
you
have
some
tunables
just
to
control
the
next
of
the
experiment
and
we're
going
to
do
this
experiment
against
a
very
simple
service
called
as
a
hello
service.
It's
basically
a
hello
world
application
that
lives
in
demo
space.
Let
me
just
take
you
to
that
list.
Name:
space
called
demo
space,
it's
just
a
part.
B
What
I'm
going
to
show
is
like
the
hello
world
of
chaos,
engineering
we're
just
still
apart
with
one.x,
we
provided
this
capability
to
create
resources
and
run
forms,
so
you
can
actually
see
these
for
running,
so
people
found
it
very
convenient
to
make
use
of
this
in
their
scripts
and
automation,
pipelines,
ci
cd,
things
like
that,
so
you
basically
run
the
the
default
happens
a
few
times.
B
We
just
said
30
seconds
duration,
10
seconds:
it's
going
to
do
the
kills
two
three
times
and
you're
going
to
get
a
chaos
result
and
then
you're
going
to
get
some
events
on
the
transmission
resource
that
might
be
of
interest.
So
you
can
actually
see
what's
happening
as
part
of
the
experiment.
B
It
does
a
free
chaos,
check
each
experiment
and
litmus
does
a
pre
chaos
check
to
see
if
the
application
that
we're
doing
chaos
on
is
in
good
state,
because
we
don't
want
to
degrade
an
already
degraded
system.
So
we
make
some
checks
and
then
we
carry
out
the
fault.
Then
we
do
some
post
checks
and
then
we
finish
the
experiment
that
is
essentially
what
was
available.
You
can
see
the
case
injection
is
in
progress
like,
like
I
showed
you,
so
this
is
going
to
complete
the
experiment
and
then
just
allow
you
to
take.
B
Take
a
look
at
the
results
and
see
what
happened
to
your
applications.
You,
you
have
your
own
influences
that
you
can
draw
from
what
is
happening.
So
that
is
what
we
had.
You
can
summary
based
on
the
constraints.
There
are
no
constraints
essentially
in
this
experiment,
so
it
says
experiment
past,
because
the
only
checks
that
we
used
to
verify
that
it
passed
is
whether
the
app
was
good
before
and
whether
it
recovered
after
within
a
specific
period
of
time.
B
But
as
we
went
along-
and
I
think
I
just
missed,
showing
you
the
chaos
result.
So
the
krs
result
is
quite
simplistic
in
this
case.
B
So
you
can
see
that
the
experiment
was
around
once
it
passed
and
this
was
the
target
and
it
just
ran
right.
So
this
is
very
something
that's
very
simple,
but
real
world
scenarios
are
more
different
and
real
world
requirements
are
more
different,
so
we
sort
of
got
that
feedback.
As
we
went
on
the
building
the
project
in
the
open,
so
they
were
asked.
How
can
I
visualize
the
impact
of
chaos?
B
I
have
an
application
dashboard
and
I
want
to
see
exactly
when
character
starts,
when
it
ends
sort
of
how
they
visualize
the
impact
of
chaos.
They
want
to
see
what
stage
it
is
in.
Of
course,
the
events
are
helpful,
but
events
is
not
for
everybody.
There
are
a
lot
of
communities.
B
The
way
applications
are
operated,
so
the
different
personnel
is
involved
there.
There
are
some
folks
who
are
very
deep
involved
and
know
the
communities
api
very
happy
to
navigate
on
things
like
blogs
and
events.
There
are
others
who
are
looking
at
more
graphical
representations
of
a
process
dashboards.
So
how?
How
do
I
visualize
the
impact
of
chaos
and
then
how
do
I
basically
validate
application?
Behavior?
The
word
that
you
are
telling
me
is
too
simplistic
application.
B
B
How
can
I
go
ahead
and
do
more
faults,
maybe
as
part
of
a
larger
scenario,
like
you
mentioned
some
minutes
back,
we
will
talk
about
a
case
where
your
node
is
run
to
exhaustion
and
there's
another
coordination
happening
elsewhere,
you're
stuck
you're,
not
what
is
in
pending
state.
How
do
I
simulate
these
kind
of
scenarios?
B
How
do
I
do
benchmarking?
I
think,
when
you
mentioned
about
these
cases,
where
there's
thousands
of
users
using
your
platform
as
a
single
point
of
time,
how
do
I,
how
you
simulate
that
load
and
then
you
fall
there?
How
does
your
system
respond?
Faults
under
such
loaded
conditions
is
being
chaos
in
the
utopian
conditions
is
not
is
not
great.
I
don't
want
to
do
it
in
idle
case.
Whatever
in
this
traffic,
that's
full.
How
do
you
do
that?
Basically
do
multiple
things
at
once.
B
As
particular
experiments
a
lot
of
pattern
processes-
and
how
do
you
tell
me
in
residence?
This
is
a
metric
that
you
can
show
and
tell
me.
This
is
how
your
fault,
or
your
experiment
and
your
application
or
service
is
corrected.
This
is
the
metric
that
you
have
that
says.
Your
application
is
resilient
to
this
formed
by
this
level
by
this
much
right
and
then
there
are
other
operational
challenges.
How
do
I
get
different
team
members
to
come
and
collaborate
on
my
chaos,
artifacts
and
visualize
it?
B
How
do
I
ensure
that
there
is
a
single
source
of
truth
for
case
experiments?
Yamas
are
great,
you
can
store
them
and
get,
but
git
gitops
is
really
the
in
thing.
I
want
to
ensure
that
there's
a
change
made
on
my
git
and
the
experiment
definition
needs
to
get
changed
on
my
cluster
when
it
is
wrong.
So
how
do
you
ensure
that?
How
do
you
use
single
platform
to
target
different
environments?
B
We
went
up
on
this
a
little
bit
doing
care
some
kubernetes,
but
also
doing
chaos
against
other
components
while
still
running
as
an
app
on
communities.
So
how
do
you
do
all
this?
So
these
are
the
requirements
that
we
got
and
we
sort
of
spoke
to
the
community
and
there
are
several
meetups
where
we
went
and
presented.
We
got
talking
to
people,
and
this
is
what
we
brought
back
into
the
project
and
built
with
it
was
2.0.
B
So
the
result
is
an
architecture
which
basically
gives
you
a
single,
centralized
cross
cloud
control
plane.
We
like
to
call
it
it's
like
a
centralized
management
platform
where
you
can
connect
one
or
more
target
environments
of
chaos
depends
on
where
you
want
your
gears
to
be
done.
I
have
a
fleet
of
clusters,
but
I
can
use
a
single
management
platform
to
manage
chassis
institute,
so
you
have
the
self
agent.
So
this
is
the
cluster
on
which
the
portal
of
the
your
center
is
installed.
B
It
automatically
registers
itself
as
a
candidate
for
a
target
environment
for
chaos.
Then
you
could
add
other
clusters
as
well
for
doing
chaos.
Essentially,
this
is
an
execution
plane.
You
could
do
your
chaos,
business
logic
on
the
same
cluster,
where
the
community
is
a
controlled
plane
for
your
size.
That
is
basically
where
the
key
of
center
resides.
B
We
can
also
use
a
different
cluster
as
an
execution
plane,
so
by
doing
that,
you
are
able
to
discover
microsoft
sitting
on
that
cluster,
but
you
are
just
making
use
of
you
cross
you'll
just
be
making
use
of
that
to
run
your
parts,
your
care
spots,
targeting
something
with
stored,
inaudible
also,
and
we
have
this
ability
to
create
workflows.
B
Now,
instead
of
a
single
chaos
engine
that
you
saw
getting
created,
you
can
create
a
workflow
that
can
stitch
together
more
than
one
ks
engine
in
a
different
order
in
parallel
or
in
sequence,
and
you
can
also
have
load
tests
embedded
within
above
flow.
You
could
probably
use
locust
or
vegeta
or
k6
io
or
a
lot
of
other
tools.
F5.
Probably
that
runs
maybe
as
a
job
says
bench
things
like
that
that
we
know
that
the
community
is
using
today,
and
that
can
be
done.
B
If
you
can
containerize
then
run
that
as
a
part
you
can
run.
That
is
one
of
the
processes
in
the
workflow,
along
with
the
experiment,
so
you
get
a
better
scenario
that
you
test
and
we
also
have
the
analytics,
which
does
comparison
of
workflows
against.
What
is
called
as
a
resiliency
score,
so
the
resilience
score
is
essentially
a
metric
that
connects
your
experiment
and
your
service.
So
the
resilience
score
is
created
or
or
calculated,
depending
upon
some
importance
or
maintenance
that
you
give
to
an
experiment.
B
B
You
get
a
product
and
you
get
a
summation
of
that
product
for
several
experiments
that
are
listed
within
a
workflow
divided
by
the
total
points
possible
and
you
get
a
residence
scope
and
that
resilience
score
is
something
that
you
can
use,
and
you
can
use
that
to
compare
over
a
period
of
time
and
you
can
see
how
your
experiment
is
sort
of
improving
or
whether
it
is
reducing
its
resilience,
etcetera.
So
that
is
about
the
analytics,
and
it
also
has
the
option
for
you
to
sort
of
pick.
B
Workflows
clears
artifacts
from
a
git
repository,
so
you
can
commit
workflows
that
you
constructed
on
the
on
the
care
center
on
this
wizard
that
is
provided
with
the
kr
center.
That
gets
committed
into
your
git
repositories
and
any
changes
made.
B
B
So
we
have
the
embedded
chaos
hub
here.
So
when
you
construct
your
workflows,
you
can
actually
pick
experiments
from
the
hub
and
stitch
them
together.
So
these
are
all
the
capabilities
that
we
created
in
response
to
the
requirements
that
we
gathered
I'll.
Do
a
very
quick
set
of
demonstrations
to
show
you
how
you
can
leverage
this
mario
was
talking
about
e-commerce
applications.
We've
taken
the
example
of
an
example:
e-banking
app
called
the
bank
of
panthos,
which
is
comprised
of
several
micro
services.
B
You
can
see
that
I
have
bank
of
antos
in
the
department
space
on
this
cluster.
I
have
lengths
here
just
to
make
things
easy
in
terms
of
visualization,
so
you
can
see
that
there
is
a
balanced
leader
application
of
surveys,
rather,
which
is
allowing
me
enabling
me
to
read
the
balance
here.
Make
payments
do
all
sort
of
things.
I
accept
this
application
of
without
much
resilience
and
what
I'm
going
to
do
is
inject
a
black
hole
attack,
something
like
a
very
similar
data's
attack.
B
Maybe
we're
going
to
incubate
the
balance
trader
service
and
that's
going
to
give
us
a
semi
or
quasi-operational
e-banking
application,
which
is
something
you
generally
want
to
avoid.
So
you
can
take
a
look
how
we
do
it.
We
are
just
scheduling
the
workflow
and
selecting
an
execution
environment.
I'm
selecting
the
same
cluster
as
where
minutes
here
center
the
sides
and
I
go
ahead
and
select
the
chaos
hub
from
where
I
can
pick
my
experiments.
You
can
define
your
own
hub
here
as
well.
B
If
you're
in
a
private
environment,
you
can
create
your
own
chaos
hub
and
pick
exponents
from
there,
and
then
you
go
ahead
and
give
a
name,
I'm
going
to
call
it
back
of
antos
backhoe
and
we're
just
going
to
pick
an
experiment
in
this
case,
network
loss
is
the
instrument
I'm
going
to
use
to
create
this
attack.
So
once
I've
selected
the
experiment,
I
can
go
ahead
and
tune
it.
The
way
I
need
so.
I
am
interested
in
the
balance
reader
service
residing
on
the
default
name
space.
B
So
I'm
just
going
to
select
that
I
have
the
option
of
validating
some
behavior
as
I
do
the
form,
but
just
keep
it
simple
for
this
first
round,
I'm
just
going
to
say
next
and
I'm
going
to
keep
it
intact,
the
part
intact
for
60
seconds.
This
is
going
to
be
100
network
loss
that
we're
going
to
inject.
B
So
I'm
going
to
say
finish,
I'm
going
to
say,
remember,
schedule
fox,
so
this
is
basically
a
feature
that
cleans
up
your
chaos,
resources
or
your
customer
resources
that
are
created
to
do
the
fault
afterwards,
I'm
going
to
keep
them
just
to
show
you
the
logs,
and
this
is
the
step
I
said
where
you
can
provide
the
weightage
or
criticality
to
your
experiment.
B
I'm
going
to
give
it
all
points
you
can
do
this
for
once
now
or
you
can
do
a
recurring
schedule
and
what
we're
going
to
do
is
just
do
the
forward
ones
now,
and
this
is
going
to
give
us
an
orgo
workflow,
underneath
it's
going
to
construct
an
output
workflow
where
it
has
a
couple
of
steps
to
pull
the
fault,
template
for
network
class
and
then
actually
run
the
experiment.
You
can
see
the
kos
engine
is
auto
configured
or
created.
You
don't
have
to
create
it
by
hand.
B
B
For
this
point,
vancouver
anthony's
looks
healthy
and
good.
So
once
the
fall
starts
running,
you
will
be
able
to
see
that
we
cannot
read
the
balance
or
make
payments
so,
like
I
said
sometime
back,
let
me
see
chaos
and
post
chaos
checks
at
this
point.
The
previous
checks
is
to
see
whether
the
parts
that
carry
this
label
that
we
just
described
is
healthy
and
it
is
alive
before
it
actually
starts
doing
the
fault.
You
can
see
that
the
step
trigger
the
fault
has
started,
you
can
use.
B
The
the
kiosk
engine
schema
is
is
quite
rich.
You
can
do
a
lot
of
things
with
it.
The
documentation
for
that
is
available
here.
You
can
take
a
look
at
the
concepts
section
and
you
can
see
all
the
the
specifications
that
it
contains
and
all
that
can
be
tuned.
So
you
can
take
a
look
at
these
sections
here
to
see
what
all
is
studio.
You
can
set
resources,
you
can
define
affinity,
policies
where
the
spot
has
to
run.
You
can
inject
annotations
into
it.
B
You
can
define
near
the
the
amount
of
time
that
you
spend
trying
to
validate
whether
an
application
is
alive.
A
lot
of
things
can
be
tuned,
but
if
setup
is
something
very
simple,
the
fault
is
active
at
this
point.
So
if
I
refresh
this
application,
you
can
see
it
cannot
lead
the
balance
and
I
try
to
make
a
payment,
but
that
shouldn't
go
through,
because
I
cannot
read
the
balance
to
see
how
much
I
have
my
wallet
to
actually
make
the
payment.
B
So
these
are
some
things
we
would
really
like
to
avoid
in
our
applications.
We
need
to
find
the
have
the
right
middleware
property
to
direct
us
to
a
different
replica
that
is
working
and
ensure
things
work,
and
it
is
always
risky
when
you
have
semi-operational.
B
Applications,
for
example,
I'm
able
to
make
deposits,
but
I
can't
read
how
much
I
have
deposited
things
like
that.
So
this
was
a
very
quick
demonstration
of
how
you
can
inject
a
fault
and
how
it
runs,
and
you
will
see
that
this
experiment.
A
Yeah
this
was
this
was
wonderful.
Sorry,
I
didn't
mean
to
cut
you
off.
I
just
wanted
to,
I
think
so.
I
think
a
lot
of
people
think
of
down
and
up
and
they
think
binary
about
how
a
platform
is
operating.
It's
it's
just
it's
it's
black
and
white
right
and
that's
not
actually
how
most
outages
work.
Most
outages
are
actually
kind
of
like
a
a
brown
out
like
a
like
a
some
things
are
working.
Some
things
are
not
working.
A
The
problem
is
that
the
totality
of
maybe
some
of
the
critical
flows
for
using
an
application.
This
example
is
fantastic
that
you
that
you
use
is
that
some
things
might
stop
working
right
and,
and
things
might
seem,
okay,
but
when
a
user
actually
goes
to
do
something,
that's
when
dependency
services,
other
api
endpoints
that
are
called
and
certain
microservices
that
make
up
the
overall
platform
are
not
doing
their
job.
As
is
expected,
I
think
that's
one
of
the
major
use
cases
for
this.
I'm
loving
this
workflow
dashboard.
A
You
know
obviously
the
schedule
interface
you
have
here,
some
of
the
agents
and
the
hub
stuff
is
fantastic,
so
it's
all
tied
together,
a
kind
of
all
unified
interface,
and
I
think,
like
this
is
the
next
evolution
of
what
it
looks
like
to
be
comfortable
feeling,
like
you,
have
the
control
to
test
your
environment
in
the
way
you
need
to
to
really
get
the
correct
signal
instead
of
just
noise
about
like
oh
well,
you
got
this
one
pod
using
lots
of
resources.
A
You
know
that's
one
of
many
little
things
going
on,
but
there's
a
lot
more
signal
that
you
need
around
the
the
flows,
I
think
and
what's
actually
happening
in
these
certain
scenarios.
So
this
is
fantastic
to
see
karthik.
Well,
you
click
around
a
little
bit.
I'm
going
to
ask
a
couple
lightweight
questions,
because
we
have
just
a
few
minutes
left.
I
think
one
of
the
the
big
ones
here
that
I'm
thinking
about
is
what
is
some
of
the
key
things.
A
What
is
the
next
step-
and
you
did
a
great
job
of
talking
about
like
what
2.0
is
delivering
versus
one
point?
Oh
and
and
so
like?
What
are
the
what's
next
on
the
road
map?
What
can
we
see
here
over
the
next
few
months?
As
we
end,
you
know
end
of
the
year?
What
should
what
is
lit,
moss
looking
to
implement?
A
That
has
been
like
something
you
know
huge
that
a
lot
of
users
have
maybe
been
talking
about
or
things
that
you've
been
asked
about
from
from
people
saying
like
hey,
you
know
I'm
using
this,
and
if
it
did
x,
y
and
z,
I
would
be
so
much
more
efficient
right
be
able
to
really
nail
down
certain
things.
What
are
those
things
look
like
for
you
in
the
team.
B
That's
a
great
question,
so
one
of
the
things
that
we're
being
asked
is
to
improve
it's
exactly
what
you
mentioned.
It's
about.
You
want
to
see
what's
happening
to
your
applications,
it's
not
about
it's!
It's
not
binary!
It's
it's
brown
out
generally,
and
you
want
to
validate
a
lot
of
things,
get
insights
into
a
lot
of
things.
That's
happening
on
the
cluster,
even
as
you
do
the
fault,
so
probes
is
one
thing
that
we
introduced
to
help.
B
Do
that,
for
example,
in
this
case,
I'm
just
repeating
the
thought
that
did
the
one
dot
x
experiment
where
we're
trying
to
kill
a
part,
but
along
the
way
we
are
doing
some
checks.
We
are
checking
whether
a
downstream
application
is
always
alive.
I
am
basically
checking
if
it's
always
going
to
be
alive.
There's
a
200,
okay
that
I
get
against
it
for
a
polling
interval
of
every
one.
Second,
if
it
is
not,
then
probably
I
would
like
to
abort
the
experiment,
so
you
can
see
stopping
failure
true,
so
these
are
things.
B
B
I
I
can
run
probably
these
probes
today.
That's
basically
using
for
this
metrics
to
check
deviation
in
your
study
state.
Maybe
is
it
under
your
slos
or
no.
So
I
support
for
different
kinds
of
probes
that
work
with
different
observability
tools
that
are
there
in
the
in
the
cnc
landscape.
Today,
that's
one
of
the
things
that
we're
being
asked
and
we're
working
towards
and
more
experiments
for,
non-communities
environments.
So
that's
something
that
we
are
working
towards
as
well
as
your
awas
gcp.
B
These
are
some
of
the
beam
where
these
are
small:
the
environments
where
people
are
still
running
a
lot
of
their
services
and
we
have
initial
set
of
experiments
there.
But
this
request
for
being
able
to
add
more
and
from
folks
that
who
are
trying
to
use
this
in
enterprise
environments
also
want
support
for
different
kinds
of
authentication
and
authorization
mechanisms,
as
as
to
how
people
can
access
the
chaos
center
and
run
it.
B
So
that
is
one
of
the
things
that
we
are
looking
forward
into
the
adding
roadmap
as
well,
and
there
are
other
fault
types
within
kubernetes
that
we
can
add.
The
faults
are
like
thousands
of
falls
that
you
can
actually
do
using
a
base
set
of
faults.
You
can
create
thousands
of
scenarios.
B
Rather,
solutions
provides
you
a
good
set
in
the
chaos
hub,
but
we
are
committed
to
increasing
that
set
of
faults
for
communities
and
non-communities
and
also
provide
a
resilience,
a
better
resilience
view
that
people
can
make
use
of
analytics
and
resilience.
Scores
are
great,
but
there
are
other
things
people
would
like
to
know
about.
When
there
are
experiments.
Some
probes
are
stuck
in
that
direction.
People
also
want
to
know
things
about
how
their
recovery
was.
How
much
time
did
it
take?
Really?
B
B
A
Is
all
fantastic
yeah?
We
just
have
one
more
minute.
I
want
to
end
on
a
strong
note
for
people
that
are
looking
to
dip
their
toes
in
and
get
started,
and
maybe
even
evangelize
this
a
little
bit
in
their
organization
or
play
around
in
a
kind
of
lower
end
environment
like
a
dev
environment,
a
playground.
What
what
would
you
say
what's
kind
of
the
best
way
for
them
to
get
started,
understanding,
chaos,
engineering
and
starting
to
leverage
it
in
their
day-to-day
on
their
laptops,
whatever
they're
trying
to
do?
What
are
some
resources.
B
Yeah,
I
think
we
have
done
a
fair
bit
of
refracting
on
the
docks,
as
well
as
part
of
the
wonder
text
to
2.0.
So
this
is
the
docs
documents,
chaos
that
I
hope
there
are
a
lot
of
resources
here
that
you
can
use
to
learn.
Some
of
it
is
still
coming
in,
but
there
is
a
good
set
of
good
sort
of
concept.
B
Docs
that
you
could
you
use
here
to
learn
about
bitmaps,
and
you
also
have
experiment
documentation,
there's
a
lot
of
information
about
how
you
can
run
each
experiment,
the
different
durables
that
it
provides
and
how
you
can
do
it
with
different
options.
All
the
information,
for
example,
we
talk
about
coordinate.
B
There
are
different
ways
in
which
you
can
run
moderate,
a
lot
of
which
is
explained
here.
So
this
I
I
would
recommend
our
couple
of
good
resources,
the
docs
and
the
pages
of
the
repository
itself,
that's
about
where
you
can
find
information
about
litmus.
But
when
it
comes
to
general
information
about
chaos,
engineering,
you
can
take
a
look
at
the
principles
of
chaos.
B
Engineering
and
cncf
has
just
got
started
with
a
chaos
engineering
work
group,
which
is
trying
to
actively
put
together
information
about
gears,
engineering
for
beginners
and
for
practitioners,
who've
been
doing
chaos,
engineering
for
a
long
time
and
then
jump
into
a
cloud
native
world.
Looking
at
doing
care,
science,
engineering
and
cloud
native
day
we
meet
once
in
two
weeks,
and
we
are
trying
to
put
together
a
white
paper
that
talks
about
chaos.
Engineering.
B
There
is
some
information,
for
example,
if
you
are
looking
at
common
terminologies
associated
with
chaos
engineering.
This
dictionary
that
we're
creating
talks
about
chaos.
Engineering
is
not
really
an
alphabetically
sorted
glossary,
it's
more
about
chaos,
engineering
as
you
learn
it,
so
you
sort
of
come
to
principles,
and
then
you
talk
about
what
is
an
experiment
and
how
you
can
sort
of
understand
each
part
of
the
experiment,
blast,
radius,
hypothesis,
slice,
slos
and
how
you
can
practice
it
as
an
sre.
B
You
know
you
can
conduct
game
days,
information
that
we
are
trying
to
expand
upon.
So
this
is
probably
good
space
to
look
for
one.
A
For
sure
yeah,
this
is
perfect.
I
did
not
know
there
was
a
working
group
that
is
amazing
to
hear
thank
you
very
much
karthik,
so
much
great
content.
Today,
you
really
did
some
amazing
demos
as
well,
and
the
demo
gods
were
clearly
with
you
litmus
chaos,
dot,
io
chaos
native
is
the
company.
Thank
you
to
karthik
today
and
thank
you
for
the
the
people
working
behind
the
scenes.
My
name
is
mario
lauria.
It's
been
a
great
pleasure
to
host
today's
session
and
talk
to
everybody
later
have
a
great
rest
of
your
day.