►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hey
everyone
welcome
to
this
webinar
on
the
project
updates
of
open
source
litmus
and
what
are
the
new
set
of
features
that
we
have
been
working
on
and
added
on
the
open
source
product.
My
name
is
shayan,
I'm
a
senior
software
engineer
at
harness
and
I've
been
closely
working
with
the
litmus
open
source
project
since
the
past
two
years,
and
also
mainly
working
on
the
litmus
2.0
and
the
features
associated
with
it.
A
Centric
focus
features
that
we
have
added
to
the
open
source
platform
and
also
we
are
going
to
take
a
look
at
two
different
use
cases
to
help
understand
how
litmus
is
actually
being
used
in
the
world
right
now
how
litmus
is
adopted
by
different
clients.
We
are
also
going
to
finish
it
with
a
small
demo,
where
we'll
be
seeing
how
the
new
enhanced
features
of
litmus
can
be
put
to
use
in
a
much
in
a
commercial
application.
A
So
before
we
jump
into
the
new
developer,
centric
features
that
we
have
added
just
a
quick
refresher
of
what
litmus
is
for
all
the
new
users
out
there.
So
litmus
is
a
cross
cloud
cloud
native
chaos,
engineering
tool
set
or
a
framework.
You
can
also
say
that
which
helps
not
only
srs
but
also
developers
and
different
personas,
any
any
personas
try
out
chaos
with
seamless
integration
and
automation
that
will
help
you
ease
your
chaos
engineering
journey,
so
you
can
choose
different
experiments.
A
You
can
create
scenarios
out
of
them,
and
then
you
can
run
your
workflows
and
do
chaos
in
a
much
simpler
way.
With
the
help
of
that
mask
now,
assuming
some
of
the
community
users
are
present
here
and
have
already
used
clickmas,
we
are
going
to
take
a
look
at
the
developer
centric
and
the
developer
focus
features
that
we
have
added
to
help
ease
both
the
developers,
as
well
as
the
community
users,
the
community
contributors
that
have
been
working
closely
with
us.
A
So
previously
we
already
had
the
ability
to
add
probes,
but
we
have
also
worked
upon
that
and
improved
the
probe
addition
capability.
So
now
you
can
also
once
once
you've
added
a
probe.
Now
you
can
also
go
ahead
and
edit
the
same
probe.
Previously
we
did
not
have
the
capability,
but
now
you
can
edit
it
like.
A
Previously,
you
have
to
delete
the
entire
thing
and
then
create
a
new
thing,
which
was
quite
a
bit
of
a
pain
when
you
are
thinking
about
constructing
new
probes
and
thinking
of
the
hypothesis
and
doing
the
steady
state
validation
with
groups.
So
it's
better
that
you
get
an
edit
option.
So
now
you
do
and
you
can
change
the
probe
types
in
the
same
edit
feature,
so
you
can
just
completely
change
the
probe
all
together
with
this
new
probe,
editing
feature.
A
A
A
You
can
also
update
the
same
steps
and
that
should
be
visibly
that
should
be
visually
visible
in
the
error
sequence
as
well,
and
there's
also
the
option
to
go
to
the
configuration
wizard,
which
is
the
pencil
icon
that
you
see
on
the
different
experiments
in
the
table.
So
there
you
will
have
the
option
to
tune
your
environments,
give
certain
values
to
your
environment,
which
will
be
reflected
to
the
kiosk
engine
and
also
at
the
bottom
of
the
table.
You
would
also
get
advanced
advanced
options,
so
in
there
you
can
select
the
port
gc
strategy.
A
If
you
want
to
clean
out
all
the
different
pods
post
chaos
and
then
also,
if
you
want
to
add
things
and
tolerations
to
your
particular
chaos
experiments,
then
you
can
also
do
that
kind
of
advanced
configuration
to
that.
A
Third,
we
have
the
ability
to
upgrade
a
kiosk
delegate.
So
let's
say
you
are
on
the
latest
version
of
litmus
you're
running
your
litmus
workflows
on
the
deployed
with
the
latest
version
of
litmus,
but
the
community
has
pushed
a
few
you
you're
a
bit
behind
on
the
upgrade
part.
So
let's
say
the
community
has
pushed
a
new
feature,
a
new
upgrade
of
the
kiosk
delegate,
so
with
with
the
new
versions
of
litmus,
what
you
would
be
able
to
notice
is
there's
an
upgrade
option
at
the
kiosk
delegate
side.
A
So
if,
if
you're
on
a
latest
deployed
version
of
litmus
and
there's
an
upgrade
available
of
your
gears
delegate,
then
it
must
litmus
would
suggest
that
you
go
ahead
and
update
your
kiosk
delegate
to
the
latest
version.
So
if
you're
already
on
the
latest
version,
the
button
would
be
disabled,
but
if
you
are
not,
if
you
are
a
little
lagging
behind
on
the
update,
then
litmus
will
show
you
the
option
to
upgrade
your
chaos
delegate
to
the
latest
version.
A
Next,
we
have
added
a
more
secure,
our
back
with
update
to
the
apis.
This
rbac
security
updates
have
been
added
about
to
the
api,
as
well
as
to
the
ui.
So
this
was
mostly
a
bug
addressed
from
reported
from
the
community
side
which
is
addressed
by
us,
which
is
there
are
two,
our
back
permissions
just
editor
and
the
viewer.
A
So
as
a
viewer,
you
shouldn't
be
able
to
access
certain
pages
certain
apis,
certain
you
shouldn't
be
able
to
do
certain
perform
operations
as
a
viewer
which,
where
there
was
a
few
leaked
cases
where
you
would
be
able
to
create
a
scenario
or
view
and
like
go
to
the
editing
section
of
a
particular
workflow
which
you
shouldn't
be
able
to
do
as
a
viewer.
So
those
things
are
addressed
and
now
we
have
added
a
more
secure
hardback.
So
now
let's
say
a
viewer
is
given
the
specific
screen
through
different
api
calls.
A
Even
you
wouldn't
be
able
to
do
so,
because
the
our
back
checks
are
also
added
to
the
api.
So
now
the
apis
are
hardened,
as
well
as
the
ui.
So
now
a
viewer
should
not
be
able
to
view
screens
which
are
only
accessible
via
the
admin
and
the
editor
also.
There
have
been
requests
of
adding
the
support
of
running
or
scheduling
a
basic
cargo
workflows,
rather
than
just
chaos
workflow.
A
So
previously
we
had
the
support
of
running
different
kinds
of
chaos,
workflows,
which
were
also
argo
workflows,
but
we
didn't
had
the
support
to
clearly
run
a
very
stripped
down
simple
version
of
a
basic
cargo
workflow.
So
now
we
have
modified
our
back-end
and
we
have
added
the
support
of
just
scheduling
a
basic
argo
workflow
as
well.
So
if
you
directly
take
on
argo
workflow
from
argo
and
try
to
run
it
on
our
platform,
that
should
work
as
of
now.
A
Similarly,
once
kubernetes
moved
once
kubernetes
updated
to
1.22.0
and
above
there
were
a
few
apis,
there
were
a
few
manifests
deployments
that
users
coming
from
the
community
side
were
trying
and
they
were
unable
to
execute
them
because
we
didn't
had
support
for
22
and
above
so.
A
We
have
also
addressed
that,
and
currently
we
do
support
all
communities
versions
on
122
and
above
we
also
added
the
support
for
ipv6
and
to
ensure
that
we
have
a
better
end-to-end
capability
and
to
end
strictness
of
all
the
different
builds
and
continually
do
this
kind
of
deployments
via
nightly,
builds
or
via
regular
ci
checks.
We
have
done
that.
We
have
ensured.
We
do
better
testing
and
we
have
added
more
e2e
test
suits
to
cover
all
aspects
of
this
development.
A
So
when
we
push
or
when
we
complete
our
work
by
the
end
of
the
day,
there's
al
always
multiple
nightly
builds
happening
so
that
it
ensures,
and
also
it
gives
you
the
report
so
that
it
ensures
that
you
that
you
know
the
status
of
your
deployments.
So
we
have
also
added
the
ability
to
skip
ssl
verification.
So
let's
say
in
case
of
applying
a
manifesto:
you
are
trying
to
connect
an
agent.
So
if
you
want
to
skip
the
ssl
verification,
there
is
very
much
an
option
to
do
so
now,
so
we
have
added
feature.
A
So
you
can
just
provide
the
flag
and
you
should
be
able
to.
You
should
have
the
ability
to
skip
your
ssl
verification
if
your
use
case
demands
so
also
for
developers
also
for
both
the
community
developers,
as
well
as
the
developers
in
the
core
team,
you
have
improved
the
log
functionality
of
the
different
servers
that
we
use
so
that
way,
whenever
you
encounter
a
bug
or
whenever
you
encounter
a
specific
log
that
you're
looking
for,
let's
say
an
asian
connection
or
a
subscriber
connection,
so
those
log
metrics
have
been
enhanced.
A
So
now
you
you'll
be
getting
better
events,
you'll
be
getting
better
results.
So
that
way
you
would
be
knowing
what
exactly
went
wrong
or
what
exactly
is
happening.
So
we
have
enhanced
that
part
of
logs
both
for
the
development,
as
well
as
for
the
productions
production
setup
for
production
when
you
visit
the
kiosk
engine
log.
A
So
when
you
go
to
the
show
the
workflow
and
you
check
the
logs
there,
the
pod
logs,
they
have
better
highlighting
so
let's
say
if
a
probes
was
resulted
in
success
or
failure
or
if
the
experiment,
the
verdict,
everything
resulted
in
success
or
failure
you'd
be
having
that
individual
line
highlighted
out
as
either
green
red
or
if
it's
a
warning,
so
those
kind
of
things.
So
we
have
taken
measures
to
enhance
the
logs
both
for
the
production,
as
well
as
the
development
side.
A
Now,
on
the
internal
side,
we
have
also
migrated
the
project
collection
that
we
use,
so
the
project
collection
was
used
mainly
to
store
metadata
of
the
project
of
of
your
litmus
projects.
Previously,
it
was
in
the
litmus
database.
Now
we
have
shifted
that
to
the
authentication
database
and
also
apart
from
just
this
shift.
We
have
also
done
internal
code
refactor
of
the
authentication
server
to
enhance
and
improve
security.
We
have
also
added
enhancements
in
the
cmd
probe,
specifically
in
the
source
tab.
A
We
have
also
hardened
the
litmus
alpine
images
that
were
used
in
different
litmuscas
tools
and
also
the
e2a
pipeline
to
monitor
all
the
pipeline
builds
have
been
created,
so
there's
nightly
bills
and
and
a
whole
e2e
dashboard
that
you
can
explore
where
you
can
see
the
different
builds
for
individual
workflows
for
each
of
the
experiments,
etc
and
also.
Lastly,
we
of
course
are
working
on
new
experiments,
so
experiments
like
aws,
az,
experiment,
azure,
disk
laws,
etc.
So
those
new
different
kind
of
experiments
have
also
been
added
to
the
public
chaos
hub.
A
A
They
have
delivered
more
than
60
million
orders
each
month
they
do
deliver
more
than
60
million
orders
each
month
it
was
founded
in
2011
and
it
aimed
to
provide
a
better
and
a
quicker
solution
for
online
phone
delivery
with
innovative
systems,
so
that
user
can
order
deliveries
on
the
internet
with
no
hassle
and
with
ease
of
use
so
with
over
80
percent
of
the
market
share.
Geographically
ifood
covers
most
cities
and
regions
in
brazil,
especially
in
the
brazil's
financial
center
of
sao
paulo.
A
This
came
with
its
works,
but
also
it
came
with
much
more
complexity
and
additional
costs,
so
that
was
one
why
they
were
trying
to
look
for
solutions
to
handle
this
kind
of
complexities
and
to
deal
with
resiliency,
the
more
and
more
they
scale
false,
like
database
services
going
out
of
business
message,
brokers
crashing
and
the
entire
region
of
cloud
providers
going
down
due
to
different
kind
of
outages
and
also
network
bandwidth,
dropping
without
notice,
where
definitely
some
other
challenges-
and
these
are
like
these
are
definitions
of
different
outages
that
might
happen
to
your
systems
to
your
applications.
A
So
these
were
some
of
the
major
problems
that
ifood
was
facing
in
latin
america
and
they
wanted
to
mitigate
these
challenges
because
the
user
base
was
growing
and
they
wanted
to
give
them
a
seamless
access
to
their
delivery
platform.
Now
they
decided
to
tighten
up
the
reliability
by
continuously
doing
load
tests
and
bare
minimum
chaos
experiments,
but
the
solutions
that
they
were
using
at
the
time
lacked
specific
use
case
driven
functionality.
A
So
if
they
had
certain
scenarios
in
mind,
they
wouldn't
be
able
to
do
it,
but
they
will
be
able
to
target
and
do
basic
cares,
but
based
on
the
problems
mentioned
above,
they
wanted
something,
some
solutions
which
can
tether
to
their
specific
use
cases
and
also
the
ability
to
customize
their
own
kind
of
scenarios
based
on
their
experience
and
use
them
automatically
as
a
in
an
automated
fashion.
A
So
they
also
had
the
requirement
to
know
which
users
perform
what
kind
of
chaos
to
enable
a
better
back
control
in
production.
So
that,
let's
say
because
chaos
testing
requires
an
amount
of
responsibility
when
you're
doing
it.
So
they
wanted
to
know
which
user
is
performing.
What
kind
of
cures,
because,
if
they're
doing
it
on
production
or
let's
on
a
specific
environment,
and
if
something
goes
horribly
wrong
because
of
a
specific
chaos
experiment,
then
they
would
like
to
know
which
user
performed?
What
experiment
and
what
was
the
scale
of
it.
A
What
exactly
did
it
target
so
those
kind
of
things
they
wanted
to
know?
Which
user
performed?
What
and
enabled
better
our
back
controls
and
the
current
kiosk
engineering
solutions
they
were
using?
A
Was
not
really
automated
and
it
also
had
limited
number
of
experiments,
but
with
the
amount
of
ideas
that
ifood
had
regarding
the
scenarios
they
wanted
to
create
these
kind
of
things
they
wanted
to
customize
the
experiments
and
eliminate
manual
cost
as
much
as
possible
because
they
wanted
to
have
these
ideas,
created
custom
scenarios
and
also
automate
the
same
and
have
it
running
by
itself
rather
than
doing
it
manually.
A
So
these
are
some
of
the
challenges
that
they
faced
and
one
of
the
main
challenges
downtime,
which
is
why
the
thought
came
into
their
mind
to
switch
to
an
automated
chaos
tool.
So
what
exactly
would
go
wrong
if
you
have
down
times
right
so
right
off
the
bat
there's
a
loss
of
customer
confidence,
which
is
the
biggest
let
down?
A
If
you
have
an
application
with
a
huge
base
of
customer
and
there's
a
time,
there's
even
a
slight
amount
of
downtime,
you
would
have
a
loss
of
customer
confidence
and
not
to
also
mention
the
amount
of
costs
that
you
might
incur
in
that
time
frame.
So
the
average
down
time
for
an
outage
is
reported
to
be
about
79
minutes
and
the
average
cost
of
these
down
times
are
about
84
000,
so
which
is
huge
considering
in
that
period
of
time.
A
Let's
say
even
if
it's
not
79
minutes,
even
if
it's
five
minutes
in
that
five
minutes,
millions
of
users
could
have
ideally
clicked
on
or
wanted
to
get
food
or
wanted
to
have
something
delivered
or
just
wanted
to
check
out
your
platform.
A
Now
litmus
came
in
with
the
idea
of
providing
a
lot
of
chaos,
experiments
which
suited
their
requirements
because
litmus
has
the
ability
to
add
your
own
private
hubs,
as
well
as
the
public
hub,
which
litmus
has
is
also
filled
up
with
over
50
plus
experiments.
So
one
of
those
experiments
they
can
usually
pick
up
and
then
add
on
top
of
it
and
it
comes,
and
it
covers
a
range
of
different
types
of
experiments.
A
So
I
would
really
like
that
idea
of
having
to
customize
something
that
they
can
use
for
their
specific
requirements,
because
there
were
multiple
options:
multiple
areas
that
litmus
was
touching,
so
they
went
with
a
declarative
approach
which
helped
them
customize
these
kiosk
experiments
and
then
target
the
chaos
engineer
and
target
the
chaos
engine
further
to
add
their
own
ends
and
attune
that
specifically
to
their
requirement.
Well,
it
must
also
give
them
the
ability
to
fine
grain.
A
We
also
gave
them
the
ability
to
construct
a
workflow
as
a
cron
now
because
they
wanted
to
automate
and
also
and
save
manual
labor,
because
we
have
the
option
to
even
save
it,
save
the
different
scenarios
as
a
template
for
later
use,
so
that
aided
with
easier
automation
and
auto
chaos
even
after
specific
intervals.
So
that
is
one.
So
that
is
one
feature
that
they
considered
handy,
and
that
is
something
that
is
helping
them
to
automate
this
entire
process
and
remove
manual
labor
as
much
as
possible.
So
yeah.
A
That
was
one
of
the
stories
that
ifood
currently
is
using
litmus
for
and
to
continue
with
the
next
use
case
and
also
the
demo.
I
would
like
to
hand
it
over
to
nilanjan
and
he
can
guide
you
with
the
rest.
B
B
B
It's
not
an
isolated
incident
where
newly
added
services
go
down
which
eventually
get
mitigated
after
much
effort
that
affects
the
team
and
end
users
in
a
system
with
the
kind
of
dependencies
that
halodoc
had.
It
was
prudent
to
test
and
measure
service
availability
across
a
host
of
failure
scenarios
this
needed
to
be
done
before
going
live
and
occasionally
after
it,
albeit
in
a
more
controlled
manner.
Hence
chaos
engineering
was
found
suitable
to
supplement
the
existing
qa
with
comprehensive,
automated
test
suits
and
periodic
performance
testing
analysis
to
make
the
platform
more
robust.
B
Considering
the
microservices
span
across
several
frameworks
and
languages
such
as
java,
python
c,
plus,
plus
and
golang,
it
was
vital
to
subject
them
to
varied
service
level.
Faults
add
to
it
the
hybrid
nature
of
the
infrastructure,
varied
aws
services
and
the
ability
to
target
non-kubernetes
entities
like
cloud
instances,
disks,
etc
becomes
clear.
B
Furthermore,
the
application
developers
were
required
to
be
able
to
build
their
own
faults
and
integrate
them
into
the
suit
and
have
them
orchestrated
in
a
similar
fashion.
To
the
cloud
native
faults,
chaos
scenario.
Definition
there
was
a
need
for
full-fledged
chaos,
scenarios
that
combined
faults
with
some
custom
validation,
depending
on
the
use
case,
as
the
chaos
jets
were
expected
to
run
in
an
automated
fashion
after
the
initial
experimentation
or
establishing
a
good
fit
hello.
B
B
Hello
doc
needed
a
tool
with
the
ability
to
isolate
the
chaos
view
for
respective
teams,
with
admin
controls
in
place
for
the
possible
blast.
Radius
contain
this.
Allied
with
the
standard
security
considerations
around
running,
the
third
party
containers
were
required
observations,
hallo
dock
relies
heavily
on
observability,
both
for
monitoring
application
or
infrastructure
behavior.
The
stack
includes
new
relic
prometheus
grafana,
elasticsearch
etc,
as
well
as
for
reporting
and
analysis.
B
They
use
allure
for
test
reports
and
lighthouse
for
service
analytics.
It
was
only
judicious
to
choose
a
chaos
framework
that
can
provide
with
enough
data
to
ingest
in
terms
of
logs
metrics
and
events.
Lastly,
community
support.
Hello
doc
saw
value
in
an
open
source
project
that
has
a
strong
community
around
the
tool,
with
approachable
maintenance,
who
could
see
reasons
in
the
issues
raised
and
the
proposed
enhancements,
while
keeping
a
welcoming
environment
for
users
who
can
contribute
back.
B
Hence,
hello,
dog
chose
led
per
squares,
which
met
the
requirement
criterias
to
a
great
extent,
while
having
a
road
map
and
release
cartons
that
align
well
to
their
needs
and
pace.
Another
reason
for
choosing
latvia's
chaos
is
the
ktop
support
which
allowed
for
the
automation
of
chaos.
Experiments
halodoc
has
also
contributed
towards
better
user
experience
in
the
chaos
center
and
improving
the
security
of
the
platform
from
them.
B
Halodoc's
initial
efforts
with
litmus
involved
manually,
creating
chaos,
engine
custom
resources
targeting
the
application
ports
to
verify
their
behavior.
This
in
itself
proved
beneficial
with
some
interesting
application.
Behavior
unheard
in
the
development
process.
Eventually,
the
experiments
were
crafted
with
right
validations
using
litmus
probes
to
form
chaos,
workflow
resources
that
can
be
invoked,
programmatically
and
automate.
The
process
of
hypothesis
validation
during
the
chaos
today.
B
While
the
chaos
experiments
on
staging
are
used
as
a
gating
mechanism
for
deployment
into
production,
the
team
at
halodoc
believes
firmly
in
the
merits
of
testing
and
production
scheduled
chaos.
Experiments
are
used
to
conduct
automated
game
days
in
the
production
environment,
with
a
mapping
between
the
fault,
type
and
load
conditions
that
are
devised
based
on
the
usage
and
traffic
patterns.
B
The
upgrades
to
the
of
the
chaos
microservices
on
the
clusters
are
carried
out
in
much
of
the
same
fashion
as
any
other
tooling,
with
the
application
undergoing
standard
scans
and
checks
in
the
gitlab
pipelines,
with
that
we
are
all
set
for
a
demonstration
of
litmus
where
we'll
see
that
how
we
can
inject
chaos
into
a
kubernetes
application
to
assess
its
resiliency,
see
you
in
the
demo,
hello
there
and
welcome
to
the
demo
on
litmus
chaos,
but
before
we
actually
jump
into
creating
some
chaos.
As
you
can
see,
I'm
here
in
my
chaos
center.
B
B
B
I
would
also
like
to
show
you
this
dashboard,
which
is
a
grafana
dashboard
for
our
boutique
application.
As
you
can
see
right
now,
the
metrics
that
we
can
observe
here
in
the
dashboards
are
indicative
of
a
normal
system.
Behavior,
we
have
a
black
box
exporter,
which
indicates
the
service
endpoint
is
quite
healthy
and
the
probe
success
percentage
for
the
same
is
100.
B
Therefore-
and
we
can
also
see
the
queries
per
second
of
the
cart-
lies
somewhere
in
the
range
of
60
to
40,
f,
qps
or
ops,
basically,
which
is
indicative
of
a
normal
system
behavior
and
the
access
duration,
or
you
can
see.
The
latency
is
also
quite
low
right
now,
which
is
in
vicinity
of
somewhere
2.4
2.
B
2.8
seconds,
which
is
quite
normal,
so
yeah
with
that.
We
can
actually
go
ahead
and
target
our
chaos
within
our
application,
using
an
http
chaos
experiment
to
do
so
in
my
chaos
center
I'll.
First
of
all
schedule
a
chaos
scenario,
I'll
choose
the
self
agent
that
I
have
and
go
next
then
I'll
choose
chaos
up,
because
that's
where
my
baud
http
latency
experiment,
the
experiments
is
situated,
then
I'll
go
next,
and
then
we
can
name
this
scott
kiosk
scenario,
since
we
are
targeting
the
cart.
B
Now
that
we
have
added
our
chaos
experiments,
we
just
need
to
simply
fill
up
this
experiment
in
a
way
that
we
are
specifying
the
exact
details
of
our
chaos,
so
that
the
experiment
can
target
the
requisite
pod
and
the
resource
that
via
target
that
we
want
to
target.
B
So
for
that
I'll,
first
of
all
go
next
over
here
and
here
in
the
application
name
space.
We
need
to
choose
the
name
space
in
which
our
boutique
applies,
so
that
happens
to
be
litmus
for
now,
since
we
have
installed
it
in
the
same
name
space
as
of
litmus
chaos
and
for
the
application
kind.
Well,
it
is
a
deployment.
B
B
B
B
So
to
do
so
I'll,
first
of
all,
add
a
new
probe
I'll
go
for,
let's
say
a
cart
probe
over
here,
that's
the
name
of
the
probe,
and
it
would
be
a
type
of
an
http
probe
which
will
be
running
in
a
continuous
mode
that
is
throughout
the
experiment,
in
a
continuous
fashion,
before
we
fill
up
any
of
the
pro
properties
I'll.
First
of
all,
try
to
bring
your
attention
to
what
we
are
going
to
do
as
part
of
this
http
probe.
B
So
we
are
just
going
to
provide
the
url
over
here
for
the
cart
and
the
condition
that
we
are
going
to
enforce
to
be
checked
is
we
are
expecting
a
response
code
of
200
whenever
we
are
performing
a
get
request.
So
what
would
happen?
Is
that
we'll
be
performing
a
http?
Get
request
at
this
particular
end
point
in
a
continuous
fashion
throughout
the
duration
of
the
experiment.
B
Now
we
can
go
ahead
and
specify
a
few
pro
properties.
That
is
what
is
the
timeout,
after
which
the
probe
would
time
out
basically
with
fail.
So
let
us
give
this
as
three
seconds
and
how
many
times
shall
we
retry
in
an
event
where
our
probe
is
actually
failing,
so
we
can
set
this
as
once.
We
can
retry
once
just
to
be
sure,
and
then
what
is
the
interval
that
we
want
to
have
between
the
successive
probe
iteration?
B
So
we
can
see
that
this
can
be
one
so
with
that
we
are
pretty
much
done
with
expressing
our
probe
in
a
declarative
fashion,
as
you
saw,
and
that's
all,
you
need
to
basically
to
initialize
a
probe
and
check
your
application,
steady
state
conditions
during
the
chaos,
so
with
that
I'll,
add
the
probe
and
go
next.
Lastly,
in
the
last
step,
we
just
need
to
specify
a
few
environment
parameters
for
the
experiment,
so
these
are
the
parameters
through
which
the
experiment
would
run.
First
is
the
total
chaos?
B
Duration,
so
we'll
be
running
this
for
a
duration
of
60
seconds,
which
seems
plausible
and
for
the
latency.
What
I'll
do
is
I'll
add
a
very
big
latency,
which
would
essentially
go
ahead
and
block
our
http
request
for
this
latency
value,
and
this
is
in
milliseconds,
so
I'm
adding
an
http
latency
of
80
000
milliseconds,
which
appears
to
be
very
large
so
we'll
see
what
happens
when
we
are
applying
this
large
of
a
latency
to
this
experiment.
B
Also,
we
need
to
provide
our
target
service
port,
so
this
is
the
port
that
we
are
targeting
for
that
deployment
service.
So
let
us
try
to
see
what
this
target
service
port
looks
like
within
our
kubernetes
terminal,
that
is
using
cubectl.
So
what
I'll
do
is
I'll?
Try
to
list
down
all
the
services
that
we
have
over
here?
You
can
see
that
we
have
a
card
service
and
the
card
service
has
a
port
of
7070,
so
we'll
be
using
this
as
our
target
port.
B
With
that,
we
are
ready
to
go
ahead,
but
not
before
we
actually
specify
the
parts
affected
percentage.
So
this
is
the
percentage
of
the
parts
that
we
are
meaning
to
target.
So
the
minimum
number
of
parts
that
this
experiment
will
target
is
one
and
above
that,
like
whatever
percentage
we
are
specifying
over
here,
would
be
the
would
be
the
percentage
that
it
will
go
ahead
and
target.
So
I'll
mention
50
over
here.
B
So
with
that,
we
can
finish
up
over
here
and
I'll.
Make
revert
schedule
to
false.
What
this
will
do
is
that
basically,
it
won't
delete
any
of
the
application
or
basically
the
experiment,
metadata
that
that
is
getting
created
during
the
experiment
execution.
This
includes
all
the
pods
or
the
workflow
resources
that
we
are
having
as
part
of
the
experiment.
B
So
this
would
allow
us
to
retain
the
locks
so
that
we
can
view
them
with
that
I'll
go
next
and
we
can
specify
a
weight
for
our
calculation
of
the
resiliency
score
at
the
end
of
the
test.
So
we
can
keep
it
at
at
10,
since
we
have
only
one
chaos
experiment
and
it
doesn't
really
matter
what
weight
we
are
providing
here,
as
we
only
have
one
experiment,
we
can
go
next
now
and
I
would
like
to
schedule
now
next.
B
B
So
our
chaos
scenario
has
been
successfully
created
over
here.
As
you
can
see
it's
running,
so
let
us
actually
wait
for
a
while
for
the
experiment
to
get
initialized,
so
you
can
see
right
now
that
the
kiosk
experiment
is
getting
installed.
The
pod
http
latency
experiment
is
getting
installed
right
now
as
part
of
this
step.
B
All
right,
so
with
that
our
installation
of
the
chaos
experiment
is
over,
and
now
we
can
see
that
the
pause
pod,
latency
http
latency
experiment
has
in
fact
started.
So
what
we
can
do
is
that
we
can
verify
whether
this
the
effects
of
this
experiment
in
real
time
using
our
absurdity
dashboard,
that
is
the
grafana
dashboard.
B
B
We
are
essentially
applying
a
very
large
latency
on
our
on
our
cart
service
application
and
therefore
we
can
see
that
the
axis
duration
or
the
latency
is
increasing
exponentially,
while
the
qps
is
also
taking
a
hit,
you
can
see
that
the
mean
qps
indicated
in
yellow
is
going
up
while
the
while
the
99
percent
or
the
immediate
qps
is
in
fact
going
down.
B
The
response
is
still
pending
and
we
can
see
that
it
says
something
has
failed
and
there
are
some
logs
and
basically
some
information
for
debugging.
It
sees
500
internal
server
error,
which
makes
sense,
because
we
have
essentially
added
a
very
large
latency
and
right
now
the
front
end
is
not
getting
any
information
from
the
cart
and
hence
we
are
observing
this
error.
B
If
we
go
back
to
our
application
dashboard
right
now,
we
can
see
that
the
chaos
duration
has
in
fact
passed,
and
at
this
stage
we
can
see
that
the
experiment
is
effectively
getting
over.
We
are
right
now
in
the
post
cure
stage
where
viktor's
effect
has
been
reverted
hopefully,
and
what
we
would
like
to
right
now
understand
is
what
this
cost
to
our
application.
What
we
saw
in
real
time
is
that
well,
our
application
is
unavailable,
but
during
the
chaos,
it's
very
much
important
to
validate
what
it's
happening
in
an
automated
fashion.
B
B
As
we
wait
for
our
experiment,
to
conclude,
we
can
see
that
the
service
metrics
are
again
regaining
a
normal
system.
Behavior,
the
access
duration
is
going
down.
The
card
qps
is
returning
to
its
normal
state
somewhat
and
for
for
the
black
box
exporter.
That
is
the
probe
success
percentage
that
we
have.
It's
also
getting
back
to
a
normal
100
percent.
B
B
We
can
see
that
there's
no
remnant
of
the
500
internal
error
that
we
were
getting
since
the
the
effects
of
the
chaos
have
been
removed
for
now.
So
if
we
go
back
to
our
chaos
center,
let
us
try
to
analyze
that
what
went
wrong
in
this
experimentation
and
what
does
litmus
has
to
say
about
it.
B
If
we
go
inside
the
table
view
and
try
to
view
the
logs
and
results,
we
can
see
that
we
have
all
the
experiment
logs
over
here
as
part
of
our
experiment.
Logs,
of
course,
we
are
first
of
all
getting
all
the
different
experiment-
metrics,
for
example
the
running
pod
over
here.
This
is
the
name
of
the
pod
that
we
are
targeting
card
service.
Then
we
are
also
seeing
the
run
properties
of
the
probe
over
here.
That
is
the
timeout,
the
interval
or
the
retries.
B
So,
in
the
course
of
time
you
can
see
that
initially
we
were
getting
an
actual
value
of
200
when
the
probe
was
running,
which
makes
sense
before
since
before
the
chaos
there
was,
the
service
was
working
correctly
and
henceforth
we
were
getting
a
200
response
code
as
expected,
but
during
the
chaos
well,
we
didn't
quite
got
a
200
response
code
which
can
be
seen
over
here
in
this
log.
B
It
says
that
actual
value
is
500,
which
is
not
expected
to
the
expected
value
of
200,
and
this
is
in
sync
with
what
we
saw
earlier
in
our
application
in
the
browser
as
well.
Basically,
we
saw
that
we
are
getting
a
500
internal
error
and
therefore
this
has
been
the
cause
of
the
failure
of
this
experiment.
B
As
you
can
see,
the
probe
status
has
failed
and
therefore
the
experiment
has
failed,
so
this
shows
that
how
litmus
can
be
leveraged
litmus
probes
can
be
leveraged
to
automate
the
process
of
the
hypothesis
validation
during
the
chaos
and
how
you
can
use
the
logs
for
verifying
the
perfect
cause
of
your
failure
of
your
experiment
or
passage
of
an
experiment
using
litmus
chaos.
We
can
also
get
a
quick
summary
of
the
entire
experiment.
B
Using
the
chaos
result
where
we
can
see
that
the
experiment
status
is
completed,
but
the
verdict
is
failed
and
the
probe
success
percentage
is
zero
for
the
probe
that
we
have
defined.
That
is
the
card
proof.
We
can
see
that
it
says
better
next
time
for
the
continuous
mode,
which
means
that
well,
it
has
failed.
B
So
with
that
we
saw
that
how
we
can
validate
the
experiment,
how
we
can
use
atmospheres
in
order
to
validate
our
chaos,
experiments
resiliency.
So
we
got
the
information
and
the
validation
that
well.
Something
is
not
quite
right.
Our
with
our
application
and
some
component
of
it
at
least,
is
weak.
So
how
can
we
make
it
more
resilient
in
this
case,
so
the
most
perfect
or
plausible
explanation
could
be.
We
can
just
bump
up
the
number
of
parts
that
we
have
as
part
of
our
application
deployment.
B
So
let
us
actually
try
to
do
that.
We
have
one
part
right
now:
let's
scale
it
up
to
maybe
two
parts
and
re-run
this
experiment
and
see
how
this
experimentation
goes.
Then
I'll
go
back
to
my
terminal
and
what
I'll
do
is
I'll,
try
to
scale
up
the
card
service
deployment,
which
is
the
deployment
enabling
the
cart
to
two
replicas.
B
B
B
B
All
right
with
that,
our
pod
http
latency
has
actually
started.
So
let
us
head
back
to
our
application,
dashboard,
the
bootiecap
dashboard.
We
can
see
that
slowly,
our
chaos
experiment
is
taking
place
over
here.
The
chaos
annotation
is
quite
prominent
over
here,
but
what
we
are
observing
here
in
this
case
is
that,
so
far
our
steady
state
seems
to
be
maintain
the
probe
success
percentage
for
the
end
point
of
the
card
service
seems
to
be
stable.
B
B
Access
duration
service
is
also
spiking
up
a
bit.
It's
in
vicinity
of
2.5
minutes
right
now,
which
is
not
too
bad
so
yeah.
It
seems
that
the
chaos
is
doing
something
to
our
application.
The
qps
is
steadily
increasing
and
the
access
duration,
the
latency,
is
kind
of
flattering
out,
but
yeah.
The
most
important
question
remains:
is
the
application
still
available
so
for
that
I'll
simply
refresh?
B
And
as
you
can
see
this
time
around,
it's
not
going
down,
we
are
not
waiting
for
any
response
code
or
anything
as
such.
It's
still
accessible,
no
matter
what
the
application
dashboard
is
showing
to
compare
the
result
of
the
application
dashboard
I'll,
actually
like
to
compare
them
side
by
side,
maybe
with
15
minutes
yeah.
B
So
this
time
around,
you
can
see
that,
although
we
observed
a
spike
in
the
access
duration
for
the
card
service,
it's
much
less
compared
to
our
earlier
expectation,
it's
almost
half
which
makes
sense
because
we
have
added
one
more
probe
and
sorry.
We
have
added
one
more
part,
and
that
is
mitigating
the
effects
of
the
chaos
in
effect
and
therefore
helping
the
application
to
sustain
the
chaos.
B
So
with
that,
we
we
can
see
that
our
chaos
duration
has
essentially
passed.
We
can
go
back
to
our
application
right
now
and
we
can
see
that
well,
even
after
removal
of
the
chaos,
everything
is
fine.
Everything
is
working
in
order
and
we
can
wait
for
our
chaos
experiment
to
complete
to
observe
its
effect
and,
as
you
can
see,
the
chaos
experiment
is
this
time
around
it's
completed
over
here.
B
So
let
us
try
to
observe
the
locks
this
time,
although
we
can
already
see
that
it
has
passed,
but
let
us
try
to
still
validate
using
the
logs
in
the
kiosk
result.
If
we
take
a
look
at
the
logs
this
time
around,
we
can
see
that,
of
course,
before
the
experiment
as
well.
We
were
getting
an
expected
value
of
200,
as
well
as
actual
value
of
200.
That
is
the
response
code,
which
makes
sense,
and
this
time
around
every
time
we
are
performing
this
check.
B
We
are
always
getting
a
200
response
code
and
there
is
no
response
time
out
this
time.
That
is,
we
are
right
on
track
with
what
we
observed
within
the
browser
as
well
for
the
application
when
we
were
refreshing,
the
website
was
available
throughout
the
experiment,
duration
and
as
a
result
of
this,
our
probe
has
in
fact
passed.
As
you
can
see,
the
card
proof
has
passed
and
this
in
turn
made
ensure
that
our
application,
our
experiment,
is
passing.
B
We
can
observe
the
same
from
our
chaos
results.
As
you
can
see
over
here.
The
experiment
status
is
completed
while
the
vertex
is
passed
with
the
probe
success
percentage
of
100,
since
we
had
only
one
probe
and
it
passed.
Therefore,
the
probe
success
percentage
is
100
and
the
continuous
probe
that
we
had
defined
by
the
name
of
card
probe.
It
has
passed
so
yeah
with
that.
We
conclude
the
demonstration
of
litmus
chaos.