►
Description
Adrian Gonciarz is sharing their journey on how they implemented application resiliency evaluation with LitmusChaos, Locust, GitLab, and Keptn.
Meeting notes:
https://docs.google.com/document/d/1Om9pj16hGKP_w2vUaH-7Cp0ffEIj-Oe3IezeVCpFYAM/edit#heading=h.49tbq0kx1jf9
Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
A
Hi
and
welcome
everyone
to
this
edition
of
the
captain
user
group.
Today,
we
already
have
a
couple
of
folks
joining
and
I'm
very
honored
and
pleased
that
we
also
have
a
great
presentation
from
adrian
coming
up.
I've
already
seen
parts
of
it.
I
really
love
the
use
case
and
how
admin
solved
actually
the
challenge
of
evaluating
the
application
resiliency
with
a
couple
of
different
tools.
A
He
brought
everything
together
and
he
will
be
explaining
the
challenge
and
also
the
solution
to
us
in
the
in
the
bigger
part
of
this
captain
user
group.
For
today,
the
captain
user
group
is
also
following
the
cncf
code
of
conduct.
That
basically
means
please
be
nice
to
each
other.
Please
respect
each
other.
If
there
are
any
questions,
please
put
the
questions
in
the
chat
and
after
the
presentation
of
adrian,
I
will
also
open
up
all
the
microphones
or
I
allow
you
to
to
unmute
yourself.
A
So
we
can
have
a
discussion
and
please
feel
free
to
ask
all
the
questions
directly
to
adrian
or
also
questions
around
captain
the
captain
project
in
general.
Some
use
cases
you're
interested
in
please
feel
free
to
share
your
questions.
Your
thoughts
and
give
feedback-
I
will
again
put
in
the
link
to
this
document
that
I'm
sharing
right
now
in
the
chat,
so
please
feel
free
to
also
add
your
name
in
here,
so
we
can
keep
track
a
little
bit
of
all
our
smds
in
the
sessions.
A
Just
writing
your
name
and
your
affiliation.
That
would
be
great
actually.
A
A
Yeah
with
this
thanks,
everyone
for
joining,
please
also
put
this
on
your
calendar.
We
are
doing
this
kind
of
meetings
every
third
tuesday
of
the
month
and,
as
you
already
know,
from
the
first
edition
of
the
captain
user
group,
it's
more
focused
on
captain
users
how
users
are
actually
taking
advantage
of
captain
what
are
the
main
use
cases
they
want
to
implement
and
therefore
it's
not
so
much
focused
on
the
capture
developers,
the
captain
developer
meetings.
A
We
still
do
them
on
on
thursdays
same
time
as
captain
user
groups,
but
we
just
have
these
two
different
kinds
of
of
meetings.
So
with
this
I
think
I'm
already
handing
over
to
adrian.
I
know
that
you
have
a
short
introduction
for
yourself
in
your
slides,
so
I
will
just
give
away
the
stage
to
you.
We
or
we
adrian
agreed
that
he
will
also
share
the
slides
afterwards,
so
we
will
put
the
slides
in
here
and
we
will
also
put
once
the
recordings
up
live
on
youtube.
A
We
will
also
put
the
recording
link
in
here,
so
thanks
already
for
putting
in
your
names-
and
I
will
be
handing
over
now
to
adrian
and
really
thank
you
very
much
to
your.
A
All
good,
we
can
see
your
screen
and
again,
if
you
have
any
questions,
please
put
it
in
the
chat
or
in
the
q
a
and
I
will
be
either
throwing
them
into
adren,
or
we
will
just
discuss
it
afterwards.
After
the
presentation.
B
Perfect,
thank
you
very
much
here
again.
First
of
all,
thank
you
for
having
me
here.
Thank
you
for
your
continuous
support
year
again
and
andy
and
everyone
at
captain
and
dinotrace.
That's
that's
really
a
blessing
to
have
such
support
when
you,
when
you
take
a
difficult
task,
you
have
no
idea
about.
Basically,
this
is
the
the
first
story
and
the
first
chapter
of
the
of
the
journey.
B
I
had
no
idea
where
it's
gonna,
when
it's
gonna
lead
me
and
how
it's
gonna
go
on
the
way,
but
hopefully
I
was
able
to
to
reach
the
first
milestone
and,
first
of
all,
when
you
listen
to
my
presentation,
don't
treat
it
as
any
kind
of
expert's
story
or
expertise,
huge
expertise
on
this
kind
of
implementation,
because
for
me
personally,
this
is
the
first
try
of
this
kind
of
chaos,
engineering
and
resiliency
testing.
It
was
a
hard
but
joyful
time.
So
please
be
aware
that
this
this
is
nowhere
near
perfection.
B
This
is
just
the
first
proof
of
concept
and
hopefully
a
great
start
to
something
really
important
for
for
us.
So
I
will
tell
you
a
bit
of
of
a
story,
but
first
let
me
turn
on
my
timer,
so
I
don't
talk
all
day.
B
B
So
I
would
like
to
show
you
today
guys
and
also
this
is
a
challenge,
because
I
don't
really
like
online
meetups,
but
we
are
in
these
difficult
times
where
we
cannot
meet
each
other
and
have
a
beer
and
talk
about
this.
So
let
me
tell
a
short
story
about
the
idea
about
the
way
that
led
us
to
some
prototype
and
where
we
are
now
and
where
we
want
to
go
with
the
idea
of
evaluating
application,
resiliency
or
basically
implementing
some
kind
of
site,
reliability,
engineering
in
kitopi,
but
first
two
words
about
me.
B
B
B
Also,
as
I
said
now,
I'm
a
bit
more
on
a
sre
site
and
who
knows
where
I'm
gonna
go.
I'm
obsessed
absolutely
obsessed
with
python.
I
never
felt
in
love
with
java,
but
when
I
learned
python
it
totally
got
me
so
most
of
the
things
I
write,
I
write
with
python,
whether
it's
test,
automation
or
support
code
or
some
sdks,
etc,
etc.
B
I'm
also
quite
obsessed
about
kubernetes
and
cloud.
Actually,
that's
quite
funny
because
when
I
was
working
in
a
company
called
impost
part
of
this
engineers
are
now
working
in
quitope.
That
was
the
first
time
when
I
heard
about
kubernetes
and
I'm
lucky
enough
to
meet
those
people
who
basically
taught
me
about
kubernetes
like
sebastian
in
kitopi
and
martin.
B
Now
we
are
back
on
the
same
track
for
some
another
journey
and
I'm
also
teaching
people
mostly
in
area
related
to
software
testing.
I
teach
at
universities
in
in
post
graduate
studies.
I
also
do
some
trainings
and
most
of
all,
I'm
quite
focused
on
automating
stuff,
not
only
test
automation,
but
basically
how
to
automate
different
stuff.
We
have
some
guests
from
gitlab,
so
I
tell
you
that
a
lot
of
lots
of
my
problems
are
solved
using
gitlab
pipelines.
B
B
B
As
a
kind
of
software
platform,
where
you
can
put
your
rest
around
where
you
can
put
your
shop
and
which
serves
as
a
meeting
point
between
people
ordering
some
food,
etc,
etc,
and
restaurants
and
companies
that
deliver
food,
mostly
in
middle
east,
but
I
quite
sure
that
they
have
much
more
to
come.
So
basically,
you
are
dealing
with
quite
a
big
traffic
related
to
putting
orders
to
the
system,
processing
those
orders
and
serving
them
to
different
kind
of
entities
such
like
kitchens,
restaurants,
etc,
etc.
B
And
that's
where
the
idea
started
so
once
we
met
for
a
quick
coffee
with
guys
in
kitopi,
I
think
a
few
months
ago.
I
think
it
was
november
or
something
like
this,
and
we
had
this
idea
or
they
had
this
idea,
which
was
sparkled
by
you,
know
the
centers.
B
So
we
had
this
little
production
issue
and
that's
how
it
all
starts
always
so
now,
as
an
action
point
from
this
little
production
issue,
they
they
said
that
they
would
like
to,
or
we
would
like
to
test
our
resiliency
improve
the
resiliency
of
our
system.
So
we
need
to
find
a
way
to
to
test
it.
B
So
we
thought
that
it
could
be
done
in
a
modern
but
quite
natural
way
and
for
I
think
for
many
cloud
living
platforms.
The
natural
way
today
means
chaos
testing.
If
you
are
not
familiar
with
the
idea
of
chaos
testing,
let's
just
say
that
this
is
the
kind
of
engineering
that
is
supposed
to
introduce
some
kind
of
malicious
behavior
to
your
system
and
malicious
behavior,
maybe
not
be
the
best
word,
some
kind
of
unexpected
behavior,
such
as
a
network
problems,
maybe
problems
with
some
other
services.
B
You
depend
on
basically
introducing
some
kind
of
chaos,
so
this
is
simulating
a
real
situation
more
or
less,
as
you
would
sometimes
face
on
production.
Sometimes
one
of
your
service
service
is
not
available.
Sometimes
you
have
problem
with
connection
to
database.
Sometimes
maybe
you
have
problems
with
restarting
your
application,
etc,
etc,
and
for
some
time
chaos
testing
has
been
around,
but
I
have
never
been
able
to
put
my
hands
on
it
and
this
time
we
decided.
Maybe
it's
going
to
be
a
good
way
to
choose
for
our
testing.
B
So
maybe
we
would
like
to
introduce
the
unpredictable
factor
by
using
chaos
testing,
but
on
the
other
hand,
we
would
like
to
deal
or
simulate
the
production
like
traffic
and
for
me,
as
a
qa
engineer,
and
maybe
for
some
of
some
of
you,
the
natural
way
or
the
intuitive
way
of
simulating
production
like
traffic
is
load
testing.
B
So
this
is
also
what
what
we
called
non-functional
testing
any.
Maybe
most
of
you
will
be
familiar
with
jmeter,
maybe
got
link.
Some
of
you
may
be
familiar
with
locust,
I'm
using,
but
basically,
on
the
other
hand,
on
the
client
side
or
the
on
the
production
side,
we
have
some
load
generated
via
some
tool.
B
So
the
initial
idea
was
very
simple:
let's
generate
some
load,
let's
introduce
some
kind
of
chaos
to
our
system
and,
let's
see
what
happens,
let's
see
if
we
are
able
to
survive
this
kind
of
natural
state
of
bigger
or
smaller
disaster,
and
first
we
sat
down
and
we
thought
okay.
So
what
building
blocks
would
we
need
for
this
idea?
Because
of
course,
we
have
the
very
rough
idea,
but
we
need
to
put
some
necessary
tools
and
basically,
at
the
very
first
meeting
we
start,
we
established
a
set
of
building
blocks
building
tools.
B
We
would
need
to
achieve
this
goal
so
the
list
pretty
much
went
like
this.
We
need
a
source
of
traffic,
so
we
need
to
generate
or
stimulate
our
system
the
way
our
users
would
do
and
then,
since
we
have
the
traffic
and
we
since
we
have
working
system,
then
we
will
need
some
kind
of
chaos
generation.
At
this
point,
I'm
talking
about
broad
abstractions.
B
So
since
we
have
traffic,
we
also
want
to
introduce
some
chaos
to
we.
We
need
some
kind
kind
of
chaos
generator.
Also,
a
lot
of
applications
are
storing
metrics.
So,
in
order
to
see
how
our
system
behaves,
how
we
can
track
the
behavior
of
our
system
and
also
how
how
we
perform
in
terms
of
traffic,
how
we
perform
of
in
terms
of
chaos
retrieval,
we
would
like
to
store
those
metrics
somewhere
and
possibly
after
this
testing.
B
We
would
like
to
draw
some
graphs
see
if
we
can
narrow
down
our
bottlenecks
and
if
we
can
see
what
was
happening
basically
to
the
system
and
then
we
need
some
sort
of
evaluation.
So
we
did
some
job.
We
generated
traffic,
we
generated
chaos.
We
are
able
to
draw
everything
that
was
happening
to
the
system,
but
are
we
able
to
evaluate
what
if
the
test
was
green
or
red
if
the
behavior
was
positive
or
unexpected?
B
And
last
but
not
least,
we
need
some
kind
of
runner.
We
need
some
kind
of
tool
that
will
lead
the
orchestration
here
so
going
step
by
step,
the
source
of
traffic.
For
me,
the
natural
thing
and
luckily
for
for
the
team
for
the
qa
team
there,
which
also
are
some
of
my
friends,
so
I'm
very
happy
to
have
my
friends
everywhere.
B
We
decided
to
go
with
locust,
which
is
very
easy:
python,
based
load
testing
tool
very
similar
to
jmeter
or
gatling.
So
for
our
purposes,
you
can
think
of
locus.
Does
basically
the
load
testing
performance
testing
to
tool
pretty
much
the
same
as
jmeter
in
a
black
box
view,
then,
for
the
chaos
engine
the
suggested
solution
by
our
devops
team
was
litmus.
B
After
some
research
they
did,
they
suggested
and
they
implemented
implemented
very
first
version
of
chaos
engine
using
litmus
and
after
some
time
I
learned
that
also
litmus
is
the
tool
that
the
captain's
team
is
cooperating
with.
So
that
was
a
very
natural
and
good
choice
for
metrics
and
graphs.
I
think
most
of
you
know
promit
use
as
a
source
of
matrix
and
grafana
as
the
as
the
graphing
energy
graphing
engine.
B
This
is
very
popular
stock,
so
I
think
most
of
you
should
be
familiar
with
that
and
for
results
evaluation.
I
said
okay,
so
we
will
run
this
test
and
I
will
find
a
way
to
script.
The
evaluation
I
will
just
basically
pull
some
metrics
put
them
together,
see
what
what
was
the
failure
rate.
What
was
the
request
per
second?
B
I
will
do
it,
I'm
I
mean
it
should
not
be
easy.
I
will
just
script
it
probably
so
I
will
use
python
and
for
a
runner
as
nearly
most
of
our
projects
we
use
gitlab,
which
would
be
basically
the
orchestrating
platform
in
this
in
this
matters.
B
So
we
went
for
a
beautiful
prototype.
I
hope
your
prototypes
are
better,
but
first
idea
was
like
this.
We
thought
about
it
at
the
very
first
meeting
and
we
thought
about
the
idea
of
three
stages
of
chaos
and
by
three
stages
of
chaos.
I'm
not
I
don't
mean
the
the
three
lockdowns
or
how
many
lockdowns
you
had
in
2020.
B
It
was
a
bit
a
bit
less
heavy
and
the
three
stages
we
basically
imagined
and
once
again
don't
treat
it
as
some
super
expert
knowledge
in
chaos,
engineering
treated
us
as
my
attitude
or
our
attitude,
the
idea
we
had
and
maybe
some
inspiration,
if
you
have
any
comments
and
go
ahead,
but
the
initial
idea
and
the
idea
we
still
have
now-
were
intrigued,
defined
by
three
stages.
The
first
stage
would
be
no
chaos,
so
basically,
we
would
like
to
see
how
our
system
performs
or
how?
B
What
are
the
results
of
our
load
tests
and
by
the
metrics
from
lotus?
You
can
probably
think
about
requests
per
second
or
maybe
what
is
the
error
rate
for
particular
endpoint.
So
these
are
important
metrics,
so
we'd
like
to
basically
have
a
baseline,
so
we'd
like
to
run
our
performance
test
without
any
chaos
for
some
given
period
of
time
and
treat
it
as
our
baseline
and
then
we
would
like
to
introduce
some
light
chaos
and
now
what?
What
does
it
mean?
The
light
chaos
as
a
light?
Chaos?
B
So,
for
example,
as
I
as
I
taught
jurgen
today,
you
can
imagine
that
using
aws
in
the
morning
and
when
the
usa
wakes
up
and
you
are
using
the
aws
at
the
same
time
as
the
guys
from
usa,
you
see
that
the
response
times
of
aws
are
different.
So,
for
example,
if
you
are
able
to
do
something
within
seconds
or
minutes
in
the
morning,
it
can
take
15
minutes
half
an
hour
in
the
evening.
Also,
when
you
have
some
kind
of
natural
delay
between
applications
and
database,
you
have
this
kind
of
unproductive
ability.
B
This
randomization,
where
you
don't
really
know
if
you're
gonna
connect
within
50,
milliseconds,
500,
milliseconds
or
five
seconds.
Maybe
this
is
the
light
chaos.
This
is
not
not
the
bad
behavior,
but
this
is
some
kind
of
natural
randomization
and
our
expectance
for
the
state
of
no
chaos
and
light
chaos
for
our
applications
is
them
for
them
to
the
users.
To
be
the
same,
maybe
we
will
have
a
slightly
less
true
output,
slightly
less
request
per
second,
but
which
should
not
observe
any
errors.
B
So
whether
we
have
the
perfect
state
of
no
chaos
or
very
gentle
or
light
chaos,
natural
production
conditions,
we
should
not
observe
errors
and
we
should
not
observe
any
drop
in
our
performance.
On
the
other
hand,
we
have
heavy
chaos,
so
we
introduced
this
concept
of
heavy
chaos,
which
is
the
bad
behavior.
Something
bad
goes
on.
The
like
something
goes
really
wrong.
Maybe
we
have
a
really
bad
network
latency.
Maybe
our
database
has
dropped.
Maybe
we
don't
have.
B
We
cannot
connect
between
pots
and
then,
of
course,
we
will
observe
failures,
but
at
the
same
time
we
should
be
able
to
recover
from
those
failures
pretty
quickly.
So
this
is
the
very
important
idea
for
the
development
of
the
whole
of
the
whole
project
or
the
whole
concept,
so
just
keep
in
mind
that
no
chaos
and
light
chaos
for
you
as
a
end
user,
for
example,
people
ordering
at
restaurant
or
people
using
the
the
app
in
the
kitchen
should
be
the
same.
B
B
As
as
this
light,
chaos
for
us
for
now,
as
a
proof
of
concept,
is
a
network
packet
drop
so
introducing
some
latency
some
unpredictable
improv
in
predictability
in
the
network
for
light
chaos
around
25
light
enough
not
to
destroy
our
system
but
to
introduce
some
kind
of
randomization
and,
on
the
other
hand,
for
heavy
chaos.
We
we
selected
a
quite
significant,
quite
significant
packet
drop
of
75
percent.
B
That
would
definitely
cause
our
cost
problems,
but
we
should
be
able
to
recover
and
for
the
tests
and
metrics
we've
chosen,
locust
and
as
a
as
an
addition
to
locust
so-called
locust
exporter.
So
we
built
our
architecture
of
locust
in
a
very
easy
and
natural
way
for
locus,
so
we
have
one
master
node,
that
orchestrates
workers
and
the
workers
are
realizing
the
requests.
This
is
done
in
order
to
be
able
to
scale
up
so
whether
we
want
to
run
10
users,
100
users,
1,
000
users
or
10
000
users.
B
We
are
easily
able
to
scale
scale.
Our
workers
and
the
master
node
is
used
only
to
communicate
with
the
exporter
and
exporter
is
the
layer
between
locust
or
the
test
results
and
prompt
use.
So
our
test
results
are
stored
as
a
metric
in
promote
use.
Basically,
and
then
we
put
the
together
or
I
put
together
a
very
simple
gitlab
pipeline,
which
we
could
summarize
oh,
in
which
I
would
store
the
artifacts
from
tests.
B
So
my
test
results
would
be
basically
in
the
first
iteration
stored
as
files
containing
the
results
of
tests
and
the
pipeline
looked
pretty
much
like
this.
I
would
run
some
load
tests
without
introducing
any
chaos,
and
I
would
store
the
results
as
the
gitlab
artifacts.
Then
I
would
run
the
same
load
test,
but
I
would
introduce
the
light.
Chaos
also,
I
would
store
the
artifacts
gitlab
artifacts
and
then,
as
the
last
round
I
would
run
the
load
tests,
but
I
will
introduce
heavy
chaos
at
the
very
east
and
the
very
end
of
my
pipeline.
B
B
Few
years
ago
I
met
in
krakow,
I
met
andy
from
dinah,
trace
and
captain,
and
he
was
telling
us
the
story
about
captain
very
early
stages
of
captain.
I
guess
it
was.
I
think
it
was
around
three
years
ago.
Then
he
came
again.
Then
he
told
us
the
story
again
a
bit.
You
know
a
bit
more
cheerful
story,
but
still
I
didn't
even
move
a
finger,
but
now
a
few
years
later
there
comes
the
time
when
I,
when
I
think
to
myself,
hey
it's
time
to
use
this
captain
to
see
how
it
goes.
B
I
only
remember
that
there
was
something
as
pytometer,
which
was,
I
think,
replaced
with
some
other
applications,
but
nevertheless
I
synced
up
with
guys
on
dynatrace
with
andy,
and
then
we
thought
that
the
good
solution
for
evaluation
of
this
metrics
would
be
captain,
and
here
is
a
big
shout
out
to
jurgen
he's
the
guy
that
spends
many
hours
with
me
helping
me
out.
So
that's
you
know.
That's
very
very
kind
of
you.
B
Thank
you
very
much
so
after
I
would
say
a
month
of
r
d,
different
configuration
issues,
different
stuff,
different
learning
curve.
For
me,
of
course,
we
were
able
to
come
up
with
some
new
configuration.
B
We
defined
a
project,
a
captain
project
with
three
stages,
those
of
you
who
are
familiar
with
captain.
You
know
that
we
have
projects
and
within
the
projects
we
have
some
stages
in
a
full
version
of
captain.
Those
stages
can
relate
to
to
deployments
to
your
test
stage
pro
the
environment,
but
my
use
case
was
more
towards
the
quality
gates
examinations.
B
Therefore,
I
defined
a
project
with
three
stages
that
reflect
that
no
chaos,
light
chaos
and
heavy
chaos,
and,
as
you
might
know,
if
you
are
familiar
with
captain,
you
will
or
if
you
are
not
at
some
point,
you
will
get
familiar
with
slis
and
slos.
Slis
are
basically
the
metrics.
So
these
are
our
indicators,
how
to
measure
how
to
calculate
and
what
to
calculate
in
order
to
tell
whether
our
system
behaved
properly
or
improperly,
and
for
me,
as
a
qa
engineer,
and
in
this
kind
of
situation,
there
are
two
metrics
that
are
important.
B
So
those
are
two
important
metrics
I
use
as
my
slis,
so
I
defined
requests
per
second
for
each
endpoint,
I'm
querying
basically
or
testing,
and
I
calculate
the
total
error
rate
for
the
period
of
time.
I
was
testing-
and
here
you
have
a
example-
excuse
my
very
bad
code
position
at
the
slide,
but
this
is
just
an
example.
This
is
promit
use.
Query
I
use
to
get
the
requests
per
second
for
one
of
the
endpoints.
B
This
is
basically
the
rate
rate
is
a
prompt
use.
Query
operator
of
locust
number
locus
request
number
requests.
This
is
the
metric
exporter
from
lock
exported
from
locast.
So
you
can
say
this
is
the
basically
the
average
rate
of
request
for
a
particular
endpoint.
The
end
point
is
hidden
here
under
those
three
dots
and,
of
course,
captain
provides
you
with
some
kind
of
parameterization
using
this
duration
seconds.
While
variable
you
can,
you
can
pass
during
the
evaluation
of
captain
metrics,
you
can
pass
what
was
that
re
duration
of
your
test?
B
So
for
me
the
the
metric
is
parameterized
with
some
duration
seconds.
When
I
evaluate
my
test
results,
I
tell
captain
hey,
please
evaluate
this
particular
metric
for
a
time
of
last
five
minutes
or
ten
minutes,
so
the
time
where,
when
the
tests
were
running-
and
basically
this
gives
me
at
the
very
end,
one
single
number
for
each
sli
and,
as
I
said
in
my
case,
I
have
three
end
points.
So
these
are
three
slis
and
this
is
the
fourth
one,
so
I
will
get
four
numbers
basically
and
for
slos
this
is
the
important
part.
B
I
have
two
kind
of
slos:
slos
are
the
objectives
service
level
objectives,
service
level
objectives?
So
basically
you
can.
You
can
tell
that
each
stage
of
your
pipeline
of
each
stage
of
your
test
can
have
different
sl
sli
results,
and
then
you
want
to
check
or
ask
captain
to
check
if
the
result,
if
the
numbers
in
the
metrics
are
okay
or
not,
and
as
I
said
you-
I
told
you
previously-
I
treat
my
stage
of
no
chaos
and
light
chaos
as
pretty
much
the
same
so
for
the
end
user.
B
The
request,
the
number
of
requests
per
second
should
be
more
or
less
the
same,
very
close,
and
the
total
error
rate
should
be
near
zero.
So
in
those
two
those
those
two
situations
where
we
have
no
chaos
and
light
chaos,
there
should
be
barely
any
difference.
I
want
those
states
to
be
similar,
so
why
is
that?
So?
Because
our
application
should
be
resilient
should
be,
should
resist
and
small
variations.
B
If
we
have
some
delay
on
connection
between
the
app
and
database,
it
should
be
okay
for
the
application
it
should
reconnect.
It
should
wait
longer.
It
should
retry.
It
should
not
immediately
return
errors
to
the
user,
so
the
slide
chaos
is
taken.
So
oh,
I'm
running
almost
half
an
hour,
I'm
very
I'm
taking
a
lot
of
time,
I'm
going,
I'm
gonna
go
so
to
the
to
the
point
no
case
and
like
chaos
is
pretty
much
the
same
situation,
so
I
expect
the
conditions
to
be
the
same
and
for
heavy
chaos.
B
If
you
want
to
know
more,
I
sincerely
recommend
documentation,
but
for
now,
let's
just
say
this
is
our
criteria
for
passing
or
warning
for
this
sli
and
also
for
the
average
fail
ratio.
I
want
it
in
a
in
an
into
a
idaho
situation.
I
want
it
to
be
less
than
one
percent,
but
if
it's
less
than
five
percent,
it's
still
not
so
bad,
and
this
is
for
the
stage
of
no
chaos
in
light
chaos
and
see
that
the
numbers
change
for
the
situation
of
heavy
chaos.
B
B
It
should
be
either
under
30,
but
it's
not
that
bad
if
it's
50
percent.
So
this
number
changes,
but
the
slos
are
pretty
much
the
same.
So
how
this
final
pipeline
looks
like
we
have
the
gitlab
as
our
runner
and
the
gitlab
is
able
to
communicate
with
locust
via
api.
So
I'm
triggering
my
tests
locus
tests
as
a
via
api
call.
B
Also
I
have
my
litmus
and
I
am
able
to
run
litmus
chaos
engines,
so
those
high
latencies
or
low
latency,
so
basically
introducing
light
or
heavy
chaos
via
cube,
ctl
comment.
So
basically
I'm
modifying
the
deployment
in
kubernetes
and
I'm
activating
the
chaos
this
way.
I
think
this
is
very
similar
idea
to
to
what
jurgen
and
the
the
team
has
for
integration
with
litmus,
but
I'm
yet
to
discover
this
length
a
bit
more,
but
I
think
we
we
a
bit
agree
on
on
that
part
and
then,
as
the
source
of
matrix.
B
B
So
this
is
not
demo
because
we
don't
have
too
much
time
for
a
demo,
and
I
really
want
to
show
you
the
essence.
But
let
me
show
you
how
this
looks
like
in
some
screenshots.
So
basically,
this
is
my
pipeline.
My
gitlab
pipeline,
where
I
have
the
three
stages
defined
as
separate
stages,
and
this
is
not
very
complex
pipeline-
it
basically
in
each
pipeline,
I'm
only
using
some
cools,
I'm
only
using
one
cube,
ctl
command
to
activate
chaos
and
some
final
evaluation
by
captain.
B
So
this
is
pretty
simple
pipeline,
but
it
corresponds
to
a
thing
like
this.
This
is
a
grafana
graph
and
now
it's
it
requires
some
explanation.
So,
as
you
can
see
at
some
point
at
night,
of
course,
we
have
three
bursts
or
three
runs.
The
green
line
represents
the
total
rps.
So
basically,
let's
say
the
rps
summed
rps
of
all
endpoints
and
the
red
line
represents
errors
seen
by
locust.
So,
as
you
can
see
for
our
first
burst
without
chaos,
we
don't
have
pretty
much
much
any
errors
and
we
have
nice
around
2.5
average
rps.
B
B
The
rps
don't
drop
too
much.
Why?
Because
probably
the
end
points
are
responding
with
a
similar
time,
but
they
are
throwing
errors,
so
they
are
throwing
timeouts.
They
are
throwing
firefox
and
very
important,
and
yet
very
concerning
part
of
this
graph.
Is
this
part
here?
Because
what
happens
here
we
are
introducing,
let's
say
our
tests
take
five
minutes
and
we
are
introducing
the
chaos
only
for
30
seconds
or
one
minute.
I
think
the
chaos
engine
the
latency
has
to
start
here
and
then
it
drops.
B
So
we
have
the
error.
We
should
have
the
the
errors
only
here
and
then
we
we
should
not
have
errors
here
and
still
we
see
that
there
are
some
errors
seen
by
locust
after
this.
So
my
guess
is
that
after
the
experiment
is
ended,
our
application
is
not
able
to
to
recover
on
its
own.
That's
why
chaos
litmus
is
restarting
pods
and
when
it's
restarting
the
pods
I'm
seeing
the
errors,
so
they
are
not.
B
They
are
not
able
to
respond
to
me.
They
are
showing
timeouts,
and
this
is
wrong.
Behavior
I
mean
the
result
of
the
this
experiment
should
be
only
this
errors
here,
and
here
I
should
go
to
the
green
ariana
errors,
and
this
is
very
concerning
to
us
because
in
our
first
try
in
our
first
experiment,
we
didn't
see
that
now
we
are
seeing
that
and
there
is
something
clearly
wrong.
B
So
captain
is
the
captain
is
starting
to
pay
off
and
just
to
show
you
how
it
looks
in
a
real
situation.
I
don't
know
if
you
are
familiar
with
captain
ui,
but
here
I
present
you
the
do
you
see
my
marker.
This
is,
this
is
the
stage
of
no
chaos,
and
here
we
have
the
slis
and,
as
you
can
see,
the
average
rps
for
endpoints
are
around
four.
B
B
B
B
That's
why
I
see
a
big
average
file
ratio,
and
also
one
thing
that's
concerning
here-
is
this:
this
request
per
second
are
only
slightly
lower
than
in
situation
of
no
chaos.
It
means
that
we
are
throwing
errors
very
fast,
and
this
is
concerning
this
should
not
be
like
this.
In
heavy
chaos,
we
should
have
low
rps.
B
B
Okay
to
the
final
result,
sorry
for
being
so
so
taking
so
long
to
the
final
result,
the
initial
goal
was
achieved.
It
was
achieved
by
with
with
a
strong
dedication
from
our
team,
with
a
great
help
of
captain
team
and
I'm
happy
about
the
results.
For
now
I
was
also
able
to
submit
two
requests
to
captain
community.
B
One
was,
I
think,
with
some
scripting
thing,
so
this
was
quite
funny
lesson
for
me,
but
the
other
one
was
related
to
a
thing
we
were
discovering
at
at
some
point,
and
the
question
was:
if
we
are
use
the
external
mongodb
and
not
the
one
provided
with
captain
deployment,
then
the
answer
is
yes,
you
can
do
it
very
easily.
B
A
I'm
not
sure
if
the
network
hiccup
is
on
the
side
of
adrian
or
if
it's.
If
it's
on
my
side,
please
let
us
know
in.
A
Okay,
thanks
michael,
so
I
assume
that
adrian
will
be
hopefully
back
again
soon.
He
also
had
some
network
issues
earlier
today,
anyway,
thanks
adrian-
maybe
you
can't
hear
this
right
now,
but
thank
you
so
much
for
this
presentation.
We
are
already
in
the
part
where
you
were
presenting
the
results,
and
I
know
also
some
outlook
for
the
future.
What
actually
you
wanted
to
improve?
I
tried
to
pull
up
the
slides
on
on
my
screen
and
show
you
the
slides.
B
It's
not
the
perfect
way
in
captain
world,
we
can
develop
local
service,
so
we
can
devel
develop
a
middleware
for
the
captain
to
connect
instantly
to
locust
and
to
orchestrate
everything
directly
via
captain,
so
the.
If,
if
we
dedicate
a
bit
of
time,
we
can
develop
an
integration
between
captain
and
locust,
which
I
think
is
yet
to
be
done.
But
there
is
already
existing
integration
of
litmus,
so
we
can
use
the
litmus
integration
and
then
we
can
shift
to
captain
as
the
orchestrator.
So
we
don't
need
to
run
the
calls
manually.
B
We
can
pass
the
task
to
captain
and
then
on
our
side.
We
need
to
improve
monitoring,
graphing
and
possibly
maybe
we
can
integrate
with
argo
cd.
So
the
future
version
of
this
graph
would
look
like
this.
The
gitlab
will
will
only
pass
some
comments
to
captain
and
captain
would
be
the
direct
orchestrator
for
locust
for
litmus
and
also
the
evaluation
layer.
B
So
this
diagram
would
look
like
this,
so
we
will
not
have
to
make
all
the
api
calls
and
cube
ctr
requests
on
ourselves,
but
we
would
use
captain
as
the
first
point
of
contact
and
that's
pretty
much
all.
Thank
you.
A
Thanks
so
much
adrian
for
for
sharing
all
of
this,
I
know
that
we
have
the
litmus
community
here
that
we
have
someone
from
gitlab.
Here
we
have
a
couple
of
folks
from
the
captain
community
here,
so
please
feel
free
to
to
share
your
thoughts.
A
I
will
allow
everyone
to
unmute,
so
please
feel
free
to
just
jump
in
and
ask
the
questions
directly
to
adrian
one
thing
that
I
discovered
while
everyone
is
joining
in
here,
is
you
mentioned
that
you
were
quite
surprised
that
the
requests
were
during
the
heavy
stage
was
so
fast
or
actually
the
the
request
per
seconds
that
they
did
not
drop?
A
So
I
thought,
if
you
want,
you
could
also
add
an
upper
and
a
lower
about
in
the
captain.
Quality
gate
to,
for
example,
get
a
warning
if
the
requests
do
not
drop.
If
you
expect
them
to
drop,
you
could
do
this,
so
you
get
the
the
evaluation
result
as
you're,
not
using
it
for
the
complete
cicd
part,
but
just
for
quarter
gates.
This
is
something
that
was.
I
was
just
thinking
about
this
during
the
presentation.
A
A
A
Hey
organ
hi
adrian,
this
is
karthik
from
the
witness
team.
That
was
an
awesome
presentation,
thanks
for
taking
the
time
to
describe
the
use
case
and
explaining
how
you
went
about
this
really
interested
in.
I
think
the
litmus
service
that
we
are
working
on
are
there
any
suggestions
you
have
adrian
on
how
you
would
like
to
see.
Let
us
more
integrated
into
captain.
We
were
having
some
discussions
around
using
litmus
for
some
more
use.
Cases
then
already
are
being
actually.
B
It
don't
don't
get
me
wrong.
My
my
development
right
now
was
focused
on
implementing
the
solution,
and
now
we
have
the
working
solution
and
now
is
the
time
to
make
the
actual
r
d.
So,
for
now
we
have,
as
I
said,
very,
very
simple
chaos.
Definitions
of
these
packet
drops.
We
should
improve
them.
I
think
this
is
now
the
time
to
get
into
litmus
a
bit
more
learn
about
the
experiments,
learn
about
the
possibilities
and
see
how
our
system
behave
and
also
I'm
yet
to
see
the
integration
with
litmus
so
for
now.
B
Unfortunately,
I
don't
have
any
advices,
but
if
we
can
keep
in
contact,
then
I
will
be
jurgen
knows
that.
That's
I
like
to
give
suggestions
so
definitely
up
for,
like
I'm
really
happy
to
see.
What's,
gonna
come
up
with
the
future.
A
Great
yeah,
looking
forward
to
that
thanks
yeah,
thank
you
for
for
sharing
yeah.
I
think
that's
that's
a
great
use
case
where
the
little
service
is
used
in
a
real-world
scenario,
where
we
can
really
learn
what
is
needed.
A
So
in
the
current
implementation
we
can
already
basically
fulfill
the
last
part
or
the
last
image
you
had
where
captain
is
doing
this
orchestration.
What's
still
needed
is
the
locust
service.
So
if
anyone
here
in
this
in
this
group
is
working
with,
locust
has
experience
with
locust
and
want
wants
to
join
us
in
building
this
locust
integration
into
captain,
please
step
up
and
please
get
in
touch
with
us.
We
will
be
starting
quite
soon
on
this
implementation,
as
we
as
we've
seen.
A
It
might
be
helpful
if
there
are
some
issues
going
on
that.
It's
basically
that
we
have
an
integration
between
the
test
execution
and
the
orchestration
that
we
directly
know
what
was
actually
going
on,
and
we
have
kind
of
the
bridge
as
a
central
hub
where
we
see
the
information.
If
tests
have
been
executed.
What
was
the
result.
B
Yeah,
I
think
I
think
that
needs
one
explanation
here,
because
for
those
of
you
who
are
familiar
with
low
cost,
locust
has
basically
two
two
modes
of
execution.
B
So
first
mode
is
the
ui
mode
where
we
deploy
our
test
script
or
load
the
script,
and
we
basically
run
everything
from
the
ui
and
the
other
mode
is
when
we
run
it.
So
this
ui
mode
is
similar
to
jmeter
ui
mode,
but
with
geometry,
ui,
ui
mode.
You
are
using
the
application
so
to
implication,
client
whatever,
and
here
you
have
like
website
so
standalone
website
and
the
other
way
to
run
it
is
just
command
line,
run
for
a
particular
time.
So
this
this
modes
are
slightly
different.
So
with
the
ui
mode
you
have
always.
B
You
are
always
running
the
website
somewhere
on
your
in
your
system,
and
then
you
can
just
start
this.
Whenever
you
want
run
it
and
manually,
stop
it
so
manually,
run
it
manually
stupid
and
with
the
command
line
execution.
You
have
okay,
let's
run
this
with
100
users
for
five
minutes
and
we
are
using
the
ui
mode.
So
we
have
the
deployed
locus
somewhere
and
whenever
we
want
to
test
performance,
we
can
do
it
manually,
but
also
we
are
a.
We
are
querying
or
starting
the
test
a
bit
of
a
hack
way.
B
B
So
that's
low
hanging
fruit.
Why?
Because,
then,
if
you
would
like
to,
if
you
have
this
running
instance,
then
you
are
using
the
same
instance
that
has
the
exporter
that
that
is
exporting
matrix,
and
if
you
run
the
command
line,
you
would
need
to
start
pots
in
kubernetes
ad
hoc
connect
to
exporter,
and
this
is
more
more
troublesome.
A
The
details,
so
it's
very
interesting
to
see
that
there
are
different
approaches
to
it.
Also,
the
approach
we
took
with
integrating
the
litmus
service
to
captain
or
developing
it.
We
decided
to
go
for
one
approach.
We've
implemented
this,
it's
already
out
there
for
everyone
in
the
community
to
use,
and
we
will
be
also
yeah
improving
this
over
the
next
weeks.
A
So
it's
always
great
to
see
also
some
more
use
cases
that
might
be
needed,
and
it's
great
to
see
that
there's
already
some
some
demand
out
there
yeah,
if
you,
if
there
are
more
questions,
please
just
ask
them
away,
and
if
not,
then
I
would
like
to
thank
everyone.
A
Maybe
I
just
share
again
my
screen
just
to
give
a
little
bit
of
background
what
I'm
talking
about
so
yeah
I've
put
a
list
of
attendees
together
the
ones
that
I
could
see
if
you're,
not
fine
with
having
your
name
out
there,
just
please
go
ahead
and
remove
it
again,
but
that
you
totally
fine
thanks
adrian
for
doing
this
presentation
on
how
you
did
the
implementation-
and
it
was
really
great
to
see
all
these
things
coming
together.
We
can,
if
it's
fine
for
you.
A
I
would
also
love
to
put
in
the
link
to
the
slides
in
here,
so
that
if
someone
wants
to
take
a
look
again
and
maybe
take
a
look
at
some
of
the
details,
they
can
they
can
do
so.
I
will
put
the
link
to
recording
once
it's
online.
I
will
put
the
link
recording
also
in
here.
A
If
there
are
no
more
questions,
then
thanks
everyone
for
joining.
We
will
have
the
next
captain
community
meeting
again
on
the
third
tuesday
of
the
month,
which
will
be-
and
I
actually
have
to
take
a
look
at
my
mobile.
When
is
the
the
next
third
tuesday
of
the
month,
and
it
should
be
february
16th
so
february.
16Th
is
our
next
edition
of
the
of
our
captain
user
group.
A
In
the
meantime,
we
are
hosting
our
weekly
developer
calls
on
thursday
same
time
as
today,
please
also,
if
you're
working
with
captain,
if
you're
more
on
the
developer
side,
if
you
want
to
build
integrations
on
captain,
please
join
us
on
thursdays.
A
As
a
last
update,
we
just
released
captain
serial.8
in
the
alpha
as
an
alpha
release.
Please
give
it
a
try,
give
us
feedback
how
you
like
it
give
us
feedback
on
the
new
features,
give
us
feedback.
If
something
is
not
working.
As
expected,
it's
an
author
release.
We
do
not
recommend
to
already
override
your
stable
release
of
captain
and
use
it
in
production.
A
Please
do
not
do
this,
but
if
you
have
some
instances
that
you
are
where
you're
developing
with
or
that
you
are
experimenting
with,
we
would
really
appreciate
if
you
give
the
new
version
of
captain
a
try
and
give
us
feedback
on
this.
You
can
reach
us
slack
or
on
the
twitter
google
group
on
the
on
all
the
channels
that
you
already
know
so
thanks.
Everyone
thanks
adrian
and
have
a
great
rest
of
your
day.
Usually
after
these
kind
of
events,
we
would
go
for
a
beer.
B
A
Great,
so
thanks,
everyone
hopefully
see
you
all
again
next
time.