►
From YouTube: Cloud-Native Chaos Engineering WG Weekly Sync Up - October 29 2021 | CNCF TAG App Delivery
Description
Check out the recording from the Cloud Native Chaos Engineering Working Group Weekly Sync Up from October 29th, 2021
For more information check out the WG Charter: https://docs.google.com/document/d/1scr9uuvG1g1xpIHPs3314FqeFufE31ustTVnRMrX3gI/edit?usp=sharing
B
Hi
good
morning
morning,
thanks
for
joining
today's
chaos,
engineering
work
group
call
we'll
just
wait
for
a
few
minutes
to
see.
If
other
members
join
in
we'll
probably
give
five
minutes,
and
then
we
can
get
started.
A
B
B
I
I
know
we've
chatted
earlier
and
you
also
expressed
interest
in
being
part
of
this
work
group
and
sharing
your
views
and
contributing
to
the
chaos
engineering
community
in
general.
So
thank
you
for
for
agreeing
to
participate
and
for
taking
interest.
I've
shared
the
meeting
notes
on
the
zoom
chat.
We
also
have
a
git
repository
that
we
just
created
towards
some
artifacts
related
to
this
work
group,
which
is
still
having
some
initial
content.
So
please
feel
free
to
take
a
look
at
it.
Add
yourself
as
an
interested
party
in
the
charter,
etc.
B
We
do
most
of
these
user
discussions
in
question
and
answer
style.
So
if
there's
anything
that
you
would
like
to
demonstrate
or
talk
about,
you're
very
welcome
to
do
that.
If
you
plan
to
do
that
in
a
subsequent
session,
where
you
have
had
more
time
to
sort
of
prepare
and
showcase
and
demo
things
like
that,
that
is
completely
understandable
as
well.
We
could
do
this
as
the
as
the
context
setting
discussion,
and
you
could
do
that
in
a.
B
C
B
Give
you
a
little
bit
of
background
as
we
wait
for
the
other
members
to
join
in.
So
this
is
a
chaos
engineering
work
group
that
is
started
quite
recently.
A
few
months
back.
It
was
an
initiative
from
cncf
to
foster
some
interactions
between
the
end
users
and
the
chaos
projects
in
the
cncf
landscape.
B
B
And
the
expectation
is
for
the
end
users
to
benefit
from
the
content
that
we
have
delivered.
This
white
paper
get
them
started
in
their
chaos.
Engineering
journey
just
help
them
with
understanding
chaos,
engineering,
better
and
the
paradigm
shift
that
is
taking
place
in
the
cloud
native
world
in
terms
of
how
people
are
practicing
chaos,
engineering
in
a
cloud-native
setting.
B
So
that's
what
this
work
group
is
focused
on,
so
just
wanted
to
share
a
little
bit
of
background.
From
that
I
see
sergio
has
joined
this
call
hi
sanju.
Would
you
like.
C
I
was
a
software
engineer
at
arena
world
and
I
contributed
to
the
brave
browser
and
I'm
planning
to
be
ambassador
for
the
academy
software
foundation.
Awesome.
B
Great
to
have
you
here,
please
feel
free
to
add
yourself
as
the
attendee
in
the
meeting
notes.
I
have
just
shared
the
meeting
notes
on
the
zone
chat.
C
B
We
have
had
some
other
users
appearing
on
this
working
group
before
michael
came
in
last.
He
was
he
was
speaking
about
the
chaos
philosophy
and
practice
at
iag
and
tpg
telecom,
where
he
is
served
as
a
devops
consultant
and
we
hope
to
invite
other
end
users.
Whoever
is
interested
to
come
and
share
their
learnings
with
us
and
we'd
like
to
take
that
back
to
the
white
paper.
B
So
with
that
context,
I
think
we
can
get
started.
Maybe
I
think
six
minutes
passed.
So
money
will
probably
begin
by
an
introduction
of
yourself.
So
please
feel
free
to
talk
about
your
background,
where
you're
working
at
what
kind
of
what
you
do
and
how
you
basically
got
started
with
your
chaos
journey.
A
Yeah
sure,
thanks
for
kissing
me
hey,
my
name
is
working
as
a
sre
engineering
manager
at
halodoc,
so
I
have
overall
17
years
of
experience
in
the
id
industry.
So
I
I
played
different
roles
at
a
different
point
of
time
at
halodoc.
My
primary
responsibility
is,
to
you
know,
take
care
of
infrastructure,
provisioning
and
then
you
know
bringing
in
all
the
best
practices
that
are
required
at
the
helidoc
as
well,
as
you
know,
identifying
analyzing
the
tool
and
tech
stacks.
A
What
is
supposed
to
to
have
it
at
these
are
the
primary
responsibility
rather
than
that
we
have
a
set
of
deliverables
which
needs
to
happen
so
as
a
sre
team.
So
we
support
all
our
tech
verticals,
where
they
they
have
a
set
of
requirements,
that's
to
help
them
provision
infrastructure,
or
maybe
the
configurations
or
maybe
at
the
different
level.
So
that's
something
that
we
do.
Apart
from
that
analyzing
different
set
of
tools,
you
know
a
quarter
and
quarter
and
then
see
which
is
the
right
fit
for
hellotalk.
A
So
during
this
journey
I
think
we
started
evaluating
a
kiosk
tool.
So,
basically,
we
we
considered
a
set
of
tools
to
you
know
to
evaluate
on
certain
parameters.
A
So
what
we
did
we,
we
did
an
analysis
in
the
month
of
I
think,
jan
we
started
looking
at
how
we
can
implement
chaos
engineering
at
halador,
considering
that
we
identified
around
four
four
tools:
one
is
kiosk
toolkit,
the
other
one
is
pumbaa
and
a
litmus
is
the
third
one
and
the
last
one
is
chaos
mesh.
So
what
are
the
key
parameters
that
we
looked
at
while
exploring
this
tool?
Was
we
categorized
this
into
six
areas?
A
One
is
how
easy
the
installation
management
is
and
what
are
the
different
set
of
experiments
and
definitions.
Each
of
this
tool
offers
the
third
one
is
about
security
and
the
fourth
one
is
about
observability
and
fifth
is
one
of
the
primary
point.
You
know
which
we
want
to
make
a
decision,
whether
we
want
to
go
with
this
tool
or
not
just
basically,
actually,
if
we
wanted
any
any
tool,
just
basically
a
kubernetes
native
tool,
the
last
one
is
about
the
running
behavior,
so
these
are
the
six
key.
A
You
know
categories
that
you
want
to.
You
know
analyze
through
and
then
take
a
decision
on
it
out
of
which
you
know
we
we
felt
all
the
all.
The
six
parameters
were
very
well
made
by
litmus,
so
so,
based
on
that,
you
know
we,
we
decided
to
go
with
the
greatness,
but
there
are
few
factors
which
which
which
which
you
know
made
us
to
make
this
decision.
Is
this
kubernetes
native
tool?
It's
a
primary
point.
A
Second
thing
is
very
easy
to
install
and
you
know
lightweight
and
stateless,
and
it
has
got
variety
of
experiments
and,
and
it
covered
almost
like
you
know,
in
every
area
that
we
wanted
to
cover.
A
The
third
thing
is
about
the
security
aspects,
which
is
more
like
you
know
it
has
service
specific.
You
know,
accounts
roles,
role
binding.
So
since
you
know
all
these
experiments
are
talk
to
kubernetes,
so
we
can
very
well
control
within
the
name,
space
level
or
at
the
cluster
level,
so
that
was
one
of
the
additional
benefit
we
have
got
and
then
by
default.
Yes,
it
gives
a
very
detailed
report
of
how
each
of
these
experiments
were
running
in
a
very
simpler
format
and
optionally.
A
It
can
also
get
integrated
with
promotions,
but
now
we
can
see
more
and
more
metrics
from
the
end
points.
The
last
is
about
the
running
behavior.
So
since
this
comes
as
a
crd
right,
so
we
can
say
that
we
go
and
solve
as
many
number
of
experiments
with
entire
workflow
we
want
to
can
be,
which
comes
up
with
all
the
list
of
inbuilt.
You
know
experiments
so
with
this.
I
think
you
know
which
made
us
to
decide.
You
know
to
go
with.
B
B
Your
engineering
manager,
like
you
just
introduced
yourself,
so
I
am
sure
there
would
have
been
some
resiliency
efforts
that
were
in
place,
and
you
said
you
started
looking
at
chaos
tooling
in
germany.
So
what
was
really
the
trigger
point
for
you
to
consider
doing
chaos?
Engineering?
Was
it
an
outage
that
happened
in
the
organization
or
was
it
a
conscious
decision
that,
as
a
technical
leadership
group,
you
took
to
sort
of
adopt
chaos,
engineering
practices
because
the
rest
of
the
world
is
doing
it
and
it
is
a
beneficial
thing
to
do?
A
Yeah,
that's
an
interesting
question.
So
what
made
us
to
you
know
incorporate.
A
So
often
we
onboard
new
services,
new
utilities,
you
know
to
the
end
users,
and
it
happens.
You
know
in
a
situation
like
where
we
get
the
requirement,
we
have
to
make
sure
that
this
is
resilient
and
you
know
when
there
is
a
failover
how
how
quickly
it
is
recovering.
These
are
some
of
the
points
you
know
which
we
wanted
to
consider
when
we
onboard
new
services.
A
Initially,
when
we
did
this,
we
had
come
across
few
challenges
in
terms
of
you
know
not
checking
the
resiliency,
how
reliable
the
utilities
that
we
are
on
boarding
and
directly
it
is
going
live,
it
went
live
and
then
we've
also
seen
some
some
kind
of
hassles.
So
to
overcome
that
what
we
have
decided
is
we
wanted
to
have
all
this
roughly
checked,
I
mean
all
the
resiliency
testing,
meaning
in
every.
A
What
is
the
service
availability
and
if
there
is
a
failover,
how
quickly
it
is
recovering,
so
these
are
some
of
the
things
pain
points
we
want
to
do.
You
know
capture
it
at
the
earlier
stage,
so
that
made
us
to.
You
know
pick
up.
You
know
another
kiosk
engineering
tool
to
create
some
gears
within
our
infrastructure
to
see
you
know
how
resilient
we
are.
So
how
we
started
is
we
started
this
with
basically
in
the
stage
environment
anyway.
A
Initially,
when
we
started
it
was
more
of
an
experiments
there,
where
we
run
it
through
crds
directly
at
the
later
stage,
I
think
when
we
started,
you
know
showing
more
and
more
interest
incorporating
this
tool.
We
went
this
through
a
git
ops
approach
where
we
have
maintained
all
the
source.
I
think
litmus
had,
you
know
very
well.
You
know
helped
us
setting
up
this
tool
through
helen
charts,
so
we
maintain
all
the
stores
in
our
source
code
management
tool
and
then
from
there.
A
I
think
we
deployed
and
then
enabled
all
the
workflows
that
we
wanted.
So
we
started
creating
a
portal,
you
know
workflows
for
all
the
services
and
we
we
wanted
this
to
be
integrated
within
the
litmus.
So
I
think
that's
when
litmus
team,
I
think
chaos.
A
Engineering
team
has
really
helped
us
setting
up
a
kind
of
portal
through
which
you
know
we
have
created
all
the
set
of
workflows
and
we
we
had
approached
via,
like
you
know,
control,
plane,
approach,
wherein
we
have
control
plane
in
one
cluster
and
we
have
all
the
other
clusters
having
the
the
infra
part
and
we
can
able
to
target
our
experiments.
So
that
really
you
know
motivated
us
to
set
up
this
in
stage,
and
then
we
propagated
this
to
production
in
the
later
stage.
B
A
B
Great
inside
money,
so
this
is
something
that
is
consistent
with
what
we
heard
from
michael
in
the
previous
call
spell
where
the
chaos
experiments
were
introduced
in
a
non-production
environment
at
first
and
the
reason
why
chaos
was
considered
in
the
first
place
was
to
test
their
systems
for
resilience
and
onboard
before
they
onboarded
new
tools
into
their
production
environment,
to
test
the
resiliency
of
these
tools,
and
then
they
sort
of
started
doing
it,
introducing
it
at
some
level
in
production
in
a
gradual
way.
B
So
the
original
chaos
engineering
philosophy,
the
principles
of
chaos
that
you
see
on
the
website,
the
principles
of
kiosks.org,
which
was
put
together
initially
by
netflix
and
team
talks
about
the
necessity
to
do
chaos
in
production
and
for
a
long
time.
People
have
believed
that
chaos
engineering
is
practiced
only
by
sres,
and
they
do
it
only
in
production
and
that's
how
it
was
taken
care
of
using
game
days
and
half
days
and
things
like
that.
B
But
there
has
been
a
shift,
a
gradual
shift,
especially
with
the
introduction
of
the
cloud
native
landscape,
where
there
are
a
lot
of
tools
that
you
would
use
in
your
environment.
You
picked
from
you
know,
different
sources
which
probably
you've
not
had
much
control
in
terms
of
developing
yourself,
because
it's
all
needed
to
provide
a
seamless
experience
to
the
user,
and
it's
also
needed
by
the
sres
to
maintain
the
sanity
of
your
systems.
B
B
B
Was
there
a
challenge,
some
kind
of
apprehension,
hesitation
coming
from
your
sra
team,
when
you
said
that
you
need
to
be
doing
these
chaos
experiments
in
production,
or
is
it
something
that
was
actually
welcomed
by
them?
There
was
you
know
mutual
agreement
right
in
the
beginning,
and
people
were
invested
in
this
process
or
were
there
any
cultural
challenges
that
you
had
to
face
in
convincing
people
that
this
needs
to
be
done
in
production
and
also
on
a
similar
note,
you
said
you're
bringing
your
staging
and
pre-production
environments.
B
I
would
just
be
interested
in
understanding
who
is
the
person
that
is
actually
doing
these
chaos
in
production.
We
can
understand
that
it
is
the
the
cluster
admins
and
the
sres
and
the
service
owners
who
are
doing
it,
but
in
case
of
the
staging
and
pre-product
environments,
is
it
again
the
same
group
of
people
or
is
the
person
are
different,
something
like
saying
a
performance
engineer,
qa
engineer
or
a
developer?
Who
is
actually
doing
the
experiments
by
themselves?
Has
it
sort
of
populated
the
philosophy
of
chaos
and
the
thought
process
of
chaos?
A
To
give
a
more
info
about
that,
so
we
actually
execute
all
the
experiments
you
know
automatically
meaning
since,
as
I
mentioned
earlier,
you
know
it
is
more
of
a
git
ops
approach.
We
have
created
all
workflows
the
way
we
want
it
and
all
these
are.
You
know,
stored
in
the
repositories
so
as
and
when
any
new
service
is
getting
deployed
in
stage,
so
we
it
automatically
triggers
the
workflow
and
start
executing
meaning.
So
there
is
no
individual
person
goes
and
manually
trigger
the
workflows.
A
If
there
are
failures,
then
yes,
I
will
call
it
out
and
we'll
analyze
why
the
failure
has
occurred
and
accordingly,
no
team
will
fix
it
and
come
back.
So
this
is
something
we
have
done
couple
of
times
in
a
in
a
quarter.
I
think,
every
month
and
month
we
have
a
day
where
we
call
it
as
kia's
chaos
day.
Wherein
you
know
we
create,
we
will
run
the
experiments
against
our
production
environment,
share
reports
to
the
leaders,
that's
something
we
do
awesome
yeah.
That's.
B
Automating
this
you
sort
of
taking
away
the
human
hesitation
out
of
it.
One
could
say
so
that
that
is
nice
to
hear.
B
The
other
question
that
I
had
was
about:
how
do
you
determine
what
scenarios
you
run?
For
example,
you
mentioned
you
have
scheduled
some
experiment
on
staging
on
production,
and
then
you
have
upon
deployment
experiments
getting
triggered
as
part
of
the
detox
flow.
So
how
does
the
team
come
up
with
scenarios?
Is
it
based
on
past
experience
and
your
analysis
of
your
code
base
and
application
application
weaknesses
that
you,
you
might
think,
might
be
manifested,
or
is
this
something
that
you
have
determined
on
okay?
B
A
If
these
services
goes
down
for
an
example,
I
can
put
an
otp
service,
let's
say
a
person
is
doing
a
enrollment
and
he
needs
to
get
an
otp
if
this
service
goes
down,
so
you
may
not
get
the
odp,
for
you
know,
subscribing
right,
so
that
is
one
of
the
showstopper.
So
what
we
did
we
identified
list
of
services
and
you
know
we
wanted
to
run
a
set
of
experiments.
How
we
have
sequentially,
you
know
running.
A
This
experiment
is
basically
we
first
wanted
to
see
if
the
pod
is
getting
terminated,
how
quickly
the
pod
is
recovering.
That
is
the
first
thing.
Then
we
wanted
to
play
around
with
the
memory
and
cpu
hawks,
so
that
was
the
you
know
next
set
of
experiments
that
we
have
still
arranged
and
the
other
areas.
Most
of
the
time
we
see
is
about
the
network
loss
and
most
of
the
times.
You
know
we
have
some
dns
errors.
A
So
when,
when
we
do
a
health
check,
you
know
with
the
parts
the
part
is
not
coming
up.
It
will
keeps
on
spinning,
but
the
part
never
gets
terminated.
So
these
are
all
some
of
the
past
challenges
we
have
seen
so
considering
this,
you
know
we
want
to
have
this
relevant.
A
B
A
B
You
brought
in
some
very
valid
points
about
checking
the
health
of
applications,
even
as
you
do
the
faults.
So
this
brings.
C
B
A
So
what
we,
what
we
have
done
is
like
we
have,
you
know
running
while
running
these
experiments.
We
are
probing
for
certain
endpoints
right,
so
that
is
any
way
being
supported
by
a
chaos
experiment.
So,
whatever
experiments
we
design,
we
can
probe
on
the
different
endpoints.
It
can
be
http
or
maybe
maybe
kubernetes
side
as
well,
so
cubicle
probe
can
be
done.
A
So
these
are
this
list
of
probes
that
we
have
enabled,
but
I
haven't
said
that
when
this,
when
there
is
a
failure
with
this
experiments,
we
have
also
captured
this,
and
currently
our
team
is
working
on
integrating
this.
With
with
prometheus,
we
wanted
to
enable
all
the
service
with
service
monitoring,
and
then
you
know,
skyping
the
metrics
about
the
execution.
So
currently,
I
think
it
must
by
default,
have
a
very,
very
good
dashboard,
which
is
which
is
under
the
analytics.
A
We
can
able
to
see
what
is
the
level
of
experiments,
how
many
times
it
was
run
and
what
was
when
was
the
last
failure
and
the
success
rate
and
how
many?
How?
What
is
the
percentage
of
each
of
this
resiliency
I
mean
which
of
these
services?
What
is
the
resiliency
score?
There
are
multiple
things
comes
by
default,
but
beyond
that
I
think
we
are
also
working
on
you
know
integrating
this
with
prometheus.
B
That
one,
the
other
question
that
we
had
was
a
lot
of
the
times.
People
doing
chaos.
Engineering
in
pre-production
environments
tend
to
use
synthetic
workload,
generators
to
sort
of
simulate
the
traffic
that
they
see
in
production.
Is
there
a
practice
that
you
follow
and
is
that
an
area
for
which
you
interact
with
the
performance
group?
Do
they
have
any
inputs
into
your
chaos?
Engineering
practice?
Are
they
also
stakeholders
in.
B
Do
they
generally
participate
in
analyzing
the
results
through
cause
analysis,
things
like
that
we're
just
trying
to
the
reason
for
asking
this
question
is
trying
to
establish
sort
of
or
trying
to
understand
the
interoperations
of
the
communication
channels
that
a
chaos
engineering
team
would
have
in
an
organization.
So
how
frequent
are
your
conversations
with
performance
teams
and
do
they
participate
in
constructing
scripts
or
help
you
with
load
generation
for
chaos?
A
What
we
often
do
is
we
give
them
a
setup
of
perfect
environment
for
a
specific
set
of
services
wherein
they
want
to
create
a
load
against
and
see.
You
know
how
it
is
performing.
So
what
we
have
recommended
to
that
perfect
team
is
basically
we
call
them
exhausted.
So
within
our
organization
we
have
software
developers
and
tests,
they
were
there,
the
right
set
of
test
cases
and
they
run
against
environment
based
on
the
use
cases.
So
we
we
tag
them.
We
discuss
them.
A
You
know
how
how
the
the
setup
can
be
made
on
top
of
them.
We
have
also
explained
them.
You
know
how
this
litmus
kia
scan
experiments
can
contribute.
So
in
that
aspect
you
know
on
the
load
side.
To
be
honest,
you
know
we
haven't
given
much.
You
know.
A
B
Great
thank
you
and
it
a
lot
just
a
couple
of
questions
from
my
side
and
then
I
probably
opened
the
floor
for
any
thing
that
you
would
like
to
ask
or
share
money
and
sergio
can
put
in
his
thoughts
or
questions
one.
Another
question
I
had
was
about
the
importance
of
multi-tenancy
in
in
chaos,
engineering
frameworks,
and
is
that
a
requirement
that
you
had
and,
for
example,
it
could
be
like
when
you're
doing
it
when
you're
doing
chaos,
experiments
in
a
shared.
C
B
And
you
have
different
people
doing
their
own
experiments.
Sometimes
it
can
be
like
stepping
on
others
toes,
so
you
mentioned
that
you've
automated
it
to
a
large
extent,
there's
this
possibility
that
multiple
apps
get
upgraded
at
the
same
time
triggering
their
own
respective
workflows
in
their
own
spaces.
So
is
this
something
that
you've
given
thought
about?
Did
you
do
some
kind
of
intelligent
scheduling
or
or
is
there
any
agreement
between
the
teams
that
they're
going
to
be
okay
if
experiments
are
run
or
do
basically
inform
people?
How.
A
B
That's
one
way
to
interpret
my
tendency
for
sure.
The
other
way
to
interpret
is
you
have
a
large
staging
cluster,
like
you
mentioned
that
you're
carrying
out
experiments
and
staging
environments
typically
are
owned
by
different
people
or
when
you
say
owned,
I
mean
to
say
that
they
are
used
by
different
people.
You
could
have
name
space
a
being
used
by
let's
say
one
team,
one
application,
services,
team
and
name
space
b
being
used
by
someone
else,
so
they're
all
trying
to
test
their
own
review
code
in
the
staging
environment.
B
A
Now,
in
the
are
stage
the
litmus
experiments
at
that
point
of
time,
what
we
have
done
is
through
automation,
so
we
try
to
replicate
what
we
have
in
production,
meaning
you
know
that
in
rds,
data
points
would
still
remain
same,
but
the
replica
accounts-
and
I
know
how
how
the
the
parts
are
running
in
production
right.
That
is
something
where
we
want
to
exactly
simulate
in
our
stage.
Environment
then
run
run
the
experiments
once
it
is
complete.
A
A
To
make
the
changes
they
want
to
deploy,
it
will
still
still
go
ahead,
but
considering
the
experiments
that
we
wanted
to
run
right,
so
we
are
exactly
replicating
what
we
have
in
production.
So.
B
A
I
think
you
know
if,
if
it
doesn't
work
in
stage,
I
think
it
will
not
work
in
production.
B
B
You
do
you
have
a
hybrid
kind
of
a
setup
where
there
are
some
services
running
out
of
you
know
private
data
center
or
they're
running
on
cloud
instances,
not
part
of
kubernetes
clusters.
So
do
you
extend
chaos
engineering
to
those
systems
as
well,
or
is
it
something
that
that
is
completely
on
kubernetes.
A
Right,
we
have
a
plan
to
cover
the
in
a
non
fibonacci
specific,
but
for
now
I
think
our
major
focus
is
only
on
the
kubernetes
specific
and
the
the
nodes
out
of
kubernetes.
It's
right
right.
So
it's
more
of
a
vanilla,
flavored
instances
rather
than
you
know,
only
the
equivalent
of
this
nodes.
A
A
B
Sr
engineering
manager-
you
might
have
some
kps
for
yourself
for
your
team
around
improving
resiliency
and
chaos.
So
how
do
you
sort
of
evaluate
the
efficiency
of
your
chaos
engineering
process?
Is
it
the
number
of
tickets
that
have
reduced?
Is
it
the
number
of
outages
that
have
reduced
or
is
it
like
increased
uptime?
Do
you
actively
have
some
slos
that.
C
B
Sort
of
look
at
to
say,
we've
performed
better
in
this
month
compared
to
last
month.
How
do
you
generally
evaluate
the
efficacy
of
your
chaos
practice
and
how
is
it
communicated
to
the
larger
group
in
the
organization
and
how
is
it
sort
of
bubbled
up
as
a
as
a
report
that
everyone
can
consume
and
understand.
A
Yeah,
basically,
we
have
error
budgeting
and
we
have
defined
a
solution
as
a
life
for
all
the
services.
So
with
respect
to
our
chaos,
executions,
right
experiment,
executions,
we'll
make
sure
that
it
is
not
going
down.
In
terms
of
you
know
the
percentage
and
also
we
have
a
alert
alerting
system
in
place.
If
any
services
has
been,
you
know,
we
created
a
chaos
against
any
any
services.
It
goes
down
for
some
time.
The
error
percentage
would
get
incremented
right
so
which
will
indirectly
impact
our
you
know
reported
percentage
as
well.
A
So
considering
this,
you
know,
we
always
run
these
experiments
on
stage
and
we
have
a
scheduled
run
on
production.
As
I
said
earlier,
if
there
is
any,
you
know
a
discrepancy
in
this.
You
know
these
data
points
I
think
yeah.
We
we
recommend
this
to
the
development
team
to
have
a
look
at
it
and
also
from
our
side,
provisioning,
the
intra
meaning.
You
know
the
service
goes
down
for
certain
reasons
like
we
will
also
analyze
and
fix
them.
B
Some
of
these
questions
that
are
part
of
the
status
like
template.
You
mentioned
in
the
meeting,
notes,
questions
that
help
us
learn
about
how
the
the
in
the
inducer
community
is
practicing
chaos
and
the
few
others,
where
sort
of
things
that
sort
of
came.
To
my
mind,
as
you
explained,
the
use
cases,
thank
you
so
much
for
being
patient
and
giving
us
all
this
information.
B
I'm
sure
the
community
is
going
to
be
making
use
of
this
in
a
better
way
all
the
answers
that
you
provided.
We
will
try
to
crystallize
this
information
as
part
of
the
user
experience
and
the
user
story
section
of
the
white
paper,
so
I
probably
like
to
get
in
sergio
for
his
views
on
what
he
just
explained.
Are
there
any
questions
that
you'd
like
to
add.
B
Cool
I
see
brad
has
joined
this
con
hello
brad.
Would
you
like
to
yeah.
C
Yourself
yeah,
my
name
is
brad,
I'm
from
new
zealand,
so
normally
I'm
living
in
australia,
but
I
just
found
that
this
meeting
suits
my
time
zone
so
yeah,
I'm
happy
to
be
here.
Awesome
thank.
B
You
for
joining
brad,
so
we
were
this-
is
the
chaos
engineering
work
group
meeting
as
you
know,
so
this
is
across
tagged
and
across
project
working
group
in
which
we're
trying
to
build
some
knowledge
around
cloud
native
chaos,
engineering
and
crystallize,
all
these
learnings
into
our
white
paper
and
as
part
of
this
effort,
we
are
inviting
some
end
users
to
come
and
talk
about
when
how
they
practice
chaos,
engineering
and
what
are
the
challenges
they
face?
What
are
the
tools
they
use?
B
What
are
the
plans
they
have
for
the
future
things
like
that,
so
we
can
get
their
learnings,
take
their
learnings
back
to
the
to
the
team
and
put
them
into
the
white
paper
so
that
beginners
or
enthusiasts,
who
planning
to
do?
Okay
engineering
can
derive
some
learning
out
of
this.
So
we
had
money
from
hella
dog
id,
a
med
tech
organization
based
out
of
indonesia
to
come
and
talk
about
their
use
cases,
and
we
have
captured
some
interesting
answers
and
perspectives
in
this
call,
so
they
will
be
made
available
in
a
recording.
B
So
I
think,
please
feel
free
to
take
a
look
at
that
brand.
We
had
insurance.
B
Group
talking
in
the
previous
meeting,
so
there's
a
recording
available
for
that
as
well.
So
that's
as
some
context
for
you
and
money,
I
just
want
to
sort
of
leave
the
floor
open
for
any
questions
from
your
side.
For
example,
what
is
your
ask
of
the
cncf
community
in
terms
of
chaos?
Engineering?
What
are
the
kind
of
resources
you
would
like
to
see
here?
What
is
the
kind
of
standards?
You
might
like
to
see
anything
that
you
would
like
to
see
from
this
community?
B
C
Yeah
yeah
sure
so
I
I
haven't
done
casting
before
I
guess
from
this
working
group.
It
would
be
really
nice
to
to
see
some
sort
of
introductions
to
it.
You
know
a
lot
of
people
are
beginners
going
to
these
webinars
as
well,
so
it
would
be
cool
to
see
some
sort
of
hands-on
labs
and
yeah
and
really
like
do
you
have
a
specific
tool
that
you'd
like
to
use
lip
masks
or
that
there's
another
few?
C
Do
you
use
a
specific
project
or
is
this
sort
of
you
know?
Chaos,
engineering
in
general,.
A
B
Money
today
explained
about
how
he
has
been
using
litmus
and
why
they
chose
it
must
come
amongst
the
choice
of
tools
that
it
had.
What
was
the
basis
of
their
comparison,
and
things
like
that.
B
C
C
Engineering,
yep,
yep,
no
perfect,
and
if
you
ever
want
to
do
like
webinars,
I'm
happy
to
help
like
I'm
a
cncf
ambassador
as
well,
so
I
can
help
I
can
help
to
like
get
you
webinars
and
things
just.
Let
me
know
how
I
can
help
as
well
sure
that
would.
B
Be
awesome,
thank
you,
for
I
need
to
contribute
brian.
So
there
is
I've,
just
posted
the
meeting
notes
and
the
charter
link
in
the
zoom
chat.
Let
me
go
ahead
and
place
that
once
again,
so
please
feel
free
to
add
yourself
an
interested
party
and.
C
A
To
answer
your
questions,
so
we
being
a
startup,
we
wanted
to
evaluate
you
know,
tools
which
are
open
source.
A
A
I
think
we're
also
using
all
other
tools,
like
you
know,
argo
cds,
one
tool
we
are
using,
we
are
using
prometheus
graphana
stack
and
then
we
are
using
oppa
gatekeeper.
So
these
are
all
some
of
the
tools
we
have
already
evaluated
and
we
have
adopted
and
we
are
using
it.
Our
primary
point
is,
since
we
are
we're
having
all
our
hundred
plus
services
running
in
kubernetes,
so
we
wanted.
You
know
this.
The
the
tools
that
we
are
bringing
in
is
a
cloud
native.
A
I
mean
kubernetes
native
tool,
so
that
was
one
of
the
thing
and
if
you
ask
me
the
road
map,
yes
in
in
in
this
quarter,
so
we
have
planned
to
cover
all
the
node
specific
areas
and
I
based
on
the
discussion
that
we
had
in
the
past.
So
one
of
the
key
thing
I
would
like
to
integrate
is
on
the
user
user
roles
in
user
groups,
so
the
user
manager.
A
So
that
is
something
we
are
looking
forward
to
closely
work
with
you,
because
currently
only
the
sr
team
have
access.
We
have
to
manually,
go
add
users
to
this
user
management
app
and
then
we
have
to
provide
access,
but
it
does
not
have
any
g-suite
or
any
other
integration
into
it,
where
in
all
other
tools
that
we
have,
it
has
a
g
suite
integration.
Any
users
who
wanted
to
see
our
view
the
dashboard
support
the
metrics
so
which
is
in
place,
but
for
this
yeah
we
have
to
manually
add
users.
A
So
that
is
something
that
we
are
looking
forward
from
you
in
the
coming
quarter
and
yeah.
As
I
said,
the
load
specific
and
the
storage
specific,
you
know,
chaos
experiments
has
to
be
focused
in
this
part.
B
Great
that
sounds
like
a
very
exciting
road
map
that
will
be
very
happy
to
extend
any
assistance
that
you
might
need
in
terms
of
approaches
that
you
would
like
to
sort
of
validate
and
if
there's
anything
that
you
would
like
to
share
about
your
chaos
journey
other
public
forums
such
as
the
cnc
inducer
forum.
If
you
would
like
to
do
a
webinar,
we
have
brad
here
who's
who's.
Who
is
basically
going
to
help
us
with
that
aspect.
B
If
we'd
like
to
present
anything
about
ikea's
journey,
if
you
want
to
talk
about
specific
use
cases
and
incident
and
how
you've
basically
constructed
scenarios
around
it,
you
are
most
welcome
we'd,
like
to
as
a
cncf
group
we'd,
also
like
to
thank
you
for
using
other
tools
in
the
ecosystem
for
continuous
deployment
and
policy
enforcement.
B
So
I'm
sure
the
chaos
ecosystem
is
going
to
integrate
tightly
with
all
these
platforms
as
well,
giving
you
some
kind
of
a
whole
some
integrated
experience
in
terms
of
maintaining
and
running
your
applications
in
production
and
as
like
you
write.
It
pointed
out
at
the
beginning
of
this
call
as
a
medical
tech
organization.
Startup
reliability
is
very
critical.
You
need
to
be
able
to
connect
your
patients
to
doctors.
B
You
probably
need
in-time
reports
and
your
websites
and
otp
mechanisms
should
work
right.
So
I
think
this
is
probably
a
very
good
use
case
for
chaos.
Engineering
similar
to
you
know
the
criticality
it
has
in
domains
like
financial
groups.
So
thank
you
for
joining
today
and
sharing
all
the
thoughts,
perspectives
and
insights.
B
So
that's
at
this
point
in
under
construction,
so
we
will
share
that
on
the
cncf
chaos
engineering
workforce,
my
channel,
probably
in
a
week's
time
or
so
so
feel
free
to
add
to
that
and
you're
most
welcome
to
join
this
work
group
any
other
time
in
future
and
probably
do
a
couple
of
quick
demos
or
bring
up
any
other
questions
or
challenges
that
you
might
have
around
chaos
engineering.
B
Right,
thank
you.
Everyone
hope
you
enjoyed
this
and
we'll
meet
in
the
next
meeting.
C
Thank
you
yeah.
It's
very
nice
to
meet
everyone
and
looking
forward
to
the
next
one.