►
From YouTube: Argo Workflows and Events Community Meeting 20 Oct 2021
Description
02:50 Hera Workflows Python SDK demo
32:30 How LitmusChaos uses Argo Workflows
A
Okay
good
morning,
thank
you.
Everybody
for
coming
along
today,
just
a
quick
overview
of
what
we're
going
to
be
doing
today,
we're
going
to
have
a
demo
of
a
new
python
sdk
for
our
go
workflows
from
flavio
varden
shortly.
I
think
that's
pretty
interesting.
A
We
know
we've
got
a
lot
of
python
users
and
then
we're
going
to
have
a
presentation
and
demo
on
litmus
k
here,
os
from
show
me
and
karthik
and
then
an
opportunity
to
kind
of
ask
any
questions
or
discuss
any
other
topics
that
you
would
like
to
discuss
today.
A
If
you
can
add
yourself
to
the
attendees
list,
that
would
be
awesome.
That
would
be
really
good
for
people
who
are
new
to
this
meeting.
We
tend
to
talk
about
argo,
workflows
and
algorithm
events.
We
tend
to
have
people
come
along
and
demonstrate
those
those
two
products.
I
won't.
I
won't
tell
you
any
more
about
those
because
I'm
sure
you're
familiar
what
they
are.
Who
are
we
who
runs
these
meetings?
I'm
alex.
I
am
a
principal
software
engineer
on
argo,
workflows
and
argo
events.
A
I
used
to
work
on
arco
cd
as
well
during
this.
If
you
want
to
ask
any
questions,
feel
free
to
ask.
Typically,
we
have
like
a
q
a
at
the
end
of
any
kind
of
demo
or
presentation,
that's
a
great
opportunity
to
ask
them,
and
if
you
want
to
come
back
and
ask
any
questions
afterwards,
something
that
kind
of
slipped
your
mind
at
the
time,
then
obviously
you
can
come
and
ask
those
same
questions
on
slack,
that's
fine!
A
We
are
recording
this.
So
if
you
want
to
share
it
with
some
colleagues,
the
recording
typically
goes
up
on
youtube
the
next
day,
sometimes
two
days,
and
then
you
can
share
that
out.
A
Just
a
quick
announcement,
so
we've
just
started
three
weeks
ago
running
a
kind
of
a
weekly
meeting,
a
tuesday
at
the
same
time,
10
a.m,
pacific
standard
time,
which
is
kind
of
an
opportunity
to
kind
of
learn
what
people
are
working
on
right
now
and
learn
a
bit
about
kind
of
what
milestones
and
roadmaps
were
doing
that's
kind
of
aimed
at
code
contributors.
A
So
anybody
working
on
actual
co-contributions
at
the
moment,
if
you're,
if
you're
doing
that-
and
we
also
talk
about
things
like
roadmap
and
what
the
future
will
be
for
our
go
workflows,
events
and
also
for
data
flow
as
well.
Okay,
so
flavio,
are
you
ready?
C
Okay,
there
we
go
hi
everyone,
my
name
is
flavio,
I'm
a
software
engineer
at
dino
therapeutics
and
today
I'm
going
to
be
talking
to
you
about
what
I
like
to
call
the
missing
article
workflows,
python
sdk,
it's
called
hera
and
it's
the
main
intent
is
to
make
it
easy
for
scientists
in
particular
to
adopt
and
use
arc
workflows
by
by
providing
them
with
a
very
easy
to
use.
Interface
for
constructing
and
submitting
article
workflows
I'll
talk
a
bit
about
dyno.
C
C
Our
computational
group
is
composed
of
scientists
and
engineers
and
our
scientists
work
on
numerous
algorithms
that
are
typically
run
in
workflows
for
running
things
like
model
training
in
in
this
realm
of
machine
learning
or
computational
biology,
workflows
that
take
data
from
the
laboratory
and
they
they
apply
specific
processes
to
that
data
and
we've
used
for
organizing
and
scheduling
and
submitting
these
workflows.
C
We've
used
multiple
platforms
in
the
past,
the
one
that
was
predominantly
used
when
I
first
joined
was
keyflow
and,
of
course,
we
used
it
for
notebook
hosting,
but
we
also
used
the
kubeflow
dsl
and
for
the
majority
of
our
scientists,
it
was
very,
very
challenging
to
debug
the
the
workflows
that
they
were
scheduling
and
it
was
mostly
because
of
the
way
the
cluster
was
set
up
and
observability
was
pretty
challenging
for
them
to
understand,
and
in
addition,
the
syntax
and
the
vocabulary
that's
adopted
by
the
kfd
sl
is
not
it's
not
very,
very
welcoming
for
numerous,
perhaps
probably
for
the
academic
community
and,
and
that
made
it
very
challenging
to
set
up
things
like
parallelism
because
of
the
way
kubeflow
will
extract
the
variables
from
the
payload
that
it
receives
in
order
to
for
for
scheduling
a
parallel
workflow.
C
So
once
we
moved
away
from
kubeflow,
we
adopted
argo,
and
then
we
used
at
first
the
our
google
workflows
dsl
to
schedule
some
someone
to
restructure
and
schedule
some
of
the
workflows
that
we
wanted
to
submit,
and
it
was
great
at
the
start
to
be.
It
allowed
us
to
obtain
almost
instant
value
from
argo
workflows,
but
again
we
faced
numerous
challenges
and
it's
not
the
most
of
the
engineering
group,
but
our
scientists
who
didn't
feel
empowered
to
build
these
workflows
on
their
own
for
numerous
reasons.
C
But
I
think
primarily,
it
was
because
this
dsl
exposes
a
lot
of
the
objects
that
that
come
from
the
argo
workflow,
the
python
client
that
uses
the
you
know
the
open
api
schemas.
If
I
correct-
and
you
also
have
to
write
this
very
specific
syntax-
to
obtain
input
parameters
and
whatnot
and
all
of
a
sudden
now,
our
scientists
need
to
know
a
lot
of
elements
about
argo
in
order
to
construct
and
submit
a
workflow.
C
In
addition,
it
makes
it
challenging
to
easily
ski
skip
steps,
which
is
often
something
that
our
clients
want
to
do
during
their
experiments.
C
And,
lastly,
it
makes
it
very
challenging
to
request
specialized
resources
such
as
gpus,
because
you
need
to
essentially
write
your
own
internal
library
that
wraps
the
workflow
object
that
comes
from
this
sdk
in
order
to
access
specific
fields
on
it,
set
up.
Node
selectors
set
up
the
whatever
nvidia
card
you
want
and
things
like
that.
C
The
second
sdk
we
tried
was
cooler
and
it
solved
some
of
the
dsl
problems
we
had
before.
It
is
certainly
more
pythonic,
but
inputs
are
still
a
problem.
I've
in
the
past
submitted
an
issue
mentioning
this
because
we're
using
environment
variables
for
the
container,
but
again
you
still
need
that
very
specific
syntax
and
a
lot
of
the
workflow
setup
is
quite
confusing
for
a
lot
of
our
contributors.
C
In
addition,
the
last
time
we
used
it,
you
couldn't
submit
to
a
custom
domain.
We
have
an
article
server
running
deployed
to
kubernetes
and
we
want
it
sits
behind
an
iap.
So
there's
a
specific
domain.
We
need
to
hit
to
reach
the
server
and
for
running
cooler.
C
You
specifically
need
cube
ctl
to
access
the
kubernetes
deployment,
and
this
is
a
big
no-go
for
dyno,
mainly
because
for
practical
reasons,
and
also
from
a
security
standpoint,
we
primarily
don't
want
our
scientists
to
have
to
worry
about
these
engineering
tools,
such
as
cube
ctl
in
order
to
port
forward,
and
things
like
that
which
conceptually
doesn't
sound
that
challenging.
But
there's
the
value
for
for
our
company
is
is
in
the
scientific
pursuit,
not
in
in
the
use
of
cube
ctl
for
for
submitting
workflows.
C
So
we
thought
of
building
our
own
internal
sdk
with
the
with
simplicity
at
its
core,
and
we
had
specific
requirements
for
that.
We
wanted
to
easily
understand
dependencies.
C
We
wanted
an
easy
way
to
set
up
parallelism
over
whatever
jobs
we
want
to
submit.
We
wanted
to
submit
it
easily.
C
We
wanted
to
maintain
some
high
level
arc
of
vocabulary
to
maintain
consistency
with
the
argo
ui,
for
example,
because
there's
a
lot
of
value
in
in
using
the
ui
for
our
scientists
for
debugging
purposes.
We
wanted
to
support
by
dentex
schemas,
because
they're
easily
json
sterilizable,
and
it
fits
nicely
with
the
built-in
python
json
package.
C
C
With
that
I'll
switch,
my
screen
and
offer
of
some
demos
and
tasks,
a
task
and
a
workflow
are
primary
citizens
of
of
pera,
and
we
have
this
concept
of
a
workflow
service
and
the
workflow
service
is
the
is
a
wrapper
around
the
workflow
service
api
that
simply
just
sets
the
right
configuration
for
you
to
submit
your
workload
to
your
own
domain
under
the
assumption
that
you're
able
to
provide
a
bearer
token
to
pass
authentication
for
workflow
submission,
then
you
inject
your
service
into
a
workflow
that
has
that
takes
a
currently
a
name.
C
And
all
we're
adding
here
is
a
function.
That's
that's
callable,
and
this
you
can.
This
is
a
very
simple
toy
example,
of
course,
but
you
can
imagine
our
scientists
building
something
a
bit
more
complex
as
this.
You
know
unit
as
this
atomic
unit
of
execution
that
were
submitted
to
argo
for
execution.
So
they
get
now.
They
now
get
our
clients
or
our
scientists,
now
get
have
the
opportunity
to
exclusively
focus
on
the
implementation
of
the
scientific
code,
with
a
very
simple
interface
for
workflow
submission.
C
So
again
they
build
the
task,
they
add
their
tasks
to
the
workflow
and
they
submit
and
I'm
going
to
spare
you
from
having
you
know,
looking
at
the
ui
initialization
things
like
that,
and
I
just
took
a
very
simple
screenshot
for
this
example:
the
next
one
I
wanted
to
show
you
is
the
classic
diamond
that
is
offered
in
in
in
the
sdk
examples
and
again
you
set
up
your
workflow
service,
you
make
a
workflow
and
then
you
get
to
focus
on
your
tasks,
so
you'll
notice
that
in
this
case
now
we
have
a
parameter
that
we
pass
in
and
the
way
you
perform.
C
This
mapping
is,
you
say
you
pass
in
this
dictionary
payload
that
says
map
message
to
this
thing
and
then
it's
just
gonna
get
it
and
it'll
execute
and
it'll
have
that
parameter
for
you
and
it'll
execute
this
the
print
statement.
The
way
you
set
dependencies
is,
you
say,
task
dot
next,
this
other
task.
C
So
in
our
case
we
have
a
a
xp,
a
next
c
b
next
d
c
next
d,
and
that's
that's
going
to
give
us
the
diamond,
and
you
can
also
use
the
right
shift
operator
and
I'll
have
an
example
of
that
as
well.
Again,
you
add
your
tasks,
submit
your
workflow
and
you're
done,
and
I
have
that
schedule.
A
screenshot
of
that
here
for
the
dynamic
final
part,
we've
introduced
this
input
from
a
concept
that
takes
a
task
name.
C
So
in
our
case
we
have
a
generate
and
a
consume
task,
so
it
it
depends
on
the
generate
task
and
it
takes
the
value
parameter.
That's
extracted
from
the
json
payload
that
comes
from
the
generate
task
and
we've
we
resorted
to
a
simple
json,
dom
to
says
standard
out
for
now,
perhaps
in
the
past,
we're
going
to
have
something
that
looks
a
bit
more
friendly,
but
this
is
sufficiently
easy
to
understand
and
there's
we've
added
a
lot
of
documentation
around
this.
C
So
again,
generate.next
consume
azure
tasks
submit
and
then
it
looks
like
that
essentially,
and
we've
also
added
three
tries
that
has
a
duration
and
a
max
duration
and
that's
just
a
simple
function
that
says,
you
know,
generate
a
number
001
if
it's
less
than
0.5
raise
an
exception.
Otherwise,
it's
success,
and
in
this
case
it
tried
twice
when
I
tried
it
and
as
the
last
example
that's
a
bit
more
complex,
I
have
a
task
task,
a
that
takes
multiple
messages
now,
and
this
is
how
you
set
parallelism.
C
You
say,
process
this
input
and
then
process
this
out
this
input
and
this
other
input
and
then
the
task
will
just
execute
in
a
parallel
fashion.
It'll
make
a
task
group
for
you
and
it'll
execute
all
of
those.
You
know
the
function
that
you
submit
using
those
specific
inputs
and
I
made
some
linear
tasks
as
well.
Just
for
the
purposes
of
illustration,
and
again
you
make
your
dependency
chain.
C
You
say
a
b
b1,
whatever
all
the
way
to
d,
you
add
all
of
your
tasks,
you
could
have
put
them
in
a
list
and
then
start
the
list
of
course,
but
this
is
just
a
simple
and
then
you
submit
your
workflow,
and
you
end
up
with
something
like
this
and
it's
overall
quite
easy
to
understand
conceptually
what
a
task
does,
especially
when
you're,
using
things
like
default
parameters
and
whatnot.
So
currently
what
it
does.
It
has
a
lot
of
options
that
come
with.
C
You
know
defaults,
so
it
uses
the
python
3.7
image,
for
instance.
So
if
you
don't
have
internal
dependencies,
you
can
just
use
this
one,
but
if
you
have
a
dyno,
for
instance,
we
have
a
lot
of
custom
code,
so
we
inject
our
own
images
into
this
this
image
field,
and
then
we
execute
whatever
code.
We
want
to
submit
to
argo
using
that
specific
image.
C
It
has
a
command
field
for
supporting
things
like
if
you
want
to
run
the
argo
script
in
in
parallel
using
some
product
like
horobot,
for
instance,
we
use
horovod
for
distributed
gpu
training
and
we
use
a
horrible
drawn
command
to
launch
the
argos
script.
Whenever
the
the
pod
starts
up,
you
can
pass
in
environment
variables
and
there's
this
concept
of
resources
that
you
can
say.
C
I
want
whatever
min
cpu
max
cpu
gpus
of
volume,
existing
volumes
and
empty
directory
volumes,
and
the
reason
we
introduced
pydentic
is
because
you
can
build
schema
validations,
which
are
quite
nice,
especially
if
you
apply
any
business
logic
into
your
workflow
submission
process.
C
It
also
has
tolerations
and
node
selectors
and
the
reason
we
resorted
to
using
volts
electricity
is
because
of
differences
between
gke
eks
and
azure
ake,
I
think,
to
to
to
provide
the
opportunity
to
use
hera
in
the
context
of
you
know.
It
was
really,
irrespective
of
the
platform
you,
the
cloud
provider
that
you
have,
that
runs
your
kubernetes
cluster,
so
that
you
can
do
things
like
request
specific
gpus
using
very
particular
node
selectors
that
you're
using.
C
However,
I
also
wrote
a
very
simple
example
that
allows
you
that
that
showcases,
how
you
can
write
your
own
wrapper
to
even
to
simplify
a
submission.
Even
more
so
you
don't
really
need
the
service,
because,
like
your
company
will
probably
have
a
single
way
of
generating
the
token,
your
company
will
probably
have
a
single
domain
for
your
oracle
server.
So
you
can
imagine
building
your
own.
You
know
my
workflow
service,
that's
a
wrapper
around
your
workflow
service
with
a
consistent
domain
and
a
simple
way
to
generate
your
token.
C
Just
have
a
mock
here
and,
however,
whatever
parallelism
you
want
to
put
it
put
in.
Similarly,
you
can
have
very
specific
tasks
for
specific
domain
at
your
company.
That
use
maybe
very
specific
images
that
that
are
only
allowed
to
take
in
specific
functions
or
have
default,
retries
or
and
whatnot
in
our,
in
our
case,
where
we
use
the
google
kubernetes
engine.
C
So
we
know
that
there's
this
label
on
the
nodes
that
have
nvidia
k80s
on
them,
for
example-
and
we
just
have
this
default
as
have
this
as
the
default
node
selector
for
accessing
k,
d,
gpus
or
if
you
have
specific
commands
environment,
specific
environment
variables
that
you
want
to
inject
or
volumes
or
something
like
that.
You
can
write
your
own
and
that
you
know
it
just
simplifies
your
workflow
submission
interface
even
more.
It
saves
a
single
line,
but
at
least
you're
saving
your
colleagues
from
the
need
to
have
these.
C
You
know
you
know,
write
these
details
themselves.
C
So
it's
currently,
we
currently
have
a
repository
under
argor
labs
named
hera
and
I'm
thinking
about
using
hero
workflows
as
the
pi
by
name
perhaps,
and
it
I.
I
have
a
note
here-
to
remove
the
contextual
elements,
but
they
have
already
been
removed
and
what
I
mean,
what
I
mean
by
contextual
elements
is
anything
specific
to
dyno,
because
it
is
ultimately
a
an
internal
sdk
and
again
very
you
know
very
open
to
feedback
around
the
execution
model.
That's
scheduling
and
things
like
that.
D
D
Okay,
and
could
you
elaborate
a
little
bit
why
and
what
your
data
science
team
needs
from
kubictr,
since
you
mentioned,
you
still
need
qbctr
in
for
to
walk
around
with
workflows
submitted
via
cooling
right
since
cooley
already
provides
things
like
status
checking,
and
I
wonder
what
else
your
team
needs.
D
C
And
that's,
that's
why
you
need!
So
that's
why
you
need
cube
ctl
yeah,
so
you
need
your!
You
need
your
kubernetes
configuration
to
to
be
set
in
such
a
way
that
cooler
can
use
it
for
submitting
the
workflow
and
that's
that's
where
the
problem
comes
in,
because
we
do
not
want
our
scientists
to
have
to
worry
about
these
types
of
things
about
having
the
right
content,
the
right
name,
space,
the
right!
Oh,
you
know
whatever
for
submitting
their
workflows.
D
C
So
for
observability
I
just
wanted
to
clarify
that
they
were
definitely
going
to
the
ui.
It's
just
the
submission
part
that
was
problematic
for
us.
Okay,.
D
C
Exactly
yeah,
so
all
of
this
stuff
is
inside
a
script,
template
and
whatever
types
of
whatever
types
your
parameters
have
it'll
import,
for
example,
json
and
it'll
serialize
your
your
value,
and
then
you
know
import.
So
this
import
json
is
actually
there's
two
jsons
in
this
script.
If
you
look
at
it
because
there's
an
import
json,
that's
automatically
added
in
order
for
the
script,
to
add
something
like
message,
you
know:
input.parameter.x.
C
D
C
Jesse
asked:
could
workflow
service
also
work
with
k?
It's
api,
probably
yeah.
I
think
so.
I
don't
see
why.
Why
not-
and
I
think
that's
I
I
take
your
question
jesse.
As
can
we
use
the
deployment
the
same
way?
Cooler
does,
and
yes
absolutely
your
second
question
jesse
about
how
does
the
code
become
what
exit
what's
executed
inside
the
container
that
that's
similar
to
what
terry
just
asked?
And
yes,
it
is
via
script
templates.
C
I
mentioned
that
it
looks
the
hero.
Workflows
repository
looks
empty
right
now.
Yes,
that
is
correct.
I'm
currently
working
with
bala
and
and
alex
to
to
gain
access
to
that
repository,
and
I
do
have
a
community
an
argo
project
like
labs
community
issue,
submitted
to
to
to
gain
access
to
that
repository.
Essentially.
E
Do
you
anticipate
you'll
run
into
like
the
size
limit
of
stuffing
everything
into
the
script
template,
especially
if
you
have
like
a
lot
of
functions
or
long
functions,
and
then
what
do
you
think
the
mitigation
for
that.
C
Should
be
yeah,
this
is
certainly
a
concern
and
I
can
share
from
personal
experience
that
we
have
some
scripts
specifically
for
training,
machine
learning,
models
that
are
quite
long
and
if
we
schedule
you
know
things
that
are
in
the
thousands,
then
we
we
start
encountering
problems.
The
easiest
way
that
we
found
around
it
right
now
is
to
submit
multiple
workflows.
It's
quite
easy
to
do
that,
because
here
represents
a
very
accessible
interface
for
programmatic
submission
and
yeah.
We
just
wrap
tasks
in
different
work.
C
E
C
Yeah,
the
biggest
the
the
reason
we
like
using
it.
This
way
is
because
it
it
helps
us
iterate,
very,
very
quick,
so
the
script
template
is
added.
Sorry,
the
the
the
function
that
you
submit
is
added
to
the
script
template
as
is,
and
as
long
as
your
container
provides
the
right
dependencies.
You
don't
have
to
rebuild
your
image
in
the
case
when
you
change
your
ml
code,
for
example,
in
a
library
you
have,
you
have
to
rebuild
your
docker
image
because
now
you're
going
to
use
a
one,
that's
out
of
date.
E
C
Yeah,
of
course,
of
course,
so
there's
a
part
that
looks
through
the
arguments
that
are
that
are
set.
These
are
just
you
want
arguments,
and
these
are
structured
from
obtained
from
parameters.
So
yes,
you're
right,
there's
a
lot
of
in
spec
on
it
and
I'll
show
you
the
script
itself,
yeah
there's
a
it.
It
starts,
it
looks
for
the
start
token,
it
whatever
grabs
every
single
line
and
then
then
dedance.
It
does
another
d
that
that
I
don't
remember
the
reason
why,
but
I
can
probably
look
into
it.
C
A
A
A
C
D
So
usually
I
don't
think
image
building
is
a
a
pen
point
for
them,
but
the
more
user-friendly,
the
better
hackers.
A
Yeah
I've
been
wondering
if
there's
like
an
intermediate
between
pre-baked
images
and
one
that
kind
of
would
bring
in
your
requirements.
But
I
think
that's
probably
not
the
case,
because
either
you've
got
pre-baked
image
that
you
you're
happy
to
use.
Or
it's
been
a
you
know,
product
relationalized
by
your
ops
team
and
it's
not
really
a
problem.
B
I
have
another
question
regarding
the
secret
because
I'm
trying
to
compare
the
product
with
airflow
a
lot
of
times,
I'm
doing
that
airflow
is
a
big
thing,
the
secret
management
basically
injected.
So
how
we
handle
here
like
do
you
creating
the
secrets
on
the
fly
and
reference
it,
or
do
you
just
like
parsing
as
a
string
as
an
environment,
variable.
C
So
there
are
multiple
ways
we
can
go
about
this
there's
a
base
environment
specification
that
will
make
an
mvar
for
you
and
there's
also
a
secret
and
specification
that
will
take
in
as
a
secret
name
and
a
secret
key
from
a
very
specific
secret.
That's
currently
available
in
a
kubernetes
cluster,
and
I
believe
there
are
different.
C
You
know
other
kinds
of
secrets
we
could
include
like
the
ones
that
pull
metadata
from
the
pod
definition,
but
we
haven't
found
a
a
clear
use
case
for
that
and
it's
you
can
think
of
her
as
the
stripped-down
version
of
argo
that
focuses
on
the
things
that
most
data
science
practitioners
will
need
and
in
time
we're
probably
going
to
be
expanding
it.
B
Oh
okay,
so
basically
you
don't
handle
it,
so
the
data
scientists
assume
they
need
assuming
the
secret
is
there,
so
they
can
connect
to
the
data
store,
whatever
they
needed
in
the.
In
this
way,
I
kind
of
like
trying
to
connect
the
dot
is
how
data
scientists
in
your
team
like
doing
the
local
development,
then
the
transition
to
be
like
in
this
way.
Oh,
is
this
can
run
locally
without
any
problems,
so.
C
One
of
the
big
disadvantages
of
had
being
able
to
to
write
these
functions
independently
of
your
of
your
task
is
that
now
you
can
write
tests
for
this.
You
can
do
whatever
you
want
with
that
function.
It's
just.
I
can
literally
import
it
from
here
and
write
tests
for
it.
You
can
write
your
function,
a
different
file.
You
can
have
a
task.
You
can
have
a
file
just
with
tasks
and
one
with
workflows,
and
then
you
can
share
these
tasks
perhaps,
and
things
like
that
so
yeah.
B
Test
it,
so
I
I
want
maybe
just
like
I'm,
I'm
general,
I'm
don't
know
much
about
data
scientists.
That's
why
I'm
asking
some
dummy
questions
is:
what's
the
common
development
like
process
like,
I
know
the
data
scientists.
Normally
they
are
more
focused
on
algorithms,
so
they
don't
care
about
they
normally
I'm
not
seeing
all
of
them.
They
don't
care
about
the
wrong
time.
B
So
do
you
see
that
the
data
scientist
will
hand
over
the
jupiter
notebook
or
something
to
you
or
to
your
team,
but
like
then,
you
help
to
reorganize
the
their
script
and
make
it
like,
like
production,
ready
like
turn
this
in
format?
How
do
you
see
they
actually
feel
comfortable
to
do
them?
Do
this
themselves
right
now,.
C
I
see
use
cases
all
over
the
map,
the
idea
of
heroes,
to
empower
scientists
to
run
these
workflows
themselves,
and
you
know
I
have
colleagues
who
build
software
in
a
notebook,
then
use
cara
to
submit
whatever
software
they
created
to
argo,
and
you
know
they
measure
metrics
performance,
whatever
they
need
and
then
based
on
the
experiments
they
will
adjust
their
code
and
ultimately
productionize
themselves
into
a
workflow,
okay.
Okay,
thank
you.
This
is
great.
A
F
Yes,
I
hope.
F
So
hi
everyone-
this
is
karthik
and
I'm
joined
by
shomo.
We're
maintainers
of
the
cncf
sandbox
project
called
litmus
chaos,
which
is
a
cloud
native
chaos,
engineering
project
and
we've
been
using
argo
workflows
underneath
the
litmus
platform,
it's
part
of
the
litmus
platform
to
run
complex
chaos
scenarios.
F
We
just
want
to
talk
about
how
that
came
about
and
what
we're
planning
to
do
what's
going
on
right
now
in
the
community
and
what
we're
planning
to
do
going
ahead
in
the
next
few
months
near
tom.
We'll
just
do
a
very
quick
background
on
what
chaos
engineering
is.
What
litmus
chaos
is
about
and
a
bit
of
history
on
how
and
why
we
embraced
argo,
workflows
and
how
argo
workflows
are
being
used
today
in
the
platform
and
what
changes
have
been
made
over
the
standard
when
larko
workflows.
F
So
we
will
be
doing
a
very
quick
demonstration
of
our
workflows
being
executed
as
part
of
litmus
and
we'll
then
pick
some
questions.
I'm
sure
all
of
you
have
heard
of
chaos.
Engineering
most
of
you
might
even
be
practicing
it.
It's
basically
popularized
by
netflix
about
a
decade
ago,
and
you
can
see
some
standard
definitions
here.
It's
about
injecting
faults,
controlled
faults
and
identifying
weaknesses
in
your
system
through
fault
injection
and
the
idea
of
chaos.
F
Engineering
is
to
inject
these
faults
in
a
random
way
and
direct
unpredictable
way
and
there's
also
a
lot
of
paradigm
shift
in
the
recent
times
around
how
chaos
engineering
is
being
practiced.
Originally,
it
was
always
being
done
in
production,
but
the
practice
nowadays
is
to
do
it
a
lot
in
deep
broad
environments,
especially
with
the
advent
of
kubernetes.
A
lot
of
people
are
re-architecting,
their
applications
to
microservices
model
and
they're,
not
really
ready
to
go
and
do
chaos,
engineering,
their
fraud,
it's
being
done
as
part
of
staging
environments
or
cicd
pipelines,
etc.
F
So
this
is
a
simpler
definition
of
chaos,
engineering,
something
that
we
can
all
connect
to
in
the
current
times.
Just
like
a
vaccine-
and
we
inject
harm
willfully
inject
failures,
it
could
be
node
failures,
maybe
go
fill
your
disks.
Do
packet
drops
cause
cpu
or
memory
exhaustion
on
your
parts
and
nodes,
etc.
So
this
is
basically
something
that
you
try
to
do
as
part
of
fault,
injection
and
chaos.
F
Engineering
is
a
lot
about
injecting
fault
with
a
hypothesis
around
how
your
system
should
behave
when
the
fault
is
injected,
so
you
might
want
to
know
how
the
deviation
is
same
in
steady
state.
So
you
have
an
idea
of
how
the
system
should
behave,
what
we
call
as
a
steady
state
and
then
you
inject
the
fault,
and
then
you
see
some
deviation
there.
Maybe
you
expect
some
deviation
to
happen
and
that
is
under
limit.
Sometimes
you
don't
expect
any
kind
of
deviation.
A
F
F
So
chaos
engineering,
basically
is
like
vaccine
and
one
of
the
reasons
why
we
got
this
cloud
native
tag
attached
to
chaos.
Engineering
case
engineering
has
been
there
for
a
decade
now
and
we
added
this
sub
category,
especially
because
of
the
way
the
community
was
looking
at
chaos
in
the
kubernetes
world,
and
this
pyramid
here
basically
talks
about
the
way
your
services
are
deployed
in
a
typical
communities,
environment.
You
have
your
platform
services,
you
have
the
kubernetes
control,
plane
community
services,
you're
pulling
a
lot
of
tooling
from
the
cnc
landscape
service,
discovery,
storage,
observability.
D
F
Deployment
environment
and
you
would
really
want
to
check
what's
happening
at
your
cluster.
One
of
these
components
is
failing.
You
might
want
to
repeat
those
tests
over
a
period
of
time
regularly,
so
this
is
about
why
chaos,
engineering
is
needed
and
how
it
is
really
more
important
in
our
native
world
and
the
other
aspect.
F
The
practical
aspect
to
cloud
native
chaos.
Engineering
is
a
lot
of
people
today
that
are
dealing
with
communities
are
used
to
a
certain
way
of
describing
their
applications
or
carrying
out
the
regular
tasks.
Everything
is
declarative,
it
is
all
in
a
yaml
file.
It
is
all
stored
in
a
git
repository,
its
guitar
is
controlled.
You
have
resources
and
resource
controllers.
That
is
the
way
you
basically
go
about
doing
things
in
day-to-day
work
day-to-day
communities
world.
So
we
wanted
to
basically
do
chaos.
F
Engineering,
the
same
way,
chaos
engineering
when
I
say
the
same
way
when
you're
describing
the
chaos
intent.
This
is
the
fault
you
want
to
do.
This
is
how
you
want
to
do
it
and
when
you're
trying
to
add
some
steady
state
validation
intent,
you
wanted
to
be
able
to
do
it
in
a
declarative
fashion
and
do
it
in
a
kubernetes
native
way,
keep
it
homogeneous.
F
Let
the
developers
and
salaries
have
the
same
experience
with
resilience,
testing
and
chaos
engineering
than
they
have
with
other
things
that
they
do
on
the
clusters.
That's
when
the
latest
project
was
born,
so,
typically
the
chaos
experimentation
process
has
this
flow.
You
identify
steady
state
conditions.
Reading
for
your
services,
you
introduce
a
fault.
F
You
check
whether
the
slos
continue
to
be
met.
If
they
are,
then
it's
resilient
to
this
fault.
You
go
on
to
the
next
scenario.
If
not,
you
found
a
weakness.
You
go
back
and
fix
either
the
application
business
logic,
or
maybe
you
fix
the
deployment
or
something
on
the
platform.
You
basically
go
adopt
some
better
practices
in
terms
of
deployment
or
tuning
your
infrastructure
to
ensure
that
you
are
going
to
be
more
tolerant
to
this
one
and
for
these
particular
aspects
in
this
flow,
these
blocks
we
tried,
came
up
with
some
custom
resources.
F
So
there's
a
custom
resource
that
describes
the
fault
and
there's
a
custom
resource
that
basically
applies
this
fault
on
some
service
running
on
your
system.
There's
an
operator
that
watches
these
crs
and
launches
some
runners
to
carry
out
the
fault,
business
logic
and
the
result
of
this
experiment
is
stored
in
another
cr.
They
have
been
kept
in
separate
resources
because
there
is
a
lot
of
scope,
rich
scope
for
improvement
in
terms
of
definition
of
these
processes.
F
So
that's
about
chaos,
engineering
and
cloud
native
chaos,
engineering.
Why
we
wanted
to
introduce
the
ritmas
project
and
how
we
did
it
using
some
custom
resources
and
operators,
that's
the
core
of
the
project,
and
that
was
what
the
project
was
about
quite
some
time.
These
are
some
details
about
the
resources
and
what
they
define
probably
skip
that
in
the
interest
of
time.
F
Now,
when
we
started
thinking
about
argo
workflows
is
when
we
started
getting
feedback
from
people
using
litmus
in
the
community
for
doing
chaos
engineering.
So
there
are
complex
scenarios
where
you
would
probably
want
to
do
more
than
one
fault,
and
you
want
to
stitch
it
together
in
a
specific
way.
Maybe
multiple
faults
occurring
in
parallel
things
like
your.
F
You
already
have
a
degraded
system,
and
then
you
suddenly
have
a
part
that
got
evicted.
Let's
say
all
your
nodes
are
running
to
capacity.
How
would
you
simulate
that
kind
of
a
scenario
right,
so
you
would
probably
need
to
create
a
chained
failure.
You
had
a
fault
and
then
that
injected
something
into
your
system
and
it
carried
on
for
some
time.
F
You
have
another
fault
happening.
On
top
of
that,
which
probably
has
disastrous
results
like
you
might
want
to
really
check
how
your
tolerant
to
those
kind
of
scenarios
so
creating
complex
scenarios
was
one
need,
and
then
there
was
this
need
to
validate
application.
Behavior
during
the
automated
experiment
runs.
Sometimes
you
are
doing
the
fault
injection
and
you
are
also
peering
into
your
observability
systems.
You
looking
at
the
dashboards
and
you
know
exactly
what's
happening,
but
a
lot
of
times.
The
chaos
runs
as
a
background
service.
F
You
just
let
the
system
go
and
inject
failures
at
random
times
and
random
components,
of
course,
within
a
controlled
blast,
radius
that
you've
predefined,
but
still
in
a
randomized
way,
and
you
want
to
validate.
What's
happened
during
the
fault
injection
you
want
to
factor
in
the
steady
state,
validation
as
part
of
the
experiment
run
so
being
able
to
do
that
so
sometimes
steady,
state
validation
is
not
a
simple
step.
F
It
could
be
a
set
of
tasks
that
you
do
along
with
the
fault
itself,
so
that
necessitates
some
kind
of
a
workflow
logic
to
be
available
and
many
times
fault
injections
on
your
pre-product
systems
need
to
be
done
with
some
kind
of
load
generation,
some
kind
of
stressful
scenarios.
You
want
to
simulate
production
traffic,
you
want
to
run
low
cost,
widget
or
k6
io
or
any
such
load
generation,
just
as
your
fault
is
happening.
F
So
that
also
probably
needs
us
to
run
separate
tasks,
and
there
was
another
requirement
where
sometimes
you
need
some
pre-configuration
before
you
run
an
experiment
of
course
clean
up
after
the
experiment.
So
it's
these
are
all
better
structured
spot
of
workflow,
and
always
you
would
like
observability.
You
want
to
visualize
these
steps
running
in
your
system,
so
these
things
sort
of
came
in
as
the
requirements
and
necessitated
creation
of
workflows
and
instead
of
building
it
out.
We
didn't
reinvent
the
wheel,
we
wanted
to
look
at
existing
options
and
which
was
argo.
F
Some
of
the
reasons
here,
it's
a
kubernetes
native
solution,
which
is
also
the
philosophy
of
litmus,
and
we
had
several
executed
types.
We
settled
on
kts
api
because
we
had
people
running
these
workflows
on
different
clusters
having
different
runtimes
and
the
ability
to
define
source
artifacts
output.
Artifacts.
You
get
lot
of
reports,
the
experiments
that
you
run,
ability
to
visualize
and
a
lot
other
technical
reasons,
and
the
other
important
factor
was.
F
It
is
extremely
well
documented-
has
a
great
community
as
a
cnc
project,
and
there
were
initial
adopters
of
witness
who
were
already
using
the
crs
as
part
of
an
argo
workflow.
So
it
happened
to
be
an
organic
extension.
We
we
decided
to
bring
it
into
the
project
and
make
it
a
formal
part
of
the
religious
project
and,
as
we
started
doing,
that
there
were
some
parallel
requirements
that
were
emerging,
such
as
the
need
for
a
dashboard
or
a
control
center.
F
That
can
manage
chaos
across
a
fleet
of
clusters,
so
we
needed
the
workflows
to
be
created
and
managed
as
part
of
this
control
center
and
some
of
the
practical
challenges
that
we
had
or
some
of
the
additional
things
that
we
needed
to
do
while
adopting
our
workflows
was
the
ease
of
generating
the
workflows
themselves.
F
Sometimes
experiments
can
look
pretty
long
like
this.
You
have
a
chaos
engine
with
all
steady
state
steps
and
details
on
the
services
that
you
are
trying
to
target,
and
there
are
different
steps
that
you
would
do
in
terms
of
dependencies
installing
some
templates
and
running
the
tests.
Cleaning
up
things
like
this,
so
for
users,
just
just
as
fly
view
was
explaining
in
the
previous
talk,
wanted
to
make
it
a
little
simple.
So
how
do
you
simplify
the
creation
of
these
workflows
by
stitching
together
several
faults?
F
Each
of
these
faults
need
some
tuning
to
be
done,
so
we
had
to
come
up
with
a
proper
ux
and
a
proper
series
of
steps
to
be
able
to
do
that
and
make
it
simple
for
users
and
typically
with
argo
workflows.
You
have
containers
that
run
a
task
and
you
you
get
to
see
the
logs
of
the
task
that
has
been
executed,
but
in
case
of
litmus
the
chaos
engine
creation.
F
The
cr
creation
is
one
of
the
tasks
that
would
spawn
a
set
of
other
paths
which
actually
carried
out
the
old
business
logic,
something
that
we
have
referred
to
as
secondary
or
generated
paths,
and
we
would
like
to
see
the
logs
of
those
as
well.
So
that's
something
we
needed
to
achieve
and
because
of
workflow
is
now
like
a
scenario.
A
proper
scenario,
test
scenario,
test
cases
and
test
scenarios
are
maintained
in
some
kind
of
test
management
system.
F
Mostly
today,
people
are
using
github
as
well,
so
you
need
to
be
able
to
pull
the
workflow
from
source,
ensure
that
things
that
are
changed
in
your
source
are
also
reflected
on
your
control
center
of
the
care
center.
So
some
kind
of
sync,
with
the
gate
repositories
for
the
workflows
needed
to
be
maintained
and
the
workflow
status
is
actually
the
scenario
status,
and
that
needs
to
be
factoring
in
the
verdict
of
the
experiment
that
we
perform.
F
And
sometimes
when
you
switch
look
at
the
several
parts,
you
want
to
provide
some
criticality
or
weights
associated
with
experiment
depending
upon
how
mature
your
inferior
applications
are
to
the
workflows
might
want
to
execute
the
experiment.
All
the
same
probably
gave
it
a
lower
priority
than
some
other
experiment.
F
So
how
do
you
define
this
weights
and
those
are
then
used
as
part
of
some
resilience
score
calculation
in
the
platform,
so
we
added
some
extra
labels
and
annotations
into
the
workflow,
and
basically
we
were
going
to
track
metrics
of
experiments
and
we
wanted
to
know
the
parent
workflow
that
actually
ran
the
experiment.
So
there
was
some
level
of
lineage
that
was
added
into
the
workforce
to
track
the
metrics,
and
so
this
was
the
instrumentation
that
we
added
to
standard
x,
argo
workflows
and
that
resulted
in
what
we
call
now.
F
Endlessness
is
the
chaos
workflow
and
the
litmus
chaos.
Workflow
is
essentially
an
arc
workflow,
with
some
images
being
used
as
part
of
the
steps
to
influence
the
result
of
the
workflow
based
on
experiment,
status
and
experiment
result
and
they
are
being
stored
in
gate
repositories
and
picked
or
sourced
during
the
runs.
So
this
is
a
summarization.
F
This
is
an
architectural
overview
of
the
litmus
platform
and
probably
not
spent
too
much
time
on
it.
This
is
basically
the
control
center
which
helps
you
to
construct
workflows,
and
you
can
run
it
in
an
execution
environment.
We
call
it
as
execution.
Here's
execution
plane
which
happens
to
be
the
same
cluster
where
you
have
the
control
center
installed
or
it
could
be
a
different
cluster
or
different
name
space,
and
there
is
a
subscriber
here
which
takes
instructions
from
the
control
center
to
apply
a
chaos
workflow.
F
Then
you
have
the
workflow
controller
that
carries
out
individual
steps
in
the
workflow,
such
as
setting
up
the
dependencies,
creating
the
chaos,
resources
and
cleaning
up,
etc,
and
the
litmus
operator
picks
up
the
chaos
resource
and
carries
out
the
business
logic.
So
this
is
the
structure
that
we
are
using
this,
how
argo
workflows
are
being
used
in
litmus
I'll.
Probably
stop
at
this
point
and
like
show
me
to
the
demo.
F
Some
of
what
we
just
discussed
now
will
probably
be
reinforced
like
much
become
much
more
clear
if,
when
we
see
the
demo,
I
just
wanted
to
set
some
context
here
and
then
take
some
questions.
G
So
yeah
so
before
I
go
into
the
demo
I'll
quickly
give
a
explanation
of
the
setup
that
I
have
here.
So
I'm
running
my
local
mini
cube,
setup
here
and
have
installed
litmus.
So
I
have
the
install
litmus
portal
and
all
the
other
accompanying
deployments
like
the
subscribers,
the
event
or
other
operators
that
are
required,
including
the
workflow
controller
and
the
listening
space.
G
So
this
is
my
control
thing
for
the
litmus
itself
and
then
I
have
a
demo
application
called
the
power
to
head
service,
that's
running
in
my
demoning
space,
and
this
is
basically
a
http
web
server.
So
that's
all
that
this
is.
It
has
a
service
with
it,
comparing
it,
and
I
have
a
monitoring
setup
running
on
the
my
training
space,
the
trauma
system
and
the
black
box
exporter.
G
So
if
I
open
up
the
litmus
portal
ui
I'll
quickly,
log
in
and
show
the
changes
that
I've
made
or
any
setup
that
I've
gone
through.
G
So
this
is
the
portal
that
the
homepage
that
opens
up
when
you
log
in
for
the
first
time
and
I'll,
give
a
quick
run
throughout
the
options,
and
you
know
menus
that
we
have
here
so,
on
the
left
hand,
side,
we
see
a
navigation
bar
where
we
can
see
the
different
tabs
and
see
the
different.
You
know
options
that
we
have
here.
G
So
this
is
the
home
page
that
comes
in
at
the
very
beginning,
which
gives
you
an
overview
of
the
project
that
you're
currently
in
you
have
the
option
to
switch
projects.
If
you
are
part
of
any
other
project,
but
currently
I'm
just
a
part
of
the
single
project
and
in
the
home
screen,
you
should
see
the
project,
the
workflows
itself,
the
agents
and
other
project
credit
details
itself.
So
this
is
just
an
overview
on
the
summary
of
the
project.
G
Then
comes
the
network's
workflows
or
the
chaos
workflows,
as
as
we
call
it
so
here,
we'll
see
all
the
workflows
that
we
have
run
for
the
project,
the
schedules
that
we
have
in
this
tab,
so
these
are
this-
can
be
crown
workflows
or
single
runways.
G
In
this
case,
I
have
only
one
single
workflow:
that's
you
know
just
a
single,
as
you
can
see
here,
and
I've
run
it
only
once
and
if
I
click
on
this
I'll
get
a
beautiful
visualization
inspired
from
the
workflow
visualization
itself,
so
we
can
click
on
on
these
items
and
see
the
vlogs
and
stuff,
but
we'll
do
it
a
little
bit
later
on.
So,
besides
that,
we
have
the
chaos
agents
tab
where
we
have
the
where
we
get
all
the
agents.
G
So
karthik
was
talking
about
a
subscriber
that
basically
gets
all
the
requests
for
your
you
know.
Workflow
runs
and
sends
all
the
data
back
to
the
main
control
thing.
So
we
see
all
the
agents
that
are
the
subscribers
that
they
have
connected
here.
So
currently
I
have
just
a
single
subscriber
in
my
setup.
You
can
go
ahead
and
connect
more
setup
more
subscribers.
G
If
you
want
through
the
correct
agent
option,
then
we
have
the
hub
where
we
basically
collect
all
different
types
of
experiments
and
three
different
workflows
that
we
have
so
currently
there's
only
a
single
hub.
As
you
can
see,
this
is
the
public
hub.
That's
available
in
the
litmus
repo,
but
if
you
have
your
private
hubs-
and
you
know,
if
you
have
your
custom,
experiments
or
workflows
that
you
want
to
import
directly
into
the
portal,
you
can
go
and
create
a
new
hub
and
go
ahead
with
the
flow.
G
Then
we
have
the
opportunity
tab,
which
is
basically
somewhat
similar
to
grafana
itself.
You
can
add
in
dashboards-
and
you
know,
have
an
integrated
observability
feature
directly
into
the
portal
itself,
but
currently
I
have
not
set
up
I'll,
definitely
use
rafana
for
our
demo,
but
yeah.
Besides
the
dashboards,
you
can
also
see
inbuilt,
you
know,
analysis
or
analytics
for
the
workflows
itself,
so
this
is
per
workflow
basis.
Analysis,
it's
more
useful
for
cron
workflows.
Over
a
period
of
time.
G
You
can
see
how
your
application-
or
you
know,
cluster-
is
performing
for
the
same
workflow
itself.
So
currently,
there's
only
one
run,
so
there's
not
much
to
see
here
and
finally,
we
have
the
settings
tab,
which
gives
you
a
lot
of
options
depending
on
whether
you're
admin
or
not.
But
in
general
you
have
your
own
personal
options
to
change
like
the
details.
You
know
there's
a
name
and
stuff,
then
you
have
team
management
options
for
the
project
itself.
G
If
you're
an
admin
of
the
project
to
get
those
options,
then
you
have
user
management
options
again.
If
you're
an
admin,
then
you
can
add
new
users
and
stuff,
then.
Finally,
the
main
important
things
that
we
are
going
to
be
talking
about
is
the
detoxing.
So
for
each
project
we
allow
a
particular
repository
to
be
configured
as
a
detox
repository.
G
So
in
this
case
I
have
already
configured
my
private
repository,
that's
here
as
the
source
dip
source
and
just
show
how
you
can
do
it
I'll
just
edit
it
so
in
this
to
actually
configure
the
positive.
G
What
we
have
to
do
is
basically
have
to
provide
the
url
and
the
branch
that
you
want
to
use
as
your
source
and
also
add
in
the
access
token
or
ssh
key,
so
that
you
know
the
portal
can
write
to
your
repository,
also
because
in
our
system
we
not
just
sync
from
the
repository
itself,
but
we
also
write
back
to
the
repository.
So
so,
in
cases
where
you
are
creating
a
workflow
directly
from
the
portal,
the
portal
will
automatically
sync
that
workflow
into
your
report.
Configured
get
repository.
G
So
that's
how
we
need
to
access
key,
so
I'll
just
leave
it
as
it
is
for
now.
Also
we
have
the
image
repository
option
where
you
can
specify.
You
know
if
you
have
a
custom
repository
where
you
have
all
those
runner
images,
the
different
helper
images
that
we
use
in
our
workflows,
chaos
workflows.
G
So
you
can
do
that
by
specifying
custom
values
for
this,
but
I
already
I'm
going
to
use
the
open
source
ones
that
are
already
there
publicly
available,
but
you
can
do
that
and
you
can
use
that
in
your
workflows
automatically.
So
that's
also
available
for
you.
So
going
back
to
the
main
thing:
that's
a
work
to
itself.
G
So
in
the
litmus
workflows
tab
you
get
the
option
to
scale
the
move
through
and
when
you
click
on
that,
you
are
brought
to
this
wizard,
where
you
are
taken
through
a
few
steps
where
you
can
choose
which
target
you
want
to
run
the
workflow
on
and
how
you
want
to
tune
your
workflow.
So
I'll
just
go
through
the
steps
and
that'll
probably
be
easier.
G
So,
for
example,
I
have
only
one
agent
or
one
subscriber
here
so
I'll,
select
that
as
a
target
and
move
on
and
then
you
can
create
your
workflow,
so
we
provide
a
few
different
options
to
you
know
generate
your
workflow,
so
the
first
thing
that
we
have
are
the
predefined
workflows
that
are
present
in
the
hub
chaos
hub
that
we
had
configured.
So
in
this
case
I
just
have
the
single
hub.
That
is
a
public
sub.
G
So
if
I
click
on
one
of
these
and
I'll
just
show
you
how
it
looks
so,
if
I
click
on
here
and
continue
to
the
next
focus
settings,
you
get
the
option
to
change
the
name,
and
you
know
the
description
and
basically
a
metadata
about
the
whole
flip
sense,
and
I
click
on
next
and
in
this
page,
it's
basically
where
you
tune
your
workflow
here
we
provide
a
pretty
cool,
visualization
or
kind
of
a
simulation
of
the
workflow,
how
it
will
look
after
its
run.
G
So
here
you
can
see
how
your
workflow
is
going
to
perform
like
performance,
and
you
know
how
the
steps
are
going
to
be
executed
and
stuff.
So
here
we
can
see
that
there's
a
series
of
nodes
that
are
there
series
of
steps
under
there,
but
at
the
end
of
the
workflow
we
have
two
parallel
experiments
or
two
parallel
steps
that
are
being
run.
G
So
if
I
want
to
change
that
sequence,
I
can
click
on
edit
and
you
know
make
some
changes
here
like
I
can
drag
and
drop
in
between
these,
and
you
know
get
this.
You
know
visualization
updated
and
all
these
actual
workflow
manifests
itself
updated,
also
without
actually
having
to
edit
the
code
directly.
So
that's
mostly
it
for
the
edit
part,
but
besides
that
we
also
have
options
to
toggle
or
edit
our
experiments
itself.
So
let's
say
you
have
in
this
case,
I
have
selected
a
part,
delete
experiment.
G
I
can
edit
it,
and
this
will
provide
me
some
tunables
for
the
experiment
specifically.
So
in
this
case,
first
of
all,
I
get
the
metadata
details
like
extreme
name,
the
context
and
stuff,
but
if
I
click
on
next,
I
get
the
options
for
the
target
application.
G
So
here
we'll
here,
the
wizard
provides
you
with
the
settings
to
mention
which
particular
application
or
particular
deployment,
the
particular
resource.
You
are
trying
to
run
the
experiment
on
so,
for
example,
the
first
one
is
the
app
and
is,
if
I
click
on
that,
I
will
get
a
list
of
all
the
namespaces
that
are
there
in
my
current
cluster.
So
the
portal
is
right
now
the
subscript.
The
agent
is
right
now
running
and
cluster
scopes,
for
I
have
access
to
all
the
namespaces.
G
But
if
you're
running
in,
let's
say
namespace
scope,
then
you
will
only
be
getting
the
access
for
that
particular
interface.
But
in
this
case
I
in
cluster
scope.
So
I
have
access
to
all
the
new
spaces
here.
So
I
can,
let's
say:
click
on
demo,
which
is
the
link
is
that
we
were
using.
We
are
using
for
the
demo,
then
I
can
select
on
the
app
kind,
which
is
basically
the
resource
that
I
want
to
target.
G
So
in
this
case
I'll
just
select
deployment,
but
you
have
other
options
like
stateful
sets,
rollouts
and
stuff,
and
once
it's
like
that,
you
will
have
to
choose
which
deployment
so
in
so
how
we
do
that
is
using
the
app
labels
so
clicking
on.
That
will
also
give
you
a
suggested
list
of
app
labels
that
you
can
use.
So
here
what
we
are
doing
is
we
are
fetching
all
the
app
labels
that
are
there
in
the
particular
name
space
for
the
particular
resource
type
that
was
mentioned.
G
Sorry,
the
app
link
right
so
hello
services,
app
label
and
finally,
we
have
the
option
to
do
some
cleanup
stuff.
So
in
this
case,
we'll
just
leave
it
as
retained,
so
retain
is
basically
like.
After
the
experiments
is
run,
it
will
keep
all
the
resources
like
leave,
all
the
resources
on
the
cluster
itself
and
not
clean
up,
but
we
also
have
a
global
cleanup
called
revertkiovs.
That's
talk
that
you
can
toggle
on
the
workflow
itself,
so
it
will
clean
up
all
the
resources
at
the
end
of
the
workflow
itself.
G
So
I'll
just
keep
this
retaining
in
this
case.
For
now,
and
then
click
on
next
will
take
you
to
options
of
probes,
which
is
basically
how
you
define
your
steady
state
hypothesis.
In
this
case,
we
have
a
predefined
http
probe,
which
we
can
add
also,
like
you
can
add
new
probes
if
you
want
and
there's
a
proper
documentation
available
for
the
probes
which
we
can
share
in
the
chat
if
you're
interested,
but
basically
we
have
a
various
like
a
few
different
types
of
probes.
G
So
if
I
click
on
this,
you
see
the
different
types
of
probes
that
are
available.
Http
cmd
k8s
and
from
geographers
so,
for
example,
the
http
pro
would
be
basically
making
http
request
or
to
a
particular
endpoint
that
is
specified
in
this
case.
Let's
say
the
endpoint
is
already
mentioned
as
this
one
and
this
will
allow
and
then
we
can
just
check
on
different
criteria.
So
let's
say
I
have
a
criteria
to
check
whether
I'm
getting
a
response
code
of
200.
G
Then
I
can
just
set
it
as
this
and
and
I
need
to
set
some
other
timeouts
and
we
try
values
also.
But
the
just
is
that
it'll,
the
probe
will
basically
make
sure
that
that
particular
service
is
returning.
This
particular
response
code,
that's
expected
from
it
during
the
experiment
or
after
the
experiment
is
runners.
Whenever
we
want
to
actually
run
the
experiment,
run
this
probe
so
to
select
that
like
when
the
probe
is
actually
executed,
you
can
choose
on
the
probe
mode,
so
in
this
case
set
to
continuous.
G
So
it
will
go
through
throughout
the
experiment
itself,
but
you
can
also
set
on
a
start
of
the
experiment
or
end
of
the
experiment,
and
things
like
that.
G
So
that
was
mostly
it
for
the
tuning
part,
oh
yeah,
so
I
I
missed
the
tune
experiment
section
where
we
specify
how
long
the
experiment
goes
on
for
so
this
section
is
specific
to
experiments
itself.
G
The
previous
sections
were
generic
for
all
experiments,
but
this
one
will
provide
you
specific
options
that
you
would
be
able
to
tune
for
specific
experiments,
so
we
have
different
options
for
different
types
of
experiments,
so,
like
karthik
was
talking
about
node
level,
experiments
and
the
network
network
experiments
and
stuff
like
so,
each
of
these
experiments
will
have
different
configurations
that
you
would
like
to
tune.
So
you
can
add
those
chainables
here
and
finish
your
work,
work
for
tuning
and
then
continue
to
the
next
step,
and
this
is
where
we
finally
select
our
weightages.
G
This
is
a
instrumentation
that
we
have
done
over
the
workflow.
So
to
get
our
residency
score,
we
need
to
mention,
or
do
we
need
to
specify
what
weights
or
weightages
each
of
the
experiments
that
we're
running
in
the
workflow
hold.
G
So,
for
example,
in
this
case,
I
just
have
a
single
experiment,
but
if
we
had
two
different
experiments
I
could
have
different
weightages
for
each
of
them
and
depending
on
the
success
or
failure
of
each
of
those
experiments,
we
generate
a
resiliency
score
which
mentions
how
resilient
your
complete
your
target,
application
or
this
environment
is
right,
and
that
is
that
depends
on
the
weightages.
G
So
if
I
have
a
weight
for
pretty,
if
I
set
up
at
a
low
weight,
then
if
this
experiment
fails,
that
does
not
affect
your
residency
score
too
much.
But
whereas,
if
I'm
setting
it
to
10
and
this
experiment
fails,
then
the
resiliency
score
of
your
experiment
or
the
workflow
will
reduce
to
zero.
Basically,
because
I
said
there's
only
one
experiment
here
so
moving
on
to
next,
I
can
I
have
the
option
of
scheduling
how
I
want
to
schedule
it.
So
we
have
two
options
here.
G
We
can
either
go
for
a
schedule
now,
which
will,
which
is
basically
just
going
to
immediately
schedule
the
workflow
as
it
is
on
the
on
the
target
cluster,
but
we
can
also
do
a
recurring
schedule
which
is
basically
going
to
schedule
it
with
the
front
to
crunch
up.
So
we
are
in
this
case
we
are
internally
using
the
current
workflows.
So
for
the
schedule.
G
Every
week,
every
month
and
depending
on
which
option
you
choose,
it
can
also
specify
the
minutes
the
time
if
it's
or
every
day
or
something,
if
it's
every
month,
then
you
can
specify
the
particular
date
of
the
month
and
then
the
time
you
want
to
run
it
on
and
then
finally,
you
can,
let's
see
like
look
at
the
overview
of
the
workflow
and
then
run
it
right,
but
I'll
not
run
this
one.
G
I
already
have
a
workflow
that
I've
constructed
specifically
for
this
demo,
so
I'll
just
quickly
go
and
take
that
up
and
I'll
run
that
and
finally,
we'll
look
at
the
visualization
part,
so
yeah
so
going
back
to
the
creation
part
again,
we
have
the
import
option
besides
the
workflow
creation
from
the
predefined
workflows-
and
I
can
click
on
this
and
create
a
workflow
here
itself,
so
I'll
just
click
on
recent
and
upload
my
m.
G
So
this
is
the
workflow
that
I've
already
generated
for
the
demo
and
this
experiment
is
this:
workflow
is
going
to
basically
be
running
a
quad
lead
experiment
on
my
demo
application,
so
the
application
that
I
had
previously
shown
so
here's
my
application
so
I'll
be
running
a
particular
experiment
on
this
application,
but
at
the
same
time
I'll
also
be
running
a
load
test
parallelly
on
it.
And
finally,
we
clean
up
the
whole
thing.
G
So
I'll
just
go
ahead
and
schedule
the
expand
and
it
takes
some
time
to
schedule
it
because
it's
now
pushing
the
workflow
to
your
git
repository
first
and
then
it's
going
to
execute
on
the
cluster.
So
now
it's
already
executed,
as
you
can
see
it's
running
while
it's
running,
I
can
quickly
show
you
that
there
was
a
decent
commit
to
my
github's
repo
that
I'd
configured.
So
here
we
can
see
that
there's
a
comment
made
20
seconds
ago
from
admin
at
the
latest
chaos.
G
So
if
I
click
on
it
and
I'll
see
that
there
are
a
few
other
offers
also,
these
are
the
workflows
that
I
ran
previously,
but
the
most
recent
one
is
20
seconds
ago.
So
this
is
the
workflow
that
I
executed
just
now
and
we
can
see
the
whole
workflow
itself,
the
manifest
of
the
workflow
and
the
steps
that
are
there
and
just
to
show
that
we
are
what
we
are
running
so
in
the
steps
in
the
template
section.
We
have
the
steps
that
you
have
mentioned,
so
the
workflow
is
gonna.
G
First
install
the
experiment,
then
it's
gonna
be.
Did
I
run.
G
One
second,
I
think
I
oh
yeah.
I
think
I
ran
the
wrong
workflow,
but
it's
still
fine,
so
yeah.
So
basically
the
workflow
will
show
up
here,
but
what
we
needed
was
a
little
bit
different
workflow.
So
what
I'll
do
is
I'll
terminate
the
workflow
so
the
workflows?
We
also
have
the
option
to
terminate
the
workflows
directly
from
the
port
register,
so
I'll
quickly
terminate
the
workflow
and
run
the
correct
one.
G
So
yeah
demos
don't
go
well
without
glitches.
I
guess
so
I'll
just
quickly
select
the
proper
workflow,
which
I
think
this
is
the
one
yeah.
I
think
this
is
the
one.
G
Yep
yeah
so
now
I'm
running
the
correct
flow
flow,
which
is
actually
going
to
run
the
quadratic
experiment
on
the
application
and
run
the
loop
test
pattern
with
it.
So
again,
going
back
to
the
repository,
I
if
I
refresh
the
page
I'll,
see
another
file
being
created
here
so
which
is
the
correct
one
this
time?
No,
it's
not
yeah
this
one,
so
yeah
yeah.
So
this
one
is
the
correct
one.
As
you
can
see,
the
steps
are
installed.
Chaos
experiment.
G
Then
we
do
a
pod,
build
experiment
and
finally,
we
do
the
load
test,
which
is
running
parallely
actually
with
the
power
build
as
we
can
see.
So
it's
running
fairly
with
the
port
delete
and
then
we
are
just
going
to
revert
the
cameras
and
clean
up
the
whole
cluster,
so
yeah
so
going
back
to
the
visualization
here.
So
I
can
see
the
experiment
running
right
now
and
the
load
test
and
the
experiment
has
both
started.
And
if
I
go
back
here,
I
should
probably
see
yeah.
G
The
pod
has
been
deleted
and
starting
up
again-
and
here
we
can
see
the
custom
yeah,
the
workflow
odds
that
have
been
generated
for
the
experiment
itself
and
clicking
on
the
pod
itself
will
get
a
few
details
of
the
experiment.
So
these
are
the
instrumentations
that
we've
added
apart
from
this
tunic
of
the
tuning
part.
So
we
allow
users
to
not
only
get
the
logs
of
the
experiment,
the
aquapod
that's
generating
the
experiment,
but
also
the
experiment
for
itself.
G
So
in
this
case
we
can
see
the
details
of
the
logs
of
the
arc
report.
But
then,
if
we
scroll
down,
we
will
see
logs
from
the
experiment
itself,
so
it
would
say
that
you
know
started
chaos,
experiment
for
delete
and
all
the
configuration
details
are
also
in
there.
So
that's
there
and
in
the
load
test
one.
If
I
go
here,
we
see
that
the
load
test
is
running
and
currently
it's
failing,
because
there's
only
one
instance
of
the
application
running
so,
okay,
so
notice
has
actually
completed,
I
think
yeah.
G
So
we
can
see
that
the
html
request
failure
is
more
than
25
percent.
So
all
our
requests
that
we're
making
is
pretty
much
going
to
be
vested
because
the
application
is
down
due
to
the
quadratic
experiment,
that's
being
run
here.
So
this
is
a
weak
scenario
where
our
application
isn't
resilient
enough
to
the
experiment.
G
So
we'll
see
that
if
our
hypothesis
is
correct,
so
we
also
have
a
http
probe
here
that
checks
for
the
liveness
of
the
application,
while
the
experiment
is
running.
So,
if
the
hypothesis
is
correct,
then
the
experiment
should
fail
and
we
should
have
a
field
node
here
once
the
experiment
finishes.
G
So
let's
just
wait
for
a
few
seconds
yeah.
So
the
experiment
has
failed.
As
you
can
see,
and
going
to
the
logs.
We
can
also
see
the
you
know
where
why
it
failed.
So
we
can
see
the
probe
has
failed
for
that
particular
experiment
and
that's
why
the
particular
experiment
completely
failed,
and
besides
that
we
also
have
the
kiosk
results.
G
So
this
is
also
being
fetched
from
the
target
where
we
are
running
the
experiment,
and
here
we
can
see
the
final
experiment
results
that
we
are
generating,
though
this
is
the
actual
litmus
experiment
result
resource
that
is
returned.
So
here
also,
we
can
see
the
result.
The
probe
status,
which
is
mentioning
that
the
particular
probe
has
failed
and
the
verdict
has
mentioned,
has
failed
so
yeah.
So
we
have
completed
the
expand
and,
besides
that,
what
we
can
do
now
is.
G
F
Yeah,
sorry,
I
I
think
we're
over
time
and
I
hope
the
audience
has
got
an
understanding
of
how
the
written
workflows
are
being
used
in
litmus.
So
I'll,
probably
we
can
stop
at
this
point
and
take
questions
if
any.
A
I
I
have
a
question
so
this
has
been
bugging
me
for
a
while.
Actually
I
want
to
perform
a
chaotic
test
where
a
service
becomes
unavailable
can
we
can
does
litmus
take
limits,
support
that?
A
A
Yeah,
that's
right,
yeah.
Basically,
my
http
requests
to
that
service
fail,
but
I've
actually
got
persistent
tcp
connections
to
it
as
well,
so
I've
actually
got
open
sockets,
which
data
is
coming
over,
so
it's
kind
of
become
unavailable
mid-way
during
that
tcp
dial
log.
F
Yeah,
I
think
we've
had
scenarios
where
that
has
occurred
as
part
of
these
experiments.
There
are
also
some
chaos,
experiments
that
are
being
built
specifically
to
cause
service
and
availability,
either
by
a
hundred
percent,
god
loss
or
by
probably
corrupting
this
service
object,
taking
a
backup
and
then
just
removing
it
from
the
system
kind.
A
H
I
think
there's
also
another
question
oracle
data.
How
do
you
do
that.
A
Sorry
alex,
would
you
please
repeat,
say
I'm
saying
I'm
running
an
experiment
earlier
every
day,
do
you,
how
do
you
keep
the
history
of
that
is
that
stored
in
a
database
yeah?
So
there
is
a.
F
Right
there
is
a
db
database,
stateful
set
replica,
that's
running
as
part
of
the
control
plane.
So
all
these
runs
that
you're,
seeing
here
on
the
dashboard,
are
actually
recorded
there,
and
then
there
is
an
ability
to
run
some
analytics
over
it
to
see
how
your
runs
compare
over
a
period
of
time.
F
Maybe
those
runs
have
been
executed
against
different
builds
or
lasers,
or
maybe
the
same
experiment
with
the
similar
set
of
tables
have
been
run
across
environments.
Qa
is
staging
one
production,
for
example,
and
you
get
to
compare
them.
It
is
stored
in
a
database.
You're
right
for
the
chaos
results,
as
well
as
the
workflow
details.
A
Okay,
that's
cool,
so
sean
mccarthy.
Thank
you
very
much
for
doing
the
presentation.
H
F
F
Like
is
a
self
invites
like
workspace,
so
I'm
pretty
sure
a
lot
of
us
are.
A
Great
stuff,
okay,
so
thank
you
very
much
for
the
presentation.
A
What
I'll
do
is,
if
you
want
to
ask
me
more
questions
about
this,
I
know
we've
had
a
lot
of
people
had
to
leave
for
their
elect
for
11
o'clock
meetings.
You
can
obviously
ask
about
that
and
I'll
be
putting
a
recording
up
on
to
youtube
today
or
tomorrow,
for
people
to
be
able
to
watch
the
rest
of
this.
A
Thank
you
all
very
much.
I
hope
you
have
a
wonderful
day.