►
From YouTube: Keptn Auto-remediation Working Group - May 12th, 2021
Description
Meeting notes: https://docs.google.com/document/d/1y7a6uaN8fwFJ7IRnvtxSfgz-OGFq6u7bKN6F7NDxKPg/edit
Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
A
B
A
Hi
everyone
and
welcome
to
the
next
round
and
episode
of
the
ultimate
auto
remediation
working
group
yeah.
Today,
I
put
the
topic
for
today
to
talk
about
the
tool
agnostic,
remediation
process.
Why
it's
needed
and
actually
what's
the
benefit
of
setting
up
setting
it
up?
Why
is
it
important?
Because
last
time
we
talked
about
the
template,
the
remediation
template
that
we
want
to
work
on.
A
I
will
also
open
it
in
a
couple
of
minutes,
but
during
this
discussion
we
figured
out
that
it's
not
really
obvious
why
we
should
build
a
remediation
workflow.
That
is
too
agnostic.
I
mean
there
are
solutions
out
there
like
ansible,
where
you
can
build
remediations
and
then
automate
your
operations.
A
But
why
makes
it
sense
to
have
here
a
more
separate
why
you
have
here
separation
of
the
process
and
the
tooling?
A
And
this
is
what
I
want
to
demo
today
today
and
then
I
also
want
to
drive
to
discussion,
but
before
going
to
to
do
that,
let's
jump
to
them
to
the
template,
to
this
scenario
that
we
agreed
on
working,
it's
the
it's
the
template
for
it's
a
template
that
we
want
to
work
on
in
order
to
define
a
process
based
on
jvm
memory
exhaustion
when
we
detect
this
problem,
we
want
to
kick
off
a
remediation
process,
then
does
that
fixes
the
problem
for
us
and
what
we
had
defined
during
the
last
two
meetings?
A
It's
the
process
as
laid
out
right
here.
It's
a
data
gathering
that
we
want
to
work
that
we
want
to
kick
off.
First,
then,
the
change
process
that
needs
needs
to
be
tackled,
in
other
words,
who
is
allowed
to
trigger
the
remediation
and
who
should
be
involved
in
fixing
the
problem
and
then
the
recovery
process
itself,
where
kevin
gave
us
really
great
input
on
how
they
are
solving
this
problem.
A
I
mean
they
use
here,
gitlab
to
spin
up
an
ansible
tower
and
to
execute
the
playbook
and
then
yeah.
We
want
to
do
the
validation
step
to
make
sure
that
the
jbm
is
running
again
and
accepting
the
traffic
and
finally,
in
case
we
could
not
resolve
the
problem.
We
escalated
yeah
to
inform
other
parties
that
should
be
involved
all
right
yep,
and
this
is
the
process
we
talked
last
time
and
we
also
defined
action
items
on
that
and
kevin.
A
I
mean
you
texted
me
or
you
sent
me
an
email
about
the
action
item
of
working
on
the
on
the
sample
app.
Can
you
also
share
the
crew
what
you
accomplished
or
what
you
can
contribute
here.
B
Sure
so
a
member
of
our
company
had
created
a
sample
app
that
basically
just
had
two
buttons
one.
It
increments
a
counter
to
show
that
the
application's
working
then
he
had
a
second
button
that
would
just
hit
a
loop
that
would
exhaust
the
memory.
B
So
I
reached
out
to
him
to
to
see
if
we
could
turn
that
over
to
this
group,
since
it
was
already
done
and
working,
and
he
did
remind
me
that
you
know
we
do
have
a
company
policy
about
sharing
any
intellectual
property
that
was
created
in-house.
You
know,
even
though
it's
very
basic,
we
didn't
want
to
share
it
directly,
but
he
he
did
give
me.
You
know
the
the
looping
code
that
runs
it
out
of
memory.
B
So
then
you
know
we
we
talked
about
possibly
putting
it
into
the
sample
app
used
for
kept
in
the
cart
at
the
sock
shop.
App-
and
you
know
you
would
ask
me
for
a
good
location.
Oh
my
computer
wants
to
restart
all
right.
B
You
had
asked
me
for
a
good
location
to
put
it,
and
I
was
looking
at
that
sample
app
and
realized
that
they
already
had
a
similar
scenario
based
off
of
cpu
exhaustion
that
they
built
into
that
app
by
sending
it
a
certain
id
for
an
item
in
the
cart.
So
I
suggested
wait.
Why
don't
we
do
that
same
thing?
We'll
create
a
unique
id
for
jvm
memory,
exhaustion
insert
that
looping
code,
and
then
we
should
go
to
test
it.
That
way.
A
All
right,
yep,
thanks
for
the
update,
I
think
yeah
doing
this
and
then
going
with
the
cards
app
would
make
sense,
since
it's
a
java
spring
boot,
app
and
yeah.
Adding
this
additional
scenario
into
the
app
would
make
sense
and
should
work
out
all
right.
C
Since,
since
alex
is
joining
here
as
well-
because
I
was
talking
to
johannes
today-
if
it
would
make
sense
to
use
the
potato
application
for
this,
do
you
know
alex?
Are
there
any
plans
to
have
a
child
service
also
in
the
potato
application?
Because
right
now,
everything
is
written,
go
and
what
the
working
group
has
agreed
here
is
to
first
take
a
jvm
process.
D
That
distributed
version
that's
currently
in
a
branch.
There
is
a
pr
open.
It's
not
yet
submitted
because
the
litmus
people
they
have
worked
on
it,
but
they
will
be
working
on
it.
So
you
could
easily
provide
a
lag
or
a
head
service
that
is
written
in
java.
You
just
have
to
follow
the
same
http
rest
interface,
so
that
that
wouldn't
really
matter
you
wouldn't
have
to
replace
all
of
those
we
just
have
to
provide
like
one
of
those
services
as
a
java
service.
If
you
wanted
to.
C
Okay,
because
if
we
have
to
build
it
into
one
service,
then
we
can
also
think
about
reusing
the
potato
addresses.
D
D
But
you
would
have
to
install
it
and
inside
the
inside
the
image.
C
D
A
All
right,
okay
here
this
could
be-
should
be
option
two
when
we
don't
wanna,
go
with
cards
all
right,
okay
and
yeah,
as
we
also
discussed
last
time.
I
should
do
I
wanna
demo
today,
captain
and
how
captain
implements
can
implement
this
process,
the
data
gathering
change,
process,
recovery
and
so
on
and
yeah.
This
is
what
I
actually
wanna
do
today
and
also
then
to
drive
the
discussion
of
why
having
a
tool,
agnostic
approach
and
let
me
get
started
by
yeah,
going
to
captain
first
and
what
I
did
is.
A
It's
yeah
the
production
for
customer
a
or
is
the
stage
is
called
production
customer
a
and
what
I
basically
did.
I
added
this
remediation
sequence,
this
one
and
then
I
added
the
task
for
the
data
gathering
for
get
remediation
for
action,
meaning
really
doing
the
execution
and
then
on
the
evaluation
task.
A
That
takes
an
action
when
a
remediation
failed
and
this
one
I
just
called
escalade
and
I
configured
a
trigger
and
this
trigger
is
configured
in
a
way
that
it
executes,
or
it
starts
this
sequence
whenever
remediation
failed
and
this
one
has
just
one
task
in
it
and
it's
the
escalation
task
all
right-
and
this
is
what
I
added
to
my
to
my
shipyard,
to
my
process
definition
and
I'm
now
able
yeah
to
trigger
that
particular
sequence.
A
And
today
I
will
yeah
use
postman
to
do
this
manually,
but
just
imagine
that
this
can
be
automated
by
an
integration
into
monitoring
tool
that
monitors
your
environment,
identifies
a
problem
and
when
the
problem
appears
it
just
sends
out
and
he
went
to
captain
to
kick
off
the
remediation
process.
This
is
what
I'm
doing
now
manually.
A
A
I
see
that
the
remediation
is
started
and
when
I
click
on
it
here,
I
already
see
yeah
the
remediation
is
running,
the
data
gathering
step
is
executed
and
in
a
couple
of
seconds
let
me
just
refresh.
A
For
some
reason,
the
next
task
won't
is
currently
not
triggered,
but
let
me
just
show
you
the
execution.
I
did
a
couple
of
minutes
before
where
the
data
gathering
was
done,
the
remediation
was
retrieved,
action
was
executed
and
the
evaluation
happened.
A
I
just
have
to
be
clear
that
this
is
currently
not
working
or
implemented
in
the
background,
but
I'm
just
showing
you
that
the
process
can
be
modeled
and
executed
as
laid
out
here,
because
the
services
that
are
actually
or
should
do
this
job
are
not
available
now
and
instead,
I'm
using
here
this
echo
service.
It's
a
very
stupid
dummy
service
that
just
receives
the
event
sends
back
that
he
wanna
takes
care
of
doing
this
job.
A
It
waits
a
minute
or
waits
a
second.
A
Then
it
finishes
the
action
and
returns
back
to
the
captain
and
for
this
particular
task
you
can
see
that
actually
two
services
were
listening,
the
echo
service
and
then
the
service
data
gatherer
and
both
were
running
in
parallel
and
doing
their
job
and
after
a
couple
of
seconds
they
were
done,
and
captain
was
then
going
on
with
the
next
task
in
this
process,
which
was
then
to
get
remediation,
which
is
also
mimicked
and
and
simulated
here
by
this
echo
service
and,
as
I
said,
and
I
configured
also
my
remediation
process
in
a
way
that
the
escalation
will
always
happen,
because
I'm
listening
here
to
the
trigger
that
has
a
filter
on
warning
and
always
when
the
warning
appears,
then
escalation
will
be
triggered
and
captain
kevin.
A
Kevin,
do
you
have
feedback
on
that
or
what's
your
take
yeah.
B
Yeah,
so
just
to
kind
of
recap
what
I
was
speaking
of
last
time
and
and
as
we
talked
about
it,
you
know
I
kind
of
had
an
aha
moment,
but
so
in
our
remediation
process,
as
we
go
through
each
of
these
steps,
everyone
was
done
by
ansel
because
it
needed.
You
know
logic,
programming,
logic
that
you
know
that
kept
in
itself
wasn't
really
built
in
to
go.
B
Do
so,
for
instance,
the
data
gathering
okay,
we
can
send
the
problem
event
into
kept
in
from
dynatrace,
but
then
something
some
smarts
needed
to
go
and
and
retrieve
more
information
that
we
could
act
upon.
So
we
were
using
ansible
for
everything,
and
so
my
statement,
I
guess,
was
if
I'm
only
using
kept
in
for
auto
remediation.
B
Should
I
install
another
product
to
do
this,
or
should
we
just
continue
to
use
just
ansible,
since
that
was
doing
100
of
the
work?
You
know
why
put
some
other
tool
in
the
middle
and
have
a
dependency,
and
then
we
further
discussed
the
fact
of
you
know
that
captain
is
also
available
for
quality
gates
and
that
kind
of
spurred
the
discussion
of
oh
yeah,
we're
kind
of
talking
right
now.
As
an
operational
mindset
of
we
have
this
thing
in
production.
B
Something
happens.
We
want
to
fix
it,
but
it
can
do
more
than
that
and
that
hey
we
just
deployed
new
code
and
out
of
this
pipeline,
we
want
to,
you,
know,
run
tests
against
it
and
then
validate
oh
nope,
it's
bad!
Now
we
want
to
do
some
rollback
and-
and
you
know
that
process
is
also
remediation,
so
I
think
the
more
that
captain
is
in
your
process
of
quality
gates
and
in
the
pipeline.
B
B
So
I
I
do
understand
that
value
now,
but
I
yeah
I
just
guess
I
don't
know
how
you
would
sell
this
just
to
people
that
are
only
interested
in
auto
remediation.
B
If
you
know
they,
they
kind
of
have
a
tool
of
choice
right
now,
ansible
puppet
chef,
whatever
their
tool
is,
you
know,
that's
that's
where
I
I
guess
would
need
help
seeing
where's
that
strict
value-add
that
we
could
say
hey,
you
really
need
this
product
and
and
something
I
was
thinking
about
as
we
were
walking
through.
This,
too
is
maybe,
as
we
have
this
template
built.
Maybe
we
should
go
back
to
our
mind
map
that
we
did
too
and
start.
B
You
know
linking
the
things
that
we
built
in
our
template
and
and
see
how
many
things
we've
created
out
of
our
mind
map,
because
I
know
like
one
of
the
things
we
talked
about
is
hey.
You
know
we
want
a
tool
that
you
know
gives
us
some
audit
capabilities
or
you
know
approval
capabilities.
You
know-
and
I
think
we've
built
some
of
that
in
our
template,
but
just
as
a
side
note,
maybe
we
should
step
back
and
see
how
well
we're
following
what
we
came.
What
we
theorized
in
the
beginning.
A
Okay,
yeah
to
your
first
to
first
argument,
or
the
first
note:
why
should
I
use
captain
now?
Instead
of
ansible,
I
mean
with
ansible
what
you
have
for
what
you
get
is.
Actually
you
combine
all
your
actions
or
all
your
tasks
that
you
want
to
do
as
part
of
your
remediation
into
one
playbook
or
run
book.
A
We
call
it
and
just
imagine
how
difficult
this
is
to
exchange
certain
components
within
this
run
book
and
because
you
have
to
tend
to
change
yeah,
you
have
to
go
through
the
run
book,
modify
it
and
taking
care
of
updating
all
these
bots.
That
yeah
needs
to
be
changed,
and
with
this,
or
with
with
captain
in
place,
you
add
a
tool
on
top.
That's
true,
but
it
abstract
abstracts
away
the
tooling
underneath.
A
Because,
at
the
end,
you
can
easily
exchange
the
components
that
are
then
doing
the
the
certain
tasks
of
part
of
the
of
their
of
the
process.
B
B
It
makes
a
lot
of
sense,
but
just
as
auto
remediation
only.
E
B
E
Hey
kevin,
this
is
andy,
I'm
sorry
that
I'm
training
late-
and
maybe
this
has
been
discussed
before,
but
I
just
tuned
in
into
the
conversation
one
question
to
your
order:
remediation
scripts
in
ansible,
so
you
said
everything
is
everything
is
inansible.
That
means
you
have
probably
built
libraries
and
libraries,
and
I
don't
know
how
many
hundreds
of
thousands
lines
of
automation
in
antibodies
is
correct.
B
No,
so
we're
just
starting
down
this
journey,
so
we
only
have,
I
think,
three
or
four
playbooks
that
we
have
so
each
of
these
are
kind
of
broken
out
into
playbooks
and
and
what
we
did
was
to
sidestep
ansible
tower
licensing,
we're
actually
running
it
through
git
lab.
So
we
have
dyn
trace,
call
get
lab
a
get
lab
pipeline
via
api
and
then
that
kicks
off
ansible
runners
out
of
get
lab
so
that
the
node
licensing
is
actually
just
for
that
runner.
B
You
know
out
of
get
lab
instead
of
each
of
the
endpoints.
So
wasn't
my
choice.
That
was-
and
I
don't
know
if
we're
gonna
stick
to
that
or
not,
but
supposedly
licenses
was
gonna
cost
us
millions
of
dollars.
So
we
came
up
with
this
workaround,
but
so
the
the
get
lab
pipeline
has
a
few
steps
in
it.
That
kind
of
does
these
different
steps?
Okay,
so.
C
The
main
part-
and
I
also
mentioned
this
the
last
time.
A
C
C
But
if
you
want
to
get
started
or
if
you
want
to
build
new
integrations
or
new
run
books,
then
I
think
it's.
If
you
have
everything
ansible,
then
you
might
end
up
with
writing
a
new
playbook
also
in
ansible,
but
maybe
there
is
another
way
to
do
this.
Maybe
for
some
action
you
want
to
execute
it's
fine.
If
it's
a
web
hook
and
you
just
just
call
it
or
maybe
it's
maybe
the
best
action
sometimes
is
just
executing
a
bash
script
or
some
action.
C
Sometimes
it's
a
toggling,
a
feature
flag
and
all
these
kind
of
smaller
actions
with
with
captain.
I
guess
it's
quite
easy
to
to
have
these
actions
in
whatever
language
fits
best
and
captain
will
execute.
They
will
call
them
instead
of
having
everything
in
large
playbooks
and
with
conditions
if
they
get
called
or
not.
B
Yeah,
I
think
the
other
I
guess
thing
to
consider
is
so
one
of
the
things
that
we
do
is
we
go
and
with
ansibles
we
update
the
servicenow
incident
that
we
created
and
basically
are
updating
it.
The
same
way
captain
is
updating,
so
we
said:
hey,
we
just
you
know,
did
our
data
gathering
and
we
discovered
the
pid,
and
here
it
is
and
so
that
somebody
can
go
look
at
the
incident
and
see
what
the
remediation
process
did.
So
you
know
as
a
as
a
developer
of
this
okay,
I
know
ansible.
B
I
know
how
to
connect
the
service
now.
In
fact,
I
just
copied
somebody
else's
playbook.
That
already
did
this
function.
If
I'm
using
kept
in
now,
that's
a
different
service.
So
now
I
have
to
learn
all
right.
How
does
you
know
if
I
wanted
to
use
the
servicenow
service
from
captain?
I
have
to
understand.
How
does
that
one
work?
B
What
do
I
need
to
plug
in
and
so
you're
you're
kind
of
multiplying
the
knowledge
that
you
need
for
the
different
services
to
connect
to
different
systems
versus,
if
you
did
it
with
one
tool
like
an
ansible,
not
saying
it's
bad
good.
Just
observation
for
discussion
is
all.
D
Yeah,
I
posted
a
link
to
a
blog
post
where
we
describe
this
concept
of
micro
operations,
where
I
outline
a
number
of
reasons
why
these
traditional
operations
workshops
usually
do
not
work,
and
I
think
that
johannes
should
make
it
into
the
demo,
because
one
key
thing
usually
is
that
you
have
a
remediation
file
that
you
ship,
along
with
the
release.
D
This
is
usually
a
big
issue,
because
this
is
usually
a
multisystem
update.
So
if
you
ship
a
new
version
of
your
application
with
your
remediation
instructions
and
that's
why
the
remediation
instructions
you
currently
have
encoded
the
entire
instructions
in
the
sequence
and
not
in
the
remediation
file,
if
you,
for
example,
would
want
to
change
some
of
your
procedures
for
the
next
release
that
it's
not
say
a
jvm
restart.
But
it's
it's
something
different
for
for
the
sake
of
it
out
of
memory.
It
can
obviously
be
something
different.
D
But,
depending
on
what
the
remediation
action
is,
you
can
ship
it
along
with
the
code
artifacts,
that's
one
thing
and
also
for
micro
service
applications.
These
scripts
tend
to
get
like
super
complicated,
that's
as
well.
I
think
that
was
a
good
question
for
mandy,
like
how
many
of
those
scripts
do
you
have
if
it's
two
or
three
it's
fine,
but
if
there's
an
individual
remediation
action
for
each
service,
you're
running
and
you're
running
hundreds
of
interdependent
services.
D
Writing
these
end-to-end
remediation
flows
at
some
point
gets
super
hard
and
another
point.
Obviously
you
have
those
scripts
and
I
think
the
way
you
could
also
run
them
by
the
way
in
captain
is
that
you
simply
have
a
service
that
simply
execute
simply
also
does
the
answer
will
run
as
well,
because
we
could
run
the
ends
of
the
script
directly.
D
So
right
now
the
tool
integrations
and
the
actual
process
that
you're
running
are
in
the
same
file,
and
we
talk
to
a
lot
of
companies,
especially
in
restricted
industries
where
they
want
to
separate
this
out
and
for
them
pipelines
are
a
massive
problem,
because
policies
on
how
certain
issues
should
be
hand
handed
are
directly
intertwined
with
the
rest
of
the
workflows,
and
it
often
is
a
massive
issue
to
audit
whether
those
processes
still
work
the
way
they
were
supposed
to,
but
from
the
pure.
Can
you
do
it?
I
I
totally
agree.
D
If
the
only
thing
I
really
want
to
do
is
restart
a
jvm,
then
just
calling
an
n
simple
script
directly
is
definitely
the
simplest
solution.
If
you
wanted
to
do
it.
D
E
Notice
what
you
were
saying
I
think
this
is
this
is
the
important
piece
right.
Ansible
is
what
I
call
it
do
it
yourself,
swiss
army
knife.
You
can
do
whatever
you
can
do
everything
right,
but
the
question
is:
if
you
now
start
out
with
three
teams
and
they're
building
an
enterable
script,
and
you
already
mentioned.
E
If
the
next
team
comes
on
board,
they
probably
copy
and
paste
something
from
another
team
and
then
just
add
a
little
piece
to
it,
and
maybe
they
copy
and
paste
inc
they
copy
and
paste
includes
the
integration
with
servicenow
the
call
to
servicenow,
which
means
all
of
a
sudden.
You
have
this
piece
of
tool:
integration
as
part
of
3
4,
5,
10,
20
different
ansible
scripts,
and
this
is
all
in
the
end,
technical
debt
that
you're
building
up,
because
what,
if
service,
now
changes
the
api?
E
What
if
you
want
to
not
do
like
a
simple
status
update
in
servicenow,
but
you
want
to
do
something
else
and
you
need
to
go
through
all
of
your
different
scripts
in
the
chat
and
and
the
reason
why
this
is
because
you
have
everything
intertwined
in
one
end,
symbol,
script,
all
the
logic,
and
I
think
this
is
what
we're
trying
to
solve
here
right.
We're
trying
we're
trying
to
avoid
80
to
90
percent
of
automation,
code
that
you
have
to
write,
because
we
have.
E
We
are
providing
this
integration
in
an
event-driven
way
and
we
provide
these
tool
integrations
so
that
whoever
defines
your
ordinary
order
remediation
process
can
say.
This
is
the
process,
and
here
is
one
step
where
our
developers
can
specify
what
should
happen
to
remediate
a
specific
issue
to
their
application.
Oh
yeah,
they
have
an
answerable
script
that
restarts
it,
but
the
rest
is,
it
can
nicely
be
and
independently
defined
by
your,
I
don't
know,
sre
teams
or
whoever
is
taking
care
of
it.
B
Yeah,
I
think
I
think
it's
how
maybe
how
we
were
approaching
it
and
considering
the
remediation
process,
maybe
a
more
of
a
global
project
instead
of
you
know
having
this
packaged
within
each
application
and
and
kind
of
doing
your
monitoring
as
a
code
and
and
keeping
it
with
the
app
we
were
more
or
less
doing
it
as
a
global
type
thing.
So
you
know
any
jvm
memory
problem
that
existed
in
dynatrace
would
all
go
hit.
The
same
remediation
process
didn't
matter
what
the
app
was.
B
And
I
think
the
other
thing
that
I
guess
can
compounds
the
the
confusion.
Maybe
is
that
when
we
started
this
working
group
we
didn't,
we
didn't
want
to
isolate
it
to
cloud
native
services.
We
said
you
know,
there's
companies
that
are
they're
still
dealing
with
monoliths,
including
in
ours
and
and
of
course,
those
are
the
ones
that
our
leadership
want
to
have
solved
immediately
hey.
B
Why
can't
we
know
auto
remediate
these
monoliths
that
we
have
so
we
can,
you
know,
use
the
extra
help
to
start
working
towards
getting
to
microservices
and
cloud
native
and
so
by
starting
with
those
monoliths.
I
think
we're
starting
with
some
more
antiquated
ideas
and
and
a
different
design
than
what
this
was
intended
for
so
maybe
just
complicates
it.
D
D
So
you
want
to
move
away
from
a
wiki
page
or
a
traditional
human
centered
run
book
to
an
automated
process,
and
I
think
everything
you
move
in
that
direction
obviously
is
great
because
it
helps
you
to
automate
and
replay
this
process
and,
as
you
also
mentioned
ancient
that
I
really
can't
be
tested
automatically
as
well.
The
next
thing
is,
that
is,
how
many
tools
do
you
still
have
to
touch
to
implement
it,
and
I
think
that
that's
gradually,
you
can
find
the
the
right
balance
for
for
what
you
want
to
use.
B
Yeah
I
mean
I
personally
want
to
use
captain.
I
I
tried
to
get
it
instituted
for
our
our
poc
and
kind
of
got
outweighed
by
others
that
wanted
to
keep
it
more
simple.
B
You
know,
because
we
we
want
to,
we
want
to
start
doing
the
quality
gates
and
everything,
and
so,
if
you,
if
you've,
already
got
it
there
for
that,
you
know
why
not
continue
down
the
path
with
auto
remediation,
but
I
think
the
one
thing
that
kind
of
stopped
us
with
going
down
the
captain
tract
was
the
fact
that
diane
trace
was
going
to
build
it
in,
and
so
we
were
now
between.
Okay,
we've
got
it
stood
up,
but
it's
open
source
and
they're
getting
ready
to
build
it
directly
into
the
dining
trace
product.
B
D
But
I
have
one
conceptual
question
here,
so
if
you
run
your
ansible
script,
so
I
mean
nothing
against
envelope
script,
so
my
first
ever
demo
was
running
an
ansible
tower
instance
and
an
ansible
script
based
on
the
dyna
trees
based
problem.
D
B
Yeah,
so
we
have
at
the
end
of
the
ansible
playbook,
it
goes
through
a
loop
and
it
checks.
It
hits
the
dynatrace
api
and
says:
hey.
Has
this
problem
closed
or
not.
B
Yeah
yeah,
so
the
very
first
step
the
data
gathering,
because
the
only
thing
that
comes
across
is
the
problem
id
and
so
then
we
go
and
we
get.
You
know
what
entity
was
affected
so
that
we
can
find
the
host
to
go
to
and
the
command
line
arguments
so
that
we
find
the
right
jvm
in
case
there's
multiples.
B
So
we
gather
all
that
and
then
we
do
the
kill
we
validate
the
kill,
because
this
is
just
a
very
simple
java-
app
run
with
a
java
jar
command
line
and
then
after
it's
running,
then
it
will
go
back
and
check.
I
think
I
think
it
pokes
it.
Maybe
every
10
seconds
or
30
seconds
to
see
if
the
problem
has
closed
and
then
once
it's
closed,
it'll,
better
update,
servicenow.
E
And
so
I
think
this
is
exactly
the
logic
that
we
were
talking
about.
You
can
build
all
of
this
in
the
ansible
scripts,
but
now
somebody
has
to
maintain
it,
and
this
is
the
logic
that
we
built
into
the
core
automation,
workflow
engine
of
of
captain
right,
because
captain
can
will
trigger
your
remediation
and
then
the
next
thing
it
does
it
validates
back
with
diameter
is
whether
it
solves
the
problem.
It's
doing
it
in
a
similar
way.
It
is
looking
at
the
problem
that
or
that
initially
opened
up
the
remediation
workflow.
B
B
Our
own
workflow
documentation
too,
you
know
because
after
we
did
this,
we
discovered
well.
Okay,
anfield
worked,
but
you
know
what
we
can't
see
anything
you
know.
So
we
we
had
to
build
in.
You
know,
updating,
servicenow,
and
then
I
created
a
business
world
service
now
to
go
update,
diatrace
to
show
it
on
the
entity
itself.
B
A
A
B
We
have
not
yet
so
we
did
the
anful
creation
and
and
poc
to
our
cto,
I
guess
about
a
month
ago,
and
then
we
went
through
and
and
created
that
template
where
we
said.
Okay,
beyond
the
very
simple
case,
we
now
need
to
say:
okay,
we
need
to
find
startup
scripts
and
shutdown,
scripts
and-
and
you
know,
do
capture
logs
or
config
files
or
run
you
know
for
ibm
run.
B
A
All
right
yep,
because
we
were
thinking
about
actually
the
the
challenge
of
collecting
all
this
data
and
then
what?
Actually?
What
do
you
want
to
do
with
this
data?
Then?
Because,
as
when
you
already
know
the
problem
and
also
the
root
cause
of
what
is
happening,
you
and
you
have
the
mapping
between
the
root
cause
and
the
action.
Then
why
is
there
a
need
to
collect
more
data
in
front.
B
Oh
sure,
so
that's
because
we're
we're
a
bank
and
we
deal
with
like
vendors
all
the
time
so
ibm
is
notorious
for
it
and
if
you
have
never
had
to
open
a
support
ticket
with
them,
they
are
basically
going
to
tell
you
without
running
these
diagnostic
scripts.
There's
not
much.
We
can
do
for
you,
you
know
you
can
have
diet
trace
and
you
can
say
yeah,
but
I
see
in
diet
rates,
but
here's
your
problem.
Well,
we
we
don't
rely
on
that.
B
We
need
to
look
at
our
own
data,
our
own
logs,
so
they'll
have
you,
you
know,
run
heat
dumps.
Well,
they'll
tell
you
well
the
next
time
it
happens,
and
you
know
maybe
it's
a
delay
tactic.
I
don't
know,
but
you
often
have
to
gather
those
things
for
the
vendors
and
and
different
vendors
are
different.
They
want
different
things,
but
it's
maybe
less
for
ourselves,
but
more
for
vendors.
A
B
C
B
D
But
right
now
you
actually
made
a
good
argument
for
a
more
modular
approach
because
you
say,
depending
on
which
application,
which
type
of
jvm
needs
to
be
restarted.
This
process
is
going
to
be
different,
so
that
means
that
you
will
need
yeah
different
versions
for
different
steps
in
there.
Obviously,
the
actual
restart.
D
Might
be
the
same,
I
mean
you
know
what
the
gbx
is
up
or
not
most
likely,
but
the
data
gathering
might
be
different
in
this
case.
In
some
cases,
even
the
restart
scripts
might
be
different,
depending
on
which
applications
you're
restarting.
D
So
while
the
overall
process
flow
always
stays
the
same,
like
getter
data,
then
triggering
the
remediation
check,
whether
it
works
or
escalated.
What
these
these
tasks
start
to
do
might
then
at
some
point
turn
out
to
be
different,
which
means
you
need
to
have
a
library
where
you're,
more
or
less,
depending
on
which
process
you're
running
you're,
compiling
the
individual
steps
on
the
fly.
B
Again,
we're
kind
of
still
in
the
infancy
and
learning
as
we
go
and-
and
you
know
discovering
hey
this
might
be
good.
This
might
not
be
good,
so
it's
ever
evolving
for
us
but
yeah.
I
I
like
the
modular
approach
to
this.
You
know
it
was
just.
B
We
were
keeping
it
simple
and
and
now
that
we're
expanding
it,
as
you
said,
I
I
definitely
see
the
the
need
to
have
it.
Modularized
like
this
for
the
different
systems
that
will
come
into
play
and
the
different
scenarios
that
are
going
to
come
into
play.
B
So
we
we
came
up
with
an
idea
of
tagging
the
services
based
off
of
if
they
were
ready
for
this
type
of
service,
but
we
we
hadn't
gone
down
that
path
too
far,
because
yeah
we
didn't
want
just
any
jvm
restarting
because
we
had
to
have
that
information
about
it.
First,
you
know
startup
shutdown,
each
one
potentially
could
be
unique.
I
think
we
need
to
you
know,
do
a
canvas,
let's
say:
okay,
websphere.
B
Do
we
have
all
of
our
startup
and
shutdown
scripts
in
a
templatized
location?
Hopefully
we
do
probably,
as
we
canvas
we'll,
find
that
there's
all
kinds
of
edge
cases
and
and
so
we're
gonna
have
to
you
know
if
we
do
it
globally,
we
would
have
to
account
for
that.
So
I
think,
as
we
go
down
this
journey,
that's
also
going
to
be
a
good
reason
for
us
to
start
pushing
us
back
to
each
support
team
to
model
their
systems
in
a
fashion
that
can
be
automated.
D
B
A
All
right
yeah
this
the
discussions
that
they're
really
creating
and
and
going
on,
but
let
me
just
jump
back
to
the
agenda
for
today.
A
We
do
have
a
couple
of
yeah
minutes
left
and
what
I
want
to
touch
today
is
how
we
continue
on
our
scenario
and
our
template
for
fixing
and
and
and
getting
the
jvm
memory
exhaustion
problem
remediated.
A
Okay,
I
mean
the
sample
app
and
the
setup
to
simulate
the
jvm
problem.
A
Should
we
should
we
go
with
the
cards
app
to
get
this
pro
to
get
this
simulation
done,
or
should
we
give
it
a
try
to
to
test
it
out
with
the
potato
head
implementation?
A
C
Think
the
quick
win
is
to
do
it
in
the
card
set
because,
since
the
the
pr
in
the
potato,
it
is
not
yet
merged,
but
for
the
long
run
it
just
depends
on.
When
do
we
want
to
have
the
sample
app
to
be
ready
for
the
long
run?
I
I
would
go
for
the
potato
head
as
it's.
It's
a
more
mature,
app
and
maybe
a
little
bit
more.
C
It
has
also
some
dependencies
that
we
can
then
take
into
account,
but
if
we
need
it
really
really
fast,
then
the
cards
have
it's.
It's
very
easy,
because
it's
already,
we
have
the
build
process.
We
have
everything
we
just
have
to
put
in
the
loop
that
kevin
can
provide.
A
A
B
You
and
your
team
there's
a
chance
yeah
I
can
ask.
I
can
see
what
we
can
do,
generically
that
that
won't
get
us
in
hot
water.
A
Actually,
let's,
let's
go
back
to
the
first
item-
is
someone
up
for
implementing
this
change
into
the
card
service.
I
mean
the
jbm
problem.
A
A
But
let
me
just
I
will
take
care
of
getting
that
one
implemented
and
done.
A
The
process
orchestration
wire
captain,
we
we
discussed
today
and
yeah-
I
already
showed
you
how
it
could
look
like,
and
I
would
propose
to
to
go
this
route
to
use
captain
for
doing
our
remediation
scenario
and
the
data
gathering
part.
This
is
now
the
first
action
that
is
maybe
not
that
obvious
or
not
not
clear
what
should
be
done
there.
Can
we
use
the
the
next
two
weeks
to
figure
out
what
we
have
to
do
there
and
how
this
should
look
like.
C
Yeah,
maybe
I
can
also
comment
on
this
because
you're
honest,
we
were
talking
about
this
also
today,
and
so
I
I
would
be
really
interested
in
maybe
kevin's
example.
What
what
are
you
currently
collecting
and
where
are
you
storing?
This
is
only
for
archiving
purposes.
Is
it
also
for
any
analysis
purposes
and
finding
the
right
action
thinking
about
it?
Kind
of
this
is
a
modular
approach,
and
we
can
also
kind
of
the
data
gathering
can
be
can
be
done
in
different
parts
of
data
gathering.
C
That
can
be
maybe
an
analyzer's
part
to
find
out
what
might
be
the
right
action,
but
interesting
would
be
which
data
is
needed
to
be
collected,
and
where
is
the
data
currently
stored?
Is
it
from?
Is
it
already
in
some
kind
of
I
don't
know
prometheus
or
data
trace,
or
is
there
any
other
data
hub
where
that
holds
some
of
the
logs,
or
do
you
have
to
go
directly
to
the
box
and
fetch
some
more
data
there?
So
where
is
this
right
now
located?
B
Yeah,
in
our
case,
we
were
kind
of
just
walking
through
what
a
support
person
would
do.
You
know
they
get
called
and
they
are
told
hey
this,
this
jvm's
out
of
memory.
You
know
one
of
the
typical
things
that
they'll
do
you
know
for
the
vendor
is
to
you
know,
capture
a
heat
dump
so
that
they
can
review
the
heat
dump.
You
know,
let's,
let's
say
dynatrace,
wasn't
the
the
tool
of
choice
that
discovered
this.
You
know
which
would
have
the
memory
dump
for
you
or
hey.
B
The
active
gate
wasn't
configured
for
the
memory
dump
analysis
somebody
messed
up
and
the
data's
not
there
again
want
to
make
sure
that
there's
no
oopsies
so
go
ahead
and
and
kick
off
that
jvm
memory
dump
and
and
capture
it
so
that
it
can
be
analyzed
later.
You
know
some.
Some
vendors
obviously
have
scripts
that
they
like
to
run
that
go
and
collect.
You
know
different
pieces
of
data.
Os
data,
jvm
data,
you
know,
runs
the
gamut.
So
again,
it's
I
think
it's
a
plug
and
play
of
options
depending
on
your
app.
B
I
would
say
yes,
that
it's
stored
somewhere
for
someone
to
deal
with
yeah.
C
D
D
Really
want
to
do
is
execute
a
generically
like
one
action,
it's
always
the
same,
but
we
are
ignoring
that.
The
developer
might
specify
that
for
their
service,
they
can
restore
it
or
cannot
restart
it,
and
we
could
show
the
different
behavior
where
we
say
well,
this
service
can
be
restarted
or
the
service
cannot
be
restarted
and
then
have
the
process
behave
differently
and
give
the
developer
control
what
is
allowed
on
their
service.
D
I
know
it's
not
exactly
this
example,
but
I
think
the
example
is,
to
some
extent
so
straightforward
that
that
you
don't
need
any
any
powerful
solution
as
there'll
be
to
run
it
for
that,
but
also
kevin
obviously
brought
up
the
case
that
not
it's
not
a
lot
for
all
of
their
jvms
right
now.
It
was
also
what
he
brought
up
around
tagging,
so
you
also
have
the
possibility
that,
for
a
certain
service,
either
certain
service
or
certain
service
version
you're
not
allowed
to
restart
it
because
it
maybe
doesn't
allow
to
do
it.
B
Yeah,
I
wonder
if
that
ties
into
step
two
of
the
change
process
there.
You
know
if,
if
you
have
this
documented
in
your
cmdb
that
says
you
know
this.
B
This
can
be
restarted,
but
maybe
only
during
the
maintenance
window,
and
so
you,
you
can't
do
it
right
now
or
yes,
it
can
be
restarted,
but
you
need
to
get
you
know
an
approval
first
and
whether
that
be
hey,
I
can
do
a
slack
bot,
approval
or
or
some
other
you
know,
process
there,
but
I
don't
know,
maybe
that
maybe
that
ties
into
that
change
process
of
allowing
or
not.
A
A
I
mean
you
mentioned
last
time
that
you
are
using
service
now
for
checking
the
approvals
and
making
sure
that
the
person
who
is
allowed
to
do
the
action
gets
gets
asked
is
this.
Is
this
is
still
the
case
right
that
you're
using
service
now,
for
that.
A
Yeah
till
till
the
next
meeting
I
mean
I
will
take
care
of
the
cards
app
and
then
kevin
you're
about
the
ansible
script.
And
what
else
should
we
tackle
on
to
get
the
scenario
running
for
us.
C
Yeah,
I
think
it's
important
to
kind
of
define
what
what
are
now
the
the
parts
of
the
whole
process
that
have
to
be
in
the
process
for
for
kind
of
exceeding
this
hello
world
example.
We,
I
think
we
kind
of
agree
that
there
has
to
be
the
validation
part,
because
that's
one
of
the
parts
that
we
identified
early
on,
that
the
validation
has
to
be
part.
Data
gathering,
I
think,
is
also
an
important
part
that
we
already
identified.
That
has
to
be
there
the
approval
process.
C
In
my
perspective,
it
makes
a
lot
of
sense
to
think
about
how
our
approval
is
currently
done,
and
I
believe
they
are
not
right.
Now
they
are
not
done
inside
character,
so
they
are
done
somewhere
else,
so
it
has
to
be
kind
of.
How
can
this?
How
can
we
process
this,
and
how
can
we
tools.
C
How
can
we
still
use
like
captain
as
the
orchestrator
in
which
tools
are
currently
doing
the
process,
and
I
think
we
also
have
to
take
a
look
here
on
tools
like
page
of
duty
or
other
tools
that
are
like
incident
response
tools,
how
they
are
currently
doing
this
and
what
is
their
strengths?
And
what
is
the
process,
what
they
are
lacking,
and
what
can
we
bring
to
the
table.
A
D
I
think
they,
what
I'm
kind
of
seeing
here,
I
think,
would
be
useful
to
still
model
the
entire
process
end-to-end
just
to
draw
it,
because,
just
in
the
last
hour,
we
talked
at
least
about
five
or
six
different
tools
to
run
a
single
process,
and
it
would
be
good
to
lie
out,
like
all
the
two
interdependencies
which
we
have
here
right
now
and
who's
controlling
which
parts
of
those
steps
mm-hmm
yeah.
Like
so
far,
I
think
we
had
obviously
dynatwise
as
an
early
monitoring
tool.
In
this
case
we
had
gitlab
for
execution
and
modeling.
D
A
That's
a
very
good
point.
I
mean
we
already
have
identified
those
tools
in
in
our
working
document
on
the
template.
Here
we
have
yeah
added
a
couple
of
notes,
but
it
would
make
perfect
sense
to
now
bring
the
picture
together
and
to
make
it
clear
to
everyone
how
the
tools
are
connected
and
how
everything
is
then
working
together.
A
D
D
System,
I'm
always
a
bit
that
footed
for
the
change
process.
D
D
It
would
even
be
interesting
on
the
service
level
if
it
can
be
done
automatically
or
not
automatically,
if
I
notice
ahead
of
time,
I'm
just
thinking
that
right
now
the
restart
procedure,
the
remediation
procedures
differently,
but
it's
happening
at
2am
in
the
morning
or
at
10,
am
depending
how
fast
people
can
reply.
Obviously
you
have
a
24
7
knock,
so
somebody
will
eventually
look
at
it.
A
All
right,
then,
I
would
summarize
the
action
items
as
follows.
As
I
said,
I
will
take
care
of
the
app
and
kevin
about
the
ansible
script,
but
I
would
or
all
of
us
we
should
continue
modeling
the
process
and
the
tool
integrations
and
therefore,
let's
continue
on
on
our
yeah
working
document,
where
the
proof
of
concept
is
kind
of
already
laid
out.
But
let's
just
add
here,
the
tools
that
are
all
involved
and
the
responsibilities
that
they
have
within
this
process.