►
From YouTube: Keptn Auto-remediation Working Group - April 21st, 2021
Description
Meeting notes: https://docs.google.com/document/d/1_WlLP6oLcHe0yyC7kXH2hB3i9bOPvIArp83NohE78FU/edit#heading=h.wqdeglxri66j
Learn more: https://keptn.sh
Get started with tutorials: https://tutorials.keptn.sh
Join us in Slack: https://slack.keptn.sh
Star us on Github: https://github.com/keptn/keptn
Follow us on Twitter: https://twitter.com/keptnProject
A
Cloud
all
right,
yep,
then,
let's
get
started
again.
I
want
to
the
next
round
of
the
auto
remediation
working
group.
Last
time
we
agreed
on
working
on
a
template
for
an
auto
remediation
scenario,
and
the
scenario
should
be
that
we
remediate
a
java
app
that
is
running
into
a
memory
exhaustion,
and
we
therefore
created
the
document
and
yeah
already.
Some
progress
has
been
made
on
this
document
and
I
would
propose
that
we
go
over
this
document
and
yeah
see
what
we
have
and
how
we
then
can
proceed.
A
This
should
be
the
first
part
of
the
meeting
and
then
I
would
actually
break
down
the
scenario
into
more
concrete
items
and
artifacts
so
to
say,
and
then
we
should
came
up
with
a
way
how
to
work
on
these
artifacts
like,
for
example,
the
app
or
then.
We
also
should
think
about
the
orchestration
of
the
of
the
scenario,
as
well
as
the
different
components
that
are
involved,
like
data
gathering,
the
change
process,
recover
and
validation.
A
But
that's
part
of
of
later,
let's
first
focus
on
what
we
have
in
in
our
document
right
now.
Let
me
just
quickly
close
that
one
for
a
moment.
A
Okay,
yeah,
as
we
said,
we
want
to
work
on
a
template
and
I
just
added
here
a
very,
very
short
abstract.
What
this
working
document
is
about
yeah,
it's
for
defining
a
proof
of
concept
where
we
can
drive
our
template
of
how
remediating
problems
should
work,
and
I
have
now
to
ask
the
question
kevin.
Was
it
you
who
started
working
on
this
document.
A
Cool
thanks,
maybe
you
can
I
didn't
change
the
structure.
I
did
not
add
change
or
removed
something
here
how
it
was.
I
just
added
a
couple
of
things
and
a
couple
of
comments.
Can
I
ask
you
to
go
over
this
document
now
and
to
explain
us
what
you
added
first
sure.
B
Yep,
all
right
so
using
this
as
an
example
came
about
because
we
at
our
organization
at
truist
were
going
through
this
exact
scenario
as
our
first
try
at
auto
remediation.
B
So
I
just
thought
it
would
would
be
good
to.
You
know,
bring
this
group
into
the
fold
working
on
that
same
similar
problem.
B
So
in
our
working
group
we
kind
of
went
through
the
the
process
as
a
group
and
kind
of
came
up
with
some
of
these
steps.
So,
let's,
let's
start
with
under
the
assumptions.
Obviously
it's
going
to
run
a
non-container
environment
because
it
wasn't
a
container.
We
could
shortcut
a
lot
of
this
stuff.
The
bullet
point
number
two
sample
application
that
runs
out
of
memory,
so
one
of
the
people
on
our
team
created
a
sample
application.
B
It
basically
just
has
it's
a
little
spring
boot
app
that
has
two
buttons
on
it.
You
press
one
button
and
it
increments
a
counter,
and
you
press
the
second
button
and
it
just
calls
a
loop
that
runs
it
out
of
memory
immediately
and
so
dynatrace.
We
we
have
dietrich
on
the
server
where
that
was
running.
B
It
did
open
the
problem
and
then
we
we
kicked
off
auto
remediation.
You
know
through
some
ansible
by
way
of
get
lab,
and
we
we
can
talk
about
that
later.
But
so
so
we
do
have
a
sample
app
and
and
a
way
to
you
know,
have
diane
trace,
detect
it.
A
Awesome
great
because
I
was
adding
this
bullet
point
in
the
sense
of
we
need
to
think
about
an
app
that
we
take
a
sample.
But
when
you
have
already
one
that's
really
great.
B
Yep,
so
I
can.
I
can
talk
to
the
author
of
that.
It's
so
simple,
there's
no
private
data
or
anything
to
it.
So
we
should
be
able
to
share
that,
but
I'll
just
get
his
okay.
Then
I
can.
I
can
share
that
with
the
group
great.
B
So
then
we
moved
on
to
okay.
When
this
happens
in
the
real
world
in
an
enterprise
type
environment.
You
know
what
what's
the
process
that
people
go
through
to
resolve
it
manually.
So
the
the
first
step
is
the
data
gathering.
So
a
lot
of
times.
These
are
supported
by
another
company,
say
websphere
and
you
have
ibm
and
anytime.
You
have
a
chronic
issue,
they're
gonna
say
well.
Did
you
capture
all
the
logs?
You
know?
Did
you
run
this
script?
B
Did
you
take
any
heat
dumps
for
us
to
analyze,
so
there's
there's
that
process
that
you
know
the
vendor
usually
will
require
as
long
as
well
as
the
support
people,
so
we've
got
the
logs
heap
dumps
config
files.
We
could
continue
adding
to
that.
B
Of
the
jvm
process
yeah,
so
you
know
possibly
you
could
go
and
look
at
your.
You
know.
If
it's
tightly
controlled,
you
could
look
at
your
source,
control
and
say
well.
This
is
the
last
thing
pushed
to
it,
but
did
it
get
changed
or
you
know
so
it's
I
think
it's
always
good
to
get
a
running
config
before
you
restart
something
just
to
validate
that.
It
was
running
with
the
configuration
that
you
thought
it
was
running
with
so
the
step
number
two,
the
the
change
process.
B
So
in
our
environment,
you
really
can't
touch
anything
without
documenting
it
in
a
change
and
then
in
our
case
that
is
through
servicenow
there's.
You
know
often
debates
on
things
as
simple
as
restarts
as
whether
that
could
just
be
an
incident
record
or
does
it
have
to
be
a
full-fledged
change
record
but
regardless,
if
we
wanted
to
kind
of
keep
this
more
templatized
in
general,
let's
say
it's
more
than
a
restart.
Let's
say
we're
doing
something
you
know
complex.
B
So
somehow
we
need
to
be
able
to
tie
into
where
to
record
this
change
and
have
it
match
up
with
the
ci's
that
we're
changing.
So,
in
this
case,
we're
restarting
that
jvm,
what's
what's
the
service
that
we're
affecting
and
be
able
to
document
that,
along
with
that,
then
there's
the
approvals
you
know
again,
company
by
company,
you
might
have
different
processes,
but
this
you
know,
might
you
might
have
to
look
up
the
the
support
groups
to
see
you
know?
B
Does
someone
have
to
approve
this?
If
so,
who
who
are
they?
How
are
you
going
to
notify
them?
Is
there
a
maintenance
window
right
now?
Could
you
could
you
do
it
without
an
approval
if
it's
in
a
maintenance
window,
so
that
kind
of
opens
up
a
few
different
ideas.
B
So
then
step
three.
We
basically
now
have
the
okay
to
do
a
remediation,
so
we
need
to
stop
the
jvm.
Do
any
cleanup
oftentimes,
you
know
if
it's
running
out
of
memory
the
jvm
might
might
dump
on
its
own
and
and
fill
up
some
some
file
systems
that
may
prevent
you
from
starting
it
back
up,
because
the
file
systems
are
full.
B
So
there
could
be
some
cleanup
work
that
you
should
check
for
and
then
we
would
start
back
and
start
the
jvm
back
up,
and
then
I
just
threw
in
the
idea
of
you
know.
Maybe
we
would
potentially
make
changes,
especially
if
it's
a
chronic
issue
doesn't
help
to
just
continually
keep
restarting
the
jvm
that
just
keeps
running
out
of
memory.
You
know,
maybe
we
need
to
increase
the
heap
size
or
change
the
garbage
collection
or
or
tweak
it
in
some
other
way.
A
Before
doing
a
restart,
and
do
you
have
already
or
how
is
that
one
implemented,
the
recovery
phase?
Is
that
one
an
ansible
tower
script
or
how
is
it
done
right
now.
B
Sure
so,
in
our
scenario
that
that
we
demoed
for
cto
we
have
diane
trace,
opened
the
problem,
we
were
going
to
use
ansible
tower,
but
ansible
tower
it
can
be
pricey
per
endpoint.
B
So
for
our
demo
we
could
have
used
ansible
tower
directly,
but
we
wanted
to
kind
of
come
up
with
a
better
solution
for
what
we
would
use
in
the
end.
So
it
was
decided
we're
actually
going
to
go
to
get
lab.
B
B
B
B
We
looked
at
the
command
line
for
the
java
process
to
be
able
to
see
which
java
process
it
was
running
on
the
host
so
that
he
knew
which
one
to
restart,
and
then
we
had
a
kill,
because
this
is
just
a
standalone
jvm.
So
there's
no
shutdown
or
startup
scripts.
Really
it's
just
a
java,
joe
our
command.
So
we
just
killed
it
and
then
started
it
back
up
with
the
java
jar.
A
All
right
cool
great,
I
think
you
have
to
I
had
no
time
to
write
it
down,
but
I
have
to
maybe
take
a
couple
of
notes
from
here,
really
great
that
you
have
already
displaybook
in
place.
That
also
does
the
job
of
remediating
your
problem.
B
A
I
think
we
can-
or
we
should
discuss
this
later
on,
to
which
artifacts
we
need,
or
we
should
work
on,
to
get
this
scenario
running
perfect
and
time
and
then
the
the
last
part
of
the
of
the
scenarios
or
validation
itself.
Is
this
also
automated
right
now
or
is
it
just
an
idea.
B
Yes,
this
was
automated
too
in
the
ansible,
so
the
anfield
playbook
after
he
starts
it,
will
validate
that
the
jvm
is
running.
We
did
not
do
step
b
because
it's
such
a
simple
app.
B
We
didn't
actually
generate
any
traffic
against
it,
so
that
one's
not
done,
we
did
not
put
in
c
either
checking
for
any
defunct
processes,
but
that
you
know
that
should
be
simple
enough
as
well,
and
then
it
does
scan
for
it
hits
the
dynatrace
api
and
and
looks
to
see
that
it
goes
into
the
closed
status.
B
Yeah
either
or
but
yeah
it.
I
think,
oh
so
we
did
add
a
few
things
to
this
and
I
didn't
come
back
and
update
this,
but
we
do
have
a
another
step
of
escalation,
so
that
kind
of
plays
into
this.
You
have
to
check
to
make
sure
it
did
what
it
was
supposed
to
do
and
if
not
then
then
move
on
to
an
escalation
step.
Yeah.
A
Okay:
okay,
no,
this
one
is
not
part
of
the
escalation.
No
problem
originally
correct.
B
Yeah
that
that
would
be
what
would
cause
yeah
you
to
escalate
if
the,
if
whatever
originated
it,
does
not
accept
that
the
problem
is
fixed.
A
Fair
enough
good,
all
right,
very
cool
that
there
are
already
parts
that
we
can
potentially
reuse
when
we
want
to
get
this
one
up
and
running,
and
you
also
worked
on
artifacts
that
are
required.
I
mean
here
I
added
a
couple
of
notes.
I
can
then
explain
when
we
go
through
them
kevin.
Can
you
start
with
the
first
ones,
as
you
also
defined,
that.
B
Yeah,
so
under
the
data
gathering,
we
need
some
location
information.
So
first,
where
is?
Where
does
the
jvm
live
on?
Which
host
so
that
we
can
jump
into
that
host
and
be
able
to
run
our
remediation?
B
B
A
B
Yeah,
we
don't
want
to
accidentally
start
something
as
root
and
then
some
other
thing
not
have
access,
because
the
process
is
running
as
root
or
or
have
a
security
issue
of
running
as
root
when
it
should
not
or
your
answer.
Ansible
playbooks
might
jump
in
as
a
different
user
than
than
what
that
jvm
process
needs
to
run.
As.
A
Okay,
got
it
all
right:
the
next
one
captured
jvm
heat
dump
there.
I
did
a
research
on
that
one
and
I
found
different
ways
how
we
can
gather
information
about
the
java
heat
dump
heat.
B
A
Yeah
here
I
referenced
the
blog
post
or
or
an
article,
but
mainly
that
these
two
options
are
quite
convenient.
You
can
use
the
native
java
option,
heat
dump
on
out
of
memory
error
and
that
one
here
generates
the
log
files
for
us
and
then
there
is
also
the
tool
called
jm
app
and
that
one
also
allows
us
to
to
generate
the
heat
dump
and
to
get
more
information
about
the
problem
and
that
one
is
very
similar.
Just
another
way
of
gathering
more
information,
and
what
I
found
interesting
is
a
blog
post
from
netflix.
A
They
also
described
the
same
problem
they
had
or
they
were
facing,
and
what
they
did
is
that
they
used
linux
to
create
the
core
dump
because
they
kill
the
jvm
by
sending
a
sick,
a
port
instead
of
a
sick
kill,
and
this
was
then
required
to
take
a
look
into
the
core
dump
and
not
the
heat
dump,
because
you
would
not
get
the
heat
damp
when
you
send
a
sick
aboard
yeah.
This
was,
it
was
just
a
problem
that
they
faced
and
they
fixed
it
already.
A
They
took
10
and
the
way
of
going
with
the
linux
mechanism
of
getting
more
information
about
the
memory
usage
and,
interestingly,
they
also
have
here
a
little
script
that
then
uploads
this
archive
on
into
a
three
pocket
and
which
is
then
used
for
getting
or
for
investigating
the
problem
and
identifying.
Actually,
then,
the
root
cause,
maybe
just
some
fruitful
thoughts
as
we
continue
as
we
are
working
on
that
one
on
that
issue,
all
right,
then,
given
the
change
process,
what
do
you
mean
by
that.
B
Yeah,
so
this
was
the
change
documentation
location.
So,
as
I
mentioned
earlier
in
our
case,
it's
servicenow.
So
where
exactly
do
you
have
to
document
that?
Is
it
a
change?
Is
it
an
incident?
B
The
change
approval
needed
and
acquired,
so
in
that
case
you
know
who
who's
the
support
owner
for
that
service
and
if
you
know
they
need
to
give
approval,
how
are
you
detecting
that?
You
know
we're
going
to
tie
into
maybe
a
chat,
ops
type
thing
where
you
notify
them:
hey
remediation
sitting
here:
do
you
approve
not
approved
and
then
yes,
no
and
then
it
moves
on?
Is
it
a
servicenow
approval
that
they
would
have
to
hit?
B
A
B
And
then
the
recover
remediation.
So
in
our
case
it
was
very
simple.
B
We
were
just
doing
a
kill,
but
in
more
complex
scenarios
there
should
be
hopefully
a
jvm
shutdown
script
that
will
shut
it
down
more
gracefully
and
then
is
there
any
cleanup
scripts
that
could
be
called
that
already
have
some
predefined
information
in
those
to
do
the
cleanup
that
you
would
like
you
know
pointing
it
towards
log
files
or
anything
else
that
you're
looking
to
clean
up
before
restarting
and
then
we
have
to
know
where
the
startup
script
is
to
be
able
to
start
that
up.
A
What
I
also
figured
out
is
that
actually
just
let
me
open
that
one
for
a
second
that
actually
java
already
provides
the
feature
that
you
can
define
a
startup
script
like
that
one.
When
you
have
an
out
of
memory
error,
you
just
give
you
give
it
an
the
pass
to
a
script
and
that
one
then
kicks
in
when
out
of
memory
is
detected.
A
I
think
it's
supported
with
java
8
also
and
would
already
help,
but
this
would
then
be
already
automated
by
java
itself
and
not
by
an
remediation
process.
B
A
Of
course,
no
problem
all
right
and
then
finally
yeah
the
validation
step.
This
is
what
I
here,
I
added
that
we
can,
or
we
could
use
on
service
level
objectives
that
we
finally
check
whether
they
are
fulfilled
or
not.
A
A
B
In
the
case
of
diane
trace
validating
your
problem,
closure-
or
you
know,
if
you
want
about,
if
you
had
a
tight
end
of
servicenow
validated
incident,
was
resolved.
A
All
right
very
cool
what
we
already
have-
and
this
is
what
I
added
then
to
the
document.
This
is:
has
the
title
proof
of
concept,
where
I
mean
that
we
have
now
very
detailed
explanation
of
how
a
remediation
process
could
look
like
for
us
for
fixing
a
jvm
problem,
and
I
think
that's
already
very
good.
A
We
know
the
artifacts
that
we
should
take
a
look
at
and
we
should
consider-
and
now
I
would
kind
of
bring
that
one
into
a
poc
into
a
running
prototype.
So
to
say-
and
I
would
propose,
from
a
captain
point
of
view,
to
use
captain
for
the
orchestration
of
this
remediation
process,
where
we
have
the
different
tasks
of
this
sequence
or
scenario.
A
A
B
B
I'm
trying
to
think
of
the
different
services
and
captain
that
may
help
us
do
that.
You
know
in
our
scenario
you
know,
as
I
mentioned,
we
we
pretty
much.
Did
it
all
with.
A
A
B
I
don't
know
I
mean,
are
we
thinking
just
the
generic
service
offered
by
captain?
I
mean,
I
know
a
lot
of
these
there's
not
a
whole
lot
of
services.
I
don't
think
that
no.
A
C
If
I
can
just
add
my
my
thoughts
here,
I
think
especially
for
the
data
gathering,
so
what
I've
seen
in
the
in
this
document
that
it's
so
the
data
gathering
part
here
was
very
java
specific.
C
But
the
important
part
here
is
for
me
once
we
want
to
move
it
from
a
specific,
like,
let's
say,
java,
specific
use
case
to
a
more
generic
use
case.
It's
important
okay.
How
do
we
get
this
data
and
if
it's
java
process
it
so,
in
my
opinion,
it
kind
of
means
you
have
to
ssh
into
this
box
and
you
have
to
find
the
java
process
and
you
will
find
the
the
downboard.
C
Basically,
you
have
to
go
to
this
instance
kind
of
retrieve
the
data
and
you
it
has
to
be
very
specific
where
you
know
where
you
will
find
the
test.
Maybe
it's
the
the
name
of
the
process.
Import
is
important.
The
the
path
the
file
path
is
important.
Where
it
is
there
won't.
C
I
assume
there
won't
be
a
service
that
is
list
that
is
basically
listening
in
this
or
sitting
in
the
in
the
container
or
in
the
process
and
then
pushing
if
there
is
a
crash
that
it's
pushing
the
data
out.
So
we
need
exactly
where
to
look
for,
and
although
this
can
be
done
in
for
this
specific
use
case,
I
think
or
my
my
problem
or
my
thought
is
more:
how
can
we
make
this
kind
of
a
generic
way
that
it
works?
C
For,
let's
say
most
of
the
java
processes
and
in
most
organizations
because,
like
ssh
into
a
box,
is
probably
not
the
way
that
will.
A
C
Accepted
for
a
lot
of
organizations,
maybe
there
are
some
restrictions,
so
I
was
just
thinking
more
about
only
the
data
gathering
part
is
already
quite
tricky.
In
my
opinion,
how
do
we
define
where
do
we
get
this
data
if
it's
a
kind
of
a
pull
or
push
approach,
how
to
get
this
data,
and
if
we
have
to
pull
this
data,
if
we
have
to
go
there
and
and
find
the
data,
how
can
we
do
it?
I
think
that's
not!
That's
not
an.
B
C
A
Yeah,
that's
that's
totally
true
this
scenario.
It
would
be,
as
you
said,
going
to
the
reserve
and
then
working
there
locally
on
the
machine.
C
Exactly,
and
maybe
this
is
true
for
a
lot
of
different
process
crashes,
if
you
need
the
data
you
have
to
go
there,
maybe
you
will.
Maybe
you
can
find
the
data
also
in
the
monitoring.
Maybe
you
can
have
maybe
with
dyna
trace.
You
can
take
a
log
there,
a
look
there.
Maybe
you
have
something
like
I
don't
know,
prometheus
or
jaeger
or
anywhere,
where
you
can
take
a
look
at
tracy's
logs
whatever,
but
it's.
C
This
is
already
a
huge
variety
of
different
options:
how
to
really
find
this
data,
and
I
think
we
have
to
kind
of
find
a
way
that
is
a
little
bit
generic,
because
if
it's,
if
we
assume
we
can
just
go
there
and
get
it
like
first
class
directly
from
where
it
happened,
I
think
that
will
be
really
difficult.
C
B
C
Way
of
basically
fetching
this
data
via
we,
we
have
to
assume
that
this
data-
let's
say
we
have
to
assume
this
data-
is
also
available
at
data
trace,
or
we
have
to
assume
this
data
is
available
somewhere
on
the
accessible
data
store
or
log
store
or
whatever,
but
yeah,
I
think
kevin.
You
are
the
experts
here
and
you
can
share
the
details,
but
this
was
just
my
thought
that
it
already
sounds
very
like
a
very
big
problem,
very
big
challenge:
how
to
to
actually
get
this
data.
B
B
I
I
like
your
idea
of
of
getting
that
information
from
from
somewhere
else
and
in
most
cases,
that's
that's
probably
true.
You
know
when
we're
talking
about
config
files
or
log
files,
I'm
sure
we
could
do
that.
The
one
thing
that
we
probably
can't
really
get
around
as
much
might
be.
Special
diagnostics
from
you
know
that
a
vendor
requires
you
know
ibm.
I
don't
know
if
you
guys
are
familiar
with
their
must
gather
script.
B
So
you
you
can
google
it
it'll,
take
you
to
their
their
documentation,
but
they
have
a
must
gather
script
that
you
gotta
run
and
it
goes
and
does
all
kinds
of
things
grabs
os
information
and
process
information.
I
mean
it's,
it's
a
big
script
and
when
you
open
a
support
case
with
them,
they're
gonna
say,
but
did
you
run
the
must
gather,
so
you
know
obviously
being
a
dietrich
customer
for
a
long
time.
B
The
mantras
in
my
head
of
you
know:
let's
catch
it
on
the
first
time
right.
Let's
not
have
to
wait
for
another
event
to
be
able
to
react
upon
it.
So
you
know
it'd
be
nice
if
we
could
run
those
diagnostic
scripts,
the
the
first
time
we
see
it
instead
of
restarting
it
and
then
going
to
the
vendor
and
them
saying
well,
you've
got
to
wait
for
another
event
and
then,
when
it
happens,
run
the
script.
B
C
A
C
Yeah,
I
just
wanted
to
to
add
that
some,
like,
we
often
think
of
let's
say
doing
the
jvm,
restart
that
some
other
party
is
responsible
for
doing
the
restarts.
Johannes
johannes
already
mentioned
that
we
have
this,
let's
say:
ansible
tower
integration.
So
it's
basically
it's
not
captain.
C
That
is
doing
the
restart
and
captain
going
into
this
machine,
but
captain
is
triggering
the
run
books
that
you
already
have
is
it
can
be
enfield
tower,
but
it
can
also
be
your
batch
scripts
that
you
already
have,
and
maybe
these
best
scripts
are
started
from,
let's
say
a
bastion
host
or
they
are
not.
C
They
are
not
living
inside
the
machine
that
has
the
the
issue,
but
maybe
they
are
part
of
their
automation
is
already
to
kind
of
going
connecting
to
the
to
the
machine
that
has
a
problem
and
doing
something
there,
and
I
was
more
thinking
about.
Maybe
we
can
use
captain
in
this
way
that
we
are
orchestrating
the
scripts
or
actions
or
radiations
that
are
already
available,
but
also
that
would
mean
for
the
data
gathering
part
we
would
also
be.
We
were
also
kind
of
relying
on
some
other
tool
that
will
give
us
the
data.
C
It
will
provide
us
the
data.
Otherwise
we
are
another
must
gather
script
and
that
might
kind
of
put
us
we
that
will
put
captain
also
into
the
into
this,
into
the
bucket
of
log
analysis
or
or
or
data
gathering
tool
which,
which
I
think
we
are
not.
But
we
can
utilize
these
tools
and
make
sure
that
we
bring
everything
together
that
that
can
be
leveraged
for
for
the
process
of
automated
remediation.
C
B
A
Correct
thanks,
because
this
is
exactly
what
I
wanted
to
express
by
this
part:
we
have
the
orchestration
layer
and
then
the
tooling,
underneath
that
is
then
taking
care
of
executing
the
jobs
and
for
me
it
would
try
to
to
get
an
understanding
of
which
tools
could
we
reuse
or
how
could
we,
for
example,
do
the
data
gathering
part
or
the
change
process
or
the
recovery
part?
A
B
No,
I
believe
the
only
thing
that
we
did
with
that.
I
think
there
was
a
little
cleanup
just
looking
for
dumps
and
getting
rid
of
those,
but
that
was
ansible
as
well,
and
I
I
would
move
forward
with
ansible
on
that.
Probably
too.
A
B
B
A
Would
it
make,
would
it
now
make
sense
to
you
to
to
continue
working
in
a
way
that
we
now
try
to
provide
an
ansible
script,
integrate
with
servicenow
and
make
up
or-
and
we
build
up
this
end
test
scenario
by
tooling,
underneath.
B
B
And
and
one
of
the
other
things
we,
we
did
not
really
change
process
per
se,
but
you
know
we.
We
went
in
and
updated
the
servicenow
incident,
so
we
I
was.
We
tied
the
dynatrace
to
the
developer
instance
of
servicenow
and
had
the
dyn
trace
app
in
that
servicenow
instance,
and
so
it
it
automatically
opened
the
incident
and
and
resolved
the
incident
and
as
ansible
was
going
through,
it
would
update
that
incident.
B
Saying
yep
I
found
the
jvm
I
killed
the
jvm,
I
restarted
the
jvm
and
then
dying
trace
came
in
resolved
it
and
then
ansible
says
yep.
I
validated
that
I
see
it
results,
so
we
did
a
lot
of
anfield
stuff
in
there
as
well.
So
we
didn't
do
an
escalation,
but
we
could
have
and
we
would
have
done
anfield
so
long-winded.
A
Did
yeah
you
used
ansible
and
ansible
playbook
to
orchestrate
everything
yeah.
B
B
A
It's
a
very
good
question,
but
when
you
have
a
playbook
like
an
ansible
interval
playbook,
you
always
connect
to
the
tools
that
you
know
today,
but
in
the
future,
maybe
you
wanna,
inter
integrate
with
another
service.
Maybe
you
wanna
exchange
a
service
now
with
another
tooling,
then
you
have
go
back,
have
to
go
back
to
your
playbook
and
you
have
to
adapt
everything
that
is
servicenow
really
related
with
the
concept
of
captain.
A
You
get
more
flexibility
and
also
you
for
future
changes.
You
are
you're
ready
to
take
them,
as
you
can
easily
plug
in
the
tools
as
you
need
them.
This
is
the
the
main
benefit
you
would.
You
would
get
with
kepler.
B
And-
and
I
do
understand
that
concept,
but
just
in
in
this
specific
case,
we're
saying
that
everything's
answerable.
B
B
I
just
wanted
to
bring
it
up,
for
you
know
food
for
thought
and
you
know
have
somebody
convince
me
that
that
putting
kept
in
in
the
middle
so
because
we
we
went
through
a
process
of
trying
to
come
up
with
okay,
we
need,
we
need
somebody
to
orchestrate
it,
and
then
we
need
somebody
to
do
the
work,
and
you
know
we
we
looked
at
possibly
using
servicenow
gitlab
ansible
captain
was
on
the
list
and
as
we
started
going
through
each
scenario,
it
was
you
know
we
don't
want
to
throw
a
tool
in
just
to
orchestrate
if,
if
it's
just
adding
another
failure
point
either.
B
So
when
we
started
working
through
the
process,
it
kind
of
came
down
to
ansible
being
the
orchestrator.
B
Seemed
to
be
the
simplest
option.
A
Okay,
I
see.
B
I
know
it's
a
buzzkill,
I
don't
know
I'm
just
I
just
wanna,
you
know
I
I
just
wanna
bring
it
up,
you
know
to
get
discussion
around
it
and
maybe
even
you
know,
have
you
guys,
expand
and
bring
that
up
with
other
people
to
get
more
people
thinking
about
it.
But
you
know
why
why
put
kept
in
the
middle
if
we
could
go
potentially
from
a
dietrace
or
prometheus
or
something
right
to
ansible
and
have
it
orchestrated
if
each
of
your
steps
is
already
answerable.
B
A
B
There's
a
better
way
to
do
it,
too,
maybe
we'll
say:
well:
data
gathering,
actually
it's
better
to
use
this
tool
and
change
better
to
use
this
and
recover
better
to
use
this.
So
this
is
just
our
first
scenario,
but
you
know
just
food
for
thought.
C
Am
I
back?
Can
you
hear
me?
C
Automation
language
like
you,
can
use
java
for
for
building
whatever
application
you
want
to
build,
but
there
are
for
some
parts
they're
just
more
that
they're
programming
languages
that
fit
the
purpose
better,
but
you
can
do
a
lot
of
of
it
with
java
as
well,
and
I
think
antidotes
also
one
of
those
languages
or
automation,
yeah,
automation,
languages,
let's
say
or
platforms
where
you
can
do
a
lot
of
things,
but
especially
when
it
comes
to
validating
again
the
the
quality
of
the
services.
C
Then
in
captain
we
already
have
the
quality
gates
baked
in
it's.
It's
based
on
service
level,
objectives
and
service
indicators.
So
were
you
kind
of
using
a
couple
of
those
best
practices
from
the
sre
community
there
and
you
can
just
use
it
out
of
the
box.
C
You
don't
have
to
write
it
yourself
and
then
also
the
tool
integrations,
so
there
is,
for
example,
the
the
slack
or
teams
or
whatever
integrations,
so
whenever
you
want
to
switch
those
tools,
it's
basically
just
as
johannes
mentioned
just
the
one
tool-
integration
that
you
can
do
you
don't
have
to
change
the
process.
You
don't
have
to
change
all
the
automation,
scripts,
it's
just
and
a
new
approach
and
a
new
take
on
this,
but
for
sure
it
can
be
done
with
ants
will
swell.
C
I
would
even
argue
it
can
be
done
with
with
bash
as
well,
but
we
just
moved
on
with
more
with
other
concepts
that
maybe
fit
the
the
purpose
better,
but
yeah.
If
you
already
have
everything
in
ansible,
then
I
think
it's
not
good
to
to
just
throw
it
away,
but
just
to
find
a
way.
C
How
can
you
bring
those
things
together
and
yeah
captain
can
can
be
one
of
those
orchestrators,
and
it's
also
when
we,
when
we
are
building
captain,
it's
not
only
about
the
the
remediation
part,
but
it's
also
a
lot
about
the
the
the
deployment
and
delivery
part.
So-
and
you
can
also
do
this
with
ansible,
but
it's
a
it's
a
lot
of
yeah
custom
scripting.
If
you
want
to
do
this,
so
this
is
just
my
my
thoughts
on
this.
B
Yeah,
I
I
really
like
the
and
the
quality
gate
stuff.
You
know
I
demoed
that
for
some
other
people
in
our
organization-
and
they
were
all
very
impressed-
and
you
know
anxious
to
to
go
down
that
route
too,
so
I
think
obviously,
captain
will
be
in
our
organization.
Definitely
for
that
use
case,
and
so
I
I
guess,
if
you
already
have
it,
it
makes
sense
to
for
auto
remediation
to
to
say,
maybe
use
it.
I
guess
my
only
thought
was
okay
say
I
don't
have
kept
in.
B
B
C
C
C
Everything
for
its
his
or
her
organization-
this
is
what's
also
one
of
the
the
problems
of
captain
that
it's
kind
of
everyone
is
building
kind
of
an
automation,
not
maybe
a
platform,
but
everyone
is
building
cicd
automation
because
everything
that
is
out
there
does
not
really
work
for
them.
So
everyone
is,
let's
say,
starting
with
jenkins,
but
building
everything
a
little
bit
different.
C
But
after
all,
everyone
is
building
ci
cd
automation,
but
what
we
are
trying
with
with
captain
is
to
to
bring
this
together,
not
not
by
just
bringing
another
tool
to
the
table,
but
more
okay.
You
have
now
built
your
custom.
C
Let's
say
the
way
how
you
want
to
to
deploy
your
applications,
but
still
you
need
a
way
to
validate
the
application,
so
you
can
either
then
use
your
own
thing
or
you
can
go
for
captain
quality
gates,
but
the
deployment
itself.
You
can
kind
of
connect
it
to
the
process
that
captain
will
orchestrate.
So
you
can
do
your
own
testing
scripts.
You
are
not
tied
to
any
of
those
or
to
to
one
specific
tool
or
to
one
specific
way
to
do
it.
You
you
can.
C
Basically,
you
can
interact
with
the
captain,
control
plane
or
the
captain
control
plane
will
will
interact
with
your
tool
and
we're
really
trying
to
build
a
a
tool
or
a
platform
that,
where
you
can
connect
everything
that
you
already
have,
but
all
this
glue
code
and
what
happens
when
and
I
need
to
do
the
roll
back,
but
when
it
is
when
is
it
triggered
and
how
to
do
the
the
evaluation?
Actually,
this
is
what
we
we
wanted
to
kind
of
take
away
all
this:
the
burden
of
integrating
all
tools
to
each
other.
B
Yeah-
and
I
I
think
so-
you
got
my
my
brain
going
here
now,
so
I
think
that
is
the
big
difference.
We
are
showing
a
operational
auto
remediation,
so
something
is
a
monolith,
is
sitting
out
there
and
has
been
running
we're
monitoring
it
and
something
goes
bad
and
we
want
to
correct
it.
The
other
scenario
that
you
brought
up
that
makes
more
sense
is
the
cicd
pipeline.
A
B
We've
now
done
a
deployment
and
and
something
went
bad
and
we
need
to
auto
remediate
that,
and
that
makes
a
ton
of
sense
in
captain
because,
as
you
say,
you've
got
all
the
deployment
information
in
there.
It's
process
flow
driven
and
we're
going
to
insert
some
auto
remediation
pieces
in
that,
and
it's
all
going
to
be
all
documented
in
one
place.
B
So
I
think
yeah.
Maybe
we
we
started
with
the
a
scenario
that
that's
not
as
captain
per
se
friendly,
but
auto
remediation
on
a
ci
cd
pipeline
makes
makes
a
lot
more
sense
and
has
a
lot
more
vision
in
my
head.
A
But
still
on
the
scenario
we
have
right
now
here
in
front
of
us
is
also
a
use
case
that
can
be
covered
using
captain
I
mean
we,
we
skip
the
entire
deployment
part,
but,
and
we
start
working
when
a
problem
occurs,
and
this
can
occur
in
a
deployment
scenario
as
well
as
in
a
normal
operation.
A
A
Okay,
I
mean
I
can
propose
to
show
the
next
time
this
workflow.
We
have
right
here
just
a
skeleton
of
how
captain
would
execute
these
five
steps
and
we
then
in
a
follow-up
meeting.
We
then
discuss
how
to
bring
in
the
different
components
like,
for
example,
ansible
to
do
the
first
step,
then
service
now
to
do
the
second
one
and
so
on.
Would
this
make
sense
to
make
progress
here.
A
A
Okay
and
kevin,
can
I
ask
you
a
favor,
as
you
mentioned,
you
have
the
sample
app
that
allows
you
to
simulate
the
jvm
problem.
Yes,
can
you
can
you
take
care
of
of
getting
this?
This
example.
B
A
But
with
tooling.
A
A
We
discussed
the
tooling,
but
I
think,
let's
just
continue
the
next
time
to
be,
to
get
more
concrete
on
how
we
will
do
the
certain
things
like
data
gathering,
the
recovery
process
and
finally,
also
the
validation,
is
that
okay
for
you
yep,
perfect,
then
next
time,
a
short
demo
on
the
remediation
process.
B
A
Thanks
you
too
have
a
nice
day
and
see
you
next
time
all
right.