►
Description
Introduction to ARCANNA - Automated Root Cause Analysis Neural Network Assisted – Bogdan Sass (Siscale)
recorded at the OpenShift Commons AIOps SIG mtg on March 25 2019
A
So
hello,
everyone,
my
name,
is
Bob
dances.
I
am
a
principal
solution,
architect
with
Sai,
Stella
and
I'm.
Here,
to
talk
to
you
about
a
solution
that
we
sigl
have
been
developing.
It's
called
Arcana,
it's
a
short
name
for
a
very
long
duration
for
a
very
long
name.
Actually,
automated
the
root
cause
analysis.
Neural
network
existed
before
we
discussed
what
I
cannot
does
I
want
to
tell
you
why
we
started
work
on
this
project
and
just
to
give
you
a
little
bit
of
a
background.
A
Nobody
knows
where
the
issue
is
nobody,
maybe
hours
later,
nobody
even
knows
where
to
get
started
on
Xing.
That
issue,
and
the
problem
here
was
very
well
pointed
out
by
Marcel
earlier.
It's
I
want
you
to
use
this
image.
I
want
you
to
talk
about,
searching
for
a
needle
in
a
haystack,
but
I
think
Marcel
put
it
too
much
better.
It's
like
chat
and
my
scheme
and
the
mice
are
multiplying
like
crazy.
Just
a
few
years
ago
you
had
your
physical
server
and
you
had
your
application.
A
A
You
have
much
more
places
in
which
something
can
go
wrong
and
identifying
the
true
culprit
when
something
does
go
wrong
is
becoming
more
and
more
difficult
task,
but
we
also
have
some
very
nice
very
useful
technology
that
can
help
us
and
since
I
don't
know,
if
everybody
here
is
familiar
with
elasticsearch
and
their
lest
expect,
I
will
just
do
a
very
quick
presentation
of
them.
First
of
all,
elasticsearch
started
as
a
tool
for
searching
through
huge
amounts
of
text.
A
The
also
a
very
powerful
way
of
dealing
with
time-series
data,
and
nowadays
we
are
seeing
elasticsearch
being
used
more
and
more
for
monitoring,
because
it's
a
very
it
works
very
well
as
a
kind
of
no
sequel
database.
You
can
just
populate
it
with
the
time
series
data,
the
matrix
that
you
want
to
collect
and
then
aggregate
correlate
work
with
those
metrics
also
around
elasticsearch.
We
have
an
entire
ecosystem.
Now
it's
the
elastic
spec.
A
A
If
you've
ever
had
to
collect
information
for
multiple
devices
belonging
to
multiple
vendors,
you
already
know
this
issue.
I
want
to
know
what
user
has
performed,
a
specific
action
and
all
the
actions
are
loud,
but
what
is
the
field
for
the
user?
Is
it
user
user
name
user,
dot,
name
nginx,
that
exit
that
user
underscore
name?
It's
very
difficult
to
correlate
data
when
the
fields
that
are
being
used
differ
between
different
tools
and
different
vendors,
and
this
is
where
elastic
has
come
up
with
a
very
nice
idea.
It's
called
the
elastic
common
schema.
A
It's
an
open
source
specification
defines
a
common
set
of
documents,
filled
document
fields
for
data.
Once
you
apply
this
electric
common
schema
once
all
your
data
is
indexed.
In
the
same
way,
it
becomes
easy
to
correlate
data
from
different
data
sources.
So
that's
one
problem
that
I
don't
want
say
it
is
solved,
but
it
is
in
the
process
of
being
solved.
A
The
second
problem,
when
collecting
data
is
this
one?
This
is
an
actual
demo
created
by
the
people
at
elastic.
It
shows
real-life
troubleshooting
scenarios
using
elastic
search,
a
problem
that
occurred
in
an
application.
It's
a
basic
application
with
multiple
processes,
multiple
micro-services
making,
update
application
and
at
some
point
we
get
an
alert
and,
as
usual,
the
alert
doesn't
say
too
much
for
performance
on
the
server
from
there
we
go
to
the
dashboards,
and
without
going
into
the
details
we
start
to
dig.
We
start
to
look
what
has
happened.
When
has
the
problem
started?
A
A
We
eat
that
there
seems
to
be
a
problem
with
one
of
the
containers
running
on
one
of
the
nodes
we
go
into
that
container
and
we
see
some
spikes
on
on
the
CPU
usage.
You
go
there
and
finally,
we
look
at
the
processes.
We
see
that
there
is
a
backup
process
that
actually
runs
at
a
certain
interval
and
everything
becomes
slow
well
that
the
capacity
is
Radek.
A
Sorry
now
I
went
very
quickly
to
all
of
this,
but
the
problem
here
is
that
there
are
many
sources
of
data,
many
places
where
something
could
go
wrong
and
many
times
we
do
not
know
where
to
start.
We
start
digging.
Look
at
the
servers.
Look
at
the
network.
Look
at
the
application.
In
the
end,
we
will
manage
to
isolate
the
problem,
but
it
takes
a
lot
of
work.
It
takes
a
lot
of
time
and
the
question
was:
can
we
do
things
better?
Can
we
improve
the
time
it
takes
to
identify
the
actual
the?
A
A
A
Then
we
try
to
identify
the
probable
root
cause
for
those
events,
and
with
that
so
with
that
we
can
engage
the
appropriative
once
the
problem
has
been
solved.
The
feedback
actually
goes
back
into
our
Kenna.
We
tell
the
system
what
has
happened,
whether
the
determination
was
correct
or
not,
and
the
system
learns
from
our
feedback.
A
Again,
we
have
our
system,
we
have
the
electric
search
with
all
the
data
we
have
our
Keanu,
which
is
basically
a
plugin
for
Chobani,
the
data
visualization
in
the
console
for
elasticsearch
and
inside.
We
are
adding
a
tensor
flow
to
provide
Marshall
learning
model
that
actually
gets
access
to
all
the
data.
So
the
machine
learning
system
looks
at
the
data
and
try
to
identify
what
the
hood
goals
might
be.
A
It
is
that
enough,
if
the
determination
correct,
we
don't
know
right
now,
it
is
not,
but
we
provide
feedback
after
the
troubleshooting
steps
have
been
completed
after
the
root
cause
has
been
positively
identified.
The
user
or
voice
feedback
for
Java,
the
user,
tells
the
system
yes
you're
right.
This
was
the
actual
root
cause,
or
no,
that
was
not
correct.
The
actual
root
cause
was
something
else.
It
was
that
work
and
the
system
learns
and
all
the
data
also
goes
back
into
elastic
stack
into
Electric
search
and
with
this
information,
the
system
continually
improves
with
time.
A
It
learns
to
identify
the
root
cause
correctly,
and
this
is
what
we
have
now.
But
what
about
the
future?
How
this
system
be
used
in
the
future
I
need
to
specify
that
we
are
not
there
yet,
but
think
about
future,
in
which
we
can
actually
take
action
when
we
are
reasonably
confident
that
the
root
cause
has
been
correctly
identified.
A
What
if
we
have
more
than
eighty
percent
sure
that
the
issue
was
a
bigger
process
running
on
the
database
server?
Can
we
go
in
and
automate
the
solution?
We
believe
we
can
if
we
have
a
certain
confidence
threshold-
and
we
are
above
that
threshold
we
just
go
in.
We
had
an
instable
script.
The
script
goes
to
the
server
and
takes
corrective
action.
A
We
might
be
headed
to
a
point
where
the
problem
is
solved
before
the
users
even
notice
it.
It
will
not
apply
to
all
the
problems,
but
if
it
applies
to
50
60
70
percent
of
the
problems,
it
will
free
up
a
lot
of
time,
a
lot
of
resources
for
the
people
actually
doing
the
investigation
now,
just
to
show
you
what
the
interface
looks
like
this
is
the
interface
token.
If
you
have
ever
worked
with
elasticsearch,
it
will
look
very
familiar
because
it
is
nothing
more
than
another
plugin
for
jovanna.
A
This
is
where
you
define
the
machine
learning
jobs.
This
is
where
you
tell
it:
what
pills
to
take
into
consideration
for
the
ml
job
and,
of
course,
you
can
also
rename
some
of
the
fields.
If
you
need
to
do
so,
you
can
rename
them
from
the
interface
the
ml
job
starts.
Writing
and
Indian.
We
get
an
output
like
this.
A
These
are
the
events
that
are
identified
and
I
cannot
believe
that
these
three
are
part
of
the
same
set
of
symptoms.
They
have
the
same
underlying
good
cause.
We
have
a
web
server
reporting
on
internal
server
error,
a
500
error
message:
we
have
an
tikka
sequel
server,
saying
that
it's
unable
to
write
to
disk.
We
have
a
silver
that
is
out
of
memory.
All
kena
believes
that
this
out
of
memory
was
the
root
cause.
I
cannot
believe
that
we
should
investigate
this
particular
server.
First,
it
is
correct.
Is
it
not?
A
We
go
in
we
investigate
with
our
phone
our
investigation
of
troubleshooting
steps
as
usual,
and
in
the
end
these
are
actually
toggles.
You
can
switch
them
between
a
good
cause
and
symptoms.
In
the
end,
you
can
go
in
and
tell
the
system
yes
good
job
or
no.
That
was
not
correct.
Try
to
do
better
next
time
and
the
system
will
improve.
A
Keep
in
mind
that
there
already
is
a
level
of
machine
learning
in
the
elastic
spec.
Elasticsearch
already
has
unsupervised
machine
learning
that
can
already
reduce
some
of
the
noise.
It
can
detect
only
anomalies
it
can
detect
when
something
deviates
from
normal.
We
are
adding
on
top
of
that.
We
are
adding
the
supervised
machine
learning
component
and
the
automated
root
cause
analysis.
So
the
tools
that
we
have
go
up
to
step
3
that
Purcell
mentioned
earlier.
A
Now
we
are
adding
step
4,
the
automated
our
CA,
automated
would
cause
analysis
and
of
course,
on
top
of
that,
you
can
add
place.
You
can
notify
the
correct
teams,
you
can
add
playbooks
for
automatic
remediation
if
they
would
cause
identification
is
reasonably
confident
and
you
can
always
provide
feedback.
A
So
that's
it
for
Arcana,
of
course,
if
anybody
has
any
questions
for
the
system,
I
will
be
glad
to
answer
them.
Just
please
don't
ask
me
too
much
about
the
machine
learning
part.
I
am
NOT
a
developer.
A
lot
of
that
is
magic
to
me.
I
will
have
to
ask
my
colleagues
who
have
actually
written
the
code
for
that.
B
C
D
A
But
I
do
have
some
good
news
here
and
there
I
forgot
to
tell
you
about
that.
In
the
presentation,
this
technology
will
be
open
source
and,
as
Colleen
has
said,
everything
will
depend
on
the
size
of
your
net
or
the
complexity
of
your
network.
The
type
of
data
you're
collecting
the
type
of
issues
you're
encountering
how
many
of
them
are
repeated,
how
many
of
them
are
new
and
so
on,
but
everything
all
the
code
will
be
open
sourced.
C
That's
that's
a
very,
very
good
news
and
I
saw
that,
on
your
other
talk.
Actually
one
of
our
team
members
and
also
prototype
some
similar
solution
and
I
see
already
some
some
place
for
collaboration
there.
So
we
also
plug
into
elastic
and
we
train
a
model,
not
an
neural
network
model,
but
a
model
of
self-organizing
map
to
flag
anomalies
in
log
files.
You're
going
one
step
further
of
actually
pinning
down
some
root
causes
we're
only
looking
at
a
stream
of
lockfile
messages
and
wants
to
detect
something
the
normal
is
in
the
content
of
those
messages.
D
E
So
in
2016
we
partnered
with
another
big
company
that
actually
presented
at
Red,
Hat
storage
Day
in
Seattle
that
wanted
to
do
a
petabytes,
F
cluster
for
OpenStack
Club,
and
they
found
there
was
three
major
stability
issues
with
asset
cluster
that
was
sort
of
blocking
their
project.
The
first
one
was
that
every
time
disc
failed
or
you
know,
SD
failed.
The
map
would
change
the
crush
map,
which
would
cause
placement
group
hearing
and
backfilling,
or
the
cluster
would
rebalance
to
heal
itself.
E
E
But
it
essentially
did
the
same
thing.
We
could
predict
dis
failures,
six
weeks
in
advance
and
then
they
had
they
drew
out
all
this
architecture
stuff.
But
the
most
important
thing
is
this
graph
at
the
bottom
right.
You
can
see
that
there's
a
normal
workload
here
of
around
400
or
so
I
ops
and
then,
when
they,
when
they
simulated
a
disk
failure
by
just
pulling
a
disk,
they
found
that
the
cluster
performance
dropped
below
200,
so
they
dropped
around
40
to
50
percent.
E
I
ups
and
persisted
that
way,
so
it
persisted
that
way
for
the
whole
duration
of
the
test,
so
800
minutes
around
12
hours
or
so
versus
with
our
disk
prediction,
you
can
see
that,
with
being
able
to
know
a
disk
is
about
to
fail
in
advanced.
We
can
take
pre-emptive
measures,
we
can
disable
the
cluster
rebalancing
and
then
we
can
remove
it,
the
disk
and
replace
it
within
an
hour
and
how
the
performance
go
back
up
to
a
fraction
of
the
time
in
a
fraction
of
the
time
right
and
then
the
same
company
tested
art.
E
Our
prediction
engine
against
20,000,
drives
over
the
course
of
90
days
and
they
found
that
we
had
an
accuracy
rate
of
96%
and
a
recall
rate
of
97%
and
the
recall
rate
is
actually
the
more
important
statistic
here.
It's
it's.
The
number
of
correctly
predicted
failed
disks
over
total
number
of
failed
disks.
So
out
of
every
100
discs
that
failed,
we
would
correctly
predict
97
of
them
it,
and
then
this
is
just
shows
that
we're
already
integrated
in
the
set
community
we're
called
the
disk
prediction
plug-in.
E
You
can
just
enable
us
through
the
manager
daemon
and
then
you
can
just
use
stuff
native
commands
to
access
our
prediction
and
yeah.
So
we're
we
release
with
Nautilus
for
older
versions
of
stuff.
You
would
use
this
this
one
line.
Installation
and
you
can.
You
can
use
that
with
ansible
a
chef
puppet,
any
kind
of
automation,
software
to
make
it
simple
for
a
mass
appointment
and
our
biggest
our
biggest
account
right
now
is
actually
in
Michigan.
There's
three
universities,
Wayne
State,
Michigan,
State
and
University
of
Michigan,
and
what
what
their
setup
is?
E
They
all
three
of
these
campuses
share
a
single
Giants
F
cluster
and
they
put
all
their
research
data
on
this
set
cluster.
So
it's
they
have
to
make
this
F
cluster
as
resilient
as
possible,
and
so
what
we
provide
is
just
the
dis
predictions
and
allowing
them
to
monitor
the
health
of
their
discs
before
they
fail
right
and
I'm.
Just
gonna
go
through
a
quick
live
demo,
I'm
gonna
switch
screens
here
you
guys
see
my
my
web
browser.
E
E
How
many
are
bad
are
gonna
fail
in
less
than
two
weeks
less
than
six
weeks,
and
you
can
go
to
the
disk
health
list
here
to
get
a
list
of
every
single
disk
that's
being
monitored,
and
then
you
have
all
the
UI
unique
identifiers
and
which
node
saw
and
the
size
the
serial
number
the
vendor
all
over
here.
Sorry
all
over
here,
so
you
can
see
that.
So
you
can
easily
identify
the
discs
so
yeah,
so
this
is
so.
E
This
would
be
where
you
would
go
for
the
disc
details
and
then
we
also
alluded
to
it
earlier.
We
also
have
prediction
for
capacity
and
performance
so
over
here
we
have
the
cluster
capacity,
but
we
also
go
down
to
the
OST
level.
I'll
just
use
I'll
just
use
pools
because
it's
more
more
interesting
and
then
we
can
predict
future
use
future
capacity
for
the
next
up
to
next
ninety
days.
But
of
course
this
depends
on
how
much
data
you
have
so
the
general
rule
of
thumb
is
for
every
cycle
that
we
predict.
E
B
E
B
C
E
Yeah,
because,
in
order
to
be
which
they
wanted,
like
a
lightweight
version
of
our
predictor
and
then
so
we
would
we
just
gave
them
like
one
with
with
less
baggage.
That
would
be
only
70%
accurate,
that
they
could
enable
locally.
But
it
would
it
wouldn't
use
all
the
metrics
that
were
provided
for
the
prediction
it
was
requested
by
them
to
have
a
local
lightweight
lightweight
package.
Okay,
yeah.
B
They
are
almost
ten
minutes
left
myself
and
I'd
like
to
talk
a
little
bit
about
some
of
the
goals
for
this
group.
One
one
is
we're
just
trying
to
reach
out
and
build
the
community
around
AI
OTT
yeah
and
make
sure
that
we
have
some
of
the
resources
that
people
are
looking
at
and
requiring.
So
thank
you
both
Bogdan
and
Ryan,
for
sharing
your
insights
and
your
tooling.
B
That's
it's
a
great
start,
and
if
there
are
other
topics
that
people
want
to
talk
about
or
present
on
or
questions,
you
have
please
reach
out
to
us
again
sign
up
through
the
the
Google
Groups
and
and
ask
for
those.
If
there's
anyone
here
that
in
look
in
the
chat
that
has
any
questions
not
seeing
any
I'm,
hoping
that
some
of
you
will
have
some
suggestions
for
upcoming
topics
and
we
can
move
forward.
We
were
planning
on
doing
this
on
Mondays
at
9
o'clock
aka.
B
So
if
you're
interested
in
getting
together,
then
please
reach
out
to
Marcel
or
myself
and
we'll
start
coordinating
a
face
to
face,
sometimes
probably
in
September
on
this
so
Marcel.
If,
if
you
wanted
to
add
a
few
words
in
here,
I've
added
a
few
resources
down
the
end,
if
everybody
could
send
me
PDF
versions
of
their
slide
decks
on
2d
Mueller
at
Red,
Hat
comm,
that
would
be
great
and
I'll.
Add
them
in
as
well.
Marcel
yeah.
B
B
The
video
of
this
session,
the
Google
Groups
list
and
I'll
create
a
YouTube
playlist
for
these
topics
and
edit
them
and
get
them
up,
hopefully
in
the
next
24
hours
or
so,
is
there
anything
else
anyone
would
like
to
add,
while
we're
here
I'm
just
check-in
the
chat
again
and
I'm,
not
so
hopefully,
I've
gotten.
Everybody's
affiliation
is
correct.
If
not
I
posted
the
link
already
into
the
Google
group,
and
we
will
we
can
correct
it
from
there.
Oh
thanks
again,
everybody
for
attending
and
we'll
be
back
again
in
another
month,
eat.