►
Description
Log Anomaly Detector Service -
Zak Hassan @RedHat
@OpenShift Commons AIOps Briefing
April 2019
A
Okay,
so
my
name
is
Zack
and
I
work
in
the
AI
and
her
excellence
with
and
under
Marcel,
the
AI,
ops
group
and
I'm.
Currently,
the
lead
for
the
log
anomaly,
detection
project
and
I
wanted
to
over
and
present
some
some
of
the
things
that
we've
been
building
eternally
the
first
the
use
case.
So
we're
building
this
log
anomaly
detector
to
help
our
internal
customer
with
root
cause
analysis.
A
We
built
it
with
unsupervised
machine
learning
to
identify
the
anomalies
we
found
it.
You
know
with
with
machine
learning,
it's
it's
more
of
a.
They
give
you
a
probabilistic
answer
rather
than
a
deterministic
answer
where
probabilistic
is
generally,
it's
saying
that
you
know,
there's
a
percentage
of
being
correct
and
there's
a
possibility
of
having
false
positives
and
we'll
talk
a
little
bit
about.
A
We
also
generate
a
prediction
ID,
but
we
can
reference
it
later
and
then
we'll
which
I'll
talk
about
more
in
the
next
slide
with
which
is
about
more
in
depth,
about
the
pack
store.
And
then
we
have
some
monitoring
in
dashboard
tools
were
using
Prometheus
in
prathana
to
get
a
dashboard
to
see
the
feedback
that
we're
getting
back
from
the
pack
store,
as
well
as
the
metrics
that
are
being
generated
from
the
model
training.
A
So
we
also
we
also
in
the
inference
step.
We
also
send
out
messages
to
Alessi
search
which,
which
triggers
a
alert
to
be
fired
off
and
an
email
sent
to
the
user,
and
in
that
email
we
embed
a
link,
a
hard
link
where
we
use
that
prediction
ID
that
was
generated
in
the
email.
So
when
that
user
says
ok,
this
is
not
an
anomaly,
and
this
is
false
positive.
They
can
report
that
and
give
us
feedback
on
that
and.
A
A
The
the
system
itself
is
basically
a
Python
program
via
live
in
you.
Can
you
can
specify
if
you
just
want
to
train
or
if
you
want
to
train
and
do
the
inference
and
then
step
together?
You
can
provide
a
config
map
or
a
llamo
to
inject
into
the
program,
or
you
can
specify
environment
variables.
What
options
are
supported
and
you
can
either
pull
data
from
elasticsearch
or
you
can
pull
data
from
a
local
file
and
then
the
sink
that
you
would
write
to
would
be
either
elasticsearch
or
local
file.
A
A
So
just
just
a
I
can't
take
the
credit
for
this
whole
system
itself,
as
I
had
a
great
data
scientist
on
the
team,
Michael
Clifford
and
I
had
an
awesome
architect,
but
check
of
lean
who
also
were
involved
in
the
design
of
the
the
model,
training
of
the
somme
model
and
the
and
the
overall
system
I've
added
the
fax
tour,
which
was
an
additional
component.
This
system,
so
I'll
talk
a
little
bit
about
the
model
training
part.
A
So
there
are
two
models:
there
is
a
word
to
beck
model
and
the
word
the
back
model
does
the
machine
learning
and
basically
what
it
does
is
a
probability
of
words
occurring
with
each
other.
It's
we're
using
it
in
a
very
kind
of
unconventional
way,
but
basically
it's
kind
of
like
a
pre-processing
step,
basically
verify
basically
convert
the
raw
log
messages
into
vectorized
representations
and,
and
once
we're
done
with
that
in
the
training
step.
The
the
next
thing
that
happens
within
the
training
step
is
we
generate.
A
Another
model
called
saw
model
self-organizing
map,
and
what
that
does?
It
compares
the
vectorized
representation
of
the
words
the
inference
step,
and
basically
it
would
basically
print
will
train
the
model,
and
then
you
would
measure
the
distance
of
each
inference
message
to
the
different
nodes
on
a
map
and
then
determine
how,
like
or
unlike
they
are
to
the
nodes
on
the
map.
A
Basically,
it
generates
a
real
number.
If
it's
normalized,
it
would
be
between
0
and
1.
We'd
have
a
threshold
value,
it's
configurable
for
basically,
let's
say
consider
if
it's
3
3
times
the
standard
deviation
plus
the
mean
that's
an
example,
our
calculating
the
threshold
value,
and
then
this
is
a
standard
way
to
like
determine
it's
outside
the
bounds
of
normal
behavior.
A
And
then
we
want
to
kind
of
check,
as
this
message
has
been
seen
before,
when,
where
has
this
been
reported
as
an
anomaly,
then
kind
of
Bob
try
to
prevent
the
user
from
getting
duplicate,
emails
and
messages,
or
if
he
reported
this
was
an
anomaly.
They
should
never
get
an
email
again.
The
the
next
step
after
this
is
to
pull
in
that
data
and
then
retrain
the
model
with
the
false
positives
and
that's
still
in
development.
That's
something
we're
working
on
improving
our
model
on
and
and
yeah,
so
pretty
much.
A
That
is
the
the
flow
of
how
this
system
works.
They're.
All
the
next
thing
I'm
going
to
show
you
is
a
live
demo,
because
I
think
it's
flies
are
cool,
but
what's
what's
cooler
than
a
live
demo,
so
I've
prepared
an
environment
here
for
us,
so
I've
deployed.
So
the
database
that's
being
used
we're
using
Seco,
alchemy
and
the
database
we're
using
is
my
sequel.
Then
we
have
profound
the
fact
star,
which
is
a
classic
application
which
connects
using
mice,
connects
using
tickle
alchemy,
and
you
can
you
can?
A
A
We
can't
we
can
see
that
it's,
it
did
send
some
recorded
some
anomalies
and
it
sent
some
anomaly
was
found.
This
is
the
score
that
they
found.
The
data.
Sciences
can
also
look
at
a
graph
on
a
dashboard,
a
little
bit
easier
to
look
at
and
see
how
how
were
performing
in
terms
of
the
the
system
itself
and
also
you
can
kind
of
view
some
of
the
log
messages
that
were
recorded,
the
prediction
ID
the
score
and
then
the
the
internal
customer
can
give
us
feedback
over
here.
A
A
We'll
use
query
string,
but
then
it
would
auto
fill
these
these
values
here
basically
I
would
say.
Is
this
an
anomaly
false,
and
why
is
this
not
an
anomaly,
because
it
is
just
wrong
and
once
I
submit
that
that
feedback
goes
in
it's
now
in
the
dashboard.
We
can
see
this
anomaly
wrong
and
and
that's
it
and
pretty
much
we
iterate
through
this,
and
you
need
developing
and
that's
pretty
much
it
for
demo.
A
Is
there
any
questions,
anything
that
anybody
wants
me
to
dive
into
more
more
in
depth?
Glad
to
take
any
questions.
A
So
we
have
this,
we
have
this
internal
machine
learning
platform
where
we
have
OpenShift
and
we
have
been
more
components
like
Alice
research
Cubana
as
well
as
logstash
Africa,
and
some
systems
are
deployed
up
there
and
their
logs
are
streaming
through
Kafka
and
then
into
elasticsearch.
Some
customers
would
need
would
meet.
The
would
want
would
have
a
requirement,
the
machine
learning.
Then
we
deployed
this
service
as
service
that
they
would
utilize.
C
Yes,
so
the
in
customer
day
and
I
think
that
was
also
your
question
is
a
team
within
ratted
that
is
maintaining
built
pipelines
for
containers
and
they
are
using
this
tooling
to
identify
anomalies
in
locks
of
their
deployed
system
and
they
are
running
on
an
internal
deployed
OpenShift
instance.
So,
basically,
you
could
deploy
the
same
setup
that
zach
just
highlighted
to
any
other
team.
That
also
runs
on
top
of
OpenShift
and
I.
Think
that's
that's!
C
Where
we're
trying
to
go
from
a
prototype,
that's
Michael
Clifford
did
as
his
internship
last
year
then
magic
took
it
and
sort
of
product
production,
eyes,
dit
and
now
we're
trying
to
actually
apply
it
to
a
running
customer
and
it's
again
all
open
source.
You
can
find
it
on
get
up
and
deploy
it
for
yourself.
C
Obviously
it
has
some
some
prerequisites
to
set
up,
so
it
needs
to
be
on
top
of
openshift
and
it's
access
to
elasticsearch
where
the
locks
are
stored,
but
other
than
that
it's
an
it's
an
open
source
project
and
we're
also
in
the
process
of
collaborating
with
a
czech
university
in
prague
to
extend
at
least
the
research
part
of
it
and
it's
coming
out
of
the
CTO
office
of
redheads.
So
it's
no
way
a
production
eyes
system,
but
it's
some
of
our
internal
yeah
research
topics
that
we're
doing
Zach.
B
Are
there
any
other
questions,
thoughts
from
folks
on
the
call
or
for
Zach
or
for
Brian
or
for
Phil?
Well,
we
have
them
here
and
if
there
are
other
topics
you
folks
would
like
to
hear
about,
we
will
be
meeting
again
in
another
month
post
at
Summit
and
cook
on
so
there'll
be
lots
of
busyness
in
between.
So
if
you've
got
topics
that
you
learn
about
there,
that
you
want
to
talk
about
afterwards,
just
ping
Marcel
or
myself
and
we'll
get
them
added
to
the
hinge
of
the
next
agenda.