►
Description
Logs typically record the source of truth during an incident, but their sheer volume and messiness makes incident detection and root cause analysis extremely challenging. As a result, logs are typically searched reactively, relying on a mix of intuition and brute force effort. But there’s hope. Machine learning can be used to automatically detect anomalous log patterns and correlate them with root cause.
In this webinar we’ll discuss and demonstrate an approach that utilizes unsupervised machine learning to structure and categorize streaming log events and then learn normal and anomalous log patterns. The end result is reliable auto-detection of incidents and their root cause.
A
Okay,
I'd
like
to
thank
everyone
who's
joining
today.
Welcome
to
today's
CNCs
webinar,
using
machine
learning
for
autonomous
vlog
monitoring,
I'm
Sanjeev
Rampal,
a
principal
engineer
at
Cisco
and
cloud
native
forum,
ambassador
I'll
be
moderating
today's
webinar.
We
would
like
to
welcome
our
presenters
today.
Larry
Lancaster
is
the
founder
of
zebra
and
get
Gavin
Kohan
white.
Vice
president
of
marketing
at
zipping
a
few
housekeeping
items
before
we
get
started
during
the
webinar,
you
are
not
able
to
talk
as
an
attendee.
There
is
a
Q&A
box
at
the
bottom
of
your
screen
note.
A
This
is
different
from
the
chat
window.
This
is
the
Q&A
box.
Please
feel
free
to
drop
your
questions
in
there
and
we'll
get
to
as
many
as
we
can.
At
the
end.
This
is
an
official
webinar
of
the
CNC
F
and,
as
such
is
subject
to
the
CNC
F
code
of
conduct.
Please
do
not
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
that
code
of
conduct
and
basically,
please
be
respectful
of
your
fellow
participants
and
presenters
with
that
I'm
gonna
hand
it
over
to
Larry
to
kick
off
today's
presentation.
B
B
Machine
data
is
my
life
and
I've
spent
most
of
my
career
sort
of
taking
telemetry
for
products
in
the
fields
and
then
turning
that
into
tools
and
business,
intelligence
and
deliverables
back
to
you
and
users
and
so
on,
and
so
what
I'm
gonna
be
talking
about
today
is
kind
of
motivated
by
that
history,
having
sort
of
delved
in
and
built
sort
of
platforms
on
top
of
machine
data.
So
many
times
you
know
I
kind
of
got
to
the
point
where
I
was
feeling
like
you
know.
B
So
when
I
look
at
when
I
look
at
sort
of
what
comes
out
of
that,
you
know
in
terms
of
monitoring
solutions.
I
see
that
for
one
thing,
sometimes
it
ends
up
being
slow
anyway,
to
figure
out.
You
know
what's
happening
when
something
when
there
is
an
incident
just
because
you
kind
of
have
to
go
digging
through
logs
yourself
anyway.
B
So
that's
that's!
Annoying
and
suboptimal.
The
fragile
part
I
touched
on,
but
this
is
actually
I
mean
I'm,
sir,
how
many
of
you
have
sort
of
you
know
had
to
ever
build
sort
of
scripts
or
reg,
X's
and
parsers,
or
tools
on
top
of
log
data,
but,
like
you
know,
what'll
happen,
oftentimes
is
you
know,
you'll
have
something
that's
working
and
then
a
developer
does
a
really
nice
thing.
B
That
happens,
that
that
that
makes
sort
of
dealing
with
logs
and
using
them
sort
of
a
manual
or
process
and
and
finally
alert
fatigue
is,
is
another
way
and
I
think
this
kind
of
stems
from
you
know,
given
how
little
sort
of
visibility
into
the
actual
semantics
of
what
the
data
is
talking
about.
The
tools
have,
typically,
what
you'll
have
is
people
will
set
up
alerts
on
okay?
You
know
if
I
get
if
I
get
more
than
you
know
a
thousand
errors.
B
B
If
it's
less
than
a
thousand
I
know,
everything
is
okay,
and
so
that's
kind
of
the
state
of
things
that
that
and
that
kind
of
annoys
me,
and
the
reason
is
that,
if
you
think
about
it
at
least
for
me,
I
mean
logs
have
always
been
used
for
root
cause.
It's
it's
almost
always
the
case
that
that
you're
gonna
go
to
logs
at
some
point.
In
fact,
if
it's
a
new
city,
if
it's
a
new
problem,
if
it's
an
incident
of
any
kind
I
think
dogs
are
very
good
you'll
end
up
in
a
log
file.
B
But
why
aren't
so
given
that
I
mean
clearly
they
have
the
information
that
we're
looking
for.
So
why
aren't
they
better
at
helping
us
monitor
things
in
the
first
place
and
I
think
I
think
it
kind
of
drill
it
kind
of
boils
down
to
this
manual
arity
when
I
say
logs
are
stock
in
index
and
search
what
I
mean
is
there's
a
person
searching
right
so
in
the
at
the
there's,
a
problem
with
that
I
mean
it
hasn't.
That
hasn't
always
been
a
problem,
especially
when
it's
kind
of
your
app.
B
B
So
let's
look
at
that
model.
So,
20
years
ago,
when
I
started
in
the
valley,
there
was
kind
of
a
shrink-wrap
model.
Of
course
it
wasn't,
you
know
always
delivered
by
box,
but
you
get
the
idea.
You
had
an
incident
and
there
was
a
user
for
a
customer
from
you
know.
Maybe
a
few
users
at
one
customer
that
had
hit
a
bug
and
there
was
a
support
department.
Everything
was
completely
different
back
then,
and
so
you'd
had
an
incident
on
a
monolithic
application.
B
There
may
be
up
to
10,
I
mean
10
would
be
pushing
in
log
files
that
people
would
need
to
be
familiar
with
and
look
true,
and
you
would
have
them
indexed
and
you
would
search
through
them
and
you
would
figure
things
out,
and
that
was
great,
but
things
have
changed,
I
mean
so
nowadays,
you've
got
an
incident
that
can
affect
tens
of
thousands
of
users.
You've
got
dozens
of
services.
At
least
you've
got
potentially
a
thousand
log
streams.
B
B
So
to
me
this
is
kind
of
unacceptable,
so
I
believe
that
the
future
doesn't
have
to
be
like
this
I
feel,
like
we've
kind
of
gotten,
to
the
point
now
where
at
least,
to
a
large
degree,
we
can
let
systems
help
us
kind
of
pinpoint
the
root
cause
indicators
in
the
swaths
of
log
data
without
us
having
to
go
through
the
same
process
that
we've
been
doing
for
decades.
If
you
think
forward
20
years
from
now
I,
do
you
really
think
that
we
want
to
be
doing
this
twenty
years
from
now?
B
My
answer
is
no,
so
let's
get
started
with
that.
What
I
wanted
from
a
tool
is
something
that
will
kind
of
characterize
incidents
before
I
notice,
that's
kind
of
a
like
a
really
ambitious
statement.
So,
let's,
let's
talk
about
a
little
bit
about
what
that,
what
that
means
and
what
it,
what
it's
taken.
B
So
what
it
means
is,
you
know,
sort
of
automatically
to
detect
incidents
when
they're
happening
so
things
that
are
weird
start
happening.
For
example,
if
you
were
to
hire
a
new
guy
to
come
in,
you
know
in
devops
face,
maybe
he's
checking
out.
You
know
the
system
he's
monitoring
alerts,
he's
looking
around,
he
starts
seeing
things
go
haywire
typically,
he'll
he'll
have
a
good
sense
that
things
are
going
wrong.
Even
if
you
don't
not
know
what
there
may
be,
he
may
be
capable
of
figuring
out
they
hey
something's
going
on
here.
B
So
let's
start
with
that,
let's,
let's
automatically
detect
incidents
without
a
whole
bunch
of
alerts
and
rules
being
configured
and
so
on,
and
then
let's
go
find
stuff.
That's
germane
to
root
cause
indication
within
the
logs
that
we
found
and
I
think
what's
interesting
from
this
perspective.
Is
that
as
we've
accumulated,
probably
a
hundred
and
I,
don't
I
want
to
say
110
120
real-world
incidents
across
over
30
stacks.
The
people
have
been
generous
enough
to
share
with
us.
B
We
found
that
there
are
some
sort
of
fundamental
ways
that
software
behaves
when
it's
breaking
that
lets
a
system
go
in
and
find
clues
for
you
to
to
root
cause.
That
I
talked
a
little
bit
about
why
it's
so
hard
to
do
this
from
from
the
perspective
of
monitoring
software,
but
essentially
you
know,
you've
got
ambiguous,
parses
you've
got
formats
changing.
Another
problem
is
needing
experts
to
interpret
the
data,
and
so
you
know
because
apps
apps
or
bespoke,
so
what
do?
What
do
I
mean
by
all
that?
So
so
like
it,
you
know.
B
B
What
you'll
find
is
that
sort
of
the
stuff
that
you'll
find
tools,
giving
you
sort
of
connectors
and
pre-built
sort
of
ways
to
get
at
that
log
data
they're
mostly
concerned
with
the
prefix,
which
is
kind
of
the
stuff
to
the
left
of
where
to
the
left
of
the
log
line
so
you'll
have
maybe
sometimes
you'll
have
pigs
or
function
names.
You'll
have
you
know
like
timestamps
you'll
have
severity
x'
you
have.
Some
of
these
are
really
long
and
complicated,
sometimes
they're,
very
short
and
simple,
and
that's
all
very
important
information.
B
So
we
need
to
grab
all
that,
but
everything
to
the
right
also
needs
to
be
structured.
If
you
want
a
machine
to
be
able
to
tell
oh
this,
this
is
a
number
in
here
and
it's
going
up
and
it's
not
usually
this
higher
or
you
know
this
very
particular
kind
of
event
is
happening
a
lot
a
lot
more
often,
so
you
need
to
kind
of
you
know
programmatically
get
into
that
semantics
of
that
data
and
that's
kind
of
what's
what's
been
hard
about
it.
B
So
the
first
thing
that
we
do
is
we
we
structure
the
logs
to
a
relational
level
where,
if
I
wanted
to
run
queries
about
what
was
going
on
in
my
logs,
like
you
do
it
now,
we
don't
have
you
do
it.
We
have
a
database
on
the
back
end
that
has
that
information
and
and
customers
can
get
access
to
it,
but
that,
generally
that's
not
that's
not
over
what
we're
doing
we
build
stuff.
B
On
top
of
that,
but
the
point
being
you
know,
if
you
imagine,
in
your
mind,
sort
of
a
table
that
gets
created
from
a
certain
kind
of
log.
So
here's
a
very
simple
example
of
a
very
specific
kind
of
log
and-
and
you
can
see
that
you
can
see
that
you
know
over
time,
there's
some
numbers
changing
and
it's
kind
of
clear
to
the
eye
what
those
should
be
called
and
what
we
want
is
we
want.
B
B
B
You
don't
want
to
have
to
go,
look
and
make
sure
somebody
else
has
a
connector
for
X,
because
for
your
application,
logs
at
least
nobody's
going
to
have
a
connector
for
X,
and
so
you
want
something
to
you
want
to
system
that
can
just
get
in
and
rock
the
grandma,
like
a
person
would
and
and
then
iterate
on,
based
on
its
understanding
of
that
grammar.
In
the
background,
without
you
having
to
worry
about
it
and
that
way
you
can
embrace
free
text
logs,
you
know,
structured
logging
is
cool.
B
We
certainly
you
know
can
can
can
rock
that,
but
it's
annoying
I
think
sometimes
at
least
from
a
developer
perspective
that
you
know
it's
really
hard
to
read
structured
logs
for
a
human
and
there's
really,
no,
there's
no
need
that
we
should
have
to
retranslate
our
entire
infrastructure
for
machines.
Why
can't
they
just
translate
it
for
themselves.
So
you
know
a
big
chunk
of
your
stocks
always
going
to
be
free
text
logs,
and
we
believe
that
that's
important
and
valid
so
once
you've
done
that
now
you
can
do
anomaly.
B
Detection
on
that
data,
and
so,
like
I,
was
mentioning
earlier.
We
kind
of
have
this
the
sort
of
ways
of
looking
at
the
data
once
we've
got
it
structured
properly.
That
has
yielded
from
amazing
results
for
us
in
terms
of
stuff.
That
just
tends
to
happen
when
things
break
and
and
people
always
ask
oh
well,
what
kind
of
you
know
what
kind
of
models
do
you
use
and
so
on
so
without
getting
too
much
into
it?
What
I
would
like
to
say?
B
Is
we
don't
we
don't
do
any
well
so
in
this
part
of
you
know
in
this
part
of
our
software
stack,
we're
not
doing
any
deep,
deep
learning,
because
this
is
all
kind
of
real-time
you
starting.
We
start
ingesting
logs
in
line
we
structured
them
in
line
everything
just
kind
of
you
know:
it's
gets
better
in
the
background.
There's
no
batch
processing,
it's
not
gonna
cost
you
an
arm
and
a
leg
to
run
the
service.
B
Well,
so
so
what
the
way
we
actually
are
able
to
pull
these
out
is
by
looking
at
point
process
statistics
on
the
event
types.
So,
if
you
think
of
each
event
type
so
in
a
given
log
file,
there
may
be
a
thousand
unique
event,
types
right,
those
tables
I
mentioned
there
may
be
a
thousand
of
them,
virtually
speaking
kinds
of
things
that
can
happen
in
the
land
that
are
expressed
in
the
logs.
B
B
There
are
other
people
kind
of
approaching
the
same
space
and
I
think
it's
a
fascinating
space.
There's
a
lot
of
brilliant
people
out
there
now
looking
at
ways
to
structure
this
kind
of
data,
there's
sort
of
I
guess
what
I'd
call
a
community
of
academics
that
look
at
it
one
way
and
there's
there's
some
folks
in
industry
to
look
at
at
different
ways.
I
guess
the
deep
learning
one
for
me
was
was
interesting
because
I
kind
of
have
this
I
have
this
story,
so
so
I
went
so
there
was
a
there's.
B
A
large
sort
of
you
know,
tech
company,
that
you
guys
would
all
another
name
and
they
have
a
CTO
of
sort
of
their
services
division
when
I
went
and
spoke
with
him
and
he
said
yeah.
So
this
is
a
problem
for
us,
so
we've
got
all
these
logs
and
and
we've
got
all
these.
You
know
different
products
and
we
we
we
gather
all
the
logs
together
and-
and
so
we
decided
we
wanted
to
go-
do
some
some
learning
on
that
on
that
data
and
try
to
understand
what's
happening
in
customers,
environments.
B
So
so
what
we
did
was
we
went
on.
We
bought
all
of
our
senior
engineers
dgx
one
workstations,
and
we
sent
them
for
training
on
deep
learning
and
then
they
started
to
set
them
loose
on
it
and
what
I
end
up
happening
was
after
six
months,
we
kind
of
abandon
it,
because
we
found
that
they
were
having
to
spend
all
their
time
structuring
the
data
rather
than
actually
building
models
on
top
of
it.
B
So
it's
kind
of
been
it's
kind
of
been
my
my
learning
that
if
you
structure
the
data
right,
you
have
a
lot
of
options
and
you
don't
have
to
jump
for
the
most
expensive
trendy
one
right
away.
You
can
try
other
other
methods
as
well,
so
we
try
to
use
sort
of
a
Swiss
Army
knife.
What
machine
learning
approach?
B
It's
interesting
when
you
look
at
log
data-
and
you
might
find
this
interesting
so
like
if
you
look
at
log
log
data
typically
you'll
find
like,
for
example,
if
I
just
take
a
terabyte
of
some
of
some
applications,
stacks
log
files,
you
know
out
of
some
environment
and
I,
look
at
it
typically
about
half
of
the
event
types
I
see
within
that
corpus.
I'll
only
see
once
or
twice
so.
What
does
that
tell
you?
B
It
kind
of
tells
you
that
if
I've
got
this
vision
in
my
head,
where
I'm
going
to
on
board
someone
and
I'm
gonna
get
start
looking
at
their
data
and
all
of
a
sudden
I'm
gonna
have
this
massive
corpus
of
data
that
I've
learned
over
what
you
know
exactly
what
that
thing
looks
like
and
and
all
the
different
permutations
and
distributions
of
that
event
type
I've
got
another
thing
coming.
It
doesn't
work
that
way,
and
so
so
kind
of
what
you
need
to
do
is
yeah.
B
Basically,
we
have
kind
of
a
four
stage
pipeline
which,
depending
on
how
often
how
many
times
we
see
in
an
n-type
well,
do
you
get
a
different
stage
of
that
pipeline?
We'll
have
the
primary
effect,
so
there
there
is
a
layer
that,
if
I've
only
seen
the
event
type
once
and
there's
numbers
in
it,
I'm
gonna
assume
those
are
parameters
until
it's
proven.
Otherwise,
the
next
step
is
basically
reach
ability.
B
Clustering,
like
these
lines,
are
kind
of
alike
each
other
I'm,
gonna,
I'm,
gonna,
assume
that
they
are
until
proven
otherwise
and
once
I've
had
a
few
examples.
Then
there's
sort
of
a
naive,
Bayes
classifier
that
kicks
in
for
the
global
fitness
function.
That
basically
says:
okay,
here's
the
kind
of
blob
of
stuff,
that's
related
to
each
other,
and
here
are
the
columns
and
we
become
really
sure
of
it.
So,
on
the
back
end,
we
kind
of
shuffle
things
around
into
these
buckets
and
then
eventually
it
sort
of
hardens
into
a
structure.
B
B
Statistics,
the
cross-correlate
among
event
streams-
is
that
the
anomaly
detection
gets
better.
The
more
complexity.
You
have
the
more
cross
correlation
streams
that
you
have
so
with
that
I'm
going
to
hand
it
over
to
my
colleague
Gavin,
who
is
just
an
absolute
whiz
with
the
demo,
and
he
can
do
it
in
a
sort
of
a
time
efficient
manner
and
then
we're
gonna.
Come
back
and
we're
gonna
we're
gonna
have
answer
some
questions,
so
I'm
gonna
go
ahead
and
stop
sharing.
C
So
assuming
everyone
can
see
this,
what
you're
looking
at
here
is
the
overview
screen
that
appears
immediately
after
logging
in
now
just
to
set
some
context
for
what
I'm
about
to
demonstrate
we've
deliberately
ingested,
just
a
tiny
data
set,
so
24
Meg's,
180,000
events
and
there's
absolutely
zero
manual
configuration
that
took
place
yes
and
no
one
build
rules.
There
was
no
pre-learning.
C
You
know
there
was
a
zero
knowledge
of
this
data
set
until
it
came
in
and
essentially
our
you
know,
ml
went
through
the
the
events
and
what
we
uncovered
from
an
overview
perspective
is
a
bunch
of
exceptions,
some
events
that
have
error
or
high
severity,
but
the
really
interesting
stuffs
here
there
are
a
hundred
and
fifty-five
anomalous
events
so
thence
the
broke
pattern
compared
to
what
we
really
expect.
Based
on
that
small
data
set,
and
then
one
incident
and
the
incidents
are
the
things
that
we
really
care
about.
C
These
are
the
things
that
bubble
up
as
correlated
sets
of
anomalies
that
we
believe
are
not
happening
by
chance,
they're
happening
because
there's
there's
there's
something
changing
and
the
behavior
software.
So
what
I
can
do
here
is
click
on
the
incident
and
you're
taken
to
sort
of
a
root
cause
description,
so
it
seems
reasonable,
post
quest
stopped
as
what
it
calls
it,
and
that
comes
directly
out
of
one
of
the
events
and
if
I
click
on
it,
you
get
the
detail
of
what
we
found.
So
let
me
explain
this
a
little
bit.
C
The
data
was
collected
from
and
kubernetes
set
of,
pods
running
the
alaskan
stack
in
AWS,
ok
and
what
we
found
is
in
this
pod
Postgres
master.
We
see
a
couple
of
messages
saying
the
Postgres
start
and
then
in
a
different
pods
we're
seeing
this
sort
of
correlated
set
of
events
coming
from
JIRA,
so
a
different
application,
and
in
particular
this
one
that
says
it's.
You
know
it's
having
Coast
crèche
Postgres
issues
connecting
to
the
database
so
clearly
related
set
of
events.
C
So
out
of
the
hundred
and
eighty
thousand,
the
events
that
we
ingested,
we
literally
just
detected
five
that
encompass
this
incident
and
it
turns
out
this
was
exactly
the
problem
that
occurred
in
this
situation.
Something
shut
down,
Postgres
and
immediately
after
the
rest
of
the
applications
running
in
different
cards
started
noticing.
So
that's
what
we
see
that's
where
it
detected
now.
C
If
you
take
this
a
little
bit
further,
we
make
it
really
easy
to
sort
of
confirm
the
diagnosis
that
we
found
and
maybe
troubleshoot
more
or
find
out
what
really
happened
so
with
that
I
click
on
beretta
bump
browsed
the
incidents
and
I'm
taken
really
into
this.
So
interesting
and
kind
of
almost
log
manager
view,
but
I'm
filtered
at
the
moment.
C
So
what
you're,
seeing
at
the
top
of
the
screen
are
a
set
of
visualizations
of
the
data
set,
so
the
entire
space
of
this
data
set
encompasses
180
thousand
events
and
then
we
break
down
some
visualizations
and
I'll
come
back
to
those
in
a
minute
and
because
I
clicked
on
the
incident
and
filtered
on
just
those
five
events
that
make
up
the
incidents.
Okay,
so
we
can
see
them
all
now
together,
much
like
they
would
look
like
if
you
correct
those
lines
out
of
the
the
different
log
files.
C
C
So
I
can
turn
off
this
incident
filter
that
we
have
on
now
and
now
what
we're
seeing
is
we
stay
in
the
same
place,
but
we're
seeing
those
incident
events
surrounded
by
all
the
other
events
from
all
the
other
pieces
of
the
environments?
In
this
case
you
see
other
log
files
by
confluence
and
if
you
scroll
around
you'd,
see
all
the
other
Atlassian
components,
spinning
our
messages.
But
now
you
can
see
the
incident
in
the
context
of
everything,
and
this
again
is
another
mechanism
just
to
quickly
kind
of
understand
and
troubleshoot.
C
C
We
call
it
an
x-ray
and
what
we're
doing
is
on
the
x-axis
is
time
and
the
y-axis
is
the
event
space
spanning
all
the
different
log
sources
and
we're
drawing
kind
of
a
colored
rectangle
everywhere
that
we
find
an
anomaly
represented
and
location
on
the
y-axis
to
where
it
came
from
or
what
particular
event
that
came
from
the
brighter
the
little
rectangle
the
more
anomalous
the
event.
So
the
most
anomalous
are
these
sort
of
very
bright
white
colors
that
you
see
just
above
my
my
cursor,
my
mouse
pointer
now.
C
What
you
see
here
is
very
typical.
There
are
always
anomalies,
so
there
are
always
going
to
be
events,
the
break
pattern,
and
you
don't
want
to
alert
on
those.
You
don't
want
to
create
incidents
on
those
because
there's
really
not
enough
context
to
say
they
are
or
they're
not
the
problems,
sometimes
they're
problematic,
sometimes
they're,
just
against
the
break
pattern.
More
than
you
know,
new
events
occur
for
whatever
reason.
C
So
if
you
sort
of
go
to
this
section
here,
which
is
where
we
found
the
incidents,
you
see
this
really
tight
band
of
correlated
anomalies
and
that's
a
kind
of
a
trigger
when
we
see
sets
of
very
high
likelihood
anomalies,
the
very
bright
rectangles
correlated
across
difference,
either
parts
of
the
application
or
different
log
files
or
log
types
or
long
streams.
That's
our
indication
that
there's
an
incident
all
of
these
things
broke
pattern.
They're,
all
anomalous,
they're
tightly,
correlated,
there's
some
very
high
probability
anomalies
in
here
and
that's
our
incident.
C
So
that's
how
we
think
these
5
out
of
180,000
events
to
make
up
this
incident.
We
saw
this
this
sort
of
band
and
we
pulled
out
what
was
most
anomalous
there
and
that
became
the
event
and
then
the
root
cause
is
identified,
kind
of
by
the
leading
edge
of
that
the
first
sort
of
anomalous
event
that
happened.
That
seemed
to
trigger
everything
else.
C
And
so,
if
you
remember
in
the
event,
we
showed
you
the
root
cause,
post
quest
start
and
then
the
symptoms
which
were,
in
this
case
jira
noticing
that
couldn't
talk
to
the
sequel
and
actually,
if
you
go
further
and
you
look
at
some
of
the
yellow,
the
the
other
anomalies
around
it
you'll
find
the
base
they
they
relate
to
some
of
the
other
applications
that
also
started
having
problems
once
close,
read
start.
So
that
gives
you
a
good
sense
of
what's
sort
of
happening
under
the
covers.
C
The
very
last
thing,
I'll
show
you
sort
of
speaks
to
what
happen.
You
know
what
we
did
to
get
to
this
point
and
Larry.
You
sort
of
spend
a
bit
of
time.
This
presentation
talking
about
how
we
structure
the
data,
but
there
are
a
few
interesting
things
you
can
see
here.
If
you
look
at
this
logline
as
an
example,
wherever
there's
a
blue
piece
of
text
with
discerned
that
there
was
a
variable
part
of
an
event.
So
in
this
case
all
these
blue
pieces
are
variables
and
in
particular,
so
this
is
the
post
quest.
C
Stop
message.
The
word
stopped
is
actually
a
variable,
meaning
we've
seen
an
event
of
this
type
somewhere
else,
where
there's
a
different
value
for
stock,
so
what
I
can
do
is
I
can
chart
it,
and
what
you
see
nicely
here
is
over
here
we
get
on
you
know
at
4:17,
the
stocked,
Postgres
and
then
over
here
we
get
somewhere
later
two
minutes
later
we
get
a
starting
and
the
started
so
the
same
event
type
that
we've
categorized
for
the
machine
learning
with
different
values.
For
that
variable
to
get
this
I
didn't
have
to
pause.
C
C
You
also
need
to
be
able
to
do
things
like
search.
So
as
an
example
here,
I
can
I
can
do
sort
of
you
have
full
reg,
X
searches,
I'll
go
here
and
I'll
search
for
the
text,
milliseconds
and
sorry
I
mistyped,
something
there
and
what
I'm
doing
is
I'm
getting
taken
to
an
event
that
the
fact
where,
where
there's
a
match
for
the
text
by
search
bar
and
once
again,
you
can
see
here's
an
example
of
an
event
that
it
found
in
the
search.
C
The
blue
is
the
variable
text,
and
this
time
you
see
you
know
there
are
a
couple
of
metrics
that
it's
found
as
variables.
So
I
can
pick
one
of
them.
I
can
do
my
display
chart
and
you
get
sort
of
an
interesting
plot
of
that
value
inside
all
the
events
of
that
type
and
how
it
changes
and
I
might
want
to
look
at.
You
know
this
one
that
looks
like
an
outlier
I'm
going.
C
You
know,
click
on
that
and
I
can
get
taken
there
and
and
so
on
and
look
around,
and
this
is
all
around
being
able
to
learn
the
structure
of
the
underlying
events
and
then
be
able
to
pull
out
this
day
that
without
having
to
to
manually,
build
any
pausing
rules.
So
there's
a
whole
lot
more,
but
I'm
gonna
stop
here
and
hand
it
back
to
Larry.
Thank
you.
Hey.
B
B
So,
let's
talk
a
little
bit
about
where
we're
at
right
now,
so
so
we're
picking
out
application
incidents,
kubernetes
incidents
and
even
some
security
types
of
incidents.
This
seems
to
be
working
pretty
well.
So
recently,
we've
had
some
exciting
validation,
so
my
data
reproduced
the
slew
of
real-world
incidents
from
so
they
kind
of
managed
kubernetes
clusters,
among
other
things,
that
they
do
there.
B
It's
amazing
company,
but
using
litmus,
which
is
a
tool
that
they're
involved
in
and
we
were
able
to
pull
out
those
incidents
that
they
recreated
and
100%
of
them
and
pull
in
a
root
cause
indicator
and
put
it
into
that
incident
page
without
anyone
having
to
tell
the
system
anything.
So
this
vision
that
I've
outlined
for
you,
while
it
sounds
ambitious,
is
actually
coming
true,
so
that's
been
very
exciting
for
us.
B
B
So
you
know,
if
you're
it's
very
easy
to
deploy
in
kubernetes,
especially
we
deploy
a
scraper
and
then
we
look
for
anomalies
in
those
time
series
and
cross
correlate
them
with
log
log
events.
So
that's
going
to
be
in
a
very
soon
upcoming
lease
and
I
think
what
that's
kind
of
what
we're
doing
there
is
we're.
You
know
we're
saying
kind
of
this
should
be
a
one-stop
shop
for
incident,
root,
cause
detection
for
the
unknown
unknowns
for
which
you
may
not
have
you
know,
created
rules
or
so
on
and
so
on,
or
you.
B
It's
not
reasonable
to
do
so.
It's
it's!
It's
that
kind
of
thing
that
that
we
feel
it's
time
to
do
so
so
that
the
machines
can
start
helping
people
do
their
jobs
and
people
can
up
Louisville
and
do
more
strategic
work
than
dragging
through
log
files.
So
thank
you
very
much
for
your
time.
Here's
my
contact
information,
we'd
love
to
hear
from
you
and
we're
gonna
open
it
up
for
Q&A.
Now.
A
That
was
great
thanks,
Larry
and
Gavin,
so
we
now
have
some
time
for
questions.
Please.
If
you
have
a
question
drop
it
into
the
Q&A
tab
at
the
bottom
of
your
screen.
I
see
one
question
there
from
Nikhil,
so
Larry.
The
question
is:
what
kind
of
what
is
the
actual
a
learning
algorithm
that
is
used?
Are
you
using
some
kind
of
neural
net.
B
Sure
right
so
I
guess
I
touched
a
little
bit
about
upon
this
during
my
presentation,
so
there's
really
kind
of
two
separate.
You
know
sets
of
analytics
that
we
do
sets
of
machine
learning.
The
first
is
of
the
structure
of
the
data,
and
so
I
talked
a
little
bit
about
the
Swiss
Army
knife
that
we
use
there,
and
so
there's
sort
of
a
continuum
of
approaches
that
we
use
the,
why
more,
depending
upon
the
frequency
of
given
event
type.
B
B
So
a
lot
of
stuff
like
that,
the
next
step
is
sort
of
reach
ability,
clustering,
which
takes
sort
of
a
more
global
view
of
the
lines
that
have
been
seen
and
then
the
next
stage
is
a
naive,
Bayes
classifier
with
a
global
fitness
function
and
then,
finally,
when
we're
going
back
and
what
kind
of
sort
of
amending
the
structure
learning
we've
done,
we
like
to
use
LCS
for
that.
So
LCS
is
actually
kind
of
the
state
of
the
art
for
for
learning
log
structure.
B
It
has
some
weaknesses
in
low
cardinality
data
and
so
for
me,
that's
actually
kind
of
like
it's
kind
of
like
polish
on
a
car.
So
it's
the
last
thing
that
you
do,
and
so,
with
the
with
that
sort
of
palette
of
tools,
we
found
that
to
be
very
effective
and
then
and
then
there's
the
question
about
the
question
about
the
anomaly
detection.
B
So,
as
you
saw
there's
a
couple
phases
there,
so
so
the
first
phase
is
kind
of
determining
that
something
is
anomalous
in
and
of
itself
and
when
I
say
in
and
of
itself
I
mean
that
an
event
of
a
specific
type
happened,
anomalously
in
isolation
and
that
I
will
have
to
do
with,
for
if
it's
an
event
type,
we
have
lots
of
examples.
You
have
good
sense
of
a
periodicity
of
distribution,
of
values,
of
parameters
within
that
event,
so
you
can
speak
specifically
to
that
and
then
and
there's
other
things.
B
B
Finally,
when
you
look
at
these
event
as
independent
point
processes-
and
you
develop
statistics
from
those
processes
in
terms
of
their
auto
correlations,
their
cross
correlations
and
also
the
correlations
of
their
activity
as
a
sequence
of
events
versus
the
values
that
you
see
in
the
parameters
and
and
what
you
end
up
with
then
is
something
where
you
can
really
really
start
to
hone
in
on
incidents.
So
hopefully
that
gives
you
a
good
explanation
for
for
what
works
for
us.
A
Thanks
Larry,
maybe
I'll
tee
up
a
question
from
myself
here:
how
do
you
see
the
same
product
in
combination?
You
mentioned
you're
going
to
be
ingesting,
Prometheus
metrics.
So
what
would
be
a
target
configuration
there
in
terms
of?
Does
it
coexist
with
the
Prometheus
based
stack
or
an
elastic
surge,
or
you
know
afk
kind
of
a
stack?
What
what
would
be
you
know?
Where
do
you
see
this
coexisting
with
those
technologies?
A
B
B
B
You
know
if
you
find
that
you
don't
need
to
buy
a
certain
other
tool
that
you
know
is
costing
you
money,
because
now
you've
got
something
that
that
will
display
the
data,
for
you
will
let
you
search
the
data
and
chart
the
data
and
all
that
and
it's
finding
incidence
for
you.
That's
that's
great.
That
would
be
like
success,
trust,
but
that's
I,
don't
think!
That's
really,
where
we're
focusing
we're
more,
focusing
on
building
the
value
of
finding
the
root
cause
and
giving
it
to
you.
First.
A
Excellent,
thank
you.
Any
more
questions
well
looks
like
that's
it.
So
thank
you,
Larry
and
Gavin
for
a
great
presentation.
Thank
you
to
everyone
for
joining
us.
The
webinar
recording
and
the
slides
will
be
online
later
today.
We
look
forward
to
seeing
you
again
at
a
future
CN
CF
webinar
thanks
and
have
a
great
day
with
that.
We
are
signing
off.