►
From YouTube: Machine Learning for k8s Logs and Metrics
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
like
I
said,
thank
you
for
coming
today,
welcome
to
machine
learning
for
kubernetes
logs
and
metrics,
I'm
libby
schultz
and
I'll
be
moderating.
Today's
webinar
we'd
like
to
welcome
our
presenter
today,
larry
lancaster,
founder
and
cto
of
zebrium,
a
few
housekeeping
items
before
we
get
started
during
our
webinar
you're,
not
able
to
talk
as
an
attendee.
There
is
a
q,
a
box
at
the
bottom
of
your
screen.
Please
feel
free
to
drop
your
questions
in
there
we'll
get
to
as
many
as
we
can.
At
the
end.
A
A
B
Hey,
thank
you.
So
much
hey
everybody.
So
today
we
get
to
talk
about
something,
that's
near
and
dear
to
my
heart,
machine
learning
on
logs
and
metrics
in
particular,
you
know
when
we're
when
we're
deploying
through
in
kubernetes
and
and
and
so
with
that,
let's
let's
get
started.
B
So
machine
data
is
my
life.
I've
wasted
far
too
many
years
dealing
with
machine
data
at
a
number
of
companies
in
a
number
of
roles.
B
You
know
at
the
end
of
the
day
what
what
I've
found
is
that
there's
a
lot
of
value
in
telemetry?
That
can
be
hard
that
can
be
gotten
out
of
telemetry,
but
it
really
is
a
you
know.
It's
a
long
mile
to
walk
to.
You
know
to
pull
the
value
out,
and
so
you
know
kind
of
kind
of
you
know
the
vision
that
I
thought
became
realistic,
maybe
five
or
six
years
ago
was
really
you
know.
B
So
I'm
going
to
frame
kind
of
the
problem,
the
the
problem
space.
You
know,
as
I
see
it,
in
terms
of
how
that's
converging
so
so
20
years
ago,
if
you
think
about
it,
you
know
there
was.
There
was
shrink,
wrap
software.
You
know
you
had.
If
you
had
an
incident
you
you
would
have
one
affected
user,
you
would
have
one
monolithic
application
you
might
have.
B
If
you're
you
know
stretching
it,
you
might
have
10
different
log
files
to
look
at
and
so
the
way
you
would
do
that
for
if
you're
digging
into
root
cause
is
you
would
you
would
index
those
if
you're
lucky
you
may
not
even
have
to
you,
may
just
have
to
search
them.
They
may
be
small
enough
for
any
given
incident.
You
know
volume
of
data
isn't
even
a
problem
right
and-
and
so
you
know
it's
it's
it's.
B
It's
always
been
true
as
long
as
I've
known
it
that
you
know
for
detection,
you
know
of
of
incidents.
Sometimes
metrics
are
the
best
way
to
go,
but
for
if
you
want
to
get
to
the
root
cause
of
a
new
incident,
something
that's
new,
a
new
problem
chances
are
you're
going
to
end
up
in
a
log
file
at
some
point,
and
so,
and
so
you
know,
this
is
kind
of
an
important.
This
is
an
important
sort
of
piece
of
the
workflow
of
incident
management
in
triage
right.
B
So
if
you
compare
that
to
today,
you
know
you
have
a
sas
world
so
now
you've
got
one
operational
incident.
Maybe
you
have
a
hundred
thousand
users
affected.
Maybe
you
have
a
hundred
services
that
could
potentially
be
taking
part.
B
You
have
you
know
a
thousand
log
streams
coming
through
with
all
this
telemetry
and
so
using
the
telemetry
becomes
inherently
more
difficult
and
so
but
and
if
you
see
kind
of
how
you
know
where
you
know,
people
are
using
that
telemetry
today
oftentimes,
it's
still
the
same,
the
same
approach
so
that
you
know
the
question,
is
you
know?
What
do
we
need?
Do
we
need
something
better
and
if
so,
what
kind
of
what
is
that
so
so
to
kind
of
step
back
and
look
at
it?
B
If
you
take
a
look
at
sort
of
the
state
of
devops,
there
was
there's
a
report
done
annually
that
kind
of
tries
to
survey.
What
are
the
trends
in
the
field
and-
and
one
thing
that
got
called
out
in
2019-
is
because
of
the
complexity
of
a
typical
deployment.
Today,
sort
of
mttr
has
sort
of
plateaued,
in
other
words,
even
sort
of
the
elite
shops.
B
You
know
scripting
rules
and
so
on,
and-
and
so
you
know
what
what's
typically
driving
the
mttr
now
is
new
new
problems
and,
and
the
reason
for
that
is
the
complexity
of
software
today,
so
so
something
something
needs
to
break
so
so
so
the
this
is
this
kind
of
is
to
reflect
again
in
the
same
report
I
mean
what
you're
seeing
is
no
matter
if
it's
you
know,
sort
of
a
small
shop
or
an
or
a
large
shop
with
an
elite
team.
B
The
new
incident,
the
new
thing,
the
unknown
problem,
has
become
sort
of
the
driver
of
of
downtime
today.
So
our
vision
is
that
you
know
autonomous
root.
B
So
to
me,
that's
why
machine
learning
on
telemetry
is
very
important
and
is
probably
going
to
become
a
lot
more
important
in
the
coming
years.
B
B
B
It's
an
amazing
ecosystem,
you
know
someone
can
come
in
and
and
they
can
deploy
our
software
and
collectors,
for
example,
they
can
do
that
with
a
couple
of
charts
or
they
can
do
it
with
you
know
a
couple
of
cube
control
commands
it's
just
absolutely
amazing
kind
of
the
how
little
configuration
can
can
be
required
to
deploy
an
application,
and
it's,
I
think
so,
while
there's
been
a
lot
of
complexity,
that's
come
along
with
sort
of
the
microservices
environment
and
sort
of
the
decentralization
of
software.
B
There's
also
been
a
birth
of
of
new
flexibility
and
coming
out
of
the
metadata,
mostly
that
these
deployment
systems
contain
regardless.
B
So
I'm
going
to
want
to
be
able
to
monitor
you
know
an
arbitrary
application
and
the
truth
is
you
know
it
would
be
nice
if
all
of
our
you
know
software
was
sort
of
running
in
a
in
a
jvm
somewhere,
but
that's
not
always
the
case.
We
need
arbitrary
run
times
to
be
supported
and
arbitrary
infrastructure.
You
know
it
can
be.
It
could
be
that
I'm
you
know
I
need
to
be
monitoring.
You
know
a
linux
instance
in
aws,
or
you
know
some
bare
metal
server
somewhere.
B
So
so
there's
a
lot
of
combinations
now
that
kind
of
form
that
complexity
that
we
talked
about
a
minute
ago,
and
so
taking
these
things
into
account,
you
know
becomes
important
if
you're
going
to
do
you
know
effective
sort
of
automation
of
root
cause.
Another
thing
that
that
comes
up
is
it's
interesting.
B
If
you
take
one
application
stack
and
you
deploy
it
in
one
environment,
say,
say:
for
example,
an
atlassian
stack,
which
you
know
a
lot
of
people
use,
there's
there's
still
a
lot
of
people
using
it
on
on-prem
or
at
least
in
their
own
vpc,
and
there
will
be
for
a
long
time
because
you
know
it
has
it
has
source
code
in
it
and
some
people
just
aren't.
You
know
comfortable
putting
that
in
a
cloud
environment
and
there's
a
number
of
applications
like
this
sensitive
databases
and
so
on.
B
B
To
have
kind
of
you
know
the
model
where
I'm
going
to
go,
quote-unquote,
learn
about
postgres
logs
and
what's
what's
normal
and
not
normal,
you
know
in
environment
a
and
then
I'm
going
to
take
that
learning
and
I'm
going
to
distribute
it
to
a
thousand
users
who
are
using
it
in
completely
different
ways.
B
So
the
the
the
complexity
you
know
from
from
the
perspective
of
doing
machine
learning
in
that
case
is
there
needs
to
be
a
lot
of
sort
of
custom
learning,
that's
done
in
whatever
environment.
This
solution
is
going
to
be
deployed
in
you
know
when
I
talked
a
little
bit
about
zero
required
sort
of
configuration,
you
know
for
setup,
you
know
and
how
useful
kind
of
kubernetes
has
been
for
for
deployment
in
that
respect.
B
There's
lots
of
other
things
that
you
don't
want
to
require.
You
don't
want
to
require
someone.
You
know
an
end
user
to
sit
down
and
train
a
system.
Oh
yes!
No!
Yes!
No
here,
you
know
here's
a
thousand
log
events.
Is
this
interesting?
Yes
or
no?
We've
found
at
least
that
people
aren't
willing
to
do
that
kind
of
work
so
that
that
also
kind
of
kind
of
bounds.
Your
solution
space,
you
know
quite
a
bit,
there's
a
lot
of
things.
B
You
don't
want
to
require
of
a
user
to
get
started,
even
though
they
they
may
very
well
end
up
wanting
to
do
some
of
those
things
you
don't
want
to
have
to
require
them.
If,
if
you're
going
to
call
your
application
sort
of
autonomous,
and
so
so
in
that
sense,
you
know,
is
it
really
too
much
to
ask
you
know
I
I
I
think
I
think
it's.
B
I
think
it's
gotten
to
the
point
where
it's
it's
not
too
much
to
ask
to
to
immediate,
to
be
able
to
see
real
value,
there's
always
going
to
be
sort
of
a
period
of
adjustment
and
a
period
of
sort
of
you
know
going
in
and
kind
of
you
know
giving
some
amount.
Of
course,
green
feedback
you're
always
going
to
need
that
you're
going
to
want
to
let
people
bring
their
alert
rules,
and
I'm
I'm
not
saying
these
things
aren't
valuable.
B
In
fact,
in
some
cases
they're
critical,
but
your
system
can't
require
all
that
stuff
to
show
value
would
be
my
contention,
so
you
know
you
know.
Why
do
I
need?
Why
do?
Why?
Am
I
saying
that
we
need
to
be
so
so
flexible
in
that
sense?
I
I
guess,
because
you
know,
if
I'm,
if
I'm
requiring
sort
of
a
person
to
go
in
and
do
training,
that's
probably
not
going
to
scale
indefinitely
and
my
assumptions
if
I
make
any
about
a
stack,
may
not
hold
so
so,
given
that,
where
do
we
start?
B
So,
let's
say
you
know
we're
doing
a
very
general
sort
of
machine
learning
task.
You
know
on
a
set
of
data
instead
of
telemetry.
Where
do
you
start?
What
set
of
telemetry?
Would
you
start
with?
So
everyone
may
have
an
opinion
on
that.
My
opinion
is
that
if
you're
targeting
root
cause
analysis,
then
you
have
to
start
with
logs
and
the
thing
about
logs
is
you
know,
they're
difficult?
You
know
they're
difficult
to
work
with,
but
but
there's
reasons
that
logs
are
so
valuable
when
you're
root
causing
a
new
incident
type
right.
B
Oh
yeah
yeah
thanks
gavin,
I'm
going
to
do
that
right
after
this
slide.
Thank
you
very
much
so
so
a
free
text
log
is
going
to
tell
a
real
story
about
what's
happening
and
and
so,
if
you
look
at
these
log
lines,
I've
put
here
as
an
example.
B
What
you'll
notice
is
that
you
know
at
least
if
you
have
kind
of
some
experience.
If
you
read
these,
they
kind
of
tell
you
what
happened
without
a
lot
of
sort
of
rules
having
to
be
behind
all
that.
We
don't
need
a
lot
of
metadata
right
to
tell
that
story
out
of
the
text.
B
Oh
so
so
actually
you're,
right
gavin.
I
think
I
skipped
that
spot.
I
was
going
to
stop
a
couple
slides
ago,
so
so
really
quick
guys,
I'm
going
to
let
gavin
jump
in
gavin
is
our
head
of
product
and
he's
going
to
start
up
the
kubernetes
demo
and
then
and
then
at
the
end,
we're
going
to
kind
of
look
kind
of
how
things
panned
out
so
hold
on.
Let
me
let
me
go
ahead
and
stop
sharing
gavin
apologize
for
that.
C
Thanks
larry,
so,
as
you
just
mentioned,
I'm
going
to
run
this
demo
in
in
two
parts
and
the
first
one
that
I'll
do
now
is
I'll.
Just
show
you,
the
demo
environment
that
I'm
using
and
then
I'm
going
to
break
it
and
then,
in
the
second
part
I'll
come
back
and
I'll
show
you
what
the
zebra
machine
learning
picks
up.
So
if
I
share
my
screen
now,
I'm
going
to
take
you
into
the
google
cloud's
demo
microservices
app.
So
it's
a
little
web
app.
C
It's
running
a
shop,
I'm
going
to
purchase
a
barista
kit
as
we
speak,
just
to
show
you
what
it
looks
like
running
and
there
we
go.
I've
also
got
a
whole
bunch
of
services,
running
istio,
prometheus
and
kiali,
and
we
can
see
in
kiali
what
the
data
flows
look
like
and
what's
going
on
on
the
network
or
a
service
mesh
here,
so
you
can
see
there's
a
lot
of
activity.
C
So
the
way
that
I
did
that
and
the
reason
that
I
did
it
prior
to
the
webinar
is,
I
wanted
to
give
it
a
little
bit
of
time
to
just
learn
the
the
basic
patterns
in
these
logs.
So
the
way
that
we
sign
up
for
xebrium
is
you
can
go
to
our
zebra.com
page
click,
get
started,
free,
fill
in
your
name
and
so
on
and
then
set
a
password,
and
what
that
does
is
it
will
take
you
into
a
screen
that
looks
something
like
this
okay.
C
C
C
Is
I'm
going
to
go
ahead
and
break
my
application?
So
if
you
see
here,
these
are
the
pods
that
are
running
for
the
app
and
I'm
going
to
essentially
kill
the
product
service.
Catalog
pod,
I'm
going
to
scale
it
down
with
zero
replicas,
so
we're
going
to
see
it
should
die
in
a
moment
busy
terminating
now
and
it
should
disappear.
And
if
I
go
back
to
my
web
app
now
and
try
and
do
something,
it's
not
working
particularly
well
and
then
you'll
see
in
a
moment
kiyali.
Okay,
it's
it's!
C
It
should
start
to
turn
red
as
as,
as
things
start
to
it,
starts
to
detect
things
fail.
So
essentially
my
app's
completely
broken.
Now.
What
zebra
will
see
is
a
change
in
in
patterns
coming
through
we've,
absolutely
not
built
any
rules
to
detect
this
kind
of
a
problem.
So
I
know
it's
a
fairly
trivial
problem,
but
just
bear
that
in
mind,
so
let
me
hand
back
to
larry
and
then
I'll
come
back
towards
the
end
and
I'll
show
you
the
incident
that
zebra
should
pick
up
from
that.
So
larry
back
to
you.
B
All
right
that
was
fantastic,
that
was
fantastic,
gavin
thanks.
So
what's
interesting
about
that
that
demo
that
that
that
gavin
started,
I
think
it
may
have
been
two
weeks
ago
or
a
week
a
week
ago.
Actually
there
was
a
service
provider
who
just
went
in
and
they
signed
up
and
they
decided
they
were
going
to
just
see
what
happened
if
they
did
x,
and
so
we
thought
it
was.
It
was
so
simple
and
cool
to
show
that
that
we
ended
up
kind
of
appropriating
it
so
yeah.
B
B
So
you
have
to
wonder
why,
like
in
general,
people,
don't
use
them
much
for
monitoring
and
and
and
I
shouldn't
it's
not
true-
that
people
don't
use
them
at
all
for
monitoring,
there's
plenty
of
alert
rules
that
are
built
on
logs
all
over
the
place.
It's
just
in
general.
I
think
kind
of
the
you
know.
B
The
direction
for
monitoring
at
least
to
find
thing
when
things
are
broken,
is
to
use
metric
alerts
and
part
of
the
reason
for
that
is
because
you
know
logs
are
generally
higher
volume
and
there's
there's
some
other
sort
of
conceptual
problems
with
logs
that
make
them
difficult
to
work
with.
B
So,
if
we
think
about
what
long
monitoring
tools
look
like
today,
generally
you're
going
to
sit
down
and
you're
going
to,
you
know
you're
going
to
build
all
of
this
sort
of
automation
around
it
and
it's
so
it's
a
very
tedious
and
kind
of
manual
process,
but
but
still
we,
we
know
that,
just
under
the
surface,
when
manual
work
is
applied,
there's
a
lot
of
value
there
right.
So
so
you
know
when
we
think
about
you,
know
kind
of
what
makes
what
makes
them
difficult
to
work
with.
B
So,
let's,
let's
say
I'm
trying
to
root
cause
some
issue
that
just
happened.
You
know
it's
it's
kind
of
slow
and
painful
to
do
it
when
I'm
having
to
search
for
keywords-
and
I
don't
know
what
they
should
be.
So
maybe
I'll
look
for
you
know.
Oh,
was
there
a
spike
in
log
volume?
You
know.
Maybe
that's
going
to
tell
me
where
to
look
first,
but
you
know,
then
you
find
out.
B
Oh
yeah,
you
know
the
yum
update,
ran
in
the
background
and
that's
spike
was
from
something
else,
and
I
don't
know
where
to
look
and
it's
this.
Let
me
type
in
things
like
fail.
You
know
bad
abort,
whatever
so
you're
trying
trying
to
find
trying
to
find
you
know
at
least
the
place
to
start
looking
right.
It's
it's!
It's
a
real!
It's!
It's
a
real
pain,
they're
fragile
right
formats
change.
So
you
know
I
mean
I
don't
know
how
many
times
it's
happened
that
I've
that
I
you
know
that
I've
experienced
this.
B
But
you
know
you'll
you'll
set
up
a
rule
on
some
logs
and
then
you
know,
you'll
go
away,
you'll
be
happy
with
yourself,
and
you
know
the
next
time
that
event
happens,
you're
going
to
catch
it
and
you're
going
to
do
something
special
with
some
value.
That's
you
know,
parameter
in
that
log
event
or
what
have
you
and
then
you
know
someone
who
doesn't
owe
you
an
explanation
at
all.
Someone
upstream
decides.
B
You
know
they're
going
to
do
something
really
helpful
and
nice
actually,
which
is
to
fix
a
spelling
mistake
and
the
next
thing
you
know
your
little
rule
kind
of
silently,
silently
breaks
and
so
like
these
are.
These
are
the
kind
of
frustrations
that
log
monitoring
you
know
kind
of
surfaces
and
finally
it
gets
annoying
like
there's.
If
you,
if
you
try.
Sometimes
what
will
happen
is
if
you
try
to
set
simple
rules.
B
You'll,
you
know
you'll
sort
of
blow
up
with
okay,
I'm
gonna
do
this
every
time
I
get
an
error,
but
but
you
know
now
you
know
I
did.
I
deployed
some
new.
You
know
kind
of
part
of
my
stack
or
some
new
version
of
something
or
something
completely
irrelevant
happened,
something
spewing.
B
You
know
hundreds
or
thousands
of
events,
error
events
into
the
log
that
really
don't
matter,
and
now
what
is
now
I
have
to
write
rules
to
kind
of
suppress
that
I
have
to
buy
an
ai
ops
tool
to
try
to
put
them
into
one
thing
without
ringing,
my
pager
all
night:
it's
it's!
It's
a
real,
it's
a
real
pain.
B
So
so
you
know
to
me:
this
is
kind
of
logs
have
been
kind
of
stuck
in
this
in
this
rut
for
a
while,
and
I
think
it
really
boils
down
to
you're
stuck
in
the
index
and
search
kind
of
mentality
and
and
what
that
what
indexes
search
is
is
just
ways
to
kind
of
speed
up
the
manual
work,
and
so
as
long
as
you're
doing
the
manual
work,
you're
going
to
require
to
be
required
to
maintain
that
work,
you're
going
to
be
subject
to
the
limitations,
kind
of
your
own
of
your
own
process
as
you're
trying
to
just
you
know,
look
around
and
find
what's
important.
B
So
it's
just
it's
kind
of
like
it's
kind
of
a
self-limiting
sort
of
approach.
We
talked
a
little
bit
about
this
already
and
I
did
mention
since
apps
yeah,
so
applications
are
bespoke,
so
this
this
is
actually
an
important
one
right.
So,
like
let's
say
you
know,
someone
gives
me
some
package,
that's
going
to
look
for
errors,
you
know
in
postgres
and
then
I
decide
to
deploy
you
know
since
I
have
postgres
in
my
application.
B
B
You
know
those
the
cement,
what
those
logs
mean
and
and
and
what's
normal
and
abnormal
are
completely
custom
to
me,
and
so
so
the
sense
of
that's
going
to
have
to
get
learned
on
my
data
and
there's
not
going
to
be
some
giant
multi-petabyte
training
data
set
to
do
that
with
right.
So
so
all
of
these
make
it
sort
of
difficult
to
apply.
You
know
machine
learning
in
general
to
actual
sort
of
monitoring
and
root
cause
problems.
B
So
let
me
step
back
and
think
about
what
I
what's
the
very
simple
essence
that
I
of
what
I
actually
want
to
do
with
these
logs
and
is
it
possible,
and
so
so
the
way
I
think
about
that
is
the
junior
sre
problem
right,
so
let's
say
it's
it's
day,
one
you're,
a
junior
sre,
you
walk
into
a
shop
there's
a
few
things
that
you
are
familiar
with
in
the
stack
there's
a
giant
wad
of
stuff.
B
You
are
completely
unfamiliar
with
having
never
worked
in
this
with
this
application
or
stack
before
and
and
and
over
time,
you
start
to
learn.
You
start
to
learn,
what's
normal
right
and
you
do
that
in
very
simp,
simple
sort
of
terms.
At
least
I
do
so.
I
should
I
should
point
out.
This
is
kind
of
my
approach
to
the
junior
sre
problem,
but
really
I'm
looking
for
two
things
and
my
experience
starts
to
crystallize
around
two
very
important
sort
of
recognition
tasks.
B
One
is,
I
need
to
be
able
to
recognize
when
something
is
bad,
and
I
know
that
sounds
trivial
and
silly,
but
it
isn't.
You
know
there
can
be
errors
and
warnings
getting
spewed
that
don't
matter
at
all,
but
as
I
get
to
know
my
application
better,
you
know
I'll
start
to
realize.
B
You
know
that's
that
when
something
bad
and
is
actually
happening,
you
know
that
I'm
to
care
about
and
part
of
that
recognition
will
end
up
being
sort
of
how
widespread
is
the
sort
of
you
know
observed
badness
as
an
example.
You
know.
B
Let's
say
I've
got,
you
know,
I'm
looking
at
you
know
one
log
or
one
container,
you
know
type
one
one
service,
and
you
know
it's
got
some
regular
cadence
of
errors,
but
then
I
see
there's
a
few
fatals
and
at
the
same
time
I
see
some
errors
cropping
up
somewhere
else
in
another
service.
That's
going
to
be
a
key
that
something
bad
is
happening
right.
So
you
get
this
sort
of
correlation
across
log
streams.
That's
important!
B
Something
else.
I
think
that's
useful
in
finding
root
cause
like
getting
to
the
bottom
of
things
is
having
a
sense
of
what's
rare,
like
I've.
Never
seen
that
happen
before
right
and
and
that
sort
of
thing
I
think
is-
is
equally
critical
to
causing
a
new
issue.
So
just
these
two
concepts
what's
bad
and
what's
rare,
if,
if
you
can
start
to
get
a
handle
on
those
things
and
figure
out
when
they're
happening
around
the
same
time,
you
have
a
good
chance
of
surfacing
root.
B
So
I
can
tell
you
a
little
bit
about
how
I'm
going
to
talk
a
little
bit
about
what
we're
doing,
but
you
know
there
may
be
other
approaches
and
I'm
going
to
discuss
some
of
those
after
I
get
through
a
few
things
about
how
we
approach
this
problem,
because
there
right
now,
there
really
is
no
one
approach
to
this
problem.
We
happen
to
think
that
we
have
the
best
approach,
but
you
know
that's
us.
I
mean
there's,
there's
other
people
tackling
this
problem,
and
I
want
to
talk
about
that.
B
So
so
we
do
a
complete
relational
structuring
of
logs
and
we
do
that
at
ingest.
So
it's
not
like
there's
some
batch
process,
that's
going
on
and
trying
to
figure
out
what
what
are
the
different
event
types
and
what
are
the
parameters
and
the
events
and
all
of
this
out
of
my
out
of
my
piles
of
log
data.
It
has
to
be
done
at
ingest
and
there's
a
number
of
reasons.
You
know
why
that
has
to
be
done
at
ingest,
but
probably
the
most
important
reason
is.
B
If,
when
I
see
something
new,
that's
when
it's
going
to
matter
the
most,
and
so
you
know,
if
I'm
going
to
do
something
like
structure
these
logs,
that's
all
great,
but
it
better
do
something
reasonable
with
the
first
or
second
occurrence
of
that,
and
so
that's
that's
kind
of,
I
think
you
know
approp.
You
know
important
for
us
as
we
go
through
this,
so
the
idea
is
very
straightforward
and
you
know
this
particular
log,
I'm
showing
here
I'm
just
pulling
stuff
out
of
a
mocked
up.
B
Json
I've
never
seen
a
log,
a
json
log,
this
simple
before
I'm
sure
they
exist,
but
but
you
know
it's
trying
to
get
the
concept
across
and
in
fact,
the
more
free
text
this
log
is
I've.
I've
found
that
that
out
both
our
approaches
as
well
as
other
common
approaches,
are
more
it's
it's
easier
for
the
it's
easier
for
machine
learning,
approaches
to
structure
them
naively
and
the
reason
is,
ironically,
the
text
of
the
log
message
contains
a
lot
of
locality
in
it.
That
gives
you
information.
B
But
but
the
most
important
thing
that
that's
coming
out
of
this
is:
we
know
what
kind
of
event
this
is.
In
other
words,
there
will
be
one
event
type.
That
means
when
I
see
a
log
like
this,
it's
this
type
of
event
and
that's
probably
the
most
critical
thing
to
walk
away
understanding.
Is
you
know
if
you're
dealing
with
a
structured
log,
one
of
the
first
things
you
have
to
find
out
is:
okay,
hopefully
they've,
given
some
context.
Some
there's
some
context.
B
Take
concept
taken
into
account
of
what
kind
of
event
is
this
there's
probably
an
attribute
that
will
embody
that
and
that's
what
you
need
to
hone
in
on
so
so,
given
you
know,
sort
of
given
that
you
know,
let's,
let's
apply
the
requirements
from
earlier.
We
don't
want
to
have
to
assume
that
we
know
the
prefix
formats.
We
don't
have
the
logs.
We
don't
want
to
assume.
We
know
the
grammars.
B
We
don't
want
to
assume
that
we
need
to
know
keywords
because
again
this
could
be
your
own
bespoke
application
and
in
that
case,
there's
not
going
to
be
any
known,
prefix
formats
or
event
grammars.
The
system
has
to
learn
all
of
that
right
and
if
you're
able
to
do
that,
you
can
embrace
free
text
logs.
B
B
Is
this
and
the
rate
of
this
kind
of
event
in
this
stream?
Is
that
and
all
of
a
sudden
you
know,
maybe
I
have
upticks
in
the
rate
of
those.
Maybe
I
have
very
tight
correlation
in
the
time
occurrence
of
those
that
I
didn't
use
wouldn't
usually
have,
and
then
you
can
imagine
kind
of
expanding
that
to
say
not
just
event
types
but
also
sort
of
severities
and
errors.
You
know
that
all
of
a
sudden,
the
rate
of
errors
here
went
up,
but
not
only
that
those
events
are
very
correlated
to
events.
B
I'm
seeing
in
this
stream.
Over
here,
and
so
so
really
once
you've
taken
that
first
fundamental
step
of
structuring
the
events
into
event
types,
you
can
start
to
do
this
kind
of
a
correlation
analysis
and
and
really
to
to
us.
This
has
proven
to
be
kind
of
a
transformational
step.
You
know
if
you
do
well
enough
on
two
you
could
start
to
do.
You
can
start
to
do
well
enough
to
to
sort
of
I'd
identify
what
I
would
call
incidents
or,
at
the
very
least
clumps
of
stuff.
B
That's
going
to
get
you
to
a
root
cause
indicator
right,
so
we
you
know,
I
think
we
already
talked
about
this,
but
there's
a
bunch
of
stuff
we're
just
we
just
can't
require
in
order
to
do
that
correlation
up
front.
So
you
know
it
all,
at
least
from
our
perspective
and
maybe
to
a
fault.
You
can't
have
any
any
rules
built
into
this
system.
B
That
know
to
look
for
this
keyword,
because
if,
if
you
do
any
of
that
sort
of
stuff,
it
may
work
in
this
one
instance,
but
it's
not
going
to
generalize.
B
So
so
that's
kind
of
been
our
approach,
and
I
think
that
discipline
is
important
right
if
you
start
well,
you
know
next
thing
you
know
you
have
you
have
a
database
of
a
thousand
alert
rules
and
and
and
if
someone
buys
that
they're
no
better
off
than
they
would
have
been
if
they
had
their
own
database
of
alert
rules
right.
So
so
that's
that's
one
thing
it's
important
to
avoid.
B
I
want
to
talk
a
little
bit
about
other
attempts
that
have
been
made
to
structure
logs
for
for
root
cause
or
for
just
for
sort
of
detection.
Monitoring.
One
set
of
approaches
is
a
deep
learning
approach.
There's
a
number
of
papers
on
this
there's
an
academic
community.
That's
you
know
kind
of
really
has
been
interested
in
that
one
thing
I
would
say
about
that.
Is
you
know
it's
there?
Well,
there's
a
couple
things.
B
One
is
if
someone's
going
to
give
you
their
log
data
into
a
sas
service,
especially
cost
is
going
to
be
important
to
them.
You
know
you're
not
going
to
be.
B
You
know,
racking
some
refrigerator
sized
sort
of
appliance
to
do
to
do
actual
deep
learning
on
every
data
set
you
get
and
at
the
same
time,
I've
touched
on
this
already,
but
it's
very
difficult
to
take,
what's
normal
in
one
stack
and
environment
and
generalize
it
to
another,
depending
on
how
exactly
what
the
stack
is
made
up
of
and
exactly
how
it's
used
normal
can
mean
very
different
things.
So
there's
a
number
of
conceptual
challenges
like
that
that
I
think
sort
of
I
would
say,
they're
in
the
way.
B
I
do
think
that
they
will
come
when
when
deep
learning
approaches
will
be
very
successful
in
in
tackling
telemetry
from
a
naive
standpoint,
I
don't
think
we're
quite
there
yet.
In
fact,
some
of
the
natural
language
models
are
probably
closer
than
sort
of
the
deep
learning
models
that
have
been
more
popular
in
the
literature
for
the
last
few
years.
B
In
any
case,
another
thing
that
you'll
see
is
the
use
of
a
particular
algorithm
and
that's
usually
lcs,
so
for
those
of
you
in
the
in
the
audience
that
care
about
such
things,
longest,
common,
substring
and
and
sort
of
different
implementations
of
it.
Some
are
online.
Some
are
batch,
but
essentially
the
idea
is
you've
got
this
algorithm
and
it's
going
to
decide
what
your
catalog
of
event
types
is:
the
prob,
the
weakness.
There's
a
couple
weaknesses
with
that
algorithm.
B
It
doesn't
really
have
sort
of
an
innate
sense
of
types
so,
depending
on
your
implica
implementation,
you
kind
of
have
to
you
kind
of
have
to
build
in
some
something
to
to
be
able
to
tell
these
are
different.
These
are
different
things,
because
this
thing
here
is
always
an
integer,
and
this
thing
here
is
always
a
file,
and
so
so
that's
important,
but
I
think
a
more
a
bigger
barrier.
More
conceptual
barrier
is.
B
To
get
a
good
structuring
out
of
lcs-
and
you
see
this
in
a
number
of
sort
of
machine
learning
for
logs
packages
that
are
out
there
on
the
market
today,
it
takes
a
lot
of
examples
of
a
given
event
type
to
do
a
good
job
with
it.
Otherwise
it
gets
put
into
this
other
bucket
and
see
because
of
the
pareto
nature
of
logs.
B
You
always
end
up
with
some
massive
swath
of
your
log
data
and
it's
the
most
important
events
in
this
other
bucket
that
haven't
really
been
effectively
categorized
yet
so
I
think
it's
in
practical
reality.
It's
important
to
have
a
continuum
of
approaches,
you're
bringing
to
bear,
depending
on
the
cardinality
of
each
event,
type
that
you're
seeing.
B
So
you
know
talking
again
sort
of
summarizing.
You
know
you
got
a
structure
first,
you
got
to
do
it
in
line
at
ingest
time.
Otherwise
you
can't
respond
to
an
incident.
You
have
to
have
a
multi-stage
structuring
pipeline,
at
least
in
our
view,
to
respect
the
pareto
distribution
of
event
types
in
in
real
world
logs
the
good
thing
about
using
a
correlation
model.
Once
you
get
past
the
structuring-
and
you
start
doing
this
incident
sort
of
detection
and
root
cause
report
generation
is
the
more
data
sources,
the
more
streams
of
logs
and
or
metrics.
B
I
haven't
talked
too
much
about
metrics,
but
you'll
see
a
little
bit
of
that
in
a
minute.
The
more
of
these
streams
I
have
with
anomalies
that
I
can
detect
the
better
job
I
can
do
of
cross-correlating
and
picking
out
a
point
in
time.
I
get
better
resolution,
the
more
data
I
have
and
that's
an
important
that's
an
important,
I
think
dimension.
B
You
know
to
to
an
effective
solution
and
then
finally,
you'll
you'll
see
an
example
of
this
in
a
bit.
But
if
you're
going
to
pass-
and
so
a
lot
of
you
may
have
heard
gpt3-
it's
a
natural
language
model,
there's
some
competing
models,
there's
some
some
free.
You
know
downloads
of
similar
models,
you
can
you
can
download
they're
pre-trained
or
you
can
train
them
yourself.
B
B
So
here's
an
example
where
we
had
a
stack
and
we
had
some
incidents
that
were
detected,
and
so
what
you're
seeing
here
is
the
color
represents
sort
of
the
sort
of
the
severity
of
the
event
green
being
you
know,
warning
and
blue
being
info
or
no
green
was
a
debug
and
more
and
blue
was
info,
and
then
you've
got
your
warnings
and
your
errors,
which
is
the
yellow
and
the
red,
and
so
in
these
different
services.
B
Here-
and
you
know
what
we
have
is
your
minute
by
minute
time
and
we're
kind
of
you
know
what
we're
seeing
in
terms
of
the
size
of
these
things
is
kind
of
how
how
new
they
were
like,
or
I
should
say
how
rare
they
were
so
kind
of
the
inverse
it's
a
representation
of
the
inverse
of
the
typical
frequency
of
these
events.
So
what
you're?
Seeing
is
this
thing?
B
B
It's
interesting
when
you
get
an
autonomous
monitoring,
stack
kind
of
working
on
logs.
It's
interesting
because
you
can
deploy
like
litmus,
which
is
the
chaos,
testing
engineering
tool
and
basically
it'll
deploy
some
tests
and
they'll
break
some
stuff
and,
and
the
interesting
thing
is
that
thing
itself,
it's
gonna
log
stuff,
it's
gonna
say:
well,
I'm
initializing
this
and
I'm
selecting
a
pod
to
kill
and
so
on.
Well
guess
what?
If
you're
an
autonomous
monitoring
solution
is
doing
a
good
job?
It's
going
to
realize
that
that
is
actually
the
root
cause
of
of
the
outcome
right.
B
This
is
there's
this
pulled
from
a
blog
on
our
website,
just
some
recipes
to
go
and
kind
of
test
this
thing
and
see
what
happens,
but
you
see
what
we're
doing
is
we're
pulling
out
the
rare
stuff
here,
we're
pulling
out
kind
of
the
bad
stuff
here
and
we're
pulling
features
out
of
metrics
that
give
you
more
flavor
to
help.
You
know
that
yeah
you're
on
to
the
right
sort
of
approach
to
causing
this
problem.
B
Here's
an
example
where
we're
doing
the
same
thing
except
we've
grabbed
a
description
of
this
problem
out
of
gpt3.
So
we
pass
in
our
root
cause
report.
We
get
back
a
description,
we
put
it
in
here
and
so
like
here.
You
can
see
the
first
thing
that
happened
was
there
was
an
oom
killer
was
invoked
and
then
all
hell
broke
loose.
And
yes,
in
fact,
this
was
an
o.m
problem
and
that's
what
got
put
in
here
this
I'm
going
to
let
this
is
the
kind
of
stuff
I
think
the
gavin's
going
to
show.
B
But
this
is
another
example
of
that
same
incident
where
you
can
see
kind
of
the
swap
it's
a
more
modern
interface.
You
can
see
this
kind
of
the
swap
free
bites
from
prometheus
dropping
and
we
pull
out
that
anomaly.
We
feed
those
anomalies
directly
into
the
same
correlation
model.
It
just
it
goes
with
the
log
anomalies,
and
so
you
get
this
all
stitched
into
one
correlated
sort
of
report
yeah.
So
with
that
I'm
gonna,
I'm
gonna
pass
it
back
to
pass
back
to
gavin.
C
C
C
The
first
event
in
this
case,
actually
again
it's
kind
of
beautiful
because
it
exactly
nailed
the
root
cause
scales.
It
picked
up
the
kubernetes
message
about
the
product
service,
product
catalog
service
being
scaled
down
to
zero
replicas,
and
then
we
see
the
the
dead
container
over
here
and
I'm
going
to
drill
into
the
incident,
and
this
gives
you
a
little
bit
more
detail
about
what
happened
and
what
made
up
the
incident
from
our
machine
learning's
perspective.
C
C
So
that's
it.
If
anybody's
interested,
I
think
larry
has
the
link
in
his
presentation.
We've
documented
how
to
bring
up
this
demo
app
in
a
mini
cube
environment.
If
anyone
wants
to
test
it
and
then
you
can
break
it
any
way
you
choose
and
see
what
xebrum
picks
up.
So
thank
you
very
much
and
I'll
pass
that
one
back
to
to
larry.
B
Okay,
thanks
kevin
yeah.
Let
me
bring
this
back
up,
so
we've
had
some
recent
validation
in
the
market,
so
we've
so
my
data.
They
make
this
litmus
tool
set
that
I
mentioned
so
they
also
have
open
ebs,
which
is
kind
of
like
it's
kind
of
like
a
like
a
storage
device
sort
of
simulation
software.
So
essentially
it's
basically
storage
software.
So
so,
basically
what
they
did
was
they
went
in.
B
They
said:
okay,
we're
going
to
go
and
we're
going
to
replicate
the
outages
that
our
real
customers
had
and
they
picked
like
six
or
seven
of
them.
They
replicated
those
in
their
environment
and-
and
we
picked
those
up
with
the
root
cause
indicators,
so
that
was
cool,
so
d-zone
wrote
a
great
article
about
us
and
then
you
know
the
guy
who
wrote
the
article.
Basically,
you
know
spun
up
the
you
know
put
in
the
charts
spun
up.
The
software
did
something
to
break
something
and
then
and
then
see
it
show
up.
B
So
that
was
that
was
cool.
Sweetwater
is
a
customer
of
ours
right
now,
the
the
music
equipment
retailer,
and
so
they
they
ended
up
concluding
that
we've
dropped
their
root,
cause
time
for
new
incidents
from
three
hours
to
15
minutes
and
that's
made
a
huge
that's
made
a
huge
sort
of
dent
in
their
ability
to
deliver.
You
know
high
quality
service
to
their
to
their
users.
You
know
everyone's
encouraged
to
join
us
on
this
journey.
I'm
always
happy
to
have
discussions
with
everyone.
B
Anyone
just
about
sort
of
in
general
log
machine
learning,
not
only
detection
any
ideas.
People
have
I
love
to
I
love
to
have
those
discussions.
We
have
a
slack
community
as
well.
You
can
participate
in
and
I
think
that's
that's
a
bit.
I
want
to
thank
everyone
here
for
the
time
and
the
interest
in
log
machine
learning
on
kubernetes.
A
All
right
thanks
everyone,
we
have
about
10
minutes
left,
so
we
have
two
questions
so
far
in
the
q,
a
box.
So
if
there's
any
more
go
ahead
and
pop
them
in
there-
and
I
will
hand
it
back
over
to
you
larry
to
get
started
on.
B
Okay,
so
right
so
by
default,
when
you
you
know,
come
in
and
and
you
you
know,
become
a
user,
you
go
into
our
you
know,
we
have
an
aws,
we
have
a.
We
have
a
deployment
aws,
it's
slow,
it's
it's
west
u.s,
but
we've.
We
have
spun
up.
We've
spun
up
incidents,
ins,
instances
in
other
geos
on
request,
so
you
know,
don't
don't
be
shy,
we
don't
mind
doing
it.
A
B
Gpus
are
not
required
it
so
so
there
really
is
no
determinate
sort
of
training
period,
basically
there's
a
bunch
of
parameters
that
are
getting
estimated
and
those
estimates
suck
like
10
minutes
after
you
install
the
software,
and
you
know
after
a
day
or
two
they're,
probably
really
good,
and
so
there's
a
continuum
over
the
first
few
hours
where
it
starts
to
get
more
and
more
useful.
B
You
know
sometimes
you'll
see
like
spurious
stuff
popping
up
at
the
beginning,
so
so
that's
kind
of
the
approach
we've
taken
because
a
lot
of
people
they'll
just
want
to
spin
it
up.
Do
some
chaos
tests,
you
know
and
be
done
so
so
we
let
them
do
that.
But
you
know,
if
you
do,
that,
you're
going
to
get
noise
and
some
things
may
not
get
picked
up
exactly
right.
So.
A
B
Okay,
do
you
have
plans
for
running
this
on
prem
as
well?
I
mean
I'm.
Yes.
Actually,
that's
a
that's
a
very
good
question.
So,
as
it
turns
out,
a
lot
of
people
really
want
this
stuff
run
on
prem
when
it
has
to
do
with
logs
they're
a
little
bit
afraid
of
you
know,
sort
of
pii
and
that
sort
of
thing
you
know
it
depends
depends
on
the
user
right.
B
So
so
we
have
a
project
that
we
kicked
off
this
week
to
to
build
exactly
such
a
thing,
and
so
you
should
get
in
touch
with
us
if
you're
interested
in
being
sort
of
one
of
the
first
pilot
users
of
that
on-prem,
it's
a
kind
of
a
virtual
appliance
as
we're
packaging
it.
B
What
api
do
you
need
access
in
order
to
collect
the
logs
for
our
kds
cluster?
What
api
do
you
need
access?
Hey,
roddy.
Do
you
understand
this
question?
I'm
not
sure,
I'm
not
quite
sure,
what's
being
asked.
D
I
I
think
the
simple
way
to
answer
that
is
you
don't
need
to
worry
about
it.
Our
collector
deploys
as
a
demon
set,
it
picks
up
container
logs
automatically
from
you
know
the
container
log
outputs
and
it
it
does
use
kubernetes
apis
to
pick
up
events,
if
you're
interested
you
can
examine
our
documentation
on
github
or
you
can
bring
us
for
more
details,
but
the
short
answer
is
you:
shouldn't
have
to
worry
about
it,
you
don't
actually
need
to
do
any
work,
it's
just
just
a
helm
chart
and
the
demon
set
automatically
collects
everything.
B
B
Are
there
any
competitors
for
log
anomaly?
If
so,
how
do
you
compare
and
differentiate
from
them?
Yeah
I
mean
there
are
so
yeah
so
ajay
you,
you
wanna!
You
wanna
address
this
and.
D
So
yeah
so
log
anomaly,
detection
is
an
area
of
interest
has
been
for
about
a
decade.
Larry
mentioned
some
of
the
academic
research
in
the
space.
You
know
lcs
and
deep,
learn
or
whatever,
but
there
are
projects
and
even
commercial
products
that
have
attempted
anomaly
detection
to
various
degrees.
They
haven't
gone
as
far
as
we
go
in
terms
of
correlating
anomalies
and
detecting
incidents.
D
We
have
one
such
comparison
with
the
elastic
machine
learning
pack
on
our
website
and
we
have
even
have
a
short
video,
comparing
them
side
by
side.
D
That
might
be
the
best
place
to
start
it's
on
under
the
blogs
page.
Just
look
for
elastic
and
you'll
see
a
side-by-side
comparison
of
our
machine
learning
with
the
elastics
right.
B
Hey
now
I
understand
what
ann's
question
so
right,
yeah
so
right.
So
so
what
basically
like?
I
guess
you
would
say
so.
I
mentioned
we
have
this
sort
of
multi-stage
pipeline
for
structuring
right,
so
the
first,
the
first
is
actually
heuristic.
B
So
I'll
pick
a
couple,
tokenizations
and
I'll
say
which,
which
tokenization
gets
me,
you
know
sort
of
the
most
sort
of
compatible
like
things
and
by
reachability
clustering.
I
mean
you
know
by
given
tokenization.
I
have
18
tokens
if,
if
I
can,
if
I
can
kind
of
create
this
set
by
reaching
one
example
to
the
next,
with
two
with
only
two
parametric
tokens,
then
I'll
use
that
as
a
cluster,
and
so
it's
kind
of
it's
kind
of
like
it's
a
clustering.
B
A
A
A
A
A
B
Oh
wait:
there
was
one
that
just
came
in.
We
got
literally
60
seconds,
probably
so
so
right,
so
this
is
kind
of
an
approach
that's
used
in
in
tracing
right,
so
you'll,
you
might
see
logs
show
up
with
the
trace
id
that
sort
of
thing
and
the
question
is:
what's
the
difference,
the
difference
is,
I
don't
need
support
for
it
throughout
my
stack
and
I
don't
need
to
go
trace
anything
so
remember.
The
whole
idea
was:
let's
create
something
that
can
work
without
alert
rules.
B
If
I
then
require
tracing
to
go
through
every
relevant
code
path,
it
feels
like
it
felt
like
to
us
like
a
self-defeating
kind
of
prospect.
It's
just
trading
one
set
of
work
for
another
or
one
set
of
limitations
for
another,
and
that's
why
we
tried
to
avoid
requiring
that.
I
hope
that
made.
A
A
Okay,
well
thanks
everyone
for
coming
and
for
participating.
Thank
you
for
the
robust
q,
a
and
all
the
back
and
forth
and
for
everybody
helping
panelists
answer
questions
through
the
chat.
Just
a
reminder.
This
will
be
up
on
the
website
later
today
and
we
look
forward
to
seeing
everyone
at
the
future
cmcf
webinar.