►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
I've
deployed
online
boutique,
a
microservices
demo,
app
on
a
kubernetes
cluster.
On
my
laptop
using
mini
cube
now
the
catch
is
I've
deliberately
broken
it.
Let's
see
what
happens
when
I
try
to
buy
something,
I'm
going
to
buy
the
home,
barista
kit,
uh-oh
the
dreaded
and
very
generic
500
error,
and
what
I
see
here
really
doesn't
help
explain
what
happens
now
in
real
life
deployments
when
something
breaks.
There
are
two
key
things:
detecting
that
it
broke
and
finding
the
root
cause
for
detection.
Most
companies
use
some
kind
of
monitoring
or
apm
tool.
A
In
this
case
I
was
using
a
kiali
service
graph,
but
this
could
have
been
any
other
monitoring
dashboard.
What
we
see
here
is
a
lot
of
red
confirming
things
are
broken,
but
once
again
it
doesn't
shed
much
light
on
what
happens
in
many
environments.
Monitoring
is
integrated
with
an
incident
management
tool
with
rules
that
automatically
trigger
incidents.
When
things
go
wrong
here,
I'm
showing
a
pagerduty
incident
that
has
been
created
and
shows
up
in
a
slack
channel,
but
now
it's
time
for
the
tough
part,
we've
detected
that
there
is
a
problem.
A
Hopefully,
at
the
end
of
this
process,
which
could
take
minutes
or
hours,
you
can
find
out
what
happened
and
uncover
root
cause
now
imagine
if
this
whole
process
could
be
automated
and
the
way
it
would
work.
Is
you
simply
send
your
logs
without
any
training
set
up
or
rules
you'd
be
able
to
see
the
root
cause
without
any
hunting?
A
So
with
that
I'll
show
you
what
the
machine
learning
found
when
I
broke
the
microservices
demo,
app
that
I
showed
you
earlier
from
the
time
I
broke
the
app
and
while
the
problem
was
happening
about
a
hundred
thousand
log
events
were
generated
without
any
rules.
Our
ml
distilled
this
down
to
just
these
seven
events
that
you
see
on
the
screen.
A
A
The
very
first
line
that
prints
out
when
it
starts
is
a
warning
message
that
says
oh
and
test
starting.
You
can
see
that
as
the
first
line
in
this
root
cause
report,
the
cool
thing
is
that
was
picked
up
simply
because
it
was
a
very
rare,
in
fact,
probably
never
seen
before
log
event,
but
on
its
own
it
would
have
been
completely
harmless,
except
that
it
happened
to
correlate
with
a
whole
lot
of
other
things.
So,
as
part
of
the
summary
that
you
see
on
the
screen,
you
can
see
just
a
couple
of
lines
later.
A
A
So
this
is
actually
enough
in
this
case,
to
tell
us
the
root
cause,
but
we
also
try
and
show
you
the
symptoms.
In
other
words,
what
happened
when
the
problem
occurred
to
see
that
I'm
going
to
click
the
related
events,
button
that
pulls
in
the
surrounding
errors
and
anomalies
that,
hopefully,
will
explain
the
problem
or
the
symptoms
that
happened,
and
in
fact
you
can
see
here
it's
doing
a
really
good
job,
because
it's
immediately
pulling
in
a
bunch
of
kubernetes
events,
and
you
can
see
how
all
the
other
pods
were
impacted.
A
You
can
see
as
it
moves
through
there.
It
has
failed
probes
on
the
different
modules,
the
ad
service,
the
cart
service,
the
checkout
service,
the
currency
service
and
so
on,
and
if
you
go
a
little
bit
further
down
you'll
even
see
it
picks
up
where
redis
is
impacted
and
restarts.
So
this
is
brought
in
automatically
remember.
I
haven't
hunted
or
searched
for
anything
here.
I
just
clicked
the
related
events
button.
So
let
me
go
back
to
the
core
events,
because
I'm
also
collecting
prometheus
metrics.
A
In
this
case,
you
can
see
at
the
bottom
that
the
machine
learning
tries
to
correlate
any
anomalous
metrics
with
what
it's
picked
up
in
the
logs,
and
so
it's
pulled
in
a
couple
of
stats
that
you
see
on
the
left,
node
memory,
buffers,
bytes
and
node
memory
case.
Bytes
are
highly
relevant
in
this
case
you
can
see
them
going
from
very
highs
and
then
dropping
right
down,
presumably
as
the
o
and
killer
killed
off
my
rogue
process.
A
So
these
are
really
useful
to
corroborate
a
root
cause
report
that
you
might
be
writing
now.
The
final
thing
I'll
show
you
is
the
coolest
of
them
all.
We
actually
take
this
log
line
summary
and
remember.
This
is
distilled
down
from
100
000
events
that
occurred
at
the
time
and
we
pass
it
through
with
the
right
prompt
to
the
gpt3
language
model.
A
This
is
an
experimental
feature
and
the
reason
it
is
is
because
we're
still
tweaking
the
way
that
we
use
gpt3
but
in
general
gbt3,
is
only
as
good
as
the
the
internet
as
a
whole
and
all
the
texts
that
it's
been
trained
on
now.
In
this
case
it
completely
nailed
it.
There
are
some
times
when
it
produces
good
english
sentences,
but
they
may
not
be
completely
relevant
to
the
problem
that
you're,
seeing
but
in
general,
we're
seeing
a
lot
of
use
from
these,
and
the
last
thing
I'll
point
out
is
the
sentence
that
you
see.
A
There
is
truly
a
novel
sentence.
You
can't
actually
find
that
anywhere
on
the
internet.
It
was
generated
based
on
what
we
gave
the
gpt3
model
as
a
prompt.
Now,
with
that,
I'm
going
to
hand
it
over
to
iran.
Khanna
he's
the
co-founder
and
ceo
of
reserved
ai
he's
been
a
fantastic
customer
of
ours.
Now
for
about
a
year,
in
fact,
he
tried
one
of
our
early
beta
versions,
which
is
the
first
thing
that
he
saw
and
he's
continued
to
use
the
product.
Since
so,
thank
you
very
much.
Iran
and
I'll
hand
it
over
to
you.
B
Thank
you
so
much
for
the
awesome
demo,
so
just
to
give
a
little
bit
of
background
on
reserved
ai,
I'm
the
co-founder
ceo
and
what
we
enable
customers
like
librium
and
a
lot
of
other
large
customers
running
on
the
cloud
across
azure
and
aws
to
do
is
actually
proactively
forecast
and
manage
cloud
resources
in
a
completely
automated
way.
B
So
we
actually
enable
folks
running
across
these
very
complex
multi-cloud
deployments,
to
do
things
like
commitment
management,
cash
flow
forecasting,
tax,
optimization
and
uniquely,
we
actually
buy
back
over
committed
resources
from
customers,
essentially
making
a
market
but
really
on
the
granular
level,
integrating
with
tons
and
tons
and
tons
of
different
apis
from
you
know,
over
300
apis
in
the
aws
land
200
apis
in
azure,
a
ton
of
different
apis
coming
out
of
the
kubernetes
clusters
that
we're
monitoring
for
cost
and
attribution
and
what
we
really
found.
B
What
we
really
found
was
that,
given
this
wealth
of
data
and
the
wealth
of
systems
running
in
kubernetes
built
on
top
of
that,
that
there
were
things
that
were
constantly
changing
within
the
underlying
primitives
that
were
essentially
pulling
from
on
the
kubernetes
side
on
the
azure
side,
on
the
aws
side,
and
while
we
have
this
stack
running,
each
component
was
generating
tons
and
tons
of
logs
and
when
there
was
an
error,
not
even
an
error
on
our
side,
often
an
error
on
the
vendor
side
or
even
on
the
customer
side.
B
As
things
like
iam
roles
change,
we
were
not
able
to
very
easily
go
in
and
get
the
actual
root
cause
of
that
out.
The
other
end
either
forward
to
our
engineering
team
or
customer
success,
team
or
sales
team.
What
have
you
and
what
that
really
meant
was
that
our
critical
engineering
resources
as
a
startup,
we
like
to
move
fast
and
build
things
on
behalf
of
our
customers,
but
they
were
getting
waylaid.
B
You
know
once
a
week
at
the
very
least,
going
through
all
of
these
different
types
of
kind
of
debugging
procedures
to
find
root,
causes,
and
even
worse,
was
the
fact
that
a
lot
of
these
root
causes,
just
because
we
you
know,
took
the
tact
in
many
cases
like
the
out
of
memory
case,
just
throw
more
resources
at
it,
just
kind
of
ironic
as
a
cost,
optimization
company,
but
you
know,
given
that
a
lot
of
them
actually
went
unnoticed
until
the
volumes
exploded
to
the
point
where
we
really
had
to
look
at
them,
so
that
was
sort
of
the
state
of
the
world
before
and
when
I
heard
about
zebrium
honestly,
you
know
I,
I
was
a
little
bit
skeptical
right
and
I
think
my
engineering
team
was
too.
B
We
use
machine
learning
as
well,
but
we
use
it
in
a
much
more
sort
of
staid
way
of
doing
predictive
models,
expect
value
calculations
and
actually
doing
risk
modeling
and
market
making
on
the
back
end.
So
those
are
all
sort
of
established
things
that
folks
on
wall
street,
for
example,
have
been
doing
for
years.
This
was
something
that
new,
so
I
was
very,
let's
say,
interested
but
skeptical
around.
B
If
this
could
replace
the
you
know,
specific
devops
knowledge
that
was
needed
before
to
really
go
in
and
figure
out
what
was
going
on
with
this
wealth
of
data
kind
of
streaming
in
and
the
error
sporadically
showing
up.
So
this
was
essentially
something
that
we
decided
to
kick
the
tires
on
and
we
started
the
free
trial
with
the
zebra
folks
installed
on
our
kubernetes
cluster.
It
was
pretty
quick.
Actually,
I
was
able
to
do
it.
As
you
know,
the
semi-technical
ceo,
which
was
a
testament
to
how
easy
it
was.
B
I
didn't
even
have
to
pull
up
my
cto
or
my
devops
folks,
into
the
conversation
and
literally
in
the
first
week.
Aws
had
api
change.
If
you
build
on
sort
of
the
long
tail
of
aws
apis
you'll
know
what
I'm
talking
about
they'll
just
change
all
the
time
and
not
tell
you
about
it
if
you're,
not
on
s3
or
ec2,
for
example,
and
because
we're
built
on
that
long
tail,
we
have
a
number
of
systems
there.
B
This
was
actually
a
really
important
thing
to
catch,
because
had
this
not
been
caught,
if
a
customer
went
to
a
certain
page,
this
would
have
caused.
You
know
a
complete
error
and
a
service
disruption.
Essentially,
so
this
was
what
sort
of
piqued
my
interest
and
said
hey
this.
This
starts
to
make
sense.
I
think
it's
kind
of
working
here.
B
It's
seeing
that
there
is
an
error
that
we
wouldn't
have
caught
if
we
weren't
looking
at
the
logs
and
as
we
sort
of
dug
into
the
system
as
larry
was
showing
before
we
actually
saw
that
the
correlations
and
the
root
causes
were
really
pointing
to
the
exact
system
to
the
exact
sort
of
pod.
In
this
massive
array
of
different
services
that
was
causing
the
underlying
errors,
so
it
actually
led
to
a
faster
resolution
on
our
side.
So
at
that
point
you
know
we
were.
B
We
were
starting
to
buy
in
a
bit,
and
you
know
as-
and
that
was
sort
of
you
know
last
year,
essentially
as
we
were
scaling
up
and
we've
been
running
the
system
for
over
a
year
now
and
as
we
continued
to
run
with
it,
we
saw
that
you
know
the
things
that
were
being
caught
here
were
consistent.
It
wasn't
just
a
one
and
done.
We
were
consistently
as
we
were
building
and
seeing
issues
from
our
customer
side
from
the
vendor
side.
B
Getting
these
reports
in
our
slack
channel
with
xebrium-
and
you
know
this
is
an
example
right
here,
where
the
customer
actually
had
issues
with
their
account
where
they
were
messing
with
an
iam
role,
and
we
were
basically
unaware
of
this
entirely
until
we
probably
got
a
complaint
from
the
customer
that
would
have
been
the
forcing
function
but
because
of
xebrium
we
were
able
to
get
the
slack
alert
and
we
saw
the
customer
was
essentially
messing
with
the
role
and
had
this
big
issue
and
we're
able
to
escalate
to
our
customer
success
team
proactively,
which
is
something
that's
fantastic.
B
B
So
we
were
really
delighted
by
the
fact
that
zebra
was
not
only
helping
us
with
kind
of
the
state
operational
pieces
of
our
cloud
infrastructure
management,
but
really
helping
us
surprise
and
delight
our
customers
with
the
fact
that
we
can
really
get
ahead
of
a
lot
of
these
issues
in
this
complex
environment
without
our
team
having
to
build
very
sophisticated
kind
of
internal
monitoring
tools.
B
This
was
very
much
a
plug
and
play,
and-
and
this
is
sort
of
a
more
recent
thing
as
larry
was
showing,
but
you
know
usually
when
zebrian
sends
an
alert,
I'm
just
shooting
it.
Along
to
my
engineering
team
and
saying
hey,
go!
Look
at
this!
Go
look
at
this,
but
now
I
can
actually
start
with
these
nlp
summaries
that
are
coming
out
to
figure
out
for
myself.
Hey,
what's
going
on,
do
I
need
to
just
shoot
it
to
my
cto
and
have
him
you
know
route
to
the
right
person.
B
No
often
I
can
actually
understand
because
of
these
natural
language
summaries.
You
know,
even
as
a
ceo
of
the
company,
what
the
errors
are.
Who
is
responsible?
B
You
know
who
is
the
owner
of
that
piece
of
infrastructure
and
really
have
a
much
more
targeted
loop
with
them,
and
even
now
our
dev
team
is
starting
to
look
at
these
and
much
more
quickly
route
them
to
the
right
place
and
easily
understand
the
sort
of
underlying
root
causes
that
we're
seeing
in
the
stream
of
errors
that
we
get
from
zebrium,
and
this
is
something
that
you
know
I
thought
was
absolutely
science
fiction
before
I
saw
it
live
because,
as
you
saw
in
the
demo,
the
logs
to
you
know
to
anyone
who's,
a
layman
or
even
a
you
know,
a
sophisticated
engineer.
B
It's
it's
kind
of
nonsense
right.
It's
not
really
well
structured.
So
the
fact
that
these
natural
language
summaries
were
able
to
be
generated
with
such
high
fidelity-
and
you
know
very
often
right.
I've
not
seen
a
lot
of
cases
where
they're
wrong,
they're
very
often
very
spot
on.
That
was
something
that
really
was
a
a
big
draw
for
us
to
lean
further
into
this
system.
B
View
of
the
zebra
product
was
really
important
for
us
to
get
over
the
year
to
see
how
it
could
help
us,
as
it
developed,
not
only
move
faster
on
engineering
basis,
but
really
on
a
customer
success
basis
as
well,
and
I
think
that
is
something
that
I
didn't
even
expect
when
we
first
integrated
with
the
product.
But
I
was
obviously
delighted
to
see
as
we
moved
down
the
path
of
integrating
them
further
and
further
into
our
workflows.
A
A
So
when
an
incident
is
created
in
those
tools,
xebrium
automatically
augments
it
with
root
cause.
The
way
it
works
is
you'll,
probably
have
some
kind
of
monitoring
or
apm
tool
in
place,
and
when
that
detects
something
it'll
open
up.
An
incident
in,
let's
say
pagerduty
now,
with
our
integration
as
soon
as
an
incident
is
opened
or
created
in
pagerduty
it'll,
send
us
a
signal
which
is
number
two
in
that
diagram
and
we'll
respond
with
a
root
cause
report
very
similar
to
what
you
saw
in
the
ui
demo.
A
A
You
can
also
use
xebrium
without
an
incident
management
tool.
In
this
case,
when
something
breaks
you
just
look
at
the
xebrum
root
cause
dashboard,
we're
always
proactively
scanning
for
patterns
that
make
up
root
cause.
So
all
you
need
to
do
is
click
on
the
relevant
one
and
you'll
see
a
root
cause
report
that
helps
you
troubleshoot.
The
problem,
if
you
don't
see
a
relevant
one,
all
you
need
to
do
is
click
the
blue
scan
for
root
cause
button
and
enter
a
time.
A
A
Well,
the
machine
learning
does
the
same
exact
thing,
but
to
do
this,
it
first
needs
to
be
able
to
accurately
categorize
all
log
events,
so
the
first
layer
of
our
machine
learning
is
to
structure
log
events.
This
is
done
with
unsupervised.
Machine
learning
doesn't
require
any
manual
training
once
structured.
A
A
The
next
layer
of
machine
learning
is
anomaly:
detection.
There
are
lots
of
things
that
go
into
anomaly
scoring,
but
the
two
big
ones
are
events
that
are
rare
and
events
that
are
bad
like
errors
or
high
severity
alerts,
criticals
and
so
on.
Now,
as
each
new
event
comes
in,
we
essentially
give
it
an
anomaly
score
and
as
an
example,
the
rarer
and
the
higher
the
severity,
the
more
anomalous
it
would
be.
A
A
Now,
once
we've
done
that,
we
can
then
look
at
the
metric
space
and
look
for
any
stacks
that
have
correlated
anomalies
that
match
the
times
that
we
see
in
the
log
lines
that
we've
just
pulled
together.
This
is
really
cool
because
you
don't
have
to
actually
curate
or
tell
us
which
metrics
to
look
at.
You
can
point
all
your
metrics
at
us
and
we'll
pull
in
the
anomalous
ones
that
are
correlated
with
what
we
find
in
the
logs,
and
you
saw
in
the
demo
earlier.
A
It's
really
effective
at
bringing
in
corroborating
metrics,
and
then
all
of
this
is
wrapped
up
into
an
automatic
root
cause
report
that
you
can
see
either
inside
your
incident
management
tool
or
inside
the
xebrum
ui.
Now
our
ml
is
able
to
detect
a
very,
very
broad
range
of
root
causes
for
a
very
broad
range
of
problems.
A
The
other
important
thing
here
is
this
is
not
meant
to
be
an
exhaustive
list
of
what
we
can
detect.
It's
just
a
bunch
of
examples
across
some
common
categories
that
we've
seen
happen
within
real
life
situations
in
our
customer
base.
It
helps
automatically
uncover
root
cause
without
you
having
to
go
hunting
through
logs
and
with
that
comes
the
conclusion
of
this
webinar.