►
From YouTube: Teuthology Training: Analyzing Test Results
Description
* Notes: https://pad.ceph.com/p/analyzing_teuthology_failures/timeslider#2039
* Ceph Developer Guide: https://docs.ceph.com/en/latest/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-intro/
* Ceph Teuthology Documentation: https://docs.ceph.com/projects/teuthology/en/latest/
* Ceph Teuthology project wiki page: https://tracker.ceph.com/projects/ceph/wiki/Teuthology
A
Hello,
hello,
everyone
and
thank
you
for
making
it
to
the
session
wherever
you
are,
I
know
getting
a
common
time
slot
that
works
for
everybody
is
difficult,
but
I
think
this
is
the
best
we
could
do.
A
This
is
the
second
session
that
we
are
doing
in
terms
of
pathology
training
and
this
one
will
be
focused
on
analyzing
pathology
failures,
particularly
what
do
we
do
after
we
schedule
a
run
or
what
do
we
do
after
we
start
running
some
tests.
I
know
there
is
already
a
learning
curve
involved
in
you
know
getting
you
set
up
with
pathology
and
having
to
understand
what
a
suite
means
or
a
work
unit
means
or
getting
zapier
access
and
all
that
stuff.
A
So,
as
I
shared
the
ether
pad
and
the
chat,
the
first
few
points
that
I
have,
I
feel
are
you
know
basic
guidelines
or
basic
things
that
everybody
needs
to
remember
when
they
start
reviewing
pathology
runs.
A
I
feel
it's
a
responsibility
that
every
developer
or
anybody
trying
to
review
pathology
runs,
has
because
what
one
person
does
impacts,
what
the
state
of
master
or
or
whichever
branch
they
are
running
guests
on,
is
going
to
be
on
the
next
day.
So
it's
you
need
to
be
a
little
more
careful,
and
I
think,
if
you
remember
these,
two
things
you're
going
to
be
good.
B
C
A
Cool,
so
let
me
just
quickly
go
over
this
because
we'll
probably
be
going
over
all
of
them
in
more
detail
later
on
first
point
is:
every
failure
needs
to
be
reviewed.
We
don't.
We
cannot
take
anything
for
granted
that
okay,
we
think
something
is
something,
and
so
we
don't
look
at
it.
So
every
failure
when
I
say
failure
when
you
run
a
pathology
run,
n
number
of
jobs
m
out
of
n
have
failed.
A
Why
have
they
failed
is
something
we
need
to
analyze
and
to
do
that,
we
have
ways
that
we
will
be
going
over
later.
Second
thing
I
want
to
mention
is:
when
you
do
a
test
run
a
lot
of
times:
you'll
see
there
are
failed
jobs
and
there
are
dead
jobs.
There
is
a
minor
difference
between
them,
but
both
of
them
fall
under
the
bucket
of
failed
brands
means
something
has
failed.
Something
is
not
right.
A
The
difference
is,
if
you
have
to
go
into
the
details
of
failed
versus
dead.
A
dead
job
is
something
that
does
not
terminate
properly
within
the
time
frame,
so
it
just
keeps
running
until
tathology
runs
actually
hit
their
timeout,
which
currently
is
set
to
12
hours.
So,
if
you
see
a
run
that
finished,
you
know
close
to
12
hours,
it
means
it
was
a
dead
job.
It
will
be.
You
can
just
quickly
go
over
an
example
to
just
show
you
what
a
dead
job
should
look
like.
A
A
A
When
I
was
talking
about
debt
jobs.
I
only
talked
about
this.
So
when
I
say
it's
a
debt
job,
you
see
the
run
time
here
is
close
to
12
hours.
This
is
what
implies
that
this
job
did
not
dominate
in
time,
and
so
it
kept
running
until
tautology
handlers
kill
this.
So
if
you
see
a
run
like
this,
you
do
need
to
look
at
what
was
going
on
and
why
it
went
dead.
A
So
yeah,
that's
why
I
have
it
in
bold
letters
because
a
lot
of
times,
it's
not
clear
to
folks
whether
dead
jobs
need
to
be
reviewed
at
all
or
not,
because
there's
some,
you
know
impression
that
they're
only
caused
by
you
know
infrastructure
issues
or
some
other
randomness,
but
that's
not
true.
A
Moving
on
then
there's
help
along
the
way
when
I
say
there's
help
along
the
way
you're.
Just
not
you
know,
thrown
to
look
at
all
kind
of
logs
and
all
kinds
of
demon,
logs
and
figure
out.
What's
going
on
or
like
drip
through
thousands
of
things,
there
are
tools
that
are
already
present.
That
can
make
this
process
easier
and
there
are
three
things
that
I've
listed
here.
A
First,
one
is
sentry,
I'm
not
sure
how
many
of
you
are
aware
of
century,
but
centuries
is
a
tool
that
helps
us
gather
similar
kind
of
failures
and
show
you
a
history
of
when,
when
an
event,
when
I
say
a
failure,
a
failure
is
marked
as
an
event
in
century
and
when
an
event
started
occurring
in
pathology
and
it's
it
has.
You
know
things
like
you
know
which
branch
started
failing
or
which
distro
started
failing
when
it
started
failing
all
that
kind
of
stuff.
A
If
you
want
to
quickly
look
at
what
century
looks
like
I'm
just
going
to
pick
one
sample,
because
every
I
think,
except
for
pathology,
sorry
dead
jobs,
you
will
have
a
century
event
like
this
when
you
click.
What
did
I
do?
I
just
clicked
on
this
and
I
went
to
that
particular
job.
A
So,
as
you
can
see,
I
am
looking
at
one
particular
job
now,
so
these
are
the
stats
of
that
particular
job.
This
told
me
what
job
was
run,
what
what
test
was
actually
run,
and
here
you
see
there
is
the
details
of
the
log
and
there
is
century.
So
when
we
click
on
sentry,
what
do
we
see.
A
You
see
another
dashboard
opened
up
now
going
back
here.
We
were
interested
in
something
like
a
command
failed,
a
error
where
it
was
trying
to
look
for
something
it
was
trying
to
look
for
some
versions
and
making
sure
that
the
number
of
versions
was
one
in
that
total
entry.
That's
not
important,
but
what
is
important
here
is.
You
can
see
that
it
is
tracked
here
as
the
same
branch
name.
You'll
see
that
same
error
is
shown
here
as
the
exception
that
was
caught
by
that
was
used
by
century
to
group.
A
This
particular
failure
and
you
have
all
kinds
of
details
about
what
the
config
was
and
where
the
logs
are
and
all
kinds
of
things.
Now
I
think
what
is
the
most
important
thing
here.
Is
you
see
here
this
right
part
here
last
30
days?
It
tells
you
there
are
two
occurrences
first
seen
last
scene.
This
is
what
helps
you
figure
out
if
a
particular
failure
that
you're
seeing
is
new
or
not
especially,
for
unique
failures.
A
This
is
very
helpful,
which
I
will
talk
about
what
a
unique
failure
is
in
a
bit,
but
this
events,
part
here,
is
something
that
you
can
go
look
at
and
it
will
list
all
those
similar
failures
that
have
been
seen
in
pathology
over
time.
So,
for
example,
you
are
reviewing
a
run
and
you
see
a
failure.
You
click
on
the
century
event
and
you
just
see
one
event
like
example,
this
this
should
ring
a
bell
in
your
mind.
Why?
Because
this
means
that
this
failure
has
never
been
seen.
A
There
could
be
two
reasons.
One
is
that
this
never
appeared
in
somebody
else's
pathology
run
because
of
whatever
reason
there
could.
The
test
was
never
exercised.
It
was
a
race
condition,
blah
blah
blah.
You
can
think
of
whatever
reasons
or
the
the
most
important
thing
that
the
failure
is
related
to
your
branch
stuff
that
you're
testing
is
related.
So
this
is
a
clear
indicator
litmus
test.
I
would
say
where
you
can.
A
You
know
that
you
need
to
look
more
and
not
just
give
up
by
looking
at
the
century
event
a
lot
of
times
century
events
will
list
a
bunch
of
things.
Then
you
get
more
confidence
that
okay,
this
may
or
may
not
have
been
the
same
exact
failure,
but
you
can
go
and
look
at
the
other
failures
to
verify
whether
somebody
else
also
saw
exactly
the
same
thing
or
not.
A
So
that's
the
whole
idea
of
sentry
being
a
help
in
this
process.
Now
there
are
a
lot
of
other
things
that
you
can
do
with
century.
I
think
that
will
take
a
separate
session
on
its
own,
but
I
just
want
to
touch
upon
this
in
terms
of
what
I
was
just
talking
about.
A
B
A
B
Yeah,
there's
nothing
special
you
just
need
to.
This
is
the
first
time
you
click
on
one
of
these
entry
links
you'll
be
taken
to
a
page
that
lets
you
log
in
there's
a
button
to
log
in
via
github.
A
Enough
all
right,
then
we're
gonna
be
talking
about
scrape.log.
So
this
is
the
pr
that
introduced
this
concept
of
scrape.log.
I'm
not
going
to
go
into
details.
If
you
want
to
look
at
this,
but
what
the
idea
is
that
there
is
a
script
that
runs
at
the
end
of
every
run,
that
you
do
and
tries
to
group
similar
failures
to
make
analyzing
these
failures
easier.
A
A
So
here
what
I
did
was
I
I
okay,
I'm
gonna
be
covering
this
later,
but
let
me
just
go
over
it,
so
I
am
now
in
I.
I
have
logged
into
pathology
and
I
was
initially
in
slash
cef
pathology
archive
where
the
logs
all
the
pathology
logs
live
next.
What
I
did
was,
I
went.
I
just
used
the
run
name
of
that
particular
pathology
job
and
I
cd
under
the
directory.
Then
I
can
now
see
if
you
see
here,
I
see
all
those.
A
All
the
361
total
jobs
here,
so
each
of
them
have
details
about
all
those
runs.
A
Now
looking
at
scrape.log,
you
see
here
there
is
a
file
called
scrape.log,
so
this
tells
you
it
basically
does
a
post-processing
of
all
the
failures
and
it
tries
to
give
you
a
summary
of
how
many
distinct
failures
were
there
and
how
many
dead
jobs
were
there.
As
we
saw
there
was
one
dead
job,
so
it
has
said
that
there's
one
dead
job,
the
grouping
mechanism
works
in
a
way
that
you'll
see
it.
A
So
I'm
sure
it
must
have
been
covered
in
the
last
session
that
there
are
separate
facets
that
are
used
when
a
pathology
run
and
schedule,
so
this
script
also
tries
to
find
an
intersection
of
which
are
the
common
facets
that
are
present
in
a
common
failure.
So
that
is
what
is
the
idea
behind
this
so
that
you
can
figure
out
just
by
looking
at
those
if
there
is
something
that
is
directly
causing
this
particular
failure?
A
For
example,
you
know
if
there
was
some,
let's
just
use
a
simple
example,
so
this
d
balancer
on
this
means
that
the
you
know
the
balancer
module
was
on
for
this
particular
test
right
and
if
this
particular
failure
that
we
were
seeing
was
directly
related
to
the
balancer,
then
it
clearly
tells
us
that
okay,
the
balancer,
was
on.
Therefore,
this
failure
was
there.
A
If
there
are
other
examples,
we
can
see
that
you
know
there's
an
object,
store,
blue
store,
something,
and
if
we
let
us
say,
had
a
failure
that
was
not
related
to
bluester
directly.
Then
we
could,
you
know
narrow
down
or
eliminate
things
that
are
not
directly
related
to
a
particular
failure.
Just
by
looking
at
these
facets,
as
if
you
keep
going
down
in
this,
it
tries
to,
as
you
can
see,
there's
one
failure
which
so
this
means
that
it
just
found
one
job
with
this
unique
failure.
A
Here
you
can
see
there
are
two
jobs
that
have
tried
to
group
together
and
it
has
listed
down
all
the
intersection
of
those
facets,
as,
as
you
become
more
familiar
with
pathology
facets
and
what
tests
are
being
run.
This
kind
of
helps
in
analyzing
it,
and
particularly
for
me,
things
like
this
stand
out
a
lot
like
if
there
are
four
jobs
that
are
failing
in
this
same
step:
tool,
dot,
ceph,
tool,
test,
dot,
sh,
clearly,
there's
something
wrong
going
on
right.
A
If
there
is
one
random
failure
happening
somewhere,
this,
in
my
mind,
may
be
something
you
know
sporadic
or
something
which
is
not
causing
whole
bunch
of
failures.
This
is
this
clearly
tells
me
that
something
is
causing
a
whole
bunch
of
failures,
so
things
like
this
are
are
like
small
hints
that
you
can
take
from
the
scrape
dot
log
before
you
delve
further
into
analysis
of
of
logs.
A
So
let's
go
back
to
the
the
pad,
and
finally,
we've.
C
C
A
A
A
A
It
also
makes
it
more
obvious
that
there
is
something
like
script
log
now
I
have
to
do
a
session
to
explain
to
people
that
there
is
a
scrape.logs,
but
anyway.
C
A
A
See
I
I
see
it
as
like,
I'm
not
saying
that
one
can
replace
the
other,
but
this
can
be
just
like
a
you
know,
once
one
step
closer,
if
some
somebody,
let
us
say,
does
not
have
sepia
access
for
whatever
reason
they
can
quickly
look
at
scrape
log
and
say:
okay,
is
there
something?
That's
obvious
that
I
need
to
be
worried
about?
You
know
somebody's
looking
something
on
their
phone.
Maybe
even
there.
A
All
right,
okay,
any
more
questions
or
any
more
discussion
around.
A
Scrape.Log,
okay,
let's
move
on,
and
the
final
thing
that
I
have
added
here
is
a
red
mind
tracker,
so
this
is
going
to
indirectly
help
you
analyze
failures.
When
I
say
indirectly,
it's
kind
of
like
you
know,
when
scrape
wasn't
there
or
like
this
is
a
poor
man's
script.
A
If
you
see
a
particular
failure
and
you
want
to
make
sure
that
it
is
an
existing
failure
and
if
scrape
is
not
useful,
you
can
go
and
search
for
some
keywords
that
you're
seeing
in
a
particular
failure
and
see
if
there
is
an
existing
ticket
open
ticket
and
an
open
relevant
ticket,
I
mean,
like
you
know,
same
failure
could
have
been
caused
five
years
ago
and
happening
now
again,
that's
not
relevant
if
something
has
happened
a
week
ago
or
a
month
ago
and
you're
seeing
it
that
makes
it
again
clear
that
the
failure
you're
seeing
is
not
related
to
your
test
run
and
at
that
point,
what
you
do
is
you
update
the
redmi
tracker
with
the
test
run
that
you
store
that
particular
failure
in
what
this
helps
with?
A
Is
the
person
who's
going
to
be
fixing
the
bug
or
analyzing?
The
bug
will
have
more
data
points
to
go.
Look
at
that's
kind
of
the
whole
idea,
and
you
know
I
think,
everybody's
aware
of
red
mine
tracker.
There
are
separate
components
and
components
component
wise.
You
can
do
simpler,
search
as
well
so
so
I
think
those
are
the
three
things
that
I
think
are
tools
that
you
can
use.
After
that
I
want
to
just
go
over
two
points
before
we
go
into
more
examples.
A
Every
failure
needs
to
be
tracked.
When
I
say
every
failure
needs
to
be
tracked.
It's
just
for
the
same
purpose
that
I
just
described
that
if
you
see
a
failure-
and
you
analyzed
it-
you
determined
that
it
was
not
related
to
your
test
run.
You
don't
want
the
next
person
to
be
doing
the
same
thing,
so
you
want
a
tracker
where
you
analyze
the
thing,
and
you
say:
okay,
this
xyz
events
happen
and
therefore
this
failure
happened.
So
the
next
person
who
sees
this
in
their
run
can
just
go.
A
Look
at
it
and
say:
oh
I'm,
seeing
the
same
thing
and
my
branch
has
nothing
to
do
with
this
failure.
So
that
saves
a
lot
of
time.
So
it's
it's
more
like
a
symbiotic
process.
We
have
to
try
to
help
each
other
by
just
creating
these
tracker
tickets.
Also,
it
gives
you
a
real
view
of
what
failures
are
there
in
a
particular
branch,
because
if
you
don't
track
things
on
time
and
you
try
to
track
it
after
a
month,
it
becomes
difficult
to
even
figure
out,
for
you
know
some
failures.
A
Okay,
you
can
figure
out
where,
when
they,
when
they
started
happening
and
things
like
that
or
you
don't
even
need
to
see,
you
can
go
fix
it,
but
some
things
especially
with
you
know,
seth
and
and
raiders.
I've
seen
it's
very
important
to
understand
where
a
regression
was
introduced
and
tracker
issues
help
with
that,
especially
when
century
is
not
being
very
helpful.
A
Next
thing
is
differentiate
between
noise
and
real
failures.
When
I
went
over
all
of
these
things,
the
one
thing
that
I
did
not
cover
is
a
lot
of
times.
There
are
things
like
you
know:
some
packages
did
not
install
properly,
so
there
was
a
failed
job.
There
was
some
ssh
connection
error
or
something
like
that.
A
The
test
and
run
these
in
my
mind
are
kind
of
noise,
because
test
actually
didn't
run,
but
it
was
supposed
to
run,
but
you
still
see
those
as
failed
you,
you
should
be
able
to
clearly
identify
such
things,
but
just
by
the
failure
reason
you
see
and
the
you
know
the
you
know
best
thing
to
do
is
to
le
to
rerun
such
jobs.
There
is
a
dash
dash
re-run
option
in
tathology
suite
that
goes,
and
reruns
failed,
dead,
failed
and
dead,
failed
or
dead.
A
So
you
can
do
that
so
that
you
make
sure
that
the
test
that
was
supposed
to
run
actually
runs,
but
those
are
actually
noise
and
real
failures.
Everything
that
we
were
discussing
earlier
was
real
failures.
So
then
again,
the
next
point
here
is
about
failure
injection.
A
So
a
lot
of
tests,
especially
thrashing
tests,
inject
failures.
So
when
you
see
that
there
is
an
ost
that
we
are
not
able
to
communicate,
okay,
sorry,
I'm
being
biased
with
osds
and
radars,
but
I
think
that's
the
example
that
comes
to
my
mind.
So
if
you
see
that
there
is
an
osd
that
suddenly
we
cannot
communicate
with
and
that
just
shows
as
a
socket
closed
or
something
like
that
in
the
topology
log,
there
could
be
several
reasons
for
it.
A
There
could
actually
have
been
a
crash
on
the
usd
because
of
which
it
you
know
it
died,
and
that
can
be
confirmed
by
looking
at
a
corresponding
crash
in
the
pathology
law,
and
sometimes
it
could
also
be
because
we
are
trying
to
inject
a
whole
lot
of
messenger
failures:
messenger
socket
failures.
A
In
those
conditions
it
becomes
a
little
harder
to
just
you
know,
diagnose
from
the
topology
law
as
to
why
that
happened.
Like
you
know,
some
in
some
places
you'll
see
that
we
are
waiting
for
even
for
monitors.
We
are
waiting
for
monitors
to
form
quorum
and
due
to
failure
injection.
A
There
is
one
monitor
that
just
gets
repeatedly
declared
as
down
and
therefore
the
test
fails
because
monitors
did
not
form
quorum,
so
this
again
is
kind
of
noise,
but
to
determine
that
it
is
nice,
you
need
to
go
an
extra
step
to
check
whether
the
test
was
actually
injecting
failures
or
not
injecting
failures.
That's
one
simple
thing
to
verify:
if
you
do
see
that
there
were
messenger
failures
that
were
injected
there's
this,
at
least
in
some
places,
the
this
is
the
keyword,
you'll
see
and
there's
a
directory
that
we
sim
link
everywhere.
A
That
has
different
kinds
of
configs
that
we
apply
to
tests
to
inject
messenger
failures.
If
you
see
such
a
thing
in
the
test
run
or
the
description,
you
know
that
there
was
injection
of
messenger
failures
and
then
you
go.
Do
the
next
step
and
verify
where
that
happened
and
whether
that
was
the
cause
of
it
or
not.
Now
again,
the
next
part
of
it
sometimes
is
complex.
I'm
not
going
to
go
into
details,
but
something
to
look
out
for
and
if
you're,
not
sure
about
things.
A
It's
it's
good
to
discuss
with
other
folks
in
the
team
and,
like
you
know,
look
at
stuff
together
to
verify,
but
this
is
also
sometimes
a
reason
for
noise,
and
I
think
I
already
went
over
the
package
install
connection
failures.
You
know
sometimes
fail
to
lock
machines.
I
think
the
good
thing
is
these
days.
We
don't
have
such
problems,
because
the
new
dispatcher
is
working
very
well
at
least
has
solved
this
problem
very
well.
So
just
just
the
whole
point
number
five
is
about.
A
All
right,
I'm
just
gonna
move
on.
We
can
do
questions
at
the
end.
Maybe
so
here
what
I
have
is
kind
of
pathology
runs
so
I've
just
you
know
there
may
be
multiple
others
as
well,
but
I've
just
tried
to
put
it
into
three
broad
categories
based
on
which
your
analysis
may
vary
a
little
bit,
and
you
know
some
for
in
some
cases
you
have
to
be
more
careful
versus.
A
In
some
cases
you
have
to
look
for
outstanding
or
open
issues
so
like,
for
example,
the
first
one
here
is
a
whipped
branch
right,
whipped
branches
are
what
you
know.
We
developers
used
for
validation
of
prs
on
on
master
or
it
could
be
any
other
table
branch
as
well.
You
know
for
backwards.
A
So
these
are
usually
runs
where
you
have
you
like
at
least
example
in
rails
is
like
around
350
jobs
or
plus,
but
these
are
usually
larger
runs,
so
you
have
more
failures
to
analyze,
but
these
are
the
most
common
types
of
runs
that
are
done
in
pathology
by
by
developers,
or
you
know.
Anybody,
and
also
these
are
the
ones
you
need
to
be
more
careful
about,
because
this
is
where
new
stuff
gets
merged
into
the
code
base.
So
if
something
is
problematic
and
we
merge
it,
there,
then
on,
we
have
introduced
regression.
A
So
that's
why
I
have
mentioned
here
that
reruns
are
encouraged.
If
you,
if
a
lot
of
times
there
is
some
lab
issues,
you
know
the
lab
is
going
out
of
space
or
you
know
there
are
network
issues,
things
like
that.
In
those
cases,
if
you
have
a
run
where
you
have
like
you
know,
50
jobs
have
failed
out,
of
which
40
were
because
of
noise.
I
personally
would
not
encourage
anybody
to
merge
a
particular
pr
with
that
state.
A
I
know
it's
cumbersome,
you
need
to
wait,
you
need
to
rerun,
but
I
think
it's
better
to
be
safe
than
sorry
in
these
cases,
because
that
may
affect
the
state
of
your
baseline
runs
or
baseline,
how
your
master,
baselines
or
your
you
know,
octopus
knotless
baselines
are
doing
after
that.
So
that's
why,
when
I
say
rerun
the
it's
the
same
rerun
option:
that's
there
in
topology
suite
that
is
encouraged
to
be
used.
A
Next,
here
is
what
I
call
a
baseline
run
or,
like
you
know,
you
take
the
tip
of
a
particular
branch
and
you
run
a
suite
against
it.
These
are,
you
know,
usually
done
in
our
nightly
tests,
see
what
the
state
of
branches
or
how
how
our
test
suites
are
doing.
It
is
also
used
for
qe
validation
for
our
point
releases.
A
So
as
you,
you
know,
if
you're
aware
of
the
point
release
process,
you'll
see
yuri
finalizes
one
show
and
that
we
decide
is
going
to
become
the
tip
of
the
point,
release
that
we
are
going
to
be
doing
and
that
tip
gets
frozen
and
he
runs
a
whole
bunch
of
sweets
on
that
tip.
So
in
in
those
kind
of
runs,
what
we
are
trying
to
see
is
things
that
have
not
been
caught
earlier.
A
In
our
you
know,
whip
branches
because
of
whatever
reason
there
could
be,
you
know
some
bugs
that
only
pop
up
due
to
a
race
condition,
so
it
may
not
have
popped
up
the
first
time
but
may
pop
up
this
time
or
you
know
other
things
that
are
associated
with
probabilistically,
injecting
failures.
So
if
you
injected
a
failure
with
some
probability,
you
did
not.
You
know
there
is
all
our
thrashing
tests
have
probability
associated
with.
A
You
know
how
many
osds
are
being
brought
down
or
when
they
are
being
brought
in
so
timing
issues
again
it
has
got
to
do
with
race,
but
you
know
some
things
that
may
not
happen
every
time
or
not
very
repeatably.
Failing
can
pop
up
in
these
runs,
so
these,
I
would
say,
are
generally
used
for
baselines
and
they
help
discover
bugs
that
have
not
been
discovered
in
our
web
testing.
A
That
does
not
mean
that
you
know,
if
you
see
a
failure
in
this
run,
you
don't
create
a
tracker.
You
do
create
a
track
tracker
and
try
to
figure
out
where
it
you
know,
merge
where
the
problematic
pr
merged
again.
This
is
going
to
be
more
difficult
because
you're
kind
of
retrospecting
as
to
what
could
be
the
root
cause
of
a
particular
failure,
but
in
general
sentry
and
tracker,
and
everything
are
going
gonna
help
with
this
and
finally,
third,
one
which
I
call
developer
centric
is
these
are
usually
smaller
patches.
A
So,
like
the
example,
I
didn't
go
over
this
example,
but
I
mean
it's
just
like
you
know
the
previous
one,
which
I
was
talking
about
baselines.
So
this
is
a
run
that
has
been
done
by
me.
A
A
raido's
run
on
pacific
on
this
particular
on
22nd,
and
this
is
what
the
state
is,
as
you
can
see
for
yourself
and
a
lot
of
nightly
jobs
that
are
run.
Look
similar
go
over
this
example,
so
the
this
one
is
a
developer.
Centric
run.
What
I
would
say
is
because
a
lot
of
times
developers
want
to
reproduce
a
particular
bug
or
verify
that
a
fix
for
a
particular
test
is
working
or
not
working,
so
you
would
not
run
the
entire
pathology
suite.
A
At
that
point,
you
will
try
to
run
a
one
job
or
like
a
subset
of
tests
to
see
how
your
fix
is
doing
or
how
reproducible
a
particular
bug
is.
So
I
added
this
example
or
just
you
know,
I
was
trying
to
reproduce
a
particular
bug,
as
you
can
see
this,
where
the
faster
the
entire
job
description
is,
and
so
I
try
to
use
this
and
run
it
on
pacific.
So
my
idea
was
that
I
saw
a
particular
failure
on
master.
A
Let
me
see
if
I
can
reproduce
this
on
pacific
or
not,
and
that's
how
you
can
see
that
I
did
reproduce
it
once
out
of
10
times.
This
was
the
same
run
using
the
dash
capital
n
option
just
trying
to
reproduce
a
bug,
so
these
ones
are
more
developer
centric.
You
won't
see
these
run
often
as
baselines
or
anything
but
are
useful
when
you
are
trying
to
debug
something.
A
I
think
that
those
are
kind
of
the
points
I
have
here
yeah.
The
other
thing
to
mention
is
that
you
can
always
use
this
pre-tripo
and
suite
branch.
So
basically
you
can
use
run
your
test
changes
against
existing
bills
so
that
you
can
make
some
tweaks
to
the
test
or
add
extra
debugging
change
some
config
and
then
run
those
tests
without
having
to
build
a
new
branch.
A
D
A
question
on,
for
example,
you
are
talking
about
that
whip
branch
right
and
supposing
developer
sees
see
some
failures
there.
Would
it
be
a
good
idea
to
compare
that
against
a
baseline
run,
for
example,.
A
A
Yeah,
so
the
the
easiest
way
to
do
this
is
go
to
palpito,
like
I'm
trying
to
look
for
raiders.
So
now
I
just
went
to
palpito
and
I
went
to
suite
right
raiders.
So
now
I
have
all
the
raido's
runs
that
have
been
done
now.
Let
us
say
you're
interested.
There
are
two
ways
to
go
about
this
right.
One
way
is
to
look
for
just
nightly
runs
so
when
I
say
nightly
runs
so
we
have
these
automated
test
runs
that
are
done
against
master
pacific,
all
stable
branches.
A
Say,
pacific
you'll
see
any
user
when
you
see
a
user
typology.
These
means
these
are
all
pathology
runs,
so
you
can
go
and
look
for
the
latest.
One
here
like
the
latest
here
I
can
see,
is
25th
and
you
can
compare
your
run
against
this
one
or,
if,
for
whatever
reason
you
don't
find
this
latest
run
or
it
didn't
run
for
whatever
reason
you
can
also
go
and
pick.
Let's
let
us
say
you
want
a
run
on
master.
You
can
see
that
sage
was
doing
something
25th
again
yesterday.
A
This
is
going
to
be
more
less
reliable
because
he
may
be
testing
some
patch
which
may
have
issues.
So
a
lot
of
those
failures
here
may
be
related
to
that,
but
this
can
give
you
a
rough
idea
of
what
you
are
seeing
is
that
you
know
if
it's
related
or
not
so,
to
compare
if
you're,
seeing,
let
us
say
an
object,
store
tool.
Failure
in
your
test
run
and
you
want
to
see
if
your
batch
is
the
only
one
that
saw
it.
You
first
go
to
century.
You
don't
find
it.
A
You
second
go
to
tracker,
you
don't
find
it!
Third,
you
go
to.
One
of
these
runs
latest
runs
that
have
been
done
and
try
to
look
for.
You
know
the
failures
and
yeah.
If
you
see
the
failures
you
expanded
and
if
you
let
us
see,
some
object,
store
rule
failure
exactly
the
same
as
what
you're
seeing
you
go
and
look
at
the
pathology
log
and
compare
it.
You
see
the
same
thing,
then
you
you
get
an
idea
of
whether
or
not
your
your
your
failure
is
related
to
your
branch.
A
All
right
any
more
questions
on
this.
A
If
not,
then
I'm
just
going
to
go
to
the
steps
involved
thing.
Now
again,
when
I
have
listed
out
some
steps
that
I
tend
to
use,
this
may
vary
for
every
individual,
but
the
high
level
idea
remains
the
same.
So
what
I've
added
here
is
an
example
of
a
pathology
run.
I
already
went
over
what
the
what
when
I
say
run
there
is
a
run
name
associated
with
it
here.
This
is
the
example
step
one.
As
I
said,
palpito
combined
with
sentry
is
your
step.
One
of
analysis.
A
We
went
over
what
palpito
looks
like
what
you
know
what
century
does
so
even
failure.
What
am
I
doing?
I'm
just
going,
I'm
just
clicking
on
this.
I'm
seeing
okay
century
event
is
present.
Now
I
go
and
look
at
the
century
event.
If
I
find
something
useful,
I'm
going
to
be
like.
Oh
this.
This
is
the
test
tool,
dot,
sh
failure.
A
Okay,
there
are
a
whole
bunch
of
failures.
This
should
make
me
like.
Okay,
this
is
I'm
not
introducing
you
know
it
gives
me
us
a
you
know
some
level
of
confidence
that
okay,
there
is
some
existing
failure
again.
This
is
not
the
end
of
the
story.
You
need
to
go
and
verify
for
yourself
what
these
exact
failures
were,
but
this
is
just
a
guideline
combined
yeah,
so
step.
One
of
analysis
is
this:
then
what
we
do
is
analyze
each
failed
dead
job.
A
There
is
some
good
documentation
that
deepika
added
to
the
existing
documentation
fairly.
Recently,
that
goes
over
details
of
what
exactly
triaging
cause
of
failure
should
look
like.
You
can
use
this
as
a
reference
future.
I'm
going
to
be
going
over
all
this,
I'm
not
going
to
spend
some
time
on
this
documentation
now.
A
So
when
I
say,
if
analyzing
failed
and
dead
jobs,
things
that
are
involved
here,
you
need
sepia
access
to
ssh
to
pathology.
That
is,
I
think,
very
important.
Everything
cannot
be
done
using
pulpit.
It's
just
a
good
step
one,
but
that's
not
where
things
end.
A
So
this
is
this
kind
of
point
two
which
I
have
touched
upon
earlier
as
well.
Now
I
think,
dig
deeper
is
the
part
that
I
want
to
focus
on
in
the
last
15
minutes
or
so
so,
when
you,
when
you
looked
at
step
one,
we
looked
at
palpito
sentry
things
are
not
clear
and
then
you
go
to
the
script.
Scrape.Log
you've
got
a
better
idea
of
things,
but
you're
still
not
sure
about
things
right.
You
saw
that.
Okay,
there
was
one
failure
that
happened
four
times,
but
what
is
that
failure?
A
What
does
the
exact
failure
look
like
for
that?
You
have
to
look
at
pathology,
dot
log
for
each
job.
I
mean
yeah
when
the
failure
reason
is
not
clear.
When
I
say
failure.
Reason
is
clear:
it
is
like
okay,
ssh
connection
lost,
so
we
know
that
that
was
you
don't
have
to
go
and
look
at
you
know
what
crash
happened
there,
particularly
or
like
something
did
not
install
something.
A
Those
messages
are
generally
clear,
so
you
don't
need
to
waste
more
time,
but
when
you
know
command
failed
or
like
there
is
a
crash
osd
something
crushed,
you
have
to
go
and
dig
deeper,
and
so
that's
why
I've
written
that,
when
the
failure
is
not
clear
from
step
one
and
two
looking
at
pathology
log
in
my
opinion,
is,
you
know
mostly
recommended
I
I
do
it
all
the
time
for
every
every
failure
that
I
see,
even
even
like
you
know,
even
though
I
can
see
things
and
you
know
judge
sometimes,
but
I
always
want
to
make
sure
I'm
not
ignoring
anything.
A
So
I
have
an
example
here
that
I
want
to.
You
know
touch
upon
where,
like
I'm
gonna
show
where
step
one
two,
but
not
enough.
So
this
is
a
job
which
you
can
see
was
a
dead
job
because
it
ran
for
12
hours.
It
was
doing
a
thrashing
test
and
the
failure
reason
when
you
see
hit
max
job
timeout.
This
also
indicates
that
it
was
a
dead
job.
Now,
as
you
can
see,
there
is
no
pathology.
Sorry
there
is
no
sensory
event
associated
with
it,
so
we
cannot
use
sentry.
A
So
so
I
have
looked
at
this
log.
I
know
it
is
huge,
so
I'm
just
going
to
be
using
less.
I
usually
tend
to
use
rim,
but
what
are
the
first
things
I'm
going
to
try
to
do
when
when
trying
to
look
at
this
log
is
look
for
trace
pack.
A
At
this
point
here
to
identify
the
cause
of
a
failure,
you
need
to
use
some
of
these
keywords
and,
in
my
mind,
traceback
is
the
first
thing
that
will
give
you
at
least
50
percent
of
the
failure.
Reasons
can
be
reached
by
just
searching
for
traceback,
but
this
this
one.
I
picked
this
example
because
this
is
a
classic
example
where
you
need
to
do
everything
to
figure
out
what
is
wrong.
So
what
I
see
here
when
I
search
for
trace
back
it
just
tells
me
os
error
socket
is
closed.
Now.
A
A
A
A
There
is
no
so
far
we
didn't
find
a
symptom
of
fail,
but
if
there
was
some
failed,
we
would
have
found
it
now,
I'm
gonna
search
for
cot.
A
A
That
has
crashed
with
a
bad
allocation
problem,
and
this
is
the
entire
fact
race
you
see.
So
this
is
a
perfect
example
of
a
job
gone
dead
because
of
an
osd
crashing,
but
the
technology
not
dominating
the
tautology
job
not
terminating
in
time
and
you've
got
a
12
hour
worth
log.
A
This
is
something
I
tend
to
use,
and
this
is
just
my
email
and
if
you
see
I
found
a
first
hit
which
tells
me
that
there
was
this
tracker
issue
where
somebody
else
has
locked
this
and
they
had
a
similar
bad
allocation
visual
and
there
are
a
whole
bunch
of
instances
in
the
rjw
suite
and
there's
a
related
ticket,
which
also
has
a
bad
allocation
problem
which
has
been
been
tracked
since
six
days.
So
if,
if
I
was,
you
know,
this
is
sridhar's
run.
A
A
Failure
not
caused
by
my
my
test
change
or
like
my
my
pr
that
I'm
testing,
but
if
in
any
of
this,
if
I
was
not
able
to
find
a
tracker-
and
I
did
not
see
a
sentry
event
at
that
point,
I
need
to
dig
a
little
deeper
verify
whether
that
crash
could
have
been
caused
with
by
the
pr
that
I'm
testing.
A
If
I'm
sure
I
go,
create
a
new
tracker
issue
for
it,
and
that's
all
I
mean
when
you
create
a
tracker
issue,
you
add
the
technology
run,
you
add
you
know.
Sometimes
we
try
to
also
add
the
description
like
if
you
there
is
a
thrashing
test,
so
we
add
the
entire
description
that
you
see
that
next
time
people
look
at
it,
it's
just
easier.
I
mean
you,
don't
have
to
do
that.
It's
not
a
must
it's
just
easier
just
by
looking
at
it.
B
A
B
How
often
when
this
is
occurring.
A
That
is
very
important,
because
this
is
the
clear
example.
You
can
see
what
we
have
done.
This
is
the
history
we
saw
it
first,
we
updated
it.
Somebody
saw
it.
They
also
marked
it
as
a
related,
so
it
just
makes
debugging
easier.
We
know
that
this
is
spread
all
across,
not
just
in
the
reader
suite.
Now
there
are
there's
another
ticket
that
got
marked
and
then
the
analysis
and
we
keep.
A
You
know
you
see
that
the
issue
started
maybe
more
than
nine
days
ago,
but
we
still
keep
updating
the
tracker
so
that
we
have
more
data
points
to
analyze
things.
A
So
I
think
what
what
all
I
discussed
is
almost
written
here.
You
want
to
go
back
to
refer
things
that
I
also
want
to
cover.
Next
is
the
further
analysis
part,
so
I
want
to
just
quickly
do
an
ls
here.
So
when
you
are
in
in
the
run-
and
you
are
looking
at
a
particular
job-
you
have
a
topology
log
here
that
we've
gone
over.
There
are
a
few
other
things
here
which
may
be
of
use
to
you,
particularly
this
one
orange.
A
This
is
the
original
conflict
of
the
test
that
was
run,
and
this
tells
you
all
the
configuration
all
the
details
that
were
used
even
the
roles
and
everything
that
was
used
for
that
particular
run.
So
this
can
be
used
to
reproduce
tests
even
locally,
so
you
can
just
use
this
original
config
and
use
it
to
run
against
a
lock
machine
or
even
not
a
lock
machine.
This
can
just
be
used
as
a
framework
to
run
a
test.
A
You
kind
of
freeze
the
configuration
because,
as
we
just
discussed,
that
there
are
a
lot
of
things
that
are
probabilistic
and
next
time,
when
you
start
a
new
run,
you
may
not
land
up
with
exactly
the
same
values.
This
ensures
that
you
have
exactly
the
same
configuration
when
we
want
to
run
exactly
the
same
test
again.
So
this
is
something
to
keep
in
mind.
A
A
A
So
here
we
have,
you
can
see
we
have
similar
stuff,
there's
just
one
thing
extra,
which
is
this
remote
directory.
So
when
you
have
a
failure,
there
is
this
remote
directory
that
gets
created.
What
it
has
is.
It
has
all
logs
from
corresponding
demons
that
were
run
in
that
test.
You
also
have
core
dumps.
You
have
crash
metadata,
you
can
just
go
over
quickly
and
see
what
all
is
there.
So,
as
you
can
see,
this
dash
was
run
on
smithy
121,
so
it
just
creates
a
directory
of
stuff.
A
That
is,
it
has
archived
from
that.
There
are
a
bunch
of
directories
here.
The
most
important
things
that
you
will
end
up
using
is
a
log
directory.
It
has
all
the
logs
zipped
logs.
The
last
few
you
can
see
are
the
the
ones
like
for
at
least
for
rails
developers.
These
are
the
ones
we
care
most
about,
so
you
can
go,
look
at
raw
logs
copy
them
somewhere.
Look
at
you
know
from
this
place
or
whatever,
but
this
is
one
useful
piece
of
information
to
know.
A
A
A
So
this
will
have
some
additional
just
when
the
crash
has
happened.
This
will
have
some
additional
debugging
just
before
the
crash
happens.
So
this
sometimes
is
useful
because
a
lot
of
time
developers
want
to
see
what
happened.
You
know
100
lines
above
where
the
crash
happened.
We
don't
want,
we
don't
care
about
the
entire
ost
osd.
You
know
zero
log.
We
only
care
about
last
200
lines,
and
this
is
very
you
know
it's
smaller,
easier
to
open,
easier
to
copy
easier
to
move
around.
A
A
Okay,
going
back,
I
think
I've
covered
everything
here
due
to
a
lack
of
time,
I'm
just
going
to
quickly
go
over
this
last
thing
that
I
have
instances
where
sentry
is
useful
and
not
so
useful
yet
so
like
for
this
example
here.
This
is
an
instance
from
yesterday.
So
there's
this
pr
that
was
being
tested
by
yuri
here.
A
A
Now
the
good
part
is
this
is
a
command
field
error
and
it
has
a
unique
test:
name
work
unit
name,
so
you
can
see
that
this
car.
This
has
this
history
of
events,
which
are
starting
probably
two
days
ago.
If
you
go
and
click
here,
what
century
helped
me
understand
was
if
you
see
this
is
the
same.
Failure
that
has
been
seen
in
these
web
star
branches
are
all
master
based,
but
if
you
see
in
pacific
this
is
the
sole
branch.
A
This
is
the
only
branch
that
has
seen
this
failure
and
if
we
go
back,
this
was
the
exact
branch
that
yuri
was
running.
So
this
is
a
clear
indication
of
something
that
has
never
been
seen
in
pacific
is
being
seen
in
pacific.
That
century
tells
us,
so
we
need
to
make
sure
that
the
pr
is
related
or
not,
and
tanda
turned
out
that
it
was
so
this
was.
This
was
a
patch
that
merged
in
mastering
was
getting
backboarded.
A
A
I
have
one
more
example
here,
but
I
I'm
not
going
to
go
over
it
now,
because
lack
of
time,
I'm
just
going
to
show
you
things
where
century
is
not
so
useful
yet
like
here
it
tells
you
that
there
is
a
failure
in
test
health
history,
so
I
would
assume
just
by
the
previous
example.
Oh
there's
a
unique
his
test
name.
I
should
be
able
to
group
everything,
but
there
is
some
things
that
we
need
to
fix
in
century
to
be
able
to
also
consider
or
capture
runtime
errors
correctly.
A
So
here,
as
you
can
see,
if
I
go
to
events,
it
does
not
tell
me
it
is
not
the
same
health
history
failure.
It
has
a
whole
bunch
of
things
that
it
has
grouped,
which
are
probably
not
useful
in
these
kind
of
scenarios.
You'd
have
to
go
and
look
for
tracker
issues
and
look
at
the
technology
log
to
see.
A
What's
going
on,
we
are
hoping
to
fix
this
or
address
this
part
as
well,
but
as
we
stand
today,
we
we
should
be
cautious
about
not
using
sentry
in
these
cases
and
relying
on
century
in
the
first
exam
or
like
things
that
fall
in
the
first
category.
A
B
A
A
Sure,
absolutely
anywhere
I
mean,
we've
got
irc,
you've
got
email,
you've
got
a
thing,
feel
free
to
ask
questions,
but
I
hope
this
is
useful
and
next
time
you
analyze
something
it
should
be.
You
know
any
bit
easier.