►
From YouTube: Ceph Developer Monthly 2020-09-02
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Yeah
so
because
we
can
get
started
this
week
this
month
at
cdn,
we
actually
have
one
topic
on
the
agenda,
which
is
going
to
do
a
bit
of
brainstorming
about
ways.
We
could
improve
our
ability
to
diagnose
and
analyze
and
fix
failures,
especially
in
pathology.
A
A
Ones
so,
let's
see
so
the
first
one
on
there
about
scheduling,
that's
already
actually
already
taken
care
of
from
by
a
wilson
code
student
this
summer,
estrada
has
completed
rewriting
the
worker
as
dispatcher,
which
allows
us
to
have
have
jobs,
lock
machines
before
they
start
running
and
therefore
never
time
out
waiting
for
machines
and
also
when
the
jobs
finish
or
if
they
don't
finish,
if
they
time
out
for
some
reason,
because
they
get
stuck
or
they
just
take
too
long.
A
We
now
gather
logs
at
the
end
of
that
for
those
dead
jobs
that's
currently
being
tested
in
the
mirror.
Cue,
and
hopefully
we
can
roll
it
out
to
this.
With
these
maybe
next
week.
A
B
Exactly
then,
the
cost
of
trying
to
reproduce
bugs,
which
are
hard
to
reproduce,
also
to
mention
things
like
upgrade
suites,
which
need
more
than
you
know,
regular
two
machines
or
two
more
than
two
nodes,
sometimes
five
nodes.
So
that
is
one
problem.
That's
going
to
be
addressed
with
her
work
as
well.
A
Yeah,
that's
a
good
point.
We
could
potentially
be
adding
jobs
to
take
a
large
number
of
machines
and
not
have
to
worry
about
any
kind
of
clock
contention.
A
A
So
I
guess
where
I
wanted
to
focus
maybe
was
today
was
on
the
analysis
side
of
things
like
when
there
are
a
bunch
of
failures
that
we
make
it
easier
to
diagnose
and
figure
out.
What's.
A
The
first
two
things
on
that
path
we
just
talked
about-
we
actually
already
did
we
get
dead
jobs
now
and
scrape
was
actually
not
integrated.
Thanks
to
another
student
who
was
applying
for
google's
raw
code
and
the
third
thing
there
century
is
now
back
up
and
running
thanks
to
brass
suggestion
in
david
galloway
and
hopefully
we'll
be
able
to
upgrade
it
to
a
newest
version
soon
and
which
will
allow
us
to
have
kind
of
links
back
to
popido,
as
well
as
a
ability
to
create
that
kind
of
dashboard.
A
So
you
can
track
your
time
what
failure
rates
look
like
in
a
given
suite,
for
example,.
A
But
the
in
general
sundry
gifts
gives
you
like
a
history
of
when
a
failure
occurred
and
how
often
it's
been
occurring,
and
you
can
kind
of
see
that
when
perhaps
even
when
it
was
introduced,
it
shows
up
initially
in
some
testing
branch.
And
then
you
later
start
seeing
it
master
for.
C
Josh
I
was
thinking
this
morning,
it
might
be.
I've
been
a
nice
to
have
might
be
being
able
to
run
technology
and
set
interactive
on
error
from
the
command
line.
C
So
you
can,
you
can
specify
the
ridge
config.yaml
without
making
any
changes
to
it.
C
Because
my
workflow
is
generally
copy
into
local
directory
modify
run,
whereas
with
the
appropriate
command
line
switches,
you
could
just
specify
the
original
file
and
and
just
run
it
with
the
you
know,
a
switch
for
interactive
on
error
and
also
need
to
check
that
we
can.
Actually,
I
know
you
can
specify
an
archive
directory
and
I
do
but
I
currently
delete
the
archive
directory
that
is
specified
in
the
yaml
file,
and
I
think
I
did
that,
because
the
command
line
archive
doesn't
override
what's
in
the
yaml
file.
C
A
C
C
Sweet
path
is
the
only
other
one
I
can
think
of,
and
that's
already
there
as
well.
The
like,
I
said
the
query
I
have
is.
I
can't
remember
whether
the
yaml
file
has
higher
priority
than
the
command
line
and
therefore
what
you
specify
on
the
command
line
gets
ignored
if
it's
already
set
in
the
yaml
file
and
those
two
always
are
set
in
the
ml
file.
So.
A
It's
another
another
way
to
approach.
This
might
be
to
think
about
some
failures
today
that
are
very
laborious
to
debug
and
analyze
and
consider
if
there
are
ways
we
could
make
that
kind
of
more
automatic
or
simpler.
B
From
my
perspective,
we
get
sentry
in
place.
A
lot
of
our
problems
are
going
to
be
easier
to
debug,
especially
because
we
you
will
be
able
to
track
when
the
problem
started
occurring
and
also
sometimes
it
could
just
be
like
a
testing
branch
or
like
a
master,
which
we
at
least
can
go
back
to
and
look
at
what
most
after
that,
things
like
that
today
are
a
problem,
because
you
have
to
just
go
manually,
look
or
like
you
know,
bisect
or
do
other
things
like
that.
B
We
do
have
things
like
cron
jobs
which
run
master
or
any
particular
branch
on
on.
Like
you
know,
some
frequency,
but
that
still
doesn't
give
give
us
that
kind
of
accuracy
that
we
can
get
with
century.
A
Yeah,
that
would
certainly
help
a
lot,
especially
if
we
start
running
the
sweets
more
frequently
like
and
give
you
this
merged
pr.
This
today,
from
yuri
to
run
the
radio's
tweets,
yeah.
B
I
I
I
did
I
mean
for
raiders:
that's
what
I've
been
pressing
for
and
now
the
idea
is
to
be
able
to
run
at
least
300
jobs.
For
some
reason
I
don't
know
I
was
not
able
to.
I
mean,
get
the
the
subset
value
correct
to
get
to
anything
lower
than
300.
The
thithology
suite
command
just
hung
for
me.
B
Maybe
there's
a
bug
somewhere
there,
but
we
can
at
least
get
like
300
jobs
running
every
night
and
that
will
give
us
a
better
picture
of
when
regressions
get
introduced,
but
just
for
everybody's
interest,
people
who
are
not
aware
of
how
century
looks
and
how
sentry
works.
I'm
just
pasting
an
example.
There's
been
some
failures
that
I
have
at
least
been
looking
at
and
you
can
use
it
as
a
reference
to
understand
how
I'm
pretty
sure
the
newer
version
of
century
is
going
to
have
much
better
features
and
trackability.
D
E
A
Do
like
the
the
way
we're
capturing
the
the
data
from
topology
today
we're
kind
of
just
categorizing
everything
by
the
traceback
that
totology
gets
from
python,
not
necessarily
the
factories
from
the
core
dump.
We
might
want
to
maybe
add
that
to
the
topology,
so
we'd
be
able
to
correlate
that
with
lemon
tree
as
well.
D
C
We
discussed
at
one
stage
about
modifying
the
generic
timeouts
like
a
generic
timeout
in,
say
the
run
task
in
technology
so
that
it
actually
mentions
what
script
it
was
running
rather
than
just
saying
I'm
timing
out.
C
C
I
know
we
talked
about
that,
but
I
I
don't
know
whether
it
made
it
made
it
into
the
the
document
or
even
yeah.
It's
probably
something
that
I
could
work
on
anyway.
C
The
thrasher
quite
often
there'll
be
a
job
running
and
the
thrashes
involved
and
something
fails
and
the
thrasher
just
seems
to
carry
on
indefinitely.
C
So
you
end
up
with
because
it's
it's
waiting
for
the
cluster
status
to
come
back
to
health,
okay
or
waiting
for
various
inconsistencies
in
the
in
the
status.
To
clear
it
might
be
a
good
idea
to
look
at
establishing
internal
timeouts
within
the
thrasher
for
those
sorts
of
things,
so
that
it
doesn't
just
continue
on
indefinitely
for
12
hours
and
then
time
out,
and
you
don't
get
any
logs
and.
A
C
Yeah
yeah,
that's
the
thing,
even
just
loading,
the
toothology
dot
log
file,
it's
massive
and.
E
C
B
Yeah,
but
I
think
the
low-hanging
fruit
that
brady
described
is
at
least
making
the
failure
messages
more
obvious,
where
it
is
failing
versus
just
saying
that
something
timed
out
I
I
know
I
mean
I,
I
made
a
small
effort
towards
this
earlier,
just
to
improve
the
safe
task
earlier
we
had
like
anywhere
we
used
to
fail
used
to
just
save
failed
to
recover
or
something
even
if
you
know
we
were
failed
to
recover
because
it
was
some
recovery
going
on
or
some
pgs
are
stuck.
C
B
Pr
that
I
just
pasted
that
now
just
outputs
the
information
of
the
pgs
that
have
not
reached
the
desired
state
that
you
exactly
know
which
pgs
to
go,
look
for
or
like
from
the
logs
as
well.
If
you
want
some
osd
logs
and
stuff,
and
these
are
like
small
improvements
that
we
can
make,
which
can
make
debugging
way
easier
than
what
it
is
right
now.
C
Yeah,
that's
a
good
good
one,
maybe
even
a
just
a
global
timeout
on
the
thrasher.
A
Kind
of
like
the
the
12
hour
time,
it
is
today.
C
B
Yeah
and
I
don't
think
we
should
even
wait
for
12
hours-
I
don't
we've
discussed
in
this
this.
In
the
past,
our
longest
running
tests
are
probably
six
hours
or
six
and
a
half
hours.
So
there's
no
point
waiting,
12
hours.
E
That
wouldn't
be
accounted
the
same
way,
so
that's
the
correct
fix
there.
You
don't
account
for
the
waiting
for
machine
element.
Also,
you
can
create
a
tighter
timeout
and
audit
the
tests
that
really
need
a
longer
one
and
specifically
white
list
them
or
specifically,
add
an
annotation
to
those
tests.
That
says,
I
need
an
hours.
B
E
B
E
B
A
Other
kinds
of
areas
that
we
see
today
that
are
kind
of
shadowing
each
other
where
they
have
like
the
same
kind
of
error
message.
E
Well,
I
mean
at
a
course
level
any
radius
level
bug
that
results,
that
any
radius
level
bug
that
looks
like
recovery
didn't
happen
or
period
didn't
complete,
could
have
any
number
of
causes.
I
don't
know
if
there
are
that
many
analogs
to
that
kind
of
problem
and
other
components,
though
so
that
may
be
just
the
category
on
its
own.
E
Right
but
I'm
saying
that
the
the
top
level,
like
effect
you
get,
will
be
the
thing
didn't
complete
before
the
timeout,
but
you
could
get
that
10
times
and
get
to
different
causes
without
further
ado.
You
wouldn't
know
what
the
cause
was.
So
I
think
that
might
be
another
category
in
addition
to
recovery
and
peering
didn't.
B
I
know
cephalium
is
a
new
area,
but
some
of
the
cepheidium
failures
are
really
hard
to
understand.
When
you
just
look
at
the
failure
reason,
it's
cryptic
you
have
to
go
through
the
logs
and
figure
out
what
the
hell
is
going
on.
So
that's
another
area.
I
guess
that's
going
to
come
in
later,
but
something
to
keep
in
mind.
F
Of
a
stock
is
that
a
a
demon
could
be
could
kill
itself
and
without
an
another
task.
Thresher
test,
for
example,
could
be
expecting
a
healthy
and
active
cluster
and
awaiting
that
demon
for
forever.
That
happens
in
crimson's
treasure
test,
because
sometimes
criminals
have
killed
itself
without
being
noticed,
even
though
that
that
process
does
not
exist.
A
Yeah
patrick
had
added
some
watchdog
stuff
to
try
to
address
that,
but
I
think
it
didn't
it
was
100.
Complete
fix,
doesn't
catch
all
the
cases
where
that
can
happen.
B
B
Another
broad
category,
I
would
say,
is
a
lot
of
tests
fail
with
weight
for
healthy
or
something
a
lot
of
times.
That
happens,
because
there
is
either
a
health
warning
or
a
health
error
that
is
like
possibly
there,
because
we
did
not
ignore
it
or
there
for
some
reason.
B
So,
maybe
just
at
that
point,
when
we
failed,
we
could
just
have
like
a
self
health
printed
out
before
we
are
just
failing,
so
that
we
know
if
there
is
a
health
error
or
a
health
warning
that
you
don't
have
to
go
back
through
the
laws
and
see
whether
there
was
a
health
error
or
a
health
one.
Just
like
simple
things
to
do.
A
A
A
At
that
point
we
can,
we
could
even
try
to
differentiate
there,
not
to
say
we
failed
to
recover
but
say
like
there
was
a
warning
or
pgs
were
stuck
in
active
or
demon's
crashed,
or
something
else
like
that.
A
Another
category
that
I
was
thinking
of
was
a
lot
of
test
test.
Cases
we
run
have
are
our
test
programs
that
have
their
own
like
test
cases
within
them
like,
for
example,
the
object
store
tool
tests
for
tests
or
a
bunch
of
other
tests
that
use
the
g-test
framework
or
the
python
unit
test
framework
are
running
a
bunch
of
different
subtests
and
the
exception
that
they
were
giving
to
sentry
and
that
we
list
in
the
in
the
failure
today
just
says
that
the
like
the
test
command
itself
failed.
A
So
we
if
there,
if
there's
like
one
one
particular
test-
that's
that's
as
is
newly
failing
or
necessarily
notice
that
it's
a
different
instance
of
them
or
our
new
bug,
without
looking
through
the
logs
a
bit
more
extensively.
So
that's
maybe
something
we
could
try
to
parse
out
from
the
output.
B
Yes,
that's
that's.
The
quintessential
example
is
the
object,
store
tests
that
fail
and
we've
had
instances
when
master
is
seeing
a
different
failure
versus
a
stable
branch,
you're,
seeing
a
different
failure,
and
we
all
like
we
tend
to
categorize
them
like
as
the
same
failure,
but
they
are
actually
not.
A
A
A
We
did,
we
could
do
something,
perhaps
with
something
similar
with
this
fadm
failures.
The
number
of
them
end
up
failing,
maybe
because
it
can't
started
even
running
for
some
for
some
reason
and
there
perhaps
we
could
parse
some
of
the
day
out,
but
getting
up
to
that
to
say
more
precisely
what
what
why
it
failed
like
if
it
failed
to
fetch
the
image
or
if
the
discounted
space
or
became
read
only.
A
G
A
Yeah,
I
think
that's
a
good
one,
maybe
some
common
keywords
that
we
could
parse
about
from
the
output
to
get
a
better
idea
of
what
would
actually
cause
the
command
to
fail
like
that.
C
I
was
gonna
gonna
say.
Maybe
the
other
thing
we
could
do
is
identic
identify
as
the
really
generic
subsequent
failures,
such
as
an
example
there
in
the
in
the
chat
that
failure
is
almost
always
a
result
of
a
test
failing
and
not
cleaning
up
the
directory
after
itself,
but
maybe
we
should
annotate
that
message
to
say
this
is
probably
the
result
of
a
previously
a
previous
failure.
C
C
Yeah
difficult
just
said
it
was
confusing
for
her,
and
I
had
that
discussion
in
mind
when
I
was
talking
about
it.
Deepika
yeah.
It
is
confusing
the
problem
with
debugging
toothology
failures.
Is
you
need
to
backtrack
through
all
the
subsequent
failures
until
you
find
the
original
failure
which
can
be?
G
How
do
we
achieve
that?
Okay,
if
this
is
this
like
self-test
failure?
So
what
keyboard
do
you
play
like
f
test?
You
showed
that
okay,
there
might
be
some
test
case
failing
in
standard
or
somewhere.
C
I
don't
know
whether
it
can
be
ordered,
not
automated.
I
I
suppose
it
might
be
able
to.
But
the
first
thing
I
would
do
is
look
at
the
failure
reason
and
generally
the
failure
reason
if
it's
not
a
generic
error
in
the
failure
reason,
failure
reason
is
going
to
give
you
the
name
of
the
script
that
failed
or
the
test
that
failed.
C
C
A
This
brings
up
another,
I
think,
at
the
end,
at
the
bottom
of
the
pad
here,
which
is
documentation.
I
think
we
could
probably
write
up
some
good
notes
about
how
to
approach
analyzing
a
photology
failure
and
in
the
process
of
doing
that,
maybe
come
up
with
some
more
ideas
about
improving
that
process
too.
F
Hey
guys,
I
I've
been
thinking
about
a
a
probably
a
vm
plugin
to
to
to
to
view
the
technology
log
becomes
the
the
different
source,
the
different
output
from
lived
with
each
other.
So
sometimes
I
need
to
sort
this
good
groups,
the
output
by
by
the
host,
in
my
mind
or
using
another
editor
to
do
so.
Do
you
think
it's.
C
I'm
I'm
aware
of
a
log
passing
application
that
that
does
that
sort
of
thing-
and
I
can't
remember
the
name
of
it,
but
I
should
be
able
to
find
it
so
this
is
used
already
right.
Are
you
sort
of
partying.
F
C
C
With
the
vim
plug-in,
but
but
some
people
using
emacs
or
whatever,
no
no
okay,.
C
Yeah,
let
me
see
if
I
can
find
the
name
of
that.
I
I
know
a
guy
that
is
submitted
some
patches
for
it.
So
I'll
ask
him
what
the
name
of
it
is
and
come
back
to
you.
A
Like
jeff,
who,
you
can
also
think
about
ways
to
improve
the
technology
log
itself,
instead
of
needing
more
processing
tools,.
F
A
F
F
F
We
could
ask
the
github
to
shoot
by
side
by
side,
but
this
this
out,
this
format
of
output
also
applied
to
a
a
logging
file.
For
example,
in
the
left
side,
you
could
have
a
output,
the
output
from
host
a
for
example,
another.
E
C
Yeah,
I
think
the
tool
I'm
talking
about
will
allow
you
to
filter
the
log
so
that
you
can
get
the
output
from
post
a
and
then
filter
it
again.
So
you
can
get
the
output
from
host
b
and
then
sometimes
then.
F
F
F
F
F
E
A
A
Yeah
again,
we're
talking
about
that
makes
a
lot
of
sense
kind
of
treating
the
different
aspects
of
the
log
as
an
event
stream
from
different.
A
A
C
E
A
I
think
we're
talking
about
kind
of
viewing
the
the
ost
logs
from
the
same
time
period
in
different
kind
of
tabs
side
by
side.
E
A
In
some
cases,
but
it's
helpful
to
see
that
sequence
of
events
strictly
from
a
top
to
bottom
viewpoint,
since
you
have
more
vertical
space
to
understand,
what's
happening
on
one
particular
osd.
E
A
A
A
Sure,
there's
more
than
zero
things
we
can
improve
there
at
least
tons
of
abbreviations
that
don't
make
any
sense
unless
you
yeah.
E
I
just
I'm
not
sure
it's
the
kind
of
topic
that
lends
itself
to
one
top
level
directive,
it's
more
like
when
you
debug
something
you
should
try
to
identify
any
shortcomings
of
the
logs
that
made
it
difficult
for
you
and
fixed
in
those
part
of
your
couch,
but
it
doesn't
work
as
well
as
a
single
line.
Item.
A
A
It's
just
something
to
keep
in
mind
the
next
time
you're
debugging.
Something
try
to
consider
what
information
would
make
this
easier.
B
Debugging
set
to
a
particular
level
that
will
let
you
know
what
the
hell
is
going
on.
We
should
always
make
sure
that
when
we
fix
such
failures,
we
go
and
add
that
as
a
default
value
to
those
suites,
because
a
lot
of
times
failures
are
not
reproducible,
so
in
those
cases
having
that
default,
debug
level
helps
even
vice
versa.
If
there
is
something
that
doesn't
need
too
much
of
logging
and
we
can
just
do
with
debug
levels
on
one
particular
subsystem
just
we
should
just
try
to
clean
those
up
as.
B
A
Maybe
that's
something
to
think
about
and
consider,
let's
give
you
guys
a
link
in
the
chat
about
a
bunch
of
papers
regarding
root,
cause
analysis.
G
I
just
googled
it
so
maybe
if
something
is
relevant,
we
can
discuss
it
in
that
new
meeting
we
were
going
like
we
discussed
about
reading
in
performance
meeting
later.
A
Yeah,
that
sounds
like
a
good
idea.
Okay,
add
these
to
the,
and
this
allows
this
link
to
the
performance
pad.
C
A
Now
I
think,
we've
covered
a
lot
here
already
there's
a
lot
of
interesting
ideas.
We
can
add
in
the
bed
any
other
topics
that
folks
wanted
to
talk
about
or
any
other
ideas
around.
This.
A
B
Nobody
has
anything,
I
I
have
a
question,
maybe
general
question,
maybe
for
kefu,
even
sam
how's,
debugging
crimson
pathology
failures,
different
from
debugging,
regular
classic,
osd
failures.
I
know
you
guys
have
been
debunking
some
recovery
failures
and
stuff.
F
We
do
have
a
signal
handler
in
classical
sd
which
handles
like,
like
six
six
foot
or
other
critical
signals,
and
it
brings
out
the
batteries
like
like,
like
the
up,
address
two
lines,
which
is
very
helpful
to
us
for
for
diagonal
to
understanding
the
root
cause,
and
it
also,
it
also
dump
the
unique
id,
for
example,
to
a
a
metadata
file.
That's
that's
where
we
collect
the
crashing
formation
using
the
crash.
F
B
E
F
E
G
Actually,
I
was
working
on
that,
but
I
halted
that
because
I
was
focusing
more
on
jager
recently.
Maybe
I
can
do
that
this
we
got
next
yeah
give
you
maybe
some
of
your.
F
Take
take
a
look
at
the
signal
handle
to
see
if
we
can
edit
support
in
crimson
or
even
to
sis
sister,
because
because
I
think
they
are
also
suffering
from
it
in
their
home
page
of
system,
they
are
suggesting
us
to
use
this
address
to
line
wrapper
script.
So
I
think
that
also
it's
also
their
pain
point.
F
B
G
A
G
Yeah,
even
though
I
was
thinking
of
that
just
wondering
how
so
maybe
I'll
I'll
advance
tracing
is
in,
and
maybe
a
second
step
would
be
to
use
it
in
totality.