►
From YouTube: 2019-03-07 :: Ceph Developer Meeting
Description
Every month the Ceph Developer Community meets to discuss current work in the Ceph codebase, and coordinate efforts to minimize collisions and issues.
This monthly Ceph Developer Meeting will occur on the first Wed of every month via our BlueJeans teleconferencing system. Each month we alternate meeting times to ensure that all time zones have the opportunity to participate.
Meeting planning:
https://tracker.ceph.com/projects/ceph/wiki/Planning
A
A
B
C
C
B
B
So
right
now
we're
keeping
it
very
simple
if
we
had
wanted
to
since
it's
in
the
OSD
I
mean
it
would
have
been
neat
to
like
have
like
a
bit
mask
of
the
error
flags
kind
of,
and
maybe
you
could
have
a
restricted
set
of
flags.
It
says
we're
only
going
to
auto
repair
if
it's
these
things
and
not
other
things,
but
that
would
also
impact
the
way
it
builds
the
the
list
of
inconsistent
objects
that
are
then
going
to
be
repaired.
B
So
you
would
actually
not
even
if
we
did
something
like
that,
it
would
be
to
say
we
don't
try
to
repair
every
object.
We
only
repair,
like
you
were
saying
obvious
media
errors
like
an
e
io,
or
you
know,
subject
to
the
flags
that
we
keep,
which
we
don't
actually
have
any
IO
flag.
We
just
have
like
we
got
a
read
error,
yeah.
C
A
That's
an
unrecoverable
inconsistency
or
whatever,
and
you
might
want
to
raise
a
Health
Alert
only
if
you
try
to
fix
it
and
couldn't
oh,
the
proposals
to
add
a
new
PG
state
flag
for
a
failed
repair
right
and
maybe
a
different
health
warning
level.
I
can
I.
Think
inconsistent
is
amorphous.
Warning
right
here.
Is
this
warning
now.
B
C
B
We
are
as
long
as
we're
trying
to
repair
everything,
then,
at
the
end,
when
we're
siding,
if
the
inconsistence
isn't
going
off
after
the
repair,
then
we
know
we
should
set
failed
repair.
If
we're
trying
to
not
everything,
then
yeah
you'd
have
to
figure
out
which
ones
you
did
and
did
not
repair
that
if
you
had.
A
Then
we
want
to
count
just
maintain
counts
of
repairs
that
we
do
so.
We
know,
if
happened
a
zillion
times
on
a
PG
or
on
an
OSD
all
right,
let's
see
on
the
OSD
the
OSD
we
have
to.
If
we
don't
have
at
a
place
right
now,
where
we
persist
OSD
level
statistics,
so
we
have
to
add
a
field
to
the
superblock
or
add
another
object,
just
like
billiard
count
or
something
that's
in
the
meta
collection
keep
track
of
those
I'm
sure
you
appreciate,
for
it.
B
Right
getting
the
repair
account
might
be
a
little
tricky
because
again,
that's
part
of
recovery
and
knowing
that
you
did
or
didn't
repair
any
specific
thing.
So
if
you,
if
you
just
did
the
counting
the
way
we
output
now,
which
we
say
you
have
ten
errors
and
we're
trying
to
end,
we
we
assume
we
fix
them
when
we
start
recovery.
So
if
we
tally
that
that
way
and
then,
if
you
had
failed
repairs,
theoretically,
your
repair
count
would
go
up
like
every
week.
A
A
B
A
You
do
an
in
context
that
is
set
within
consistence
that
doesn't
have
a
account
for
how
many
many
objects
are
in
consider
how
many
inconsistencies
there
are.
Yes,
it
could
be
that
when
we
clear
that
flag
we
got,
and
we
reset
that
inconsistency.
We
just
used
that
number
as
the
number
that
we
cleared
right
actually.
B
A
A
A
A
V8
and
at
the
current
idea,
or
how
this
would
work
would
be
if
we
set
the
default
balance
or
mode
to
be
the
crushed
compat
mode,
which
means
that
there's
a
compat
weight
set
was
sort
of
adjusted
crush
weights
that
are
independent
from
the
sort
of
a
local
crush
weights
you
set
based
on
the
size.
So
when
you
add
a
new
OSD,
it
would
get
set.
It's
it's
real
crusher.
It
would
be
set
as
the
size
of
the
device,
but
the
weight
set
value
for
that
was
two.
A
D
A
A
Wonder
if,
eventually,
that
should
be
the
default
behavior,
it
does
kind
of
all
there.
A
couple
of
the
other
thing
is
that
it
it
sort
of
walks
you
into
the
crush
compat
balancer
mode,
because
that
you
have
this
sort
of
separate
set
of
crush
weights
that
are
indefinitely
managed
from
what
you
expect
them
to
be.
If
you're
using
the
EPS
up
up,
set
up
map
map
Abnett
bird,
there
is
no
such
analog
at
that
they,
don't
you
don't
have
that
gradual.
A
A
E
A
E
A
E
E
E
You're
gonna
get
garbage,
you
can
dump
a
G
core,
but
you
have
to
dump
a
G
core
of
each
thread
individually,
so
you
have
to
list
out
all
the
threads
and
dump
them
each
individually.
Then
you
can
take
all
of
those
cores
and
go
through
them,
one
by
one
in
gdb,
in
a
debugging
environment,
but
they're
not
going
to
make
a
lot
of
sense
because
they
weren't
all
captured
at
the
same
time
they
were
captured
at
different
times
so
I
see
this
is
a
problem.
We
need
to
have
some.
A
E
Yeah,
because
there's
certain
things
the
reason
this
came
up,
we
were
trying
to
verify
the
user
was
seeing
a
thread
deadlock
now.
The
only
way
I
know
of
to
do
that
is
to
dump
out
the
threads
or
get
a
or
dump
or
you
can
set
up
a
BRT
on
the
machine
and
ask
it
to
dump
core
and-
and
you
will
end
up
with
a
core
dump
if
you've
got
the
right
core
pattern
and
settings
in
the
kernel.
E
A
E
A
E
A
If
we
could
or
should
make
a
like
an
admin
soccer
command
that
will
just
dump
a
stack
trace
on
every
thread,
it'll
just
iterate
across
the
threads
and
just
dump
the
structure
on
it
won't
be
as
precise
as
a
GDP
one
cuz.
You
won't
have
all
the
arguments
or
anything,
but
at
least
we'll
just
know
where,
where
you
are
it.
F
E
A
E
F
A
G
A
A
A
A
A
You
couldn't
make
it
so
that
it
warns
if
there
are
any
crashes
in
the
last
10
days
that
haven't
been
like
acknowledged
and
they
have
a
new
command.
That
says
like
acknowledge
this,
and
then
the
health
warning
goes
away.
Could
do
that
be.
G
The
only
way
I
mean
it's
good
to
have
that
data,
but
it
really
seemed
to
me
like
a
thing
that
either
it's
gathered
up
when
someone
is
looking
into
issues
with
the
cluster
or
that
gets
automatically
phoned
home,
not
something
that
admins
are
gonna
like
pay
any
attention
or
not
in
any
other
circumstance.
You
know.
G
A
So
the
reason
I
noticed
this
was
because
I
was
just
staring
at
a
watch
on
the
cluster
and
I
saw
no
Steve
down
and
then
it
disappeared.
I
was
like
I
wonder
what
that
was.
There
wasn't
actually
a
crash.
I
just
saw
these
are
the
crashes
I
finally
went
onto
that
machine
and
it's
like
out
of
memory
killer,
because
these
mirror
machines
have
like
no
memory
on
them.
A
G
A
G
Like
that's
what
you
it
might
be
like,
okay,
like
there's,
clearly
permanent
state
wrong
with
this.
What's
this
demon,
but
in
most
of
those
cases,
you're
gonna
like
end
up
in
a
different
alert
system
like
okay,
now
that
demons
dead
because
I'm
sorry
consistently
stopped
restarting
it
or
there's
a
missing
logical
MDS
that
is
never
I.
A
Mean
not
always
like
they
stay
every
time
you
scrub
at
eg.
It
crashes,
for
example.
Then,
once
a
day,
I
know
Steve's
gonna
go
down
and
come
back
up
and
then
you're
not
going
to
notice
all
right.
Unless
you
happen
to
be
looking
right
then,
and
you
see
they're
like
Oh
Steve
go
down
and
up
cuz,
we
don't
even
more
and
I,
think
I
don't
see
down
until
I
know
you're
right.
G
Like
that's
a
great
thing
to
notice,
but
that's
a
very,
very
precise,
trigger
condition
right
because
you
know,
unless
you
think
the
crashes
are
so
rare,
which
they
might
be,
that
you
can
just
always
alert
on
a
crash
and
make
an
atom
and
acknowledge
it.
But
yeah
I,
don't
think
that's
gonna
work
it
might
it.
G
G
G
E
G
Lot
of
those
media
errors
actually
aren't
turning
into
crashes
anymore
and
someone
actually
I
saw
a
ticket
today
where
someone
was
complaining
about
that.
They're
like
I
automatically
recovered,
because
it
was
your
a
sure
coded
in
now,
I,
don't
know
what
my
disk
is
going
bad.
Please
tell
me
anything:
that's.
C
A
A
There
you
go
that's
thinking
outside
the
box.
Yes,
that
would
be
so
much
better.
Yeah
yeah
hit
exit,
look
like
a
exit.
I
mean
it
should
be
even
be
exit.
One
I
guess
you
guys
have
one,
but
we
can
give
it
a.
We
can
give
it
an
exit
code
that
makes
it
an
air-conditioned,
but
system
B
doesn't
try
to
restart
it.
Probably.
E
E
A
Yeah,
that
would
be
don't
be
good.
I
think.
The
reason
why
we
well
yeah,
we
don't
have
to
crash
and
core
dump
and
all
that
stuff
to
do
right
now
we
don't
have
to
do
that.
I
think
if
we
were
to
go
one
step
further,
it
would
be
nice
to
like
and
a
nice
message
into
the
log.
It
says
by
the
way
I'm
shutting
down
because
of
XYZ
and
like
login
and
all
that
stuff.
E
A
A
A
And
good,
we
can't
go
through
the
normal
shutdown
shutdown
sequence,
but
we
could
do.
We
could
log
a
crash
report
so
like
you'll
notice
that
the
one
I've
pasted,
if
you
just
crash
and
hit
a
seg
fault,
it'll
just
have
the
back
trace.
But
if
you
hit
an
assert,
it
puts
all
the
assert
metadata
there
like
what
file
on
my
number
and
what
the
condition
was.
We
could
do
something
similar
for
an
IO
error.
That's
like
a
special
function
that
logs
a
crash
report
and
says
I.
A
E
E
E
E
A
A
So
this
is
what
the
thing
that
Adam
worked
on.
Did
it
hijacked
the
threads
and
like
made
them
execute
this
code
and
then
go
back
to
what
they
were
doing
doing
some
trickery?
I,
don't
know
what
he
did
but
tell
you
what
you
don't
use,
something
like
that.
Basically
I'll
see
if
I
can
talk
to
Adam
yeah
I
think
it's
on
github.
It
was
called
it's
a
p.m.
PMP
lit
PMP
or
something
for
men
for
a
father
what
it
was
yeah.