►
From YouTube: CDS Infernalis (Day 1) -- OSD: Scrub and Repair
Description
Videos from Ceph Developer Summit: Infernalis (Day 2.1)
04 March 2015
https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)
A
B
For
anyone
who
attended
the
giant
or
firefly
or
perhaps
even
emperor
BDS
is
this
one
will
seem
familiar,
which
is
the
scrub
and
repair
blueprint,
namely,
we
would
like
there
to
be
a
ratos
interface
where
you
can
find
out
stuff
about
the
repair
state
of
a
PG.
B
The
main
driver
for
this
is
less
that
you'd
use
it
directly
and
more.
That
would
use
it
to
build
a
more
smarter
repair
tool
at
first
simply
a
command-line
utility,
where
you
could
query,
which
thing
is
inconsistent
and
how
about
a
PG
and
then
make
a
decision
about
what
to
do
about
it.
So
the
interface,
if
it
recall
correctly,
has
a
couple
of
pieces.
One
is
you
need
to
know
how
which
pg's
are
inconsistent
and
you
need
to
do
it
in
a
way
that
doesn't
require
enumerate
all
pgs
at
once.
B
B
Next
is
their
way
to
query
information
about
the
inconsistent
PG.
The
activation
epic
is
useful
because
the
information
we
keep
on
inconsistency,
we
may
not
want
to
persist
across
epics
across
interval
changes,
or
maybe
we
will
it's
hard
to
say
in
which
case
if
the
activation
epic
is
after
you
would
get
back
something
like
again
and
you'd
have
to
scrub
it
again
before
the
information
was
up-to-date.
B
Aside
from
that,
it
just
gives
you
information
about
which
objects
are
inconsistent
so
that
you
can
then
feed
them
to
poop
at
the
bottom.
There's
an
listed
inconsistency
information
with
an
in
with
an
undefined
inconsistent
info
t,
which
probably
has
something
like
a
bull
data,
is
inconsistent
or
a
you
know
they
met.
Adada
is
inconsistent
and
then
a
list
of
the
relevant
fields
at
Chuck
sums
from
all
the
replicas.
B
B
This
interface
doesn't
extend
super
well
to
erasure,
coded
pgs,
because
replica
to
use
is
less
hopeful,
but
on
the
other
hand,
this
should
almost
never
be
necessary
for
a
red,
hooded
TG.
So
maybe
no
maybe
the
answer
is
something
like
mark
replica,
unreliable
or
something
for
this
object
and
then
once
you've
done
that
to
enough
do
all
the
ones
that
don't
agree,
the
OSD
that
fixes
it.
That
would
actually
work
for
innovation
within
an
object.
B
B
So
treat
this
so
when,
when
this
first
was,
the
thing
I
think
was
prior
to
headsets,
so
we
we
didn't,
have
sort
of
wrap
up
created
internal
objects.
So
we
could
do
this
like
headsets
and
make
them
first
class
rato
subjects
or
you
know,
1.5
class
photo
subjects
or
we
can
make
them
like
the
omap
thing,
where
it's
a
/
OSD
maintained,
leukocyte
index
and
not
choose
to
maintain
it
across
beyond
pure
intervals.
B
C
B
So
this
this
can't
I
think
be
defined
much
more
until
we
actually
start
working
on
it.
That's
that's
probably
the
reason
why
so
little
has
happened
on
this
one,
it's
hard
to
define
exactly
because
it's
the
first
four
person
weeks
are
just
restructuring
the
existing
code
until
it
something
like
this
can
even
fit
anyway,
as.
C
C
B
So,
if
part
a
is
to
do
okay,
if
we
can
we
if
we
want
to
put
off
actually
implementing
through
the
repair,
we
can
do
that
too.
That's
actually
what
I
mean.
That's
the
hard
part,
though,
because
the
repair
has
to
be
initiated
as
a
like
librettos
started
operation
with
some
kind
of
completion,
time
which
mixes
the
worlds
of
client
I,
oh
and
recovery,
which
are
currently
asunder.
C
D
C
B
B
B
B
C
Yeah
I
mean
I
would
assume
that
the
in
order
to
make
the
again
tangent,
but
in
order
to
make
the
client
driven
ec
read
work
when
you
do
all
the
retards
you'd
look
at
the
version
that
comes
back
in
all
of
them
and
only
if
they
all
match.
Would
you
obviously
do
the
decode
and
return
to
read
yeah
and
in
the
case
that,
where
you're
reading
an
old
replica
or
something
you
would
do,
would
see
that
I.
C
C
C
B
D
D
C
D
You
can
import
right
with
offline
OSD
or
with
import
rados
I
mean
you
know.
A
ratos
running,
yeah
I,
see.
D
B
B
We
can
actually
do
that
did
did
that,
make
sense
so
that
the
you
you
wouldn't
be
able
to
stop
just
what
OSD
or
20
Steve's,
because
you
might
get
unlucky
and
peering
happens
in
the
meantime.
You
have
to
actually
stop
all
the
OSD
s
identify
a
healthy
copy,
export,
it
kill
all
coffees,
blast
them
and
then
restart
all
the
OSD
s
force,
create
the
PG
and
then
import
them.
It's
actually
not
that
hard,
assuming
that
it
works.
B
D
B
B
This
case
you'd
add
an
additional
top-level
thrash
to
zabal
option
which
is
cast
in
for
trados,
which
would
kill
which,
which
would
record
the
currently
alive
OS
diaz,
kill
all
of
them
rescue
one
valid
copy
of
the
placement
group,
because
you
have
to
do
it
after
clean
too
that
kind
of
sucks.
But
that's
that's
fine,
yeah,
ok,
so
this
this
would
have
to
be
something
that
happens
some
percentage
of
the
time
after
the
wait
for
Clinton
thing,.
B
D
C
B
C
B
C
B
C
Mean
right
now
right
now
that
the
transplant
is
kind
of
this
annoying
thing
or
you
rescue
to
a
PG
from
somewhere,
and
then
you
have
to
go,
find
some
other
SD
and
stop
it
and
inject
it
and
start
it.
True
may
be
nice
to
just
be
able
to
like
lost
it.
Ok,
I
go
slow!
On
the
other
hand,
that,
like
the
import,
workflow
is
kind
of
scary,
because
you
you,
like
you
blow
away
with
all
the
bad
copies
and.
C
B
Well,
there
are
sort
of
two
versions
of
it.
One
is
that
something
has
happened
that
has
just
messed
up
this
PG
state
and
that's
the
case
where
the
import
rattus
one
I,
think
is
valuable,
yeah,
but
the
case
where
you
actually
just
lost
too
many
os
DS
that
rather
transplant
a
shark
macos
birthday
poppy
in
that
case
yeah,
because
you
strip
out
all
of
the
offending
those
ratos
level
metadata
by
doing
or
sub
riddles
level
metadata.
By
doing
the
impermanence
thing,
which
is
a
good
workaround.
B
B
C
C
Yeah
I
mean
this
is
injecting
injecting
the
ec
shards
do.
Rita
sounds
just
hugely
complicated
too
because
you
have
to
like
read
all
shards
in
parallel.
Hopefully
they
order
the
objects
the
same,
so
you
actually
have
a
little
pieces.
You
have
to
do
the
ec
reassembly
in
a
tool
and
then
inject
the
raw
data.
That's
just
actually.
B
C
And
I
mean
without
loss
of
generality,
you
can
always
just
take
the
shards.
You
have
and
inject
them
on
to
10
SD.
Actually
one
random
lusty,
yeah
yep
exactly
except
up
there,
it
is
little
metadata
might
even
worked,
but
we
could
probably
make
an
option.
Probably
the
simplest
thing
would
just
be
to
make
an
option
that
when
you're
importing
the
PG
you
just
sort
of
like
zap,
all
the
PG
level
metadata
and
just
sort
of
reset
it
to
some
simple
same
state.
B
C
B
Okay,
so
I,
so
the
first
step
on
this
one
is
to
write
out
an
actual
plan
for
how
the
scrub
object.
For
one
thing
we
needed
to
find
inconsistent
info
t.
We
need
to
define
what
information
actually
gets
written
out
for
each
object
where
it
gets
written
out
down
to
how
the
object
is
named
and
how
it
interacts
with
the
appropriate
ratos
to
read
operations.
B
Mostly.
This
isn't
too
bad,
because
the
rate
of
street
operation
is
going
to
be
a
funny-looking.
It's
going
to
be
a
read
on
a
funny-looking
object
which
doesn't
give
you
the
actual
bites,
but
interprets
them
for
you
and
gives
you
back
a
structured
output,
which
is
easy,
there's
machinery
and
the
object
or
for
that
I
think
rap.
Pop
is
just
just
better
for
my
money
anyway,
yeah
I,
don't
know
if
anyone
has
a
different
ID
on
that
one,
but
I
mean.
C
A
C
C
B
Because
we
need
to
keep
more
information
that
may
be
the
end,
okay,
so
either
in
the
info,
or
on
this
thing
or
just
volatile
e
in
memory,
there
will
be
information
about
the
current
state
of
this
thing
and
as
we
go
through
the
scrub
chunks,
we
will
be
updating
a
set
of
bounds
on
which
objects
we
do
have
inconsistency.
Information
on
remember
the
scrub
would
be
stop
part
way
or
and
I
think
we
could
perform
this
scrub.
Oh,
and
that
means
the
interface
needs
to
include
information
on
which
objects.
We
have
information
on.
B
You
want
to
know
that
the
scrub
is
in
progress,
and
you
have
information
on
this,
but
not
that
which
is
very
annoying
or
because
the.
B
B
C
D
B
D
D
B
B
D
B
C
C
B
Which
things
make
sense
to
the
client,
so
the
version
makes
sense
to
the
client.
So,
let's,
let's
say
the
the
object.
Infos
don't
agree,
let's,
let's,
let's
distill
down
which
things
should
be
propagated,
the
client
so
free
charred
the
out
diversion
should
certainly
should
be.
The
size
should
be
the
back.
Some
should
be
that
one's
easy.
B
No
I
think
the
actual
checksum
should
be
sent
so
like
there.
There
could
be
like
a
known
error
thing
so
like
if
it's
just
not
there,
then
ok,
so
this
error
field
would
say
missing,
shard
or
missing
checksum.
If
we
know
the
chats
going
to
be
wrong
because
it's
a
racially
coded
object
or
because
it's
replicated
object
and
it
doesn't
agree
with
the
full
object
digest.
But
let's
say
everyone
is
there.
We
don't
have
a
full
object
digest.
They
didn't
return
the
I/o,
but
they
don't
agree.
Then
each
one
we
should
actually
return
the
version.
B
The
should
return
the
set
of
adders
they
do
have
now
all
sorts
of
it,
maybe
well.
If
the
object
info
isn't
present
than
we
know
that
it's
that's
practically
a
TI,
oh
I,
think
if
any
of
the
like
OSD
specific
metadata
is
missing
from
a
short
I
think
that's
Annie,
I,
oh
yeah,
but
so
eversion
TV
version.
What
a
quick
32
checksum
case,
it's
actually
valid,
say.
B
C
B
B
C
C
C
B
B
C
B
I
mean
that
in
a
file
system
goes
to
some
effort
to
arrange
for
it
to
be
asymptotically
better
than
that
right,
usually
and
well.
I,
guess
not
still.
If
it's
stolen,
irritate
massively
different
ballpark
of
pain,
yep
probably
be
have
the
ability
to
remove
a
range
of
keys
that
I
could
do
it
Eevee
efficiently,
but
it
doesn't
do
it
be.
C
Nice
I
could
mega
mega
whiteout
river
anyway.
Okay,
that's
probably
enough
to
go
on
yeah,
alright,
so
plan
of
record
finish
greatest
import
of
a
PG
export
book
testing
and
then
shift
to
this,
and
probably
we
need
to
fix
the
Laplace
start
by
storing
the
scrub
state
yeah,
but
we'll
send
to
fix
the
the
reform
replica
stuff
at
some
point.
In
parallel.