►
From YouTube: CDS G/H (Day 1) - OSD: Locally Repairable Codes
Description
https://wiki.ceph.com/Planning/CDS/CDS_Giant_and_Hammer_(Jun_2014)
24 June 2014
Ceph Developer Summit G/H
Day 1
OSD: Locally Repairable Code
A
B
Hey
so
did
I
start
now,
yep
far
away.
Okay,
I
will
start
for
the
benefit
of
people
who
are
not
familiar
with
that
with
a
short
introduction,
which
is
at
the
URL
which
I
passed
in
IRC.
B
So
the
idea
of
local,
irreparable
codes
is
basically
to
take
original
code
and
apply
it
recursively
to
a
subset
of
chunks.
That
is,
we
compute
coding,
chunks
for,
let's
say
10
blocks
and
we
create
for
priority
blocks
or
coding
blocks
and
then
for
each
five
blocks.
We
compute
one
more,
which
is
presumably
located
in
na
place.
B
B
In
theory,
you
only
need
to
move
blocks
within
the
rack.
You
do
not
need
to
cross
racks
boundaries
and
go
fetch
the
box
that
are
in
the
next
track,
which
may
save
you
bandwidth
and
is
that's
basically
the
goal
for
local,
irreparable
coat.
It
previously
was
called
pyramid
because
well,
I,
I,
don't
understand
much
of
all
that.
But
but
my
understanding
is
that
permit
code
involves
more
complex,
mathematics
tricks
to
do
the
that
kind
of
things.
B
So
during
the
giant
summit,
I
propose
a
way
to
do
that
which
was
inconveniently
complicated,
to
explain
and
hopefully
I
figure
that
something
that
is
simpler.
So
in
the
pad
you
see
chunk,
number
and
027,
but
the
idea
is
that,
let's
say
we
have
an
original
code
crush
full
set
that
gives
us
80
s
DS.
So
let's
also
say
that
half
of
them
are
in
Iraq.
The
other
half
are
another
in
another
rack,
so
we
want
to
apply
locality
so
that
we
can
recover
for
our
within
Iraq.
The
step
one
would
be
to
do.
B
The
global
encoding
are
using
the
two
racks.
So
when
you
see
in
the
line
under
027,
you
see
addy,
it
means
the
original
code
plug-in
is
going
to
use
the
soil
to
saw
data
and
then
the
plugin
computes
coding
chunks,
which
are
see
and
the
coding
chunks
are
stored
in
the
corresponding
OSD.
That
is
for
step
one.
We
have
one
cutting
chart
in
OSD
one
and
one
cutting
chunk
in
OSD
five.
B
Then,
once
we
have
that
reply,
step
two,
which
is
to
compute
an
additional
coding
chunk
designed
to
leave
exclusively
in
one
rack,
so
we
assume
that
the
g
0
1
2
3,
are
in
rack,
and
for
that
we
take
the
content
of
the
OSD
123
to
be
data,
and
we
compute
a
coding
chunk
that
will
be
stored
in
OS,
d0
and
step
three.
We
do
the
same
in
the
other
bag,
where
we
take
the
last
30
s
DS,
to
be
data
and
store.
The
coding
chunk
that
is
produced
in
OSD,
for
does
that
make
sense.
B
B
So
at
the
moment,
we
assume
that,
when
the
crash
wall
gives
us
a
80
s,
DS
the
first
k
ones
will
be
used
to
store
the
data
chunks
that
are
chunks,
but
that
may
not
be
convenient
if
we
want
to
do
what
we
want
to
do
so,
let's
say
we:
we
decide
that
the
code
g,
our
radio
code
back
end,
is
able
to
query
the
plug-in
and
ask
it.
Where
are
the
data
chunks?
We
still
work
in
there?
The
assumption
with
that?
We
have
systematic
codes,
that
is
the
data.
B
B
The
first
d
will
have
the
first
twenty
five
percent
of
bytes
of
the
object
of
the
stripe.
If
you
would
like
and
then
the
next
D
will
have
the
next
25
etc,
we
do
not
have
a
way
to
control
that
the
beginning
of
the
data
will
go
at
the
end
and
so
on.
It
does
not
seem
to
be
necessary
for
that.
I
propose
the
pull
request,
which
is
1911
and
contains
a
small
change
to
the
EG
back
end,
which
is
just
in
the
linked
after
the
public
west.
B
D
E
D
D
B
So
they
are
it's
within
a
profile
original
code
profile.
There
would
be
one
additional
key,
which
is
layers
that
will
contain
the
strings
that
I
explained
first,
only
within
a
JSON
object
and
the
the
string
that
follows
would
be
the
specification
of
the
area
code
plugin
to
use.
So
the
idea
is
that
the
lrc
plug-in
does
not
actually
implement
anything.
It
relies
on
another
plug-in
to
do
the
actual
encoding
and
decoding
I,
specifically
in
an
empty.
B
It
means
use
whatever
default
you
have,
and
then
you
can
also
change
that
for
something
else,
such
as
your
own
plug-in
that
you
want
to
try
and
then
there
would
be
the
world
step
will
set
steps.
So
I
had
a
hard
time
with
those
I
see
what
what
has
been
added.
The
thing
is
at
first
I
thought
it
would
be
more
convenient
to
specify
something
related
to
the
wolf
set
at
the
same
time
as
the
specification
for
the
coding,
chunk
placement
and
so
on.
But
it's
it
does
not
map.
B
So
I
propose
that
we
specified
in
a
separate
variable
now
I
chose
not
to
describe
the
wolf
set,
but
instead
stick
to
what
is
strictly
relevant
to
the
layers.
So
the
idea
is
is
to
help
the
system
administrator
who
want
to
try
something
that
is
not
too
far
from
the
example
that
is
in
the
documentation
just
to
tweak
it
a
little.
But
in
reality,
if
you
want
something
that
is
advanced,
you're,
more
likely
to
create
your
own
will
set
from
scratch
and
not
use
wolves.
B
C
E
B
B
C
C
B
B
C
If
it
is,
if
this
triggers
the
path
and
the
honor
that's
out
of
creating
the
default
tissue
rule
right
now,
it's
just
hard
coded
to
specify
whatever
the
default
failure.
Domain
type
is
that's
in
your
configurator
right.
It
creates
a
it,
creates
a
generic
and
pull
a
racial
rule
yeah
right,
so
we
can
extend
that
code
so
that
the
plug
in
the
erasure
plugin,
if
it
sees
it
there,
multiple
layers.
It's
like.
Oh,
there
are
three
layers
well,
actually
any
so
how
many
really
domain
server,
yeah
yeah.
B
C
B
E
E
B
B
Okay,
so,
but
that
would
be
exposed
to
this
is
admin
and
you
would
he
or
she
would
just
use
all
our
see
profile
and.
B
So
it
starts
for
recording
we
take
the
first
layer,
then
we
take
the
second
one
and
then
last
one
and
we
apply
the
encoding
to
the
result,
to
the
results
of
the
previous
one.
Of
course,
the
constraint
here
is
that
the
season
mean
has
to
know
that
he
should
not
come
up
with
layers
that
override
the
results
or
the
data.
B
B
Now,
if
it
turns
out
that
we
miss
one
chunk
from
that,
can
be
recovered
from
the
two
independent
local
layers,
each
of
them
will
recover
the
chunk
and
again
iteration
will
stop
before
reaching
the
first
layer,
which
is
the
more
generic
one.
But
that's
the
second
case
now.
It
may
be
the
case
that
chew
chunks
are
missing
and
they
are
both
in
a
local
layer
which
is
not
able
to
recover
them,
because
in
this
case
we
have
local
layers
that
can
only
recover
one
music
Chuck.
B
B
B
You
only
have
two
missing
when
you
climb
up
the
layers,
and
so
when
you
which
layer,
one
which
is
able
to
cover
two
missing
chunks,
then
you're
in
luck,
because
you
can
do
that,
and
the
last
case
is
when
you
cannot
recover,
because
you
have
three
chunks
that
are
missing
from
in
a
place
where
you
cannot
combine
the
effects
of
the
layer
together
to
get
all
the
all
the
chunks
back.
So.
C
In
that
last
case,
if
10
to
1,
43
and
106
are
missing
like
the
one
or
two
is
a
coding
chunk,
we
always
rebuild
that
one
143
and
106
are
one
of
the
data
chunks
and
one
of
the
coding
songs
from
the
first
step.
So
we
should
be
able
to
reconstruct
both
of
those
from
everything
from
step.
One
right
like
by
actually
moving
forward
down
the
steps
instead
of
in
Reverse
I.
E
C
Not
oh
yeah,
you
sound
fine,
but
you
talk
slower
than
raised
to
know
so
the
if,
if
this
toe
in
that
last
case,
you're
missing,
143
and
106,
which
is
two
chunks
out
of
that
first
step
where
you
have
for
data
into
quick
coding,
so
out
of
those
six
you're
still
only
missing
two,
so
you
should
be
able
to
rebuild
by
going
forward
by
taking
what
I
guess:
177
223,
285
and
207
can't
you
move
forward,
starting
from
the
original
original
data,
the
original
remaining
chunks
and
coding
chunk
from
step.
One
can't
you
rebuild.
B
C
A
line,
75
is
the
question
so
in
these,
in
these
other
examples
of
repairing
the
leak
is
sort
of
working
backwards
with
the
steps,
but
you
can
also
move
forward
like
if
you,
if
you
lose
a
coding,
chunk
from
an
early
step,
and
you
have
the
data
chunks
or
you
have
enough
of
that
layer
to
recover
them,
but
you
can
use
any
of
these
layers
effectively
to
recover
at
any
time.
Assuming
you
have,
you
have
enough
chunks,
it's.
E
C
C
Okay,
okay,
good,
that's
better!
Ok,
so
I'm
line
with
online
75
wrote
out
my
question:
I
guess
so:
I
mean
it.
Your
other
examples
make
sense,
but
it
sounded
like
you
were
suggesting
that
you
had
to
sort
of
work
backwards
through
the
layers
in
order
to
reconstruct
everything,
and
it
seems
like
it's
it's
simpler
than
that
almost
set
at
any
point
in
time.
You
look
at
the
chunks
you
have
and
the
chunks
you
don't
have,
and
you
can
look
at
any.
You
look
at
every
layer
and
you
see.
C
Is
this
layer
able
to
recover
any
missing
chunks
based
on
chunks
that
I
have
and
if
so,
at
what
cost?
And
then
you
just
pick
the
one?
That's
the
least
cost,
or
something
something
like
that.
So
so
in
this
case,
like
in
your
last
example,
143
and
106
are
missing.
But
if
you
just
look
at
just
look
at
the
first
step
layer,
one
you're
missing
two
chunks
and
you
have
4
remaining
there.
You
can
do
one
recovery
in
theory.
You
could
do
one
recovery
to
to
build
those.
C
C
B
D
B
D
C
C
So
in
layer,
if
we
use
layer
one,
for
example,
we
have
to
read,
you
know
for
chunks,
one
from
the
local
rack
and
three
from
a
remote
rack
and
whenever
I
think
weep
at
a
cost
function.
That
would
that
would
associate
what
the
I/o
cost
is
from
that
subway
x,
3,
x,
2
plus
1
times
1
or
something,
but
it
could
be
that
there
is
also
a
way
that
you
could
use
the
local
rica
like
to
local
recovery
codes,
say
if
it
was
depending
on
how
the
layers
were.
C
B
C
Drafted
and
that's
kind
of
the
point
actually
cuz
yeah.
If
you
lose
one
chunk,
you
could
recover
using
either
layer,
one
or
layer
to
do
you
lose
the
third
slot.
There
are
two
different
possible
recovery
plans
and
you
need
to
put
a
cost
associated
with
them
and
decide
like
I.
Think
we
I
think
the
problem
with
what
you're
doing
before
is
you
were
just
assuming
that
later
layers
are
cheaper
and
instead
you
should
just
look
at
all
possible
recovery,
Solarius
and
a
sign
of
cost
and
and
minimize
that
cost.
E
C
E
D
You
have
do
you
can
go
the
other
way
if
you
can't
resolve
it,
the
trouble
is
I.
Don't
think
it's
that
simple,
because,
even
in
the
case,
where
you're
missing
too
it
might
be
that
using
layer,
2,
&,
3
separately
so
faster,
it
depends
on
the
cost
of
cross
of
crossing
the
recovery
yeah.
In
other
words,
we
want
the
weighting
function
to
be
expressive
enough
to
capture
that
case,
so
we
have
to
actually
consider
it
and.
C
I
think
it's
also
going
to
also
going
to
change
over
time
to
like
the
current
or
the
initial
implantation
is
going
to
have
all
this
done
by
the
primary
and
so
using
layer
3
to
recover
something
in
layer.
3
is
going
to
be
expensive
kind
of
no
matter
what
you're
sending
it
all
between
racks
and
then
back
again.
Oh
yes,
right
so
I
think
I
think
having
and
so
being
able
to
have
that
cost
function
and
express
it.
I
think,
is
going
to
be
important.
C
D
Sort
of
all
possible-
okay,
so
I,
think
ok,
so
there
are
two
things
with
the
recovery
plan.
We
want
it
to
be,
not
wrong.
We
want
it
to
be
that
slow.
So
first,
I
think,
is
to
actually
generate
all
possible
recovery
plans
and
find
the
shortest
one.
Then
we
adapt
whatever
not
hideously
expensive
heuristic.
We
come
up
with
to
replace
it
yeah
yeah.
Well,
by
writing
the
proof
read
forcefulness,
I
think
the
right
or
so
and.
C
C
B
C
B
D
Actually,
extremely
straightforward,
all
of
I
don't
think,
there's
any
reason
to
push
the
ordering
consistency
stuff
out
of
the
primary
I.
Think
the
only
change
is
that,
instead
of
pulling
the
relevant
pieces
and
then
pushing
them,
the
primary
sends
the
lack
of
a
better
word,
a
remote
wait.
What's
the
word,
our
movie
method
call
and
an
urgency
to
the
rapidly
with
the
relevant
information
that
it
just
does
yeah,
but
I,
don't
think
it's
conceptually
heard
of
all.
D
D
E
B
C
C
D
C
Is
fighting
to
look
wait?
Okay,
one
of
them
can
be
inferred
somehow.
I
cant
rember
how
they
did
that,
but
I'm
just
wondering
if
that
can
be
captured
in
here
again
I
layer,
2
or
if
it
would
just
call
us
into
one
big
layer.
Basically,
do
you
remember
what
the
wood
ends?
It's
one
less!
It's
one
less
chunk
to
store.
I
know.
D
D
E
C
I
think,
basically,
it's
that
that
that
parody
block
could
be
reconstructed
from
s1
and
s2.
Also
somehow
it
oh
yeah,
that's
or
all
the
other
I
remember
whatever,
but
it
dad
to
soar.
One
less
I'm
just
wondering
if
not
that
it's
that
important,
because,
ultimately
I
think
the
flexibility
is
probably
more
useful
in
the
overhead
is
like
going
to
be
like
three
percent
or
something.
But
I'm
just
wondering
if
this
particular
encoding
scheme
could
be
captured
within
the
framework
yeah.
C
So
was
also
at
all
clever.
Did
you
hear
that?
Do
you
hear
the
question
week
now,
the
so
the
in
the
facebook
paper,
the
local,
irreparable
code
that
they
did?
This
is
the
paper
where
the
figure
came
from
that
you've
got
in
the
week
in
the
blueprint
they
didn't
have
to
store
the
final
implied
parity
block
because
they
used.
You
know
weird
linear,
algebra
trick
to
like
make
sure
that
it
was
carefully
chosen
to
be
inferred
from
the
other
parody
blocks.
Basically,
something
like
that.
C
D
B
D
Right
now,
well,
I
mean
actually
I
guess
if
you
think
of
each
layer
as
a
parity
declaration,
but
yeah
you
can't,
but
if
you
think
of
it
as
a
logical
dependency
graph,
then
sure
you
can,
you
just
add
an
additional
phantom
layer
and
you
just
make
sure
that
whatever
piece
of
code
is
invoked
knows
what
its
role
is
going
on.
I
mean.
C
D
D
Right,
so
to
describe
the
diagram
in
the
blueprint
it
seems
like
level
one
would
define
10
data
blocks
and
for
parity
blocks
level.
Two
would
design
five
data
blocks
of
one
parody
block
of
a
3-wood
to
find
five
data
blocks
in
one
parity
block
and
level
4
would
define
for
data
blocks
in
one
paragraph.
B
E
B
D
B
D
C
C
A
So
do
we
have
anything
else
we
want
to
add
to
that
one.
You
want
to
jump
into
the
race
you're
coding
I
mean
the
two
kind
of
seem
to
blend
together
their.