►
Description
Videos from Ceph Developer Summit: Infernalis (Day 2.1)
04 March 2015
https://wiki.ceph.com/Planning/CDS/Infernalis_(Mar_2015)
A
So
here
we
are
with
the
first
session,
which
is
the
discussion
about
erasure
coding,
pool,
overwrite,
support
and
sam.
You
want
to
take
it
away.
B
All
right
so
when
we
went
ahead
and
implemented
erasure
coding
last
year
or
whenever
we
did
it,
we
made
a
choice
to
limit
the
interface
to
append
only
and
x
adders
and
delete,
basically
the
things
that
we
can
roll
back
easily
without
needing
to
stash
a
bunch
of
extra
information.
B
So
that's
that
works
very
well
for
the
base
tier
in
a
cast
hearing
situation.
It
works
well
for
radius
gw.
It
works
abysmally
for
rbd,
since
you
need
to
be
able
to
do
partial
overrides
with
rbd.
B
So
there's
been
some
noise
about
trying
to
create
or
trying
to
add
overwrite
support
to
ratio
coded
pools
so
that
you
can
run
rbd
directly
under
on
a
narration
pool
without
needing
either
a
cache
turing
solution
which
doesn't
work
that
well
for
rbd,
either
or
needing
to
put
up
with
3x
replication
for
your
vm
images.
B
So
the
short
version
of
the
reason
why
we
didn't
do
this
to
begin
with
is
that,
unlike
a
raid
controller,
we
can't
we,
we
can't
just
use
a
you
know,
nvram
thing
to
make
sure
we
don't
lose
the
updates.
B
So,
as
I
see
it,
there
are
sort
of
two
sorts
of
approaches
to
how
we
do
this
either
we
apply
the
update
in
place
and
atomically
maintain
a
rollback
log
with
the
old
data,
or
we
journal
ahead
and
then
apply
the
update
in
place
once
all
replicas
have
committed,
we'll
call
those
the
rollback,
log
and
two-phase
commit
approaches.
B
So
the
rollback
log
would
be
an
extension
of
what
we
already
do
and
the
way
it
would
work
is
when
you
accept
a
write,
you
read
the
corresponding
data
from
the
file
or
from
the
object,
and
you
create
a
transaction
that
writes
that
data
to
a
write-a-site
log
atomically.
With
the
update
you
write
into
the
data
in
to
the
object.
B
This
is
this
requires,
first,
a
read
of
the
object
of
the
relevant
piece
of
the
object.
Either
we
send
the
data
to
the
primary
and
the
primary
packages,
the
rollback
information
in
the
wrap
up
or
each
replica
when
it
receives
the
transaction,
performs
it
locally
either
way,
we
still
need
to
read
the
object
before
we
can
perform
the
right.
This
in
this
adds
some
additional
complications
like
with
pipelined
rights.
B
You
can't
just
read
the
object
since
it
might
have
a
pending
right
on
it
and
you
need
the
most
recent
logical
version
of
the
object,
not
simply
whatever
happens
to
be
on
disk.
So
this
implies
that
we'll
need
to
buffer
any
unstable
portions
of
the
disk
of
the
object
in
the
osd,
all
of
which
is
kind
of
doable.
B
B
So
the
the
other
option
is
the
two-phase
commit
approach
which
at
first
blush
seems
worse,
because
we
need
to
perform
two
commits
and
they're
dependent,
but
we
can
reply
to
the
replica
or
we
can
reply
that
the
right
is
complete
when
all
replicas
respond
with
a
with
a
preparer.
So
once
all
replicas
are
prepared,
we
know
the
the
right
won't
be
rolled
back
and
we
can
reply
with
success
as
long
as
we
make
sure
that
it's
that
any
reads
are
blocked
until
the
write
is
committed.
B
B
So
maybe,
let's
see
significant
piece
that
I
haven't
gamed
out
here,
is
that
this
changes,
the
semantics
of
the
pg
info
to
include
a
so
right
now
the
pg
info
has
a
last
update,
which
is
just
sort
of
correct.
B
B
This
adds
another
piece
which
is
the
sort
of
last
update
prepared
versus
last
update,
committed
with
the
notion
that
the
versions
between
last
update,
prepared
and
last
update
committed
we
might
or
may
not
choose
to
include,
depending
on
which
replicas
have
which
values
so
peering
would
need
to
be
extended
slightly.
B
Exactly
yeah,
but
I
mean
the
rollback
stuff
is
sort
of
it's
almost
transparent
appearing
yeah
yeah.
This
is
a
little
different
and
I
I
think,
you're
right
though
I
don't
think
it
actually
makes
a
big
difference.
Incidentally,
both
of
these,
both
the
two-phase
commit
and
rollback
log
only
actually
work
the
way
I've
described
if
they're
aligned,
if
they're
not
aligned
or
if
the
rights
are
not
full
stripe
aligned.
Then
you
still
need
to
perform
a
read,
modify
right,
that's
probably
a
a
livable
restriction,
though,
since
rbd
could
always
choose
to
write
out
pages
or
whatever.
C
Is
that
is
it
actually
a
good
idea
or
practical
to
have
erasure
codes
that
are
aligned
to
forklift
4k
stripes?
No,
no.
B
No,
I
mean
sounds
like
an
awfully
small,
but
what
I
mean
is
rbd.
All
already
has
an
alignment
that
it
chooses
to
deal
with
and
you
can
just
choose
to
deal
with
the
bigger
one
if
that
makes
it
happy.
So
when
it
performs
a
read,
it
always
reads
whatever
the
necessary
stripe
size
is,
which
would
be
bigger
than
4k.
Or
are
you
arguing
that
will
be
impractically
large.
C
B
Did
the
benchmarking
to
find
out
how
small
stripe
size
you
can
get
before
jr
razor
starts
getting
noticeably
slower,
so
there
is
a
number
and
if
it's
as
small
as
4k,
then
you're
right
if
it's
bigger
than
that's
a
trade-off.
Yeah.
C
B
C
B
B
Well,
hang
on
I'm
I'm
saying
so
both
both
both
advantages
or
both
approaches,
above
also
torpedo.
The
way
we
do
check
summing
for
racial
coded
pools.
So
append
only
has
this
nice
property
that
you
can
just
maintain
a
checksum
for
each
of
the
shards,
and
you
always
know
which
one
is
correct
during
scrub
which
I
like.
B
If
we
go
to
this
sort
of
approach,
then
we'll
have
to
maintain
checksums
at
a
granularity
of
whatever
the
minimum
stripe,
size
or
right
size
is
which
is
sort
of
a
bummer
to
me
also
any
way
we
roll
it
or
any
any
way
we
structure
it.
It's
going
to
perform
worse
than
appends.
B
It
will
be
more
expensive,
so
the
question
is:
can
we
get
this?
Can
might
it
be
possible
to
support
rbd
without
without
supporting
partial
overwrites?
So
one
way
we
might
consider
doing
this
is
by
changing
the
way.
Rvd
lays
out
its
data
into
some
form
of,
let's
say,
for
four
megabyte
block,
which
is
essentially
immutable,
plus
a
journal
of
pending
updates.
B
We
can
also
update
the
rbd
osd
class
to
do
the
coalescing
so
when,
when
rbd
for
our
rbd
would
have
to
always
choose
to
read
full
full
blocks,
but
the
osd
could
do
the
job
of
reading
the
block
and
the
updates
coalescing
them
and
sending
back
a
single
block
back
to
the
back
to
rbd.
Rbd
would
then
send
small,
writes
in
the
form
of
these
incremental
updates
to
the
osd
until
it
hits
a
full
or
until
it
hits
some
heuristic.
B
B
D
C
C
B
Right,
so
that's
that's!
That's
the
cost.
The
reason
why
I
think
this
might
be
viable
is
because
there's
a
there's
a
paper
from
microsoft
from
a
couple
years
ago,
and
this
is
exactly
how
they
implemented
their
ratio
code
block
device
thing.
B
Append
only
blocks-
and
they
have
this
log
structured
block
device,
but
not
unlike
a
flash
translation
layer
actually
well
somewhat,
unlike
a
fashion.
Okay,
very,
unlike
a
flash
translation
more
like
what
I'm,
what
I'm
just
yeah.
C
B
B
Actually
we
we
can,
even
so
we
don't
really
want,
for
example,
views
or
the
set
fuse
client
to
have
to
implement
the
same
thing.
So
if
we
can
abstract
it
into
a
library,
that
would
really
be
the
ideal
thing
that
encapsulates
the
relevant.
A
C
B
No,
the
client,
the
the
actual
client
side
thing
needs
to
cooperate
as
well,
because
the
client
side
thing
is
already
buffering.
Whole
objects.
We
don't
want
the
osd
to
read
it.
We
actually,
we
may
want
to
center
it
over
the
wire.
C
B
Which
then
lip
rb
talks
to
that's
too
I
mean,
because
we
do,
we
don't
really
want
every
liberators
user
to
have
to
implement
this
I
mean
so.
Another
piece
is
that
is
it?
Is
it
practical
for
rbd
to
or
whatever
this
thing
is
to
choose
to
cache
object,
size
pieces
and
is
a
four
megabyte
ec
object,
even
a
good
idea?
B
C
B
No,
but
in
the
everything
so
right,
yeah,
specifically
for
the
rbd
case,
there
are
no
weapons,
it's
always
an
override,
yeah
yeah
in
other
use
cases.
You
wouldn't
use
this,
you
would
you
would
structure
it
so
that
you
had
write
once
objects,
because
it's
just
easier
to
do
that
than
rados
anyway,
yeah
everything's,
easier
to
reason
about
when
objects
are
oh
and
the
other
catch
is
that
everything
I
just
outlined
for
for
rbd
would
only
work
if
would
only
work
in
the
situation
where
right
back.
B
B
C
B
Not
the
effect
someone,
someone
has
to
be
caching,
the
the
dirty
state
or
it's
or
it's
not
a
win.
D
Just
spitballing
here
what,
if
you
were
you
could
take
over
rights
in
as
like
a
replicated
type
of
operation
and
send
that
out
and
then
something
behind
that.
Let's
just
say
you
had
one
something
behind
that.
Would
re-erase
your
code
that
and
merge
it
into
the,
but
you'd
have
to
keep
track
that
that's
happened,
so
you
can't
read-
and
you
know,
without
doing
that,
cleanup.
C
B
B
Well,
actually,
that
one
also
poses
a
bit
of
a
problem,
because
the
problem
is
the
reconstruct
on
read
problem,
and
if
we
don't
turn
the
erasure,
coded
object
into
a
replicated
object
and
then
update
in
place,
then
you
still
have
to
read
the
array
recorded
object
and
the
replicated
diffs
to
get
back
the
the
to
satisfy
the
read
which
doesn't
really
help
us.
D
C
D
C
C
The
right
latency
on
ec
is
always
going
to
be
much
higher
than
on
a
replicated
pool
anyway,
because
you're
touching
so
many
more
nodes
right.
That's
true,
like
you'd
kind
of
want
to
do
this
right.
These
writes
replicated
like
it
feels
like
that.
The
right
quote-unquote
solution
to
this
is
really
a
cache
tier.
That
does
partial
objects
like
that's
that
it's
it's
super
complicated,
but
that's
the
that's
the
thing
that
actually
sort
of
avoids
all
these
problems.
B
Oh
that's
a
problem!
So,
if
you
do
a
partial
promote
you
can't
flush,
the
whole
object.
B
C
C
B
I
mean
the
the
two-phase
commit.
One
does
have
the
does
have
a
lot
of
virtues.
It's
deterministic,
it's
strict.
It
does
basically
what
raid
does
so
people
are
used
to
the
costs
associated
with
with
that
sort
of
not
all
of
it.
In
this
case,
we
only
have
to
buffer
unstable
objects,
only
unstable
extents,
actually.
C
Which
isn't
so
bad
and
does
it
does
it
really
have
to
be
stripe
aligned
in
that
case,
or
can
it
be.
B
E
So
one
thing
that
might
help
is
with
the
mirroring
stuff:
since
we
actually
have
a
journal
of
rights
and
it's
a
replicating,
potentially
different
pool,
we
can
actually
do
the
right
back
from
the
journal.
Do
that
modify
right
at
that
point
when
doing
the?
So
it's
not
having
the
extra
latency
of
that.
We
modify
right
cycle
to
an
ec
pool
visible
to
the
guest.
E
I
mean
you
have
to
have
some
local
caching,
it
doesn't
have.
I
mean
it
doesn't
have
to
be
an
exclusive
writer.
It.
C
B
C
C
E
B
B
B
C
B
C
E
B
C
E
But
that
might
be
tolerable.
We
could
certainly
try
it
out
prototype
it,
but
you
know
padding
to
the
type
size.
We
don't
have
enough
data.
C
B
Okay,
does
anyone
have
any
sort
of
input
like
I
don't
know
if
anyone's
paying
attention
or
has
thought
about
actually
using
this
in
real
life
half
the
main
reason
I
wanted
to
have
a
session
on
this
was
to
see
if
anyone
had
the
opinions
on
or
insight
into
anything,
they
want
to
use
it
for
that,
these
don't
really
address,
or
that
one
of
these
does
address.
B
C
B
B
B
You
know
if
you
could
reference
a
previous
journal
entry
in
the
journal,
then
at
least
we
could
avoid.
No,
it's
not
worth
it.
I
was
thinking
because
you
you're
writing
the
same
extent
to
two
different
objects,
hopefully
close
together
in
time.
So
if
you
knew
that
it
was
that
the
previous
extent
was
still
in
the
journal,
you
could
reference
it
by
a
unique
id,
but
no
there's
no
way
to
there's.
B
No,
that
that
would
do
it
again.
So,
oh
I
I
see
what
you
mean.
Yes,
that's
true,
but
the
clone
range
one
does
kind
of
require
an
f-sync.
B
B
C
C
B
B
I
don't
know
that
it's
going
to
be
done.
I
I
don't
think
it's
a
good
idea
to
implement
this.
Unless
we
have
someone
willing
to
say
yeah,
I
am
a
driver
for
this
thing
and
I
would
consider
it
successful
if
it
accomplished
x
with
y.
You
know
properties
right,
yep,
exactly
otherwise
we're
just
implementing
something
which
might
or
might
not
be
useful.
Just
for
someone
maybe.