►
From YouTube: 2015-JAN-22 -- Ceph Tech Talk: RADOS
Description
A detailed look at the inner workings of the Ceph RADOS data store.
http://ceph.com/ceph-tech-talks
A
B
B
Sefs
design
stems
from
a
few
fundamental
principles:
vandal
today's
massive
storage
requirements.
Each
component
must
be
able
to
scale
horizontally.
There
must
be
no
single
point
of
failure.
F
myself
managed
and
set
myself
management
over
to
maximize
flexibility,
and
we
are
open
source
to
run
on
commodity
hardware,
that
cuts
it
on
costs
and
increases
flexibility.
B
Let's
talk
a
little
bit
about
how
these
components
are
able
to
use
ray
DOS
and
that'll,
give
us
a
basis
for
talking
about
how
ratos
works.
So,
let's
start
with
the
ratos
gateway,
the
rate
of
Skateway
provides
an
s3
interface
to
applications
which
want
to
consume
an
object
interface,
but
don't
necessarily
want
to
deal
with
the
complexity
of
librettos
using
librettos
directly.
B
Similarly,
our
BD
provides
a
block
interface,
backed
by
our
a
dos
cluster.
The
hypervisor
uses
Lib
RBD
to
translate
block,
reads
and
writes
into
librettos
operations
on
objects
in
the
Rados
cluster.
Each
RVD
image
ends
up
chunked
or
striped
across
four
megabyte
objects
strewn
across
the
rails
cluster.
According
to
the
Rados
placement
policies,.
B
Finally,
for
applications.
Looking
for
a
file
system
interface,
there
is
set
of
s
s.
Fs
deployment
requires
a
set
of
metadata
servers
in
addition
to
the
ratos
cluster.
These
servers
do
not
store
metadata
locally,
but
as
with
RB
d
and
Ray
dos
UW
I
use
the
latter.
A
dos
interface
destroy
the
data
inside
of
the
Rados
cluster.
Clients,
send
metadata
requests
to
the
metadata
servers,
but
perform
file
data
operations
directly
on
the
backing
objects
in
the
ratos
cluster.
That
is
the
stuff
of
s.
B
B
I'll
be
focusing
on
Rados
for
the
remainder
of
the
talk
separating
the
storage
replication
out
in
this
way
also
allowed
us
to
slide
a
racial
coding
and
cache
tearing
in
at
the
rate
of
level
allowing
for
the
most
part,
the
other
services
to
use
them
Trent
transparently.
So
the
rate
of
this
interface
tries
to
make
it
simple
to
reason
about
accessing
distributed.
Storage
objects
are
divided
into
a
flat
names
into
flat
names
based
pools.
B
B
Users
can
write
applications
for
a
DOS
using
the
liber8
O's
interface
available
for
C,
C++,
Python
and
several
other
languages.
The
interface
is
quite
rich.
First,
we
support
partial
over
writes
of
objects
rather
than
requiring
objects
to
be
overridden
in
their
entirety.
Partial
overwrites
make
something
like
RBD,
pretty
pretty
simple.
The
block
device
is
simply
broken
up,
striped
or
chunked
across
four
megabyte
pieces,
each
of
which
is
a
ratos
object,
writes
and
reads
are
then
simply
translated
into
writes
and
reads
on
the
underlying
array:
dos
objects.
B
Each
object
can
also
have
a
set
of
user-defined
x
adders,
which
can
be
useful
for
storing
small
apart
suit
small
amounts
of
frequently
accessed
metadata.
We
also
associate
with
each
object
a
ordered
key
value,
mapping
call
which
we
call
an
object
map.
This
object
model
is
currently
implemented
by
keeping
a
level
DV
instance
within
each
OSC.
Each
objects
object
map
is
simply
a
prefixed
portion
of
that
leveldb
instance.
B
Its
key
value
mapping
is
useful,
for
example,
for
representing
a
radio
CW
s3
bucket
index,
which
we
need
to
be
able
to
efficiently
insert
and
remove
entries
from
and
also
list
an
order.
We
also
support
Tomic,
read
and
write
transactions
on
a
single
object.
You
might
use
an
atomic
read
transaction
to
atomically,
fetch
an
attribute
and
an
extent
of
the
data
payload
or
you
might
use
an
atomic
write
transaction
to
atomically
check
an
attribute
and
conditionally
add
a
set
of
key
value
mappings.
B
B
Fundamentally,
Rados
is
a
cluster
of
individual
processes
running
on
servers
in
your
data
center.
Most
of
these
processes
provide
access
to
the
data
stored
on
disks.
A
few
provide
cluster
management
services
which
allow
the
other
cluster
components
to
intelligently
handle
engines
in
the
cluster
like
node
addition
and
failure.
B
B
B
Sorry
about
that,
okay,
a
cephalus
tea
process
manages
an
individual
storage
device
in
a
fort,
a
storage
node.
We
would
generally
recommend
that
you
run
for
OSDs
one
per
disk,
rather
than
aggregating
the
disks
into
a
single
raid.
This
way
you
can
exploit,
set
failure,
recovery,
which,
particularly
with
racial
coding,
can
be
much
more
efficient
than
riesling
a
disk
in
a
raid
array.
B
The
other
component
of
a
Rados
cluster
is
the
monitor
cluster.
The
monitor
cluster
is
responsible
for
maintaining
a
consistent
cluster
map
via
Paxos
when
OS
DS
are
added
removed
or
die
or
change
location.
The
monitors
create
the
amount
of
a
cluster
creates
a
new
cluster
map
reflecting
the
change.
These
maps
were
then
propagated
via
gossip
to
the
OS
DS
and
used
by
the
OST
s
to
independently
rebalance
or
heals
stored
data,
depending
on
the
nature
of
the
change.
B
B
Let's
talk
a
bit
about
object,
placement
and
stuff.
This
is
kind
of
where
the
magic
was
so
when
a
Raiders
client
tries
to
access
an
object.
I've
said
that
it
is
able
to
talk
directly
to
the
storage
node
without
involving
a
gateway.
So
how
does
it
know
which
one
to
talk
to
what
option
would
be
to
add
a
location
service,
perhaps
backed
by
a
traditional
database,
to
provide
an
authoritative
location
for
objects?
There
are
some
downsides,
though,
when
a
node
is
added
or
failed,
what
handles
rebalancing
the
data?
B
However,
we
first
map
the
very
large
number
of
objects
into
a
pretty
small
number
of
placement
groups
or
P
G's.
Why
consider
what
must
happen
when
the
cluster
map
changes?
We
must
go
through
the
things
place
by
crush
and
move
some
of
them
to
a
new
home,
but
there
might
many
many
objects
even
on
a
single
disk,
and
we
don't
want
to
rerun
crush
on
each
and
every
one
of
them.
So
instead
we
first
hashed
the
objects
into
a
large
into
a
set
of
placement
groups,
typically
about
a
hundred
per
OST.
B
Each
placement
group
is
then
run
through
crush
along
with
the
current
cluster
map
in
order
to
output
an
ordered
set
of
OS
DS
architectural
placement
groups
also
serve
a
number
of
other
nice
functions.
They
act
as
the
order
of
the
sorry,
the
unit
of
ordering
and
the
unit
of
locking
within
each
OSD.
So
the
OSD
acts
more
like
a
collection
of
placement
groups
than
a
collection
of
objects
that
ordered
side
of
OS
DS
determines
the
primary
and
replicas
for
that
placement
group
and
the
objects
contained
therein.
B
B
Nice
property
of
crush
is
that
the
placement
groups
are
D
clustered.
That
means
that
if
two
placement
groups
share
an
OSD,
typically
their
replica,
those
peaches
replicas
their
replicas.
Won't.
That
means
that
if
an
OSD
diet
with
100
placement
groups
dies
rather
than
one
new
OSD
having
to
receiving
a
hundred
new
placement
groups,
are
at
a
hundred
different
OS
DS
will
each
we
replicate
about
one
copy.
B
This
greatly
speeds
up
recovery
and
reduces
and
prevents
any
single
OSD
from
being
a
bottleneck
for
the
most
part.
So
what
is
it
crushes?
A
pseudo-random
deterministic
placement
algorithm?
We
fundamentally
take
two
inputs,
a
cluster
map
and
a
placement
group
and
return
the
OSDs
on
which
the
object
should
be
placed
that
the
placement
cursor
should
be
placed.
The
cluster
map
can
include
rules
and
can
model
your
data
setters
physical
layout,
allowing
you
to
specify
placement
policy.
Thus,
while
crushes
pseudo-random,
you
still
retain
a
great
deal
of
influence
over
placement.
B
For
example,
you
can
write
a
rule
that
ensures
that
no
two
replicas
are
placed
in
the
same
host
or
in
the
same
rack.
You
can
also
configure
crush
to
place
more
data
on
some
OS
DS
than
others.
If,
for
example,
you
add
newer,
OS
DS,
which
happened
to
be
larger,
crush,
also
tends
to
move
close
to
the
minimum
amount
of
data
when
the
closer
map
changes,
which,
if
you
think
about
it,
it's
a
good
property
for
a
placement
algorithm.
B
Okay,
you
can
also
divide
up
your
objects
by
pool
each
pool
can
have
its
own
crush
map
yep
and
there,
and
therefore
has
its
own
placement
grips
and
its
own
replication
level.
You
can
also
have
some
eraser
coated
tools
and
some
replicated
pools
in
the
same
cluster.
You
can
use
this
feature
to
place
different
applications
on
different
kinds
of
storage,
for
example,
you
might
want
to
back
a
raid
s,
GW
s3
workload
with
disks,
while
with
spinning
disks,
while
using
a
separate
pool,
backed
by
SSDs
for
more
latency-sensitive
virtual
machines.
B
So
if
the
placement
is
that
dynamic,
how
is
an
OSD
to
ever
be
sure
it
has
actually
seen
all
of
the
rights
it
needs
to
see
before
serving
reads?
The
answer
is
peering.
Each
OSD
map
generated
by
the
monitors
is
assigned
an
increasing
epoch
number.
The
monitors
at
OS
DS,
remember
all
OSD
maps
back
to
some
epic
II
such
that
every
placement
group
has
been
cleaned
since
epoch
II.
B
This
history
allows
the
primary
after
a
mapping
change
for
a
particular
placement
group
to
determine
which
OS
DS
it
must
contact
in
order
to
be
sure
that
it
is
learned
about
all
completed,
writes
we
represent
the
state
of
a
PG
at
a
particular
OSD
by
keeping
a
list
of
the
most
recent
operations
on
the
placement
group
witnessed
by
that
OSD
hearing
results
in
an
authoritative,
PG
log
being
decided
on
for
each
replica.
The
primary
then
checks
whether
the
replicas
log
overlaps
with
the
authoritative
log.
B
The
trick,
however,
is
that
if
we
don't
know
which
objects
are
invalid,
because
the
blogs
don't
overlap,
we
certainly
can't
serve
reads
or
writes
terms
from
such
a
pier.
Thus,
if,
after
peering,
if
after
peering
the
primary
determines
that
it
or
any
other
pier
requires
backfill,
it
will
request
the
monitor
cluster,
publish
a
new
map
with
an
exception
to
the
crush
mapping
for
this
PG
mapping
it
to
the
best
set
of
up-to-date
piers.
That
I
can
find.
B
Some
placement
group
to
those
D
0
1,
&
2,
then
for
some
reason,
perhaps
a
user
changed
around
the
crush
hierarchy
in
the
next
map
crush
Maps,
it
Maps
displacement
group,
2,
3,
4,
&
5,
for
the
record.
Generally,
you
won't
see
a
map,
a
map
changed
like
that.
Do
you
something
like
an
OSD
failure,
but
for
you
know
constructive
purposes,
so
OS
d3
will
then
peer
by
requesting
PG
logs
from
0
1
2
4
&
5,
because
those
are
all
of
the
ones
that
could
have
had
a
map
in
the
pit
in
the
past.
B
B
So
at
this
point,
3
concludes
that
it
4
&
5
require
backfilled,
because
the
authoritative
log
has
stuff
in
it
and
does
not
overlap,
it's
completely
empty
log
and
will
request
a
new
mapping.
0
1
2
from
the
monitors
the
monitors
will
then
insert
this
myth.
This
mapping
into
a
new
OSD
map
as
an
exception
and
then
publish
that
map
during
the
next
peering
interval.
B
0
we'll
learn
that
at
its
primary
it
will
request
PG
logs
from
1
2
3
4
&
5,
and
it
will
determine
that
3
4,
&
5
require
backfill
or
that
3
4
&
5
should
be
the
center
velocity
displacement
group,
but
they
require
backfill.
So
it
will
leave
the
exception,
input
in
place
and
serve
reads
and
writes
while
backfilling
3
4
&
5,
once
backfill
completes
0
will
request
the
temp
mapping
be
cleared.
B
B
So
now
that
we've
got
background
on
Rados
and
purine
recovery,
let's
talk
a
little
bit
about
cash
steering
conceptually.
There
are
two
ways
you
could
think
of
doing
tearing
with
and
Ceph.
First,
you
can
embed
a
combination
of
fast
and
slow
storage
under
each
OSD
and
let
the
backing
OST
storage
handle
the
placement
of
hot
data
in
the
fast
storage
and
cold
data
in
the
slow
storage,
a
DM
cache
be
cache.
Flash
cache
or
an
even
larger
variety
of
caching
controllers
could
be
used
in
this
way
under
the
OSD
without
any
change
the
OSD
code.
B
But
there
are
some
drawbacks
to
that.
Like,
for
example,
you
must
choose
the
ratio
of
hot
to
cold
data
as
you
provision
each
node
and
it's
difficult
to
change
afterwards
without
going
back
to
each
node
and
changing
the
ratio
or
we
could
perform
the
tearing
above
the
OSD.
This
would
allow
us
to
use
different
Hardware
for
different
tiers
and
would
also
allow
us
to
dynamically
change.
The
hot/cold
balance
by
increasing
the
appropriate
set
of
machines.
B
So
I
mentioned
different
placement
rules.
This
goes
a
step
further
by
allowing
oh,
yes,
I
recover
this.
So
one
of
the
main
advantages
to
this
is
that
this
is
largely
transparent
to
the
liberators
user.
From
the
liberators
user's
point
of
view,
they
still
connect
to
a
particular
pool.
In
this
case,
the
backing
pool
and
behind-the-scenes
librettos
takes
care
of
redirecting
reads
and
writes
to
the
appropriate
caching
pool,
thus
for
RVD
raid
OCW
and
self
assess
this.
Should
they
work
without
modification.
B
So
in
right,
back
mode,
the
liberators
client
operating
of
the
backing
pool
will
transparently
direct
all
rights
to
the
cache
pool.
Instead
on
a
cache
hit,
the
right
completes
when
the
cash
flow
right
completes
on
a
Miss.
The
cash
pool
instance
of
the
object
will
delay
the
right
while
it
promotes
the
object
from
the
backing
pool.
B
Similarly,
reads
are
directed
to
the
cache
pool.
A
cache
hit.
Concert'
can
be
served
directly
out
of
cache,
since
the
cache
will
cease
all
rights
in
the
event
of
a
cache
miss
the
read
can
be
redirected
or
proxy
to
the
backing
pool
or
if
the
policy
dictates
that
it
should
happen,
the
read
can
be
delayed
while
the
object
is
transparently
promoted,
as
with
writes
so
we
are
able
to
make
the
cache
tearing
features
fit
nicely
within
the
existing
SEF
architecture.
B
Other
than
the
tearing
relationship
base
and
cache
pools
are
fully
fledged
Rados
pools,
complete
with
independent
placement
rules.
Each
OSD
in
the
cache
pool
is
able
to
handle
caching
decisions
for
its
objects
independently,
ensuring
scalability
and
avoiding
the
need
for
an
external
tearing
agent.
The
promotion
and
eviction
operations
are
themselves
r8s
operations.
The
cache
Geocities
actually
act
as
ratos
clients
to
the
base
pool
OS
DS.
B
So
there
might
be
many
more
placement
groups
in
the
base
to
your
many
more
placement
groups
in
the
caching
tier
it
doesn't
matter,
because
the
primary
in
the
cache
pool
or
the
cash
tier
doesn't
actually
know
or
about
the
object
mapping
for
the
base
tier
it
uses
librettos.
For
that,
the
promotion
of
radius
clients
can
see
the
cash
configuration
of
the
cluster
map
and
use
that
to
intelligently
route
requests
between
the
cash
and
base
pools
as
needed
to
make
intelligent
decisions
about
which
objects
to
evict.
B
As
the
cash
flows
up,
we
need
a
way
to
estimate
the
hotness
of
an
object.
Each
placement
group
maintains
an
in-memory
bloom
filter
of
recent
operations.
Each
filter
is
then
filled
for
a
specified
period
or
up
to
attune
to
false
positive
probability
and
then
written
to
disk.
You
can
then
walk
backwards
through
the
filters
on
disk
to
estimate
a
most
recent
access
time
for
a
particular
object
by
simply
checking
each
filter
or
a
positive
result.
B
B
So
using
the
hot
and
cold
information,
the
each
cash
pool,
PG
primary
can
asynchronously
scan
its
store
and
estimate
the
hotness
for
this
object.
We
call
that
process
the
tearing
agent
it's
a
/
PG
agent,
which
the
OSD
will
sort
of
schedule
as
as
needed.
That
process
is
once
the
pool
reaches
the
target
dirty
ratio.
The
primary
will
begin
attempting
to
flush
sufficiently
chilled
objects.
As
the
placement
group
approaches
the
target
size,
the
primary
will
also
begin
evicting
clean
objects.
B
Okay,
let's
talk
razor
cutting,
which
I
think
is
one
of
the
more
exciting
things
that
has
happened
in
the
last
year.
So
up
to
now,
Steph
is
supported
only
conventional
replication.
If
you
wanted
two
copies
of
redundancy,
you
need
three
X
replication,
a
200%
overhead
and
storage
costs.
Disk
may
be
cheap,
but
not
quite
that
cheap
enter
a
racial
coding.
Erasure
codes,
allow
you
to
take
an
object,
break
it
into
four
chunks,
create
two
additional
parity
chunks
and
distribute
those
six
chunks
among
six
different
OS
DS,
possibly
split
among
three
or
six
different
racks.
B
Def's
approach
to
erase
recording
requires
the
user
to
create
an
eraser
coded
pool
with
a
specified
erasure
code,
in
this
case,
one
with
four
data
chunks
in
to
parity
chunks.
That
pool
contains
the
eraser
coded
placement
groups.
Each
placement
group,
as
with
replication,
is
assigned
to
an
ordered
set
of
OS
DS
with
four
data
chunks
and
two
parent
chunks,
crushed
with
split,
would
spit
out
an
ordered
set
of
6
OS
DS.
B
The
first
four
are
the
four
data
chunks
and,
while
the
last
two
are
a
pair,
a
junks
as
with
replication
one
of
the
OSDs,
usually
the
first
one
will
serve
as
the
primary
client
requests.
Both
reads
and
writes
will
go
to
the
primary.
The
primary
is
then
responsible
for
fetching
the
data
required
to
serve
the
request
from
the
other
O's
DS
decoding
it
and
responding
to
the
client
using
the
primary
OSD
to
do
the
decoding
greatly
simplifies
consistency,
since
it
sees
rights
as
well.
B
B
So,
as
I
mentioned,
as
with
replication,
one
of
the
OSD
is
usually
the
first
one
will
service
the
primary
in
order
to
read
an
object
from
an
erasure,
coded
pool,
the
client
computes
the
location
of
the
object
using
crush
as
with
the
replicated
placement
group,
and
sends
a
read
request
to
the
primary.
The
primary
then
determines
which
chunks
need
to
be
read
in
order
to
fulfill
the
request
and
request
those
chunks
from
the
other
placement
robos
these
here,
because
we
have
all
of
the
data
chucks,
there
is
no
need
to
read
the
parity
checks.
B
The
primary
then
uses
the
pieces
to
reconstruct
the
requested
data
and
respond
to
the
client
using
the
primary
rather
than
the
client
affects
the
individual
chunks
has
some
advantages.
First,
it
greatly
simplifies
the
problem
of
ensuring
that
we
are
reading
the
same
version
of
the
object
on
all
of
the
replicas,
since
the
primary
is
able
to
order,
writes
and
reads
on
the
Peapod,
a
placement
group.
B
Second,
the
erasure
coding,
CPU
and
memory
overhead
to
happen
on
the
OSD
rather
than
on
the
client,
which
might
be
beneficial
if,
for
example,
the
OS
DS
are
more
numerous
or
more
powerful
than
the
clients.
Lastly,
the
OSD
to
OSD
network
might
be
significantly
faster
than
the
client
to
OSD
network,
which
would
decrease
the
cost.
The
downside,
of
course,
is
the
additional
hop
and
the
additional
bandwidth
used
or
write.
B
As
with
a
replicated
pool,
we
send
the
write
request
to
the
primary
the
primary
breaks,
the
write
down
into
deeper
chunk
operations
and
sends
them
off
to
the
other
OS
DS
in
the
placement
group
and
waits
for
a
reply.
Once
all
replicas
have
replied
the
primary
response.
The
client
with
success
in
the
event
that
we
have
a
degraded
set
of
OSD
is
in
the
placement
group.
We
simply
write
to
the
OSD
s
we
do
have
aa,
but
suppose
there
is
a
previous.
There
is
a
brief
power
failure.
B
After
the
power
comes
back,
our
primary
will
appear
and
find
3
chunks
with
logs,
reflecting
the
new
version
B
and
3
chunks
with
logs
missing
the
update
for
version
B.
In
fact,
either
A
or
B
would
be
a
valid
value
for
the
object.
Since
the
client
has
not
yet
received
a
response
with
a
replicated
pool,
the
primary
would
simply
choose
one
or
the
other
and
recover
whichever
copies
ended
up
in
correct
that
the
client
via
the
cluster
map
will
see
that
the
OS
DS
restarted
and
resend
the
right
with
an
erasure
coded
pool.
B
However,
we
have
a
problem
in
that
we
cannot
recover
either
version
A
or
B,
since
neither
has
the
required
four
chunks
left
here.
We
have
some
choices.
Simply
writing
the
data
in
place
as
with
a
replicated
pool,
won't
work
for
the
reason
I've
mentioned,
but
we
could
try
complicatedly
writing
the
data
in
place
by
first
writing
to
a
pending
operation
log
in
each
replica
and
then
performing
a
second
commit
operation,
but
that
would
require
a
second
round
trick
of
a
second
round
trip
of
network
operations
and
disk
commits
increasing
latency
even
further.
B
We
chose
instead
to
restrict
to
the
interface
available
to
erasure
coded
pools
to
only
those
operations
that
we
can
make
bulb
a
cabal
using
only
local
operations
and
the
placement
code
blog
create
a
pen,
delete
and
etc
updates.
Fortunately,
each
placement
group
already
maintains
a
PG
log.
The
primary
sends
the
new
PG
log
entries
representing
each
update
to
be
committed
on
the
replicas
atomically,
with
the
update
and
they're
written
out
to
disk
atomically
as
well.
B
This
is
how
we
normally
determine
which
objects
need
to
be
recovered
after
an
OSD
has
been
done
for
a
short
period
of
time,
but
EC
pools
ratio,
clear
pools,
also
stash
enough
information
to
locally
undo
the
operation.
The
PG
log
entries
for
X
enter
updates.
We
stash
the
old
value
of
the
change
deck
setters
in
the
PG
log
entry
for
our
pens.
B
We
stash
the
old
size
of
the
object,
allowing
the
replica
to
rollback
by
truncating
to
the
old
size
or
deletes
we
instead
rename
the
object
out
of
the
way
and
record
in
the
PG
log
entry
where
to
find
the
old
version.
We
then
clean
up
these
old
objects
a
bit
later,
once
all
replicas
have
committed
a
particular
update.
B
We
disallow
object
map
operations
all
together.
First,
it
wasn't
clear
what
there
was
to
gain
by
applying
a
race
recording
to
a
key
value
mapping
subject
to
not
so
partial
updates
anyway.
Second,
it
would
be
hard.
So
if
we
punted
on
that
one,
because
the
operation
up
getting
a
2b
can
be
locally
rolled
back,
the
primary
will
observe
that
version.
A
is
the
best
version,
since
the
private
de
client
cannot
have
actually
received
a
response
for
version
B
and
will
resend
and
we
can
roll
back
a
on
all
of
the
replicas.
B
B
Leaving
us
with
only
version
a
this
restricts
the
erasure
code
pools
to
a
subset
of
greatest
operations,
notably
no
over
writes.
Of
course,
the
disallowed
operations
are
also
the
ones
which
are
inefficient
to
do
in
a
racially
coded
object.
Anyway,
if
you
try
to
override
in
the
middle
of
an
erasure,
critter
object
either
you
have
to
maintain
some
kind
of
complicated
log
structure
where
you
have
to
read,
modify
write
on
each
partial
overwrite.
B
B
Other
users
can
use
a
replicated
cache
pool
in
front
of
an
erasure,
coded
backing
pool
and
the
tearing
agent
will
simply
refuse
to
flush
anything
with
key
value
data.
This
is
one
of
the
reasons
why
we
introduced
cache
tearing
in
a
racial
coding
at
the
same
time
so
which
erase
your
code
are
we
actually
using.
Well,
there
are
a
lot
of
well
it's
weird
algorithms
for
generating
parity
blocks,
and
you
probably
have
noticed
that
I
haven't
specified,
which
one
we're
using
that's,
because
we
made
the
erasure
code
algorithm
and
implementation
pluggable.
B
B
Each
erasure
code
provided
by
a
plugin
must
provide
a
few
simple
things
like
a
way
to
encode
decode
data,
a
way
to
determine
which
chunks
are
required
to
perform
a
read
and
a
way
to
determine
which
chunks
are
required
to
recover
a
set
of
missing
chunks.
The
OS
detail
those
interfaces
and
handles
everything
else,
erasure
coding
also
introduces
another
wrinkle.
Let's
say
we
have
a
one
terabyte
OST,
which
contains
only
replicated
placement
groups
which
dies
with
replication.
B
So
maybe
each
OST
is
only
writing
out
a
few
hundred
a
few
hundred
like
a
few
gigabytes,
but
still
one
terabyte
needs
to
be
read,
and
one
terabyte
needs
to
be
written
the
pose,
however,
that
the
OST
in
question
is
storing
eraser
coated
placement
groups
in
order
to
recover
one
terabyte
of
chunks
lost
with
our
dead
OSD.
We
must
actually
read
four
terabytes
of
data.
As
with
replication,
the
recovery
would
be
d
clustered
in
many
OS.
These
would
split
up
the
work,
but
the
total
amount
read
and
written,
wouldn't
change.
B
The
improvement
in
storage
overhead
thus
comes
at
a
cost
in
disk
and
network
to
recover
from
failed
notes.
There
was
also
a
CPU
cost
associated
with
reconstructing
the
lost
chunks
from
the
recovered
chunks.
This
can
be
particularly
troublesome
if
the
erasure
coded
chunks
are
distributed
across
racks,
so
local
recovery
codes
provide
some
help
here.
Suppose
each
of
dotted
box
is
a
rack.
Lrc
allows
you
to
layer
an
additional
set
of,
in
this
case
per
rack.
Parity
bought
blocks
to
allow
you
to
recover
from
a
single
failure
with
anorak,
using
only
chunks.
B
So
that's
the
end
of
most
of
what
I've
got
covered.
Think
of
a
future
works
like
maybe
yeah,
so
future
work
for
a
racial
coating
will
probably
involve
allowing
optimistic
client
reads
directly
from
shards.
This
shouldn't
be
super
impossible
because,
as
I
mentioned
were
you
you
can
only
do
a
pens
or
deletes
really
or
updates
on
those
objects.
So
it
should
be
fairly
straightforward
when
doing
a
data
read
to
simply
make
sure
that
you
handle
the
case
that
you
failed
to
find
a
note
there
and
retry.
B
So
it
should
be
possible
to
embed
version
numbers
and
the
responses
and
allow
the
client
to
retry
in
the
case
that
they
get
a
torn
reading,
we're
also
interested
in
some
kind
of
or
in
a
racial
code
plugin
to
allow
optimizations
for
arm
for
J
racer
for
the
cash
pools.
A
lot
of
the
future
work
will
be
involved.
It
will
be
well
surround
improving
the
way
the
agent
makes
decisions
that
were
smarter
about
which
objects
to
flush
and
evict.
C
A
B
A
Alright
I'm
not
hearing
any
questions
we
can
always
do
follow-ups
later.
This
will
be
posted
on
YouTube
and
I'll.
Get
the
slides
and
post
note
upon
our
SlideShare
as
well,
so
that
they're
linked
and
as
always,
we
will
answer
questions
on
the
mailing
lists
and
IRC.
So,
thanks
for
coming
everybody
and
thanks
Sam
for
run
through
this.