►
From YouTube: 2016-JUN-21 -- Ceph Tech Talks: Bluestore
Description
A detailed update on the current state of the Bluestore backend for Ceph.
http://ceph.com/ceph-tech-talks
B
So
this
is
a
updated
version
of
a
similar
talk.
I
gave
it
vault
about
a
month
ago
that
covers
blue
store,
a
new,
faster
storage
back
in
force,
F,
so
hi
little
outline
of
what
I'm
going
to
cover
I'm
gonna
spend
a
fair
amount
of
time,
just
giving
some
background
on
what
part
of
stuff
this
actually
is
where
it
fits
in
and
what
we
used
to
do
and
currently
do
with
file
store,
and
why
sort
of
the
current
approach
that
we've
had
doesn't
work?
B
I'll
talk
a
little
bit
about
a
new
store
which
is
sort
of
a
first
attempt
to
do
something
different
and
then
move
on
to
blue
store,
which
is
the
current
effort.
A
totally
new
backend
for
the
OSD
I'll
talk
about
high-level,
how
its
structured,
how
the
metadata
is
stored
and
handled
how
the
data
path
works.
B
So
that's
that
so
I'll
start
by
just
motivating.
Why
does
we
care
so
just
a
little
bit
of
background?
Sef,
obviously,
is
an
object,
block
and
file
storage
system
providing
all
those
interfaces
in
a
single
cluster.
It's
designed
so
that
all
components
will
scale
horizontally.
So
you
can
just
keep
adding
nodes
to
get
more
capacity
and
performance.
B
That's
architected
to
have
no
single
points
of
failure
to
be
agnostic
to
the
kind
of
hardware
you
deploy
in
just
usually
commodity
hardware,
to
be
self
managing
rubber
possible
and,
of
course,
it's
open
source
under
the
LGPL
copyleft
license.
So
stuff
is
great.
If
you
look
at
our
earlier
early
papers
and
documents,
we
used
to
write
about
stuff,
we
also
use
words.
We
would
describe
the
system
using
phrases
like
a
scalable
high-performance
distributed
file
system.
B
That's
what
the
original
set
paper
called
it,
and
we
would
usually
say
that
Speth
is
designed
to
provide
performance,
reliability
and
scalability
all
in
the
same
system.
I
mention
that,
because
performance
has
often
been
sort
of
a
challenging
piece,
at
least
when
you
compare
it
to
sort
of
the
raw
performance
that
you
could
theoretically
get
out
of
a
piece
of
a
piece
of
hardware,
and
a
lot
of
that
is
due
to
the
way
that
we're
actually
ultimately
storing
that
data
on
a
disk
is.
B
Good,
ok
that
bluejeans
screen
doesn't
show
it.
Okay,
so
SEF
provides
these
three
object.
Interfaces
greatest
gateway
get
to
assess,
serene
Swift
compatible
interface
already
gives
you
a
virtual
disk
device,
Virtual
Box
twice
it's
used
extensively
and
OpenStack
and
South
FS
is
a
distributed,
POSIX
file
system
akin
to
NFS
but
scalable
and
distributed,
and
all
that
good
stuff.
All
of
this
sits
on
top
of
ray
dos,
which
is
the
piece
that
actually
replicates
all
your
data
and
distributes
its
across
lots
of
nodes
and
make
sure
that
it's
safe
and
moves
it
around.
B
When
notes
are
added
and
heals
the
system
I
know
it's
all
removed,
so
rate
is
sort
of
the
key
piece
of
that
and
that
ratos
cluster
is
structured
using
out
of
a
series
of
posts.
Each
host
typically
has
a
whole
series
of
object:
storage
devices,
OST
daemons,
each
of
which
is
sitting
in
front
of
a
hard
disk.
So
the
hard
disk
is
plugged
in
the
system.
B
There's
a
filesystem
that's
sitting
on
top,
normally
it's
XFS,
but
we
can
also
use
butter,
FS
or
x4
and
then
the
OSC
diamond
on
top
of
that
and
writes
files
into
that
file
system.
So
that's
how
its
deployed
today
in
in
reality,
there's
a
sort
of
well
containing
the
module
with
the
OST
called
file
store,
that's
responsible
for
actually
writing
that
data
to
that
local
filesystem
sitting
on
that
disk
and.
B
Piece
that
particular
piece
that
we're
looking
to
replace
with
this,
so
this
that
file
store
implements
an
interface
that
we
call
object
store.
This
is
a
abstract
interface
that
describes
how
each
OC
daemon
stores
data
on
this
local
disk
and
then
stuff.
The
larger
system
is
responsible,
for
you
know
replicating
across
multiple
has
to
use,
but
this
particular
part
is
just
how
to
store
on
one
disk.
B
Originally,
if
we
implemented
a
a
file
system
called
a
boss
and
a
back-end
called
file
store,
we
had
two
implementations
of
this
interface,
and
this
interface
objects
are
is
built
around
sort
of
a
couple
basic
abstractions.
There
are
objects
which
are
sort
of
like
files
and
they
sort
of
data
there's
a
bunch
of
bytes.
B
These
little
directories
and
the
other
sort
of
key
property
of
this
local
storage
interface
is
that
all
rights
are
transactions
though
The
Sitter
face
has
to
make
sure
that
whenever
you
hand
at
something
to
do,
it's
applied
atomically
and
consistently
and
durably.
So
it's
actually
will
be
there
when
you
lose
power
and
we
don't
worry
about
the
I
and
the
sort
of
acid
terms
of
things,
because
that
part
is
provided
by
an
upper
layer.
So
we
don't
worry
about
conflicting
transactions.
B
There
always
sort
of
non
conflicting
and
isolated,
but
we
do
have
to
make
sure
that
they're
atomically
applied
to
disk.
So
if
we
lose
power
that
you
guys
to
get
all
of
it
or
none
of
it,
so
the
first
implementation
of
this
objects
or
interface
was
a
boss,
which
was
a
user
space
extent
based
object
file
system.
We
wrote
back
in
like
2005
2006
2007
somewhere
in
there,
and
it
was
a
copy
and
write
BNE
tree
based
extent
file
system
over
in
user
space
that
would
sort
of
implement
our
sort
of
customized
interface.
B
So
part
of
the
reason
why
we
had
transactions
was
because
we
had
full
control
of
those
stack
was
the
most
natural
interface
to
accomplish.
What
do
we
needed
to
accomplish?
Even
what
the
requirements
of
you
know,
building
laundry
system
or
so
in
that
sense,
I
think
it
was
great
that
we
wrote
this
in
the
end
we
got
rid
of
it
and
instead
switched
to
writing
files
into
butter
FS
around
2009,
because
butter
FS
was
just
coming
getting
the
scene.
B
It
had
sort
of
all
the
features
that
a
boss
had
and
more-
and
you
know,
whole
community
people
were
excited
about
it
and
we
expected
that
we
wouldn't
have
to
sort
of
write
this
part
of
the
system.
We
could
just
rely
on
other
tools
that
didn't
really
pan
out,
but
it
sort
of
brought
us
to
where
we
are
today,
where
everything
is
written
using
the
file
store
implementation
of
this
interface,
so-called
because
we
write
objects
as
files.
B
So
each
of
these
place
in
groups
or
collections
maps
to
a
directory
in
the
file
system
each
object
maps
to
a
file.
We
also
have
an
instance
of
leveldb
that
we
store
other
stuff
in
so
sometimes
attributes
that
we
put
on
the
object
files
are
too
big.
Each
file
system
has
around
peculiar
limits,
so
we
have
to
a
to
avoid
hitting
those
we
put
many
or
large
attributes
and
level
to
be
instead,
and
we
also
store
the
OMAP.
The
key
value
data
in
there
I'm.
B
So
originally
the
file
store
implementation
was
there
just
for
development,
so
we
could
write
code
and
run
it
on
our
home
directory
on
a
random
machine
without
having
to
have
dedicated
disks
that
could
be
formatted
and
so
on,
but
we
sort
of
morphed
it
into
something
that
we
actually
used
for
production
in
the
real
world
and
the
structure
is
pretty
simple.
So
you
have.
You
know
that
OST
directory
has
a
directory
for
each
placement
group
with
a
bunch
of
objects
in
them.
B
So
the
first
problem
area
is
the
fact
that
our
interface,
it
wants
us
to
provide
atomic
transactions,
and
we
need
this
because
the
OST
is
very
carefully
managing
the
consistency
of
all
the
data
it's
sourced
locally,
so
that,
if
it
fails,
it
can
very
quickly
recover
and
resynchronize
with
other
replicas.
And
so
we
need
that
transactionality.
B
We
do
all
their
stuff,
that
would
say
in
transaction,
and
this
would
prevent
butter
FS
from
doing
a
commit
sort
of
while
we're
halfway
through
our
work,
and
so
we
would
either
get
the
whole
thing
or
none
of
it
committed
atomically
to
disk
as
it
sort
of
did
it's
in
terminal
checkpoints,
and
that
worked
that
sort
of
got
us
most
of
the
way
there.
So
our
initial
attempts
to
get
this
atomic
transaction
support
was
to
hook
into
butter
fests
that
we
would
mark
when
our
transaction
started.
B
Anyone
ended
and
sort
of
bracket
all
our
operations
by
those
two.
Those
two
calls
so
that
butter
FS
would
never
do
a
check
point
and
commit
everything
to
disk
with
only
half
our
returns.
Actually,
we
had
to
get
all
of
it
or
none
of
it,
so
that
got
us
part
of
the
way
there.
We
also
had
a
mount
option
and
to
make
sure
that
when
it
did
do
a
check
point,
it
would
flush
and
enter
to
date.
B
Yes,
yeah
there
we
go
so
what
would
happen
if
the
obviously
crashed-
and
we
didn't-
we
didn't
finish-
sort
of
writing
our
full
transaction
to
the
file
system
and
butter
fests.
What
have
seen
a
transaction
start
and
a
bunch
of
rights
that
would
have
sort
of
done
a
bunch
of
stuff
scribbled
all
over
the
page,
cache
and
disk
possibly,
but
there
would
be
no
end
and
we
would
never
get
the
second
half
of
the
transaction,
because
the
OST
process
died
and
so
the
only
real
way
that
we
could
think
of
to
get
around.
B
That
would
be
to
add
this
other
really
very
horrible
amount
option
that
would
basically
make
butter
fest
deliberately
wedge
itself
crash
if
you
didn't
close
a
transaction
and
so
sort
of
necessary,
because
internally
there
was
no
support
for
rollback
in
this
sort
of
transaction
stuff
that
that
butter
fest
another
file
systems
they're
not
really
meant
to
be
transactional
in
that
way.
They're
just
trying
to
manage
their
own
internal
consistency
and
not
to
provide
sort
of
a
higher
level
transactional
concept,
and
it's
very
hard
to
sort
of
shoehorn
that
in
later
so
that
didn't
really
work.
B
So,
instead,
what
we
did
is
just
it'll
write
a
head
journal,
so
every
tick,
every
transaction
that
we
handed
to
this
object
store
interface.
We
would
sort
of
serialize
the
whole
thing
into
a
sequence
of
bytes
write
that
to
a
journal
on
the
disk
once
that
fully
committed,
then
that
transaction
was
stable
and
we
could
sort
of
take
that
same
data
and
scribble
it
again
across
all
the
objects
and
metadata
and
so
forth,
and
to
write
it
back
to
the
file
system.
So
in
butter
fast
we
could
be
a
little
bit
clever.
B
So
we
would
periodically
take
a
snapshot
of
the
file
system,
which
would
do
this
sort
of
full
check
point
and
then,
after
we
did
a
check
point,
we
could
trim
journal
entries
from
you
know
in
the
past
and
then,
if
the
OSD
ever
restarted,
we
would
just
roll
back
to
the
most
recent
snapshot
and
then
replayed
the
journal
from
that
point
forward.
So
we'd
have
sort
of
a
nice
consistency
model
in
that
sense
on
non
butter,
fest
systems,
it
wasn't
so
elegant.
B
We
still
did
periodic
sync
and
Trimble
journal
entries,
but
on
restart
we
just
replayed
the
journal
blindly,
which
meant
that
we
might
be
repeating
certain
operations
and
unfortunately,
the
operations
that
the
objects
or
interface
supports
aren't
all
idempotent.
We
have
things
like
renames
and
clone
and
so
on,
and
so
there's
a
whole
bunch
of
really
ugly
hackery
in
there
to
make
sure
that
we
don't
replay
those
operations
twice
and
we
don't
sort
of
scribble
old
events
on
new
data
and
corrupt
things.
That's
over,
so
it's
kind
of
gross.
It
works.
B
Obviously,
because
you
know
this
is
how
every
current
SEF
system
is
deployed
using
this
layer
of
code,
but
it's
it's
awkward
and
hard
to
maintain
and
painful
and
not
particularly
inefficient.
The
main
thing
is
that,
because
we
have
this
full
data
journal
everything
we
write,
we
write
twice.
You
write
it
first,
the
journal,
then
we
write
it
again
to
the
file
system,
which
roughly
has
the
available
disks
throughput.
So
that's
sort
of
unfortunate.
B
The
other
area,
where
POSIX
release
really
getting
in
our
way,
is
with
enumeration,
so
Seth
objects
are
distributed
in
a
pool
based
on
a
32-bit
hash
value
and
we
do
a
numeration
of
those
objects
in
hash
order,
and
we
do
this
in
lots
of
different
cases.
We
do
it
for
scrubbing.
B
We
do
it
when
we're
doing
backfill
or
sort
of
sinking
objects
between
across
OSDs,
and
we
also
do
when
you
request
an
enumeration
via
the
liberators
API
and
you're
just
listing
objects
in
a
pool,
and
the
problem
is
that
POSIX
reader
to
actually
list
these
object.
Files
on
the
underlying
file
system
isn't
well
we're
POSIX
reader
is
totally
random.
There's,
like
some
internal
hash
function,
that's
used
in
different
file
systems
that
varies
and
so
on.
So
that's
problematic.
B
We
also
need
this
ability
to
take
a
given
collection
and
in
sort
of
a
fixed,
constant
amount
of
time,
split
it
into
1/2
or
1/4
or
any
or
whatever
I'm.
As
part
of
the
process
of
scaling
sub-clusters
up,
we
need
to
sort
of
repartition
our
data
collections,
our
place
in
groups,
and
you
can't
do
that
with
POSIX.
You
can't
take
this
directory
of
a
million
files
and
and
fixed
time
sort
of
split
it
into
two
separate
directories
that
just
doesn't
happen
so
in
practice.
B
What
we
do
in
file
store
is
we
build
this
sort
of
ugly
directory
tree
of
directories
and
then
files
and
where
the
directory
names
are
based
on
those
sort
of
the
first
prefix
part
of
the
hash
function
or
the
hash
for
that
particular
file.
And
so
you
get
the
sort
of
deep
nested
file
structure
that
might
looks
familiar
from
what
lots
of
other
projects
do
and
then,
when
we
have
to
do
a
numeration,
we
can
sort
of
have
a
fixed.
B
B
It's
not
particularly
efficient
because
you
have
this
sort
of
complicated
directory
structure
and
when
you
hit
thresholds,
where
you
hit
a
certain
number
of
objects-
and
you
suddenly
have
to
split
all
these
directories
into
smaller
directories-
and
you
have
all
this
extra
IO
and
the
disks
that
some
people
notice
as
they're
sort
of
filling
up
the
cluster.
So
it's
definitely
far
from
ideal,
so
we
decided
to
do
it
was
time
to
do
something.
Different
sort
of
POSIX
is
causing
more
trouble.
That's
worse
and
we
want
to
do
something
else.
B
So
we
wanted
to
make
a
new
implementation
of
this
object
or
interface
with
several
goals.
We
want
a
more
natural
transaction
atomicity.
We
wanted
to
avoid
all
these
double
rights,
so
we
get
better
efficiency.
We
want
an
object,
murmuration
to
be
efficient.
We
wanted
clones
to
be
efficient
so
internally
we
have
critical
e,
have
to
take
an
object
and
make
a
clone
of
it.
That's
sort
of
copy
and
write
for
snapshots,
and
we
want
to
do
that
without
actually
copying
any
data
on
butter
fast.
B
We
can
do
that,
but
on
XFS
we
literally
copy
for
make
objects.
When
you
first
touch
them
after
a
snapshot
which
is
less
than
ideal,
we
were
targeting
sort
of
current
generation
storage
devices
like
hard
disks
and
SSDs
and
beamy
cars
cards,
not
really
worried
about
persistent
memory,
because
we
think
that's
going
to
be
pretty
different
and
the
hardware
isn't
huge
anyway.
We
wanted
to
make
sure
the
code
was
structured
so
that
there
was
minimal,
locking
and
better
parallelism.
B
So
we
could,
you
know,
go
really
fast
on
SSDs
and
we
also
wanted
to
finally
implement
the
things
that
we
were
hoping
that
the
filesystem
would
do
and
never
really
delivered.
So
we
want
full
data
and
metadata
check
sums
on
everything
we
write.
Butter
fist
does
this,
but
we
don't
use
butter
fest
in
production,
first
ability
reasons,
and
we
also
wanted
to
inline
compression,
which
would
be
great
because
it
you
know,
lets
you
store
more
data.
B
Instead,
an
ordered
key
value
database,
that's
sort
of
a
perfect
match,
because
we
have
objects
that
have
a
very
well-defined
order
and
we
want
to
fish
an
enumeration
and
fast
lookup,
and
there
are
lots
of
these
databases
out
there.
So
the
news
store
is
a
combination
of
rocks
TV,
which
we
sort
of
picked
semi
randomly
and
then
to
handle
all
the
metadata
and
then
the
actual
data
for
the
object.
We
would
still
store
and
POSIX
files
that
are
just
sort
of
simple
name
flowers,
so
the
idea
would
be
you
plug
in
your
key
value.
B
Database
rocks
TV
was
what
we
were
targeting,
but
level
2
B
would
work
any
key
value
database
that
was
wired
into
such
sort
of
abstraction
would
work,
and
then
the
actual
data
for
the
object
would
be
just
written
in
sort
of
a
simple
file
with
a
very
simple
name:
the
short
name
with
nice,
big,
efficient
directories
and
XFS
or
whatever
to
just
keep
it
very
simple.
So
that's
the
idea
with
with
the
new
store
it
didn't
really
work
very
well.
The
main
issue
is
that
rocks
DB
has
a
right
ahead.
B
Log,
a
journal
that
is
used
to
manage
its
consistency
and
then
the
file
system
that
is
sitting
on
also
has
a
journal
that
manages
the
file
system,
consists
consistency
and
this
hold
journal
and
journal
thing
you'll
find
papers
written
about.
It
has
a
very
high
overhead,
because
each
journal
is
essentially
managing
only
half
of
the
overall
consistency
of
the
system
and
because
so
you
sort
of
pay
the
overhead
twice.
B
B
The
device
twice
and
then
you'd
have
to
a
new
store
would
try
to
update
the
metadata
for
that
object,
so
it
penned
a
record
to
rocks
to
be
which
would
also
append
the
recipes
log
file
and
then
F
sync
that
so
would
be
another
to
iOS
to
the
rocks,
to
be
log
file
and
then
again
to
the
file
system
journal
to
update
the
metadata
for
the
log
file.
So
you'd
end
up
paying
like
for
iOS,
when
you
really
only
want
to
pay
two
with
four
flushes
instead
of
two
flushes.
B
But
what
the
solution
really
to
put
everything
in
one
big
journal
that
manages
consistency
of
the
whole
thing.
The
other
problem
is
that
we
still
need
this
atomicity
for
being
able
to
do
overwrites
within
the
system.
So
in
POSIX
you
can't
have
a
file
that
already
exists.
What
data
there
and
an
overwrite
some
of
that
data
as
part
of
a
larger
transaction,
because
politics
is
understand
transactions
and
in
stuff.
B
But
that
means
that
each
object
Maps
to
a
whole
bunch
of
different
files
with
this
weird
mapping
structure,
should
get
complicated
and
inefficient,
and
so
we
sort
of
end
up
again
where
we
started
with
redhead
logging,
where
we
would
log
the
data
to
rocks
to
be
in
a
wall
record
that
says
I'm
going
to
write
this
overwrite.
This
data
atomically
can
get
that
with
the
metadata
and
then
asynchronously
go
overwrite
the
data
in
the
file
system
and
as
a
gener'l
game
that
works
fine,
except
that
it
with
new
store.
B
But
not
the
data
ultimately
doesn't
really
pan
out,
which
brings
us
to
blue
store,
which
is
what
we're
actually
doing.
So.
A
blue
store
is
so
named
because
it's
a
combination
of
new
store
and
a
block
device,
and
we
decided
that
spelling
it
B
le
W
store
would
not
go
over
particularly
well.
So
it's
blue
store
with
Yui,
and
the
basic
idea
here
is
that
we
consume
raw
block
devices,
Arad
disk
dev,
SD,
B
or
whatever.
B
So
if
that
was
previously
something
that
we
were
picking
up
for
an
MX
FedExed
for
whatever
we
were
using
and
now
we
just
have
to
implement
ourselves,
but
in
exchange
we
get
sort
of
full
control
over
the
a
path
and
things
some
things
get
actually
much
much
simpler.
So
that's
sort
of
key
challenge
here
is
that
we
have
to
share
the
block
device
with
Roxie
B,
because
Roxy
be
normally
writes
a
bunch
of
files
for
its
SSD
files
in
a
log
file,
and
we
need
to
make
that
sit
on
top
of
the
same
block
device.
B
We
don't.
Actually
we
just
want
to
get
XFS
or
butter
of
s
whatever
completely
out
of
the
picture,
and
we
do
that
by
implementing
our
own
rocks
to
be
back-end.
It
has
a
nicely
abstracted
and
class
that
sort
of
captures
all
of
the
platform
dependent
stuff
that
has
to
do
including
file
I/o,
and
we
implement
a
very,
very
simple
file
system
in
your
quotes
called
blue
FS.
B
So
blue
FS,
if
I
said,
is
a
really
simple
file
system,
as
you'll
see
I'm
all
metadata
for
blue
FS
to
sort
in
RAM.
So
when
you
start
up
blue
store
and
blue
FS,
it
loads
the
metadata
and
just
keeps
it
in
in
memory.
Super
simple,
which
means
we
don't
to
store
a
free
list,
because
we
have
all
the
four
pointers.
So
we
can
sort
of
regenerate
it
as
we
melt
it,
users
really
coarse
allocation
units.
B
One
megabyte
blocks
just
to
keep
things
really
simple,
because
there
actually
be
only
ever
writes
big
big
files
and
never
write
small
files.
There's
no
reason
to
deal
with
small
blocks
and
all
metadata
sort
of
has
a
single
way
that
it
can
be
written
to
the
disk
everything
lives
in
a
journal.
So
the
idea
is
that
you
sort
of
write
to
the
journal.
B
Right
ahead,
it's
wall
with
log
file,
it's
journal
just
written
in
one
directory
and
it's
SS
T's
or
into
two
different
directories,
and
we
can
map
that
to
different
block
devices
so
that
the
rocks
to
be
logged,
for
example,
might
go
to
an
SSD
or
2
MB
RAM,
whereas
all
the
other
metadata
might
go
to
a
soar
device
and
then
so.
The
last
thing
is
that
blue
store
and
blue
FS
communicate
so
that,
as
blue
FS
runs
out
of
space,
blue
store
gives
it
more
and
it
was
blue
store,
runs
out
of
space
and
blue.
B
Roxy
B
was
written
to
use
these
log
files
that
are
essentially
just
journal
and
it
would
just
write
a
new
log
file
every
time,
but
that
results
in
a
pretty
inefficient
IO
pattern
because
it
has
to
append
to
the
log
file
and
then
F,
sync
it,
and
that
has
to
update
the
metadata
for
that
file.
And
so
every
rocks
to
be
log
commit
hits
is
at
least
2
iOS
to
do
the
data
and
the
metadata,
in
contrast,
every
file
system
and
database
in
the
world
that
does
logging
uses
a
circularbuffer.
B
So
it's
just
overwriting
the
same
disk
blocks
over
and
over
again
and
it
can
sort
of
detect
whether
it's
hit
the
end
based
on
checksums
or
some
other
scheme,
and
so
we
implemented
that
in
rocks
TV.
So
they
would
sort
of
recycle
all
previously
used
files
and
overwrite
them
so
that
we
would
be
one
I/o
for
a
committed
to
is
so
that
works
with
rocks
to
be
on
regular
files
and
also
actually
be
on
top
of
blue
FS.
B
So
it
helps
our
workload
and
also
benefits
other
rocks
to
be
people,
and
that's
been
upstream
for
probably
like
three
or
four
months
now.
So
that's
good
and
I
sorta
mentioned
this
before
with
blue
FS,
but
I'll
sort
of
make
it
explicit.
Your
two
Blues
store
is
designed
to
deal
with
multiple
devices,
so
you
have
a
couple
different
scenarios.
B
The
simplest
is
that
you
just
have
one
device
just
a
hard
disk
or
an
SSD,
and
it
puts
everything
on
that
one
device,
so
Roxy
B
is
there
your
object
data
is
there
these
journals
there
and
it's
just
really
simple
and
it
just
it.
Just
works
I'm.
A
slightly
more
complicated
deployment
is
two
devices.
In
this
case
you
could
have
a
very
small
828
Meg's
of
SSD
or
envy
Ram,
and
you
could
put
just
the
rocks
to
be
logged
there
and
then
have
the
main
device
for
everything
else.
B
This
is
more
or
less
equivalent
to
how
people
currently
deploy
Ceph
file
store
with
a
journal
device
and
a
large
device,
at
least
as
far
as
what
it's
doing
internally,
except
that
in
Bluestar's
case,
it's
only
a
metadata
journal,
and
so
the
journal
device
can
be
much
much
smaller,
which
is
nicely
128.
Meg's
is
generally
enough,
so
you
could
have
a
single,
very
small,
very
fast
SSD
in
a
system
and
have
lots
and
lots
of
disk
devices
that
are
sharing
it,
or
maybe
some
in
DRAM
on
your
20
Meg's.
Isn't
that
much?
B
A
B
Ssd
and
then,
like
the
really
cold,
rocks
to
be
in
the
regular
data
on
the
slow
device,
so
several
different
options.
I'm
the
one
thing
that
we
don't
support
is
blue,
store,
automatically
tearing
actual
object
data
on
to
a
fast
device
and
we're
only
using
the
fast
devices
for
metadata,
but
that's
something
that
we
may
explore
in
the
future.
B
So
that's
that's
sort
of
high
level.
What's
fox
v
is
and
what
it
does.
I'm
gonna
talk
a
bit
about
it
internally,
how
it
represents
metadata
and
how
its
internal
data
structures
look.
So,
as
I
mentioned,
blue
Saurus,
storing
all
of
its
metadata
inside
that
key
value
database,
we
partition
that
namespace
into
a
bunch
of
different
sections,
so
the
superblock
section
is
just
meditated
about
the
whole
system.
B
What
a
block
size
is
what
configuration
options
you
chose
and
stuff
like
that,
there's
a
section
that
we
use
for
block
allocation
metadata
for
essentially
keeping
track
of
which
portions
of
the
disks
are
free
and
unused.
There's
a
section
we
use
for
stats
just
for
keeping
counting
up.
You
know
how
many
bytes
are
written.
How
many
are
compressed
bytes?
B
There
are
how
many
objects
that
sort
of
thing
things
you'd
see
in
DF
and
then
sort
of
the
interesting
pieces
there's
a
namespace
for
collections,
though
these
are
all
that
made
it,
the
denotes
the
collection
place
and
group
information
and
then
a
larger
section
that
has
the
mapping
of
object
names
to
object
metadata.
That's
where
most
of
the
data
goes,
there's
also
a
section
we
use
for
right
ahead,
login
trees,
so
I
mentioned
with
new
store.
We
do
write
ahead
logging
for
data.
B
Metadata
is
stored,
aniseed
owed
in
practice,
there's
actually
only
one
field
in
the
C
note
that
matters,
it's
that's
sort
of
the
number
of
bits.
So
each
collection
represents
a
shard
of
the
overall
pool
namespace,
so
all
objects
in
a
pool
have
a
hash
32-bit
hash,
I'm
associated
with
them,
and
a
collection
is
sort
of
a
fraction
of
that
overall
32
bits.
B
So
it's
basically
the
name
of
the
collection
is
the
value
of
the
hash
and
the
bits
indicate
how
many
of
bits
of
that
hash
are
significant
that
have
to
be
matching
with
the
object
in
order
for
the
object
to
be
considered
part
of
that
collection,
so
that
it's
sort
of
graphically
represented
here,
you'll
notice.
The
placing
group
has
this
particular
prefix,
which
maps
to
a
hash
prefix
and
everything
beneath
that
prefix.
B
Where
those
19
bits
are
do
match,
then
the
other,
whatever
32
19
bits
can
be
different,
I'm
in
comparing.
So
the
this
has
a
couple
of
nice
properties.
Obviously,
we
get
ordered
enumeration
of
objects.
We
sort
of
carefully
construct
the
key
for
each
object
pair
in
the
database,
so
that
it's
it's
sorts
and
the
exactly
the
order
that
we
want
Seth
to
sort
objects
and
you'll
notice,
because
the
objects
are
sort
of
in
hash
order.
B
So
that's
a
nice
property
that
if
I
also
had
to
do
a
lot
of
work
to
accomplish
and
sort
of
trivial,
with
simpler
key
value
model
and
make
my
phone
stop
making
that
noise
so
most
of
the
interesting
stuff
that
was
actually
in
the
O
node.
So
this
is
the
O
node
stores
per
object,
metadata
I'm.
It
lives
directly
in
the
key
value
pair.
So
the
key
will
be
the
name
of
the
object
roughly
and
then
the
oh,
no
it'll
be
all
the
metadata
about
that.
B
It
serialize
this
to
hundreds
to
thousands
of
bytes,
might
be
a
few
kilobytes
or
depends
on
whether
you
have
checksum
is
enabled
we're
doing
some
tuning
there
to
make
sure
it
make
it
smaller,
but
they're
the
main
pieces
information
in
the
O
node
are
the
size
of
the
object
in
bytes
the
logical
size,
the
attributes
associated
with
them.
So
I
remember
that
an
object
has
sort
of
small
in
line
attributes,
like
version
equals
to
that
sort
of
thing
that
are
stored
in
line
with
the
metadata.
B
It
has
data
pointers
that
indicate
where
the
byte
data
associated
with
that
object
is
stored
on
disk,
and
this
is
a
to
a
two
level
mapping.
So
you'll
have
a
logical
extent
that
map's
sort
of
a
range
of
the
of
the
object
to
a
range
of
a
blob
and
a
blob
maps
to
a
particular
range
on
disk
and
may
or
may
not
be
compressed
or
have
different
checksums
associated
with
it
I'm.
B
So
own
oats
have
extents
that
map
to
blobs
and
then
blobs
map
to
some
region
on
disk
and
then
finally,
there's
a
structure,
a
field
hero
map
head.
That
indicates
a
prefix
in
that
o
map
key
space
for
all
the
key
value
pairs
of
Sochi
that
object.
So
if
you
have
user
data
that
sorted
ASCII
value
data,
this
tells
you
where
to
go,
find
it
actually.
B
So
that's
that's!
No
node,
there's
one
other
structure.
I
haven't
mentioned
yet
called
a
B
node,
and
the
reason
is
because
we
also
need
to
store
the
metadata
about
those
blobs
you'll
notice
that
the
oh,
no
dear,
has
the
mapping
of
the
object
space
to
logical
extents,
which
Maps
the
blobs,
but
it
doesn't
actually
contain
the
blobs
themselves.
B
So
usually
we
store
the
blobs
next
to
the
node
so
that
keep
that
value
pair
will
be
the
node,
that's
encoded
and
then
a
bunch
of
blobs
sort
of
appended
to
the
end
of
it.
But
occasionally
we
have
blobs
that
are
referenced
by
multiple
objects.
This
happens
when
we
take
an
object
and
we
clone
it,
for
example,
for
a
snapshot
then
we'll
have
two
separate
objects
that
both
have
logical,
extents
that
point
to
the
same
blob.
In
that
case,
we
can't
put
the
B
note
in
the
own
ode.
B
But
regardless
of
whether
those
blobs
are
stored
in
the
B
node
or
the
O
node
there
they're
sort
of
the
same.
It's
just
a
map,
it's
a
map
of
an
identifier
I'm.
You
know
1
2,
3,
4,
whatever
to
the
blob
metadata
and
when,
when
you
point
to
a
blob,
if
it's
a
positive
value,
it
means
that
the
blobs
in
the
O
node,
if
it's
a
negative
value,
it
means
it's
in
the
B
note.
B
B
But
that
means
that
if
you
have
a
bit
that
gets
flipped
on
disk
there's
some
window
where
it's
wrong
before
we
actually
detect
the
air,
which
is
unsettling
because
you
may
be.
You
read
the
object
before
you
notice
that
it
doesn't
match
its
replicas,
which
is
unfortunate,
and
even
if
you
describe
any,
do
finding
consistency,
you
might
not
necessarily
know
which
replica
is
the
wrong
one.
I
mused,
you
might
know
that
there
are
two
copies
that
are
the
same
and
ones
that
that's
different.
B
If
you
have
three
replicas,
that's
a
pretty
good
indicator,
although
necessarily
a
guarantee,
because
maybe
there
was
some
recovery
or
migration,
you
just
copied
the
bad
copy
to
another
location,
so
you're,
never
really
sure.
So,
with
blue
store,
we
want
to
validate
a
checksum
on
every
read.
We
want
to
sort
checksum
for
everything
we
write
and
whenever
we
read
something
we
always
want
to
check
the
cut
checksum
to
make
sure
that
it
is
actually
what
we
meant
to
get.
B
So
that
means
that
the
blue
store
blobs
have
to
store
more
metadata
than
just
we're
on
just
the
data
store.
They
also
store
the
checksums
for
that
data,
so
we
can
use
multiple
checksum
algorithms.
The
default
is
crc32,
see
two
sort
of
industry
standard
for
everything.
That
seems
the
only
real
problem.
B
So
it's
doable,
but
it's
it's
sort
of
a
lot
so
there
there
are
lots
of
cases
where
we
don't
have
to
store
that
much
and
we
can
have
larger
text
M
blocks.
So,
instead
of
check
something
every
4k,
we
could
check
some
16
K
blocks
or
128
K
blocks.
You
know
whatever
it
is.
We
could
also
use
smaller
check
sums.
B
So
what
what
we
do
is
sort
of
a
matter
of
policy,
but
we
have
a
bunch
of
stuff
now,
because
we
sort
of
control
the
whole
stack
that
we
can
do
to
control
these
choices.
So,
for
example,
if
the
clients
hint
that
this
object
is
going
to
be
written
sequentially
in
read
sequentially,
for
example,
it's
written
by
the
radix
gateway
and
we
never
have
small
over
writes,
then
we
might
as
well
have
sort
of
large
checks
on
blocks
and
so
that
we'll
have
compact
on
the
side
of
the
object.
B
B
Since
we
can't
sort
of
over
write
up
a
small
piece
of
something
that's
compressed
anyway
and
in
the
end,
the
the
plan
here
is
just
to
have
policies
that
you
define
on
a
pool
basis.
So
you
might
say
that
this
pool
is
used
for
RVD.
It's
going
to
have
you
know
random,
4k,
I/o
and
so
I'm
going
to
have
maybe
a
smaller
check
sums
but
I'll.
Do
it?
That's
fine
granularity,
so
I
can
do
a
fishing
over
writes
that
sort
of
thing,
and
maybe
this
other
pool
is
used
to
route
of
Skateway.
B
It's
all
sequential
and
so
I'll
have
other
hints
that
indicate
to
use
different
checksum
policies.
So
that's
that's
the
plan
there
and
then
there's
compression
3x
replication
is
obviously
expensive.
You
have
to
buy
three
times
as
many
discs
as
the
amount
of
data
that
you're
storing
and,
in
fact,
even
besides
that
anything
that's
scale
out
is
just
expensive,
inherently
because
you're
buying
a
lot
of
something
and,
at
the
same
time,
lots
of
data
that
we're
storing
is
highly
compressible.
So
it
seems
like
we
could
do
better.
B
So
the
Blues
story
implements
inline
compression,
so
it'll
sort
of
magically
compress
things
before
it
gets
written
to
disk.
So
it
uses
less
space.
It's
a
little
bit
tricky
to
actually
implement
it
efficiently.
So
we
need
to
sort
of
largest
extents
large-ish
extents
on
disk
they
get
a
compression
benefit,
so
you
can't
take
a
4k
right
and
compress
it
to
you,
know
2k
and
then
write
2k,
because
the
block
size,
the
disk,
is
4k,
so
it
doesn't
can't
really
get
it
smaller
than
that
anyway.
B
So
we
take
larger
blocks
like
say
to
4k
or
hundred
28k
or
five
okay
and
we
compress
those
down,
and
then
we
write
it
in
less
space,
which
means
you
have
sort
of
the
largest
blobs
donkus
that
are
sort
of
compressed
into
smaller
pieces,
and
that
works
fine.
If
you're
just
sort
of
writing
data
and
objects
in
their
entirety
into
the
system.
The
trick
is
when
you
need
to
support
overwrites,
I'm
so
hopefully
this
end
of
this
diagram.
B
Hopefully
it
looks
so
make
sense
of
people,
but
if
you
have
sort
of
this
logical
mapping
of
an
object
or
it
starts
on
the
left
and
ends
on
the
right,
maybe
initially
you
write,
you
write
the
sequential
object
which
maps
to
translates
into
two
big
chunks
that
get
compressed.
So
these
sort
of
gray
blobs
indicate
the
uncompressed
region
of
the
data,
and
then
it
gets
compressed
down
to
that
blue
thing
which
gets
written
somewhere
on
the
disk.
Maybe
it
does
that
twice.
B
You
have
two
big
chunks
and
then
say
later
you
come
back
and
overwrite
certain
parts
of
it,
so
maybe
you
overwrite
a
little
bit
over
the
first
region.
What
blue
saw
basically
does.
Is
it
just
says?
Oh,
this
is
your
occluding
you're
writing
overwriting,
something
that
was
compressed
before
we
can't
sort
of
touch
the
compress
thing,
because
it's
all
compressed
so
we're
just
gonna
write
that
data
somewhere
else
on
disk
and
we'll
logically
point
to
it
so
effectively.
B
The
part
of
the
compressed
data
is
obscured,
it's
still
sort
on
disk,
but
you
logically
can't
read
it
because
it's
it's
been
overwritten
by
something
else,
and
it
might
even
be
that
we
write
something
really
small.
That's
even
smaller
than
the
unit
of
allocation
time
say
we
write
ten
bytes,
that's
inside
a
4k
block.
You
might
still
allocate
that
4x4
K
block
for
the
data
and
then
only
point
to
that
sort
of
small,
logical
region
and
the
other
reason
so
loose
or
certain
does
this.
B
It
has
much
of
heuristics
to
sort
of
try
to
keep
the
resulting
structure
as
simple
as
possible,
and
then
the
idea
is
that
if
it
starts
to
get
too
complicated,
where
you
have
too
many
layers
of
occlusion,
then
it'll
sort
of
flip
a
little
trigger
the
O's
to
you
to
actually
say
what
it.
This
is
getting
too
crazy,
I'm
just
going
to
read
the
data
and
rewrite
it
in
a
more
efficient
format,
so
sort
of.
B
In
the
general
casual
case,
you
get
relatively
efficient,
IO
patterns
and
layouts,
but
if
it
starts
to
get
too
crazy,
then
we'll
force
a
compaction
effectively
and
write
it
more
efficiently.
I
think
in
practice,
most
cases
where
you're
going
to
enable
compression
are
going
to
be
sequential,
and
so
you
really
won't
trigger
any
of
this
weird
layering
over
height
stuff.
B
So,
that's
that
that's
how
sort
of
blue
store
represents
the
mapping
of
an
object
to
logical
extents
to
blobs,
and
then
the
blobs
mapped
to
disk
you'll
notice
that
this
blob
structure
tells
us.
You
know
which
type
of
checksum
algorithm
we're
using
the
checksum
metadata,
and
it
also
has
if
the
compression
flag
is
set,
then
it
also.
The
actual
data
will
tell
you
which
algorithm
is
used
to
compress
it
I'm
so
again
go
to
read
it.
You
can
use
right
now
we
implement
the
Z
live
and
snappy
you
can
plug
in
whatever
other
algorithm.
B
You
want
to
it's
very
frizzy
to
do
that,
so
the
data
path
I'll
cover
next.
This
is
basically
how
the
code
flows
when
we're
taking
data
off
the
wire
or
from
the
OSD,
and
actually
trying
to
write
it
to
disk.
So
there
are
a
few
basic
concepts.
We
have
the
notion
of
a
sequencer
at
the
object
store
layer
which
is
basically
represents
an
independent
stream
of
transactions
that
are
being
fed
to
the
the
object
store
that
need
to
be
ordered
with
respect
to
each
other.
B
So
normally
there's
one
sequencer
per
place
in
group
and
use
placement
group
is
sort
of
emitting
an
ordered
set
of
transactions
to
that
placing
group
that
are
updating
an
object
and
adding
an
entry
to
the
PG
log.
But
you
have
lots
of
place
in
groups
on
you
chose
D,
so
you
have
lots
of
these
sequencer,
so
you
have
typically
like
50
or
100
independent
streams
of
transactions
that
you
can
be
working
on
concurrently
and
only
certain
one.
Only
some
of
the
actual
transactions
have
to
be
ordered
factor
each
other.
B
Each
transaction
is
represented
in
memory
inside
blue
store
with
what's
called
a
trans
context.
This
is
sort
of
a
transaction
in
progress
and
all
the
state
describing
what
is
currently
doing
and
then
at
a
high
level.
There
are
two
ways
that
blue
store
will
write
data
most
of
the
time,
we'll
just
do
a
new
allocation,
this
sort
of
sort
of
at
its
heart
copy
and
write
right,
anywhere
type
file
system,
so
any
write,
that's
larger
than
metallic
size
goes
to
a
new,
completely
unused
unwritten
allocated
region
of
disk.
B
So
we
just
find
some
new
empty
space
on
disk
and
we
write
the
data
directly
out
and
then
we
have
to
get
the
metadata
at
a
point
to
that
new
region,
the
disk
and
once
the
I/o
completes
once
it's
a
blondest
and
we
can
commit
the
rocks
to
be
transaction
that
actually
points
to
it.
So
that's
all
fine
and
good
in
general.
Sometimes
you
have
small
rights,
though
rights
that
are
smaller
than
metallic
size,
so
say
on.
B
Blog
style
update
where
it
will
commit
the
transaction
that
updates
all
the
metadata
and,
as
part
of
that
transaction,
will
be
an
entry.
A
temporary
entry
in
rocks
TV.
That
says:
I
promise
to
overwrite
this
data
over
on
these
blocks
of
disk
and
then
after
that
commits
asynchronously.
It'll
go
actually
do
that,
update
to
that
previous
location.
So
this
is
effectively
data
journaling,
it's
sort
of
what
we
used
to
do
with
new
store
and
with
file
store,
but
you'll
notice.
B
We
only
do
it
when
the
right
is
very
small
and
the
idea
is
that
you'll
have
a
knob,
essentially
that
you
tune
so
that
if
it's
faster
to
do
the
right
head
blog
then
do
that
and
if
it's
not
faster,
then
you'll
write
it
to
a
new
place
first
and
then
update
the
transaction
and
which
is
faster
sort
of
depends
on
the
properties
of
your
storage
device
on
a
disk.
You
know
generally
anything
under
64
K.
B
It
makes
more
sense
to
do
the
right
head
logging
on
an
SSD
currently,
if
we
only
do
for
something
less
than
4k,
but
I
have
a
feeling
that
she
that
even
stuff
that's
larger
than
4k,
maybe
8k
or
16
K
might
still
be
a
win
to
do
the
right
head
blogging
and
we
need
to
do
some
testing
to
actually
find
out
because
we
don't
all
agree
on
whether
that's
the
case,
but
it
might
be.
The
main
nice
thing
about
right.
B
Ed
logging
is
that
you
have
a
single
I/o
to
commit
this
transaction,
and
then
you
can
acknowledge
the
right.
It's
stable
once
it's
committed
along
with
this
right
ahead
promise
and
you
can
act
it
back
to
the
client,
whereas,
if
you're
doing
a
new
allocation
you
have
to
do
is
a
right
to
the
new
space.
Wait
for
that
to
be
durable
and
then
do
the
transaction
commit
and
then
we've
put
up
to
be
durable.
B
B
So
basically
idea
is
that
for
each
trans
context
it
starts
out
a
prepare
stage
where
we
prepare
all
the
updates
that
we're
gonna
make
to
the
metadata
we
figure
out
where
we're
gonna
write
the
data.
We
choose
our
disk
box
and
everything.
If
there
is
data,
that's
being
written
to
disk
first,
then
we'll
initiate
some
I/o
and
then
we'll
go
into
the
a
I/o
wait
state.
B
At
that
point,
in
the
queued
state,
we
might
actually
have
to
wait
for
a
while,
and
the
reason
is
because
you'll
have
multiple
trans
contexts
within
a
sequencer
that
have
to
commit
in
order,
and
maybe
the
one
in
front
of
us
was
doing
I/o
and
it's
waiting
for
IO
and
we
come
after
it
and
we
can't
commit
until
the
one
in
front
of
us
does
it
say?
Oh
it
also
commits,
though
these
they
sort
of
are
in
a
chain
and
they
have
to
go
in
order.
B
So
on
the
right,
you
sort
of
have
a
picture
of
this
where
you
have
a
request.
That's
in
KB
queued
state
and
the
one
in
front
of
it
is
an
AI.
Oh
wait
waiting
for
its
IO
and
so
we're
sort
of
blocked,
but
once
we
have
a
bunch
of
stuff
in
the
cute
state
that
isn't
waiting
on
it,
then
they
all
go
into
the
committing
State
give
them
to
be
until
it
to
commit
it.
We
wait
for
that
to
actually
happen
and
if
sex
myths,
if
we're
lucky,
then
we're
just
done.
B
It's
actually
not
not
too
bad
they're,
really
just
sort
of
two
two
queues.
Caching,
so
blue
store
implements
its
own
cache
in
userspace
memory.
Remember
it's
sitting
directly
on
top
of
a
block
device,
all
of
the
I/o
it
does
that
that
block
device
is
now
using
direct
I/o.
So
it's
not
using
the
kernel
for
any
caching
whatsoever
with
it
mostly
the
case
there,
a
few
bits
of
code
where
we're
in
the
process
of
removing
that
but
that'll,
be
the
end
result
at
least
so.
B
There's
a
structure
called
an
O
node
space
that
caches
a
mapping
object
names
to
the
own
admitted
data.
These
are
nodes
that
we've
recently
touched
that
are
in
memory
that
are
all
decoded
they're
ready
to
use.
We
have
everything
ready
to
go,
there's
also
a
buffer
space
base
structures.
That's
a
mapping
of
object,
offsets
the
buffers
that
is
attached
to
each
unknown
in
memory,
and
actually
it's
a
blob,
see
tupelov.
This
in
memory
has
a
buffer
space
that
might
have
some
buffers
associated
with
it,
so
we
sometimes
cache
actual
just
data
too.
B
So
both
buffers
and
no
nodes
have
life
cycles
that
are
linked
to
another
structure
called
the
cache.
It
has
a
couple
different
implementations.
We
have
one
implementation,
that
is
a
trivial
LRU
and
we
have
another
one
that
implements
that
to
cue
cache,
replacing
the
algorithm,
which
is
much
better
than
LRU,
it's
sort
of
resistant
to
sequential
scans,
preventing
those
from
pushing
out
hot
data,
and
so
the
cache
manages
the
overall
life
cycle
and
trimming,
and
the
basic
idea
here
is
that
the
cache
is
charted
for
parallelism.
B
So
you
blue
store,
might
have
many
many
cores
that
are
processing
these
transactions,
and
so
we
basically
take
the
collections
and
we
shard
them
and
map
them
to
different
cash,
shards
and
then
effectively.
Those
will
be
have
some
affinity
different
course
or
CPUs
in
the
system,
and
we
use
the
same
mapping
that
the
OSD
also
uses
at
a
layer
up
in
its
up
work
queue.
B
So
the
OSD
already
shards
requests
across
multiple
cores
to
other
collections,
and
then
we
used
an
identical
sharding
scheme
lower
down
so
that
the
same
cpu
context
will
do
all
the
OST
level
processing
in
the
request
and
the
request
to
the
object
store,
which
will
do
a
bunch
more
processing,
and
you
know,
twiddle
the
buffers
position
in
the
LRU
or
whatever
within
the
same
cpu
context.
So
you
won't
have
cache
lines
bumping
around
between
CPUs
that
that
appears.
We
think
that's
gonna
work
pretty!
B
Well
all
the
way
haven't
sort
of
done
extensive
performance
testing
on
that
yet,
but
we
think
it'll
work
I'm.
The
one
thing
that
it
hasn't
been
done
yet
is
that
the
IO
completion
currently
all
happens
in
a
second
thread,
and
that
might
do
some
updates
as
well.
So
we
may
end
up
charting
the
IO
completion,
also
so
that
the
completions
also
happen
on
the
same
core,
but
we're
not
sure
we'll
see
we'll
see
what
happens.
I
think
that'll
only
matter
on
really
fast
devices.
I
can
view
me.
B
Let's
see
there
a
couple
other
things
that
happen
there,
there's
a
free
list
manager
which
is
sort
of
a
module
that
keeps
track
of
space
on
disk.
That
is
unused.
That's
in
the
free
list,
that's
responsible
for
having
a
persistent
representation
of
what
parts
of
the
disk
aren't
being
used.
The
initial
implementation
was
just
based
on
extents,
so
you
have
a
bunch
of
key
value
pairs
in
the
database
that
have
an
offset
and
the
length
a
region
of
the
disk,
that's
not
being
used.
B
It
would
have
an
in-memory
copy,
so
I'd
know
which
keys
to
delete
and
update.
When
you
add
an
allocation
and
de-allocation.
The
problem
with
this
approach
was
that
it
enforcement
ordering,
because
you
had
to
sort
of
delete
the
old
key
and
update,
insert
new
keys,
and
if
you
reordered
those
transactions
it
would
sort
of
corrupt
its
reference
representation
so
ended
up,
and
with
this
older
implementation,
you
had
to
sort
of
serialize
all
of
those
allocation
deallocations
in
a
single
thread.
B
We
replace
this
with
a
bitmap
based
approach
where
we
basically
have
and
offset
on
the
disk
match
to
a
bitmap.
That
represents
a
bunch
of
blocks
that
start
at
that
offset,
and
these
are
sort
of
relatively
small
key
value
pairs.
So
he
might
only
have
128
blocks,
which
is
divided
by
8.
That's
only
8
bytes
of
actual
bits
for
a
particular
region.
B
We
have
a
bunch
of
these
keys
and
then
we
leveraged
the
merge
operator
in
rocks
TV,
which
is
sort
of
does
a
deferred
XOR
work
when
Roxy
B
deserts,
compaction
and
stuff,
and
so
we
can
have
a
set
of
doing
a
put
into
the
key
value
database.
We
do
a
merge
which
basically
just
tells
it
which
bits
to
flip
as
either
because
they
were
allocated
or
D
allocated
and
rocks
to
be.
Does
that
efficiently
in
the
backgrounds
later,
and
it
goes
and
compacts
things
the
nice
thing
about
that
representation.
B
This
new
scheme
is
that
there's
no
memory
state
and
there's
no
ordering
constraint
as
far
as
making
serializing
all
the
transactions
in
the
system.
It
might
sound
kind
of
weird
that
I
say
that
there's
no
in
memory
state,
because
it's
our
free
list
and
we
need
to
know
what
is
or
isn't
used.
That's
because
this
is
the
free
list
manager
module
which
is
just
responsible
for
persisting
the
free
list
representation.
B
We
have
a
separate
module,
called
the
allocator,
that's
responsible
for
deciding
where
we
should
allocate
new
data,
and
that
does
obviously
still
need
to
have
state,
because
I
need
to
know
what
parts
of
the
disk
art
in
use.
So
it's
also
an
abstract
interface.
We
can
plug
in
different
implementations.
B
The
first
implementation
was
affectionately
called
the
stupid
alligator.
It
was
extent
based.
It
would
sort
of
been
free
extents
by
how
big
they
were
and
when
you
allocated
something
it
would
try
to
get
something
that
was
big
enough,
but
not
too
big
sort
of
nearby
to
where
you
want.
Whatever
you
hinted,
titute
the
allocation,
it
works
pretty
well.
It
fortunately,
though,
has
a
very
variable
memory
usage
I'm,
so
your
device
isn't
fragmented.
B
So
what
there
is
a
new
implementation
called
the
bitmap
alligator
that
SanDisk
wrote
that
uses
bitmaps
to
indicate
which
parts
of
the
disk
are
in
use,
and
it's
not
just
a
single
bitmap
where
it's
one
bit
per
block,
but
it
also
has
a
hierarchy
of
indexes
that
are
layered.
On
top
of
that
to
indicate
whole
regions
blocks
that
are
either
completely
used
are
completely
unused,
so
there,
if
you're,
looking
for
a
large
extent,
you
can
look
at
the
higher
level
indexes
to
find
big
chunks
more
efficiently.
B
And
the
nice
thing
about
this
implementation
is
that
it
has
a
fixed
memory.
Consumption
per
terabyte
of
disk
space,
so
uses
about
35
Meg's
of
RAM
per
terabyte,
which
is
predictable,
and
you
can
plan
around
it
and
you
can
just
budget
for
it
and
it's
relatively
compact,
pretty
reasonable.
So
that's
sort
of
the
new
default.
What's
going
to
happen,
I
mentioned
that
the
alligator
is
pluggable
so
or
also
have
an
intern
google
Summer
of
Code
Sooners
working
on
adding
me
to
support
for
SMR
hard
disks.
B
These
are
sort
of
these
new
annoying
devices
that
the
manufacturers
are
producing
that
write
data
on
disk
in
an
annoying
way
that
prevents
you
from
doing
arbitrary
over
writes.
Oh,
the
disk
is
separated
into
all
these
zones
or
bands
that
have
to
be
written
sequentially,
not
necessarily
all
at
the
same
time,
but
you
have
to
sort
of
write
in
order.
If
you
go
back
and
overwrite
something
before
it'll
sort
of
corrupt
thought,
that
comes
after
so
you
have
to
write
them
in
these
sort
of
stripes.
B
So
there's
a
library
that
lets
you
query
the
zone
layout
and
manage
the
pointers
and
so
forth.
Couple
of
DBC
that
we're
going
to
be
using
the
current
crop
of
prototype
devices
that
are
experimenting
with
our
host
host
their
host,
aware,
which
means
that
the
disk
will
actually
support,
will
sort
of.
Let
you
do
whatever
you
want,
but
it'll
get
really
inefficient
as
it
sort
of
tries
to
mix
things
up
behind
the
scenes.
B
But
the
goal
is
to
make
blue
store
work
with
either
hosts
aware
or
host
managed,
so
we'll
sort
of
carefully
make
sure
that
we're
writing
in
an
appropriate
way
so
that
we're
using
the
disk
efficiently.
So
the
SMR
allocator
will
become
very
simple,
we'll
just
sort
of
need
to
keep
track
at
the
right
pointer,
Spurs
own,
and
we
know
that
everything
after
the
right
pointer
is
unused
because
of
the
way
that
we're
sort
of
forced
to
write
to
these
discs,
which
is
nice
because
the
allocator
has
almost
no
date.
B
So
this
is
sort
of
work
in
progress,
we'll
see
how
it
goes,
these
devices
are
going
to
be
cheaper
and
bigger,
and
so
it'll
be
it'll
pay
to
support
them.
They're,
never
going
to
be
as
fast
as
sort
of
more
normal,
hard
drives
or
flash,
obviously,
but
for
sort
of
capacity
plays
where
you're
building
archival
clusters,
where
you're
just
shoveling
as
much
data
as
possible
and
you're-
probably
compressing
it
and
so
on.
B
So,
that's!
That's!
That's
blue
store!
That's
how
the
data
metadata
paths,
work,
sort
of
a
high-level
I'm
gonna
talk
a
little
bit
about
performance.
These
graphs
were
produced
a
month
and
a
half
or
more
ago
with
an
earlier
prototype
version,
so
they're
pretty
preliminary
and
they're,
not
super
detailed,
but
it
gives
you
a
taste
of
what
we
expect.
The
final
version
is
going
to
look
like,
so
this
is
just
looking
at
sequential
right
on
a
standard
spending.
B
Hard
drive,
you'll
notice
that
for
large
iOS
for
exactly
twice
as
fast,
which
is
basically
what
you
expect
notice
that
file
source,
blue
and
blue
store
is
red.
So
that's
sort
of
annoyingly
inconvenient,
alright,
so
in
large
I/o
case
file
store,
has
to
double
write
everything
to
the
journal
and
to
the
disk.
We
don't
do
that.
B
We
write
it
just
once
and
then
update
the
meditative,
so
we're
twice
as
fast
for
small
AO
we're
not
quite
twice
as
fast
but,
like
you
know,
60%,
faster
and
much
more
predictable
and
the
latency
sort
of
is
more
yeah.
It's
better!
So
we're
pretty
happy
with
this
random
rights.
Look
good
they're
also
about
twice
as
fast,
almost
twice
as
fast.
So
this
is
the
left
is
streaming
throughput,
the
right
as
I
ops,
so
you
can
sort
of
see
detail
on
both
ends.
B
The
ones
sort
of
interesting
thing
here
is
that
there's
this
little
kink
between
3d
QK
and
64
K
rights.
That's
because
that's
where
we
transition
from
doing
the
right
ahead,
logging,
where
we'd
sort
of
journal
the
over
right
update
and
then
go,
do
it
asynchronously
versus
just
writing
to
a
new
region
of
disk
and
you'll
notice
that
there's
sort
of
around
this
region
is
where
there's
a
trade-off
and
we
we
set
it
around
64
K,
because
we
don't
want
to
have
highly
fragmented
objects
on
the
disk.
B
B
B
Those
are
all
on
hard
disks.
We
did
do
experiments
with
SSD
Nandini,
unfortunately,
don't
have
graphs
for
it
on
Indy
and
me,
random,
writes
are
good
they're
much
faster,
our
testing
when
we
were
doing
this
was
seeing
some
anomalies
because
of
the
kernels.
There
was
like
a
weird
issue
with
the
driver
on
the
CentOS
kernel.
We
were
using
on
the
machine
and
it
was
a
confusing
that
got
sorted
out
later,
but
I
didn't
end
up
reading
the
tests,
but
it's
faster.
It's
not
as
fast
as
we
want
it
to
be.
B
It's
similar
to
the
hard
disk
result
in
that
blue
store
is
basically
two
times
faster,
except
that
on
the
SSD,
the
small
IO
benefit
was
more
pronounced,
so,
whereas
with
hard
disks,
the
store
is
little
less
than
two
times
as
fast
with
the
SSD
blue
store
is
a
little
bit
more
than
two
times
as
fast
for
small.
As
for
La
Hoya's,
it's
like
exactly
2
X
yep,
so
there's
more
work
to
do
there,
but
overall
we're
pretty
happy.
B
Just
sort
of
the
sort
of
just
2
X
across
the
board
is
sort
of
a
very
approximate
improvement
is
pretty
compelling,
so
current
status
lots.
The
step
is
done.
We
have
a
fully
functional
implementation,
it's
in
the
master
branch.
It's
the
I/o
path
works,
it's
stable,
it
does
checksums
and
inline
compression
it
all
works.
There's
an
FS
check
that
you
can
run
either
explicitly
or
you
can
set
an
option.
B
So
it
does
it
every
time
it
starts
up
and
shuts
down
which
we
do
during
QA,
and
we
have
these
new
bitmap
based
alligators
review
lists
that
are
implemented
that
work.
Our
current
development
efforts
are
focused
in
a
few
areas.
The
main
thing
right
now
is
were
focusing
on
making
the
encoding
of
the
metadata
for
our
own
ODEs
more
efficient.
B
Once
we
added
compression
and
check
sums,
suddenly
we're
storing
a
lot
more
metadata
per
object,
and
we
need
to
make
sure
we
store
it
very
efficiently
in
rocks
to
be,
or
else
rocks,
TV
gets
big
and
it
does
when
it
does
its
compaction.
It
just
generates
lots
of
I/o,
so
we're
doing
a
lot
of
that
performance.
Tuning
there's
also
an
effort
underway
at
sandesh
to
take
the
zetas
scale,
which
is
a
key
value
database
that
designs
specifically
for
flash
that
they
recently
open
sourced.
B
You
have
that
plug-in
blue
store
as
an
alternative
to
rocks
TV,
so
zetta
scale
is
a
b-tree
based
implementation.
It's
sort
of
generates
I/o
that
is
friendly
for
SSDs,
whereas
rocks
TV.
Is
this
log
structured,
merge
tree
that
sort
of
lends
itself
better
to
spinning
disks,
so
assuming
that
works
well
and
their
initial
performance
tests
are
very
promising,
the
plan
is
to
have
rocks
to
be
when
you
create
a
an
instance
of
it,
it'll
look
and
see.
B
Is
it
an
SSE
or
hard
disk
and
based
on
that
it'll
decide
whether
to
use
rocks
DB
or
set
a
scale
for
its
back-end
database
and
just
sort
of
get
performance
that
map's
to
whatever
the
best
choices?
There's
also,
some
implementation
work
to
do
still
around
making
sure
that
when
we
have
these
compressed
blobs
and
we
overwrite
data
that
we
sort
of
prevent
the
metadata
from
getting
too
complicated,
I'm
I
sort
of
alluded
to
that
earlier.
B
That's
in
progress
right
now
and
then,
what's
coming
after
that,
we
want
to
add
per
pool
properties
to
radius
so
that
as
a
policy
level
or
an
entire
cluster,
you
would
set
options
that
say
this
particular
pool
should
have
this
type
of
checksum
or
should
use
compression
or
should
not
be
compressed
based
on
the
type
of
data
that
you're
storing
there
that'll
inform
blue
store
to
do
whatever
it
needs
to
do
compressor
or
not
compress
that
sort
of
thing
lots
more
performance.
Optimization
is
coming
stabilization.
B
This
native
SMR
support
I
mentioned
is
in
progress
and
there's
also
a
patch
set.
That's
actually
most
of
its
merged,
initially
I'm
interested
ever
SPD
K,
which
is
an
Intel
library
for
kernel
bypass
for
nvme
devices.
So
it
puts
the
driver
for
the
nvme
card
all
the
way
in
user
space
and
talks
directly
over
like
memory
map
sort
of
rather
over
the
PC
to
PCI
device
to
have
very
fast
access
to
nvme.
B
So
most
of
that's
there,
but
it's
sort
of
awkward
to
use
and
test.
But
the
goal
is
that,
eventually,
if
you're
doing
nvme
cards
and
blue
store,
you'll
have
sort
of
a
full
user
space
stack,
that's
very
efficient
and
fast
for
those
devices.
Where
can
you
get
blue
store?
So
it
is
still
experimental.
It
should
not
be
used
with
the
production
data.
It
will
lose
your
data
most
likely
it's
early
days.
Still
we
do
have
an
experimental
implementation
in
jewel,
which
was
just
released
two
months
ago.
B
You
have
to
enable
this
scary
option
enable
experimental
and
recoverable
data
corrupting
features
equals
blue
star,
X
DB,
and
we
mean
it
it
will.
You
will
lose
your
data
most
likely
it's
stable
enough
to
do
benchmarking
and
like
some
basic
testing,
but
the
disk
format
has
already
changed.
So
if
you
go
and
upgrade
your
cluster
from
jewel,
you
won't
be
able
to
read
the
data.
B
So
don't
put
anything
important
there
set
disk
and
jewel
has
basic
support
for
blue
store,
but
it
only
supports
the
full
device
mode
or
everything
is
all
in
one
disk
and
it
doesn't
automatically
set
up
multiple
partitions
to
do
the
rocks
to
be
wall
on
a
different
device
and
so
on.
You
have
to
do
that
kind
of
manually
and
tediously.
Unfortunately,
the
dual
implementation
also
predates
checksums
and
compressions.
It
doesn't
do
that
stuff
yet,
but
it
does
have
sort
of
the
overall
IO
flow
and
performance.
So
you'll
break
that
we
expect
to
see.
B
If
you
pull
up
to
the
current
master
code
and
get
then
we
have
the
new
disk
format
and
that
does
do
check
sums
and
compressions
in
compression
and
it
works.
It's
still
changing.
So
again,
if
you
write
data,
you
an
upgrade,
you
won't
be
able
read
it
back,
probably
definitely
I
should
say,
but
it's
looking
looking
pretty
good
again,
it's
pretty
stable
and
functional.
The
goal
for
blue
store
is
to
have
a
stable
version
for
Kraken.
B
So
that's
due
to
be
released
in
October
of
this
year
that
you
can
deploy
on
a
production
cluster
and
will
have
a
stable
disc
format
and
will
not
eat
your
data.
That's
the
goal
where
we've
done
most
of
the
development.
As
far
as
all
the
feature
work,
it's
really
about
optimizing,
the
on
test
data
structures
and
then
doing
lots
and
lots
of
testing
and
performance,
testing
and
optimization
and
so
on,
and
so
I'm
feeling
pretty
good
about
being
able
to
meet
that
that
goal.
B
But
that's
we're
aiming
and
then
the
secondary
goal
is
then
by
the
next
release
after
kraken,
which
would
be
luminous
in
the
spring
of
17.
It
will
be
the
default
back
in
for
the
OSD.
So
if
you
stand
up
a
new
cluster
or
add
a
notice
each
an
existing
cluster,
it
will
use
blue
store
by
default
instead
of
file
store
and
we
can
finally
sort
of
deprecated
all
the
legacy
stuff.
B
So
that's
plan,
so
in
summary,
obviously
stuff
is
great.
Originally
we
built
the
SDS
on
top
of
POSIX
file
systems,
but
it
was
sort
of
a
poor
choice.
It
works,
but
it
has
huge
performance
disadvantages
and
there's
lots
of
complexity
working
around
things
that
don't
do
what
we
need
them
to
do.
So
we
built
a
new
back-end
called
blue
store.
We
embedded
rocks
2d,
which
is
great
parks,
would
be
rocks.
B
That's
easy
to
embed
booster
was
cool,
does
full
data
check
sums
and
inland
compression,
and
it's
fast
it's
roughly
twice
as
fast
that
was
healed
stuff?
Oh,
that's.
What
I
have
any
questions
about
anything
I've
talked
about?
You
can
ask
a
question
either
verbally
or
posted
in
the
chat.
Let's
see,
I've
got
a
couple
of
things.
B
How
is
cache
reclaiming
performed
in
the
case
of
type
memory
conditions
on
the
OSD
server
side?
So
there
is
a
tunable
that
you
set
on
the
OSD
that
basically
tells
blue
store,
how
much
memory
it's
allowed
to
use
and
that's
how
much
memory
it
uses.
There's,
usually
you
want
to
just
set
it
at
you
know.
A
few
hundred
Meg's,
probably
I,
think
the
default
I'm
going
to
set
at
512
Meg's
Borowski,
but
you
can
set
it
to
whatever
you
want.
B
If
you
set
it
really
small
data,
that's
in
a
cache,
while
it's
still
being
written
it
hasn't
been
committed
to
disk,
is
effectively
pinned
in
memory
because
the
transaction
might
be
doing
a
read
of
stuff.
It
wrote
previously
whatever.
So
it
should
behave.
Okay,
if
you
set
it
to
really
low
number,
but
we
haven't
tested
that
yet
and
yes,
the
internal
page,
cache
is
no
longer
used
for
blue
store,
it
is
in
the
jeweler
j'en.
B
We
do
rely
on
the
kernel
and
jewel,
but
and
the
master
version
and
going
forward,
we
won't
be
using
the
curl
cache
at
all
on
disk
fragmentation
does
happen
because
we
are
a
copy
and
write
system
right
now,
there's
no
defragmentation
except
sort
of
implicitly.
If
you
rewrite
the
data,
it's
unclear
what
we're
gonna
do
there.
It
might
be
that
we
don't
do
anything
all
that
probably
isn't
the
case.
B
It
might
be
that
if
the
data
is
fragmented,
we
just
let
it
stay
fragmented
until
you
read
it
and
then
if
we
read
a
bunch
of
data
and
we
notice
that
it
was
fragmented,
we
just
sort
of
in
the
background
QR
right
to
defragment
sort
of
opportunistically.
As
you
read,
fragmented
data
that
way,
there's
no
background
read
cost,
there's
only
a
background,
write
cost,
but
we'll
see
what
algorithm
is
used
on
compression.
We
use
snappy
and
zeal
it
by
the
two
that
are
currently
plugged
in
there.
B
A
couple
others
out
there
that
look
interesting
or
something
called
like
next
lib
or
something
hammer
was
called
that
looked
cool.
It
claims
to
be
very
fast
and
as
small
as
ela.
But
who
knows
the
nice
thing
about
Z
love
is
it's
it's
like
inflate
deflate,
which
have
been
around
for
a
million
years
and
it's
there's
our
versions
of
it
that
are
optimized
with
like
special
CPU
support
on
Intel
processors,
but
once
that
stuff
is
wired
up,
then
hopefully
it'll
go
super
fast
it'll
be
basically
free,
which
will
be
nice.
B
But
we
do
let
you
configure
the
allocation
unit
and
the
size
at
which
we
allocate
new
space
versus
overwrite
old
space.
So
that's
sort
of
the
only
knob
you
have
right
now,
yeah
we'll
see,
let's
see,
is
there?
Oh,
the
is
there
deduplication
in
blue
store,
there's
no
deduplication
and
there
are
no
plans
for
deduplication.
B
The
expectation
is
that
D,
dupe
and
Seth
is
going
to
be
implemented
at
the
radius
level
as
part
of
the
tearing
infrastructure,
because
Fredo
is
to
sort
of
randomly
spraying
objects
across
those
DS.
If
you
write
the
same
deja
twice,
it's
usually
not
going
to
land
on
the
same
device
and
so
you're
not
gonna,
be
able
to
D
to
pit.
So
any
DDP
do
on
a
single
OSD
is
going
to
have
very
limited
benefit,
so
the
plan
is
to
have
sort
of
an
indirection
layer.
B
B
Is
there
caste
ring
feature
in
blue
store,
only
sort
of
so
blue
store
tears
across
multiple
devices
just
for
metadata
and
data
and
the
rocks
TV
sort.
Does
it
with
its
metadata.
So
the
colder
metadata
goes
in
the
slower
device.
We
don't
blue
store,
doesn't
tear
the
data
itself,
yet
we're
unsure
whether
we're
gonna
do
that
or
not
there.
You
can
do
that
in
the
block
layer
using
DM,
cache
or
flash
cache
or
B
cache,
and
those
seem
to
work
reasonably
well.
B
So
the
real
question
is
whether
those
are
going
to
work
well
enough
or
whether
we
think
we
can
do
better
and
it's
worth
the
additional
complexity
to
do
tearing
in
blue
store
itself.
In
theory,
we
might
be
able
to
do
better,
because
we
know
more
about
the
workload
of
the
data,
then
you
can
sort
of
discern
from
the
block
layer
and
so
I
think
eventually
we'll
probably
end
up
doing
it,
but
initially
it's
not
sort
of
on
the
table.
Just
yet
miss
NIC
casts
about
memory
requirements.
B
B
So
I
set
the
default
amount
of
memory
to
use
for
the
buffer
cache
at
500,
Meg's
I,
just
sort
of
picked
that
randomly
out
of
a
hat
as
a
reasonable
number
of
puros
T.
Given
what
hardware
people
usually
deploy
today,
but
you
can
make
a
smaller
bigger
if
it's
smaller
than
you'll
have
more
cache,
misses
and
you'll
do
Mario
and
it'll
be
slower
if
it's
bigger
than
it
well,
so
you
can
sort
of
choose
the
sort
of
the
OSD
independent
of
the
object
store.
The
memory
utilization,
you
know,
seems
to
fluctuate
between.
B
You
know
150
Meg's
to
like
three
hundred
Meg's
or
more
sort,
depending
on
how
many
peaches
you
have
and
how
much
IO
you're
doing
and
that
sort
of
thing,
and
so
the
blue
store
overhead
will
be
part
of
that.
It's
going
to
appear
as
though
the
OSDs
are
using
more
memory
than
they
did
before,
because
with
file
store,
all
of
the
cache
is
managed
by
the
kernel.
B
It's
not
part
of
the
OSD
process,
and
so
the
Oh
Steve
would
appear
to
be
like
200
Meg's
and
then
the
kernel
would
have
you
know
like
16
gigs
of
page
cache
memory,
that
is,
the
OC
is
taking
advantage
of
with
blue
store.
It's
all
going
to
be
part
of
the
process
memory,
so
the
Oh
Steve
process
is
going
to
look
bigger
and
the
kernel
is
not
gonna
have
any
cash
basically,
so
it'll
be
a
little
different.
B
There's
question
about
EC
pools:
it
will
be
compatible.
The
EC
pools
are
gonna,
set
all
the
hints
to
tell
blue
store
that
the
I/o
is
sequential
and
has
large
blocks
and
all
the
stuff,
so
that
blue
store
will
trigger
everything
that
it
can
so
it'll
be
more
likely
to
compress
the
data.
It'll
be
more
likely
to
do
larger
checksum
blocks,
so
they'll
make
the
metadata
smaller.
All
that
good
stuff,
though
it'll
work,
it'll
work,
fine
with
EC,
if
they're
blue
source
support
for
trim
on
SSDs,
currently,
no,
not
yet.
B
That
is
still
in
the
shoe
list.
It's
a
little
tricky
because
SSDs
before
they
pay
attention
to
your
trim,
the
trim
needs
to
be
big
enough
to
sort
of
align
with
whatever
the
unit
of
trimming
on
the
SOC
is.
So
this
will
have
to
be
supported,
I'm
wired
into
the
alligator,
the
bitmap
alligator,
so
that
it
sees
when
an
entire
large
chunk
gets
freed
up.
Then
it
will
issue
the
trim
and
it
needs
to
be
easily
turned
on
and
off,
because
some
SSDs
just
do
bad
things
at
USU
trim.
B
They
like
trim,
is
not
efficient
and
it
like
stalls.
The
I/o
pipeline,
or
sometimes
even
like
corrupt
things,
so
we
had
to
be
careful,
but
eventually,
yes
will
be
supported,
but
it
isn't
yet
I'm.
I
should
note,
though,
that
on
the
RVD
level,
trim
is
supported
that
just
sort
of
throws
that
logically
throws
out
that
part
of
the
object.
Now
at
the
object
store
layer
when
you
do
a
trim,
you're
just
punching
a
whole
new
object.
It's
just
dropping
the
logical
references
to
that
data.
B
I
don't
have
to
deallocate
it
or
drop
the
El
extents,
so
sort
of
trim
at
the
prompt
blue
source
perspective,
a
layer
up
is
sort
of
trivial.
Just
twiddles.
With
this
metadata
just
kind
of
nice,
let's
see
Steve
Taylor
asks
are
there
tests
that
might
elicit
the
performance
differences
in
RVD
clone
rights
using
blue
store?
B
Yes,
so
this
is
going
be
a
big
difference
right
now,
if
you
run
SEF
OSDs
with
file
store
and
use
snapshot
a
walk
device,
and
then
you
do
a
write
that
first
right
is
going
to
trigger
a
literal
four
megabyte
copy
on
the
OSD.
Before
the
write
happens
and
I'm
lose
store,
that's
gonna,
be
a
cop
it'll,
be
free.
Essentially
it's
a
copy
and
write.
B
It's
just
a
metadata
update,
so
we're
trying
to
think
of
what
a
good
way
to
trigger
this
would
be
I
mean
you
could
just
take
a
snapshot
on
a
block
device
and
then
go
into
that
VM
and
then
do
like
write
a
small,
a
small
file
and
do
an
F
sync
or
overwrite
an
existing
file.
B
Let's
see
the
OSD
side,
cache
support
does
is
for
different
strategies
right
back
and
right
through
the
cache
is
always
right
through
because
everything
the
OSD
ever
writes
has
to
be
committed
before
can
reply
to
the
client
and
so
a
write
back
policy
on
the
OC
just
doesn't
make
sense,
because
the
OST
is
a
server
and
you
never
want
to
say
you
wrote
something
when
you
didn't,
because
you
lose
data.
So
it's
always
right
through.