►
Description
Ceph creator Sage Weil speaking at the Storage Developer Conference in Santa Clara in Sep 2012.
A
My
name
is
sage,
while
from
ink
tank
and
today
I'm
going
to
talk
about
the
SEF
distributed
storage
system,
just
a
brief
outline
of
what
I'm
going
to
talk
about
a
little
bit
about
why
you
should
care
about
another
storage
system,
sort
of
what
step
is
from
a
high
level,
how
it
works,
what
it
does
I'll
talk
a
bit
about
that
distributed,
object,
storage,
layer
and
specifically
about
some
of
the
interesting
features
of
our
object.
Api
that
talks
to
that
distributed,
object
store.
A
A
So
to
begin
with,
why
should
you
care
about
yet
another
storage
system?
You've
heard
any
of
these
talks
during
the
course
of
the
conference.
I
think
there
are
a
few
reasons.
One
is
simply
a
matter
of
requirements.
People
have
very
diverse
storage
needs.
Some
people
need
objec
storage
because
they're
building
the
next
web,
two
point:
O
application,
they're
going
to
dump
a
bazillion
images
into
some
big
distributed
server
farm.
A
Other
people
need
block
devices
for
running
virtual
machines
and
so
forth
for
their
public
or
private
cloud
infrastructure,
or
maybe
they
want
to
replace
their
legacy,
stand
that's
too
expensive.
Other
people
need
a
shared
POSIX
file
system
because
they're
running
legacy,
applications
or
because
that's
what
theories
are
demands
or
users
demand
so
forth
and
other
people
are
doing
big
data
types
of
things
where
they
have
structured
data
and
they
don't
frankly,
really
know
what
they
want.
A
Maybe
it's
file,
maybe
its
object,
and
it's
maybe
not
particularly
clear
but
sort
of
common
across
all
of
these
things
is
that
people
really
need
systems
at
scale.
So
when
you're
building
out
a
large
infrastructure
for
your
the
enterprise,
you
need
to
be
able
to
incrementally
add
nodes
to
go
from
terabytes
petabytes
to
exabytes.
A
Ideally,
you
want
cost
to
be
a
linear
size,
a
linear
function
of
the
size
of
your
cluster
or
performances
or
as
close
to
linear.
As
you
want.
You
don't
want
the
sort
of
exponential
curve
that
you
get
when
you're
buying
more
expensive
options,
you
would
like
incremental
expansion,
so
you
don't
want
to
have
to
deal
with
for
cliff
upgrades
and
ideally
no
vendor
lock-in,
and
so
you
can
have
a
choice
and
what
kind
of
hardware
you
run
in?
A
What
kind
of
software
you
run
on,
and
so
what
a
lot
of
organizations
are
demanding
is
really
an
open
source
solution
that
they
can
run
on
whatever
Hardware
they
choose.
That's
sort
of
the
ideal
situation,
of
course,
from
our
perspective,
and
that
obviously
demands
a
question
of
what
is
what
exactly
is
Seth?
Well,
it's
tries
to
address
all
many.
Many
of
these
concerns.
First
and
foremost,
staff
is
a
unified
storage
system.
A
The
idea
here
is
that
we
can
deal
with
multiple
interfaces
to
storage,
both
object,
storage
using
our
native
API
s
or
restful
api,
so
compatible
assess
through
your
Swift.
You
can
also
store
virtual
disks
or
block
devices
with
features
like
thin
provisioning,
snapshots,
cloning,
that
sort
of
thing
and
finally,
there's
a
POSIX
distributed
file
system,
so
you
can
actually
store
files
and
directories
and
so
forth
in
the
cluster,
and
you
do
this
all
with
the
same
unified
storage
infrastructure.
So
it's
sort
of
it.
The
API
stack
at
the
bottom.
A
You
have
this
component
called
rattus,
that's
a
reliable
distributed,
object,
store
and
that's
the
thing
that
scales
to
thousands
or
tens
of
thousands
of
storage
nodes
and
make
sure
that
all
your
objects
are
replicated
across
multiple
nodes.
That
sort
of
dynamically
move
data
around
as
cluster
state
changes
and
sort
of
handles
the
reliability
and
scalability
key
pieces,
and
then,
on
top
of
that,
you
can
sort
of
talk
to
that
distributed,
object
store
in
a
number
of
ways.
A
You
can
use
the
native
liberators
API
to
just
if
you
need
sort
of
raw
object,
storage
for
your
for
your
custom
application
or
something
there's
this
ratos
gateway
component
that
sits
on
top
of
that
API.
That
gives
us
three
and
Swift
compatible
object,
storage,
using
restful
api
eyes,
there's
an
rbd
component
that
gives
you
a
virtual
disk.
A
That's
essentially
a
logical
disk
that
stripe
over
objects
that
are
then
stored
in
this
distributed;
object,
store
and
shared,
reliable,
all
that
good
stuff
and
finally,
there's
a
distributed
file
system
that
also
leverages
the
reliable
storage
abstraction
to
build
a
sort
of
higher-level
service.
Where
you
have.
You
know
POSIX
semantics
with
files
and
directories
and
so
forth
staff
is
open
source.
It's
licensed
under
lgpl
too.
A
It's
a
copyleft
license,
but
you're
free
to
link
to
proprietary
code,
so
it's
very
easy
to
sort
of
integrate
into
other
projects
and,
unlike
some
other
projects,
there's
no
copyright
assignment.
So
it's
very
essentially
the
project.
Copyright
is
an
help
hostage
by
a
single
company
that
you've
been
relicense
it
and
extort
money
and
so
forth
so
tends
to
be
very
friendly.
In
that
sense,
there's
an
active
community
of
users
and
developers
and
there's
also
a
commercial
sort.
A
So
if
you've
managed
it
as
a
single
unit
and
most
for
the
most
part
that
squirrels
for
the
behind
the
scenes
to
deal
with
all
the
details
and
moving
data
around
I'm
using
software,
so
the
stuff
distributed
object,
store,
is
based
on
an
object,
storage
model,
and
the
basic
idea
is
that
you
have
some
number
of
pools
of
stores
that
are
durable,
gical
collections
of
objects
and
their
infinite
namespaces
and
collects
menu
for
each
pool.
A
So
the
question
is:
why
do
we
start
with
objects?
Why
we
will
distribute
object
store?
Instead
of
a
distributed
file
system
first
and
the
most,
the
first
reason
is
that
objects
are
much
more
useful
than
starting
with
blocks.
So,
in
contrast
to
you,
know,
just
drive
where
you
have
blocks
that
are
sequentially
laid
out
and
you
can't
really
name
them
and
you
have
to
deal
with
all
the
allocation
details
and
so
forth.
Objects
are
named.
A
A
So
that's
what
the
step
architecture
do
so
serve
a
slightly
different
picture.
Here
you
start
with
a
number
of
disks.
On
top
of
each
disk,
you
slap
a
local
file
system,
typically
butter,
FS,
or
something
like
that.
You
can
also
run
on
explore
or
XO
fast,
although
butter
press
is
sort
of
where
we're
going
in
the
future
and
then
on
top
of
that
file
system,
you
have
object.
A
Storage
team
in
SF
object,
storage
statement
that
manages
that
particular
local
set
of
data
and
then
communicates
with
the
other
demons
in
order
to
provide
a
higher
level
abstraction,
and
then
you
typically
have
a
whole
bunch
of
these
inside
a
node,
and
then
you
have
a
bazillion
of
these
to
sort
of
form.
Your
your
larger
storage
cluster.
You
additionally
have
some
number
of
these
monitor
nodes
that
are
responsible
for
essentially
herding
the
cats
they
deal
with
them.
A
Cluster
membership
and
state
these
Paxos
to
sort
of
make
sure
that
we
know
who
is
participating
in
the
cluster
and
what
your
role
is
at
a
particular
point
in
time,
but
these
guys
aren't
actually
involved
in
any
of
the
data
path.
They're
only
involved
in
sort
of
cluster
management,
cluster
state
and
contrast
these
object,
storage
demons
need
at
least
three
of
them.
A
A
One
of
the
key
problems
in
designing
system
like
this
is
deciding
how
your
data
should
be
distributed.
So
our
requirements
are
pretty
simple:
we
want
all
objects
to
be
replicated.
Some
number
of
times
is
totally
tunable,
but
usually
two
or
three
is
what
people
choose,
and
we
also
want
those
objects
to
be
automatically
placed
in
balance
in
a
dynamic
cluster,
because
these
systems
are
going
to
change
over
time
as
disks
fail
and
new
storage
is
deployed
and
so
forth,
and
we
also
want
to
consider
the
physical
infrastructure.
A
So
we
want
to
make
sure
that
if
we're
replicating
objects,
we
place
replicas
in
different
racks
of
the
data
center,
so
that
a
single
power
circuit
failure
won't
affect
the
availability
of
my
data,
for
example.
So
there
are
sort
of
three
basic
approaches
that
you
can
take
for
deciding
where
to
store
data.
One
is
to
pick
a
location
and
remember
where
you
put
it,
and
so
when
you
come
back
a
week
later,
you
try
to
go
back
to
the
same
place,
and
hopefully
your
data
is
still
there.
A
The
problem
with
that
is
that
if
the
something
happens
say
that
host
failed
or
Iraq
failed
or
so
forth,
that
won't
actually
be
the
case.
That's
sort
of
a
not
a
very
good
strategy.
More
typical
approach
is
that
you
pick
a
location
with
the
data,
and
then
you
write
down
where
you
put
it
some
sort
of
metadata
server
index.
A
So
a
very
different
approach
is
to
use
a
hash
function
to
determine
where
your
data
should
be
stored.
So
essentially
you
calculate
a
location
based
on
the
current
state
of
the
cluster,
and
that
tells
you
where
to
store
it
and
then,
when
you
read
it,
you
perform
the
same
calculation
and
it'll.
Tell
you
where
to
go
to
find
it
and
to
do
this,
Seth
uses
a
function
called
crush.
It's
a
pseudo
random
placement
algorithm.
That
is
a
fast
calculation
for
where
the
store
data
there's
actually
no
look
up
involved.
A
So
we
don't
have
to
bother
with
maintaining
this
index
and
we
just
calculate
the
result
whenever
we
need
to
find
it.
It's
a
repeatable,
deterministic
calculation,
of
course,
but
the
key
property
is
that
it
maintains
a
stable
mapping
so
that,
if
you
have
100
servers-
and
you
add
one
typically
1%
of
the
data
is
going
to
move
to
that
new
server.
A
So
you
don't
have
this
sort
of
random
reshuffling
that
you
would
have
with
a
naive,
hashing
algorithm
and
the
other
key
thing
with
crushes
that
it
gives
you
it's
very
flexible
and
that
you
can
specify
rules
that
determine
how
your
replica
is
our
place
in
the
cluster.
So,
for
example,
you
can
specify
that
I
want
three
replicas
as
I
want
them
all
to
be
in
the
same
row
of
the
data
center,
so
that
my
replication,
you
know,
doesn't
traverse
my
spine
network
core
routers
or
something
like
that.
A
So
in
a
bit
more
detail,
but
this
looks
like
then,
you
have
essentially
a
pool
of
a
bunch
of
objects
and
you
hash
the
names
of
the
objects,
essentially
by
the
number
of
what
we
call
policeman
groups,
essentially
we're
starting
this
logical
pool
into
a
bunch
of
different
pieces,
and
this
gives
us
some
number
of
placement
groups
that
are
rainbow.
Color
animation
works
correctly
and
then
for
each
of
these
placement
groups.
We
feed
it
into
the
crush
algorithm,
and
then
that
calculates
where
and
the
cluster
those
guys
are
going
to
be
stored.
A
And
so
you
get
the
sort
of
D
clustered
approach
where
you're,
where
you're
placing
groups
are
randomly
scattered
across
the
cluster.
But
in
this
particular
example,
our
rule
specifies
that
we
will
always
choose
one
node
from
the
top
row
and
one
note
from
the
bottom
row
say:
they're
different,
hosts
or
different
racks,
or
something
like
that,
and
the
way
that
this
works
is
that
the
distributed
objects
or
Rados
periodically
publishes,
what's
called
an
OSD
map.
That's
essentially
a
snapshot
of
the
current
state
of
the
cluster
which
OS
DS
are
participating.
What
the
crush
algorithm!
A
That's
specifying!
How
data
is
mapped
onto
those
nodes
and
all
the
current
IP
addresses
and
so
forth,
and
using
that
particular
map,
we
can
calculate
where
any
popular,
where
any
particular
object
or
piece
of
data
should
be
stored
in
that
storage
cluster.
The
object,
storage
demons
then
are
responsible
for
safely
replicating
the
data
in
the
stores
cluster
and
as
if
there's
a
new
map
published
over
time,
because
there's
a
note
came
up
or
came
down.
A
Those
nodes
are
responsible
for
using
sort
of
peer-to-peer
protocols
to
migrate
data
to
the
new
location
as
specified
by
that
map,
and
they
use
gossip
protocols
to
efficiently
share
these
map
updates
and
so
that
they
can
sort
of
stay
and
think
about
what
what
the
current
distribution
of
data
should
be
in
the
system.
This
is
a
very
decentralized
distributed
approach
that
allows
massive
scales,
because
you
don't
have
any
sort
of
central
coordination.
Aside
from
the
fact
that
you're
publishing
maps
that
says
which
nodes
are
up
and
down,
nobody
is
having
to
say
you.
A
You
take
this
piece
of
data
and
moved
over
there,
but
instead
sort
of
the
leaf
nodes.
The
osts
can
do
that
on
their
own
because
they
all
have
a
shared
view
of
reality
based
on
these
OST
maps.
So
then,
a
client
then
also
gets
a
copy.
This
map
say
it
needs
to
store
a
particular
object.
It
can
do
that
crush
calculation
and
we'll
know
that
you
know
it's
in
the
green
placement
group
sort
on
these
two
knows
it
has
sort
of
complete
knowledge
of
where
all
data
is
stored.
A
By
virtue
of
this
mapping
algorithm
what
happens,
then,
if
you
have
a
node
that
fails
say
we
lose
a
node
with
the
yellow
in
the
orange.
In
that
case,
the
the
other
Oh
STIs
that
are
replicating
those
placing
groups
to
realize
that
they're
there
Pierre
went
down
essentially
because
they
got
a
map
update.
They
can
see
that
the
the
replicas
are
no
longer
there
and
the
new
they'll
identify
who
the
new
home
is
for
those
of
those
data
objects
and
they'll
actually
migrate
them,
and
this
is
a
sort
of
fully
period
of
your
process.
A
So
there's
no
Sun,
but
nobody
is
actually
having
to
do
this
for
them.
So
again
it
scales
very
well
and
then
a
client.
If
it
needs
to
read
that
object,
it'll
get
a
new
copy
of
the
map
and
it
can
go
to
the
new
location
and
find
the
data
where
it
should
be
so
liberate.
Us
is
sort
of
the
low-level
API
that
talks
directly
to
this
distribute
store.
A
Liberate
us
also
provides
another
number
of
other
interesting
features
that
make
this
make.
This
particular
object
API
very
interesting
to
consume
one
of
those
things
as
atomic
transactions,
so
the
client
operations
can
actually
contain
multiple
operations
and
a
single
request
and
those
will
be
sent
to
the
object.
Storage,
node
and
it'll
apply
all
of
those
sort
of
atomically.
There
are
all
succeed
and
it
commits
atomically
or
they
none
of
them
will
commit
and
it'll
fail,
which
is
kind
of
nice.
So
this
gives
you
Adam
icity,
which
is
very
helpful.
A
You
can
also
do
conditional
requests,
so
you
could
say
you
could
send
a
request.
For
example,
that
says
make
sure
this
X
adder
is
equal
to
1
and
if
so,
then
apply
this
operation.
But
if
it's
not
don't
do
anything,
so
you
can
do
sort
of
atomic
compare-and-swap
type
operations
and
that's
all
mediated
by
by
the
object.
Storage
cluster.
A
A
It's
based
on
Google's
leveldb
implementation,
which
is
pretty
nice.
It's
sort
of
the
big
table
SS
table
design
gives
you
efficient,
no
range
query
insertion
that
type
of
thing,
so
you
can
insert
update,
remove
T's,
and
you
can
also
the
key
thing
that
this
allows
you
to
do
is
sort
of
efficient
read-modify-write
type
workloads
where
you
can
say,
for
example,
that
I
want
to
just
remove
certain
keys
and
that
operation
will
happen
efficiently
on
the
OSD
without
having
to
read
the
entire
object
over
the
wire,
make
some
small
change
and
write
it
out
again.
A
A
One
of
the
other
interesting
things
we
can
do
is
what
we
call
watch
notify.
So
essentially,
you
can
establish
sort
of
an
interest
or
watch
on
a
particular
object
in
the
object
store.
So
multiple
clients
can
sort
of
observe
a
particular
object
and
then
they
can
send
notify
messages
to
each
other.
So
you
can
use
sort
of
an
object
as
a
meeting
point
and
a
message,
communications
gateway
for
for
clients
to
coordinate
and
allows
you
to
do
similar
things
that
you
might
do
with
them.
A
Sort
of
Apache
zookeeper,
but
using
this
distributed
object,
store,
is
sort
of
the
basis
for
doing
that
type
of
coordination.
So
no
an
example
of
what
this
actually
looks
like
here,
a
number
of
different
clients.
Each
sending
watching
question
object.
They
get
a
commit
that
says
they've
registered
that
watch
interest
in
the
object
and
then
later,
if
somebody
send
to
notify
request
that
notify
gets
distributed
to
all
different
clients
who
have
watched
it
and
when
they
all
acknowledge,
then
you
finally
get
to
notify
acknowledgement.
A
That
says:
yes,
I've
notified
everybody
who's
watched,
and
so
you
can
use
this
for
for
a
number
of
different
things.
I'm.
What
we's
it
for
one
example
is
the
radius
gateway
component.
That's
doing
the
s3
gateway
uses
it
to
manage
its
own
cache
consistency.
So
it
sends
invalidate
messages
essentially
to
invalidate
entries
out
of
its
other
rate
of
skate
wing
scences
caches,
because
you
typically
deploy
like
a
hundred
different
radio
skate
ways
and
behind
a
load
balancer
or
something
in
there.
A
They're
all
caching
things,
and
so
they
use
this
to
keep
their
caches
come
here
in,
which
is
which
is
pretty
useful.
One
of
the
more
exciting
things
that
you
can
do,
though,
is
you
can
actually
implement
what
we
call
ratos
classes,
so
you
can
dynamically
load
a
shared
object
into
the
OSD
that
implements
new
functionality
for
objects,
that's
based
on
existing
functionality.
A
So,
for
example,
you
can
implement
new
breed
methods
on
these
objects.
If
you
will,
that
will
run
arbitrary
code
essentially
and
then
for
a
read
request.
They'll
do
some
transformation,
give
you
a
response
or
for
right
they
can
do
some
sort
of
higher
level
mutation
on
the
object
and
then
and
then
atomically
commit
that
to
disk.
A
So
there's
also
a
moving
on
I
guess:
there's
a
there's,
a
ratos
de
lis
component,
that's
built
on
top
of
liberators
that
gives
us
three
in
Swift
compatible
API
access
for
people
who
sort
of
want
a
drop-in
replacement
and
their
own
infrastructure
for
applications
that
are
targeted
towards
s3
or
switch
interfaces.
That's
built
on
top
of
liber8
s,
it's
relatively
straightforward,
there's
also
a
rate
of
block
device
that
gives
you
a
virtual
disk.
Essentially,
we
just
take
a
single
disk
and
stripe
it
across
lots
of
objects
and
then
distribute
those
across
the
cluster.
A
That's
pretty
well
integrated
and
works
pretty
well,
so
you
can
imagine
essentially
taking
all
these
OS
DS
and
distributing
little
blocks
across
them
and
aggregating
those
into
a
virtual
disk
and
then
attaching
that
to
a
computer
or
multiple
VMs.
More
typically,
so
this
is
number
of
nice
things
you
can
do.
You
can
snap
shots
on
these
virtual
disks.
It's
linked
directly
into
qmu
kvm
virtual
machine
framework.
So
you
can,
you
know,
have
a
virtual
machine,
that's
backed
by
the
stuff
cluster
without
any
colonel
support.
A
A
So
again,
you're
taking
lots
of
objects,
you're
sort
of
linking
them
through
the
live,
our
body
library,
integrating
with
some
virtualization
container
and
giving
a
virtual
diff
that's
consumed
by
a
virtual
machine
and,
of
course,
because
we're
dealing
with
shared
storage
here,
that's
reliable
and
so
forth.
You
can
also
do
nice
things
like
a
virtual
machine
by
migration,
live
migration.
A
On
top
of
that,
but
probably
the
more
interesting
piece
here
as
far
as
complexity
is
a
safe
distributed
file
system,
it's
probably
about
as
many
lines
of
code
as
everything
else
combined,
because
file
systems
are
complicated,
even
when
you
start
on
an
object,
storage
grant
foundation
essentially
so,
and
the
key
idea
here
is
that
clients
were
mounting.
The
file
system
will
talk
to
metadata
servers
to
deal
with
issues
related
with
the
file
system,
namespace,
so
for
resolving
paths
and
traversing
the
hierarchy
and
so
forth.
A
But
then,
when
they
actually
need
to
read
and
write
file
data,
they
can
talk
directly
to
the
object
stores.
The
O's
denotes
to
read
and
write
from
the
object
that
stores
that
files
data,
and
so
this
allows
the
data
path
to
be
highly
parallel
and
distributed
and
scalable
and
so
forth,
and
then
we
just
have
to
deal
with
having
a
file
system
hierarchy.
That's
spread
across
multiple
metadata
servers.
So
that's
really
where
the
complexity
here
comes
in.
So
we
have
these
new
metadata
server
components.
A
These
are
set
in
DSS
they're
responsible
for
managing
the
POSIX
file
system
hierarchy.
You
know
dealing
with
all
the
file
metadata
owner
mode,
uid,
gid,
all
that
good
stuff.
They
store
all
of
their
metadata
in
raid
us
again,
so
we're
leveraging
the
fact
that
we
already
have
this
magical
reliable,
distributed.
Scalable
storage
abstraction.
So
these
are
religious
demons
that
are
cashing
lots
of
stuff
in
memory
and
then
writing
everything
out
to
raid
us
on
the
back
end
and,
of
course,
they're
only
necessary
if
you're,
using
the
set
distributed
file
system.
A
If
you're
talking
to
just
the
object
stores
and
the
metadata
service,
don't
even
get
involved,
because
those
objects
are
trivially
parallel
and
all
that
good
stuff,
but
building
the
distributed,
met
a
server
with
an
interesting
design
problem.
Part
of
it
was
due
to
the
fact
that
sort
of
the
legacy
metadata
storage
approaches
are
sort
of
a
disaster,
so
you
typically
have
a
file
name,
it
master
and
I
node,
which
is
stored
in
some
other
table,
and
then
that
has
a
block
list
which
is
all
the
blocks
on
disk.
A
So
you
have
to
go
look
in
and
you
find
the
data.
So
this
is
multiple
levels
of
indirection
before
you
actually
read
a
file.
Just
comes
sort
of
annoying
annoying
and
the
inodes
are
stored
in
a
different
table.
Often
the
locality
isn't
very
good
and
they
could
I
know
table
gets
fragmented.
So,
even
though
you're
looking
at
sequential
file
names,
I
notes
are
scattered
and
random
places
on
disk
and
things
get
fragmented
and
it's
lots
of
Sikhs
and
it's
difficult
or
petition
right.
You
have.
A
The
first
observation
is
that
block
lists
aren't
necessary,
we're
storing
our
file
data
in
objects,
objects
are
variable
sized
and
we
can
name
them,
and
so
we
can
just
name
the
objects
that
are
storing
the
file
data,
as
the
inode
number,
maybe
with
a
block
number,
and
so
we
don't
have
any
block
metadata
at
all.
Steph
I
notes
are
very
small
and
relatively
pack
because,
basically
fixed
size.
A
The
other
observation
is
that
I
know
tables
are
usually
useless,
because
you
only
have
a
single
file
named
linking
to
a
flat
a
single
inode
most
of
the
time,
and
so
in
those
cases
we
can
embed
the
I
know
directly
in
the
directory
that
refers
to
that
file.
That
means
that
in
SEF
we
just
store
an
object
that
has
the
contents
of
the
directory.
It
has
all
the
file
names
and
most
of
the
time
it
does
all
the
I
know
is
that
those
file
names
refer
to.
A
So
we
can
do
a
single
I
owe
to
the
OSD,
read
a
single
object
and
we
get
all
the
filings
for
our
directory
and
all
the
I
note.
So
we
can
do
things
like
LSD
shale
very
quickly
and
we
leverage
the
key
value
objects
on
the
back
end
to
make
this
all
efficient
and
easy
easy
to
manage
the
real
challenge,
those
that
we
have.
You
know
this
one
big
tree,
one
big
file
hierarchy
and
we
have
multiple
metadata
server
so
like
how
do
you?
How
do
you
make
this
work?
A
And
what
stuff
does
is
we
sort
of
dynamically
carve
up
the
tree
hierarchy?
Labeled
tickly
big
big,
take
big
chunks,
sub-trees
of
the
overall
directory
hierarchy
and
will
sign
them
different
to
different
metadata
servers
based
on
the
current
workload
based
on
how
busy
those
sub
trees
appear
to
be,
and
we
can
sort
of
arbitrarily
do
this,
so
they
have
all
the
logical
complexity
to
migrate.
Subtree
management
between
metadata
server
nodes
and
sort
of
arbitrarily
partition
it
across
across
metadata
servers
and
what
we
call
an
algorithm.
A
We
call
dynamic,
subtree
partitioning,
and
the
nice
thing
about
this
approach
is
that
it's
scalable
we
can
take
a
hierarchy,
sort
of
arbitrarily
carve
it
up
into
little
pieces.
So
that's
that's
nice,
because
we
want
to
be
able
to
have
hundreds
of
metadata
servers.
The
other
nice
thing
is
that
it's
adaptive,
so
the
monitor
the
metadata
servers
are
sort
of
monitoring
how
busy
the
file
hierarchy
it
is
that
they
have
cached
is
at
any
point
time
and
if
they
decide
that
they're
overloaded,
they
can
sort
of
take
about.
A
Twenty
percent
is
about
this
big
sub
tree
over
here
and
they
can
shunt
it
off
to
another
metadata
server,
and
this
is
based
on
the
current
workload.
So
if
your
work
load
shifts
later
and
suddenly
or
have
a
total
different
file
set
that
you're
working
with
the
metadata
server
will
adapt
by
splitting
that
file
set
into
smaller
pieces
and
distributing
across
the
cluster,
so
that
you're
sort
of
always
utilizing
all
available
metadata
server
resources.
It's
efficient,
we
as
a
sub
tree
based
partition,
so
that
we
preserve
locality
within
the
workload.
A
A
One
of
the
challenges,
though,
is
dealing
with
metadata
I.
Oh
so
metadata
tends
to
be
very
small.
It's
updated
very
frequently
and
you
sort
of
want
to
avoid
a
situation
where
you
have
lots
of
small
rights
to
the
object
store,
because
that's
you
know
no
matter
how
well
you
optimize
a
it
tends
to
be
a
nightmare,
and
so
the
way
we
approach
this
is
to
view
the
set
metadata
server
as
sort
of
a
big
cash
I'm.
A
So
the
idea
is
that
it's
essentially
a
large
sequential
file
or
blog
whatever
you
call
it.
We
stripe
over.
We
stripe
over
objects
in
the
object
store
and
then
we
essentially,
whenever
there's
an
update
to
the
system.
We
write
it
out
to
the
journal
and
then
at
that
point
it's
durable
and
committed
and
we
can
move
on,
but
we
an
end
up
with
sort
of
two
tiers.
A
We
have
all
the
recent
updates
get
consumed
by
writing
things
out
to
the
journal
and
then
later,
when
we,
the
journal,
gets
big
and
we
start
trimming
things
off
the
end
of
the
journal.
We
then
take
all
those
updates
and
we
push
them
out
to
the
long
term
storage,
which
is
the
per
curve.
Directory
objects
that
store
the
file
system
hierarchy
on
the
backend,
and
so
the
nice
thing,
of
course,
with
the
journals
that
you
have
very
fast
failure
recovery.
A
So
if
I
metadata
server
crashes,
you
can
just
read
it
in,
but
the
more
important
thing
is
that,
as
a
journal
grows
over
time,
you
notice
the
missing
effect,
where
the
things
that
you
right
at
the
beginning
of
the
journal
are
always
sort
of
dirty
every
day.
Those
updates
exist
only
in
the
journal,
but
metadata
tends
to
be
updated
multiple
times
repeatedly.
A
So
you
might
change
the
entire
mind
directory
times
as
you
do
a
compilation
or
so
forth,
and
so,
as
the
entries
in
the
journal
get
older
and
older,
they
tend
to
become
stale,
so
there
I
don't
actually
contain
any
useful
information,
and
so,
by
the
time
we
get
to
the
very
end
of
the
journal.
That's
maybe
an
hour
old.
Most
of
the
metadata
we
wrote
out,
there
isn't
actually
even
dirty
anymore.
It's
been
since
updated
more
recently
in
the
journal,
and
so
when
we
do
have
a
directory
that
needs
to
be
updated.
A
We
can
take
all
the
updates
that
have
happened
to
that
directory
of
the
last
hour
and
build
up
one
single
large
transaction
and
generate
a
single
ayah
that
goes
out
to
the
object,
store
and
updates
that
directory,
and
so
our
overall
right
pattern.
That
is
random
because
we're
updating
these
directories
tends
to
take
to
consolidate
rights
over
a
long
period
of
time
to
each
directory
and
ship
them
out
efficiently.
So
the
overall
aggregate,
I/o
pattern
generated
by
these
metadata
servers
tends
to
be
very
good.
A
One
of
the
big
questions
is
what
actually
gets
put
in
a
journal
minutes
and
it's
a
trade-off.
So
there's
lots
of
state
in
any
complicated
system
like
this
right
there's
and
where
what
you
actually
do
with
that
state
is
sort
of
a
trade-off.
So,
on
the
one
hand,
if
you
journal
that
state
it's
expensive
up
front,
because
you
have
to
actually
write
to
write
to
the
journal
but,
on
the
other
hand,
is
very
cheap
to
recover.
When
you
restart
the
metadata
server
recover
from
crash,
you
read
it
in
sequentially
and
you
get
it
all
back.
A
So
so
that's
expensive
up
front,
but
it's
cheap
to
recover.
On
the
other
hand,
if
you
don't
journal
state,
you
tend
to
have
complicated
protocols
during
recovery.
They
have
to
recover
at
that
state.
So
some
examples
of
things
that
you
would
journal
would
be
defective
client
sessions
open
the
particular
clients
who
are
accessing
the
file
system
or
maybe,
of
course,
actual
modifications
to
the
metadata
in
the
file
system
of
those
things
sort
of
have
to
go
in
the
journal,
because
they're
important
you
know,
do
you
want
to
recover
them
later.
A
On
the
other
hand,
things
like
cash
provenance,
you
know,
I
have
a
particular
piece
of
metadata
in
one
minute
or
servers
memory,
but
it's
replicated
another
in
other
metadata
servers,
caches
or
in
client
caches.
That
type
of
information
is
very
expensive
to
journal
because
there's
a
lot
of
it
and
it's
happening
all
the
time
we
want.
A
We
don't
want
to
generate
all
this
ayah,
so
it
means
that
has
a
trade-off
when
we
do
recovery
when
the
metadata
server
restarts
clients
have
to
reconnect
on
the
app
to
sort
of
resynchronize
and
re-establish
the
shared
state
in
order
to
move
on
less.
But
one
of
the
key
things
that
we
do
do
is
what
we
call
the
ez
flush.
So
whenever
their
client
modifications
that
the
client
is
sending
to
the
metadata
server,
the
metadata
server
is
queuing
them
up
and
is
getting
ready
to
send
out
to
the
journal.
A
A
The
client
protocol
were
the
sub.
Clients
are
talking
to
the
metadata
servers.
Generally
speaking,
it's
highly
stateful,
so
we
aim
for
strict
POSIX
consistency.
So
we
would
like
to
processes
interacting
with
the
file
system,
to
behave
the
same
if
they're,
on
the
same
host
or
if
they're
on
different
hosts,
don't
just
be
little
bit
slower,
they're
on
different
house,
but
that's
sort
of
the
level
of
consistency
that
we
inform.
A
In
contrast
to
protocols
like
NFS,
which
are
notoriously
weak
in
this
area,
we
tend
that
the
clients
have
a
seamless
handoff
between
metadata
server
demons
because
they're
actually
using
our
own
protocol
and
that
sort
of
a
legacy
protocol
I
canna
fest.
They
understand
the
fact
that
they're
talking
to
lots
of
different
metadata
servers
and
they
can
behave
it
tell
instantly
as
a
result
for
that.
A
So
when
the
client
is
traversing,
a
hierarchy,
they'll
sort
of
seamlessly
move
over
to
different
meditative
servers
that
are
managing
that
part
of
the
file
tree,
I'm
and
so
forth,
and
when
the
metadata
servers
are
doing
their
load,
balancing
and
they're
moving
things
around.
They
tell
the
clients
about,
and
so
the
clients
can
shift
their
their
cash
state
and
so
forth
to
actually
have
that.
Have
that
work?
Well,
of
course,
when
they're
actually
reading
and
writing
file
I/o,
they
talk
directly
to
the
LSTs.
A
So
it's
sort
of
an
illustrative
example
of
what
the
section
looks
like
you
have
a
client
here
and
he's
happy
he's
going
to
mount
the
file
system
until
you
do
mount
tcf.
The
IP
address
is
one
of
the
monitors,
that's
sort
of
the
how
you
identify
particular
cluster.
So
initially
there's
going
to
be
a
few
round
trips
to
the
monitor.
A
Has
he
authenticates
and
gets
a
ticket
that
says
I'm
allowed
to
talk
to
these
meditator
routine
and
demons
he's
also
going
to
learn
who
the
metadata
servers
are,
what
their
IP
addresses
are
and
what
to
do,
sts
are
and
what
their
IPs
are.
And
then
there
will
also
be
a
couple
of
round
trips
to
the
metadata
server,
as
he
opens
up
the
root
directory.
He
opens
up
a
session
and
then
gets
a
handle
essentially
on
the
root
directory.
So
we
can
mount
the
file
system
amenity.
A
The
server
is
going
to
journal
something
to
the
OSTs,
because
he
wants
to
record
the
fact
that
he
now
has
a
persistent
session
open
with
this
particular
client
and
then
say
the
client
traverses
into
a
directory.
So
they're
going
to
be
a
couple
different
round
trips,
a
pair
of
round
trips
of
metadata
server,
as
he
looks
up,
/
foo
and
then
/
bar
inside
that
directory
as
the
metadata
server
had
a
cold
cash.
A
Then
he'll
up
the
low-dose
directories
off
of
disk,
so
there
will
be
some
corresponding
I/o
request
to
the
objects
or
to
populate
the
metadata
server
cache.
But
that's
generally
pretty
quick
I
mean
if
the
client
does
analysis
al.
You
want
to
see.
What's
in
the
side,
this
directory
there'll
be
an
open
operation
that
actually
has
no
interaction,
because
he
already
has
a
handle
and
a
leaf
on
that
directory
inode.
So
there's
no
MDS
interaction
necessary
there
and
then,
when
he
does
the
reader
to
fetch
all
the
directory
entries.
A
There's
going
to
be
a
single
round
trip
to
the
metadata
server
to
fetch
all
these
directory
names
and
again,
if
there's
a
cold
cash,
they'll
load
all
that
stuff
off
of
a
disk
in
a
single
io
to
load
that
director
Han.
But
the
reply
is
actually
going
to
contain.
Not
only
the
directory,
names
and
leases
that
say:
they're
valid
until
otherwise
invalidated,
but
also
all
the
I
notes
that
those
names
refer
to
that.
A
The
that
it
is
have
we
got
for
free
because
they're
embedded
in
the
directory,
so
that
when
the
client
has
a
stout
on
every
single
file,
there's
no
additional
metadata
server
traffic
necessary.
He
already
has
it
all
in
his
cash
and
it's
all
right
there
in
the
VFS
and
just
plows
right
through
it.
And
finally,
when
he
closes
again,
that's
essentially
a
knob
here
and
then.
A
Finally,
if
the
client
is
going
to
copy
all
the
data
in
that
directory
to
somewhere
else,
he
now
has
all
of
the
I
notes
for
all
those
files
and
leases
on
those
I,
not
saying
that
they're
not
going
to
be
changed,
and
so
he
can
go
directly
to
the
osts.
That
store
that
file
data
to
copy
those
objects
to
a
local
file
without
having
any
further
interaction
with
the
metadata
server.
So
again,
this
means
that
the
metadata
server
workload
is
very
low
and
efficient.
A
That
client
has
these
sort
of
highly
stateful
leases
and
all
the
pre
fetching
and
caching
and
so
forth,
and
when
he
actually
does
start
to
do
file
I/o,
we
can
spread
that
across
the
entire
cluster
and
do
it
all
in
parallel
and
appreciate
it
on
his
way,
and
it's
going
to
be
fast
and
wonderful.
One
of
the
other
interesting
things
that
the
metadata
server
does
is
what
we
call
recursive
accounting
so
because
we're
essentially
implementing
a
file
tree
from
the
ground
up,
we
could
do
all
sorts
of
interesting
place.
A
A
So
for
each
directory
we
have
a
summation
of
all
the
file
sizes
nested
beneath
that
point,
in
a
hierarchy
stored
in
that
directory,
I
note,
and
so,
for
example,
when
you
do
an
LS
al,
the
file
size
you
see
for
a
directory.
Instead
of
being
the
sort
of
meaningless
number,
that's
like
a
multiple
of
4
K
or
something
on
x3
is
actually
the
sum
sum
of
all
the
file
sizes
nested
beneath
that
point.
So
essentially
what
you
would
get
from
a
d
you,
but
it's
free
it's
sort
of
accumulated
over
time
efficiently
by
the
MDS.
A
We
also
maintain
file
and
directory
counts
and
the
most
recent
modification
time.
So,
for
example,
if
you
dump
the
extended
attributes
on
any
of
these
directories,
you
can
see
all
these
different
statistics,
which
is
which
is
interesting,
and
the
key
thing
is
that
it's
efficient.
So
whenever
there
are
changes,
this
information
is
sort
of
being
lazily,
propagated
up
the
hierarchy
tree
by
the
metadata
servers
and
stored.
So
it's
not.
A
You
know
one
hundred
percent
time
accurate,
but
it's
way
cheaper
than
doing
Adi
you
to
try
to
figure
out
why
your
disk
is
filling
up
and
what
user
is
writing
data
and
so
forth,
so
pretty
great
for
system
administrators?
One
of
the
other
interesting
things
we
do
is
snap
shots
on
a
per
directory
granularity.
A
One
of
the
problems
is
that
when
you
have
a
petabyte
scale
file
system,
you
don't
necessarily
have
a
single
data
retention,
snap
policy
that
makes
sense
for
all
different
types
of
data
that
you're
going
to
store
in
it,
and
so
instead
we
empower
direct
users
to
create
snapshots
on
any
sub
directory,
and
that
applies
sort
of
recursively.
The
things
that
are
nested
being
set
point-
and
we
do
this
using
a
very
simple
interface
without
any
special
tools.
A
So
this
hidden
there's
a
hidden
dot
snap
directory,
and
if
you
want
to
create
a
snapshot,
you
just
do
a
make
der
inside
this
hidden
directory
with
some
random
name
and
that
poof
essentially
creates
the
snapshot.
I
mean
it
has
sort
of
the
usual
semantics
where
I'm,
if
you
you'll
notice,
if
you
look
inside
a
subdirectory
in
the
dot
snap
directory,
you'll
see
that
it's,
if
also
part
of
that
snapshot,
although
the
name
is
mangle
to
avoid
collisions,
and
it
has
usual
semantics.
So
you
know
you
delete
a
file,
it
disappears.
A
If
you
look
inside
the
hidden
snap
directory,
it's
still
there,
so
it's
sort
of
what
you
expect
and
then,
when
you're
done
with
a
snapshot,
you
want
to
delete
it.
You
can
just
do
an
arm
door
on
that
magic
directory
and
in
it
goes
away
and
it
sufficiently
cleaned
up.
On
the
back
end,
the
file
system,
client
is
implemented
and
a
number
of
different
ways.
There's
a
there's,
a
Native
Client
in
the
linux
kernel.
That's
been
upstream
for
two
to
three
years.
Now
that
you've
mount
with
mount
Dashti
steph.
A
You
can
react
sport
that
as
NF
a
source,
if
sort
of
unusual
way,
there's
also
a
fused
version
of
the
client
where
that's
implemented
user
space
is
the
generic
fuse
API
to
to
mount
there's
also
a
shared
library
that
you
can
link
directly
into
an
application.
If
you
want
to
build
something
on
top
of
the
Ceph
file
system,
but
don't
want
actually
melt
it
as
a
native
file
system
on
your
kernel,
then
you
can
do
it
that
way
and
their
number
of
things
that
we've
done
at
that.
A
So
one
example
is:
there
are
patches
to
glue
lips
at
best
into
the
samba
BFS,
so
you
can
directly
re-exports
f
as
sifs
without
actually
mounting
it
as
a
kernel
file
system.
Another
example
is
the
ganesha
user
space
NFS
server
there.
Those
patches
actually
support
peanut
best
on
top
of
stuff,
which
is
sort
of
an
interesting
thing
and
another
is
Hadoop,
so
you
can
use
staff
in
place
of
HDFS
and
run
all
your
MapReduce
stuff
on
top
of
Stephanie
still
go
all
the
you
know.
A
The
localizing
features
where
you
run
the
computation
on
the
same
note
that
the
data
is
stored
on
and
so
forth.
So
it's
very
easy
to
consume.
Sort
of
a
picture
of
the
current
status
of
the
project.
Ray
dos
liberate
us
rato,
scary
RVD
are
very
stable.
People
are
using
in
production
generally
pretty
awesome.
The
file
system
is
a
bit
more
complicated
is
nearly
awesome.
A
There's
a
bit
more
sort
of
deliberate
QA
effort
and
the
story
there
is
that
I
sort
of
bit
off
a
lot
initially
so
I
was
working
on
this
project
for
a
long
time,
sort
of
on
my
own
and
implemented
all
kinds
of
features
in
there
with
the
snapshots
and
the
recursive
accounting
and
the
scalability
and
so
forth.
And
it's
been
being
used
for
a
long
time.
But
it
hasn't
had
sort
of
a
deliberate
q
effort
that
you
need
with
a
real
QA
team,
with
all
sorts
of
automatic
regression,
testing
and
failure,
testing
and
so
forth.
A
A
Sort
of
just
a
I
don't
know
why
my
animations
are
all
screwed
up
a
little
bit
about
why
we
why
we
work
on
staff,
their
limited
options
for
open
source,
scalable
storage,
there's
no
thirst
buster
in
the
HPC
space,
there's
cluster.
There
are
things
like
EFS
and
so
forth,
but
there
aren't
that
many
options
that
really
scale
big
and
that's
an
emerging
requirement
for
things
like
public
cloud
and
private
cloud
and
structures
in
particular,
and
also
for
big
data
and
so
forth.
In
the
proprietary
solutions
that
people
use
instead
tend
to
be
very
expensive.
A
They
want
to
run
it
on
in
the
store
space.
So
they
can
choose.
You
know
the
cheapest
SATA
drives
they
want
or
really
expensive,
fusion-io
o
drive
and
then
run
sort
of
a
distributed,
scale-out
system.
On
top
of
that
and
then
go
out
from
there.
So
stuff
was
originally
created.
It
UC
Santa
Cruz.
He
grew
out
of
some
department
of
energy
grants
for
petascale
storage
after
I
finished
my
dissertation
work.
It
was
developed
at
dreamhost
for
several
years
sort
of
as
a
skunk
works
pet
project
of
my
own.
A
More
recently,
we
spun
out
a
company
called
ink
tank,
that's
dedicated
to
supporting
staff
properly
as
an
open
source
project,
so
that
companies
wanting
to
deploy
this
as
a
storage
system
can
actually
buy
level
2
and
level.
3
support
consulting
performance
tuning
that
sort
of
things
that
they
can
actually
run
it
in
environment
and
there's
a
growing
community,
so
the
Linux
distros
are
picking
it
up.
There
are
lots
of
users.
It's
integrated
with
OpenStack
CloudStack
system
integrators
are
looking
at
it.
Oem
are
looking
at
it
as
a
basis
for
their
future
scale-up
source
products.
A
A
It's
it's
a
little
complicated,
but
it
works
it
in
and
ends
up
being,
meaning
that
the
sort
of
the
cost
to
resolve
a
foreign
link
in
that
sense
is
a
little
bit
more
expensive.
It's
like
logging,
expensive
to
find
the
inode
versus
a
typical
I,
know
table
which
is
sort
of
order,
one
it's
a
fixed
cost,
and
so
it's
not
that
bad.
But
it's
not
not
quite
as
good
so
for
workloads
where
you
have
bazillions
of
hard
links.
A
How
do
you
deal
with
running
out
of
space?
Is
the
question
the
easy
answer?
Is
you
don't
you
generally
when
you,
as
the
cluster
start
to
fill
up
you
just
deploy
more
storage
nodes
and
things
were
balanced
out
of
the
way
part
of
the
problem?
Is
that
we're
using
a
hash
based
distribution?
So
when
you're
writing
a
piece
of
data,
you
don't
get
to
choose
where
the
data
is
stored,
the
hash
function.
A
Does
that
for
you,
so
the
key
is
to
make
sure
that
the
variance
in
the
utilizations
the
different
nodes
is
relatively
tight,
so
their
number
of
features
to
actually
make
that
happen,
but
essentially,
once
you
start
having
devices
that
are
approaching
full,
then
an
OSD
map
is
published.
That
basically
says
everybody
slow
down,
switch
to
synchronous
rights
and
eventually,
whenever
reaches
a
certain
point,
it
says
everybody
stop
writing,
because
we're
full.
So,
yes,.
A
A
Yes,
so
the
question
is:
how
do
you
deal
with
the
semantics
of
cross
directory
renames,
so
they're
sort
two
cases
there.
One
is
when
the
target
directory
is
on
the
same
metadata
server,
that's
easy!
You
just
earn
elytte
and
update
the
trees
and
so
forth.
The
harder
case
is
when
you're
renaming
across
metadata
servers,
which
sounds
hard
in
reality.
The
fact
that
we're
we
have
this
ability
to
dynamically
move
sub
trees
between
metadata
servers
and
we're
already
sort
of
describing
this
distribution
in
terms
of
these
trees
that
are
mapped.
A
We
can
sort
of
leverage,
some
of
that,
so
that
when
we
rename
a
directory
somewhere
else,
we're
actually
only
moving
the
inode,
but
then
in
the
new
location
it
appears
as
if
that
sub
tree
was
been
remapped
back
to
the
server
where
it
already
was,
and
so
we're
updating
the
subtree
map
and
removing
that
1i
node
and
updating
the
hierarchy,
but
other
than
that.
There's
not
inexpensive
like
migration.
That
has
to
happen
to
make
it
work.
So
it's
complicated
there's.
A
You
know
several
messages
that
go
back
and
forth
and
there's
two
phase
commit
going
on
in
the
journal
and
so
forth,
but
but
it
works
such
as
slower.
That's
one
of
the
reasons
why
we
try
to
maintain
a
coarse
subtree
partition
so
that
most
renames
aren't
across
meda
to
service.
They
tend
to
be
localized
in
the
same
sort
of
part
of
the
hierarchy.
Yeah.
A
Yeah,
do
you
so?
Do
you
deal
with
having
heterogeneous
storage
same
yes,
so
there
there
are
a
couple
of
different
ways
to
deal
with
that.
So
one
is
that
the
crush
hashing
algorithm
essentially
lets
you
wait
each
device
that
determines
proportionately
how
much
data
they'll
get.
So
if
you
have
drives
that
are
twice
as
big
as
other
drives,
you
just
set
the
weight
twice
as
high,
and
so
they
get
twice
as
much
data
and
that's
twice
as
much
I.
Oh,
so
that's
the
first
answer,
but
that
doesn't
really
deal
with
a
different
performance
characteristics.
A
So
that's
a
bit
more
tricky.
So
the
answer
there
is
that
the
rate
of
object
model
lets
you
create
different
pools
of
storage,
so
you
might
create
one
pool
of
storage.
That's
based
on
you
know
slow
SATA
disks
and
you
put
one
type
of
data
there
and
you
might
create
another
object,
pool
that's
based
back
by
flash
or
something
and
you
put
other
data
there
and
then
the
file
system.
You
say
this
particular
directory
is
mapped
to
this
pool
of
storage.
A
So
you
know,
/
temp
has
one
replica
and
it's
over
here
on
this
crap
and
then
slash
home
has
four
replicas
and
it's
over
here
on
the
superfast
f50
or
whatever
it
is,
and
the
final
thing
that
you
can
do
is
in
the
crush
rules
that
are
specifying
how
your
data
is
distributed.
You
can
do
something
like
say
that
I
want
three
replicas
of
all.
My
data
I
want
the
first
replica
to
be
on
this
sort
of
fast
year
of
storage.
That's
servicing
both
the
reeds
in
the
reitz.
A
Maybe
it's
you
know
staffs
or
flash
or
whatever,
and
then
I
want
the
additional
replicas
to
be
on
this
sort
of
slow
storage.
That's
all
SATA
and
it's
only
getting
rights
unless
one
of
the
you
know
front
end
notes
fails,
in
which
case
we
fall
back,
but
that's
sort
of
the
exceptional
case.
So
there
are
several
different
games
that
you
can
play
to
do
with
it.
A
There
are
several
different
ways:
you
can
have
an
administrator
that
sort
of
decrees.
This
is
the
crush
hierarchy
that
I
want
and
I
meticulously
figured
out
how
that
should
be
constructed
and
mapped
it
all
out
and
said
this
is
the
map
to
use
the
other
way
is
you
can
tell
each
node
in
the
South
com
file
which
rack
which
row
which
host
it
is
and
then
the
startup
script
when
it
starts
up,
will
say:
okay,
I'm,
starting
a
postie.
A
You
know
712
update
the
location
for
this
OSD
in
the
crush
map
to
be
in
this
part
of
the
hierarchy
in
this
row,
rack
hosts
whatever
and
so
it'll
be
placed
in
the
right
point
and
then,
when
it
boots
up
and
it
starts
getting
allocated,
data
it'll
be
serve
in
the
right
location.
So
that's
sort
of
the
direction
that
we're
moving
with
with
them.
A
There's
a
lot
of
work
going
on
right
now,
with
improving
integration,
with
tools
like
chef
and
juju
and
puppet
and
all
those
sort
of
dev
ops
ii
deployment
tools
to
make
it
extremely
painless
to
sort
of
deploy
this
on
thousands
of
servers.
And
so
that's
one
of
the
things
as
long
as
you
can
tell
each
host
sort
of
what
their
row
and
location
is
and
so
forth,
then,
when
you
deploy
OS
DS,
they'll
dynamically
allocate
OSD,
ids
and
they'll
put
themselves
in
the
hierarchy
appropriately
and
they'll
start
up
automatically
and
so
forth.
A
A
The
question
is
about
geographical
replication.
Yes,
they're,
they're
sort
of
two
different
projects
there
that
are
both
on
the
road
map,
but
they're
a
bit
ways
out.
One
of
them
is
sort
of
disaster.
Recovery
type
replication
where
you
just
want
to
have
an
asynchronous,
mirror,
that's
sort
of
streaming
off
to
another
location
so
that,
if
you,
the
primary
cluster
fails,
you
can
have
you
know
less
than
five
minute
old
copy
somewhere
else.
That's
in
a
consistent,
a
consistent
state
and
that's
sort
of
the
easier
of
the
two.