►
From YouTube: Cephfs in Jewel Stable at Last
Description
The Ceph upstream community is declaring CephFS stable for the first time in the recent Jewel release, but that declaration comes with caveats: while we have filesystem repair tools and a horizontally scalable POSIX filesystem, we have default-disabled exciting features like horizontally-scalable metadata servers and snapshots. This talk will present exactly what features you can expect to see, what's blocking the inclusion of other features, and what you as a user can expect and can contribute
A
Hello,
it's
my
audio
work.
Yes,
it
is
working
okay,
excellent,
so
I'm,
Greg,
Farnham
I
am
the
best
tech
lead
I've,
been
working
on
the
project
for
wow,
like
seven
years
now
and
I'm
here
to
talk
about
stable
staff
s
in
the
top
stream
Joel
release.
So
I
realize
there's
been
a
lot
of
SEP
talks,
but
I
haven't
seen
any
that
talked
a
lot
about
how
Seth
actually
works.
I'm
gonna
blaze
through
that
pretty
quickly
we're
gonna
talk
about
Seth
fests.
A
What
actually
works
in
the
upstream
stable
release
and
all
these
things
that
you
might
have
heard
about
over
the
past
many
years
that
aren't
done
yet
so
that
you
know
what
to
expect
and
some
pain
points
that
I
expect
people
to
see
or
maybe
not
and
would
like
to
know.
If
you
don't
see
so
Seth
Seth
is
built
on
top
of
the
reliable
autonomic
distributed,
object
store.
It
was
a
long-term
research
project
at
UC,
Santa
Cruz,
where
sage
got
his
PhD
thesis
and
now
is
supported
by
Red,
Hat
and
a
whole
bunch
of
other
people.
A
I
apologize
if
I
missed
anyone,
I
don't
actually
know
anymore.
So
it's
you
know
a
bunch
of
people
writing
commercial
support
for
this
open
source
upstream
project,
with
whatever
their
downstream
spins
on
and
are
in
South
effete
are
in
the
Ceph
project.
We
have
raid
us
at
the
bottom,
that's
sort
of
our
base
storage
layer
that
provides
all
the
primitives
all
the
other
projects
use
to
to
build
up
their
services.
A
There's
the
liber8
O's
API
library
that
allows
clients
in
the
system
to
talk
to
the
actual
storage
cluster,
and
then
we
have
the
radio
Skateway,
which
is
a
restful
s3
and
swift.
Compatibles
object,
storage
service,
the
Rados
block
device,
which
is
a
virtual
block,
store
and
step
of
s,
and
for
a
long
time
we've
been
calling
those
first
two
awesome
and
stuff.
That's
almost
awesome
and
now
set.
The
Feist
has
many
awesome
things
so
I'm
very
excited
about
that.
A
So
within
the
rightest
cluster,
you
have
a
whole
bunch
of
a
whole
bunch
of
servers.
Some
of
them
will
be
monitors
so
to
the
MS
and
then
you'll
have
oh
s,
DS
or
object
storage,
daemons
and
the
application
just
talks
to
whichever
systems
it
needs
to
an
object.
Storage
daemon
is
a
regular
Linux
process
running
at
the
moment.
On
top
of
a
Linux
file
system.
On
top
of
a
disk,
there
are
experimental
and
developmental
backends
going
on
the
top.
A
Let's
skip
up
strip
out
the
the
Linux
file
system
entirely
and
just
run
on
the
disk
directly
within
the
cluster
you'll
have
tens
to
tens
of
thousands
of
OS
DS.
These
provide
the
actual
data
storage,
unlike
in
many
clustered
storage
systems.
Each
of
these,
each
of
the
OS
DS,
is
intelligent
and,
with
a
very
small
amount
of
data,
works
together
to
maintain
the
replication
and
consistency
of
objects.
So
it's
not
like
you
have
a
central
month.
We
have
monitors
but
they're
not
like
going
around
and
saying
hey
OS
d5.
A
You
need
to
push
data
to
OS
d3.
The
monitors
maintain
a
very
small
amount
of
state
in
that
state
is
the
cluster
membership
saying
you
know
we
have
OSD
0
to
10,000,
and
all
of
them
are
up,
except
for
OSD
is
five
eight
and
nine
thousand
873,
and
the
purpose
of
the
monitors
is
just
to
say
who's
alive,
who's
dead.
What
actually
exists
and
sort
of
what
the?
What
the
rules
are
for
how
data
goes
into
the
cluster?
A
So
when
you
want
to
read
or
write
an
object,
you
need
to
find
out
where
it
is.
There
are
a
couple
of
different
strategies
for
doing
that
in
a
lot
of
storage
systems
that
you
just
have
some
kind
of
central
service
that
says:
hey
object.
Foo
is
over
on
these
storage
nodes,
but
that's
sad
because
it
means
that
you
wouldn't
occur
the
lookup
plate
and
see
every
time
you
want
to
access
an
object
and
because
it
means
that
you
need
a
storage
server
that
can
hold
the
locations
of
all
the
objects.
A
So
within
ratos
we
use
a
calculated
play
screen
algorithm.
It's
called
crush,
consistent
replication
under
scalable,
hashing
I.
Think
the
important
part
about
it
is
that
it's
a
mathematical
algorithm,
which
takes
a
very
small
amount
of
input
it
takes
in
the
map
that
the
monitors
maintain
and
the
name
of
the
object
that
you
want
to
look
up,
and
it
says
okay
that
object
right
now.
According
to
this
map,
lives
on
these
two
or
three
OSDs.
If
OSDs
get
marked
as
failed
in
the
cluster,
then
crush
automatically
says:
okay,
I
know
they
don't
live
there.
A
Now
they
live
over
here
in
these
different
places.
It's
very
fast
calculation,
it's
stable.
So
when
you
run
it
from
different
clients,
you
get
the
same
results
and
it
allows
you
to
do
so
much
cleverer
things
than
sort
of
normal,
consistent
hashing.
You
can
do
replication
across
racks
or
across
machines
or
across
data
centers.
You
can.
You
can
set
up
your
own
rules
so
that
maybe
you
want
to
have
power
supplies
or
or
power
circuits,
as
as
failure
domains.
Maybe
you
don't
so
it's
very
configurable.
A
Within
the
storage
system,
you
have
a
bunch
of
different
namespaces
called
pools,
that's
relevant
to
set
the
best,
because
we
will
have
a
you'll
see
you
later
on.
We
have
a
data
pool
and
a
metadata
pool,
but
you
might
also
have
a
pool
for
your
rate
of
block
devices
in
a
pool
for
your
rgw
objects,
and
then
each
of
those
pools
is
sliced
up
into
shards,
which
we
call
placement
groups.
A
Those
placement
groups
are
what
actually
get
sort
of
moved
around
on
the
OSDs
they're
called
that
because
they
move
as
a
unit.
So
when
an
OSD
fails,
you
don't
move
every
object
to
different
different
nodes
in
the
cluster.
You
move
the
placement
groups
as
units
and
sort
of
the
way
that
works
is
it's
through
the
peering
process.
The
monitors
maintain
these
maintain
OSD
maps
saying
the
state
of
the
cluster.
Each
of
those
OSD
maps
is
numbered
with
with
an
epoch
they
just
increment
increment
for
forever.
B
A
A
So
in
this
example,
we've
just
pushed
out
a
new
map,
twenty
thousand
two
hundred
twenty-
and
it
has
members
eleven
five
and
thirty-
and
let's
say
that
11
was
not
previously
part
of
the
part
of
the
part
of
the
set
of
OSD
serving
this
placement
group.
So
but
OSD
five
was
so
I
see
five
gets
the
new
epoch
of
the
map
and
he
says:
okay.
Eleven
eleven
is
now
a
member
of
this
placement
group
and
he
wasn't
before
so.
A
I'm
gonna
tell
eleven
that
he
needs
to
go
up
that
he
is
a
member
of
the
placement
group
and
eleven
then
gets
that
gets
that
notification.
He
says
okay
well
now.
I
need
to
talk
to
everybody
who,
like
I,
need
to
get
all
the
data
for
this.
For
this
PG
forty-two
and
so
I
was
the
OSD
says
all
right.
Let
me
go
back.
I
have
a
history
of
the
maps
and
I
want
to
see
who
all
stored,
who
all
this
is
responsible
for
storing
this
data.
A
But
that's
not
you
know
the
specific
reason
doesn't
matter.
It
goes
through
the
same
process
for
basically
any
kind
of
cluster
change
and
he'll
go
and
you'll
say
all
right.
So
in
this
case
maybe
OST
11.
Actually
just
the
placement
groups
have
logs.
So
the
OST
11
might
just
say:
hey
five
I
need
everything.
That's
changed
in
seatback
1984,
because
hey
I
was
in
charge
of
it
back
then,
but
maybe
instead,
it's
more
complicated,
and
so
you
know
what
we
don't
actually
want
to
talk
about
this
right
now.
A
Sorry,
okay,
so
the
liberators
API
provides
access
to
sort
of
the
functionality
of
the
Raiders
cluster.
It's
it's
an.
It
is
an
object-oriented
API.
You
say:
I
want
to
do
this.
Op
I
want
to
do
this
operation
or
the
set
of
operations
on
object
foo,
but
it's
a
very
rich
API.
It's
not
just
put
getting
delete.
You
can
say
I
want
to
write
to
offset
57
these
100
bytes,
you
can
say
hey.
If
the
object
has
this
version
number
I
want
to
set
this
X
adder
to
something
different.
A
A
Similarly,
we
have
the
array
dose
block
device.
It
runs
as
a
user
space
library
inside
of
inside
of
Kim
UK
VM
or
as
part
of
the
Linux
kernel,
and
it
translates
it
translates
block
device
commands
at
those
layers
into
operations
on
the
readers
cluster.
It's
got
all
kinds
of
great
features.
It's
the
number
one
OpenStack
cinder
store
solution.
You
should
use
it
hooray,
all
right,
so
SEPA,
fest
and
feel
free
by
the
way.
I
usually
start
out.
A
My
talks
this
way,
if
you
have
any
questions,
just
raise
your
hand
or
get
my
attention
during
the
talk
because
I'm
not
quite
sure
how
much
we'll
have
left
at
the
end.
Maybe
we'll
have
gobs
of
time
every
jiggered
this
a
little
bit,
maybe
we'll
run
out.
Okay,
so
sev
s.
It
is
in
fact
a
file
system.
Hooray,
everyone
loves
a
scalable
file
system.
You
mount
it
from
multiple
clients.
You
can
write
from
client
a
and
read
the
data,
the
client
a
wrote
from
client
B.
A
It
is
a
Linux,
basically
POSIX
file
system,
in
the
same
way
that
all
Linux
file
systems
are
basically
POSIX.
It's
not
closed
to
open
semantics
like
NFS.
It's
like
you
know.
You
write
to
your
exe
for
volume
and
you
read
from
your
ext
for
volume.
It
works
that
way,
and
so
that's
sort
of
the
catch-all
is.
It's
got
coherent,
caching
between
between
all
the
clients
and
the
server's,
your
Linux
host,
either
via
the
Linux
upstream
kernel
module
or
via
our
user
space.
A
A
Here's
your
here's,
your
metadata
server
map
and
here's
your
OSD
map
and
then
the
client
talks
to
the
metadata
server
for
all
metadata
updates
for
saying,
hey
what
what's
the
root
directory
look
like
and
what
are
the
contents
of
my
home
directory
and
hey
I
want
to
change
the
time,
the
end
time
on
this
file
and
it
talks
directly
to
the
OSDs
for
all
data
updates.
So
for
all
rights,
the
filesystem
is
very
consistent,
and
that
also
means
that
under
many
circumstances,
it's
much
much
faster
than
you
expect
from
a
POSIX
filesystem.
A
If
you
have
a
bunch
of
different
clients
mounted,
but
they
all
have
their
own
sort
of
hierarchy
like
they're
all
in
their
home
directory,
and
that's
the
only
thing
that
user
care
about.
Then
they
can
just
count
cache
that
entire
tree
locally.
On
the
client
side,
all
the
stats
will
be
satisfied
locally
from
the
client
side
without
going
over
the
network
or
anything.
But
if
they
are
sharing
things,
then
the
SIRT,
then
sort
of
the
server
will
say:
hey
you
I'm,
making
a
change.
A
Your
cache
is
invalid,
throw
away
this
information,
and
so
that
means
that
clients
can
be
very
fast
when
they're
the
only
one
working
on
stuff.
But
if
there
are
people
sharing
data,
then
they
never
see
anything
stale.
There's
no
opportunity
for
any
kind
of
split
that
you
might
have
seen
in
other
storage
systems.
It
just
works
scaling
the
data
sort
of
the
data
path.
The
file
I
hope
ass
with
in
within
Southwest,
is
pretty
trivial.
All
the
data
stored
in
Rados
the
file
system,
clients
write
directly
to
rate
us.
A
You
see
all
it
the
same
way
you
do
in
your
ordinary
ratos
cluster.
If
you
want
more
throughput,
you
can
put
in
faster
SSDs
if
you
might
be
able
to
say,
hey,
I'm,
writing
files
and
they're
in
for
megabyte
chunks,
but
like
these
are
10
gigabyte
files.
I
want
to
use
64
megabyte
chunks
when
I'm
splitting
them
up
across
the
ratos
cluster
sort
of
all
the
other
tricks.
A
You
want
at
least
until
you're
limited
by
latency
I've
said
of
throughput,
at
which
point
you
know
we
need
to
make
the
OST
s
faster
and
that's
being
worked
on.
Scaling
metadata
is
a
little
harder,
but
we
do
have
some
good
tricks.
First
of
all,
unlike
in
some
storage
systems,
you
don't
store
the
entire
file
system,
hierarchy
and
namespace.
A
When
you
want
to
access
the
directory,
then
the
metadata
server
goes
and
looked
it
up
off
of
disk,
and
then
it
caches
it
in
its
in
memory
cache,
but
and
and
throws
it
away
when
it
runs
out
of
room.
But
that
means
that
your
metadata
servers
cache
can
see
size
for
how
much
active
data
you
have.
If
you
have,
you
know,
100
million
1,
1
gigabyte
files,
but
you
only
ever
look
at
50,000
of
them
in
a
day.
A
A
Second,
awesome
thing
within
stuff:
this.
We
actually
have
a
security
model.
Now,
for
a
long
time
we
didn't-
and
you
know
there
are
still
a
ways
to
go,
but
we
do
have
a
way
to
deny
clients
certain
levels
of
capabilities.
So
clients
start
out
with
nothing
at
all.
It's
a
capability
model
you
grant
accesses,
and
then
you
can
say
that
I
want
this
client
to
be
able
to.
You
know,
read
the
entire
file
system,
but
not
file
system,
or
you
can
say,
hey
I,
want
the
client
feel
to
read
and
write
to.
A
You
know:
slash
home,
slash,
client,
a
or
whatever
you
can
say
that
they
are
allowed
to
act
only
as
file
system,
ID,
number
98
or
1017
or
whatever,
and
for
real
security.
So
the
MDS
capability
is
controlled
only
what
happens
on
the
metadata
server
controls,
what
metadata
they're
allowed
to
look
at
and
what
metadata
they're
allowed
to
change,
but
it
does
not
impact
what
they
what
actual
file
data
they
can
read
and
write
from
the
OS
T's.
A
This
these
capabilities
are
reasonably
secure,
they're
encrypted
by
the
by
the
monitors
and
are
unreadable
by
the
client
they
just
get
passed
along
when
they
open
up
sessions
to
the
metadata
servers
of
the
OSDs.
They
say
what
the
clients
are
allowed
to
do:
yep,
ok,
another
awesome
thing:
we
have
features
called.
We
have
features
for
doing
scrub
and
repair
on
the
file
system.
A
Now,
a
couple
years
ago,
people
would
test
the
file
system
and
they
would
say
usually
it
works
great,
but
I
had
this
crash
and
now
my
MDS
won't
start
up,
and
it
says
that
there's
a
journal
error
and
we
would
be
like
well.
Ok,
can
you
zip
up
the
journal
for
us
and
send
it
to
us
and
we
look
at
and
we're
like
all
right?
A
So
the
first
thing
we
have
is
what
we
call
forward
scrubbing
and
forward
scrubbing
the
MDS
starts.
You
can
give
this
the
metadata
server
path
and
say
I
want
to
scrub
from
here,
and
it
will
go
off
in
the
background
and
you'll
start
at
whatever
path
you
get.
It
and
it'll
say
all
right.
What
do
all
the
files
in
here
look
like?
Oh
look,
I
have
some
directories.
Let
me
go
down
to
the
next
directory
and
sort
of
it
and
sort
of
when
it
reaches
an
end
directory.
Then
it
looks
at
all
the
files.
A
A
I
want
you
to
take
all
that
in
the
journal.
I
want
you
to
flush
down,
and
then
you
go
to
the
data
scan
tool.
The
data
scan
tool
makes
use
of
the
fact
that
radius
is
an
object
store.
Unlike
your
normal
hard
drive,
when
we're
doing
a
file
system
repair,
we
don't
need
to
crawl
over
each
block
and
say:
does
this
block
look
like?
Maybe
it's
an
inode
I
think?
Maybe
it
is
so
I'm
gonna
try
and
reclaim
this
file
and
put
in
lost-and-found.
A
Instead
we
can
iterate
through
all
the
objects
in
the
radio
in
the
radio,
spool
and
say
hey.
We
know
what
objects
are
what
the
object
names
look
like
this
is,
you
know
a
file
object,
and
so
I
know
that
I
now
have
this
file
whose
inode
number
is
1776
and
we
use,
and
we
do
that
iteration
you
some
of
our
some
using
some
of
those
various
classes.
We
examine
the
object,
name
and
presuming
it's
a
file.
Then
we
send
the
information
about
that
object
back
to
these
shortly
file
routes.
A
So
the
first
object
in
every
file
has
a
special
piece
of
data
on
it
called
a
back
trace
and
a
back
trace
is
just
the
path
of
the
file,
but
it's
version
so
that
it
can
be
stale.
So
we
say:
hey
you
know
like
once
upon
a
time
in
then
we
were
in
the
home
directory.
It
was
some
version
2
of
the
home
directory,
and
then
it
was
in
the
Gregg
directory
in
version
9,
and
then
it's
the
flour
foo,
and
so,
if
we
find
this
object,
1000
dot
1,
we
would
say:
hey
1000
0.
A
We
had
this
object,
1001
and
then
you
know
well
I've
changed
my
numbers
anyway,
then
that
object
object
0.
We
would
do
a
second
pass.
That
goes
and
looks
at
just
the
route
objects
and
it
would
go
and
say:
hey,
I
believe
that
I
am
in
the
Greg
directory,
which
has
this
inode
number
so
I'm
going
to
send
off
the
information
to
the
to
that
directory.
A
Saying
I
exist
and
we
can
with
that
reassemble
it
might
be
slightly
out
of
date,
but
we
can
reassemble
a
tree
with
everything
in
the
cluster
that
we
know
is
coherent
and
because
we
are
running
directly
on
the
radius
through
the
radio,
say,
P,
I
and
running
on
running
part
of
the
code
on
the
OSD.
So
we
can
do
this
in
parallel
across
the
cluster.
It's
not
one
serial
work
or
we
can
spin
up
a
whole
bunch
of
them
on
different
machines.
A
Alright,
so
awesome
things
we
have
a
hot
standby
MDS,
so
nothing
ties
metadata
to
a
particular
server
as
I've
sort
of
implied.
Then
we
keep
a
log
of
the
source
of
the
of
the
meditations
and
Rados
and
we
keep
the
actual
metadata
file
or
the
metadata
directory
objects
in
rigo's.
So
if
we
want
to,
we
can
just
move
the
metadata
server
over
and
the
way
you
know
you
would
do
that
by.
If
you
were
being
polite,
you
would
say:
hey
turn
off
this
metadata
server
turn
on
the
sub
one.
A
You
are
running
as
that
guy
now,
but
in
particular
you
can
spin
up
as
many
backup
ones
as
you
want.
We
call
these
standbys
and
standby
replay
servers
and
the
standby
replay
once
in
particular
nice,
because
they
will
actually
sit
around
and
read
the
MDS
log
and
replay
all
the
operations
in
memory.
They
don't
make
any
writes,
but
they'll
just
run
it
over
and
over
in
memory
and
say:
hey
did
you
do
more
operations?
Let
me
do
that
operation
in
my
memory
too,
and
the
reason
you
might
want
to
do.
A
That
is
because
it
warms
up
the
minute.
The
metadata
servers
cache
so
if
you're
active
metadata
server
dies,
your
passive
one
can
go.
Oh
hey,
I
just
happened
to
have
all
of
the
things
in
memory
that
people
are
interested
in
and
I
don't
even
go
around
and
grab
those
hundred
thousand
or
million
or
however
many
I
know
it's
I.
Don't
need
to
go
and
grab
them
off
of
disk.
You
know
in
that
number
of
I/os
I
just
have
them
ready
to
go.
A
So
if
you
do
have
crash
the
replay
is
reasonably
fast.
You
need
to
replay
whatever
amount
of
the
metadata
server
log
you
haven't
already
replayed.
You
need
to
load
all
the
necessary
eye
nodes
out
of
the
cluster.
If
you
don't
already
have
them
in
memory
and
then
we
have
a
and
then
we
have
a
very
short
replay
I
think
it
defaults
to
30
seconds
where
clients
can
say:
hey
I
had
some
operations
that
I
had
that
you
haven't
acknowledged.
A
So
that's
the
end
of
the
happy
things
for
the
moment
there
are
some
parts
of
stuff
best
that
you
might
have
heard
about
in
the
past
that
are
not
ready.
Yet
one
of
those
that's
almost
awesome,
is
having
more
than
one
active,
MDS
server.
If
you've
been
in
a
talk
about
CFS
or
maybe
even
just
staff.
In
the
last
six
years,
you've
probably
seen
a
slide
that
looks
sort
of
like
this.
A
Where
we
say
hey,
no
metadata
is
stored
on
the
MDS
servers,
so
we
can
just
like
split
up
the
metadata
between
more
than
one
active
MDS
server,
and
you
know
it's
great,
it's
cooperative
partitioning.
We
each
server
keeps
track
of
how
hot
the
metadata
it's
working
on
is.
If
one
of
them
gets
too
much
hotter
than
the
others,
then
it
will
migrate
subtrees
in
order
to
keep
the
heat
distribution
across
the
cluster
similar.
This
is
pretty
cheap.
A
A
We've
been
building
repair
tools
so
that,
if
there's
a
disaster,
we
can
get
you
your
data
back,
that
you
run
away
as
quickly
as
possible
or
come
back
for
another
bruising,
because
it
wasn't
our
fault
whatever
so
sort
of
in
general
mb/s
failure.
Recovery
is
a
lot
more
complicated
if
you
have
more
than
one
active
end.
Yes,
the
the
picture
I
painted
when
you
have
a
single
one
is
pretty
simple,
but
when
you
have
more
than
one
MDS
than
operations
like
reading
files,
that
might
cross
directories
gets
a
lot
more
complicated.
A
A
Also
almost
awesome
directory
fragmentation
directories
are
generally
loaded
from
disk
as
a
unit
which
means
that
if
you
have
a
hundred
thousand
file
directory,
which
you
can
have
I
mean,
depending
on
your
model,
that's
not
unreasonable.
Then,
whenever
you
access
one
file
in
that
directory,
the
mb/s
goes
off
and
gets
back
all
hundred
thousand
of
them,
and
it's
in
its
cache
and
says:
hey
now,
I
have
the
file
I
want.
A
Oh,
but
also
you
know,
my
cache
size
is
only
a
hundred
thousand
inodes
and
so
I
had
to
throw
everything
away,
which
means
that
if
you're
doing
repeated
accesses
on
one
very
large
directory,
it
can
be
very
sad
or
if
you
have,
or
once
we
have
multiple
active
MBS
servers
running.
Maybe
you
have
one
really
hot
directory
and
you
want
to
split
it
up
across
the
different
servers
for
faster
throughput.
So
we
have
a
feature
where
you
actually
can
split
up
directories
into
multiple
objects.
That's
the
fragmentation
part.
A
It
probably
works
honestly,
it's
just
not
tested
well
enough,
so
we
have
it
turned
off
by
default
in
the
storage
system,
we
need
to
write
like
we.
We
turn
it
on
in
our
nightly
system
so
that
we
don't
have
a
lot
of
large
directory
workloads
and
we
don't
have
anything
specifically
going
in
and
like
making
a
large
directory
making
sure
that
the
split
works.
The
way
we
expect
it
to
making
some
change,
making
sure
that
things
keep
working.
A
It's
it's
basically,
just
a
QA
workload
and
honestly,
what's
the
thing
we
could
put
off
so
we
put
it
off
almost
awesome
snapshots.
Everyone
likes
snapshots
and
our
snapshots
are
almost
really
really
awesome.
Instead
of
being,
instead
of
being
divided,
subvolumes
and
taking
snapshots
of
subvolume,
you
can
just
say:
hey
I
want
to
snapshot
at
that
guy's
home
directory
I
want
to
snapshot
of
this
person's
home
directory.
You
know
what
I
want
to
snapshot
inside
of
that
guy's
home
directory
of
the
just
the
log
directory.
A
It
doesn't
need
to
be
the
whole
thing,
and
it's
really
cool
that
the
this
file
data
is
stored
in
the
inradius
object
snapshots,
that's
a
primitive!
They
have
it's
very
efficient
at
our
level,
but
it
means
it
makes
the
directory
structures
and
the
inode
structures
a
lot
more
complicated,
because
you
can
do
those
sorts
of
snapshots
inside
of
existing
snapshots,
you
can
rename
files
from
inside
of
a
snapshot
outside
of
a
snapshot.
A
We
need
to
keep
tracking
all
the
metadata
to
keep
things
consistent
and
it's
just
it's
just
complicated,
so
we
need
like
every
so
often
one
of
our
developers
will
go
off
and
be
like
hey,
I
wrote
a
bunch
know
where
new
snapshot
tests
I
found
a
bunch
of
more
new
bugs
I
fixed
them
all.
It's
passing
now,
but
you
know,
then
he
writes
more
tests
and
it's
like.
Oh,
we
found
more
bugs.
A
So
we
need
lots
of
testing,
lots
and
lots
of
testing
and
then
just
sort
of,
especially
when
you
add
this
in
with
multiple
active
metadata
servers,
you
could
have
snapshots
and
part
of
the
snapshot.
Data
is
on
metadata
server,
a
and
part
of
its
on
metadata
server,
B,
and
that
makes
things
even
more
exciting
from
a
coding
perspective
and
from
sort
of
a
recovery
perspective
when
you
have
snapshot
operations
that
are
happening,
but
one
of
the
server's
fails
and
you
need
to
recover
stuff.
A
Really
the
only
thing
missing
here
or
the
biggest
thing
missing
here
is
just
testing
in
commerce
in
February.
We
didn't
want
to
turn
it
on
for
a
long
term,
support
release
and
we
didn't
to
make
such
a
brand-new
feature,
part
of
our
stable
announcement.
We
do
have
a
few
very
a
few,
a
few
very
small,
known
issues
under
edge
cases,
which
I
think
all
the
ones
we
have.
We
actually
have
pull
requests
pending,
for
they
just
aren't
done
yet,
and
the
security
model
here
is
just
a
little
bit
iffy.
A
A
But
this
will
probably
be
turned
on
for
cracking
unless
we,
which
is
our
next
release
in
about
six
months
unless
we
come
up
with
something
very
surprising
and
I,
think
that's
the
first
time
I've
said
that
out
loud,
so
there
we
go
all
right,
so
some
pain
points
that
you
might
see.
If
you
deploy
Steffes
in
testing
or
testing
or
or
do
something
with
it,
they
weren't
expecting
only
mr.
file
deletion
file.
A
Deletion
works,
don't
get
scared
like
you
delete
a
file,
it
does
go
away,
but
you
know
a
file
can
be
very
large.
It
can
consist
of
you
know
thousands
of
Rados
objects
which-
and
you
know,
depending
on
how
fast
your
cluster
is,
it
might
take
a
lot
of
times.
You
actually
send
out
that
number
of
operations
and
and
do
the
actual
deletes
on
the
disk.
So
when
you,
unlike
a
file
from
the
client
side,
then
it
sends
an
operation
to
the
MDS
and
the
MDS
says
alright.
A
This
file
is
is
in
the
leading
state
I.
Unlike
this
file
and
hey,
no
one
else
has
a
no
one
else
in
the
tree
has
a
link
to
it.
So
I'm,
deleting
it
I,
moved.
In
fact,
what
we
do
is
we
move
it
into
what
we
call
the
stray
directory
saying
that
it's
not
part
of
the
filesystem
anymore
and
then
we'll
say:
oh
hey,
we
can
in
fact
leave
this
file.
A
A
We
do
have
fixes
in
progress.
We
have
pull
one
pull
request
pending.
That
reduces
the
memory
pressure
a
great
deal,
and
so
that
help
and
one
of
the
more
urgent
things
in
our
task
queue
is
that
we
need
to
build
a
system
and
so
that
we
can
say
hey.
I
am
writing
to
this,
for
this
I
know
just
being
deleted
and
and
by
the
way
I
just
finished,
leaving
this
one
so
give
me
the
next
amount
for
the
queue.
A
It's
not
terribly.
Complicated.
Work
it'll
probably
be
done
in
the
not-too-distant
future,
but
that
needs
to
happen.
So
you
know
if
you're,
deleting
lots
and
lots
of
things
that
wants
to
be
aware
of
that.
Second
major
pain.
Point
is
client
trust
sort
of
inherent
to
this
ffs
protocol
right
now
is
that
clients
are
on
some
level
trusted.
We
have
coherent
caching,
which
means
that
if
a
client,
if
a
client
has
has,
has
information
cached
about
something,
we
can't
change
it
until
we've
until
the
client
has
told
us
that
it's
dropped
the
cache.
A
Now,
if
a
client
goes
unresponsive,
then
we
will
time
them
out
after
30
seconds
or
whatever,
but
if
they,
if
they
keep
on
saying
yes
I'm
alive,
but
I
can't
give
you
this
cat
back
or
I,
can't
give
you
this
into
locks
on
this
information
back
because
I'm
still
writing
data
out
and
I
can't
release
the
in
released
information.
Until
I'm
done
writing
this
data,
then
we
can't
kill
them
so
clients.
If
can
deny
rights
to
anything.
They
can
read
clients,
because
the
data
certainly
OS
DS
and
they
have
their
own
security
capabilities
than
anything.
A
The
clients
can
read
or
anything.
The
clients
can
write
in
the
OST
cluster
they
can
trash
over.
You
can
fix
that
by
giving
clients
separate
namespaces,
you
can
fix
the
the
write
deny
by
not
shanks
of
the
Cross
tenants
and
sort
of
the
biggest
one
is
the
clients
can
send
a
do
s
attack
against
damn?
Yes,
they
attach
to.
They
can
just
keep
on
saying,
like
hey,
create
this
directory
at
a
deeper
and
deeper
level,
create
these
hundred
thousand
files,
whatever.
C
A
Once
we
have
multiple
file
system
in
a
cluster,
then
that
will
that'll
work,
but
at
the
moment
there's
sort
of
a
minimum
level
of
trust
you
need
to
have
in
all
of
the
clients
which
you
are
connecting
directly
to
your
radios,
close
or
directly
view
yourself
filesystem.
If
you
don't
trust
your
clients,
you
should
put
them
through
sambar
Ganesha
as
a
gateway
instead
and
finally,
final
pain.
Point
is
debugging
live
systems.
We
have
some
pretty
cool
tools.
You
can
go
to
the
metadata
server
and
dump
every
operation.
That's
currently
in
flight.
A
We
can
see
what
clients
are
connected
to
a
data
server
and
some
of
the
information
about
what
they've
got
going
on
and
we
can
dump
the
contents
of
the
meditative
servers
cache
and
get
a
lot
of
useful
information
about
the
state
of
the
system
out
of
there.
But
we
can't
say
what's
happening
on
this
specific
inode.
We
don't
have
a
real
great
way
if
you
have
an
operation
on
a
client
in
this
like.
Why
is
this
rename
taking
forever?
A
But
you
know
these
ones
that
we
have
on
the
list
like
we
have
some
stuff
in
work
in
progress
for,
but
it's
sort
of,
we
need
to
get
it
out
and
see
what
people
actually
want
because
I
don't
know:
I'm
a
developer,
my
file
systems
last
for
like
12
minutes
before
I
hear
you
start
them
it's
great.
We
need
to
see
what
the
diagnosis
that
people
need
are
before.
We
can
start
building
the
appropriate
tools
to
track
them.
A
B
C
By
the
message
so
I've
heard
you
know
a
long
time,
don't
use
in
production
and
now
you're
saying
it's
awesome,
but
is
it
awesome
or
is
it
almost
awesome
I'm
still
a
little
shaky
on
on
production
usage.
A
A
The
upstream
community
is
leery
abusing
the
words
production
ready
you
can
talk
to
downstream
or
to
like
downstream
people
who
provide
actual
support
for
that
because
the
upstream
community,
it's
all
you
know,
hey
I,
have
this
problem
and
someone
on
the
on
the
mailing
list
is
interested
in
it.
What
we're
saying
is
that
it's
stable?
That
means
that
we
are
very
really
very
confident
that
if
you
run
the
system
the
way
we
tell
you
to
run
it
and
don't
turn
like
like
these
features,
that
I've
said
are
almost
awesome.
A
You
have
to
go
set
flags
via
the
monitor
we
like
they're
locked
out,
so
you
have
to
acknowledge
that
we
don't
think
you
should
turn
these
on.
Yet,
if
you
lose,
if
you
turn
them
on,
you
might
lose
data.
They
irrevocably
mark
this
that
your
cluster
States,
so
that,
if
we're
debugging
things,
we
know
that
these
have
been
turned
on
right.
A
But
if
you
run
it
in
the
configuration
that
we
that
we
that
we
that
we
tell
you
is
stable,
then
we're
very
confident
that
you
aren't
going
to
lose
data.
It
might
or
might
not
have
the
performance
characteristics
you're
looking
at,
which
is
sort
of
where
the
concern
is
lower
right.
It's
concern
is
right,
like
if
you
want,
depending
on
what
you
like
in
a
file
system.
A
Performance,
is
sort
of
part
of
the
of
the
basic
requirements,
but
very
different
people
have
very
different
needs
about
what
exactly
is
performing
in
which
ways
so
we're
saying
it's
stable,
we're
not
gonna,
lose
your
data,
we're
very
confident
or
not.
Gonna
lose
your
data,
and
if
some
disaster
does
befall
you,
then
we
built
the
recovery
tool
so
that
we
can
get
to
your
data
back
and
let
you
go
to
something
else
now.
That's
where
we
are.
B
A
Eraser
coding,
I'm,
supported
under
sacrifice,
the
Ceph
file
system
expects
a
replicated
pool
so
no
erase
your
coding
right
now.
There's
Rados
features
which
will
be
out
if
we're
very
lucky
and
Kraken
may
be
an
L
which
will
overwrite
Sonny,
C
and
then
I
think
it
should
be
fine
and
we
don't
like
to
recommend
cache
tearing,
but
that
does
make
it
look
like
a
replicated
pool
from
South
Augustus
perspective.
It
doesn't
care
for.
A
A
Question
is,
if
you
have
standbys
or
standby
replays
in
the
main,
and
the
active
metastable
server
dies,
do
they
take
over
automatically,
and
the
answer
is
yes:
the
metadata
server,
the
active
metadata
server,
maintains
a
heartbeat
connection
with
the
monitors
and
I
think
the
default
is
30
seconds,
but
you
can
tune
it.
The
monitor
is
declared
dead
and
they
say:
oh
hey,
we
have
someone
else,
you
can
take
over
right
now
and
they
push
out
a
new
map.
That
says
this
person
is
in
charge.
A
So
we
don't
have
real
great
performance
numbers.
The
information
we've
had
most
recently
says
that,
depending
on
what
you're
doing
you
can
expect
on
the
order
of
five
to
ten
thousand
metadata
server
operations
per
second,
keep
in
mind,
that's
not
like
an
HDFS
server
where
stat
counts
as
an
operation.
That's
like
something
that
changes
the
state
of
the
system,
because
otherwise
the
clients
have
a
cached
and
an
inode
+
addi
entry
takes
depending
on
what
you're
doing
between
about
2
&
4
kilobytes
of
memory.