►
From YouTube: 2015-JUL-30 -- Ceph Tech Talks: CephFS
Description
A detailed look at CephFS. What it is, how it works, and what the development roadmap looks like.
A
Finally,
things
like
permissions
and
directories
are
actually
intrinsically
useful
things:
they're,
not
just
hangovers.
If
you
build
a
new
system
based
on
objects,
you
might
find
yourself
implementing
some
kind
of
hierarchical
concept
on
your
own.
So
sometimes,
if
it's,
what
your
application
needs,
having
a
file
system
has.
Those
things
baked
in
is
very
useful.
So
I
don't
use
file
system
to
everything.
A
Some
applications
that
expect
to
file
system
interface
use
it
badly
in
a
way
that
expects
latency
is
associated
with
local
file
systems.
The
classic
example
is
people
running
a
lot
of
LS
l
calls
from
their
applications,
which
we'll
go
through
and
stat
every
file
in
a
directory
which
worked
fine,
our
local
file
system
that
presents
challenges
for
us
when
implementing
a
distributed
file
system
in
providing
that
kind
of
functionality
without
unacceptable,
latency
overhead
and,
finally,
the
statefulness
of
file
systems
is
challenging
so
where's
with
an
object
store.
A
A
That
limits
your
ability
to
talk
to
all
existing
applications
that
have
been
developed
in
debugging
other
file
systems,
so
implementing
a
POSIX
interface
makes
us
very
compatible
with
a
whole
lot
of
software.
We
get
great
scalability
in
our
data
storage
because
we
store
our
files
directly
and
rate
us
and
inherit
all
of
its
useful
properties.
A
We
get
scalability
in
our
metadata,
but
allowing
users
to
have
multiple
metadata
servers
that
act
in
a
cluster.
We
add
some
functionality
on
top
of
the
basics
that
you
would
expect
from
a
POSIX
file
system,
so
we
have
snapshots
which
are
great
kind
of
per
directory
level,
and
we
also
have
recursive
statistics
which
allow
users
to
identify
statistics
at
a
pair
of
directory
level
without
having
to
recurse
down
as
the
final
system.
A
A
So
the
way
that
all
that's
built
is
outlined
in
a
fair
bit
of
detail
in
the
paper
that
I
reference
at
the
bottom
on
this
slide,
which
is
sages
from
way
back
in
2006,
some
of
the
stuff
in
there
has
changed
sense,
but
a
lot
of
it's
irrelevant
and
that
year
is
kind
of
kind
of
amazing.
When
you
consider
how
rapid
the
development
is
on
staff
and
SEPA
fest.
Now
the
project's
actually
over
10
years
old
in
the
final
system
was
one
of
the
earliest
parts
to
exist.
A
So
this
diagram
will
probably
be
a
little
familiar.
People
who've
been
to
private
previous
talks
about
radars
as
well.
The
little
turquoise
squares
are
OSDs
and
the
new
part
of
this
is
the
little
squares
with
a
tree
M,
which
our
metadata
servers
or
MD
Isis.
So
at
the
top
of
this
diagram,
you
have
your
client
host,
which
is
running
some
set
the
first
client
code.
When
we
say
client,
we
essentially
mean
amount
when
you
have
/mnt
slashes
a
professor
or
something
like
that.
A
That's
a
client
and
the
client
is
sending
two
types
of
information
to
the
cluster
data
and
metadata.
Where
metadata
is
things
like
opening
a
file
or
getting
the
attributes
of
a
file
and
data
is
the
actual
reads
and
writes
within
a
file
as
I
mentioned,
the
data
goes
directly
to
raid
us,
so
it
doesn't
have
to
go
through
the
metadata
servers.
There's
no
extra
bottleneck
there
and
there
are
multiple
metadata
servers
within
the
cluster.
A
The
way
that
we
store
the
final
data
within
Raynaud's
is
worth
describing
so
in
a
similar
way
to
our
median
rgw.
We
support
striping
and
chunking
of
the
data
in
set
the
first.
We
already
have
a
unique
identifier
for
every
file
and
that's
its
inode
number,
so
objects
and
named
after
the
inode,
followed
by
period
and
then
the
offset
within
the
file
where
the
offset
is
a
account
in
the
number
and
in
the
size
of
chunk.
A
That's
selected
for
the
file,
which
is
4
megabytes
by
default
users,
can
change
those
settings
on
profile,
/
directory
basis,
and
they
do
that
using
virtual
extended
attributes
which
are
acceptable
using
any
existing
system
tools.
So
you
don't
need
special,
safe,
specific
tools
on
the
client
to
do
that,
and
in
addition
to
specifying
the
striking
of
a
file
the
layout
lets,
you
say
which
radar
school
you
want
to
store
data
in,
so
you
can
have
multiple
radar
schools
and
use
force
ffs,
and
when
you
do
that
it
lets.
A
They
put
the
metadata
into
directory,
fragment
objects
within
radars,
and
these
are
these
are
things
that
take
advantage
of
radar,
says
Oh
map
interface,
though
Redis
lets
you
create
an
object
and
then
use
it
as
a
key
value
store
or
a
no
map.
Instead
of
us,
we
want
to
support
really
large
directories,
so
we
don't
want
to
create
arbitrarily
large
Oh
map
objects.
So
we
break
directories
up
into
fragments,
based
on
the
hash
of
the
directory
entry
names
within
a
directory
and
within
that
own
map.
A
The
keys
are
the
file
names
or
their
denturri
names,
and
the
values
are
the
directory
entries,
which
also
in
step
of
s,
includes
the
inode.
So
we
embed
VI
nodes
directly
with
the
dentary
so
that
when
somebody
retrieves
a
directory,
they
get
all
the
data
they
need
right
there
and
that
reduces
latency
when
somebody
is
traversing
the
filesystem,
so
everyone,
the
the
takeaway
from
that,
is
that
there
is
locality
in
the
way
that
we
store
files
within
a
directory.
A
A
This
is
a
simple
example
of
what
kind
of
objects
you
end
up
with
after
creating
a
directory
and
a
file
within
your
service
file
system,
so
we
create
a
directory
called
mitre.
We
write
12
megabytes
of
data
to
my
final
one
within
it.
On
the
left
hand,
side
of
this
slide
I
see
two
objects
in
the
metadata
call.
These
are
directory
fragment
objects.
The
top
one
is
for
the
root
directory,
the
slash
directory,
which
has
a
magic
inode
number,
which
is
already
known
to
everybody.
Working
with
file
system.
A
That
is
just
one
so
directory
fragment
1.000
contains
a
single
entry
for
my
dirt.
These
are
these
are
OMAP
key
values
which
contains
the
inode
for
my
death,
which
is
1,
0,
0,
0,
0
1.
Forgive
me
for
pronouncing
the
right
number
of
zeros,
that's
just
how
we
print
them
and
then
for
that
mitre
I
note.
There
is
a
directory
fragment
object
again,
which
contains
an
O
map
key
for
my
file,
1,
which
contains
an
I,
know
that
there's
1,
0,
0,
0
2
and
in
the
data
pool
you've
got
the
objects
for
the
file.
A
There
are
three
because
the
default
chunk
size
is
4
megabytes
and
we
wrote
12
megabytes
and
they
just
contain
the
data.
The
first
object
in
a
file
has
this
extra
extended
attribute
called
parent
that
contains
the
full
path
to
the
file
by
the
time
it
was
created
and
the
use
for
that
will
become
clear
very
shortly.
A
So
this
structure
on
disk
is
optimized
for
the
lookup
by
path
case.
Where
something
says:
I
want
to
open
flash
mightor
one
slash
my
file,
and
so
we
can
go
and
read
the
root
directory
fragment,
find
my
der
read
the
miner
directory
fragment
fright,
find
my
file
and
then
go
and
see
the
objects
on
disk.
For
that
that's
fairly
straightforward.
A
A
So
in
order
to
look
up
by
inode
number,
we
have
this
extra
attribute
on
the
first
date
object
of
files.
We
call
it
a
back
trace
and
we
can
get
to
that
directly
by
inode
number,
because
we
name
our
data
objects
after
the
inode
number
and
that
allows
us
to
have
more
support
for
hard
links
and
for
NFS
file
handles.
A
A
So,
for
example,
if
you
lose
your
metadata
pool
and
you
want
to
try
and
rebuild
your
data
as
best
you
can,
you
do
have
some
record
in
the
data
pool
of
what
the
path
was
originally
for.
The
files
and
that's
what
this
looks
like
when
you
do
a
look-up
by
I
know
so
you
initially
go
read
the
back
trace
from
the
data
object
and
then
the
rest
of
the
lookup
is
just
the
same
by
path
process
going
through
the
directory
fragments,
except
instead
of
having
been
given
the
path
by
the
client.
A
So
that's
how
the
data
and
metadata
is
stored
within
radars.
All
of
that
work
of
storing
map
metadata
is
done
by
the
metadata
servers.
These
are
demons,
much
like
the
Mons
or
the
OSDs
written
in
C++
and
initially,
when
you
start
up
an
MVS
demon
on
the
host,
it
does
nothing
at
all.
It
communicates
with
them
on
and
it
goes
into
a
standby
mode.
A
The
India
friends
also
have
some
of
their
own
data
of
the
per
rank
data,
such
as
a
journal
and
things
like
an
allocation
table
for
assigning
new
inodes.
That's
one
of
those
per
rank
as
well,
and
by
storing
those
in
radars
we
can
failover
and
the
s
ranks
really
quickly,
because
there
is
just
no
data
at
all
left
behind
on
the
NBS
servers
themselves.
You
start
a
new
one,
it
gets
assigned
the
rank
and
it
picks
up
where
the
old
one
left
off
by
reading
all
that
meditation
from
Raiders.
A
The
actual
assignment
of
which
piece
of
metadata
goes
to
which
rank
is
done
dynamically
and
it's
done
in
terms
of
sub
trees.
So
in
this
diagram
the
colors
represent
different
and
yes
ranks
and
the
gray
and
the
s0
part
starts
at
the
root
and
that's
where
all
the
metadata
would
have
been
initially
and
then
over
time.
As
the
systems
used,
hot
directories
get
reassigned
to
different
and
BSS,
but
then,
if
there
is
a
particularly
hot
directory
within
a
parent
directory
that
was
also
quite
hot,
this
can
be
recursive
as
well.
A
So
you
can
see
in
this
diagram
that
there
are
colors
within
colors.
That
might
seem
like
a
lot
to
keep
track
of,
but
in
reality
all
the
MDS
is
actually
need
to
know
about.
This
situation
is
not
where
each
individual
I
notice,
but
just
where
the
boundaries
are,
and
that's
the
MSO
Trina
I'm,
just
gonna
close
for
a
second
and
see
if
there
are
any
questions
and
checked
well.
Okay,
I
should
have
said
at
the
start,
feel
free
to
just
draw
any
questions
that
pop
into
your
head
into
the
chat
as
we
go.
A
So
when
the
MDS
is
making
updates
to
this
metadata,
if
you
had
a
whole
bunch
of
files,
I
saw
a
whole
bunch
of
clients
making
updates
to
files.
You
would
find
that
for
things
like
incrementing
the
size
of
the
file
to
reflect
data
appended
to
it,
that
will
generate
a
lot
of
little
pieces
of
iron
to
update
the
metadata
objects
on
disk.
So
we
don't
do
that.
Metadata.
Ops
are
written
to
a
journal,
there's
a
journal
for
each
MDS
rank
and
when
we
have
received
some
updated
metadata,
we've
written
it
to
the
journal.
A
We
have
that
in
an
in-memory
cache
a
cache
of
AI
notes
and
entries
and
directories.
It
will
remain
in
that
in-memory
cache
until
it
falls
off
the
end
of
the
journal
and
we
actually
use
pretty
big
journals
and
the
default
size
of
the
journal
is
in
the
hundreds
of
megabytes
and
you
can
make
it
a
lot
bigger
than
that.
A
If
you
want-
and
that's
because
it's
nice
not
to
have
to
worry
about
hurrying
to
evict
things,
not
Vivek
so
expired
things
from
the
journal,
but
it
also
lets
us
do
failover
even
more
efficiently,
because
after
we
replay
the
journal,
we've
got
a
great
big
journal
with
a
large
collection
of
recent
metadata
operations.
In
it,
our
cache
will
be
warmed
up
with
everything
that
was
recently
operated
upon
by
clients.
A
A
Typically,
you
want
to
size
this
for
the
amount
of
RAM
in
your
metadata
server,
and
you
want
to
provision
servers
that
have
plenty
of
RAM
for
uses
and
ESS
the
mb/s
cache
size
parameter.
Lets
you
control
that
that's
a
limit
by
the
number
of
I
nodes
and
controlling
cache
size
has
actually
been
kind
of
a
tricky
area
recently,
and
that's
because
the
clients
have
to
be
involved
in
the
process
too.
If
a
client
has
a
file
open
or
a
file
in
its
cache,
the
N
DSS
can't
necessarily
remove
it
from
their
cache.
A
They
have
to
pin
it
as
long
as
a
client
is
using
it,
and
so,
in
order
for
the
MDS
is
to
shrink
that
cache.
They
have
to
ask
the
clients
to
shrink
the
client
caches
and,
as
you
can
imagine
that
as
they
it's
a
diff
tribute
systems
problem.
Therefore,
it's
hard,
and
if
you
follow
the
mailing
list,
you
will
have
seen
various
people
asking
about
some
of
the
warnings
that
we've
added
recently
for
her
clients,
failing
failing
to
respond
to
capability
releases
and
that
kind
of
thing
and
those
messages
are
about
this.
A
You
were
replay,
so
I
mentioned
that
the
client
maintains
a
cache.
The
client
MDS
protocol
is
kind
of
interesting.
It's
implemented
twice
once
in
our
youth
space
client,
which
is
which
has
a
fuse
interface
and
once
in
our
kernel,
client,
which
is
part
of
the
upstream
Linux
kernel,
the
clients
start
up
and
they
learn.
The
addresses
of
the
MDS
is
from
the
Mons,
so
when
you
Manus
a
file
system,
you
don't
type
in
the
address
of
an
MDS,
because
MDS
is
a
completely
dynamic.
A
Read
summarize
to
this
object.
You
are
allowed
to
write
to
it.
You're
allowed
to
read
from
it
you're
allowed
to
update
the
metadata
for
it,
you're
allowed
to
update
extended
attributes
for
and
that
kind
of
thing,
and
that
means
that
when
you've
got
multiple
clients
which
are
taking
an
interest
in
the
same
file,
although
we
have
to
do
this
locking
to
maintain
posing
semantics,
we
don't
it's
not
an
all-or-nothing
thing.
A
I
was
doing
that
weren't
quite
finished,
yet
that
might
sound
like
kind
of
a
low-level
implementation
detail
that
this
stuff's
worth
knowing
even
as
an
administrator,
because
you
will
see
the
MDS
going
through
these
stages
as
it
starts
up
after
a
failure
or
when
you
first
start
it'll
go
through
replay
of
its
own
journal.
It'll
go
through
a
reconnect
phase
where
it's
waiting
for
the
clients
to
come
back
and
it'll
then
go
for
a
client
replay
phase
where
the
incomplete
client
operations
are
getting
replay.
A
So
if
you
see
a
system
stuck
or
seemingly
stuck
in
any
of
those
states,
they
can
take
some
time.
Then
it's
useful
to
know
what
that
means
and
what's
actually
going
on
within
the
system.
That's
a
common
failure
mode
in
distributed
file
system.
It's
not
just
a
set
of
s
that
this
this
dance,
that
you
have
to
go
through
after
a
failure.
If
the
clients
are
unresponsive
or
one
of
them's
died,
we
do
cope
with
that,
but
it's
not
necessarily
immediately
obvious.
A
In
case
it
is
still
alive
in
some
systems
that
we
call
fencing
what
we
do
that
and
let
another
NDS
start
and
take
on
that
role,
so
that
all
happens
completely
autonomously.
There's
no
IDE
mean
intervention
required.
Mmm-Hmm
and
clients
do
a
similar
thing,
except
instead
of
pinning
the
models,
because
they
find
a
very
large
number
of
clients.
We
don't
want
to
over,
like
the
ones
with
that.
A
The
clients
ping
the
nd
assets
and
then
the
MDS
is
individually
decide
if
a
client
has
been
too
late
and
if
it
has
will
drop
any
resources
its
holding,
so
that
other
clients
can
get
access
to
you
dropping
out
each
other
chat
again
chat
again:
okay,
there's
a
question:
does
the
dynamic
subtree
partitioning
in
place
already,
as
it
said,
the
unstable?
So
yes,
it's
in
place,
and
if
you
want
to
know
how
stable
it
is,
you
have
to
test
it.
So
we're
not.
A
A
So
that's
what
set
the
press
is
and
how
it
works.
Now,
how
do
you
use
it
though
first
hand
ease
get
it
well,
it's
packaged
and
released
as
part
of
staff
on
some
systems.
It
might
be
a
separate
package
called
SAP
MBS
that
you
need
in
your
MDS
demons,
but
it's
within
the
whole
really
cyclist
app.
You
can
use
set
deployed
to
create
MVS
demons
and
there's
a
manual
process
you
can
do
to
which
is
documented
and
the
various
orchestration
frameworks
that
have
modules
for
Saif.
Many
of
them.
I.
Imagine
know
how
to
do.
A
A
Some
low-level
things
are
exposed
in
the
form
of
admin,
sockets,
moans
and
ESS,
and
Ice
T's
all
have
these
things
called
an
admin
socket
which
allows
you
to
log
into
the
node
and
put
on
so
locally
to
them,
and
there
are
quite
a
few
admin
socket
commands
in
the
NDS
s,
some
of
which
we
will
eventually
expose
up
five
months
as
well.
We
tend
to
add
new
functionality
in
the
form
of
an
admin,
socket
come
on
first
and
then
that's
opposed
to
elsewhere
later.
A
So
what
that
looks
like
in
a
terminal
is
actually
pretty
brief.
You've
got
one
command
on
with
SEP
deploy
to
deploy,
and,
yes,
you
need
to
create
a
day
pool
and
a
metadata
pool
for
your
file
system
and
then
use
the
FS
new
command
to
configure
the
file
system
once
you've
done.
That.
Bmds
that
you
just
deployed
will
be
informed
by
the
one
that
the
file
system
is
now
available.
It'll
come
up
and
take
rank
zero,
stop
operating
as
an
active
MDS,
and
at
that
point
you
can
go
ahead
and
mount
it.
A
A
Here's
another
practical
example:
I
mentioned
we
have
a
recursive
statistics
that
give
you
real
numbers
for
what
happens
on
a
directory.
Sorry.
This
is
all
very
difficult
to
read
because
of
the
line
wrapping
so
on
the
top.
Half
of
this
is
what
we're
all
familiar
with
with
a
local
file
system
like
ext4,
you
go
and
do
an
LS
on
a
directory
and
ext
people
claims
the
directories
for
colitis.
A
It's
a
little
bit
strange,
but
we're
all
very
familiar
with
with
that.
If
you're
familiar
with
Linux,
you
will
have
seen
that
works
with
South.
If
s,
if
I
go
and
look
at
one
of
my
directories
in
a
set
of
s
file
system,
it's
telling
me
16
megabytes
in
this
example,
and
that
would
be
the
size
of
the
files
within
the
directory
or
within
any
children
of
the
directory.
A
Now,
shots
are
also
exposed
directly
within
the
file
system,
as
no
special
tools
needed
to
create
managed
snapshots.
Every
directory
in
the
southwestt
file
system
has
a
magic
dot.
Snap
directory
this
doesn't
correspond
to
a
real
piece
of
on
disk
metadata.
This
is
something
virtual
that
the
NDS
is
D.
You
accessing
the
dots
net
directory
and
translate.
Add
international
operations
in
this
example
I'll
just
step
through
it
and
create
a
file
called
history.
A
Counterintuitive
that
we're
sort
of
doing
purposing
make
directory
via
file
systems,
don't
give
you
a
way
of
adding
new
commands.
So,
instead
of
a
make
snapshot
command,
you
have
a
make
directory
command
and
the
point
that
we've
created
that's
natural.
We
can
go
back
up
into
the
backups
directory,
delete
the
history
file.
A
Do
an
LS
see
that
it's
really
gone,
but
then,
if
we
do
an
LS
in
dot
snap
slash
snap,
one
will
see
that
it's
still
there,
and
so
once
you've
created
a
snapshot
they
show
up
as
if
they
were
directories
within
the
dot
snap
folder.
And
similarly,
if
you
want
to
get
rid
of
a
snapshot,
you
can
get
rid
of
it
with
the
remove
directory
command.
A
The
statistics
that
you
can
get
out
of
any
SAP
service
are
particularly
useful
for
MTS's,
so
you
can
run
this
command
on
any
type
of
service,
not
just
MD
SS,
but
to
get
insight
into
what's
going
on
in
your
system.
If
you
want
to
know
why
is
my
client
stuck?
Why
is
my
file
system
not
pushing
radar
since
I'm
at
detecting
it
too?
It's
very
useful
to
look
at
these
stats,
especially
the
rates
of
client
requests,
the
sixth
column
from
the
left-hand
side,
the
HTC
are
or
handle
client
request.
A
Column
is
kind
of
interesting,
especially
when
you
compare
it
to
the
next
column
along
which
is
objector,
writes,
which
is
an
internal
name
for
our
radars
rights
from
the
NBS.
So
you
can
see
you
can
actually
see
the
journaling
going
on
here.
We're
getting
a
fairly
steady
stream
of
client
requests
coming
in,
but
we're
going
through
several
seconds
of
not
doing
very
much
in
terms
of
radius,
writes
and
then
a
little
flurry
of
updates
and
that
corresponds
to
expiry
of
log
segments
within
the
the
NDI
slaughter.
A
A
A
There
are
varying
degrees
of
testing
that
has
been
done
so
far
in
this.
We
recently
had
some
very
useful
feedback
about
how
the
NFS
server
especially
was
handling
or
not
handling
cash
pressure
properly.
So
as
with
any
of
these
upstream
components,
please
do
try
these
things
out,
but
if
you
find
bugs,
then
be
ready
for
that
and
be
ready
to
report
them
to
us.
A
Okay,
there's
a
question
again:
does
the
Indiana
crush
so
crush?
Is
the
algorithm
that's
used
within
radars
for
deciding
how
to
locate
data
across
a
population
of
disks
in
order
to
place
something
within
that
population
of
well
so
crush?
Tells
you
where
to
put
placement
groups
and
then
to
decide
what
placement
group
an
object
goes
into.
A
You
take
the
hash
of
the
object
name,
though,
in
radars
we
scatter
the
objects
across
all
of
those
placement
groups
and
the
placement
groups
get
placed
on
OSDs
using
crush
the
enemy,
has
a
dozen
news
crush
and
that's
because
the
NBS
isn't
necessarily
aiming
for
a
uniform
distribution
of
data.
What
the
NBS
is
aiming
to
do
is
continuously
monitor
what
the
hot
spots
are
in
the
metadata
hierarchy
and
then
decide
based
on
that
dynamically,
where
to
move
things
around.
A
So
in
the
meta
data
servers
when
we're
assigning
metadata
from
one
MDS
rank
to
another,
we're
doing
they're
explicitly
and
the
placement
into
in
terms
of
which
MDS
a
piece
of
metadata
belongs
to
is
determined
dynamically,
whereas
within
Rios
the
place
that
an
object
lives
is
implicit
in
its
ID
and
the
place
that
the
placement
group
lives
is
implicit
in
the
the
crash
calculation.
So
I
hope
that
answers
that
and
there's
question
about
commission
and
access
control,
which
I
will
come
to
a
little
bit
later
in
the
presentation.
A
Do
I
want
to
talk
about
the
most
recent
developments
in
southwest?
What's
happened
over
the
last
year,
or
so
our
focus
at
the
moment
is
very
much
getting
to
the
point
where
more
people
will
be
comfortable
using
self
s
in
production.
So,
at
the
moment,
Sabbath
s
is
its
function.
It
works.
You
can
install
it
and
put
your
data
on
it
and
go
and
read
it
back
and
it'll
still
be
there.
A
One
of
the
downsides
to
POSIX
file
systems
generally
is
because
of
the
tightly
coupled
nature
of
all
the
AI
nodes
and
directories,
and
that
entries,
if
you
poke
a
hole
anywhere
in
there
and
just
knock
out
one
piece
of
metadata
you're,
going
to
potentially
remove
access
to
all
the
metadata.
That's
a
penny
thousand
three,
and
that
makes
it
more
fragile.
A
So
these
are
a
little
out
of
date,
because
ham
has
been
out
for
a
while
now,
but
during
the
Firefly
hammer
period,
you
can
see
many
hundreds
of
commits
many
thousands
of
lines
of
code,
many
thousands
of
lines
of
code
added
to
the
tool
surface
directory
and
to
set
QA
suite
as
well
for
testing,
specifically
the
file
system
and
a
pretty
steady
turnover
of
bug
tickets.
So,
as
bugs
come
up
in
the
file
system,
either
reported
by
users
or
found
by
automated
tests,
we're
continuously
fixing
them
we're
we're
also
back
courting
some
file
system
fixes.
A
But
although
there
isn't
quite
the
level
of
support
for
long
term
releases
of
the
file
system,
yet
because
it's
not
in
these
in
production
as
widely
as
the
other
components
except,
we
are
still
making
an
effort
to
certainly
not
break
things,
not
break
backwards-compatibility.
We
don't
do
that,
but
also
to
make
sure
that
when
there
are
bugs
which
are
affecting
the
people,
who
are
early
adopters
of
SEPA
fest
that
they
get
taken
care
of
so
I
was
going
to
really
quickly
run
through
the
various
sort
of
grab
bag
of
things
that
have
been
added.
A
A
Similarly,
if
something
seems
it
seems
to
be
stuck
or
not
progressing,
you
can
now
use
the
OP
tracker
component
to
get
a
very
detailed
into
the
hall
view
of
what's
going
on
within
the
MDS.
This
is
a
little
little
less
user
friendly
because
it
is
very
internal
information,
but
it
allows
you
if
you've
got
a
system
that
stuck
and
you
need
to
send
some
information
to
developers
or
to
whoever
is
supporting
your
system.
You
can
give
them
that
level
of
information.
A
A
A
That
if
something
says,
if
a
directory
says
it
has
16
megabytes
of
data
in
files
within
it,
that
those
files
really
exist,
and
we
do
have
that
total
size
does
the
metadata
that
we
have
in
memory
in
our
cache
really
match
what's
on
disk
and
does
our
metadata
for
files
match
reality.
So
if
we've
got
metadata,
the
size
of
file
should
be
200.
Megabytes
is
in
really
200
megabytes,
and
this
is
partly
about
detecting
damage
from
loss
of
loss
of
objects
on
disk,
although
Rados
is,
is
very
resilient.
A
If
you
do
have,
for
example,
a
three
disk
failure
and
you
lose
some
subset
of
your
placement
groups
and
we
want
to
make
it
so
that
that
won't
kill
your
entire
file
system.
You'll
only
lose
the
data.
You've
really
lost,
rather
than
having
your
whole
file
system.
Go
down
but
as
well
as
that
data
loss
case.
It's
also
about
making
the
system
resilient
to
bugs
because
bugs
happen,
and
we
need
to
make
sure
that
when
something
hits
an
issue,
we
can
take
them
through
a
process
of
recovering
and
fixing
their
system.
A
So
some
parts
of
this
recovery
and
repair
capability
are
starting
to
come
online
in
the
master
branch
at
least
there's
a
brand
new
surfaced
data
scan
tool
which
enables
you
to
scan
through
the
data
pool
and
essentially
scrape
out
the
files
by
exhaustingly
exhaustively
examining
the
object
swimming
pool,
but
a
little
bit
more
selectively.
We
can
identify
which
files
in
the
data
pool
appear
not
to
be
referenced
by
the
metadata,
so
things
which
are
orphans
and
take
actions
such
as
removing
them
or
recovering
them
so
creating
metadata
that
will
reference
them.
A
A
So
there
will
be
a
new
capability
in
the
Raiders
level
to
have
many
workers
to
share
out
the
namespace
in
a
you
know,
raid
or
school,
and
work
on
it
in
parallel.
So
this
tool
will
take
advantage
of
that
and
what
it
will
look
like
is
you'll
have
tens
or
well.
However,
many
you
want
instances
of
this
program
running
in
parallel
to
scrape
the
data
out
of
your
system
in
case
of
the
disaster.
A
There
is
a
sort
of
sibling
tool
that
has
existed
for
a
little
bit
longer.
This
has
been
in
the
last
couple
of
releases
called
Southwest
journal
tool,
and
what
this
gives
you
is
the
ability
to
recover
from
damage
journals,
so
we've
seen
at
various
points
bugs
or
incidents
that
have
led
to
people
having
damaged
gels
and
historically,
and
that
would
break
your
system
for
really
badly.
It
would
not
necessarily
be
completely
unrecoverable,
but
it
was
pretty
hard
to
recover
from
it,
because
without
these
journals,
the
NBS
is
couldn't
even
start
up.
A
So
this
is
an
offline
tool
which,
if
your
MDS
is
won't,
start
up
because
something's
gone
wrong
with
your
journals.
It
lets
you
interrogate.
What's
there
identify
specifically,
which
parts
of
general
appear
to
have
become
unreadable
or
unusable
and
then
take
action
to
fix
that,
such
as
by
blanking
out
parts
of
the
journal
that
you
no
longer
want
to
run
touch
because
you
know
they're
broken
all
by
trying
to
scrape
out
as
much
metadata
as
we
can
from
the
journal
before
purging
it
from
the
desk.
A
A
There's
a
number
of
new
features
which
enable
you
to
have
some
visibility
of
your
clients
on
your
cluster.
So
there
is
a
session
LS
admin
socket
command
that
allows
you
to
list
what
clients
are
created.
We've
also
added
client
metadata,
which
is
transmitted
from
the
clients
to
the
MDS
s,
and
it
tells
you
things
like
the
kernel
version,
the
hostname
and
the
path
that
something's
meted
out,
which
means
that
rather
than
saying
client
2,
3
7
X
had
an
issue.
A
We
can
say
client
my
hostname
water
and
had
an
issue,
it's
a
simple
thing,
but
it
makes
a
real
difference
if
someone's
trying
to
work
out
which
client
is
clobbering,
their
system,
client
eviction
so
killing
the
session
of
a
client
which
is
known
to
be
dead
and
believe
to
be
dead.
On
misbehaving
used
to
operate.
You
know
in
a
slightly
best-effort
way
and
that
has
been
tightened
up
now,
so
it's
now
possible
to
properly
blacklist
the
client
and
fence
a
client.
A
So,
even
if
you
have
a
misbehaving
client,
you
can
ensure
that
you
can
safely
remove
it
from
the
system
and
that's
an
example
of
what
it
looks
like
when
you
run
session
LS,
you
get
a
bunch
of
useful
information
about
your
clients
in
the
future.
It
would
be
useful
to
extend
this
to
environment.
Specific
pieces
of
information
like
what
HPC
job
is
a
client
part
of
what
VM
does
a
client
belong
to
you?
What
container
does
it
belong
to.
A
There
have
been
various
client
improvements,
especially
to
cash
trimming
that
didn't
work
too
well
a
year
ago
and
then
works
a
lot
better.
Now
in
the
fuse
client,
there
is
new
flock
support,
F
locks
at
what.
So,
if
your
application
relies
on
that,
you
can
I
use
it
with
a
fuse
clan
and
there
is
a
new
quota
feature
in
the
fuse
client.
This
is
implemented
client-side.
So
at
the
moment,
it's
specific
to
the
fuse
clients
that
available
in
the
coal
plant-
and
there
are
some
caveats
around
the
quota-
support.
A
A
So,
in
addition
to
adding
all
of
those
useful
capabilities,
there's
a
lot
of
lot
of
work
on
into
testing
and
QA
as
well,
and
that's
really.
Ultimately,
the
answer
to
is
a
given
release
of
Ceph.
That's
ready
is
going
to
be.
Does
it
pass
the
tests,
so
seth
has
a
to
follow
G
test
framework
which
a
lot
of
people
in
school
will
be
familiar
with
and
that's
used
across
set,
and
so
a
bunch
of
new
functional
tests
remain.
A
A
A
There's
some
ongoing
work
to
improve
the
access
control
features
around
surface.
So,
though,
historically
you
could
kind
of
do
a
trick
where
you
would
set
layouts
on
particular
folders
that
use
particular
pools
and
then
make
sure
that
certain
clients
could
only
access
certain
rules
at
radars
level,
but
it
didn't
prevent
clients
from
doing
naughty
metadata
operations.
So
it
was.
It
was
kind
of
a
little
bit
fuzzy
and
there's
new
features
going
in
here
now
to
have
robustly
enforced
client
access
controls
using
the
the
author
caps
mechanism
that
exists
throughout
Seth
and
that's
gonna.
A
All
these
other
areas
like
hardening
the
multi
IDs
support
the
rebalancing
all
that
good
stuff,
getting
the
snapshots
working
even
better
than
they
are
now
and
getting
the
testing
around
that.
That
gives
us
the
confidence
that
they're
working
and
integration
with
cloud
and
container
environments
so,
for
example,
the
manila
project
that
OpenStack
is
of
a
lot
of
interest
to
us,
because
that
provides
an
Avenue
by
which
people
can
use
us
with
their
and
just
really
quickly.
A
At
the
end,
if
you
are
crying
out
set
a
set
of
s
right
now
as
an
early
adopter,
these
are
links.
You
definitely
should
know
if
you
don't
already
so
the
mailing
list,
the
IRC
channel,
the
issue
tracker
documentation,
including
troubleshooting
documentation,
and
when
you
encounter
an
issue
there's
the
most
recent
release
fix
it
because
stuff
is
getting
fixed
all
the
time,
and
that
includes
in
the
kernel,
if
you're
using
the
kernel
client.
A
If
you're
reporting
an
issue
tell
us
as
much
about
your
configuration
as
you
can,
especially
what
versions
you're
using
whether
you're
using
the
kernel
plan
or
fuse
claim
what
are
you
doing
with
the
file
system?
What
kind
of
workload
are
you
doing
and
ideally,
if
you
can
reproduce
an
issue
with
the
Evo
clogging
enable
that
makes
us
really
happy,
and
that
makes
for
a
really
good
travel
ticket?
If
you
can
do
that
so
with
that
I
will
wrap
up
and
go
see
if
there
any
more
questions
in
the
chat
if
anyone's
talking
into
IR
C.
A
A
Don't
know
if
the
off
the
top
of
my
head,
if
there's
a
hard
limit
but
there's
a
practical
limit,
because
when
you
create
a
pool
you're,
creating
PGs
MPGs,
create
resources
or
consume
resources
on
OSDs.
So
you
don't
want
to
create
an
indefinitely
large
number
of
calls.
The
this
is
a
real
thing
by
the
way
not
a
set
for
fasting.
A
The
solution
to
that
is
something
called
rate
of
namespaces,
and
what
that
allows
you
to
do
is
create
subdivisions
of
all
same
spaces
within
a
pool
without
creating
any
more
pges
and
without
creating
any
more
without
consuming
any
more
physical
resources.
So
one
of
the
things
we'd
like
to
do
in
the
near
future
is
a
Laos
ffs
layouts
to
specify
not
just
a
pool
but
and
a
namespace,
though,
that
people
can
divide
things
up
using
namespaces,
where
all
the
pools
avoid
spurious.
A
A
Yes,
please.
It
may
also
be
affected
by
what
version
of
fuse
itself
you're
using
but
I.
Imagine
if
you're
using
coal
4.1
you're,
probably
using
a
recent
version
of
fuse,
so
there
is
possibility
that
could
be
our
bug,
but
I'm
not
aware
of
a
bug
in
that
area.
So
we'll
see
what's
going
on
question,
am
I
going
to
post
flies
and
this
whole
talk
is?
Is
videotaped?
It's
going
to
be
on
youtube
then
asks
how
is
MDS
funding
work?
A
How
do
you
guarantee,
but
fenced
off
MDS
does
not
subsequently
modify
metadata,
so
fencing
or
blacklisting
is
something
that's
implemented
at
the
radars
layer.
So
it's
one
of
the
very,
very,
very
useful
primitives
that
Rados
gives
us
when
you
fence
a
radar
client
in
this
case
and
MDS.
What
we
do
is
write
an
entry
to
the
OSD
map.
A
A
There
is
a
more
recent
OSD
map
here
it
is,
and
so
at
that
point
you
can
guarantee
that
the
clients
won't
be
allowed
to
work
with
an
older
version
of
the
OSD
mode,
so
they
will
have
seen
the
blacklist
and
importantly,
more
importantly,
the
OSD
is
forcing
blacklist
on
top
of
that
ramos
mechanism
within
CFS.
We
also
distribute
within
our.
We
have
a
similar
structure
called
the
MDS
map,
and
that
has
an
OSD
map
version
in
it.
That
reflects
the
version
at
which
we
last
did
something.
So
what
happens?
A
Is
we
blacklist
MDS
a
and
we
create
a
new
version
of
the
XD
map?
Let's
say
version
99
that
includes
that
whitelist
entry
we
write
99
to
our
MDS
map,
and
then
we
write
to
our
MDS
map
in
the
same
transaction.
That
MDS
a
has
failed
so
anybody's,
seeing
the
MDS
map
that
says
NDS
a
has
failed.
We'll
also
see
that
you
need
OSD
map
99
before
taking
any
actions
which
assume
that
any
si
is
fed.
A
After
the
point
that
MDS
e
MDS
B
has
taken
over
the
rank,
there's
a
question
payload
or
service
work
with
a
cache
tier
of
assist,
deep
pools
back
to
my
rich
particles.
Any
specific
implications,
though
Seth
FS
can
use
a
cash
tear,
because
ray
dose
exposes
cashiers
pretty
transparently
when
you're
using
an
overlay
mode.
You
essentially
just
create
a
cash
tier
on
top
of
the
pool
set
it
as
the
overlay
for
that
pool
and
then
pointless
ffs
at
the
underlying
pool,
and
it
will
pick
up
the
overlay
just
the
same
as
any
other
radars
client.
A
Would
you
the
caveat?
Is
you
can't
use
our
educated
pools
directly,
so
you
would
have
to
use
a
cash
pool
if
you
want
to
use
rate
and
rate
of
coded
pools.
You
should
also
be
aware
that
that's
not
something,
we've
tested
a
lot
and
you
might
have
find
you
have
really
quite
interesting
performance
characteristics
if
you
do
something
like
that.
So,
for
example,
if
your,
if
your
data
pool
was
on
a
cash
tier
and
you
had
a
lot
of
part
of
links,
we
were
having
to
go
and
experience.
A
Okay
last
question:
those
old
new
store
and
its
effect
on
Serapis.
The
only
knock
on
effect
on
set
the
festive
new
store
is
that
it's
coupled
to
some
of
the
sharded
object
listing
feature
that
we
need
for
making
set
up
a
status
gun
scalable.
So
that's
the
only
coupling
there
is
there
other
than
that.
B
I,
don't
see
any
more
questions
coming
in
so
I
think
that's
just
about
the
end
stay
tuned
for
our
next
chef
tech
doc,
which
will
be
on
the
27th
of
August,
so
that
for
Thursday
again
keep
an
eye
on
the
Ceph
Tech
Talk
page,
if
that
may
change,
but
other
than
that
thanks
John.
This
is
great
and
we'll
see
you
guys
next
month.