►
From YouTube: Ceph Month 2021: CephFS update
Description
Presented by: Patrick Donnelly
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
A
So,
to
begin
again
we'll
be
going
through
the
various
features
we
had
in
pacific
you'll
notice,
that
many
of
these
slides
have
are
based
off
of
what
stage
already
presented
with
some
modifications,
try
to
go
into
more
depth
in
it
during
this
talk
and
feel
free
to
stop
me
at
any
time
to
ask
any
questions
you
might
have.
A
I've
also
included
in
the
slides,
which
will
be
distributed
later.
The
blog
post
I
wrote
up,
which
also
includes
a
lot
of
these
discussion
topics
somewhat
in
more
detail.
Perhaps
so,
the
the
project
has
been
following
five
themes:
you
revolving
around
the
features
we've
been
developing
for
the
last
few
releases
and
that
and
we'll
begin
with
usability.
A
So
the
highlight
feature
for
pacific
is
now
that
multi-file
system
support
is
stable
and
it's
fairly
simple
now
to
create
file
systems.
No
longer
do
you
need
to
do
a
a
bunch
of
pool
creation
and
then
cfs
new
and
then
spinning
up
mdss
in
pacific.
This
is
now
largely
automated
for
you.
A
We
have
this
fs
volume
interface
that
allows
you
to
create
a
file
system
quite
easily,
including
all
of
its
pools.
Following
the
best
recommendations
of
this
of
first
half
and
automatically
just
deploy
mdss
using
the
distributed
or
the
deployment
backend
for
ceph,
either
cephadm
or
rook,
and
bring
up
as
many
file
systems
as
you
need
and
then
also
remove
them
as
you
need
another
usability
feature.
We've
developed
in
that
same
vein
is
the
mds,
auto
scaler,
which
will
start
and
stop
mdss
based
off
of
the
needs
of
the
file
systems
you
have.
A
This
is
a
module,
that's
not
enabled
by
default.
You
have
to
turn
it
on,
but
once
on
it
should
create
mdss
using
the
deployment
tool,
either
adm
or
rook
in
response
to
changes
of
the
configurations
of
your
of
your
ffs
file
system.
For
example,
if
you
increase
max
mds,
the
module
will
automatically
deploy
another
mds
in
response
to
that
change.
A
We've
also
developed
a
new
tool
which
is
currently
developer
preview
quality
called
cfs
top
that
now
lets
administrators
monitor
the
usage
of
the
step
file
system
with
details
that
were
previously
not
trivial,
to
gather
in
particular,
clients
now
communicate
to
the
mds
certain
information
which
we
call
metric
gathering
concerning
the
different
performance
statistics
that
they
are,
that
they
have
that
the
mds
would
not
otherwise
be
aware
of,
for
example,
how
how
effective
the
client
caches
are
and
how
much
read
and
write
io
they're
putting
on
the
ceph
cluster.
A
This
is
communicated
to
the
mds
and
the
mds
is
able
to
communicate
with
the
sef
the
cef
manager,
to
provide
a
summary
of
statistics
for
all
the
clients
in
the
cluster
and
display
a
simple
and
curses
ui
for
the
administrator,
showing
the
various
client
sessions
in
the
cef
file
system
and
what
they're
doing
and
who
the
top
consumers
are
again.
This
is
a
tech
preview,
but
we
are
spending
time
to
improve
it
always
and
we
are
eager
for
any
feedback
that
the
community
has.
A
In
that
same
vein,
we've
also
been
working
on
cfs
shell.
To
my
knowledge,
we
haven't
had
too
much
feedback
from
the
community.
This
has
been
around
for
about
two
releases.
Now,
two
or
three
releases,
it's
a
simple
python
utility
that
mounts
ffs
and
allows
you
to
execute
some
commands
on
the
file
system
without
having
to
mount
it
using
fuse
or
the
kernel.
A
We
also
have
a
new
snap
schedule
manager
module
that
also
must
be
enabled
that
allows
you
to
schedule
snapshots
on
a
step
file
system
on
a
given
period
and
also
retain
snapshots
according
to
a
retention
schedule.
That's
set
up
that
can
allow
you
to,
for
example,
set
up
snapshots
to
be
taken
every
day
for
a
week
and
then
delete
any
snapshots
that
are
that
are
older
than
that
time
frame.
A
This
is
a
module.
That's
designed
to
be
used
in
tandem
with
the
new
cfs
mirror
project
that
we'll
talk
about
later.
A
There
is
a
nfs
manage
manager,
module
that
allows
you
to
create
nfs
clusters
to
export
cfs
and
set
up
a
group
of
exports
to
to
be
exported
by
that
nfs
cluster.
A
The
the
you
can
set
up
the
nfs
clusters
in
active,
active
configurations
or
set
of
and
and
in
in
the
near
future,
you'll,
be
able
to
even
set
up
aha
using
ceph8m
with
aj
proxy.
That's
a
work
that
sage,
while
is
doing
right
now,
or
you
can
also
use
rook
for
for
the
aha
component,
although
the
rook
crd
is
under
active
development
right
now,
so
not
quite
yet
the
nfs
clusters
that
are
deployed
you
can
have
them
automatically
deployed
using
the
step
orchestrator.
A
A
We've
also
been
working
on
adding
encryption
support
to
the
kernel
client
and
there
have
been
a
a
few
changes
that
have
also
been
necessary
for
the
mds.
A
Some
of
these
changes
are
being
adjusted
constantly
and
the
mds
side
is
is
not
quite
ready
because
the
the
kernel
client
is
still
in
under
active
development,
but
we
have
several
changes
that
are
already
in
the
mds
side
for
pacific
and
we
plan
to
backport
the
remaining
changes
necessary
to
get
complete
encryption
support
ready
for
in
the
mds,
so
that
the
kernel
client
may
mount
cfs
sub
volumes
and
have
them
be
encrypted.
A
A
All
right.
I
think
we
might
have
a
question
you
have
sharing
that
with
yes,
we're
doing
an
fs
crypt
instead
of
talk
tomorrow
at
9
30
a.m.
Eastern
time,
if
you
want
to
learn
more
about
that,
the
coming
back
to
the
robustness,
we
have
feature
bit
support
for
turning
on
and
off
required
file
system
features.
A
This
replaces
the
previous
behavior
of
setting
a
minimum
client
release,
which
is
historically
not
been
a
great
design,
especially
when
the
kernel
client
can
only
selectively
back
back
for
certain
features
to
bring
it
to
parity
with
a
ceph
release.
A
So
a
more
robust
solution
is
to
selectively
enable
and
disable
the
features
that
you
want,
and
that
is
all
the
api
for
that
is
all
set
up
and
documented
this
ftox.
If
you
want
to
learn
more,
the
we've
also
stabilized
multiple
mds
file
system,
scrub
that
is
in
in
the
past
before
pacific.
You
would
have
to
bring
your
self
ffs
file
system
down
to
max
mds
equals
one
so
reduce
the
number
of
ranks
to
one
before
you
could
run
a
scrub.
A
This
was
due
to
bugs
that
we
knew
existed
with
false
scrub,
airs
or
incorrect
scrubbing
when
executing
a
scrub
with
multiple
ranks.
A
Let's
see
if
we
we
had
it,
we
did
a
significant
amount
of
of
development
to
improve
that,
and
now
we
we
now
support
the
having
multiple
actives.
You
don't
know
you
no
longer
need
to
change
max
mds
in
order
to
execute
a
scrub
in
the
kernel
client.
We
now
support.
We
have
support
for
messenger
v2.
A
Thanks
to
efforts
by
ilia,
you
can
just
specify
a
kernel
mount
option
to
turn
that
on
now
also
in
the
kernel
client,
we
have
support
for
recovering
amounts
from
block
listings,
so
you
don't
no
longer
need
to
remount
your
kernel
client
to
have
it
brought
back
after
if
it
becomes
block
listed
by
the
by
the
mds
and
that
can
be
turned
on
by
recover
session
clean.
A
This
doesn't
actually
allow
workloads
that
we're
doing
rights
to
continue
because
it's
it's
generally
a
non-recoverable
error.
A
If
of
a
applications,
writing
to
the
mount
tries
to
come
back,
but
the
applications
that
had
read
handles
on
files
will
continue
to
function
and
new
applications
running
on
the
mount
can
do
reads
or
writes,
set
fuse
has
a
similar
feature
or
behavior
with
the
client
reconnect,
scalable
one
configuration
and
then
I'll
see.
If
you
want
to
do
this,
you
need
you
should
disable
the
page
cache.
A
Finally,
just
as
far
as
testing
cfs
is
concerned,
we've
dedicated
a
lot
of
effort
into
cleaning
up
our
our
test
infrastructure
and
making
sure
we're
we're
testing
clients
and
and
mds
configurations.
A
You
know
in
across
different
types
of
tests
in
more
consistent
ways
and
we've
doubled
the
number
of
tests
for
cfs
as
a
result
from
about
2500
jobs
to
5000.
So
we're
now
testing
a
lot
more
things
in
the
upstream
and
we're
gaining
a
lot
more
confidence
in
the
stability
of
the
system.
A
Moving
on
to
performance,
another
big
feature
that
we've
had
that
was
partially
available
in
octopus.
As
development
preview
was
ephemeral
pinning
this
is
now
stable
and
what
this
is
is
a
policy-based
sub-tree
pinning
pinning
subtrees
has
been
around
since
I
believe
luminous
allowing
you
to
assign
a
given
subdirectory
treat
to
a
particular
mds
rank
ephemeral
pinning
allows
you
to
set
a
policy
saying
that
you
want
a
how
you
want
a
directory
or
a
group
of
directories
to
be
pinned
but
you're
not
actually
assigning
the
pin,
pin
directory
to
a
particular
rank.
A
A
It
allows
the
the
cluster
to
intelligently,
distribute
the
the
fm
really
10
directories
and
also
rebalance
them
if
the
number
of
mds's
in
the
cluster
changes
there's
two
different
kinds
of
ephemeral,
pins,
distributed,
pins,
automatically,
shards,
subdirectories,
so
think
of
home
directory,
and
then
we
have
random
pins,
which
is
just
a
probabilistic
chance
that
a
directory
that's
been
loaded
into
the
cache
from
the
metadata
pool
or
a
newly
created
directory
is
ephemerally
pinned.
A
This
is
a
pretty
exciting
change
for
us
and
we
believe
it
could
have
some
very
attractive
performance
aspects
for
a
lot
of
workloads,
we're
hopeful.
They
get
some
good
feedback
on
this.
A
Next,
we've
also
spent
a
lot
of
time
improving
the
capability
in
cash
management
in
the
mds
for
larger
clusters.
There's
this
has
been
an
issue.
We've
been
dealing
with
for
quite
a
several
releases
now,
and
we
believe
that
the
the
current
status
ffs
is
much
better
now
and
a
lot
more
stable
than
it
used
to
be.
The
most
recent
changes
we've
made
are
improving.
The
cap
recall
defaults
for
larger
production
clusters,
and
that
is
in
response
to
some
feedback.
A
We
got
from
dan
van
der
surfs
of
cern
and
we
also
improved
the
capability
acquisition
throttling
for
some
client
workloads,
and
what
that
means
is
certain
workloads
like
find
the
find
command
executed
on
cfs
could
acquire
huge
numbers
of
caps
capabilities
via
reader
calls,
and
that
would
cause
the
amount
that
it
was
executing
on
to
get
way
more
caps
than
it
should,
and
the
mds
due
to
its
own
throttles
would
not
recall
them
fast
enough.
A
A
Many
of
these
changes
in
in
the
kernel
client,
which
is
the
only
client
that
does
this
currently
have
already
been
backported
to
some
of
the
more
stable
kernels
in
particular
rel
8.4,
has
this,
and
this
feature
needs
to
be
turned
on
with
the
no
w
sync
flag
and
that
allows
the
kernel
client
to
asynchronously.
A
Moving
on
to
multi-site,
as
I
spoke
of
earlier,
with
the
snap
schedule
module,
we
also
have
this
snapshot
snapshot
based
mirroring
through
the
cfs
mirror
tool.
This
allows
you
to
figure
replication
targets,
remote
stuff
clusters
to
be
mirrored
on
and
configured
for
any
directory.
A
The
ffs
mirror
daemon
is
analog
to
the
rbd
mirror.
Daemon
is
used
to
push
data
from
the
locally
snapshotted
cfs
file
system
to
another
cfs
file
system
located
on
the
remote
cluster.
A
The
feature
is
snapshot-based,
so
it
you.
You
need
to
have
a
snapshot
of
this
file
system
before
you
actually
before
it'll,
actually,
sync
it
to
the
to
the
remote
cluster.
A
We
have
an
initial
implementation
in
pacific
that
we're
doing
aggressive
feature
development
on
right
now.
It
supports
a
single
daemon
configuration
but
that'll
soon
change
to
allow
for
multiple
active
cfs
mirror
demons
that
automatically
balance
the
workload
and
the
aha
support
is
already
present.
A
We
also
recently
updated
to
improve
the
incremental
updating,
so
it
looks
at
the
directory
tree
similar
to
how
rsync
does
to
choose
which
files
to
actually
sync,
if
you
have
multiple
across
multiple
snapshots,
it'll
only
incrementally
send
what
is
necessary
in
order
to
do
the
sync
and
not
resync
the
entire
tree
and
again
we'll
be
developing
this
aggressively,
and
there
will
be
more
back
ports
to
pacific
to
improve
how
it
functions.
A
Moving
on
to
ecosystem,
we
also
have
spent
a
lot
of
time
improving
the
use
of
cfs
with
kubernetes
csi
environments,
where
syphifest
is
used
for
rwx
or
rox
volumes.
Pvc
pvs-
and
this
is
all
orchestrated
through
the
volumes
plugin
in
the
cef
manager,
which
provides
an
api
for
creating
and
deleting
tvs
through
the
what
we
call
this:
the
sub
volume
interface
for
pacific.
A
We
added
a
support.
We
stabilized
the
snapshots
interface
for
our
sub
volumes
and
stabilized
the
interface
within
csi,
and
we
have
been
adding
a
new
authorization.
Api
support
for
openstack
manila,
which
is
also
in
the
process
supporting
to
use
this
new
api
and
also
have
added
ephemeral,
pinning
support
for
volumes.
A
A
The
current
coded
is
what
is
a
developer
preview,
but
we've
heard
a
lot
of
positive
feedback
on
on
the
community
mailing
list.
Already
it's
under
active
development
and
the
one
of
the
next
major
steps
will
be
to
to
integrate
steph
s
dokken
into
our
upstream
qa.
So
it's
tested
more
regularly.
A
For
performance,
what
we
would
like
to
do
next
is
add:
support
for
asynchronous
rm,
dur
maker
and
potentially
also
link
and
rename.
This
will
get
us
to
a
point
where
our
scene
can
run
extremely
quickly.
Those
are
the
last
remaining
our
pcs
that
the
that
we
would
need
to
make
a
asynchronous
to
improve
that
kind
of
workload
tremendously.
A
We
also
have
asynchronous
metadata
operations,
support
and
lips
ffs.
It's
currently
under
development.
We
expect
that
should
be
merged
for
quincy,
there's,
also
some
opportunities
to
improve
the
performance
of
set
fuse.
There
have
been
some
recent
changes
to
lib
views
that
we
believe
we
could
integrate
into
our
staff
use
library
to
just
take
advantage
of
the
latest
kernel
modifications
that
help
us
achieve
better
performance
within
lips
ffs.
A
A
We
also
are
developing
the
fs
cache
support
within
the
kernel,
client,
jeff
jeff
layton's
been
working
on
this.
The
fs
cache
interface
is
currently
being
refactored
and
reworked
by
david
howells
and
the
theft
kernel
client
is
being
constantly
updated
to
to
his
latest
changes
and
we're
hopeful
that
we
can
get
something
like
that
complete
by
the
by
the
quincy
release.
A
But
of
course
the
kernel
has
its
own
release
schedule,
so
they
may
not
match
up
or
the
kernel
support
may
be
finished
before
quincy
is
even
released.
A
Another
performance
feature
that
we're
looking
into
is
recursive
unlink,
rpc
recursive
unlink
on
a
distributed
file
system
is
usually
pretty
slow
and
a
lot
of
workloads
just
want
to
throw
it
away
and
forget
it.
A
That's
especially
true
within
the
new
volumes
plug-in,
which
needs
to
regularly
delete
sub-volumes,
and
it
would
be
much
faster
to
just
tell
the
mds.
An
entire
directory
tree
can
go
away
and
that'll.
That's
a
workload
that
we'd
like
to
support
through
a
new
recursive,
unlink,
rpc
and
probably
that'll,
be
exposed
to
the
user,
is
some
kind
of
dot
trash
directory
to
allow
applications
to
to
be
able
to
do
that
without
linking
to
lips
ffs.
A
A
On
a
usability
standpoint,
we're
looking
to
support
mds
rolling
upgrades-
this
has
been
something:
that's
we
realize
is
very
difficult
for
dev
clusters
to
upgrade
the
the
mds
due
to
the
awkward
procedure
that
must
be
followed,
and
this
is
something
we
want
to
try
to
address
with
the
quincy
release.
A
There's
already
some
changes
in
flight
to
improve
how
we're
handling
the
compatibility
sets
for
the
mdss
and
the
file
systems
to
redu
to
eliminate
the
need
to
bring
out
bring
to
actually
stop
mds's
and
standbys
prior
to
doing
any
package
upgrades.
A
This
should
be.
This
should
simplify
the
the
procedure
a
lot
and
then
we
can
move
on
to
rolling
upgrades
of
multiple
active
file
systems,
we're
also
working
on
a
libs
ffs
sqlite,
which
will
be
another
client
to
ffs.
This
will
be
the
companion
vfs
to
the
new
libsep
sqlite.
A
That's
in
pacific
there's,
a
link
to
the
blog
post
for
libsev
sql
lite
in
the
slides,
libstep
ffs
sqlite
will
function
similarly
and
that
it
allows
you
to
put
a
sqlite
database
on
cephfs
without
actually
mounting
cfs
using
the
chronoclient
or
cephus.
So
can
speak
this
fs
protocol
and
and
avoid
doing
any
mounts.
A
Then
we've
also
got
the
mds
memory
target
and
this
is
another
configuration
companion
to
the
the
osd
memory
target
variable
and
this
is
to
improve
how
the
mds
utilizes
the
memory
by
monitoring
its
own
memory,
use
and
then
adjusting
the
its
own
cache
as
appropriate
in
order
to
to
meet
the
desired
memory
target.
We
have
some
code.
A
That's
already
been
worked
on
that
just
needs
picked
up
and
polish
for
frequency
to
get
this
in
and
then
first
ffs
multi-site
replication,
currently
we're
working
on
more
testing
for
the
h,
a
components
of
the
cepheus
mirror.
That's
already
present
and
adding
active
active
support.
Both
of
those
features
were
planning
to
back
port
to
pacific.
A
You
won't
need
to
wait
for
quincy
for
those
to
be
available
and
make
it
more
automatic
to
set
up
snapshots
and
sync
it
to
the
to
the
remote
cluster
to
make
that
interface
a
little
cleaner
and
some
of
the
other
things
we're
looking
at
is
some
other
sophisticated
models
for
for
actually
doing
multi-site
replication
without
any
firm
plans
of
doing
this.
For
for
quincy,
look
at
potentially
bi-directional
or
loosely
eventually
consistent
type,
synchronization
mechanisms
and
some
kind
of
simple
conflict
resolution
behavior
in
tandem
with
bi-directional.
A
Sync
actual
details
of
that
have
yet
to
be
sorted
out
and
and
also
how
much
that
feature
is
in
demand
by
community.
So
if
you
have
any
thoughts
or
a
strong
desire
for
this
type
of
system,
please
do
share
on
the
on
the
mailing
list.
A
So
got
a
few
minutes
for
qa.
If
anyone
would
like
to
ask
questions.
B
I
have
a
question
dan
from
cern,
hi,
patrick
hello,
I'm
wondering
if
there's
any
work
on
multi-threading.
A
Any
work
on
multi-threading
the
answer
is,
there
has
been
work
but
due
to
the
size
of
the
problem,
not
a
lot
of
progress
has
been
made,
mostly
just
a
lot
of
design
discussion,
so
I
would
say:
no
not
not
really
any
progress
made.
B
Okay,
okay,
yeah,
because
we
really
noticed
the
our
mds
has
really
hit
the
cpu
limits
and
we
tried
to
have.
We
were
running
10
before,
but
this
was
operationally
like
kind
of
weird.
So
now
we
try
to
run
maybe
three
or
four
and
cpu
seems
to
be
a
bottleneck.
B
Okay,
the
other.
The
other
thing
I
was
wondering
is:
whenever
there's
like
a
it
happens
very
rarely,
but
whenever
users
hit
like
a
a
crash
of
an
mds
and
then
the
journal
can't
be
replayed
and
then
they
have
to
enter
into
the
disaster
recovery
procedure.
This
is
like
extremely
scary.
I'm
wondering
if
there's
any
ideas
to
make
that
somehow
less
scary
and
maybe
even
more
automatic,
something
like.
A
Yeah
yeah
disaster
recovery
on
on
sevens
can
be
frankly
terrifying
for
users.
The
which
steps
you
should
follow
is
not
always
clear,
and
I
think
the
mds
rightfully
decides
to
turn
itself
off
if
it
detects
any
kind
of
metadata
damage
and
expects
the
administrator
come
in
and
just
take
a
look.
A
There
are
tools
to
do
that,
of
course,
but
which
tools
to
use
is
not
always
clear.
I
think,
as
far
as
improving
the
usability
of
of
cefs,
we
could
go
a
long
way
to
be
more
clear
about
what
the
kind
of
metadata
damage
we've
discovered
and
in
places
where
we
do
said
that
there
should
be
some
suggestions
on
on
where
to
look.
A
What
tools
to
use
maybe
try
to
classify
the
errors
numerically
so
that
they
can
be
looked
up
more
easily
to
to
so
administrators
have
somewhere
to
go
immediately
to
see
like
what?
What
kind
of
metadata
damage
has
been
discovered
and
is
it
important?
Can
it
just
be
deleted?
A
B
B
A
B
A
Was
yeah
there
were
plans
to
do
that,
so
you're
not
crazy,
but
it
because
of
the
recent
changes
in
rattles
to
reduce
the
number
of
pg's
that
a
pool
needs
to
have
like
you,
you,
you
can
have
a
p,
a
pool
with
one
pg
and
you
don't
see
any
errors
or
warnings
anymore
in
in
ceph.
A
B
A
B
A
Obviously
yeah
it
was,
it
would
be
possible,
but
the
the
benefits
there
would
be
yeah.
I'm
uncertain.
A
Okay,
thank
you
all
for
attending.
Please
check
the
either
pad
for
the
upcoming
events
on
the
calendar
I'll
go
ahead
and
link
that
in
the
chat.