►
From YouTube: Ceph Developer Summit Quincy: CephFS
Description
00:00 - Quincy open issues
00:59 - Dashboard and FS/NFS
14:05 - mds_memory_target
18:19 - FScrypt integration
24:32 - async rmdir/mkdir and link/rename
30:55 - recursive unlink
42:14 - MDS rolling
50:00 - CephFS mirroring metrics (in line with RBD mirroring metrics)
A
Welcome
everybody
to
the
wfs
quincy
cds.
The
agenda
is
in
the
chat.
A
First,
stop.
We
have
dashboard
and
nfsnfs
current
status,
and
next
steps
did
ernesto
put
that
up.
B
B
So
please,
let
me
know
when
using
my
screen
yep,
you
see
it.
Okay,
thanks
so
well.
This
is
for
those
of
you
not
familiar
with
the
dashboard
or
the
file
system,
specific
component
of
the
dashboard.
B
This
is
how
it
looks
like
I'm
running
a
via
start
cluster,
so
some
things
would
be
a
slightly
different
compared
to
as
fading
one,
but
what
I
think
basically
will
work
for
for
the
purposes
of
this
demo,
so
well,
basically
in
via
star
there's,
this
default
system.
So
currently
this
is
what
you
get
in
the
dashboard.
I'm
also
enabled
a
couple
of
mds,
so
we
have
an
active
standby
one.
B
That's
currently
the
info,
also
the
bulls
associated
with
this
file
system,
and
we
have
this
chart.
I
think
there
was
a
proposal
for
removing
this,
because
we
have
the
grafana
ones,
but
this
is
coming
from
the
information
in
in
the
manager.
So
well
it's
free
to
get
that.
So
we
may
leave
that,
apart
from
that,
we
have
the
clients
right
now.
There
is
no
client
connected
here,
the
ad
directory
listing.
B
So
we
now
have
are
able
to
see
a
list
of
directories
and
also
we
have
the
possibility
of
creating
snapshots,
so
the
snapshot
comes
with
with
a
pre-built
or
free
suggested
name
for
that,
based
on
the
date
and
for
the
directories.
We
also
have
the
ability
to
set
quotas,
so
basically
we
can
have
set
course
for
years
and
that's
mostly
it
regarding
the
operations
for
the
file
system.
B
There
is
this
dashboard
as
well
grafana
one
and
what
else
is
missing
here,
so
that
would
be
for
pure
ffs
if
we
go
to
the
nfs
which
we
didn't
demo
for
for
rtw,
so
we
may
use
this
also
for
demo
in
this
as
well.
This
is
how
it
looks
like
this
is
created
by
yesterday.
So
with
stevie
m.
It
will
be
yes
a
bit
different.
I
cured
this.
B
This
one
point
alf
also
has
been
recently
investigating.
It
seems
like
we're
able
to
create
nfs
v3,
though
I'm
not
sure
if,
from
the
dashboard
perspective,
the
nfs
with
remote
point
is
there,
but
I'm
not
sure
if
the
ganesha
diamond
we
haven't
tested
that
yet
is
fully
working
with
with
this
config
and
not
sure.
How
is
this
working
with
the
volume
manager
or
the
nfs
new
module
so
we'll
need
to
double
check
that.
So,
basically,
you
can
create
a
an
export
from
here.
B
D
B
I
think
it
probably
it
will
be
worth
it
to
have
a
follow-up
discussion
on
this
because
there
are
different.
Well,
you
know
upstream
downstream
discussions
around
this
topic,
so
at
least
that's
where
it's
possible
to
do
that,
we
can
simply
disable
that
hide
that
or
whatever.
So
it's
just
for
our
convenience.
E
Yeah
and
also,
I
would
probably
specifically
want
to
be
using
nfs41
or
greater
4-0,
actually
quite
a
different
protocol,
and
we,
you
know
it
may
be
fine
on
4-0,
but
it's
not
well
tested
and
one
thing
that
doesn't
give
you.
If
you
use
4-0
and
then
you
your
grace
periods,
have
to
last
the
full
length.
You
can't
lift
the
grace
period
early
if
the.
If
the
server
goes
down,
it
has
to
come
back
up,
so
you
don't
and
that
that
applies
to
any
you
know
serving
the
cluster
as
well.
B
Okay,
I
I
think,
for
this
we're
still
directly
creating
something
scores
me
if
I'm
wrong
the
export
files
in
in
brothers
right,
we
are
not
using
the
nfs
or
the
volume
module
for
this
right.
C
In
that
cluster,
that
you
have,
I
think,
it's
user
defining
here,
yeah
well,
in
this
case
it's
usually
fine
yeah
yeah
yeah.
I
think
that
orchestrator
is
only
worked
with
before,
but
as
we
allow
also
the
user
defined
it
exports
until
now
at
this,
they
can
set
this
this
way,
but
well,
this
is
a
one
of
the.
The
debates
is
if
we
we
definitely
dropped
this
for
quincy
or
or
what
two
okay.
A
I
put
in
the
in
the
agenda:
there's
a
tracker
ticket
for
the
nfs
export
update
to
integrate
the
the
dashboard
with
the
nfs
plugin,
there's
actually
quite
a
bit
of
work.
That
needs
to
be
done,
but
it's
all
laid
out
in
that
ticket.
B
B
Okay,
I
will
resume
this
and
yeah
just
create
a
sample
there,
for
example,
so
yeah
this
form
will
create
this
direct
directory
in
surface,
so
we
can
just,
for
example,
we
may
drop
the
nfsp3
and
then
we
lose
the
tag
option.
We
put
another
name
for
the
export
just
to
and
let's
select
this
option
and
that's
it.
So
we
have
the
other
export
created
here
and
if
we
go
to
the
file
system,
the
directory
is,
is
there
so
there's
the
sample
directory
created
there?
B
Currently
we
don't
support
the
creation
of
files
here.
I
don't
really
know
the
reason
for
that,
but
well
we
may
start
adding
more
fine-grained
operations,
maybe
also
dealing
with
objects
and
are
yeah,
it's
something
that
we
haven't
considered
yet,
but
as
we
progress
towards
more
fine-grained
operations,
we
may
include
that
as
well.
Just
kind
of
some
way
of
checking
there's
some
objects
and
I
think
that's
all
of
it
not
sure
if
you
have
any
questions
about
this
is
any
thing
that
you
would
miss
here
from
the
cfs
or
nfs
perspective.
A
I
had
a
few
comments,
so
the
snapshotting
of
directories
is
interesting.
What
would
be
really
cool
is
if
we
can
get
that
to
hook
into
the
new
snap
schedule
module
so
that
you
can
manipulate
the
schedules
from
within
the
dashboard.
A
So
yeah
you've
already
got
the
the
you
know
being
able
to
snapshot
individual
directories
with
arbitrary
names.
That's
that's
great,
but
I
think
the
next
step.
There
would
also
be
the
scheduling
the
performance
metrics.
I
think
we're
I'm
trying
to.
I
can't
recall
like
where
exactly
mds
is
sending
those
metrics.
I
think
it's
in
the
manager
report
message
that
periodically
gets
sent
sent
out
that
hasn't
been
touched
since
john
spray
was
working
on
this.
A
What
we
have
now,
though,
is
a
new,
the
the
metrics
that
are
reported
for
seppfest
top
venky's,
not
here,
but
the
I
think
the
vision
is.
We
have
the
zip
fest
top
tool
for
the
command
line
interface
and
then
the
dashboard
would
also
consume
the
same
metrics
to
display
it
in
in
the
the
browser.
A
So
I
think
that
that
would
be
the
direction
we
would
want
to
go
next,
you'd
be
able
to
get
the
client
listings
and
various
information
about
the
clients
like
how
many
caps
they're
using
and
how
effectively
they're,
using
their
capabilities
like
how
well
their
caches
are
being
used,
and
things
like
that.
G
Interface,
so
I
I
don't
want
to
divert,
but
it's
hard
register
interest.
I
work
on
manila
and
openstack.
G
And
I'm
I'm.
I
noticed
that
the
way
the
dashboard
handles
exports
is
more
up-to-date
or
a
different
method
than
than
we
use
in
particular
for
ganesha.
G
We
use
a
dbus
mechanism
to
update
the
interfaces
rather
than
using
a
watcher
on
the
redis
url,
even
though
we
stick
the
url
in
a
random
object.
So
I'm
wondering
if
we
should
be
updating
we
we've
recently
updated.
You
know
to
use
this
ffs
sub
volume
module
rather
than
the
old
volume,
client
library,
and
we
have
interest
in
staying
up
to
date
with
what
what's
happening
here.
G
So
I
just
want
to
you
know.
I
don't
want
to
divert
the
conversation
about
the
dashboard.
The
ui
aspects
of
it
are
primarily
of
interest
to
us
from
a
read-only
cloud
administrator
perspective
rather
than
in
terms
of
updating
exports,
all
of
which
is,
you
know,
api
automation,
but
underneath
here
there's
a
different
api.
Clearly,
I've
played
with
the
dashboard
and
the
orchestrator
with
pacific
a
bit
with
seth
adam
and
with
and
they're
both
both
more
up-to-date
than
what
we're
doing
so.
G
Maybe
it's
a
question
mainly
for
jeff,
patrick
and
romana
or
something
but
just
letting
you
know.
I
noticed
that
what's
happening
here
and
if
we
should
be
changing,
let
us
know
and
we'll
work
with
you
on
it.
G
And
the
other
aspect
of
this
is
you
can
deploy
ganesha
active
active
with
the
orchestrator
and
that's
complicated,
but
you
know
for
us:
if
we
lose
a
node
system
d
is
not
able
to
handle
that
migration
to
another
node
or
something
like
that
today
and
we
had
talked
about
kubernetes
and
one,
but
you
know
simpler
than
that
problem.
Right
now
is
just
the
export
maintenance
issue,
and
that
may
be
the
place
to
start.
A
Yeah
that
the
nfs
plug-in
right
now
also
does
all
the
orchestration
handling
for
setting
up
nfs
clusters.
But
I
think
there
is
interest
to
to
make
it
work
with
statically
defined
set
of
nfs
servers,
which
is
something
both
the
dashboard
for
a
legacy,
support
perspective
and
also
manila
is
interested
in,
but
that
work
hasn't
been
scoped
out.
Yet.
G
B
Okay,
so
I
think
that's
all
from
from
our
site,
if
you
have
any
extra
feedback,
please
feel
free
to
reach
out
to
us
and
yeah.
We
may
also
discuss
that
later.
Thank
you.
Folks,
thanks.
A
Next,
the
angela
is
the
mds
memory
target.
A
So
the
the
tracker
tickets
in
the
agenda,
unfortunately
this
this
pr-
is
the
the
pr
associated
with
this
take
its
atrophy
quite
a
bit.
It
means
revived,
but
it's
a
fairly
simple
vr
that
just
dynamically
modifies
the
mds
cache
memory
limit
in
response
to
the
total
memory
usage
of
the
mds,
which
would
be
defined
by
or
the
the
target
would
be
defined
by
the
this
new
mds
memory
target
variable
analogous
to
the
other
ones
in
the
lsd.
A
F
One
of
the
tv
items
also
frequency
for
stephanie
anna's
to
have
stuffed
idioms,
setting
these
limits
and
be
able
to
scale
them
automatically
to
the
size
of
the
node
or
whatever.
It
is
so
it'd
be
nice
if
it
was
using
a
consistent
setting
across
all
the
different,
even
types
but
very
nice
to
have
that's
all
yep.
D
E
F
It's
something
that
mark
built
for
bluestore
that
allows
you
to
use
that
it's.
If
you
have
a
pool
of
memory,
that's
consumed
by
multiple
caches,
you
can
set
various
policies
around
like
how
you
would
like
that
memory
to
be
used.
So
like
these
caches,
you
know
at
least
this
much
memory
and
then
above
that,
then
they
scale
at
this
ratio.
And
then,
if
it's
above
a
certain
point,
then
this
one
gets
all
of
it
or
whatever.
It
is.
You
know
sort
of
this
tiered.
F
So
it's
helpful
if
you
have
multiple
consumers
like
in
this
case,
we
just
have
the
mds
cache
as
sort
of
a
single
knob.
I'm
not
sure
it
would
be
appropriate,
but
if,
for
example,
we
were
independently
scaling
like
memory
consumed
by
caps
versus
inodes.
D
F
A
I
think
the
one
challenge
with
integrating
with
the
priority
cache
is
just
dropping.
Cache
entries
is
not
as
simple
as
it
is
in
in
blue
store.
You
have
to,
there
might
require
a
cl
cap
recall,
which
probably
complicates
it.
D
E
At
least
in
the
near
term,
if
you
have
to
do
a
cap
revoke
that
you're,
probably
doing
memory
allocations
and
making
it
even
worse
right,
you
know
so
yeah
I
mean.
Is
this
really
a
bug
I
mean
per
se?
I
mean
it
almost
looks
like
it's
like.
If
we're
not
hitting
the
mbs
cash
memory
limit,
you
know
correctly,
I
guess
maybe
we
should
be
just
our
math
is
off.
F
D
A
All
right
next
on
the
agenda
is
fs
crypt
integration
jeff.
Would
you
like
to
go
through
where
that's
at
right
now.
E
Yeah
I
mean
I've
got
the
file
names
piece,
pretty
much
done
and
we've
got
dang
did,
though
I
guess
zhang
picked
it
up
that
started
it
and
then
finished
it
up.
It
did
the
alternate
name
feature
that
we
need,
so
we
need
to
be
able
to
give
entries
a
secondary
name
in
case
they're,
very
long,
because
we
can't
just
hash
or
encrypt
and
hash
them
at
that
point,
or
we
can't
just
encrypt
in
base64
and
code
them
at
that
one.
E
So
because
we
want
to
keep
all
the
file
names
under
name
max,
it's
a
little
complicated
but
but
effectively.
Yes,
I've
got
the
file
name
portion
of
it
done
where
we
are
stuck
up.
E
What
I'm
stuck
on
now
is
the
fact
that
the
mds
is
what
handles
truncates
for
the
content
encryption,
and
so
when
we
set
size
to
a
particular
length,
you
know
the
mds
will
come
back
and
truncate
the
thing
down
to
that
bite,
but
that
could
be
in
the
middle
of
a
crypto
block
and
at
that
point
we
can't
decrypt
the
tailbone
file.
So
we
want
to.
E
We
have
to
ensure
that
we
that
we
somehow
teach
the
nds
to
round
up
to
the
to
the
end
of
the
next
crypto
block,
as
we
do
this
that
you
know
stuff,
there's
a
couple.
Different
approaches
we
could
do
and
I
think
greg
is
a
great
farm-
is
going
to
help
out
with
some
of
this,
but
but
any
case
we've.
We
got
a
meeting
scheduled
for
later
today,
just
to
discuss
some.
You
know
where
to
go
with
this.
So
any
case.
E
That's
so
I've
got
it
about
halfway
done
the
rest
of
it
doesn't
look
too
bad.
You
know
sans
the
part
that
where
we
actually
have
to
do
so,
we
also
have
a
we
have
a
code
path
in
the
ceph
client
that
handles
uncached,
I
o
so
caps.
E
For
instance,
it
will
just
write
through
to
the
you
know,
synchronously
through
to
the
to
the
nbn
or
to
the
server
whenever
it
needs
to
go
to
the
osd's.
Whenever
it
needs
to
go,
do
a
write
or
a
read.
If
we
do
it
right,
we
can't
in
that
situation
we
need
to
be
able
to
do
a
read,
modify
right
cycle
because
we
have
may
have
to
slurp
in
the
beginning
and
end
of
a
crypto
block
at
the
beginning
of
crypto
blocks
to
make
sure
that
we
can
handle
those.
E
So
we've
got
some
code
that
does
all
this
it's
pretty
complicated
and
I'm
hoping
I
can
clean
it
up
a
little
bit
before
we
it's
ready
for
merge.
So
I'm
hoping
that,
maybe
by
the
summer
we'll
have
something
that's
ready
to
go,
but
we
need
some
mbs
support
for
the
truncate,
but
but
yeah
the
file
name
portion
actually
looks
pretty
good.
Now,
it's
working
really
fairly
decently.
E
Luis
enriquez
says
that
he
was
hitting
a
problem
with
it,
but
where
you
occasionally,
but
he
and
he
was
trying
support
to
track
it
down.
But
I
remember
back
from
maybe
I
haven't
been
able
to
reproduce
the
problem
so
anyway,
that's
where
we
are.
I
don't
have
a
lot
to
discuss
here
with
it.
I'm
thinking
this
anyways.
A
Yeah,
I
think,
as
far
as
the
direction
we're
going
to
go
with
handling
truncate
right
now.
The
preferred
solution
is
having
two
sizes
profile.
One
is
the
the
actual
size
that
we're
already
using
and
the
others
the
a
size
that's
used
for
the
purposes
of
truncate,
so
the
mds
will
will
not
truncate
off
the
last
few
bytes
that
we
need
and
they'll.
Both
both
sizes
will
be
protected
by
the
same
locks
or
maybe
I'm
sorry.
I
have
that
reversed.
A
E
We
could
we
could
even
take
that
feel
that
the
the
you
know
the
encrypted
or
that
the
crypto
enabled
clients
use
and
encrypt
it
too.
So
we
could
cloak
the
the
full
size
of
the
file,
even
if
we
wanted
the
cache
there
is
that
you
need
to
when
you
do
a
crypto,
the
the
crypto
blocks
have
to
be
at
least
16
bytes.
So
you
know
we
would
be
consuming
an
extra
word
in
the
in
the
inode.
You
know
somewhere
to
do
this.
D
A
So,
for
also
for
those
who
aren't
aware
with
the
you
know
about
this
project,
we
didn't
really
introduce
it.
The
fs
crypt
is
the
is
in
the
is
in
the
kernel
tree
and
it's
a
library
for
it's,
a
generic
library
for
file
systems
to
use
to
encrypt
files
and
file
names
within
the
file
system
currently
supported
by
ext4.
I
think
it
was
originally
a
project
for
android
and
we're
now
trying
to
use
it
in
ceph.
A
E
Yeah
the
neat
thing
about
this
is
that
it
allows
you
to
set
keys
as
an
unprivileged
user.
So
so,
if
you
are
say
you
know,
you
know,
got
a
vm
or
something
that
you're
using
for,
and
you
don't
need
any
mds.
You
know
you're
not
sharing
the
keys
with
the
mds.
E
So
so,
if
someone
like
a
hosting
provider,
could
you
know
give
you
a
chunk
of
space
on
a
on
a
cfs,
and
you
can
then
use
that
the
fscrip
code
to
encrypt
all
that
data?
So
even
though
it's
hosted
in
a
public
place,
you
know
you
leave
a
lot
of
protection
to
the
data.
E
D
A
All
right,
let's
move
on
to
the
next
agenda-
topic,
asynchronous
armed
error,
maker
and
link
rename.
A
D
A
A
But
the
problem
is:
is
that
once
it
reaches
a
maker
or
an
arms
or
that
becomes
a
barrier
for
future
operations,
so
it
would
it'd
be
nice
to
make
all
those
those
system
calls
completely
asynchronous
so
that
you
know
you
can
rm
dash
rf
a
whole
sub
tree,
and
it
just
almost
instantly
completes
until
you
do
like,
and
then
you
can
do
like
an
f
sync
to
ensure
that
it's
actually
durable.
A
Neither
neither
antar
nor
arm
dash
rf
actually
do
those
have
syncs
so
user
perspective.
It
would
be
an
instant
change
so
that
that
would
be
the
direction
we'd
like
to
go.
So
the
next
step
is
to
get
it
getting
armed
or
make
their
definitely
asynchronous,
and
then
link
and
rename
is
sort
of
a
nice
to
have
stretch
goal,
because
that's
those
are
operations
that
are
used
by
by
rsync,
for
example,
in
some
configurations.
A
So
with
that
said,
jeff
did
you
have
any
thoughts
on
on
where
we
are
with
that?
Is
there
any
current
code
or
or
do
you
know
if
you'll
have
time
to
work
on
it,
putting
you
on
the
spot.
E
Yeah,
I
don't
have
a,
I
don't
have
any
current
code.
I
can
I
I
mean
you
know.
The
hard
part
is
like
you
know
it's
how
we
handle
recursion
or
you
know
like
recursive,
unlinks
and
stuff
right.
You
know
if
you've
got
the
the
catch
here.
E
Is
that
you
know
when
you
start
doing
these
things
asynchronously,
you
know
you
kind
of
lose
control
over
the
ordering,
and
so
it's
harder
to
ensure
that
you
know
at
the
point
where
we've
got
you
know
we
might
have
deleted
all
the
dentures
in
a
directory
on
the
client
and
then
go
to
do
an
rmdr,
but
the
you
know
the,
but
somehow
that
rn
gets
ordered
before
the
all
the
unlinks
on
the
nbs.
E
So
we
have
to
make
sure
that
that
doesn't
happen,
because
then
that
rm
door
won't
work
if
there's
a
file
in
the
directory,
and
so
that's
the
hard
part,
and
that's
why
you
know
we
have
done
armed
or
to
be
a
synchronization
point,
but
we
you
know,
maybe
we
need
to
revisit
that
and
think
about
how
to
do
it.
I
I
don't
know
a
way
right
off
hand
to
make
that
simple.
I
haven't
really
figured
out
anything.
Don't.
A
We
already
have
that
issue,
though
like
if,
if
I
have
two
applications
that
are
unlinking
files
in
a
directory,
and
then
you
know,
finally,
all
the
at
least
from
the
client
perspective.
All
the
files
have
been
deleted,
then
one
of
them
does
an
armed
or
one
of
the
applications
says
an
arm
there,
isn't
that
armed
or
already
ordered
against
the
other
unlinks
in
flight.
E
E
I
don't
think
about
it.
I
don't
know
I
mean
it's
worth
experimenting
with
be
able
to
probably
please
maybe
draft
something
and
throw
it
together
and
see
if
we
can
make
it
work.
I
don't
know,
but
it's
a
little.
I
mean
it's
not
too
hard
to
do
even
to
fix
the
client.
To
do
this,
I
mean
we
there's
some
guard
rails
in
there.
Basically,
that
keep
you
from
proceeding
at
you
know,
barriers,
basically
that
keep
you
from
proceeding
until
all
the
dentures
have
you
know,
synchronously
been
deleted
from
the
mbs,
but
we
can.
E
We
could
remove
that
and
see
how
it
goes.
I
I
hit
problems
with
this.
You
know,
I
don't
think
it's
you
know,
but
you
know
again.
The
catch
is
that
when
you
issue
these
things,
you
know
the
unlinks
come
back
very
quickly,
and
you
know
the
the
calls
are
still
very
much
in
flight.
You
know
that
may
not
have
been
transmitted
yet.
E
So
it's
not
too
hard
to
imagine
that
the
socket
buffer
you
know
for
the
rn
dirt
process
that
you're
calling
ends
up
ordered
you
know
getting
the
mutex
for
the
socket
before
you
can
send
the
the
unlinked
you
know
gets
his
chance
to
send
it
yeah.
We
have
a
lot
of
competing
new
texas
stuff
too.
So
it's
hard
to.
I
don't
know
which
way
that
it's
gonna
fall
out
so
anyway.
I
when
I
did
some
experimenting
with
a
while
back.
I
hit
problems
where
you
know
director
is
not
empty.
E
D
A
Yeah,
I
think,
right
now,
this
this
project
is
gonna
sit
at
the
moment
until
we
we
have
resources
available
to
work
on
it.
Maybe
shubo
will
have
time
he's
developed
an
interest
in
the
kernel,
client.
A
A
All
right,
the
next
one
is
recursive
unlink,
which
is
kind
of
similar.
This
is
an
idea
to
just
add
an
actual
rpc
to
the
to
the
mds
that
does
a
recursive
on
link
of
a
subtree
one
of
the
main
target
uk
use
cases.
We
have
at
least
from
an
immediate
sense,
is
the
the
volumes
plug-in
be
able
to
recursively
delete
an
entire
sub-volume
without
having
the
manager
have
to
list
and
rm
the
entire
sub
volume
in
an
asynchronous
fashion.
A
It
could
just
one
off
shoot
off
the
rpc
off
to
the
mds
and
let
the
mds
chew
through
it
over
time
and
thank
you
is
planning
to
work
on
this
hopefully,
but
for
this
this
release,
I
think
it
actually
supporting.
It
will
be
a
little
weird
for
the
mds
or
require
some
changes,
because
the
directory
assumes
that,
like
every
directory,
that's
actually
in
the
stray
is
already
empty.
A
So
it
needs
to
be
taught
to
deal
with
directories
that
also
have
entries
in
them
and
then
also
subdirectories,
etc,
and
then
also
this.
This
will
be
explicitly
not
osx
compliant
because
link
counts
will
necessarily
not
be
updated
for
all
of
the.
A
For
the
entire
recursive
subtree,
you
wouldn't
be
able
to
iterate
through
all
the
files
to
update
the
link
counts
until
you,
actually
until
the
mds
actually
performs
the
unlink
to
make
this
available
across
all
of
our
drivers
and
not
just
as
a
special
purpose
rpc
and
the
lips
ffs.
I
was
thinking
we
would
also
plumb
in
support
by
maybe
having
a
dot
trash
directory
that
you
can
move
hidden
in
both
the
views
and
in
the
kernel
client
that
would
be
translated
into
that
recursive
unlink
operation.
F
Just
a
random
thought
so,
since
you
mentioned
that
dot
trash,
would
it
make
sense
to
have
that
be
like
a
proper
trash
feature
or
like
if
you
rename
something
into
that
directory
it'll
just
stay
there
until
the
mvs
decides
to
delete
it
based
on
some
policy
yeah.
A
Yeah,
I
don't
think
we
really
thought
about
that,
but
that
could
be
a
very
good
good
add-on
feature
having
like
a
special
stray
directories
that
are
only
unlinked
after
the
entries
been
there
long
enough,
probably
wouldn't
call
it
trash
just.
H
So
there's
just
some
discussion:
the
tracker
ticket,
but
hard
links
are
going
to
be
really
irritating
to
deal
with
on
a
recursive
unlink.
I'm
not
sure
if
there
are
other
issues,
but
that's
the
most
immediate
one
that
pops
up
in
my
head.
A
H
A
What
is
all
right
so
are
we
talking
about
handling
like
the
subtree
authorities
when,
after
doing
an.
H
Unlimited
file,
you
have
a
file
in
in
slash
toolbar
and
it's
and
there's
a
hard
link
outside
to
one
of
those
files
and
you
recruit.
Then
you
do
an
issue
and
recursive
delete
on
foobar.
You
need
to
like
move
the
move.
The
file
up
or
something
out
like
like
you,
can't
just
sit
in
the
existing
purge
queue
and
then
go
away
because
something
else
might
be
pointing
at
it
yeah.
H
H
It's
and
I'm
saying
that's
yeah,
I'm
saying
that
before
we
start
on
this
like
that
needs
to
be
figured
out
it
might.
H
I
don't
remember
what
invariants
we
hold
around
around
the
link,
counts
and
and
their
locations,
because
likely
because,
like
the
vet,
like
the
current
back
trace,
might
on
on,
it
might
be
out
of
date,
and
so
we
like
look
at
it
and
say:
oh
it's
in
the
recursively
deleted
section.
We
can
just
throw
it
away,
but
actually
got
moved
out.
F
H
F
H
H
H
A
Yeah
there's
this:
we
haven't
thought
about
those
details
quite
fully
greg
to
be
honest,
but
this
is
a
good
point
in
my
head.
Logically,
it's
the
same
as
just
renaming
a
subtree
somewhere
special
and
the
mds
just
happens
to
be
unlinking.
Those
entries,
so
I
mean
yeah,
they're,
really
sure.
D
Useful
curious,
so
this
kind
of
brings
up
another
idea.
My
head:
does
the
sffs,
like
generally
run
the
ck
periodically
in
the
background
currently
when
you're.
D
Okay,
I
just
was
thinking
if,
if
there
are
structures
that
we
end
up
needing
to
clean
up
in
some
way
like
like
cracking
filling
indexes
of
hard
links
or
something
like
that,
that
could
potentially
be
a
way
to
do
it.
A
A
That
is
a
problem
right
now
the
stray
directory
can
grow
unbounded.
If,
if
there's
you
know,
you
have
a
huge
sub
tree
of
hard
linked
files
and
one
of
those
sub
trees
gets
deleted
and
if
the
mds
never
touches
the
the
remote
entries,
it'll
it'll
never
do
the
reintegration,
so
the
strategy
just
keeps
growing,
so
it
made
make
sense
to
just
have
the
mds
periodically
go
through
the
entire
file
system
so
that
it
can
do
that,
reintegration
if
it's
not
driven,
naturally
by
client,
io.
F
Well,
it
seems
like
if,
if
we
add
a
schedule
for
scrub,
so
it
runs
once
a
week
or
whatever
it
is.
Then
this
would
sort
of
happen
organically.
Also.
A
Topic
all
right
next
step
is
mds
rolling
upgrades,
so
this
has
been
a
kind
of
troublesome
part
of
cfs
for
a
while.
Now,
in
fact,
we
have
a
rather
complicated
upgrade
procedure
for
cfs
that
users
that
users
should
follow,
involving
pretty
much
turning
off
all
mdss,
except
the
one
active
on
a
file
system
and
reducing
max
mds
to
one
and
then
actually
doing
the
upgrades
of
all
the
mds
daemons.
A
And
then,
finally
restoring
max
mds
the
reason
for
turning
off
all
the
standbys
has
been
that
if
there's
been
any
change
to
the
ad
set
for
the
file
system,
which
is
uniform
across
all
the
file
systems,
then
all
of
your
mds
demons
commit
suicide,
which
is
fairly
scary,
because
you
know
you
have
aborts
on
all
of
your
mds
logs.
A
A
patch
set
to
make
it
so
that
the
mds
is
no
longer
do
that
no
longer
abort.
A
If
the
compat
set
on
the
file
system
changes
which
will
now
store
the
competitive
for
each
file
system
and
the
mds
daemons
have
a
compat
set
on
all
of
their
beacons
of
what
they
support
and
what
the
mds
supports
doesn't
change
for
its
entire
instance
and
the
monitors
will
do
standby
promotion
it
only
if
the
mds
is
compatible
with
whatever
the
file
system's
compat
set
is,
and,
furthermore,
it'll
only
upgrade
the
file
system
compat
by
promoting
a
stamp
standby
that
has
a
newer
incompat
in
its
compat
set
by
if
it
meets
certain
requirements,
namely
that
the
file
system
has
only
one
rank
max,
mds,
one
and
and
standby
replay
is
disabled,
and
if
that's
true,
then
it'll
actually
do
the
promotion
and
then
finally,
the
file
system
compats
it
incompat
will
also
be
updated.
A
Instep
ffs,
we
only
use
the
incompat
feature
set
in
the
compat
sets.
So
that's
that's
the
only
thing
I'm
talking
about
here
that
works
pretty
well,
it's
something
I
want
to
backport
specific.
I
promised
it
would
be
in
pacific
for
rook,
because
the
tearing
down
all
the
standbys
turning
them
off
is
is
a
rather
onerous
upgrade
procedure
for
the
orchestrators
to
have
to
deal
with,
and
this
gets
us
pretty
far
to
making
that
process
a
little
easier.
A
The
next
step
in
in
supporting
rolling
upgrades
is
to
make
it
so
the
mds
you
can
have
mixed
versions
of
the
mds's
mixed
active
mixed
version,
actives
on
a
file
system,
and
I
think
to
support
that.
It's
going
to
require
some
significant
changes
to
the
to
the
mds
map,
some
of
those
things
I've
laid
out
in
the
ticket
and
then
there's
been
some
efforts
to
version.
A
A
lot
of
the
messages
that
go
back
and
forth
between
the
mds
is
so
that
we
can
note
that
something
is
a
newer
version
and
then
I
think
is
there
needs
to
be
some
work
to
gate
features
with
flags
in
the
file
system,
so
that
mdss
don't
make
breaking
changes
to
the
metadata
until
all
the
mds's
are
on
this
on
the
same
version,
similar
to
what
we
do
with
modders
and
osds.
A
It's
a
much
more
significant
undertaking.
The
compat
set
changes
are
much
simpler
and
get
us
pretty
close,
but
the
rolling
upgrades
is
something
I
would
like
to
get
done
for
quincy
yeah.
So
that's
where
we
are
currently
any
thoughts
on
that.
F
Just
before
thinking
about
the
like
mixed
version
thing
just
with
the
changes
that
you
have
so
far,
I
just
want
to
make
sure
that
the
the
sefadm
upgrade
procedure
is
updated
accordingly,
but
it
occurs
to
me
something
you
mentioned
that
I
didn't
realize
before
that
you
need
to
also
have
nose
standby
replay.
F
A
F
A
F
I
I
don't
know
because,
when
you're
doing
standby
replay
do
you
have
to
decide
which
what
rank
each
standby
is
playing
for,
or
does
the
cluster
sort
of
automatically
figure
that
out.
A
F
Okay,
and
do
they,
if
you
have
like
four
mds's,
and
you
only
have
two
standbys
and
you
have
that
option
turned
on.
A
F
D
A
All
right
next
on
the
agenda
is
ffs,
mirroring
metrics
in
line
with
rbd
mirroring
metrics.
F
A
So
unless
someone
has
something
they'd
like
to
discuss
about
it,
I
think
we'll
just
table
that
and
the
same
with
exporting
client
metrics
to.
F
D
E
F
That
the
idea
there
is
that
to
have
as
much
alignment
as
possible
between
the
metrics
that
rbd
mirror
is
presenting
in
stuff.
That's
mirror
so
that
they're,
hopefully
as
close
to
identical
as
possible.
D
A
I
know
venky's
been
working
with
some
folks
on
on
ensuring
that
all
right,
the
last
topic
in
the
agenda
is
fencing.
A
So
this
is
the
the
target
scenario
for
this
is
we
have
two
kubernetes
clusters
and
one
steph
cluster
and
or
at
least
one
of
the
target
scenarios,
and
if
an
entire
cluster
it
becomes
unavailable
for
whatever
reason
you
want
to
be
able
to
block
list
all
of
the
active
instances
from
that
cluster,
and
so
the
open
question
is:
how
do
we?
A
How
do
we
express
that
with
the
how
we
currently
do
block
listing?
We
have
sean
here.
He
was
kind
of
leading
this
session,
no
he's
not
here.
So
there
are
a
few
different
ways.
We
can
do
this
or
we've
thought
about
doing
this,
and
one
was
to
to
have
a
tag
of
some
kind
on
on
client
credentials.
A
Maybe
add
that
to
the
caps
list
that
indicates,
you
know
what
what
kind
of
availability
zone
it's
associated
with
and
and
whatnot
and
then
you'd
be
able
to
block
lists
an
entire
set
of
auth
credentials
with
or
instances
that
are
using
those
auth
credentials.
Based
off
of
that
tag,
I
think
that
was,
I
don't
recall
the
exact
details,
but
there
was
discussions
on
the
mailing
list.
That
indicated
I
was
not
that
wouldn't
be
quite
workable.
A
In
particular,
you
need
to
be
able
to
on.
I
think
one
of
the
main
challenges
is:
you
need
to
be
able
to
unblock
list
new
instances
so
or
if
a
new
instance
comes
from
that
cluster,
that
it
should
not
be
block
listed,
so
you
need
to
perhaps
some
kind
of
a
generation
id
epic
for
for
how
you
block
lists
an
entire
group
based
off
of
a
tag,
and
then
I
think
it
development
discussions
kind
of
stalled
there.
A
I'm
not
sure
if
there's
been
any
new
pushes
for
for
supporting
this.
I
know
it's
a
feature.
That's
been
desired
by
this
fcsi
folks
for
with
rook
deployments,
but
I
haven't
heard
a
lot
recently
about
it.
A
D
I've
heard
about
the
desire
for
for
it
from
the
working
stuff
csi
folks
more
recently,
so
I
think
it's
still
something
that
they're
interested
in.
D
But
nothing
can
be
on.
A
A
So
I
guess,
depending
on
on
interest,
we
we
should
have
some
discussions
josh
about
how
this
would
look.
D
Yeah
yeah,
it
sounds
like
we
could.
We
could
get
the
interesting
folks
together
at
some
point
and
try
to
figure
that
out
generation
concept
sense
reasonable,
but
not
sure
exactly
how
that
would
look.
A
You
just
block
list
all
the
instances
that
exist
before
a
certain
osd
epic
and
then
each
instance
has
a
epic
that
it's
associated
with,
based
off
of
when
it
first
the
birth
of
the
instance
itself.
A
H
We're
just
about
important
to
them
to
be
able
to
to
like
block
list
only
a
specific
site,
or
can
we
just
force
all
the
clients
to
redo
their
sessions.