►
Description
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Patrick Donnelly, Red Hat CephFS Lead
A
B
Today,
I'm
gonna
talk
about
the
status
and
future
of
the
state
file
system.
My
name
is
Patrick
Donnelly
work
at
Red,
Hat
and
I'm.
The
set
fest
team
lead
so
to
begin
I'm
going
to
talk
about
and
give
an
introduction
to
set
FS
case
you're
not
familiar
with
it,
then
I'm
going
to
give
an
overview
of
the
features
that
we
released
in
luminous
last
year
and
then
finally
wrap
up
with
the
changes
we've
made
in
time
for
make
which
will
be
released
in
a
few
months.
B
So
self
s
is
one
of
the
components
on
top
of
self.
As
sage
talked
about
earlier
in
the
keynotes
f
is
a
unified
storage
system.
It
offers
many
different
ways
to
access
the
storage.
That's
based
off
your
UC
steffeff
s
is
just
one
of
those
use
cases.
In
fact,
it
was
the
original
use
case
of
seven-step.
Is
a
POSIX
compatible
distributed
file
system?
B
B
B
Fuse
client
is
sometimes
preferred
if
you,
you,
are
not
able
to
use
the
kernel
client
because,
for
example,
you
can't
use
the
latest
version
for
whatever
reason
or
you
don't
have
control
of
the
kernel
in
use
by
your
clients.
The
kernel
clients
is
generally
going
to
give
you
much
better
performance
than
the
fuse
client
one
of
the
homeworks
or
necessary
requirements
of
set.
B
The
fests
as
a
POSIX
distributed
file
system
is
as
coherent,
caching
across
all
of
the
clients,
so
the
MDS
issues,
capabilities
to
the
clients
to
give
the
clients
permission
to
read
and
write
to
files,
and
so
you
don't
need
to
have
to
worry
about
the
the
clients.
Reading,
older
data
or
any
type
of
eventual
consistent
file
system
model.
Ii
may
be
familiar
with
from
other
vendors.
B
C
B
Here
we
have
three
metadata
servers
shown
above
and
read.
There
are
two
active
metadata
servers
and
one
standby.
The
two
active
metadata
servers
are
there
to
cooperate.
Cooperatively,
distribute
amenadiel
load
from
from
the
clients
in
terms
of
the
clients,
looking
up
files
or
mutating
the
metadata,
and
all
of
these
metadata
changes
and
reading
of
the
minute
it
is
done
in
the
metadata
pool
which
is
stored
in
rathaus.
The
clients
interact
with
the
active
metadata
servers
through
the
separate
file
system
protocol
to
do
opens
makers,
Lister's.
A
B
B
Of
cephus
I
want
to
move
on
to
one
of
the
features
we've
debuted
in
luminous
last
year,
so
in
jewel
we
release
emphasis
a
stable
system,
except
with
the
caveat
that
only
one
active
metadata
server
was
a
considered.
A
stable
configuration
in
luminous.
We've
corrected
that
and
now
you
can
have
multiple
active
metadata
servers
which
allows
you
to
scale
the
meditative,
the
load
linearly.
With
the
number
of
active
medea
servers,
you
have
setting
the
number
of
actives
that
you
learn
as
it's
as
simple
as
doing
Annette's
ffs.
C
B
On
modifying
the
max
NTS
setting
on
the
file
system
to
control
the
number
of
actives
the
wand
for
your
file
system
after
a
period
of
time
of
changing
it,
the
the
monitors
will
bring
one
of
the
standardized
to
active,
and
so
you
can
see
here.
We
have
two
active
metadata
servers
available
for
the
file
system.
B
B
B
B
B
B
A
B
B
B
B
We
found
is
that
there
were
instances
of
imbalance
resulting
from
the
balancer,
in
particular,
more
metadata
server
would
become
overloaded
and
another
metadata
server
would
be
doing
almost
nothing.
We
also
observed
volatility,
that
is
a
subtree,
would
be
passed
back
and
forth
between
metadata
servers
without
ever
settling
down
on
any
kind
of
distribution.
B
A
B
B
B
B
B
B
B
Some
of
the
drawbacks,
your
MDS
is,
can
become
unbalanced
to
this
pinning.
So
if
you
have
a
subtree
you've
created
manually
and
it's
and
it's
overloading
that
particular
yes,
the
balancer
and
doing
help
you
with
that
you'll
have
to
resolve
it
yourself
by
there
splitting
the
subtree
more
or
undoing
that,
like
the
balancer,
do
it
dynamically
and
then,
of
course,
you're
introducing
the
possibility
of
human
operator
error
into
your
set
policies.
B
B
B
B
In
you
can
now
have
directories
was
more
than
100,000
files
can
Jule.
We
back
ported
a
change
which
prevented
the
file,
the
directory
from
exceeding
100,000
files,
and
this
was
to
prevent
performance
anomalies.
We
saw
more
performance
degradation
because
the
entire
a
large
factory
would
be
stored
in
a
single
jurado's
which
would
exceed
maximum
object
sizes
and
cause
performance
problems
in
your
cluster,
and
so
we
created
that
limit.
And
now
you
don't
have
to
worry
about
it
anymore,
with
luminous.
B
Mds
before
this
change,
you
would
provide
a
configuration
variable
on
DEA's
cache
size
which
took
account
of
eye
notes
that
yes
was
women
itself
in
cache
and
unfortunately
it's
a
poor
proxy
for
memory
usage,
because
you
have
to
empirically
determine
how
much
memory
that
given
number
of
inodes
uses
and
it's
not
applicable
to
all
workloads,
because
some
eye
notes,
for
example,
directories-
might
require
more
more
memory
than
say
a
regular
file.
So
it
might
work
for
somewhere
close,
but
then,
if
I'm
not
using
a
lot
more
memory
for
others.
B
So
what
we
did
in
this
change
was
we
allowed
you
to
specify
the
amount
of
memory
you
wanted
to
limit
the
NBS
cache
size
to
internally?
This
uses,
C++
memory
pools
to
track
them
might
suddenly
as
captions
using
it's
still
a
soft
limit.
So
your
MDS
can
go
above
you,
the
number
of
bytes
that
you
limited.
B
However,
we
also
have
this
enemy
as
cash
threshold,
which
specifies,
when
enemies
should
start
complaining
to
the
monitors
and
issuing
cluster
health
warnings
that
the
cache
size
has
been
that
is
using
more
than
as
than
as
limit
in
the
cache
size.
The
default
is
50%
more
and
you'll
start
seeing
notices
in
the
cluster
log
saying
that
then,
yes
is
having
trouble
turning
its
cash.
B
In
practice,
we
recommend
using
allocating
approximately
twice
the
RAM
for
and
yes
both
to
allow
for
MDS
team
over
its
cash
limit
by
some
amount,
but
also
acknowledging
that
then
he
has
uses
RAM
for
other
things,
of
course,
and
so
two
times
the
limit
as
well.
They
recommend,
if
you
want
to
read
more
about
this,
there's
a
blog
post
in
the
slide
deck
that'll
be
online
that
you
can
read
it.
B
Snapshots,
receivable
yeah:
it's
we've
been
getting
asked
about
that
a
lot.
This
was
largely
due
to
work
by
an
junggeun
who's
in
the
audience
he
snapshots
and
SEM.
Fests
are
done
per
directory,
so
you
can
create
a
snapshot
by
doing
a
mate
d'oeuvre.
It's
the
third
line
in
that
in
that
code,
in
a
hidden
box,
nap
directory,
that's
present
in
all
directories
and
SEPA
fests
will
provide
the
name
for
the
the
snapshot
and
that's
all
you
need
to
do.
Except
of
us
handles
to
the
details
of
the
background.
B
B
Another
popular
feature
of
us
that
we've
committed
is
kernel
quota
support.
This
is
part
of
cooperative
effort
by
Luis
Enriquez
of
Seuss
and
jumped
in
similar
to
two
subdirectory
pending.
So
you
can
specify
the
photo
by
using
the
next
attribute
interface
set.
Fests
provides
two
different
limits.
You
can
set
the
maximum
number
bytes
for
a
given
subtree
or
the
maximum
number
of
files.
B
B
Next,
we
also
improve
the
cache
limits
by
memory.
There
were
some
structural
problem
issues
that
we
know
about
when
we
were
creating
the
change,
namely
that
the
cache
the
items
in
the
MDS
cache
weren't
tracking
the
containers
at
each
end,
ESS
catch
item
was
using
through
the
directories
and
I
notes
were
using
containers
us
seek
like
a
C++
standard
map
that
we're
not
we're
allocating
space.
C
B
Of
the
memory
pools
or
our
tracking
was
off
by
a
constant
factor,
and
so
we
faced
that
you
can
see
the
two
issues
there
in
the
side
deck
and
this
factor
a
back
port
for
this
fix,
is
coming
in
the
Luminess
for
a
12
to
5.
So
you
can
expect
it
there
as
well.
Just
to
give
you
a
quick
example
that
I
run
through
due
to
time.
B
The
NES
cache
size
before
was
approximately
65%
of
the
total
random
use
by
and
yes,
that
is
that
the
cache
in
use
has
understood
by
the
MDS
based
off
of
itself
tracking
was
65%
of
its
actual
use
of
ram.
Afterwards,
it's
it's
approximately
80%,
so
we're
closer
to
the
true
ram
usage
it'll,
never
actually
converge
on
the
complete
ram
usage
due
to
and
yes
using
the
RAM
for
other
things,
of
course,
not
just
cash
and
here's
another
look
at
em
guesses
for
much
larger
cache
sites.
This
is
actually
where
we
noticed.
B
C
B
B
B
We've
also
moved
the
client
session
timeouts
to
the
FS
map,
and
this
is
so
that
we
get
a
consistent
view
of
house.
Clients
are
are
evicted
if
they
don't
communicate
with
the
NDS
after
a
certain
period
of
time.
This
was
mostly
necessary
so
that
we
had
a
consistent
behavior
across
multiple
yeses,
because
it
was
possible
to
configure
only
one
yes
and
the
others
would
be
differently,
and
in
particular
this
was
important
for
NFS
Ganesha,
which
would
is
able
to
export
SEPA,
fest
and
issue
delegations
and
the
fest
allegations
to
its
NFS
clients.
B
And
if
those,
if
the
MDS
revokes
capabilities
that
are
held
by
NFS
Ganesha
Ganesha
needs
to
revoke
delegations
held
by
his
clients,
it
could
easily
run
into
these
timeouts.
So
these
you're
able
to
now
set
these
timeouts
based
off
of
your
means
for,
for
example,
if
you're
doing
NFS
exports
you
might
want
to
set
them
higher
and
also
NFS
finishes
to
observe.
These
tenants
by
is
by
accessing
a
copy
of
the
Festa.
B
B
B
B
It
finally
another
plan
feature
mimic.
The
promises
were
trying
to
integrate
the
NFS
gateway
and
have
an
integrated
NFS
gateway
in
step,
4
exploring
sefa
fest.
This
serves
as
a
third
alternative
client
for
accessing
step,
FS
through
NFS,
and
so,
if
in
the
figure
in
the
top
right,
we
have
a,
for
example,
some
virtual
machine,
that's
mounting
an
NFS
server,
Ganesha
in
the
middle
and
then
Ganesha
is
going
to
forward
all
those
NFS
requests.
Turn
those
into
equipments
have
a
set
of
s
requests
which
get
passed
to
them.
Yes,
is
it
all
ceased?
B
B
This
is
Ganesha
acts
as
a
gateway
for
this.
We
also
want
to
have
a
solution
that
can
be
applied
in
other
situations.
For
example,
if
you
don't
want
to
you,
if
you
refuse
for
use
this
is
too
slow
or
you
can't
use
it
to
do
the
type
of
privileges
it
might
require
Ori
and
you
can
update
the
kernel,
client
and
then
another
option
for
you.
B
And
the
important
important
aspect
of
this
is
that
it
allows
you
this
also
lets
us
set
up
the
high
availability
and
scale
out
of
manifest
gateways
and
do
that
and
consistently
across
different
types
of
deployments.
The
way
we're
planning
to
do
this
for
the
high
availability
is
to
have
the
condition
containers
managed
by
kubernetes
the
life
cycle
of
the
mesh
of
containers.
When.
B
B
The
set
manager
will
be
what
is
actually
managing
creating
these
containers
in
kubernetes,
and
it
has
the
option
of
creating
multiple
containers
for
a
given
share
to
to
handle
scale
and
then
we'll
take
advantage
of
the
kubernetes
little
bouncer
through
the
the
proxy
service
mechanism,
so
that
multiple
condition
containers
can
serve
multiple
clients
to
a
single
IP
address
that
will
allow
you
to
do
dynamic
scale.
So
this
is
a
big
figure
that
I'm
going
to
gloss
over
due
to
time.
The
vanilla,
for
example,
with
Manila.