►
From YouTube: Ceph Month 2021: Evaluating CephFS Performance vs. Cost on High-Density Commodity Disk Servers
Description
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021
Presented by: Dan van der Ster
CERN operates several PB-scale CephFS clusters to offer shared storage for IT apps and services. Our lab also developed EOS to deliver high-density, low-cost physics storage and support the APIs needed by the LHC community. This work evaluates CephFS+EOS: CephFS is a reliable low-cost EC backend and EOS is layered on top to add the missing functionalities. We present performance tradeoffs, networking overheads at scale, and optimized Ceph tunings, then conclude with ideas for the future.
A
Click
so
this
is,
this
is
a
work
that
andre
andreas
and
I
did
in
the
last
few
months,
so
just
some
background.
First,
like
on
the
physics
side,
at
cern,
we
do
our
computing
on
this
thing
called
the
worldwide
lhc
computing
grid
cern
forms
the
tier
zero,
so
we
have
around
135
petabytes
of
disk
replicated
twice
and
almost
400
petabytes
of
tape,
and
we
for
our
scientific
data,
taking,
we
still
add
50
petabytes
per
year.
A
A
In
total,
this
makes
around
a
million
cpu
cores
processing,
something
like
2
million
jobs
per
day.
We
have
around
one
exabyte
of
storage
globally,
with
around
one
terabit
total
internal
connectivity
and
for
example,
last
year
we
transferred
300
petabytes
around
on
our
wlcg,
our
worldwide
lecture
grid.
Now
this
whole
thing
on
the
storage
side
is
powered
by
some
open
source
storage
software
all
developed
within
our
high
energy
physics,
hep
community.
These
are
things
like
dcash,
dpm,
eos,
storm
x,
root
d:
these
are
all
site
storage
solutions.
A
Protocols
that
are
used
are
like
http,
x,
root
and
gsi
ftp
to
transfer
between
sites.
We
have
a
thing
called
the
file
transfer,
service
or
fts,
and
this
does
third-party
transfers
between
sites
and
it
schedules
these
transfers
according
to
network
constraints,
and
then
we
have
something
called
russio,
which
is
what
we
call
a
data
orchestrator.
So
it
places
the
data
and
interacts
with
the
file
transfer
service
according
to
different
policies.
A
A
Lots
in
europe,
lots
of
north
america,
lots
of
age,
lots
in
australia
also
in
africa,
next
slide.
So
this
brings
to
like
what
this
work
is
about.
So
in
the
next
few
years
the
high
luminosity
lhc
data
taking
is
going
to
increase
our
demands
on
storage,
so
we'll
need
to
be
taking
something
like
500
petabytes
per
year
by
2028
and
open
source
storage.
Software
like
zeph,
has
compelling
features
of
maturity.
So
we
want
to
ask
it
begs
the
question:
what
role
they'll
play
in
future
physics
storage
systems.
A
However,
it's
known
that
like
off-the-shelf
software,
misses
some
high-level
features
that
we
have
so
one
solution
would
be
to
layer
our
high
energy
physics,
specific
gateways
on
top
of
open
source
storage.
So
here
in
this
presentation,
I'm
presenting
a
novel
combination,
cfs
plus
eos
software,
written
at
cern.
A
So
I
don't
have
to
go
into
detail
on
what
ceph
is
it's
popular
part
of
the
open
infrastructure?
Stack
lots
of
sites
have
it
anyway,
so
maybe
it's
useful
to
put
some
small
thin
layer
on
top
to
then
be
able
to
expose
the
university's
infrastructure
to
the
to
the
lhc
computing
grid.
A
Cfs,
I
don't
have
to
go
into
detail
what
it
is.
It's
nfs
like
clustered
file
system
used
for
home
directories,
hpc
scratch
areas
or
shared
storage,
I'd
scale
out
it
uses
radios
underneath
it
can
do
replication
or
eraser
coding,
and
it's
also
read
after
write
consistent,
which
is
important
and
it
had
on
the
clients,
the
mds
delegate,
capabilities
to
the
clients
so
that
they
can
either
do
buffered
I
o
or
asynchronous
asynchronous
buffer.
I
o
or
synchronous
as
needed,
going
very
quickly
through
this
overview
stuff.
At
cern,
we've
had
cefs
in
production
since
2017.
A
A
We
have
openstack
manila
used
massively
at
cern,
so
this
is
again
replication,
one
petabyte
usable
capacity
at
the
moment,
and
then
we
use
it
also
also
for
some
on-prem
groupware
solutions.
So
we
have
ceph
osds
co-located
on
openstack
hypervisors,
and
then
we
run
some
erasure
coded
cephes
there.
A
These
and
more
than
30
petabytes
of
other
ceph
clusters
have
been
robust
and
performing
in
any
kind
of
disaster
scenario.
Everything
seems
to
work
after
infrastructure
outages.
The
data
is
still
fine,
our
users
are
are
not.
The
failures
are
basically
transparent
to
our
users
and
we've
also
been
through
three
procurement
cycles
now,
and
we
just
replace
and
rebalance
that
data
and
everything
works.
A
Great,
however,
cfs
like
I
mentioned,
misses
some
features
that
are
essential
to
high
energy
physics,
like
some
authentication
mechanisms
which,
like
sci
tokens,
x-509
cabaros,
also
some
very
feature-rich
quota
and
access
control,
things
that
are
required
in
high
energy
physics,
the
storage
protocols
like
x,
http,
x,
root
d
and
third
party
copy
I
mentioned
before,
and
also
in
this
use
case.
A
At
the
moment,
it's
implemented
in
a
storage
framework
called
x,
root
d.
Basically,
the
there's
a
name
space
which
is
present.
There's
a
namespace,
implemented
by
a
thing
called
quarkdb,
which
is
a
kind
of
consensus
distributed
raft
cluster
with
rocks
dbe
behind
fsts
are
like
the
osds.
They
store
data
either
locally
or
they
can
also
gateway,
remote
storage
and
then
mgm
is
like
the
mds.
A
It
caches
metadata
and
maps
file
names
to
inodes,
so
actually
it's
very
straightforward
for
us
to
use
cefs
behind
the
scenes
of
a
eos
cluster,
simply
by
tricking
eos
into
using
it
as
a
local
file
system.
So
all
the
redundancy
and
high
availability
is
delegated
to
this
ffs
layer
and
we
configure
our
eos
storage
to
store
with
just
a
single
copy.
A
A
So
we
did
a
proof
of
concept
of
this.
We
took.
We
took
eight
large,
very
large,
new
machines,
so
they're
dual
xeon,
not
very
much
ram
192
gigs
of
ram.
They
have
100
gigabit
ethernet,
60
terabyte
drives
each
and
two
one
terabyte
ssds,
so
the
ram.
I
said
it's
not
very
much
it's
roughly
three
gigabytes
per
per
spinning
disk.
This
is
different
from
what
we
run
in
products
right
now.
A
We
run
actually
96
12
terabyte
drives
192
gigs
of
ram,
but
at
that
ratio
it's
really
getting
to
be
too
little
ram
for
for
this
fosts
everything
that
we
buy,
because
we
have
hundreds
of
petabytes,
it
has
to
be
optimized
by
price
per
per
price
per
terabyte.
A
The
cef
back
end
was
octopus
version
1528
and
we
configured
so
we
have
osd's
installed
on
this
hardware
for
the
mon
mgr
and
mds.
We
didn't
we
weren't,
particularly
benchmarking
metadata
performance,
so
we
just
put
them
onto
some
vm
somewhere
else
in
the
data
center,
and
we
have
the
metadata
pool
on
the
ssds.
Though,
and
then
we
have
a
few
different
cfs
data
pools,
testing,
three
different
erasure
coding,
layouts,
four
plus
two,
eight
plus
two
and
sixteen
plus
two,
and
to
try
to
make
things
fair.
A
We
had
a
number
of
placement
groups
so
that
the
number
of
placements
per
osd
was
roughly
equal,
so
like
50,
40
40
for
these
different
layouts,
and
then
we
used
the
up
map
balancer
to
make
sure
everything
was
was
balanced
in
the
test.
Okay,
this
is
this
is
like
just
to
explain
how
it
works.
We
did
16
megabyte,
if
suppose,
an
object
was
16
megabytes.
This
is
sent
to
the
primary
osd
hdd1
here.
Who
then
does
the
work
to
to
split
this
into
the
different
pieces
and
the
erasure
coding
pieces?
A
We
in
our
test.
We
varied
the
object
size
to
see
what
impact
it
has
on
performance,
and
then
we
did
different
kinds
of
tests.
So
what
we
call
back
end
test
was
just
native
cefs
benchmarking,
where
we
ran
on
on
a
separate
set
of
nodes
connected
to
the
same
switch.
Also
with
hundred
gigabit
networking.
We
just
ran
dd
to
see
how
quickly
that
we
could
really
pump
files
into
this
ffs,
and
then
we
also
after
layering
our
eo
software.
On
top,
we
did.
A
We
did
kind
of
benchmarks
of
this
layered
indirect
writing.
In
all
cases,
each
client
node
is
running
10
in
parallel
and
we're
always
writing
two
gigabyte
files
and
then
they're
just
looping
like
this.
A
A
A
So
some
back
end
streaming
numbers
on
this
cluster
we
were
able
to
so
we
very
here
on
the
left.
This
is
streaming
read
performance.
We
vary
the
number
of
clients
nodes
running
and
for
up
to
three
nodes
running.
We
were
getting
linear
increase
in
the
throughput
that
we
were
able
to
read.
So
it
was
like
four
and
a
half
gigabytes
per
second,
then
nine,
then
fourteen,
but
then
it
started
to
to
saturate.
So
it's
saturated
around
20
gigabytes
per
second
of
reading
performance
writing.
A
We
actually
did
better
so
around
six
gigabytes
per
second
per
node
added
to
the
cluster
until
it's
saturated
around
33
to
34
gigabytes
per
second.
A
We
noticed
something
interesting,
which
was
that
as
the
osd's
got
full,
the
performance
dropped
the
right
performance.
Here
we
showed
that
up
to
50
full
everything
was
was
working
very
well,
but
then,
as
as
we
got
to
75
and
then
90
full,
we
lo,
we
saw
up
to
a
30
percent
performance
cut
in
the
streaming
right
performance
it
correlated
with
increased
io
8
on
the
discs.
So
we
just
assumed
this
is
just
the
blue
store
allocators
spending
more
time
having
to
fit
these
blocks
onto
the
disk
and
lots
more
random
seeks
to
write.
A
Here.
We
varied
the
erasure
coding
layout.
We
went
from
four
plus
two
to
a
plus
two
to
sixteen
plus
two
to
see
how
that
impacts.
The
streaming
rate
performance
instead,
ffs
the
default,
is
four
megabytes,
but
on
this
particular
use
case
that
we
were
running,
this
didn't
give
us
the
optimal
performance
by
by
increasing
to
say,
64,
megabytes
and
then
doing
16
plus
2,
erasure
coding
or
128
megabytes
object,
sizes
and
doing
16,
plus
2
erasure
coding.
A
A
On
the
read
read
side,
it
was
similar.
Okay,
the
object
size
influenced
the
performance
that
we
could
get
reading
and
we
really
could
get
for
the
small
object
sizes
we
could.
We
could
only
get
maybe
200
megabytes
per
second
and
then
with
128
megabyte
objects.
This
is
the
large
yellow
plot.
Here,
128
megabyte
objects
we
could
get,
we
could
get
up
to
like
380
or
400
megabytes
per
second.
We
were
also
varying
the
block
size.
A
A
That's
in
the
top
left
corner
here,
19
or
the
average
time
will
be
some
peak
of
a
plot
like
this,
but
then
you
can
measure
the
99th
percentile
or
the
maximum
time
for
the
for
the
slowest
transfer
for
the
small
objects,
the
the
the
the
the
mean,
which
is
shown
as
the
gray
in
this
plot
here
was
always
was-
was
quite
reasonable,
but
the
99th
percentile
or
100th
percentile
was
was
maybe
double
the
the
mean.
A
However,
when
we
got
to
the
64
megabyte
objects
and
128
8
megabyte
objects,
we
had
very
huge
long
tail
distributions,
so
we
were
waiting.
The
slowest
transfer
was
really
like,
like
maybe
10
times
the
10
times
the
average,
which
is
quite
poor
for
our
data,
taking
type
scenario
on
the
reading
side.
This
didn't
this
tail.
These
long
tails
were
not
so
apparent
and
even
with
the
long
with
the
largest
objects
and
the
largest
block
sizes,
the
the
tails
were
were
minimized.
A
So
for
reading
we
can
do
we
can
we
can
still
have
these
huge
objects,
huge
ios.
A
Now,
that's
all!
That's
all
like
back
end
performance,
cfs
alone.
Now
we
go
to
set
ffs
and
we
layer
our
gateways
on
top
our
high
ninja
physics
gateways
in
this
plot.
Okay,
we
start
at
the
left,
with
four
plus
two
erasure
coding,
four
megabyte
objects
and
we
have
a
certain
performance
which
is
the
gray.
Okay.
We
add
our
eos
front
end.
The
average
speed
takes
is
roughly
the
same.
Okay,
the
average
throughput
is
basically
the
same,
so
we
don't
have
a
performance
penalty
to
add
our
eos
front
end.
However,
the
tails
increase
substantially.
A
We
got
huge,
99th
and
and
max
transfer
times.
So
what
we
did
to
work
around
this
was
on
the
client
side.
We
started
throttling
the
bandwidth
so
by
throttling
down
to
26
gigabytes
per
second
total
or
which
was
325
megabytes
per
second
per
transfer.
We
could
bring
those
tails
back
down
to
almost
like
native
sffs
and
then,
if
we
increase
that
slightly,
we
got
started
increasing
the
tails
again.
A
So
we
see
that
was
like
a
sweet
spot
for
this,
for
this
particular
use
case
and
cluster,
but
we
really
need
this
client
side
throttling
to
to
protect
from
long
tails
in
read
performance.
Okay,
we
have
we
started
with
here.
We
start
with
native
ceph
cephalon,
with
four
plus
two
erasure
coding,
four
megabyte
objects,
one
megabyte
reads:
okay:
we
can
optimize
ffs
alone
by
increasing,
so
we
still
do
four
plus
two
erase
recording,
but
we
increase
to
16
megabyte
objects.
A
We
do
eight
megabyte
reads:
okay,
this
decreases
the
transfer
time
per
per
transfer,
so
this
showing
you
that,
by
playing
with
cfs
file
layouts,
you
can
gain
a
lot
of
performance
and
then,
by
we
add
our
eos
frontend.
On
top,
we
got
even
slightly
better
performance.
Okay.
This
is
due
to
the
ios
being
better
scheduled
somehow
by
being
by
being
shielded
by
by
the
eos
front.
End,
it's
not
a
huge
effect,
but
it
was
noticeable
and
there
were
no
long
tails
for
reading.
A
So
that
come
that's
the
end
of
the
of
the
the
raw
benchmarks
and
I'll
talk
a
little
bit
about
what
we
did
with
what
we
had
what
we
observed
on
the
cef
side,
so
on
this
large
cluster
or
this
relatively
large
cluster,
with
huge
boxes
and
very
fast
network,
even
just
while
rados
benching
this
cluster,
we
found
that
the
rados
clients
themselves
were
throttling
themselves
because
there's
something
in
there's
a
there's.
A
client
parameter
called
objector
in
flight
op
bytes
and
it's
limiting
to
100
megabytes
by
default.
A
But
on
this
cluster
with,
like
so
many
spinning
disks
and
so
much
network
throughput,
we
needed
to
increase
the
in-flight
bytes.
So
we
could
get
the
best
windows
bench
performance
by
increasing
this
to
one
gigabyte.
A
This
was,
of
course
only
for
like
user
mode
clients,
we
were
doing
some
fuse
tests
as
well,
and
some
rails
benches
on
the
side.
It
doesn't
apply
to
the
kernel
staff
of
s
that
that
that
does
this-
I
don't
know
what
it
actually
limits
to,
but
it's
something
larger
than
the
default
rates.
A
Client.
Now.
Something
interesting
that
that
came
up
during
this
is
that
eos
software
has
an
internal
fsck
function
where
it's
always
scanning
the
files.
It's
always
hammering
the
mds,
so
the
mds
is
always
having
to
load
cache
and
then
trim
the
cache
of
thousands
of
inodes
and
stay.
Underneath
the
mds
cache
memory
limit.
A
We
found
just
by
observation
that
each
inode
is
consuming
around
3
kilobytes.
So
if
we
had
a
64
gigabyte
mds,
this
would
hold
around
21
million.
I
notes,
but
that's
we
need
we
need.
We
have
file
systems
with
something
like
a
billion
files,
so
it
doesn't
all
fit
in
memory,
and
actually
this
fsck
was
very.
A
So
this
was
this
contributes
to
something
like
one
gigabyte
per
second
of
inode
cash
growth
and
you
very
quickly
within
a
few
seconds,
your
your
mds
goes
out
of
memory,
so
this
was
all
fixable
by
changing
the
tuning
parameters
of
ceph,
some
caps
recall
tunings
and
the
we
worked
with
patrick
upstream
to
to
get
some
increased.
Some
increased,
lift
rate
of
caps
recall
and
there's
a
pr
there
linked,
and
this
actually
works
really
well.
A
The
numbers
that
are
now
the
default
in
ceph
actually
work
really
well
for
all
of
our
use
cases
and
there's
also
a
new
capabilities
acquisition.
A
caps
throttle
to
prevent
this,
maybe
even
without
tuning
these,
without
paying
so
much
attention
to
tuning
them
now.
Something
that's
unsolved
is
that
during
our
testing
one
day
out
of
the
blue,
the
performance,
the
right
performance
of
the
cluster
dropped
from
something
like
25
gigabytes
per
second,
which
was
the
normal
to
under
five
gigabytes
per
second,
and
there
was
no
changes
to
the
cluster,
nothing
obvious.
A
We
confirmed
this
like
in
our
front-end
testing
and
also
with
raido's
bench,
and
then
the
root
cause
was
found
to
be
just
one
sick.
Spinning
disc
in
the
cluster,
maybe
it
had
a
poor
sata
connection,
but
we
could
observe
by
measuring
that
disc
itself
directly.
We
saw
something
like
two
seconds
of
latency
doing:
small
ios.
A
There
were
no.
I
o
errors,
no
smart
errors.
The
drive
was
just
slow,
so
a
very
quick
fix
was
simply
to
stop
the
osd.
Stop
the
system.
Ctl
stop
the
osd
process
immediately.
The
right
performance
went
back
up
to
25
gigabytes
per
second
and
then,
of
course,
the
data
was
backfilled
somewhere
else.
So
we
want
to
find
a
way
to
better
identify
these
kind
of
six
sick
drives.
I
guess
we
can
call
them.
We
have
lots
of
internal
metrics.
A
We
could
actually
find
this
drive
right
away
just
by
running
ceph,
osd
perf
and
sorting
by
the
op
commit
latency,
but
we
it
would
be.
We've
we're
working
ourselves
just
now
on
trying
to
find
how
what
what
is
worth,
which
kind
of
threshold
of
of
off
latency
is
worth
alarming
or
worth
warning
the
unit,
the
user.
You
know
in
seth
we
already
warn
about
high
network
latencies,
and
we
can.
A
We
monitor
the
smart
status
as
even
predict
the
status,
but
I
think
we
can
also
look
at
the
anomalous
op
commit
latencies
so
coming
to
the
end.
So
this
proof
of
concept
demonstrated
that
we
can
get
per
client
node
up
to
four
gigabytes
per
second
reading
and
writing
and
it
works
very
well
for
our
use
case.
A
A
We
have
a
performance
cut
off
that
we
observed
the
rados
level,
probably
caused
by
by
disk
fragmentation.
Maybe
if
we
use
block
db
on
flash
it
would
help.
Actually
we
didn't
even
use
block
db
on
flash
in
this
case
and
then,
of
course,
you
have
to
reserve
adequate
spare
capacity
to
handle
any
kind
of
failures,
like
one
rack,
free
or
at
least
one
host
free
on
the
network
utilization
side.
So
we
have
this
very
fast
network.
A
We
want
to
make
sure
that
we're
using
the
network
we
found
that
right
performance
is
limited
by
the
network
connectivity,
so
we
didn't
see
any
cpu
or
disk.
I
o
bottlenecks.
Read
performance,
however,
was
probably
limited.
By
seeking
we
measured
that
with
this
basic
eraser
coding,
it
basically
doubles
the
network
throughput
based
on
what
the
user
is.
Actually
writing.
A
So
nine
gigabytes
per
second
inbound
translates
to
like
five
gigabytes
to
get
a
local
disk
output
and
five
gigabytes
sent
outbound
to
other
nodes
in
the
cluster.
We
could
afford
to
double
the
satur,
the
double
the
network
connectivity
on
these
nodes
to
thereby
saturate
all
of
the
available
disk.
I
o
in
these
particular
nodes,
so
we
could
use
public
and
cluster
network
isolation
which
we
didn't.
We
also
found
that
when
we
were
doing
concurrent,
writing
and
reading
the
rights
were
taking
priority
so
in
this-
and
this
is
actually
what
we
want.
A
So
it's
okay!
If
we
have,
if
we're
doing
data
taking,
we
want
the
right
reads
to
be
de-prioritized,
but
if
you
leave
the
I
o
prioritization
just
up
to
ceph,
then
then
the
the
red
is
just
showing
that
in
these
various
in
these
various
tests,
okay,
the
rights
we're
taking
the
the
most
of
the
bandwidth.
A
I
asked
at
a
previous
meeting
like
it
would
be
interesting
if
we
could
actually
tune
this
directly
so
that
we
could,
we
could
specify
by
policy
how
we
want
the
ios
to
be
prioritized.
Of
course,
we
can
do
this
in
our
front
end
as
well.
So
maybe
that's
a
better.
A
That's
another
path,
our
front
end.
So
this
is
like
a
case.
If
anybody
needs
to
put
a
front
end
in
front
of
cfs
you
can
you
can
see
that
it
has
marginal
impact
on
the
overall
performance
compared
to
native
the
native
back
end
rados,
you
might
get
tails,
for
example,
like
we've
seen
and
going
forward,
we
we
might
want
to
like
co-locate
everything
on
the
same
boxes
rather
than
putting
the
our
gateways
on
separate
boxes
and
then
connecting
her
to
a
remote
cluster.
We
might
want
to
put
everything
locally,
however.
A
If
is
a
high
memory
pressure,
it
would
be
safe
to
use
a
fuse
mount
or
access
with
lips
ffs,
but
we
found
in
experiments
that
live
server,
vest
performs
quite
poorly,
and
I
think
that
this
was
mentioned
this
this
week
that
there's
a
a
global,
lock
and
probably
this
is
why
we
see
poor
libs,
fs
performance
so
coming
to
conclusions,
these
pieces
of
software,
seven
eos
are
easily
stackable
and
give
excellent
performance
on
the
high
density,
commodity,
disk
server
and
hundred
gig
network
softwas
is
extremely
reliable,
high
performance
and
flexible
with
tunable
qos,
and
it
has,
as
we
know,
a
large
and
active
user
community
beyond
our
physics
communities
and
then
in
this
stack.
A
Also,
like
fine-grained
resource
control,
find
green
quotas
according
to
our
user
communities,
and
then
you
can
also
we've
built
other
services
on
top
of
eos,
like
we
have
cern
box,
which
is
a
sink
and
share
thing,
and
also
we
have
a
new
open
source
tape.
Software
called
cern
tape
archive
which
is
linked
to
this,
so
we
can
think
of
putting
this
all
all
these
pieces
together.
A
What
are
we
doing
now,
so
I
won't
go
into
too
much
detail,
but
we're
doing
we're
now
testing
this
sort
of
thing
we
want
to
start.
We
will
start
testing
this
in
production
to
get
to
see
if
we
can
really
have
real-life
gains
in
usability
performer
performance
and
operations.
A
It
also
removes
some
limitations
that
we
have
on
the
eos
side,
and
then
we
have
on
the
on
the
like
thinking
about
how
this
can
be
implemented.
Even
optimized
implementation,
we're
considering
how
to
unify
the
name
spaces
and
localize
the
I
o,
so
that
when
we
use
one
name
space
between
cfs
and
eos,
but
also
do
the
I
o
like
so
that
the
clients
they
don't
have
to
go
through
a
a
special
eos
client.
They
could
just
use
the
native
s
client
on
the
on
our
large
batch
systems,
and
that's
it
thanks.
B
I
had
a
quick
question
about
the
the
read
versus
performance.
B
I
was
a
little
bit
surprised
to
see
that
you're,
you're,
right
or
sorry
you
read
throughput,
seemed
to
shape
her
off
before
the
writes
did,
and
you
mentioned
the
seek
latency
on
the
disks
being
the
the
likely
culprit
yeah.
So
I
think
that's
that's
that's
generally
right,
but
that's
only
part
of
the
story.
Did
you
try
playing
with
the
the
read
ahead
on
the
setting
on
the
kernel
client?
We
didn't
we
just
yeah.
Usually
what
happens
is
there
are
only
a
certain
number
of
reads
and
flight?
B
It
only
reads
so
far
ahead,
and
so
you
are
waiting
for
the
the
arms
to
move
around
for
those
like
whatever
100
megs
or
whatever.
It
is
in
front
of
your
your
read
position
and
so
there's
some
built-in
latency
there.
But
if
you
just
extend
the
read
ahead,
then
it
can
like
fetch
that
data
ahead
of
time
and
then
you
can
get
much
much
more.
A
I
mean
that
will
help
if,
if
things
are
laid
out,
linear,
if
things
are
laid
out
linearly
linearly
according
to
how
we're
reading
them
yeah
but
we're,
but
we're
thinking
that
actually
that
read
ahead
can
be
okay.
When
we,
when
we
write
things,
are
going
in
a
like
they're
going
sequentially,
but
when
we
read
back,
maybe
we're
not
reading
back
at
the
exact
same
order.
That's
right:
okay,
okay,.
A
B
B
Yeah,
so
increasing
the
read
ahead,
just
means
that
you
can
have
more
osds
busy
moving
around
and
reading
data
at
a
time
so
and
theory,
your
reads
should
be
able
to
saturate
your
overall
network
capacity
or
whatever,
so
you
should
get
more
than
than
you're
right.
If
you
have
enough
for
your
head
anyway,
yeah
we'll
talk
about
it,
we'll
try.
C
It's
do
you
see
the
use
case
for
cfs
snapshots
in
your
specific
environment
in
the
future.
A
A
We
use
snapshots
to
keep
the
older
versions
of
the
of
the
files
and
the
you
know
the
analyses
working
in
project
in
progress
and
we're
midway
through
implementing
this
right
now
and
we've
kind
of
you
know
with
it
with
a
snapshots,
are
different
than
like
things.
You
don't
have
a
dropbox
which
is
like
snapshots
are
more
like
that.
The
what's
the
mac
os
thing
called.
A
I
forget
where
you
can
like
slide
the
snapshot
right
to
it,
where
everything
is
all
at
a
point
in
time,
but
for
the
machine
yeah
time
machine
so
but
for
synchro
you
want
per
file
versions
and
that's
where
that's
where
we
could
see
like
more
effective
use
of
cfs
snapshots
instead
of
having
we
kind
of
have
to
hack
file
versions
into
snapshots,
which
is
a
bit
weird.
A
We
end
up
using
a
lot
of
indirect
soft
links
to
the
files
anyway,
yes,
we
will
be
using
snapshots
a
lot
in
the
next
in
the
upcoming
use
cases.