►
From YouTube: Ceph Performance Meeting 2021-05-13
Description
B
Yeah,
I
hope
so
I
added
another
topic
to
the
agenda
mark
about
a
project
that
I'm
thinking
about
for
an
an
interview
starting
in
a
couple
weeks.
Oh
excellent.
A
Yeah,
I
think
so
we
had
a
an
intern
working
on
something
that
wasn't
this,
but
it
was
is
kind
of
similar
in
a
way
with
cbt
results,
and
I
think
the
trick
to
making
something
like
this
work
well
will
be
to
make
it
really
easy
to
change
the
schema
or
through
the
data
that's
collected
and
and
regenerate
whatever
the
the
kind
of
central
repository
it
is
from
from
the
raw
data
you
know
easily,
without
without
having
it
all
kind
of
be
fragile
and-
and
you
know
prone
to
falling
apart
when
things
change.
B
A
A
A
B
Actually
has
json
support,
so
you
can
kind
of
query
different
kinds
of
fields
within
it
pretty
easily.
A
A
A
Cool,
that's
that's
fun,
yeah!
If
that
sounds
like
a
really
good
project.
It'd
be
super
interesting
to
see
what
oh
yeah.
We've
got
a
couple
of
perfect
cameras
to
be
really
interesting
to
see
over.
B
I
think
it's
about
like
distributions
of
different
parameters,
like
maybe
distributions
of
I
o
sizes
or.
B
Maybe
things
like
it
even
at
the
booster
level
of
like
distributions
of
oh
node
sizes
or
number
of
x,
headers
or
map
key
value
pairs.
You
have
associated
with
it
with
the
objects
yeah
to
kind
of
get
a
kind
of
get
a
kind
of
try
to
try
to
get
a
view
of
the
aggregate
kind
of
data
set
distribution
and
the
workload
distribution.
A
If
we
get
the
yeah,
if
we
get
the
age
binning
in,
we
can
also
then
start
seeing
like
how
how
cold
the
the
o
node
versus
o
map
data
is
relative
to
each
other,
like
is,
is
that
that
could
be
really
interesting
to
see.
B
A
A
Yeah
yeah,
we
we
have
hit
rate
type
stuff
for
the
blue
star
caches
and
I
think
adam
has
something
that
tells
us
hit
rates
for
rocks
to
be
right.
Is
that
what
you're
says
adam,
I
think
so,
right.
C
Yes,
that
yesterday
I
shared
with
you,
it
just
tells
p
trade
for
block
cache.
A
Yeah,
I
didn't
get
a
chance
to
try
it.
Yesterday.
I
ended
up
following
up
on
some
other
stuff
I
was
doing,
but
but
today
I'm
going
to
try
applying
those
to
to
master
and
run
some
tests
with
them.
A
A
All
right:
well,
let's,
let's
get
this
getting
started
here:
okay
prs!
I
did
not
see
anything
performance
related
that
was
new
this
week.
So
if
you
made
something
I
apologize,
I
did
not
see
it.
I
was
quite
tired
and
didn't
have
coffee
yet
when
I
did
this
so
so
that's
why
I'll
blame
it
on
closed
two
pr's
that
I
saw
we
merged
the
initial
osd
support
for
c-store
yay.
So
that's
super,
exciting
and
sam
gave
an
excellent
presentation
on
some
of
his
work.
A
I
don't
know
if
that's
shared
publicly
or
not,
but
he
he
went
over
a
lot
of
his
thoughts
on
on
and
camp
plans
for
that,
the
other
pull
requests
that
merged.
Oh
sorry,
it
was
closed.
Was
this
throttle
request
sent
to
monitors
here
for
logging
that
I
think
is
just
favored
I'm
being
closed
in
favor
of
a
new
pr,
so
we
should
see
something
replacing
that
soon
a
couple
of
updated
prs,
though
adam.
Let's
talk
about
your
your
your
do.
A
Small
right,
sorry
do
right
boom,
so
smart,
well,
smaller
yeah
yeah!
I
again
I
did
this
before
coffee.
I
think
I
just
kind
of
five
words.
Let's
talk
about
that
after
we
go
through
the
rest
of
the
prs,
because
I'd
like
to
understand
it
better.
Okay,
this
optimized
client,
oh,
comes
knows
these
client
requests
parallelism.
A
I
think
they're
they're
they're
discussing
it
in
any
event,
let's
see
rgw
d3n
cache
changes
that
got
rebased,
but
nothing
else.
I
think
the
work
to
set
container
memory
limits
in
cepheum.
A
That
also
was
rebased
and
I
think
sage,
maybe
updated
it
a
little
bit
and
then
oh,
this
actually
was
supposed
to
be
enclosed.
This
work
to
improve
the
efficiency
of
ordered
meth
listing
by
eric
that
also
merged
okay,
lots
of
stuff
in
the
no
movement
category,
but
I
don't
think
anything
real
interesting
to
discuss
right
now,
all
right,
any
anything
I
missed.
A
A
All
right,
then
adam
so
josh,
I
think,
explained
to
me
last
week
why
it
made
sense
why
your
pr
to
change
and
only
do
direct
io
inside
do
small
right
makes
sense,
and
I
I
thought
at
the
time
that
I
understood
what
josh
was
saying,
but
I
was
wondering:
could
you
could
you
explain
it
again,
because
I
I
need
to
remember.
C
If
we
use
in
buffered
mode
rights,
then
we
will
pollute
the
cache
a
page
cache
and
it
would
be
better
not
to
eat
additional
memory,
just
to
be
aware
that
we
can
give
more
space
to
blue
store
buffers
and
most
and
in
addition,
that's
the
only
place
when
we
even
make
it
possible
to
do
aio
rights
that
are
buffered.
All
the
other
cases
are
never
buffered.
C
So
it
was
more
like
a
cleanup
from
my
my
my
thinking,
because
either
we
we
cache
it
in
all
conditions
or
we
never
make
a
possibility
to
double
cash
in
system
pages
system.
Red
pages
that
that
was
my
my
thinking-
and
there
is
no
really
more
logic
behind
that.
That's
it.
I
mean
there
is
a
logic.
I
wanted
to
make
some
simplification
on
aio
rights,
but
that
was
a
secondary
goal.
D
Well,
there
is
one
exception
here
if
right
request.
D
D
But
other
than
that,
well,
actually,
I'm
not
sure
if
this
flag
is
actually
by
any
client
but
other
than
that
by
default.
We
have
this
booster
default
buffer
right,
config
parameter,
which
is
set
to
false
by
default,
and
this
makes
all
rights.
A
A
Sorry
so
adam's
pr
changes
it
so
that
we
do
aio
right
instead
of
using
wctx
the
the
right
contacts
buffered
flag.
We
just
set
it
to
false
right
all
right
so
right
now
we
default
to
have
a
blue
store
default,
buffered
right,
disabled.
C
Time,
okay,
but
am
I
I'm
like
confused
here?
There
are
two
things:
one
is
our
control
over
caching
data
in
bluestora
when
we
do
write
and
we
have
a
flag
for
that.
It's
let
me
verify
that
blue
store
default
default,
buffered
right,
and
this
tells
us
to
cache
the
data
in
blue
store
buffers.
When
we
write
that's
one.
D
A
A
Yes,
not
always,
but
usually
so
I
think
I
understand
now
why
you
want
to
do
this
right,
because
if
you
already
have
blue
store
default,
buffered
right
disabled,
like
we
do
by
default,
then
already
you
don't
use
either
cache.
You
do
a
direct.
I
o
right
and
you
don't
use
our
cache,
but
when
you
enable
it,
then
you
most
of
the
time
should
be
doing
both
with
the
current
code.
A
A
A
Are
we
better
off
caching,
at
the
bluester
layer
or
at
the
page,
cache
layer?
My
instinct
would
be
we're
better
off
at
the
blue
store
layer
that
we
are
better
off
having
our
own
cache
there,
but
what
we
saw
with
roxdb
right
is
that
in
fact,
buffered
io
was
far
more
important
than
we
realized,
possibly
due
to
a
code
bug.
We
don't
know,
I
just
want
to
make
sure
we
don't
end
up
repeating
that
same
mistake.
C
Okay,
that's
that's
correct,
and
this
is
I
share
that
concern.
That's
that's
true,
but
why
would
we
leave.
C
Having
buffers
a
double
buffering,
basically
only
for
small
rights,
do
we
remember
logic
behind
that.
A
Let's
just
say
that
maybe
it
was
based
on
a
theory
that
that,
having
like
a
secondary
cache
at
the
page
cache
layer
would
be
better.
So
we
have
a
primary
cache
of
the
bluester
layer
and
kind
of
a
secondary
page
cache
layer.
A
A
A
But
maybe
if
you
had
one
hot
osd
and
a
couple
of
others
that
weren't
just
hot,
maybe
ruxty
or
sorry
rgw
indexes
or
one
osd,
or
something
and
there's
tons
of
oh
map
entries
and
be
cash
there.
That
then
then
maybe
but
but
then
again,
that's
that's
roxy
be
page
cache,
so
they're
page
cache.
Okay,
so.
C
A
I
don't
think
you
should
necessarily
sit
close,
the
pr
I
don't
know
this
wrong.
I
just
think
we
should
test
it,
but
you're
you
may
actually
better.
I
I'd
like
us
to
move
away
from
buffered
io,
especially
with
lib,
a
all
right.
It's
it's
not
really
the
way
it's
intended
to
be
used.
C
So,
okay,
on
the
testing
front,
I
really
can
devise
a
series
of
small
right
tests.
I
mean
tests
that
will
come
comprised
is
a
in
a
high
percentage
of
a
small
rights
and
try
to
toggle
performance.
When
I
crank
up
buffering
in
blue
store,
I
can
do
that.
I
I
will
have
some
problem
with
limiting
linux
kernel
with
buffering
my
extra
data.
I
don't
know
how
to
do
that.
A
I
can
help
you
with
that.
You
can
use
c
groups
seemingly
effectively
to
do
it.
That's
how
I
was
doing
the
omap
testing
by
changing
my.
C
Memory
limit,
so
maybe
let's
split
that
to
just
our
one
one
talk
and
you
will
teach
me
how
to
use
siggraphs
to
limit
caching
memory,
and
I
will
finish
and
make
the
tests
just
to
show
what
what
the
benefits
and
costs
are
cool.
D
D
The
page
cache
this
definitely
makes
sense.
Maybe
we
we
should
introduce
an
additional
config
parameter
to
to
control
page
cache.
C
Usage
yeah
that
would
be
cool
that
that
really
makes
sense
and
then
just
put
it
everywhere.
D
Yeah
yeah
absolutely,
and
this
way
at
least
we
would
be
able
to
to
to
benchmark
booster
cache
and
compete
against
the
page
cash
and
decide
which
one
is
more
fairly
efficient.
D
Yeah,
so
at
this
point
I
agree
with
patterns
points,
so
it
makes
sense
to
to
have
this
patch
into
it,
and
maybe,
additionally,
you
might
want
an
additional
fix,
an
additional
patch
to
control
within
page
cache.
C
Okay,
so
you
will
think
I
should
not
reduce
that
logic,
but
extend
it
to
provide
an
extra
parameter
just
to
allow
also
caching
in
page
cache,
so
we
could
manipulate
parameters
and
get
full
control
all
right.
So
I
I'm.
D
Complicated
so
well,
we
we
can
definitely
go
with
the
current
patch
and
maybe
additionally,
we
might
add
some
more
another.
One
parameter.
C
D
A
A
All
right,
well,
I
suppose
I'll
go
into.
I've,
got
just
kind
of
a
small
update
on
the
omap
bench
testing
that
I
was
doing
I'll
put
the
link
in
the
chat
window
for
the
data
here.
A
A
So
what
we
saw
previously
was
that
there
was
a
really
big
difference
between
luminous
and
master,
we're
far
faster
in
master,
especially
when
using
buffered,
io
and-
and
there
was
some
concern
about
how
much
faster
we
were.
A
So
I
went
back
and
started
looking
at
running
omapbench
against
various
versions
of
the
historic
code
that
we
have
in
some
cases.
I
did
need
to
change
it
somewhat
to
make
it
work.
The
object,
store
interface
has
changed
since
luminous,
so
I
think
everything
was
basically
right,
though.
What
I
saw
is
that
I
don't
think
there
was
any
one
pr
that
necessarily
improved
performance
overall,
but
it
looks
like
there
were
a
number
of
pr's
that
maybe
did.
A
A
C
A
Is
to
actually
get
stuff
to
compile,
especially
that's
the
toss
eight,
but
in
the
mimic
time
frame.
I
thought
that
maybe
it
was
going
to
be
kind
of
around
this,
this
pr
2177,
where
we
changed
the
flushing
behavior
in
and
also
change
the
object,
store,
interface,
the
flushing
behavior
in
bootstore
and
the
object
store
interface,
but
it
turns
out
that
had
very
little
effect.
A
It
didn't
change
anything
at
all,
really
what
it
it
was
and-
and
this
maybe
should
have
been
obvious
to
me
since
I
wrote
this
code,
but
it
was
when
we
introduced
the
osd
memory
auto
tuning.
That
was
what
seemed
to
make
one
of
the
biggest
differences
in
all
of
this
and
and
the
reasoning
for
it
is
kind
of
obvious.
Prior
to
this,
we
we
statically
assigned
the
caches
to
the
owner
cash
and
the
kv
cache.
It's
not
entirely
true.
A
We
had
some
capabilities
in
the
old
code
to
kind
of
try
to
like
rob
from
one
and
give
it
to
the
other,
but
it
was
is
pretty
limited
and
it
didn't
work
right,
so
it
it
never
really
kind
of
did
what
it
was
supposed
to
do.
I
think,
just
from
what
I
remember
of
looking
at
it
at
the
time.
A
So
when
we
introduced
the
osd
memory
auto
tuning,
it
allowed
a
blue
store
to
allocate
almost
all
of
its
pre-available
memory
for
caches
to
say
roxdb
to
the
block
cache.
So
you
could
really
aggressively
cache
omap
if
there
were
very
few
oh
nodes.
So
in
this
case
we
we
only
have
like.
Oh,
this
is
actually
not
right
there.
A
We
only
have
a
hundred
thousand
objects
and
we
have
100
omap
keys
per
object,
and
so
in
in
this
case,
we
actually
did
not
need
a
whole
lot
of
cash.
We
primarily
need
everything
to
be
an
omap
or
caching
omap
entries,
and
so
what
we're
really
seeing
in
this
test
is
actually
that
the
amount
of
memory
that's
available
for
omap
cache.
A
So
something
very
strange
is
going
on
here
because
clearly
giving
roxtv
more
memory
for
the
block
cache
when
in
buffered
io
mode
helps
dramatically,
but
it
doesn't
appear
to
when
we
we
set
a
bluester
buffered
io
to
disable
our
bluefest
buffered
io
to
disabled,
so
there's
still
something
very
strange
going
on,
but
that's
the
pr
that
really
made
the
big
difference
in
terms
of
a
lot
of
these
numbers
that
we
saw
there
have
there.
There
are
some
other
ones
I
mean
still
set.
A
Keys
is
like
twice
as
fast
in
master
as
it
was
back
in
mimic
when
we
merged
that,
so
we've
definitely
had
some
additional
improvements
since
then,
but
you
know
that
was
the
one
that
I
really.
I
saw
that
kind
of
made
the
the
big
difference
in
in
a
number
of
different
places.
A
Comments
all
right.
Well,
then,
if
there
are
none
I'll
try
to
next
take
adam's
work.
Looking
at
he
developed
a
pr
that
will
record
the
roxdb
block,
cache
hit
rate
numbers
in
in
our
proof
counters.
A
So
I'm
going
to
try
to
work
on
applying
his
to
master
and
really
try
to
dig
into
why
we
see
this
dramatic
difference
between
buffered
io
and
direct
io
in
a
test
that
should
be
reading
everything
from
the
the
block
cache
so
hopefully,
next
week
I
will
have
some
numbers
there
and
we
can
maybe
figure
out
how
to
fix
it.
So
the
mode
is
performing
like
we
expect
it
to
so.
Okay,
that's
it
josh.
Would
you
like
to
talk
about
imar
the
topic.
E
You
added.
Oh
sorry.
Yes,
sorry,
just
a
question,
I
know
in
previous
prs,
so
this
is
joshua
bergen
from
digitalocean.
I
think
you've
interacted
a
little
bit
with
alex
maragon
as
well
in
one
of
the
previous
prs
kind
of
related
to
the
series
of
topics.
One
of
my
colleagues.
I
know
one
of
the
races
that
you
had
looked
at.
If
I
remember
correctly
kind
of
showed
that
we
were
pre-fetching
over
and
over
and
over
again,
do
you
have
a
memory
of
that?
E
A
Looks
like
in,
in
fact,
igor
is
the
one
that
fixed
it
igor.
Maybe
you
want
to
talk
about
this.
This
is,
I
believe,
your
your
investigation
into
delete
iterating
iteration
during
delete
where
we
were
only
doing
30
objects
at
a
time.
A
Rage
scans,
basically
that
were
causing
extremely
slow
performance
and
I
believe
it
was
actually
the
the
thing
that
you
fixed,
where
we
were
basically
re-scanning:
every
30
objects
during
deletion
in
pg
deletion.
A
E
E
The
reason
I
was
asking
about
this
specif
specifically
is:
we
are
running
bufferdio
right
now,
but
we've
we've
run
into
a
number
of
corner
cases,
even
with
buffered
io,
on
where
old
map
performance
gets
really
really
bad
in
in
with
rocks
to
be
on
blue
store
to
the
point
that
we
might
we're
actually
considering
moving
back
to
file
store
for
our
indexes
right
now,
oh
okay,
so
in
particular,
like
there's,
been
a
few
cases
where
we've
seen
this
happen,
but
a
very
easy
way
for
us
to
reproduce
the
issues
that
we
see
is
say
you
have
a
big
old
map.
E
Let's
say
it's:
2
million
objects
or
something
or
2.2
million
keys
in
the
old
map,
and
then
you
have
a
loop
external
that
is
doing
a
list
and
then
deleting
a
range
of
keys
list
during
delete
range
of
keys
over
everything
repeat
and
what
we
find
is
we'll
start
off
deleting
at
say
a
thousand
keys
per
second,
and
that
rapidly
drops
off.
I
should
say
rapidly
over
time
drops
off
down
to
like
100
keys
per
second
with
this
osd
100
cpu
usage.
A
I
have
seen
something
that
looks
a
lot
like
that
when
we
were
doing
delete
range,
and
you
ended
up
with
a
ton
of
tombstones
that
you
were
iterating
over.
But
as
soon
as
you
compacted,
everything
got
good
again
exactly
yeah.
Is
that
what
you're
seeing.
E
D
Yeah
and
just
want
to
add
to
mark's
comment
that
perhaps
it's
not
necessary
to
have
reg
deletes
to
get
this
degraded
roxdb
stage.
It
seems
to
me
that
multiple
removals
impact,
the
roxdb
performance.
D
So
originally
we
were
thinking
that
we
just
get
this
degradation
after
deleted
range
deletes,
but
it
looks
like
every
every
delete
impacts
that
maybe
single
deletes
not
that
impact
that
not
that
badly.
But
if
you
apply
multiple
of
them,
they
still
impact
the
performance.
A
D
And
yeah,
maybe
well
in
my
second
pr,
which
is
still
pending.
I
I
reworked
the
map
removal
a
bit
and
it
triggers
range
deletes
followed
by
the
arranged
compaction,
but
this
is
supplied
to
each
removal.
A
A
E
A
E
A
mixture
of
nautilus
and
luminous
nautilus,
mixture
of
14,
2,
8,
11
and
18.
D
Just
in
case,
what
drives
are
you
using.
E
The
new
generation
are
intel
nvme
for
two
teras,
four
teres.
Why
am
I
blinking
on
this?
I'm
not
sure,
but
pretty
fast.
The
drives
themselves
are
pretty
fast.
When
we
run
into
issues
it's
almost
always
cpu.
E
Until
compaction
happens,
yeah
so
you're
talking
about
like
triggering
compaction.
My
recollection,
I
walked
the
rocks
db
code
on
the
compaction
path
a
couple
months
ago,
so
I'm
going
by
memory
here,
but
I
think
they
do
take
some
level
of
tombstones
into
account
in
their
heuristics
in
terms
of
what
files
they
should
be
compacting
in
their
background
compaction.
E
So
they
do
try.
One
of
the
worries
we've
had
is
at
least
in
nautilus.
The
default
configuration
for
aux
db
might
be
hurting
the
background
compaction
processes
in
the
way
that
a
file
store
configuration
doesn't
in
that
it
sets
a
older
option
for
the
number
of
threads
to
use
for
compaction
versus
the
rocksdb
version.
E
Has
a
capability
to
have
like
a
set
of
flusher
threads,
a
set
of
compaction,
threads,
more
concurrency,
etc,
and
I'm
pretty
sure
file
store
by
default
does
not
override
any
of
those
options,
and
so
it
might
actually
have
better
background
compaction,
behavior,
maybe
that's
what's
compensating
here
as
well.
We
haven't
had
enough
time
to
really
experiment
with
that,
though,.
A
I
don't
remember
what
versions
we
were
worried
about,
but
there
was
a
point
at
which
we
were
worried
about
a
data
corruption
issue
with
multiple
compaction,
threads
and
older
versions
of
rxdb,
but
it
might
have
been
pre-nautilus.
A
We
were
trying
to
be
really
careful
about
not
not
doing
anything
that
was
gonna.
Potentially
you
know
have
problems.
F
Yeah,
I
think
you're
right
mark.
I
remember
a
nautilus.
We
did
update
the
roxdb
version
because
those
issues,
corruption
issues-
were
fixed,
yeah
and
also
we
did
increase
the
number
of
compaction.
Also,
if
I
remember
correctly,.
A
E
Honestly,
we
haven't
had
enough
time
to
really
experiment
with
that
at
scale.
In
production
we
did
a
major
discovery.
I
want
to
say
about
a
month
and
a
half
back
on
this,
but
that
was
all
in
controlled,
lab
environments.
E
A
In
the
testing
that
we
did
back
in
the
nautilus
time
frame,
we
we
did
see
that
there's
a
sweet
spot
where,
if
you
have
too
many
compaction
threads,
that
actually
slows
it
down
slightly.
It's
probably
like
four
between
four
and
eight
I'm
guessing
two
was
pretty
good.
It
was
like
the
at
least
in
our
tests.
It
looked
like
the
the
benefit
after
two
started.
You
know
flowing
down
significantly
and
then
after
you
went
above
like
because
you're,
six
or
eight
or
something,
then
it
it
kind
of
plateaued
or
even
decreased.
A
E
Right,
okay,
yeah:
I
think
we
probably
just
have
to
do
a
little
bit
more
experimentation
digging
on
our
side
to
come
up
with
more
solid.
This
works.
This
doesn't
sort
of
information
yeah.
E
The
other
thing
I
want
to
comment
on.
I
don't
want
to
take
over
if
there's
other
stuff
you
want
to
discuss
in
this
meeting,
but
the
discussion
earlier
about
whether
or
not
to
write
through
the
os
cache
earlier.
That's
particularly
interesting
to
me
as
well.
We
have
we
have.
E
We
have
older
clusters
where
we're
basically
running
rocks
to
be
right
on
the
spinners
like
we
don't
have
separated
flash
for
the
rock's
tv
portion
and
what
we're
finding
there
is
a
really
weird
pathological
behavior,
where
you
write
through
the
os
cache
and
then
do
a
flush
range,
essentially
right
across
a
bunch
of
buffers
and
tell
it
to
flush
out
to
the
back
end.
The
os
is:
writing
those
as
a
whole.
E
Bunch
of
I
think,
512
bytes
so
like
bezier
block
size,
writes
down
into
the
lower
layers
which
are
supposed
to
gather
them
back
up
into
big,
writes
again
via
the
I
o
scheduler
what's
happening
is
the
dm
crypt
layer
is
slow
enough.
That
little
bits
of
those
big
writes
are
actually
leaking
through
and
being
scheduled
to
the
disk,
and
so,
instead
of
having
a
like,
say,
it's
a
512
kilobyte
right
to
the
disk.
E
That's
supposed
to
be
just
one
big
chunk,
it
actually
leaks
through
as
five
or
six
and
then
we're
missing
rotations
on
the
disc,
which
is
causing
massive
increase,
while
I
should
say
massive,
but
a
large
increase
in
right
latency
for
those
rights,
the
disk.
So
I
actually
started
writing
a
patch
and
I
kind
of
abandoned
it.
I
got
a
bit
nervous
about
it
where
I
stopped
writing
anything
through
the
os
cache
in
bufferedio
mode,
so
we
would
get
the
read
benefit
from
buffered
io,
but
avoid
using
the
cache
on
the
right
interesting.
E
E
To
know,
though,
now
with
igor's
fix,
like
the
reason
we
had
buffer
on
was
because
of
the
pg
deletion,
so
we're
waiting
to
upgrade
that
system
to
14
to
18
and
then
we'll
just
be
disabling
buffer
io
on
the
spinners
at
that
point,
because
that
that's
really
the
only
thing
we
know
that
was
biting
us
was
the
pg
deletion
there
and
then
that
should
address
the
right
latency
problem
for
us
as
well.
But
I
thought
it
might
be
interesting
for
you
to
know
that.
A
A
Absolutely
all
right,
adam
sounds
like
that's
a
endorsement
of
of
your
pr.
A
Like
all
right
josh,
I
know
we
don't
have
a
ton
of
time,
but
did
you
want
to
try
to
talk
about
your
topic
or
should
we
move
on.
B
Let's
see
next
week,
that's
a
good
timing
from
youtube.
Okay,.
A
All
right
well,
then,
anything
else,
guys
that
we
should
talk
about
this
week
or
should
we
wrap.
A
E
Okay,
I
I
posted
this
to
the
dev
mailing
list
and
I'm
sorry
to
to
bug
on
this
again.
We
do
have
some
really
old
clusters
that
are
still
using
file
store
on
spinners
and
we
are
constantly
plagued
by
the
oh,
no
so
the
inode
iteration
issue
in
xfs
and
at
one
point
sage,
said:
hey
a
newer
kernel
fixes
this,
but
unfortunately
he
didn't
specify
which
one
I
was
really
hoping
that
maybe
someone
somehow
remembered
which
kernel
has
better
inode
scanning
during
flush
range
and
the
answer
is
no,
that's
fine.
A
E
A
Yeah
no
problem,
one
question
for
you
actually
on
your
file
store
clusters.
Are
you
have
have
you
done?
Okay,
with
multiple
ag's
on
xfs,
with
file
store?
That
was
the
thing
that
really
kind
of
made
file
store
fall
over
a
lot
of
times.
Oh
the.
E
Our
file
store,
well
file
store,
falls
over
all
the
time.
For
us,
both
systems
are
awful
and
rolling
them
to
blue
store
is
difficult
because
they're
so
slow
that
we
can
pretty
much
only
do
one
backfill
at
a
time
per
osd
in
file
store
and
yeah
they're
they're
difficult,
so
we're
trying
to
get
to
blue
stores
as
fast
as
we
can
on
those,
but
as
fast
as
we
can
is
probably
a
one
year
plus
project
on
those
clusters.
E
A
My
I
don't.
I
have
no
idea
what
your
file
store
deployments
actually
look
like,
but
if
they're,
like
any
other
file,
store
deployments
I've
seen
with
lots
of
objects,
your
your
directories
are
just
fragmented
to
hacking
back.
You
know
really
really
bad
across
multiple
accessory
groups,
assuming
that
you
haven't
done
it
to
one.
E
Yeah,
I
don't
think
we've
done
any
custom
tuning
on
those
I'd
have
to
look
at
it
that
we
have,
and
they
do
have
very
very
high
object,
counts
yeah.
So.
A
So
I
don't
know
I
I
posted
some
stuff
about
this
years
ago,
but
the
basic
idea
is
that
when
file
store
does
a
split,
a
directory
split,
it
will
then
move
objects
into
those
directories,
but
they
originally
were
in
a
directory
that
was
assigned
to
one
ag
and
the
new
subdirectories
are
going
to
be
different
ages,
probably
most
likely
they're
different
aging,
not
guaranteed,
but
most
of
the
time
they
will
be,
and
so
then
the
first
set
of
objects
that
ends
up
in
that
ag
are
in
fact
sorry
in
that
directory.
A
In
that
subdirectory
are
from
a
different
ag.
Then
new
objects
will
be
put
into
it.
So
you'll
end
up
over
time
as
the
directory
tree
gets
deeper
and
deeper
and
spread
wider
with
just
a
completely
random
mix
of
objects
at
different
offsets
on
the
disk
inside
one
directory,
and
so,
if
you're
going
to
scan
that
directory,
the
whole
thing
is
just
insanely
bad.
It's
just
random.
I
o.