►
From YouTube: 2018-May-17 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
A
All
right,
I'm,
I'm,
moving
on
all
right,
so,
let's
see
this
week
for
pull
requests,
not
a
ton
of
new
stuff.
The
only
thing
that
I
noticed
in
in
in
my
list
this
morning
was
the
PR
I
had
actually
submitted
for
kind
of
a
first
step
toward
doing
priority
based
cashing
in
blue
store
and
I'm,
hoping,
maybe
eventually
the
OSD.
The
the
idea
here
is
well,
maybe
I'll
get
to
that
later.
A
If
no
one
else
does
anything
they
want
to
talk
about,
but
the
the
basic
idea
is
that
it's
really
difficult
in
master
right
now
for
users
to
set
ratios
of
how
memory
should
be
used
for
different
things
in
the
OSD
and
specifically
in
blue
store
right
now.
We
allow
you
to
kind
of
specify
how
much
memory
should
go
to
metadata
and
blue
store
data
and
blue
store
and
Rox
DB's
block
cache,
and
even
inside
rocks
TVs
block
cache
there.
A
You
potentially
have
the
option
of
specifying
high
priority
and
low
priority
pools
for
kind
of
how
much
memory
you
want
to
devote
to
guaranteeing
that
indexes
and
filters
remaining
cache,
and
so
we
allow
you
to
set
at
least
some
of
those
ratios
and
then
beyond
that.
We
also
allow
a
minimum
value
to
be
set
for
the
the
key
value
block,
cache
and
kind
of
when
you.
What
we
found
out
was
that,
essentially,
when
users
try
to
do
this,
it's
really
not
clear
at
all
how
all
of
this
interacts
and
calves
on
top
of
it.
A
We
actually
had
have
currently
in
master
a
bug
where
we
tend
to
favor
the
metadata
cache
over
the
data
cache
kind
of
beyond
what
the
user
specifies.
It's
not
always
a
bad
thing,
because
actually
it
turns
out
that
favoring
may
take
a
cache
kind
of
helps
performance
in
a
lot
of
scenarios
and
usually
doesn't
hurt
it
even
when
you
might
expect
it
to
so
it
it's
it's
kind
of
inadvertently
doing
a
better
thing,
but
it's
not
doing
what
the
user
requests.
So
it
ends
up
being
really
really
confusing
why
things
are
set
internally.
A
Now
where
you
can
enable
Auto
tuning
and
then
kind
of
through
a
a
more
complicated
scheme
it
will,
it
will
try
to
do
make
better
decisions
about
where
memory
should
be
assigned
and
then
eventually
revert
back
to
those
user-defined
ratios.
If
it
can't,
if
it
doesn't
think
it
can
do
better.
So
that's
that's
it
in
a
nutshell.
A
A
This
async
messenger
PR
from
how
am
I
that
that
kind
of
improves
the
locking
behavior
merged
I
keep
merged
that
that
potentially
is
really
good
for
kind
of
latency
related
scenarios.
I,
don't
know
that.
We've
done
any
kind
of
extensive
testing
on
it,
but
it
at
least
the
testing
that
was
done.
There
makes
it
look
like
it's
a
good
PR,
and
then
there
was
this
other
one:
continued
recovery,
optimization
for
overwrite
apps
that
did
not
merge.
It
was
just
closed
by
the
author
I'm,
not
entirely
sure
why?
A
But
I
guess,
for
whatever
reason
that
was
closed,
a
couple
of
ones
that
were
updated,
Igor's,
new
bitmap
allocator
that
he
presented
last
week,
sage
reviewed
it
looks
like
he
really
likes
it
and
wants
to
replace
the
the
old
bitmap
allocator
code
with
Igor's
new
stuff.
So
that's
good
expect
at
some
point
soon
here
that
we
will
have
a
new
bitmap
allocator
implementation
that
works
far
better
than
the
old
one.
So
so
that's
exciting.
A
There
is
this
lib
RBD
throttle:
PR
I,
don't
really
know
too
much
about
it,
but
I
guess
that
one
must
have
gotten
updated.
I
should
probably
figure
out.
Why
that's
what's
going
on
with
that
anyway,
yeah
I,
don't
know
too
much
about
that
one
and
then
Radice
slavs
work
on
the
the
crypto
SSL
PR
here
and
that's
going
through
testing
and
getting
fixes
and
I.
It
looks
like
it's
making
progress.
So
that's
good.
A
A
So
that's
good
I
think
a
user
is
involved
in
testing
that
another
one
from
Peter
about
reducing
buffer
list
rebuilds
during
right,
head
log,
writes
redick's
huge
pages,
PR
more
stuff
from
Adam
I'm,
sorry,
Q,
I,
guess
in
this
case
regarding
AES
and
crypto
yeah.
That's
that's
about
it.
Don't
say
anything
else,
real
reason
here.
So
that's
it
for
PRS
would
any1.
Does
anyone
have
anything
that
they
would
like
to
discuss
this
week?.
A
A
Yeah,
okay,
cool,
so
I
kind
of
earlier
on
gave
an
overview
of
the
problem
here
right
now,
it's
really
complicated
for
users
to
adjust
all
of
our
cache
settings.
It
became
clear
when
we
went
through
some
of
this
with
was
with
one
of
our
customers
beyond
just
the
fact
that
it's
kind
of
confusing
it's
really
difficult.
A
Even
if
you
know
what
all
of
the
different
ratios
do
to
figure
out
what
they
should
be
set
to
and
it
kind
of
changes
depending
on
the
workload
everything
that
I
have
seen
indicates
that
for
like
an
RB
D
workload,
if
you
can
keep
everything
all
of
the
o
nodes
for
blue
store
in
the
metadata
cache,
that
is
the
kind
of
number
one
priority.
It's
less
clear.
What
happens
once
you
can't
do
that
anymore?
A
If
not
all
the
o
notes
fit
into
cache,
maybe
you're
better,
actually
doing
like
a
full
swap
over
to
Rox
tepees
block
cache,
because,
theoretically,
you
can
keep
all
of
the
do
nodes
and
block
cache
in
an
encoded
form,
rather
than
reading
them
directly
out
of
memory
with
blue
stores.
Oh
no.
The
the
trade-off
right
is
that,
when
you're
reading
from
Rox
TVs
cache,
you
have
a
lot
more
work
to
do.
You're
you're
copying
memory
around
you're,
doing
an
encode
step
to
put
them
there
and
you're
doing
a
decode
set
to
get
them
out.
A
A
So
at
least,
if
you
can
keep
everything
in
cash,
that's
kind
of
what
you
want
to
do,
there's
there's
kind
of
a
bad
in-between
state
where
it
looks
like.
If
you
don't
have
enough
metadata
cache-
and
you
are
doing
reads
from
disk-
we
end
up
double
cashing
a
lot
of
data,
so
you
end
up
putting
the
exact
same
data
that
is
in
blue
stores.
Metadata
cache
in
rocks
TVs
block
cache.
It's
not
clear
whether
or
not
you
yet
you
get
some
added
benefit
by
doing
that,
like
sometimes
you're,
maybe
hitting
stuff
in
rocks.
A
Tb's
cache
that
you're
not
hitting
in
blue
stores,
metadata
cache,
I,
haven't
really
gone
through
and
done.
An
exhaustive
look
at
like
the
hit
rates
to
to
determine
that,
or
even
like
a
/,
a
/,
oh
no
trace
of
where
it's
hitting
in
what
cache,
but
just
from
the
performance
results
and
from
the
behavior
that
that's
showing
it
looks
to
me
like,
generally
speaking,
it
doesn't
work
that
way.
A
It
looks
like
either
you're
hitting
blue
stores,
cache
or
you're
fetching
from
disk
or
or
at
the
very
least,
the
the
overhead
of
getting
things
out
of
rocks.
Dvds
cache
is
so
high
that
fetching
from
rocks
TVs
caches
is
not
really
much
better
than
just
fetching
it
from
disk
in
the
first
place,
so
that
kind
of
is
the
conundrum
right
now
is.
What
do
we
do
in
all
these
scenarios?
A
Have
the
the
good
news
is
that
we
can
at
least
avoid
some
of
it
by
saying?
Well,
if
we
aren't
using
a
lot
of
kV
cache
yet
and
if
we
have
memory
available,
I
guess
actually,
if
we're
not
using
any
of
these
caches
up
to
that
ratio
that
we
specify.
Potentially,
we
don't
need
to
allocate
that
much
memory
to
any
of
them.
A
What
memory
we
have
left
before
we
move
on
to
the
next
set
of
priorities.
So
in
that
way
we
can
say
if
for
high-priority
items,
we
want
more
metadata,
cache
beyond,
say,
like
the
the
one
third
or
the
forty
percent
of
the
cache
that
we
specified
in
the
ratios.
Well,
let's,
let's
give
it
to
that:
let's
give
it
to
the
metadata
cache
then,
before
we
give
lower
priority
items
a
shot
at
that
cache.
So
that's
basically
what
this
does
right
now
in
this
PR.
We
don't
really
make
use
of
those
priorities
very
well.
A
So
so,
let's
take
a
look
here.
This
is
essentially
the
old
behavior
of
Matt.
Well,
the
current
behavior
of
master
on
under
an
rgw
workload
like
that
you
can
see
that
we
have
kind
of
a
couple
of
different
ratios
that
are
set
here:
40%
kV,
40%,
Mehta,
20%
data,
so
Roxy
bees
block
cache
is
getting
forty
percent
of
that
three
gigabyte
total
cache,
the
the
blue
store,
metadata
cache
for
o
nodes
and
Ono
related
stuff
is
also
getting
40%
and
the
the
data
cache
for
for
four
objects
in
blue
store
is
gained
twenty
percent.
A
In
reality.
That's
what
we've
that's
what
we've
allocated,
but
in
reality
that's
not
actually.
What
gets
assigned
right
now
in
master
blue
store
kind
of
tries
to
do
some
kind
of
auto
tuning
itself,
where
you
can
see
the
yellow
line
for
data
actually
spikes
way
up
at
the
beginning
and
then
drops
back
down
and
at
the
beginning
we
kind
of
start
out
without
much
meta
cash,
but
it
spikes
up.
A
If
you
notice,
the
amount
of
data
that
is
allocated
is
over
the
course
of
this
run
is
actually
larger
than
the
amount
of
data
used
that
that's
kind
of
that
bug.
That
I
was
talking
about
where
we
tend
to
not
actually
assign
as
much
memory
for
the
data
cache
as
the
user
specified,
and
we
over
allocate
memory
for
the
metadata
cash
versus
what
was
requested.
A
A
Okay
in
master
or
sorry
in
with
this
PR,
if
we
have
the
auto
tuner
disabled,
sorry
we're
switching
to
our
BD
now,
maybe
I'll
actually
go
down
and
show
the
rgw
behavior,
because
that
was
oh,
we
had
just
talked
about
and
then
we'll
go
back
to
our
BD,
okay,
so
for
our
GW
behavior,
with
the
auto
tuner
disabled,
its
it's
really
similar
to
master,
except
that
we
fix
this
issue
where
we're
under
allocating
data
and
over
allocating
metadata.
A
So
again,
you
can
see
really
similar
behavior,
it's
hitting
the
user
specified
ratios.
We
are
not
doing
any
kind
of
auto
tuning
at
the
beginning,
where
we're
kind
of
taking
some
of
the
the
the
metadata
cache
and
assigning
it
to
the
and
then
having
shrink
back
down
that
kind
of
is
now
we
moved
into
this
auto
tuner
design,
so
that,
if
you
specify
that
you
want
auto
tuning
it
does
that,
but
otherwise
it
just
does
exactly
what
the
user
requests
and
doesn't
fool
around
with
it.
A
A
One
thing,
I
will
note
is
that
it's
not
real
apparent
here
yet
you'll
see
in
the
RBD
results,
but
we're
actually
double
caching,
a
lot
of
the
same
metadata
and
oh
no
data
in
the
medic
hash
and
in
Roxie
B's
key
value,
I
started
in
the
block
cache
a
lot
of
the
same
data
as
basically
and
populated
in
each
okay.
So
now,
with
the
auto
tuner
enabled
you
see
really
different
behavior
right
again,
you
see
that
the
data
cache
spikes
way
up
as
opposed
to
master
we're
not
just
doing
this
for
blue
source
caches.
A
With
the
a
couple
of
changes
to
rocks
tb's
cache,
we
are
able
to
expose
some
of
the
information
that
they
have
internally
about
high
priority,
the
high
priority,
pool
and
low
priority
pool
and
how
much
memory
is
being
used
there.
So
now
we
can
actually
try
to
allocate
just
a
little
bit
more
memory
than
is
used
at
any
given
point
for
these,
so
so
that's
kind
of
what
we
do.
A
You
can
see
that
for
the
key
value
line
here,
the
blue
line,
essentially
we're
always
just
allocating
a
little
bit
more
there
than
is
used
over
time
until
we
get
to
the
saturation
point
at
which
everything
kind
of
hits.
You
know
the
the
allocations
that
we
specified
were
we're
also
fully
utilizing
the
cache
much
sooner
than
than
we
did
previously.
A
One
of
the
things
that
you'll
notice
too
is
that
at
the
crossover
point
kind
of
around
one
hour
in
previous
to
that,
we
were
assigning
more
metadata
because
we
had
we
had
it
available.
Both
data
and
metadata
are
able
to
kind
of
use
memory
that
was
assigned
to
the
the
the
key
value
store
or
to
the
rocks
movies
meta
cache,
but
because
there
wasn't
that
much
memory
being
utilized
there.
Yet
we
could
give
that
to
the
the
metadata
cache
and
data
cache.
A
So
it's
a
little
bit
different
and
it's
not
entirely
clear
that
we
we
want
to
favor
the
block
cash
in
the
key
value
store
beyond
making
sure
that
those
indexes
and
filters
are
always
cashed.
It
might
be
that
that's
a
really
high
priority
thing,
but
beyond
making
sure
those
are
cashed,
we're
still
better
off
giving
the
rest
of
the
memory
to
the
blue
store
own
owed
cash.
A
And
you
can
see
that
too,
when
we
specify
different
ratios
here,
that
we're
sort
of
keeping
to
the
user-defined
ratios
once
we
hit
catch
cache
saturation,
except
that
again,
the
the
racks
would
be
an
mixes
and
filters
are
getting
the
first
shot
at
memory
and
in
growing
over
time
that
the
such
that
you
know
we're
sort
of
sticking
to
our
ratios,
but
we're
also
letting
the
auto
tuner
make
sure
that
those
things
are
cached
with
high
priority
in
terms
of
performance.
Unfortunately,
none
of
this
really
affected
our
GW
performance
at
all.
A
There
was
a
little
bit
of
change,
but
it's
it's
such
a
small
amount
that
it's
possible.
This
was
just
random
variation,
I'm,
I'm
curious
as
to
whether
or
not
maybe
we
have
enough
other
latency
in
the
stack
that
that
kind
of
the
effective
reading
from
Roxie
breeze
block
cache
or
from
blue
stores.
Oh
no
is
really
not
significantly
different
than
just
reading
off
of
the
nvme
drive
or
sorry.
In
this
case,
the
SSD
drive
directly.
A
B
A
B
A
Of
why
we're
reverting
back
to
the
the
user
specified
defaults?
The
next
step
in
this
I
think
is
gonna,
be
really
interesting.
Where
we
now
look
at
recent
kind
of
binning
of
the
cache
items
so
say
in
the
last
five
seconds
with
we,
we
cache
the
kV
cache
and
the
metadata
cache
with
high
priority
and
try
to
assign
everything.
A
That's
wanted
there
and
then
we
kind
of
do
these
steps
where
we
go
through
and
look
at
older
and
older
stuff
and
try
to
you
know,
divvy
stuff,
up,
I,
think
that
will
be
really
important
here,
because
then
it
will
start
telling
us
how
much
of
the
recent
things
are
key
v8,
you
know
our
block,
cache
items
versus
you
know:
blue
store,
metadata
items
and
maybe
we'll
start
being
able
to
deviate
up
a
little
bit
better
I
think
also.
Another
really
interesting
thing
would
to
do
would
be
to
differentiate
in
the
different
block
caches.
A
B
A
A
You
know
the
the
right
now
the
auto
tuner
is
working
at
like
five
second
resolution.
It's
every
five
seconds,
that's
going
through
and
rebalancing
the
caches.
The
next
step
in
this
potentially
could
add
some
overhead
to
trim
in
the
cache,
which
is
a
little
more
scary,
but
I
I
think
the
auto
tuner
itself
kind
of
as
long
as
you're,
not
having
it
do.
Work
too
often
is,
is
hopefully
not
going
to
impact
things
too
badly.
A
You
said
with
hard
disks,
since
you
know
the
the
amount
of
work
being
done
is
so
much
lower.
Potentially
you
could
even
you
know,
do
this
more
often
and
have
it
not
really
hurt
things
too
bad?
Okay,
let's
look
at
our
BD.
This
is
kind
of
the
the
more
fun
one,
so
okay,
I
really
behavior,
with
Auto
tuning
disabled
with
this
PR
I
did
not
grab
graphs
of
master.
Unfortunately,
I
thought
I
had,
but
I
I
didn't
grab
the
data
about
what
it
was
doing.
A
Unfortunately,
so
I
don't
have
master,
but
but
essentially
it's
gonna
look
a
lot
like
this,
except
with
kind
of
the
that
bug
that
that
we've
mentioned,
where
we're
not
quite
assigning
as
much
data
cash
as
we
should
and
over
assigning
metadata
cash,
but
otherwise
it
should
look
really
similar
to
this.
What
you're,
seeing
here
at
the
beginning,
is
a
prefilled
stage
where
we
are
filling
up.
The
RBD
volume
with
four
mega
byte
objects
just
to
make
sure
that
there's
there's
data
Ellery
allocated
there.
A
That's
that
kind
of
initial
slope
to
the
metadata
used
and
and
can
overall
total
used
at
the
beginning
here
at
one.
At
some
point
we
actually
may
be
I,
don't
know
seven
hundred
seconds.
Roughly
we
actually
hit
the
limit
of
how
much
cash
we
can
assign
to
the
metadata
a
cash.
How
much
remember
you
can
assign
to
the
metadata
cash?
And
so
now
things
are
getting
swapped
out.
That's
where
the
the
meta
used
line,
kind
of
goes
horizontal.
A
Once
we've
totally
filled
up
the
volume,
then
we
start
random
rights
and
all
of
a
sudden,
you
see
that
Cavey
used
spikes
way
up,
and
that
is
because
we
are
now
doing
reads
from
disk
of
some
of
the
own
ODEs
and
it's
populating
both
rocks
DB's
cash
and
blue
stores.
Oh
no,
and
so
we
you
see
the
the
the
total
use
spikes
way
up
and
now
we're
utilizing
all
of
our
memory,
but
in
reality
we're
actually
double
cashing
a
lot
of
stuff.
There's
a
lot
of
data.
A
That's
in
Roxie,
B's,
Cavey
cash,
which
is
basic,
is
just
the
exact
same
data.
That's
in
blue
stores,
meta
cash
and
just
not
being
accessed
because
it's
you
know,
any
kind
of
reads
are
coming
from
the
meta
cache.
Theoretically,
maybe
some
of
the
reads
can
hit
the
the
rocks
DB
cache
rather
than
going
to
disk
after
this,
the
the
data,
that's
in
the
block.
Cache
should
be
smaller
because
it's
encoded
using
the
variant
encoding
versus
blue
stars-
oh
no,
but
it
it
doesn't
seem
to
really
have
a
positive
effect.
A
A
Line
there
around
probably
the
5
for
4,500
5,000
second
mark
there
is,
when
we're
switching
over
to
doing
random
reads
that
the
amount
of
memory
and
the
cashes
doesn't
change
dramatically
at
all.
There's
there's
a
little
bit
of
a
spike
down,
but
not
not
not
anything,
spike,
I
guess
a
little
bit
of
dip
down,
but
nothing
really
there.
Oh,
you
can
really
infer
from
that
is
that's
where
the
the
random
reads
are
starting
okay.
A
So
what
happens
then,
if
we
don't
use
the
auto
tuner,
but
instead
specify
that
we
want
a
lot
of
the
cache
devoted
to
metadata
cache
that
works
really?
Well,
that's
actually
what
we've
kind
of
ended
up
doing
in
the
last
year
or
two
is
devoting
most
of
the
memory
to
metadata
cache,
because
performance
results
were
good
when
we
did
that.
The
reason
for
that
appears
to
be
that
at
least
in
a
lot
of
these
tests
that
we've
done
so
far
where
we've
got
you
know,
maybe
turn
56k
byte
or
even
512
gigabyte
of
RBE
data.
A
We
never
get
to
the
point
where,
where
the
the
we
start
trimming
data
from
the
meta
cash-
and
we
make
it
all
the
way
up
through
the
pre-fill
stage
and
into
the
the
random
right
and
random
read
stage
and
and
the
meta
used
remains
constant
and-
and
this
is
this,
the
the
sign
that
will
tell
you
that
that
we're
double
cashing
above
is
that
the
the
kV
cash
in
the
scenario
that
the
block
cash
and
rocks
DB
never
grows.
It
stays
really
really
small.
It's
it's
like
hovering
right
around
1%
of
the
total
cash.
A
We
never
actually
use
that
space,
though
that
that
could
have
been
assigned.
We've
we've
over
allocated
the
meta
cash.
We
only
ended
up
needing
around
51%
of
the
the
full
cash
to
cash
Oh
notes,
but
we
we
overdid
it
in
this
particular
scenario.
If
we
had
512
gigabytes
of
our
BD
block
to
cash,
then
we
wouldn't
have,
then
we
would
have
actually
under
allocated
a
little
bit,
but
at
least
in
this
exact
scenario,
we've
allocated
more
than
enough
and
we're
not
utilizing
any
of
that
extra
memory
that
we
allocated.
A
A
Nevertheless,
this
graph
in
this
particular
test,
the
the
graph-
that's
that
represents
this
test.
In
this
test,
we
saw
higher
performance
than
the
previous
one,
where
we
split
the
ratios
more
evenly
and
and
we're
using
less
memory
to
do
it,
but
we're
not
really
effectively
using
the
memory
that
the
user
specified
all
right.
So
now,
what
happens
with
auto-tuning
enabled
well
with
auto-tuning,
enabled
we're
making
much
much
better
use
of
the
cache
right
away.
A
The
amount
of
metadata
that's
needed
for
the
meta
cache
to
cache.
Oh
nodes
grows
as
the
number
of
Oh
nodes
grow.
So,
as
we
write
out
blocks
to
our
BD
that
grows
and
then
stays
constant.
Once
we've
we've
saturated
it,
the
amount
of
data
spikes
way
up
at
the
beginning,
as
we
have
plenty
of
available
memory
and
then
drops
back
down
but
drops
down
to
a
ratio
much
higher
than
we
specified
the
user
specified
20%,
but
with
the
auto
tuner
it
knows
it
can
it
can
give
it
more.
A
So
it
actually
ends
up
giving
it
around.
42%
and
kV
doesn't
grow
because
we
never
enter
a
state
where
we
need
to
double
cache.
If
we
are
writing
out
a
gigabyte
of
our
BD
data
than
we
would
and
then
we'd
enter
a
scenario
where
we'd
end
up
having
to
revert
back
to
the
user
specified
defaults
again
with
the
exception
that
we'd
be
index,
caching
indexes
and
filters
at
high
priority
right
now,
that's
the
behavior,
but
we
can
do
much
better
than
this
PR
in
the
future.
A
Using
the
priority
scheme,
I
believe
you'll
notice
that
in
next
graph,
where
we've,
the
user
is
specified
very,
very
different
ratios,
again
10%
kv-85
percent
Mehta
and
5%
data.
The
auto
tuner
converges
on
the
exact
same
solution
at
the
very
beginning.
If
you
look
right
around,
you
know
time
0
up
to
time,
100,
maybe
the
ratios
are
very,
very
different.
That's
it's
kind
of
an
initial
idle
state
where
we're
just
recording
data
about
what
the
cluster
is
doing
before
anything
is
written
out.
Those
ratios
are:
are
the
user-defined
ratios?
It's
reverting
back
to
those
things.
A
It's
not
really
doing
much
differently.
It's
it's
slightly
off,
but
it's
pretty
close
in
the
previous
graph.
Those
ratios
again
are
are
roughly
what
the
user
defined
is
just
slightly
different,
but
really
close,
but
then,
as
soon
as
we
actually
start
writing
any
data
out
it
quickly
convergence
back
to
the
same
solution,
so
I'm
excited
about
that.
That's
that's
telling
me
that
this
is
trying
to
be
smart
about
making
sure
that
the
cache
is
well
utilized.
A
So
so
that's
that's
personally.
I
like
this
I
really
like
seeing
this
behavior
and
in
the
RBD
performance.
It
looks
pretty
good
too.
The
the
worst-performing
solution
by
far
is
when
we
have
tried
to
equally
well,
not
equally,
but
but
when
we
have
tried
to
divvy
up
the
cache
equally
between
the
block,
cache
and
blue
store
meta,
and
then
you
know
giving
giving
the
data
cache
some,
but
not
quite
an
equal
share
of
versus
the
other
two.
It
does
not
do
particularly
well
when
we.
A
Don't
think,
there's
actually
much
difference
here
is
probably
just
random
variation.
I
suspect
that
all
three
probably
perform
roughly
equally
when
when
oh
noes
are,
are
always
in
the
meta
cache
latency,
the
99th
percentile
latency,
improves
to
the
the
biggest.
By
far
the
biggest
difference
was
when
we
were
doing
Oh
node
reads
from
disk
or
from
rocks
DB's
cash
versus
when
we
weren't
the
auto
tuner
does
a
good
job
of
avoiding
this
it
we
avoid
it
relatively
or
we
we
avoid
it.
A
Yet
in
this
particular
scenario,
so
that's
that's
kind
of
our
BD
that
it
looks
like
all
of
this
is
helping
RVD
more
than
its
helping
rgw
right
now,
but
I
think,
especially
as
we
improve
our
GW
behavior
with
with
beast
and,
and
maybe
you
know,
kind
of
doing
other
things
looking
for
other.
You
know
areas
where
we
have
bottlenecks.
My
hope
is
that
this
will
start
also
showing
more
of
a
performance
improvement
with
our
GW,
like
it
is
with
our
BD.
B
For
their
a
BW,
but
this
kinda
reminds
me
when
we
were
talking
about
some
in
the
cash
game
booster
in
general
before
and
varying,
how
do
we?
You
were
able
to
get
some
results
with
like
local
black
toys
caching
and
were
like
we'd
seen
some
good
results
from
the
Intel
mint
data,
caching
layer
ever
it's
called,
but
that
was
even
with
30
W
I'm
wondering
if
that's
because
they
were
you,
it
was
using
my
flower
drive,
you
get
RW
objects,
so
there
was
enough
or
small
enough
metadata.
B
A
A
A
A
That's
just
a
guess,
but
I
have
a
feeling
that
probably
plays
at
least
a
moderate
part
and
why
they
saw
those
kinds
of
results.
I'd
be
really
curious.
How
much
their
their
software
helps
blue
store,
especially
if
you're
already
putting
o
nodes
or
sorry
o
map
data
in
on
the
SSD,
and
you
know,
with
the
DB
on
the
SSD.
B
Yeah,
that's
maybe
pretty
interesting,
I
think
when
we
thought
discussed
this
before
we
were
talking
about
how
perhaps
the
local
block
twice
cashing,
wouldn't
avidly
Sherman
effects
once
boosters
caching
was
more
intelligent
because
the
major
effect
was
from
caching
that
metadata
yep
do
that
explicitly.
Maybe
a
block
device
cache,
isn't
effective
anymore.
A
B
A
Kind
of
enough
data
cache,
which
is
pretty
common
right
to
to
cache
all
the
data,
that's
being
written
or
read
for
really
hot
things.
You
know,
maybe
that
would
that
would
be
a
benefit
but
I
guess
when
you
think
about
like
say
our
GW
bucket
indexes.
That's
already
Oh
map
right.
So
it's
already
gonna
end
up
in
the
DB
already
on
the
SSD.
So
maybe.
B
A
B
Mean
they
they
keep.
The
kernel
under
phase
typically
will
have
a
file
system
put
on
top
of
it,
so
it
still
ends
up
using
the
local
page
cache
and
the
client
there
yeah
I
think
maybe
some
kind
of
cases
with
rtw,
where
you
don't
have
any
I
mean
you
could
put
it
got
from
there
in
front
of
it
like
a
it's
like
a
CDN
sort
of
situation,
perhaps.
B
A
B
A
A
So
it's
it.
It
right
now
over
allocates
a
little
bit
to
give
itself
some
leeway
to
grow
things.
That's
why,
in
some
of
these
graphs,
it's
not
quite
getting
up
fully
to
the
hundred-percent
ratio,
because
it's
always
leaving
itself
a
little
bit
of
room
to
grow.
In
case
like
say,
the
kV
used
kind
of
starts
going
up.
It
always
wants
to
say
ahead
of
it
if
it
can.
So
that's
why
it's
kind
of
left
a
little
bit
of
of
the
cache
unutilized
for
future
growth.
So
that's
that's
one
caveat.
A
There's
some
really
weird
behavior
in
rocks
DB,
where
by
default,
rocks
DB
allows
it
will
grow
the
block
cache
beyond
the
user
specified
ratios,
especially
in
during
compaction.
The
alternative
is,
you
can
disable
that
and
it
can
make
operations
fail,
which
I
think
we
probably
can't
do
currently
so
so
the
gist
of
it
right
now
is
that
rocks
TV
may
use
more
memory
than
then
it's
been
assigned
temporarily,
especially
during
compaction.
What's
a
little
odd
is
that
when
that
happens,
it
appears
to
completely
flush
the
high-priority
pool.
So
all
the
indexes
and
filters
get
flushed.
A
A
So
at
any
event,
that's
what
it
appears
to
do
in
this
PR
there's
some
code
to
kind
of
try
to
work
around
that
by
if
it,
if
it
all
of
a
sudden,
the
rocks,
DB
cache
rinks
dramatically,
especially
like
the
high
priority
cache
or
the
high
priority
pool
the
usage
there
shrinks
dramatically.
It
doesn't
just
shrink
it
to
the
new
value
instead
kind
of
the
rocks
DB
priority
cache
layer
in
this
will
will
say
well.
I
I
recently
saw
high.
You
know
this
high
of
usage.
A
Instead,
there's
some
memory
here,
so
so
let
it
regrow
quickly
if
it
needs
to,
and
then
we're
kind
of
back
close
to
where
we
were
before
once
we
had
all
indexes
and
filters
populated
in
cache,
so
we're
working
around
it
right
now,
but
I
think.
Ultimately,
we
need
to
understand
why
rocks
TB
does
this
and
if
it
needs
to
do
this
or
if
it's
just
kind
of
a
an
esoteric
behavior
that
that
doesn't
really
make
any
sense
future
work.
I've
talked
a
little
bit
about
kind
of
this
bend.
A
Lru
approach,
I've
got
kind
of
a
prototype
of
it
in
the
works
that
that
that
right
now
reports
or
bins
the
different
usages
it
was.
It
was
pretty
easy
to
implement
so
I'm
playing
around
with
that.
It
does
not
appear
to
have
significantly
high
overhead,
it
does
add
some
overhead
to
trim
the
cache,
but
it's
basically
just
incrementing
and
Eckhart
and
and
decreasing
a
couple
of
of
n64
values
with
smart
pointers
to
them.
A
We'll
see
my
hope
is
that
once
we
implement
this,
it
will
allow
the
caches
to
rebalance
based
on
the
workload
right
now.
If
you
imagine
a
scenario
where
you've
got
lots
of
heavy
rgw
rights
and
then
later
on,
do
a
bunch
of
our
biddies,
random,
small
random
reads:
writes
on
the
same
cluster.
You
still
might
end
up
in
this
double
cashing
scenario,
because
our
job,
you
sort
of
forced
you
into
it
and
now
and
now
without
the
ability
to
adapt.
A
Rbd
we'll
end
up,
in
that
same
scenario
with
a
bunch
of
old
rgw
data
cashed
in
the
kv
cache,
and
you
can't
get
out
of
it
with
this
bidding
this
priority-based
binning,
I
believe
we
will
be
able
to
I
believe
we'll
be
able
to
say.
Okay,
there's
lots
of
old
rgw
data,
but
we
don't
care
about
that
anymore.
We
have
much
a
much
more
high
priority.
A
Oh
no
data
coming
in
right
now,
so,
let's
cache
that
instead
and
then
fit
all
of
the
own
ODEs
into
cache.
All
of
the
Ono
data
that
was
in
the
kv.
Cache
starts
kind
of
going
into
low
prior
lower
priority
bins
that
shrinks
and
now
we're
back
in
our
optimal
scenario,
where
we're
caching
lots
of
Oh
nodes
and
not
caching,
Cavey
data.
That's
irrelevant!
That's
the
goal
with
it.
I
think
we
can
get
there.
A
Other
future
work
might
be
to
to
do
this
at
the
OSD
level.
Instead
of
just
looking
at
a
user-defined
boost
or
cash
ratio,
we
kind
I
think
we
want
to
instead
say
here's
how
much
we
memory
we
want
to
give
the
own
this
particular
OSD
in
general,
and
that
accounts
not
only
for
these
caches,
but
things
like
the
right
ahead:
log
buffers
and
rocks
DB
and
the
PG
log
buffers
that
are
in
memory
and
other
stuff,
that's
potentially
in
memory
right
now.
A
The
OSD
potentially
can
use
significantly
more
memory,
then,
what's
defined
here,
you
give
it
three
gigabytes
of
glue,
store
cash
and
in
reality
like
in
rgw
workloads,
we
might
be
using
close
to
eight
gigabytes
of
RSS
memory
on
the
OSD.
It
it's
very
very
different
and
we
can't
account
for
everything.
There's
TC
malloc
fragmentation,
there's
there's
lots
of
other
stuff
that
that's
going
to
be
really
hard
to
tell
you
know
to
account,
for
but
at
least
we
can
get
closer.
A
Let's
see,
oh,
this
is
the
age
based.
Oh
right
now,
even
if
we
did
this
kind
of
priority
based
binning
I,
don't
think
we're
gonna
totally
get
rid
of
double
caching
in
in
rocks.
Db
I,
don't
know
if
we'd
entirely
want
to,
but
definitely
in
cases
like
rgw
I
think
we
want
the
OMAP
data
to
be
prioritized
relative
to
kind
of
the
the
double
cached
blue
store.
Oh
no
data.
A
We
really
want
I,
think
the
Ono
data,
probably
in
blue
stores,
cache,
and
we
really
want
the
oh
map-
data
cached
in
rocks,
DB's
block
cache-
or
at
least
we
want
prioritize
it
versus.
Oh,
no
data
in
Rocky's
block
cache,
so
potentially,
maybe
in
the
future
we
can.
We
can
either
create
multiple
caches
for
each
column,
family
or
or
at
least
kind
of
implement
some
scheme
to
to
give
different
priorities
to
different
prefixes
or
something
so
that
might
be
another
potential
future
area
of
work
that
that
I
think
could
yield
improvements.
A
So
that's
it
we're
more
or
less
out
of
time,
but
any
any
questions
on
any
of
this.