►
From YouTube: Ceph Performance Meeting 2020-10-29
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right
looks
like
we've
got
folks,
trickling
in
now.
That's
good
all
right!
Let's
get
this
thing
started,
so
I
saw
two
new
pr's
this
week.
One
was
a
performance
improvement
for
breeding,
mirror
snapshots
and
rbd.
Asynchronously
jason
did
a
review
of
that.
A
A
I
think
that's
actually
just
a
build
change
but
reviewed
that
so
not
sure
if
how
much
of
a
performance
improvement
that
provides-
but
I
take
it
this
morning
so
hopefully
we'll
at
some
point
be
able
to
track
that
there
are
two
pr's
this
week
that
closed
the
first
is
the
slip
rbd
cache,
I
think,
jason
finally
thought
that
was
good
enough
to
merge
so
that
has
been
merged.
A
I
believe
that
was
quite
a
bit
of
code,
so
hopefully
we'll
we'll
be
able
to
harden
that
and
get
into
a
point
where
folks
can
use
it,
but
this
is
the
initial
pr
so
good
what
else?
Oh,
this
bufferless
rebuild
one.
It
looks
like
that
has
been
replaced
and
I
did
not
catch
the
replacement
in
this
so
I'll
get
that
in
here
as
well,
but
there's
a
new
pr
for
that.
A
All
right,
updated
adam.
If
you're
here
other
adam
okay,
we
don't,
we
don't
have
core
adam
there's.
This
blue
store
dynamic
levels,
pr
that
has
been
there
for
a
while
and
they
noticed
that
it
was
not
behaving
properly
with
adam's
column.
Family
sharding
work,
so
I
think
they
are
requesting
a
review.
Oh
adam,
great.
A
Hey
adam,
I
was
just
talking
about
this
pr
that
changes
or
provides
initial
implementation
for
doing
dynamic.
Leveling
in
roxdb
and
they've
requested
a
review
from
you
because
they
said
it's
not
working
right
with
calm
family
sharding,
but
they
think
it
works
right
when
we
they
don't
use
it.
So
they
requested
a
review
from
you
on
that.
A
Sure
yep,
just
if
you
have
time
I
know
it's
everyone's
super
busy,
okay
and
then
the
other
updated
pr
I
have
is
just
my
very,
very
old,
now
cash
binning
code.
A
Most
of
this
is
actually
in
master
already
that
got
broken
up
into
multiple
separate,
smaller
pr's,
but
the
one
remaining
piece
is
the
the
age
binning
code
itself,
all
the
other
things
that
led
up
to
that
are
are
in
so
that
is
now
being
worked
on
in
a
separate
branch.
A
I
suppose
I
should
probably
include
that
here,
but
we
can
we'll
talk
about
a
little
bit
more
here
coming
up,
so
I
won't.
I
won't
discuss
it
now,
let's
see
so
otherwise
there
wasn't
a
whole
lot
adam.
I
think
you've
been
reviewing
yours
changes,
but
it
looks
to
me
like
it's:
it
failed
tests,
but
otherwise
I
think
you
were
really
happy
with
it
right.
B
I'm
very
happy
with
igor's
improvements
to
for
deleting
pgs,
especially
with
the
first
part,
which
is
really
clean
and
gives
much
improvement,
I'm
not
so
keen
to
go
with
the
second,
a
second
part
which
changes
our
format
and
only
gives
us
boost
if
our
pg's
do
have
a
lot
of
omap
data.
That's
the
only
case
when
the
second
part
really
shows
its
performance
boost.
B
But
if,
if
you
have
a
lot
of
omaps
in
your
pg,
then
it's
like
10
times
faster,
the
deletion
using
the
second
part,
but
yeah
I
will
I
will
still
would
prefer.
B
I
think
that
when
we
change
a
format
to
extract
the
change
of
format
to
a
separate
pr
and
then
just
make
use
of
it
with
with
additional
code
that
takes
advantage
of
it
yeah,
that's
it:
okay,
okay,
cool.
A
Cool
okay:
let's
see
what
else?
Okay,
there's
your
your
code
for
making
it
so
that
we're
we're
more
choosy
about
how
we
spend
our
cash
for
cash
charts.
I
thought
you
take
that
with
do
not
merge,
yet
is
that,
yes,
it's
still.
B
It
still
do
not
merge.
It
needs
to
some
kind
of
this,
I
think,
should
go
to
the
final
code,
but
I
guess
finally,
we
should
first
see
a
a
cash
aging.
Cash
binning
has
been
aging
before
before
actual
algorithm
for
being
more
aggressive
with
allocating
space
to
shards
should
be
revised.
This
is
because
current
implementation-
I
I
feel
like
it-
will
make
it
very
slow
for
some
previously
under
allocated
charts
to
to
exit
that
state
to
before
they
they
in
take
subsequent
iteration
can
get
enough
pressure
to
go
from
a
very
low.
B
A
Okay,
yeah,
it
makes
sense.
Do
you
have
a
feeling
one
way
or
another
about
having
the
cash
shards
independently
compete
for
age
binning
and
priority
cash.
B
And
yes,
I
I'm
now
considering
that
this
is
like
making
just
more
elements
more
caches
without
any
actual
benefits
for
for
process
of
properly
allocating
sizes
to
those
entities.
B
A
Cool
all
right,
well,
I'm
sure
we'll
we'll
have
more
to
discuss
on
that
one
as
we
get
more
results
with
the
testing
we've
been
doing.
A
A
All
right,
well,
then,
josh,
I
see
pg,
auto
scaler
is
the
first
thing
in
our
list
here.
So
do
you
want
to
take
us
away
on
it.
C
Yeah
sure
I
wanted
to
talk
about
this
first
inside
junior
is
to
be
late
for
him,
so
I
wanted
to
get
to
that
for
earlier,
so
that
we
talked
about
this
a
bunch
in
the
past
and
dad
your
ideas,
and
so
we
wanted
to
improve
the
out
of
the
box
behavior,
for
when
you
initially
install
stuff,
you
can
get
good
performance
from
the
get
go
but
slow
like
what
we
do.
We
talked
about
doing
this
in
the
past
was
using
like
the
full
budget
of
pgs.
C
We
have
for
all
the
pools
to
start
with
and
then
only
scaling
down
the
number
of
pgs.
We
have
when
there's
pressure
from
other
pools
to
actually
use
that
and
need
need
more
space
or
parallelism.
C
So
this
essentially
means
changing
the
way
that
the
auto
scaling
algorithm
works.
So,
instead
of
going
up
by
capacity,
it
goes.
It
starts
with
all
the
pg's
allocated
and
shifts
them
around
if
there's
a
very
large
difference
in
utilization
later
by
keeping
the
behavior,
where
it
only
makes
a
change
through
the
pgm.
If
it
would
differ
by
a
factor
of
two,
so
we
don't
have
any
kind
of
oscillation
or
small
changes,
it's
data
movement
and
is
pretty
expensive.
A
So
agreed
100,
I
think
that's
that's
good
thing.
I
still
don't
kind
of
well.
I
guess
the
the
thing
that
I
I
keep
coming
back
to
is
that
it
seemed
like
we
could
get
some
of
the
benefit
by
scaling.
The
pg
log
length,
rather
than
reverting
back
to
the
auto
scalers,
have
the
first
line
of
attack
like
that
are
still
really
good.
C
A
Yeah
yeah
exactly
like,
instead
of
first
adjusting
the
number
of
pgs
per
pool
to
control
the
kind
of
the
the
distribution
of
pgs
across
the
osd.
I
wonder
if
it
might
be
lower
impact.
C
Are
you
describing?
Are
you
thinking
of
like
having
a
higher
like
default
target
for
stephen,
where
the
the
pg
log
length
comes
in
because
that's
kind
of
like
a
secondary.
A
So
so
I
see
two
separate
limitations
that
prevent
us
from
having
more
pgs
per
osd.
One
is
the
amount
of
memory
that
we
consume
for
pg
log
and
then
the
other
is
the
the
kind
of
overhead
of
just
having
too
many
peaches
in
the
cluster
in
general.
There's
probably
a
third
thing
of
like
too
many
pgs
per
osd,
but
I
don't
think
we're
actually
anywhere
close
to
that.
Yet
I
think
it's
those
other
two
things
that
really
kind
of
limit
us.
A
So
it
seems
to
me
like
we
could
eliminate
some
of
the
problems
that
we
face
by
implementing
a
total
overall
cluster-wide
pg
count
limit,
and
if
we
haven't
hit
that
yet
then
you
know
we
can
we're
free
to
kind
of
you
know,
have
more
pgs
and
then
also
scaling
the
pg
log
length
on
a
per
pool
basis,
so
that
then
you
could
have
more
pgs
overall
in
the
osd
but
then
still
scale
things
back
to
control
memory
by
scaling
the
pg
log
length
right,
you're,
not
reducing
the
overall
number
of
pg
entries,
you're,
just
changing
kind
of
the
the
dynamics
of
whether
or
not
you
have
long
pg
logs
with
few
pgs
in
that
pool
or
more
pg's
in
that
pool
with
a
shorter
log
length
for
that
pool.
E
How
do
you
deal
with
different
kinds
of
pools
like
ones
that
are
supposed
to
consume?
More,
like
just,
for
example,
how
do
you
differentiate
between
a
data
pool
and
index
pool
and
rgw?
E
Do
you
start
out
with
the
same
number
of
pgs
and
weight
or
just
pg
log
length,
so
I
mean
technically
you'll
see
different
patterns
in
different
pools,
right.
C
So
like
for
metadata
pools,
we
set
this
to
like
four,
so
today
the
auto
scale
uses
that
to
consider
them
as
if
they
were
four
times
their
size.
I
think
we
could
use
that
in
their
reverse
direction.
We're
treating
things
with
rpgs,
where
we
say
this
would
be
like
a
quarter
of
their
allocation
as
then
that
they
would
get
as
if
they
were
a
data
pool.
A
My
hope
is
that
if
we
went
down
this
route,
we
could
have
enough
pgs
per
pool
for
a
reasonable
number
of
pools.
That
concurrency
would
no
longer
be
a
problem
like
we'd,
be
able
to
have
enough
minimum
per
pool
that
then
concurrency
isn't
really
the
problem
anymore.
It's
more
about
making
sure
data
distribution
is
good,
which
we
have
the
balancer
for
so
hopefully
we
can.
You
know
not
have
to
worry
about
that
either.
A
I
think
what
we
face
right
now
is
that
if
you
were
to
create
a
large
number
of
pools
either
we
have
so
few
pgs
in
each
one
that
we
lack
concurrency
or
we
have
such
a
high
minimum
like
say,
16
or
32
pgs
per
osd
in
that
pool
with
a
large
number
of
pools
that
you
end
up
with
a
ton
of
pg's,
and
then
we
end
up
in
situations
where,
like
with
containers,
the
osd
could
actually
go
out
of
memory.
C
A
A
C
That's
kind
of
important
direction,
we're
going,
but
with
we
need
to
have
some
limits
right
I
mean
there's
going
to
be
even
with
a
dynamic
log
length,
there's
still
some
minimum.
We
need
for
dupe
detection,
so
we
can't
get
get
rid
of
that
memory
used
entirely.
A
Like
we're
hard
limited
right
like
like,
like
you,
said
10,
you
know
as
an
example,
you
know
if
we
have
10
pools,
that
each
have
a
minimum
of
32
osds
per
osd
per
pool,
that's
still
like
320
osds
on
the
three
sorry
320
pgs
on
the
osd,
which
you
know
yes,
we
can.
We
can
make
that
work
right
now
and
four
gigabytes.
We
can
shoot
the
caches
a
little
bit
and
I
don't
remember
what
that
will
actually
work
out
to
be
with
our
defaults
for
pg
log
length.
A
But
I
guess
I
guess
my
thought
was
that
if
we
kind
of
turn
this
around
a
little
bit,
we
might
be
able
to
support
a
larger
number
of
pools.
A
That,
then,
would
have,
I
guess,
yeah,
like
you
said
it
would
be
shorter
overall
numbers
of
dupe
entries
and
pg
log
entries
per
pool.
C
A
Okay
yeah,
as
long
as
as
long
as
we
really
can
just
say:
10
10,
you
know
pools
is
the
like
max.
You
can
do
for
per
cluster,
it
was
just
it
seems
like
it
could
hurt
us
in
the
long
run,
maybe
not.
C
You
know
with
other
mechanisms
like
with
the
namespaces
for
multi-tenancy
or
with
other
kinds
of
groupings
for
isolation
like
one
of
the
things
that
sam
has
talked
about
is
kind
of
an
overload
on
top
of
crush,
where
you
have
like
rbd
images,
for
example,
using
a
subset
of
the
clusters,
so
they're
not
necessarily
spread,
even
though
they're
in
a
given
pool
they're
spread
over
like
a
smaller
area
within
the
cluster,
so
they
have
a
kind
of
reduced.
C
Potential
fault
area:
this
also
helps
with
the
some
other
ideas
that
that's
kind
of
sca,
respecting
the
total
number
of
hosts
that
need
to
talk
to
as
well.
A
Oh,
this
will
be
my
last
attempt
at
this
and
I'll
give
up
on
it.
But
if,
if
we
have
like
just
completely
like
cold
pools,
like
no
one's
been
doing
anything
on
it
for
a
while,
like
rbd
hasn't
been
touched
in
months
or
whatever
do
we
still
really
need
to
keep
those
dupe,
ops
and
and
other
pg
log
entries
in
memory
like?
Does
that
make
sense
on
a
per
pool
basis
to
be
continuing
to
keep
those
around
or
can
that
memory
better
be
used
for
other
things?.
C
I'm
not
sure
that
that's
a
very
common
problem
necessarily,
but
I
think
that's
something
that
we
could
solve
kind
of
orthogonally
to
what
we've
been
talking
about.
C
E
I
mean
if
those
pgs
that
are
there
on
those
pools
are
all
active
and
clean,
then
probably
yeah.
We
don't
need
the
pg
logs
as
much,
but
in
some
scenarios
there
are
these
corner
cases
where
there
are
unclean
pg's
and
there
we
definitely
want
to
keep
the
pg
log
around,
because
whenever
you
want
to
use
those
pools
back,
you'd
need
them
to
recover
right.
E
I
I
actually
going
back
to
the
initial
discussion.
Let
us
assume
for
the
moment
that
we
don't
have
more
than
10
to
15
pools
what
what
is
our
current
bottleneck
and
what
can
we
do
to
address
it
with
the
spa
like
whether
this
back
pressure
mechanism
is
the
only
way
to
go
about
it
or
is
there
anything
else
we
need
to
do.
C
C
F
A
And
just
like
create
a
pool
with
like
100
pgs
per
osd,
starting
out
you're
great
it's
when
you've
got
like
nine
already
and
you're,
creating
the
tenth
that
it
doesn't
help
as
much.
C
Yeah,
this
still
helps.
In
that
case,
you
still
get
assuming
your
cluster
is
large
enough
to
handle
that
many
calls.
A
You
would,
you
would
presumably
already
be
using
like
some
distribution
of
the
available
pgs
for
the
other
pools
and
as
soon
as
you
add
that
tenth,
it
would
then
have
to
steal
stuff
from
the
other
pools
to
add
pgs
for
the
new
one.
C
So
I
think
we
wouldn't
necessarily
want
to
be
stealing
ceiling
spgs,
but
in
many
cases,
maybe
he's
purely
adding
adding
more.
C
Got
I
agree
this
this
isn't
really
that
relevant
for,
like
a
tiny
cluster,
where
you're
going
to
be
constrained
by
the
hw
pool
using
the
minimum,
and
it's
going
to
get
the
full
parallelism,
because
there's
not
that
many
osd's,
but
for
a
larger
cluster.
A
Yeah
I'll
make
it
clear,
I'm
not
against
any
of
this.
This
is
actually.
This
is
absolutely
good,
more
just
the
what
I'm
one
can
the
the
worst
thing
in
all
of
this
is
the
data
movement
right.
C
That
yeah
and
you
and
you
bring
up
some
good
points
there.
I
think
orthogonal
to
like
this,
this
pg
part,
but
was
about
the
the
way
we
store
pt
logs
and
memory
we
were
using
there.
We
could
probably
introduce
like
ways
to
kind
of
get
rid
of
that.
A
lot
of
that
memory
usage
when
it's
not
necessary.
C
Like
respect
a
little
bit
diversified
like
having
hello
use
like
a
scheme.
A
Yeah
we
really
someday
soon
are
gonna
need
to
be
able
to
keep
a
lot
more
pg
log
entries
and
do
ventures.
I
think.
A
C
Yeah,
maybe
with
a
very
reduced
subset
memory
just
for
the
dupe
detection.
A
A
C
Yeah,
I
think,
we'll
uncle
one
I'd
like
to
test
with,
like
kind
of
artificial
maps
that,
like
represent
kind
of
real
clusters
for
kind
of
different
use
cases
and
see
what
what
the
algorithm
would
aspect
would
be.
G
A
As
you
as
you
work
on
this
just
kind
of
like
I
said
earlier,
the
anything
you
can
do
to
avoid
data
movement,
I
think,
is
going
to
be
really
really
useful
and
beneficial.
A
A
That's
why,
like
the
small
movements,
are
bad
right,
like
you
know,
adding
a
pg
one
at
a
time
slow
over
time,
you
know
constantly
rebalancing
stuff
is
kind
of
a
lot
of
people
were
complaining
about
that
like
a
year
or
two
ago,
when
this
first
was
implemented
as
they
create
a
new
pool,
it
gets
like
eight
pgs
or
something,
and
then
you
know,
as
they
start
filling
it
with
data
it.
It's
constantly
rebalancing
and
slowing
down.
G
So
I
guess
like
just
if
the
target
is
like
two
times
the
amount
of
what
we
need
to
scale
we
just
have
to,
or
we
can
increase
it
to
three
times.
I
don't
know
so
that
it
avoids
like
repeating,
as
joshua
was
saying.
C
Yeah,
I
think
we
can
maybe
like
test
it
out
like
how
it
would
function
like
if
you,
if
you
had
like
a
cluster,
that's
set
up
originally
with
like
this
many
pools,
and
then
you
add
a
new
pool
and
start
filling
it
up
like
what?
What
at
what
point?
Would
it
start
moving
things
around
and
we
can,
and
we
can
like
see
how
that
would
be
like
too
much
movement
at
this
point.
If
we
want
to
change
that
threshold
or
what
the
impact
would.
C
Be
I
think
the
nice
thing
about
this
is
that
the
auto
scaler
is
kind
of
relatively
isolated,
doesn't
need
a
whole
lot
of
external
stuff,
so
we
can
kind
of
test
it
test
this,
this
algorithm,
pretty
by
itself
kind
of
a
in
a
very
interesting
way.
We
don't
need
to
like
set
up
giant
clusters
to
figure
out
what
what
behavior
is.
A
It's
pretty
easy
with
the
current
code
to
see
the
impact
when
you
create
a
new
pool
and
just
start
like
throwing
data
at
it
too,
like
you'll,
you'll,
see
it
right
away
with
it
on
and
off
the
difference
between
it.
So
that's
like
you
know,
a
good
starting
point
to
look.
A
A
Oh,
thank
you
all
right.
Well,
let's
move
on
to
owner
memory,
usage
and
cash
age.
Binning
adam
has
been
doing
tons
of
work
in
this
area.
A
He
fixed
a
bug
in
the
existing
cash
code
that
was
preventing
us
from
properly
growing
the
of
the
data
cache,
I
believe,
buffer
cache
and
then
also
got
the
old
implementation
of
the
oh,
no
double
caching,
pr
working
with
column
families,
and
then
we
also
on
top
of
that
implemented
the
old
cash
age
bidding
pr
and
have
all
this
kind
of
working
again
adam.
B
There
is
nothing
new
from
my
site
from
last
week,
performance
meeting
I've
been
entirely
dedicated
to
to
testing
igor's
pr
and
fixing
stuff
with
that,
but
I
can
replay
what
was
what
was
the
result.
Basically,
when
I
tested
your
bundle,
meaning
that
your
original
pre-based
and
also
age
binning
and
also
fixes
the
limiting
factor,
seems
to
be
that
sizes
for
kv
caches,
both
all
node
cache
and
regular
roxdb
block
data
cache
just
kept
growing
and
they
never
yielded
any
data,
and
I
made
detailed
analysis.
B
What
are
the
actual
elements
in
that
caches?
I
mean,
and
not
actually
what
the
content
is,
but
evidently
roxdb
leaves
those
entries
in
our
caches
that
they
are
just
very
old.
They
can
be
one
or
two
hours
old,
even
with
intensive
testing.
Only
new
ones
are
being
used,
but
the
old
ones
are
just
there
lying
around
and
they
eat
up
our
space.
So
that's
the
most
of
the
new
interesting
stuff.
I
cannot
about
using
caches.
A
And
that
was
the
exact
same
thing
that
I
saw
adam.
Let
me
share
my
screen,
so
what
we
have
here
is
a
comparison
of
the
different
intermediate
steps
that
adam
and
I
were
taking
going
through
some
of
this,
these
prs
and
older
code
and
new,
newer
code
that
we've
been
working
on,
and
so
this
first
graph
is
master
and
there
are
some
problems
in
it,
the
biggest
one
being
that
this
was
before
adam's
fix
for
the
data
cache.
A
The
data
cache
was
not
properly
growing,
but
also
we
get
into
a
state
where
basically,
the
meta
cash
and
the
kv
cash
are
being
distributed
equally,
even
though
that's
maybe
not
ideal,
and
so
you
can
see
that,
like
the
the
yellow
line
and
the
orange
line
are
almost
exactly
the
same.
A
As
we
move
on
through
the
different
code
that
we
we
did,
you
can
start
seeing
some
changes
in
this.
I'm
going
to
skip
over
graphic
number
one
and
just
go
to
graph
number
two
here.
That's
with
adam
speaks
from
the
data
cache.
So
now
it's
actually
getting
more
memory
as
it
should,
and
then
this
is
also
including
the
oh,
no
double
caching
fix.
A
So
we
see
that
the
kv
cache
is
actually
much
smaller
than
the
meta
cache.
Now
it's
much
smaller,
but
getting
much
less
memory.
The
yellow
line
is
much
lower
than
the
orange
line.
A
Number
three
is
where
cache
age
bidding
is
first
introduced.
In
this
case,
we
actually
see
that
the
the
green
line
and
yellow
line
this
is
for
the
the
kvo
node
cache
and
the
kv
cache.
Basically
rock's,
dvd's
versions
of
these
caches
is
actually
growing,
like
adam
said
kind
of
without
bound,
and
it
turns
out
that's
all
in
the
first
priority
level,
which
is
for
rocks
to
be
indexes
and
filters,
not
not
great.
A
A
It's
not
good,
but
it
turns
out
that
it
looks
like
roxdb
doesn't
actually
delete
these
things.
They
just
wait
for
them
to
be
expired
in
the
cache
and
because
we're
prioritizing
caching,
these
things
they
end
up
getting
even
more
priority
than
bluestora's
own
node
cache,
they're
kind
of
like
the
highest
priority
thing
in
the
current
or
in
this
implementation
of
the
code,
and
so
they
just
keep
growing
because
they're
never
deleted.
A
So
if
we
keep
going
down
scrolling
down,
I'm
actually
gonna
go
all
the
way
down
to
graph
six
here.
A
If
we
change
it
so
that
we
don't
prioritize
roxdb's
indexes
and
filters
in
price
zero,
and
instead
we
have
the
compete
with
all
other
kv
entries
in
roxdm's
caches.
We
can
eliminate
this
so
instead
of
seeing
in
in
graph
four
here
this
kind
of
green
and
yellow
line
growing
and
growing
and
growing.
If
we
have
those
compete
with
other
kb
cache
entries
is
too
small
or
relatively
small,
and
instead
we
give
a
much
greater
percentage
of
the
cache.
A
The
available
memory
to
o
node,
which
is
this
orange
line,
but
before
the
orange
line
never
exceeded
one
gigabyte
and
then
in
the
new
version,
where
the
the
indexes
and
filters
compete
with
other
kv
entries
and
the
overall
cache
is
compared
to
the
owned
cache.
A
Then
we
see
that
the
owned
cache
gets
much
more
memory
and
is
actually
working
a
little
bit
better.
Also
in
these
graphs,
you
can
see
that
everything's
kind
of
working
properly.
Initially
we
don't
have
a
lot
of
oh
nodes.
The
data
cache
gets
a
lot
of
memory.
That's
the
steel
line.
A
The
light
blue
line
initially
is
ramping
up
quickly
during
the
pre-fill
stage,
but
then,
as
we
add,
o
nodes
and
we
grow
the
owned
cache,
we
start
seeing
that
that
data
cache
shrinks
and
goes
down
from
some
of
adam's
testing.
I
believe
he
was
actually
seeing
that
if
he
he
alternated
between
workloads
that
had
lots
of
oh
nodes
and
few
oh
nodes,
we
would
see
cache
memory
being
given
back
to
data
cache,
which
is
kind
of
exactly
what
we
want
in
all
of
this.
A
So
I'm
hopeful
that
the
code
is
actually
doing
what
we
wanted
to
do.
It's
just
a
question
of
figuring
out
how
to
ensure
that
all
this
works.
Well,
one
thing
we
do
see
still
in
all
of
this
is
that
tc
malik
very
quickly
full
fragment
memory
with
all
these
small
objects
that
we
deal
with
creation
and
deletion,
especially
in
separate
threads
of
all.
These
small
objects,
the
the
auto
tuning
code,
more
or
less
works.
A
But
as
long
as
we
have
all
this
fragmentation,
we
see
that
as
soon
as
the
4k
random
right
workload
starts
after
the
pre-fill
stage,
the
amount
of
memory
available
for
the
cache
just
plummets
down
to
to
a
very,
very
low
value
compared
to
what
it
used
to
be,
and
then
we
kind
of
regrow
it
here
as
as
tc
mal
works
things
out
and
maybe
helps
get
rid
of
some
of
the
fragmentation.
A
But
still
we
fragment
memory
terribly
in
in
the
osd,
with
all
the
objects
we
create
to
delete.
So
if
we
want
to
improve
this,
we
probably
need
to
figure
out
how
to
avoid
this
kind
of
behavior
that
we
currently
have.
B
Mark,
I
maybe
I
think
I
will
rephrase,
because
that's
not
very,
I
think,
that's
not
very
clear,
I'm
sorry.
The
problem
is
that
when
we
had
a
pressure
from
a
data
cache
that
forced
metadata
cache
to
drop
some
old
entries,
we
did
actually
see
properly
metadata
cache
to
drop
some
oh
notes.
That
was
very
good,
but
unfortunately
we
couldn't
get
data
cache
to
get
the
memory,
because
the
data
left
by
after
using
for
a
metadata
booster
metadata,
it
was
so
fragmented.
A
B
A
Yes,
yes,
this
is
our
problem,
so
adam,
I'm
thinking
if
we
can
use
radix
implementation
of
the
opportunistic
ring
buffer
or
allocating
some
of
these
things.
Maybe
we
can
improve
the
situation.
We
can't
fix
it,
but
maybe
we
can
sort
of
make
it
a
little.
B
Better,
unfortunately,
I
am
skeptical,
because
I
understand
that
this
allocating
in
ring
buffer
works
very
well
when
we
have
short
lived
objects
that
we
can
push
our
ring
buffer
and
then
reuse
it
anew,
but
for
the
cache
it's
just
totally
opposite.
We
expect
to
hold
items
here
for
a
long
time
and
more
problematically.
A
I
propose
that
we
take
radix
work.
We
make
a
modified
version
of
it
that
allocates
the
ring
in
chunks
and
then,
if
we
have
a
chunk
that
has
some
old
entries
still
in
it,
we
set
that
aside
and
then
over
time
we
look
at
the
old
chunks
and
we
compact
them
back
into
the
ring
if
they
are
only
have
a
couple
entries
in
them.
H
H
I
think
the
concept
people
need
to
to
agree
here
is
that
if
you
know
that
you
can
do
that
many
iops,
it
means
the
amount
of
items
in
the
system
is
pretty
much
constant.
I
mean
your
your
biggest
size,
don't
think
about
either
machine.
You
could
do
say
hundred
thousand
iops.
You
don't
need
more
than
hundred
thousand
object,
so
you
pre-allocate
hundred
thousand
object,
and
then
you
don't
need
to
reshift
them.
Now
you
don't
have
to
pre-allocate
the
full
hundred
thousand.
H
Maybe
you
do
50
000,
which
going
to
come
from
the
recycle
phone,
a
pool,
and
if
you
grow
more
than
that,
then
you
might
need
to
do
some
different
allocation.
H
H
H
So
if
we
could
assign
them
to
a
pools
and
every
aisle
get
could
could
allocate
stuff
from
the
pools,
then
you
have
some
close
estimate
of
how
many
objects
you
need
from
each
type.
A
I
agree
with
you
regarding
things
for,
like
in
in-flight
objects,
for
iops
right
for
maintaining
for
getting
those
through
what
I
I
agree
with
adam,
though,
that
I
don't
think
that
we
can
use
the
ring,
as
is
for
long
term,
storage
for
lru
cash,
o
node
entries.
H
If
the
constant
part
we
all
are
will
all
arrive
from
from
a
common
pool,
then
you
could
just
keep
recycling
them
and
the
dynamic
part
you
could
allocate
for
something
like
you
described
some
kind
of
a
body
allocation
system
where
you
can
allocate
stuff
and
it's
going
a
power
of
two
kind
of
object.
So
when
you
stop
using
them,
you
could
aggregate
them
back
later.
H
B
Explain
why
I
think,
is
the
problem
when
we
store
an
o
node
representation
in
blue
store
or
node
cache.
It
means
it's
a
set
of
c
plus
plus
objects
which
are
linked
together
by
pointers.
Some
of
them
are
stored
in
maps
and
some
are
stored
in
intrusive
lists.
B
B
At
this
I
imagine
they
they
do
with
the
proper
proper
and
their
bins
are
related
to
their
respective
sizes.
B
So
if
I
have
multiple
of
those
objects
loaded
consecutively,
then
I
have
fully
intermixed
data
from
different
objects
in
a
very
close
memory
regions,
and
this
is
why
I
think
we
happen
with
a
fragmentation
when
we
arbitrarily
delete
from
cache
some
objects.
We
just
create
a
lot
of
holes,
but
never
continues
reagans,
because
we
never
had
a
chance.
A
B
So
easiest
implement
okay,
so
easiest
implementation
would
be
that
we
have
some
mechanism
that
forces
all
allocations
regarding
single
object
to
fall
into
a
single
bin.
A
Yeah
yeah
exactly
we,
we
say
that
now
these
these
entries
that
were
long
lived.
In
fact,
maybe
we
say
that
these
entries
that
were
long
lived,
they
survived
the
rotation
of
the
ring.
That
means
that
they
weren't
short-lived
entries
and
other
long-lived
entries.
Maybe
we
put
them
into
different
memory
space,
because
now
we
know
that
they're
long-lived
we
put
them
in
long-lived
storage
rather
than
short-lived
storage,.
B
A
C
Mark
with
all
these
allocations,
do
you
have
an
idea
of
which
kinds
of
objects?
Those
are
primarily.
A
C
I
think
it'd
be
really
interesting
if
we
could
map
those
back
to
like
which
which
objects
are
using
them
and
see
if
it
would
make
sense
to
do
some
kind
of
slab
allocation
for
those
kinds
of
objects.
Perhaps.
A
C
Aren't
allocated
in
bufferless,
so
I
mean
the
the
buffalo
cycle.
Thing
makes
sense
for
some
some
part,
some
objects,
but
others
right.
A
Yeah
yeah,
the
the
one.
Some
of
the
things
that
we
probably
should
try
to
zero
on
in
on
are
any
objects
that
are
created
in
one
thread
of
legion,
and
I
don't
have
a
real
great
understanding
of
where
we
actually
do
that.
But
I
know
it's
been
talked
about
in
the
past.
A
I
mean
I
would
assume
that
it's,
it's
related
to
stuff,
that
we
create
like
in
a
tpo
ctp
thread,
and
maybe
we
delete
it
in
the
kb
sync
thread
or
you
know,
maybe
there's
other
threads.
That
can
do
things
like
that.
C
A
A
Adam,
I
think
the
fact
that
you
saw
that
that
you
saw
that
when
we
do
all
these-
these
oh
note
creations
and
then
later
have
fragmented
memory
so
that
we
can't
give
it
back
to
buffers.
That's
key.
That
was
a
really
good
insight
that
you
saw
that,
because
that's
kind
of
what
I
suspected
but
didn't
have
evidence
of.
That's
that's
really
good.
B
A
B
A
C
Yeah
and
like
we've,
been
discussing
several
different
aspects
of
it
too,
like
there's
the
the
caching
aspect,
the
fragmentation
aspect
to
the
analogy
and
yeah
different
threads
and
and
the
use
of
like
real
men,
pools
yeah
so
in
some
ways,
independent
but
projects,
but
with
the
related
effects.
A
C
I
think
it
was
interesting
to
see
how
this
works
with,
like
with
sea,
star
and
crimson,
when
we,
when
we
use
the
sea
star
allocator
instead
of
tcmlic.
A
Yeah
yeah,
I
need
I
need
to
understand
better
what
kifu
and
and
sam
and
the
other
folks
have
been
doing
regarding
the
how
messages
come
in
on
the
wire
and
then
and
then
how
that
gets
distributed
to
different
worker
threads.
That
might
play
a
large
part
in
what
we
see.
C
J
A
I'd
like
to
I
yeah,
I
know,
there's
been
different
proposals
over
over
time
regarding
how
that
works.
I'd
like
to
maybe
we
can
spend
another
whole
performance
meeting
talking
about
crimson
and
what
you
guys
have
been
doing
with
it.
A
We
only
have
a
couple
of
minutes
left,
so,
let's
quickly
move
on
so
iou
ring.
This
has
been
a
topic.
That's
come
up
really
recently.
I
went
back
got
kernel
5.9
installed
on
some
of
our
test.
Nodes
tried
to
build
our
code
with
it
when,
when
we
include
the
the
libiyo
u-ring
support
in
ceph,
it
doesn't
build.
It
looks
like
we're
trying
to
use
some
of
the
syscalls
using
a
c
interface
that
doesn't
exist.
Maybe
it
used
to-
I
don't
know.
A
But
anyway,
if
we
switch
those
to
sys
calls,
it
works
okay.
So
I
don't
trust
these
results
that
I've
gotten.
Yet
I'm
actually
not
even
quite
sure
if
we're
invoking
the
libya
ring
correctly,
but
the
the
end
result
is,
I
don't
see
any
difference,
at
least
in
these
tests.
A
What
I
did
see,
though,
when
I
was
doing
this,
is
that
the
kv
stinks.
Sync
thread
in
these
tests
is
like
90
to
92
percent
utilized
and
we
again,
like
we've
talked
about
in
the
past,
I'm
seeing
that
we
are
spending
cpu
time
and
wall
clock
time
doing
key
comparisons
to
keep
the
mem
tables
ordered
so
just
more
circumstantial
evidence,
perhaps
that
we're
spending
time
in
this
particular
thread
doing
that,
and
that
may
be
a
limitation
that
is
actually
we're
hitting
before
we
hit.