►
From YouTube: 2018-Jun-7 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
All
right,
maybe
we
should
just
get
started
here.
Oh
good,
there's
Josh,
all
right,
I,
actually
don't
have
a
whole
lot.
This
time,
I
have
to
apologize.
I
was
trying
to
do
like
three
things
at
once
this
morning
and
forgot
to
go
through
the
PR
so
next
week,
I
guess
we'll
have
two
weeks
worth,
but
there's
a
couple
of
different
things.
Going
on
right
now,
maybe
I'll
I'll
highlight
Igor's
continued
work
with
the
the
bitmap
allocator
and
I.
Think
that
Adam
has
been
reviewing
that
which
is
good.
A
C
In
shattered
of
work
you
in
profiling,
I
can
see
there
is
that
we
have
a
lot
of
memory
allocations
made
in
our
way
that
is
not
friendly
to
Stygimoloch.
The
issue
is
that
we
are.
We
are
in
queueing
an
O
in
messenger,
while
D
queuing
in
TPU
HTTP.
This
means
that
time's,
the
deallocate
needs
that
the
allocator
needs
the
TC
malach
needs
to
employ
its.
It's
short
pole,
I
put
a
shaft
between
between
Alfred's.
C
The
reason
for
that
is
is
that
we
are
made
that
the
weighted
priority
queue
is
implemented
in
a
way
that
calls
many
times
ma
allocate
allocates
money
times
for,
therefore,
in
queueing,
a
single
off,
and,
moreover,
we
do
that
under
shot
block.
So
it's
even
in
it
has.
It
has
more
coarse
granularity,
even
in
comparison
to
VD,
Bock
I.
A
C
C
D
A
C
C
C
It
means
that
if,
if
the
user
space
part
of
the
of
the
mutex
is
contended,
then
you
are
going
to
catch
up,
then
you
are
calling
the
Phoenix
to
Sisko.
There
is
no
deep-sea
offers.
Some
kind
of
adaptive
me
taxes,
but
you
need
to
tell
them
tell
it
explicitly
want
to
use
that
by
default
at
least
all
the
two
two
versions
I
was
taking
called
a
Luka.
They
are
not
spinning.
C
C
Yes,
I
see
two
eyes.
First
of
all
is
just
to
set
prepare
initialize
a
mutex.
It
could
be,
it
could
be
in
such
way.
We
could
touch
our
abstraction
of
new
Texas.
Second
thing
is
to
change
to
alter
the
way,
how
we,
how
we
take
how
we
manage
the
lock
its
we
can.
Just
you
just
can
replace
some
some
unique
locks
or
our
Union
guards,
with
with
the
train,
with
the
trial
guard.
C
Another
thing
another
degree
of
freedom
is
is
suffered
for
Intel
TSX,
transactional
memory.
Gypsy
got
fattest,
enabling
that
for
mutexes
destroy
vendor
can
enable
enable
that
that
acceleration
by
default.
However,
for
read/write
blocks
of
of
GDP
of
pivot,
the
the
tsx
code
is
enabled
by
default,
and
this
can
actually
have
some.
C
C
It
depends
it
depends
whether
on
the
size
of
critical
section,
and
also
on
the
pattern
of
of
memory
accesses
over
critical
section,
if
you,
if
you
are
touching
things
that
are
not
logically
related
but
staying
on
the
same
cache
line,
I
guess
you
can
cut
unnecessary
cancellations
of
India,
basically
fall.
Sorry
right.
D
C
E
D
It's
returning
to
actually
to
ready
and
what
he's
calling
and
I
like
an
IO
engine
or
AI
engine
for
making
it
simpler
to
interface
with
the
C
star,
int
star
code,
so
that
we
can
start
converting
some
pieces
like
the
that
aren't
performance,
critical
like
the
mana
clients
or
the
operator,
or
there
was
new
about
cash
to
continue
running
in
non
T
star
threads
as
week.
While
we
have
the
main
notice
to
your
knee
and
C
star.
D
A
All
right,
I'm
gonna
talk
a
little
bit
about
this.
Then
this
is
stuff.
I've
been
working
on
pretty
excited
about
it,
so
the
first
thing
I
will
mention
is
that
there
are
PRS
both
for
rocks
DB
and
for
stuff
for
the
first
kind
of
piece
of
this,
which
is
implementing
the
the
priority
based
scheme
for
for
assigning
memory
to
key
caches.
The
the
rocks
TV
one
is
in
review,
but
they're
they're
moving
has
slowed,
but
we
we
do
have
it
a
branch
in
our
own
Fork
of
rocks.
A
Db
that
that
we
can
target
the
Ceph
PR
is
is
basically
there.
There
was
a
bug
that
that
we
found
in
a
corner
case
where
basically,
we
we
didn't
create
a
rocks,
DB
cache
if
the
cache
size
was
set
to
zero.
So
we
just
have
like
a
null
pointer,
which
it
was.
My
code
was
expecting
that
it
would
be
there
so
that
was
easily
fixed.
So
that's
not
a
problem,
but
but
there's
a
number
of
other
issues
that
have
prevented
it
from
merging.
In
my
own
defense,
those
don't
appear
to
be
related
to
the
PR.
A
There
was
a.
There
was
a
bug
in
the
in
the
blue,
spur
cache
implementation
related
to:
u
n
64,
T
conversion
to
in
conversion
that
wasn't
safe,
that
is
fixed
in
the
PR.
Now
there
was
an
added
commit
to
fix
that
and
then
there's
a
long-standing
bug
that
our
QA
suite
only
picks
up,
maybe
one
out
of
30
times
running
a
particular
test
and
I
just
happened
to
hit
it
when
testing
this.
A
So
that's
actually
about
a
year
old
and
master
we'd
haven't
yet
figured
out
what
that
is,
but
it
does
not
appear
to
be
caused
by
my
Pierre.
It
was
just
happened
to
be
picked
up
in
the
the
self-test
objects,
for
a
run.
That
I
did
so,
hopefully
that
will
merge
soon.
I
think
it's
hopefully
safe
at
this
point,
but
the
the
real
kind
of
guts
of
this
is
in
this
other
commit
as
a
separate
branch
for
age
based
binning
of
the
the
caches.
A
A
So
the
idea
here
is:
is
we
basically
have
a
circular
buffer
of
counters,
where
each
counter
represents
an
interval
of
time
by
default?
This
is
five
seconds
and
we're
keeping
720
of
them.
So
one
hours
worth
and
that's
separated
out
into
six
different
priorities.
One
through
six.
Zero
is
kind
of
a
special
case
one.
But
the
idea
here,
the
the
kind
of
the
the
the
current
implementation
has
a
five
second
bin
for
super
high
priority
stuff
between
five
seconds
in
30
seconds,
30
seconds
in
five
minutes,
five
minutes
and
60
minutes.
A
And
then,
if
it's
the
data
cache,
it's
actually
offset
one
lower
priority.
So
five
seconds
is
actually
a
priority
to
instead
of
priority.
One
and
60
minutes
is
at
priority
six,
instead
of
priority
five,
the
idea
being
that
that,
generally
speaking,
we
we
want
to
prioritize
o
nodes
and
kV
cache
over
data
cache.
A
So
the
the
kind
of
neat
thing
here
is
that
in
this
prototype
implementation,
the
effect
of
this
of
all
of
us
doing
the
the
the
priority-based
binning
their
priority,
based
based
allocation
of
memory
to
different
caches
combined
with
this
age
bidding
scheme,
makes
the
balancer
dynamic
in
that
it
will
shift
memory
around
based
on
what's
kind
of
happening
on
the
cluster.
Currently.
A
So
so,
in
some
very,
very
initial
tests
of
this,
when
filling
up
an
RB
d
volume
full
of
4
megabyte,
writes
for
pre
allocation
of
the
volume
it
was
keeping
the
data
cache
to
around
75
percent
of
the
total
memory
available
and
assigning
about
25
percent
of
that
cash
to
metadata
so
about
50%
of
the
OU
notes
are
being
cached
by
by
the
time
I
finished.
It
was
a
turn
56
gigabyte
volume.
A
It
got
to
the
point,
maybe
halfway
through
where
it
hit
the
the
total
cache,
the
the
total
amount
of
memory
available
cached
as
it
was
kind
of
balancing.
Oh
no
I'm.
Sorry,
it
was
much
earlier
than
that.
The
the
data
cache
should
actually
spiked
way
up,
but
but
over
time
it
slowly
gave
metamour.
But
I
didn't
give
him
it
everything
just
because
it
was
seeing
such
a
large
ingest
of
recent
data,
but
as
soon
after
after
it
finished
the
four
megabyte
sequential
writes,
is
started.
A
A
The
kV
cache
never
exceeded
about
one
to
two
percent,
because
it
was
never
really
needed.
There
was
there's
no
there.
There
was
no
real
need
for
the
kV
cache,
because
the
only
thing
that
was
being
really
accessed
for
Oh
nodes,
which
were
already
in
the
booster
cache.
So
this
this
I'm
super
excited
about,
because
it's
doing
exactly
what
I
was
hoping
it
would
do,
which
is
kind
of
targeting
the
current
use
and
and
kind
of
optimizing
for
what
the
current
use
case
is
the
right.
A
But
once
we
have
that
and
then
once
we
start
accounting
for
other
things,
like
EPG
log
data
usage,
the
the
right
had
log
buffers
and
rocks
DB
the
row
cache
and
rocks
DB
and
potentially
a
couple
of
other
things.
We
should
be
able
to
get
a
really
good
idea
of
where
we're
using
memory
everywhere
and
controlling
where
we're
using
memory
everywhere,
and
especially
if
we're
watching
be
assigned
memory
and
TC
malloc,
we
should
be
able
to
then
start
dynamically
changing
the
amount
of
memory
we
have
that's
available
to
work
with
and.
A
Try
to
probably
not
not,
we
won't
be
able
to
guarantee
it,
but
try
to
keep
the
OSD
memory
usage
with
us
and
some
boundary
that
the
user
sets.
So
my
goal
with
all
of
this
is
to
make
it
so
that
we
are
keeping
the
OSD
to
a
user
sign
memory
value
and
then
not
requiring
the
user
to
define
anything
else
just
automatically.
We
assign
memory
for
everything,
I
think
it's
doable.
These
results
to
me.
A
Look
really
encouraging
in
terms
of
being
able
to
have
the
OSD
make
smart
decisions
in
real
time
about
where
memory
should
go
so
I'm
super
excited
about
it.
I
know
it's
it's
it's
hard
to
to
be
excited
when
you
you're,
not,
you
know
kind
of
in
the
midst
of
it
like
this,
but
it's
it's
I
think
it
has
a
lot
of
promise.
So.
D
A
And
it
that
you
know
it's
it's
almost
kind
of
false,
though,
to
be
honest,
because
really
what
it's
doing
is
it's
just
like
giving?
Oh,
no
giving
rocks,
DB
or
actually
blue
store
memory
for?
Oh,
no,
because
that's
the
thing
with
our
BD
that
really
like
once
you're
you
are
not
caching
own
ODEs
performance
can
go
way
down
right.
So
it's
you
know.
It's
kind.
A
A
test
where
you
aren't
you're
not
doing
like
a
random
read
over
the
entire
volume
or
the
entire
desk,
but
you're
just
doing
like
hoc
reads,
like
you
know,
a
way
called
a
distribution
as
if
Ian,
yes,
yep,
exactly
yep
I
suspect
for
that.
That's
where
it
will
really
shine,
because
then
you'll
see
a
certain.
You
know
most
of
the
hot
data
residing
in
the
data
cache
and
you
can
do
those
reasons
just
from
cache.
A
D
Something
is
popped
in
I.
Had
some
IBD
uneasy
pools.
I
know
this
is
still
gonna,
be
probably
not
I'm
guessing.
This
is
not
going
to
be
as
effective
there
just
because
the
bottom
line
there's
mostly
going
to
be
over
the
network
I'm
having
to
do
the
read,
modify
write
cycle
a
lot
of
time,
but
maybe
if
we
can
get
some
of
that,
like
that
data
cache
going
in,
maybe
it
would
help
there
as
well,
but
the
distribution
would
be
interesting
to
try
out
the.
D
Yeah,
essentially
not
necessarily,
you
could
still
have
the
objects
themselves
being
the
same
size,
I
guess
you're,
saying
that
there
are
more
shards
of
the
object
so
that
they
were
more
yeah.
Okay,
yep.
A
D
Ever
do
that
I,
don't
actually
know.
I've
heard
that
I'm
doing
that
occasionally,
like
ikx
guy,
for
example,
was
doing
like
30
q
makes
objects
okay
Wow,
but
for
the
you
see
case
in
like
it's,
my
probably
more
important
there,
we
consider
what
an
optimal
size
would
be
for
a
given
decoding
as
well.
I.
A
The
the
workload
on
the
the
key/value
database,
that's
kind
of
where,
like
for
a
small,
random,
I/o
workload,
that's
kind
of
the
thing
that
that
that
you
always
see
really
getting
hit
hard,
and
it
would
be
really
interesting
to
see
if
you
start
changing
that
kind
of
thing
how
it
changes
the
amount
of
data
that's
necessary
to
cache
like
TV
data
versus.
Oh,
no,
no
data
in
blue
store.
A
Don't
think
that
we
can
ever
really
know
in
advance
or
guess
in
advance
what
these
kinds
of
ratios
should
be
set
for
to
cover,
like
all
cases,
I
very
much
believe
that,
but
I
think
we
can
make
something.
That's
relatively
smart
about
kind
of
like
doing
exactly
this,
balancing
it
out
dynamically
in
real
time.
It's
yeah
very,
very
convinced
this
is
the
right
way
to
go,
but
we'll
see.
A
F
Yeah
so
I
had
I
had
a
question.
I
can
remember:
I
was
looking
at
it,
I
think
it
was
in
there
the
pull
request,
description
and
email
he
sent
or
something
where
you
were
showing
you
were
bidding
things
buy
time
in
the
different
caches
and
you
were
skewing
the
the
time
series.
So
a
shorter,
more
recent
data
in
rocks
GP
was
the
same
priority
as
older
data
and
blue
store.
F
A
Remember
thinking
I
spent
about
a
morning
thinking
about
that
for
a
couple
of
hours
and
I.
Remember,
I,
wish
I
would
have
written
it
down
because
there's
something
I
was
really
really
worried
about
that
we
were
not
gonna.
I
felt
like
I
was
worried.
We
weren't
gonna.
It
was
not
gonna
work
well
for
if,
if
we
just
do
that,
I.
F
See
you
have
to
wait
for
things
to
fall
off
the
cache
it'll
be
slower
to
respond
because
things
have
to
fall
off
the
cache
I
guess.
But
the
thing
I
worry
about
is
the
prototype
be
built.
That's
basically
having
all
these
additional
allocated
for
FCAT
and
objects
that
are
attached
to
everything.
That's
in
the
cache
is
just
adding
all
these
allocations
and
ref
counts
to
every
object
and
I'm
just
worried
about
the
overhead
that
that's
going
to
cause
sure.
F
A
F
F
F
A
So
if
we
say
we
did
the
trivial
thing
I'm
trying
to
work
through
the
the
logic
I
was
thinking
of
when
I
was
going
through
this,
and
maybe
let's
just
try
to
retry
it
now
and
see.
If
we
can
come
up
with
what
the
behavior
would
be
so,
okay,
we
wait
for
it
to
fall
off,
and
that
gives
you
kind
of
the
the
maximum
age
of
the
cache
right.
F
F
A
F
Some
ratio
like
if
you
think
that
that
data
cache
is
less
important
than
you
could
target
like
one
tenth
age
of
the
data
cache
as
the
meditative
cache
or
whatever,
because
if
you're,
if
you're
doing
the
bidding
thing
and
you're
skewing
the
bin
slightly,
then
at
the
end
of
the
day,
what
falls
off
the
cache
is
like
off
by
one
bin
right,
which
is
like,
but
nine
in.
If
you
have
ten
bins
and
it's
like
nine
point-
nine,
ninety
percent
or
110
percent.
But
everyone
look
at
it-
feels
like
the
thing
so.
F
A
So
so
like,
in
this
particular
case,
the
the
the
really
interesting
behavior
here
was
in
like
Priority,
One
and
prior.
That
was,
where
things
rapidly
changed,
we're
like
the
switching
over
from
a
four
megabyte
workload
to
a
4,
KB,
random
right
workload.
All
of
a
sudden,
the
own
owed,
the
number
of
Oh
nodes
that
were
in
like
priority
2
or
even
priority,
3
all
immediately
shot
into
priority
1
and
the
the
amount
of
you
know.
A
F
F
A
A
F
A
Now
it's
well
as
they're
hard
code
intervals.
So
it's
like
you
know
the
the
currently
the
interval
length
is
5
seconds,
so
the
cash
balancer
will
will
run
every
interval
and
that's
user-defined,
and
then
these
right
now
are
hard-coded,
but
but
that
you
know
that
can
be
changed.
That's
not
a
big
deal
so
like
priority.
1
is
for
the
kV
cache
and
for
the
the
meta,
cash
or
sorry
the
the
meta,
cash
and
medication
kV
cache
R
is
just
one
interval,
whereas
the
you
know
so.
The
first
bin
and.
A
F
F
F
A
So
so
like
when
you're
doing
four
megabyte
rights,
you'll
have
lots
of
Oh
nodes
in
priority.
One
two
and
three
they'll
be
spread
across
them
because
you're
not
getting
no
nodes
and
fast
enough
that
they
all
stay
at
a
high
priority.
But
once
you
switch
over
to
like
4
KB
random
writes
all
of
a
sudden
you're
accessing
Ono's
constantly
and
they
all
shoot
up
into
priority.
B
A
You
end
up
with
like
a
ton
of
superhot
data
now,
because
you're
doing
4
kb,
you
random,
writes
you
end
up
with
some
hot
data,
but
the
amount
of
data.
That's
in
the
cache
ends
up
kind
of
more
migrating
down
to
2
3
4
whatever,
but
own
ODEs
are
all
super
hot,
so
the
the
stuff
switches
around.
But
if
we
just
looked
at
like
the
age
like
the
maximum
age
of
the
cache
it
wouldn't
it
wouldn't
really
like.
F
The
thing
is
we
don't
actually
none
of
this
matters
until
you
start
trimming
things
right
like
it
doesn't
matter
that
you
have
a
lot
of
priority
to
own
oats
and
you
don't
have
a
lot
of
variety
to
data
until
you
actually
start
trimming
on
oats
or
trimming
data
that
could
be
used,
and
so
it
not,
it
doesn't
actually
matter
until
you
start
it's
something
if
you
start
trimming
Ono's
that
are
like
less
than
5
minutes.
That's
when
you
start
to
worry
right.
F
None
of
that
matters
until
we're
actually
get
to
the
point
where
we
would
have
trimmed
an
own
of
that
is
recent
that
we
shouldn't
have
trimmed
right,
like
the
only
decision.
Any
of
this
ever
affects
is
the
actual
thing
that
you're
trimming
at
the
very
end
mm-hmm,
and
so
we
and
we
notice
that,
because
the
thing
that
we
trimmed
right
before
that
is
like.
Oh,
it's
only
five
minutes
old
inside
of
30
minutes
old
I
need
to
like
dump
a
bunch
more
memory
over
to
get
outside
and
start
pruning.
My
data
cache
sooner.
A
F
Made
so
that
the
hypothesis
is
that
it's
it's
the
it's
the
same,
regardless
of
what
your
cache
is.
The
only
thing,
the
only
actual
effect
of
changing
the
cache
sizes
is
like
whether
you
trim
an
entry
or
not
like
that's
actually
that
the
effect
of
all
this,
and
so
it's
whether
this
entry,
that's
in
minutes
old,
gets
trimmed
or
doesn't
get
trimmed,
and
you
can
base
that
decision
either.
F
Based
on
this
having
this
weird
curve
of
priorities-
or
you
could
also
just
look
at
the
age
of
the
thing
right
before
it,
which
is
going
to
be
approximately
the
same
age,
they
trimmed,
and
so
my
thinking
is
that
you
could
make
the
same
decision
based
on
that,
because
by
the
time
you
get,
let's
say
your
priority
to
blows
up,
because
you're
blowing
out
a
bunch
of
Oh
nodes,
suddenly
the
things
that
are
falling
off
the
node
cash
instead
of
being
you
know,
four
minutes
old
are
three
and
a
half,
then
three,
then
two
and
a
half
and
two
then
one
and
a
half
right,
mm-hmm
that
wraps
up
you're
like
oh,
my
gosh
I
got
to
start
giving
them
more
memory
and
then
know
like
equalize
it
at
two
and
a
half
you
like
actually
just
stop
trimming
them
entirely,
I
mean
even
if
it's
if
they
were
at
five
as
soon
as
they
go
to.