►
From YouTube: 2016-OCT-12 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
For full notes and video recording archive visit:
http://pad.ceph.com/p/performance_weekly
A
B
B
There's
just
two
new
kind
of
interesting
ones
this
week,
but
both
are
are
good.
One
is
Ramesh
has
been
working
on
the
mem
DB
store
interface,
basically
storing
a
key
value
stores
in
our
key
value
data
in
an
in-memory,
its
structure
and
that
he's
switching
from
a
b-tree
implementation
to
generic
maps,
just
because
the
b-tree
implementation
was
actually
not
a
former
very
well
so
they'll
be
interesting
to
see
it.
B
How
much
that
improves
things,
but
yeah
he's
he's
working
on
that
and
the
other
one
is
that
dan
Lambright
has
a
first
attempt
at
using
our
see
you
walking
for
I
think
the
PG
info
that
he
was
doing
a
PG
map
and
that's
that's
really
exciting,
because
I
think
once
he
he
dirksen
kind
of
getting
that
working,
there's
a
number
of
places
we
might
be
able
to
use
it.
The
the
big
thing
that
that
I'm,
hoping
for
is
that
we'll
be
able
to
speed
up
fastest
bash
with
this.
But
I
guess
we'll
see.
B
That's
that's
11
393!
If
anyone
is
interested
in
playing
around
or
testing
it,
although
I'm
not
sure
if
it
actually
works
yet
or
not,
so
those
are
kind
of
the
two
new
interesting
ones
that
are
there
this
week,
a
couple
of
different
things
closed,
most
of
it
was
blue
store.
Related
there's
been
a
lot
of
work
on.
Excuse
me
reducing
the
size
of
the
in-memory
structures
for
blobs
and
extents
and
other
things
like
this,
so
so
a
lot
of
the
stuff.
B
B
B
This
can
be
a
really
important
thing
for
us
to
be
able
to
do
before
camp
unleashing
blue
store
on
the
masses,
just
because
it's
pretty
tough
tootin
right
now,
based
on
camp
the
variable
size
that
that
you
know
these
data
structures
can
have,
depending
on
things
like
the
metallic
sighs,
so
yeah,
just
lots
of
lots
of
work
in
that
area.
B
Let's
see
a
thing
for
updated
ones
is
like
11
to
13
I'm,
hoping
that
we
can
get
that
merged
moon.
That's
that's
really
nice!
It
reduces
the
size
of
PG
info
and
it
kind
of,
or
at
least
reduces
the
amount
of
data
that
we
write
every
time
by
I
kind
of
not
lazily,
but
but
more
solemnly
updating
a
lot
of
fills
that
don't
really
matter
that
much.
So
that's!
That's
really
good
for
blue
store
that
actually
improves
random,
smaller
and
write
performance.
B
B
C
B
Yes,
just
okay
good
deal
tho,
I
have
the
the
biggest
issue.
I
see
right
now
with
blue
store
in
terms
of
and
of
how
we're
doing
relative
to
file
stores
with
sequential
reads
all
sequential
rates,
and
you
know
this
is
one
of
those
things
where,
hopefully
you
don't
see
it
that
often
you
know
if
programs
are
doing
small
sequential
raises
now
they're,
probably
not
that
well
written,
but
it
count
looks
bad
from
the
standpoint
that
you
know.
If
someone
does
benchmarking,
you
see
and
look
a
lot
worse.
B
So
here
this
is
kind
of
the
the
percentile
difference.
Graph
I've
got
showing
kenna
where
file
store
was
a
jewel
wear
blue
store,
was
it
jewel
where
each
of
them
are
now
and
then
what
happens
when
we
change
them
in
Alex
eyes
on
this
is
with
nvme
devices,
but
when
we
increase
it
and-
and
you
can
can
see
here-
that
we're
not
doing
so
great
with
blue
store
and
we're
doing
quite
a
bit
worse
actually
than
we
were
with
jewel
in
master,
if
we
increase
them
in
alex
eyes,
it
does
help
pretty
dramatically.
B
The
problem
is
well,
that's
exactly
it
right,
I
mean
if
it's
fragmentation,
then
it
might
be
that
just
increasing
the
manal
excite
will
help
this
and
make
it
better
and
when
one
thing
in
blue
store
and
jewel
is
that
the
default
metallic
size
for
ssds
was
64
K.
So
that
might
explain
why,
in
jewel
we
are
seeing
a
higher
sequential
read
through
putting
this
test.
B
It's
on
the
list
of
things
to
test
again
to
go
back
and
look
at
that.
But
I
think
that
that
might
make.
If
that's
the
case
right,
then
that
means
it's
probably
fragmentation,
but
you
know,
potentially
there
might
be
other
things
you're
going
on
too
so
I,
don't
know
what
what
were
your
thoughts
Alan.
A
Well,
I
think
that
the
history
of
storage
devices
suggests
that
we're
going
to
end
up
having
to
go,
tackle
compaction
or
you
know,
defragmentation
whatever
you
call
it
eventually
yeah,
you
know,
I
mean
even
with
XFS,
which
is
a
pretty
advanced
file
system.
If
you
talk
to
the
users,
they'll
tell
you
that
you
really
end
up
having
to
defragment
the
product
over
time
and
the
I
suspect
that
we're
going
to
have
the
same
issue
with
blue
store
at
the
end
of
the
day,
you
know
changing
the
allocation
size.
B
Agree,
but
at
the
same
time,
right
now,
it's
looking
like
with
a
small
metallic,
sighs,
we're
generating
so
much
metadata,
even
with
all
of
our
improvements
that
we
made
that
we're
just
we're
hammering
rocks
TV
kind
of
to
the
point
where
doing
the
extra
white
road
head
log
right
is
actually
cheaper
than
dealing
with
all
the
metadata.
So
you
know,
there's
there's
multiple
aspects
to
this
right
there.
There
are
multiple
trade-offs
that
we
have
to
think
of
well.
Do
we
know
that
that's
what's
actually
happening.
B
What
what
we've
seen
him
in
the
past
is
that
rocks
TB
when
you
have
a
small
metallic,
sighs,
there's
so
many
blobs,
and
so
many
extents
that
are
getting
have
shoved
into
it,
that
that
the
the
right,
amplification
and
compaction
overhead
just
kind
of
starts
killing
things.
There's
lots
of
reeds
on
the
the
DB
partition.
There's
lots
of
rights
on
the
DB
partition.
B
E
And
and
basically
I
am
seeing
around
2.5
x
more
right.
If
I
have,
if
I
have
like
mean
alex
eyes
of
4k.
Yes,
in
middle
of
size,
of
16
k,
we
have
like
one
junk
giant
right
of
like
4k,
that
is
going
to
the
wall.
So
if
you,
if
you
don't
consider
that
it
is
leaking
into
the
SST
SST
files,
so
the
amount
of
right
going
to
thy
sista
files
is
really
smallish.
Around
500
bites
five,
six
hundred
by
its
vs.,
probably
1.5
k.
A
B
Yeah
well,
I
mean,
to
be
honest,
I
mean
I
think
hard.
It
might
be
that
we
need
to
consider
key
value
stores
that
have
less
write,
amplification
and
less
compaction
of
head
and
that
that
might
be
what
we'll
see.
Maybe
we
can
tweak
rocks
to
be
in
ways
that
we
don't
understand
yet
to
avoid
some
of
this,
and
maybe
we
can
continue
to
shrink
the
the
amount
of
data
that
is
in
the
the
0
notes.
Right.
B
You
know,
maybe,
with
all
of
these
things
together,
we
can
get
to
the
point
where
we're
actually
better
off,
not
doing
the
right
ahead.
Log
right
as
opposed
to
you,
know
the
increase
in
metadata,
but
it
doesn't
seem
like
we're
there
yet
just
based
on
campus
I'm
not
to
see
and
what
I've
seen
it
looks
like
we're,
actually
better
off
eating
the
external
right.
Look
at
it
right,
but.
A
B
A
A
B
A
A
You
is
one
way
of
doing
that,
but
clever
usage,
you
know
of
allocation
strategies,
might
do
the
same
for
you
in
the
sense
that
if
you
have
multiple
streams
where
you're
fetching
or
where
you're
storing
data,
if
you
sort
of
left
space
so
to
speak
before
them
under
the
assumption,
that's
what
was
going
on.
You
would
achieve
the
same
thing.
A
Ahead,
yeah
I
did
I
mean
the
again
it's
another
trade
off.
If
what
you
do
is
you
say,
okay,
I'm
gonna
spread
the
rights
around
and
leave
a
lot
of
space
around
them
in
the
hope
that
having
written
this
he's
going
to
come
by
later
on
and
do
another
sequential
write
and
I'll
have
a
place
to
put
it
so
that
I
don't
have
to
expand
the
extent
structure
that
will
dramatically
improve.
This
is
running
on
flash.
A
Think
you
know
there's
there
are
other
potential
solutions
for
this
problem.
So,
for
example,
what
you
could
do
would
be
to
go
ahead
and
allocate
large
chunks
of
space
and
do
direct
writing
and
keep
track
of
the
extras.
And
you
know,
and
if
the
extra
space
wasn't
consumed
in
some
appropriate
interval,
you'd
go
back
and
reclaim
it
there's
all
sorts
of
strategies
you
can
think
of
yeah.
A
Like
that
I
mean
that's
kind
of
the
inverse
of
the
right
ahead
log,
which
is
sort
of
you
know.
Instead
of
leaving
the
data
there,
you
leave
an
annotation
that
I've
allocated
more
space
than
I
need
it
for
this.
You
know,
and
you
know,
if
you
limited
the
number
of
those
you
could
go
back
and
clean-
that
up.
I.
B
Guess
the
the
thought
I
have
those
that,
like
as
we
progress
here
with
storage
technology,
we're
starting
to
see
like
nvram,
become
more
accessible
to
people,
and
you
know
other
really
fast
fast.
Small
amounts
of
memory
right
that
persists.
So
is
the
right
ahead.
Log
really
a
bad
solution.
If
you
potentially
have
a
small
pool
of
really
fast
persistent
memory,
you
can
target
for
it.
Look.
A
E
A
B
That
fair
enough!
Well,
let's,
let's
continue
through
some
of
these
numbers
so
that
we
can,
you
can
actually
see
some
of
this
other
stuff
so
for
sequential,
writes,
we're
doing
beautifully
right
now,
at
least
in
the
test.
I've
seen
you
mean
large
bad
lights
and
small
sequential
writes
not
not
really
but
rights
here,
for
basically
all
sequential,
writes,
recent
blue
store
and
all
the
tests.
I've
done
look
good.
A
B
Eyes:
okay,
yeah,
no,
no
worries
so
yeah
sequential
writes
we're
doing
really
really
good
dual.
We
were
bad
for
various
reasons,
but
everything
reason
were
we're
better
than
file
store
and
for
large
sequential
writes.
We've
always
been
better
just
because
that's
kind
of
the
whole
benefit
of
the
way
the
blue
soars
rehab.
A
But
you
know
you
should
be
getting
pretty
damn
close
if
your
rights
are
large
enough,
if
you
do
it
for
mega
by
rights
and
the
end,
the
metadata
overhead
is,
you
know
three
or
four
4
kilobyte
chunks.
You
know
you're
dealing
with
point
three
percent
right,
yeah.
B
A
A
B
A
A
B
A
See
it
all
surprised
to
find
out
that
there
were
see
realisations
that
occur
because
of
replication
that
are
indirectly
preventing
you
from
obtaining
enough
parallelism
to
saturate
the
cpu.
You
know
dessins,
you
all
that
have
queue
depth
or
in
your
enough
parallelism
the
various
stages
of
the
OSD,
and
if
that's,
what
that
you
know,
if
the
single
OSD
experiment
shows
that
yeah
it
gets
to
it
gets
to
the
99
percent
that
you
would
expect
the
other
one
doesn't
but
you're,
not
network
or
cpu
limited.
Then
we've
got
code
to
go
fix.
C
A
C
A
B
B
A
B
B
B
A
Okay,
you
know,
I
mean
realistically
in
the
competitive
world,
nobody's
going
to
care
about
how
you
do
relative
to
file
store
they're
going
to
care
how
much
out
of
the
hardware
you
extract
yeah,
I'm
sure
you
know,
and
what
these
graphs
tell
me
is
there's
still
some
kind
of
plumbing
problem
internally
in
the
OSD
might
not
be
blue
store,
it
might
be
elsewhere,
yeah,
okay,
but
there's
some
kind
of
plumbing
problem,
because
you
know
the
shape
of
this
graph.
You
know
if
you
think
about
that
as
the
block
sizes
increase
it
should.
A
B
A
A
E
B
A
Afraid
so,
instead
of
heading,
you
know,
if
you're
just
going
to
count
the
iOS
and
you're
going
to
say
that
the
small
random
ones
count
more
if
there
are
two
or
3x
the
cost,
which
is
not
unreasonable.
Okay,
now
you're
at
90
five
percent,
but
you're
struggling
here
to
get
the
sixty
and
seventy
percent.
A
A
B
Just
the
jewels
you
know
the
jewelry
listen.
This
is
the
work
in
progress
release
based
on
a
recent
master
from
last
week.
So
basically,
the
the
whip
version
of
file
store
here
is
using
the
async
messenger,
whereas
the
jewel
one
is
not
and
there's
probably
a
bunch
of
other
random
stuff.
That's
in
the
work-in-progress
one,
although
the
AC
messenger
is
probably
the
biggest
difference
between
them
into.
E
B
Alright,
let's,
let's
move
on
then
to
random
reads
so
in
the
random
read
case,
what
you're,
seeing
here
this
big
dramatic
drop
is
entirely
due
to
a
sink
messenger.
It
happens
both
with
file
store
and
with
blue
store,
so
blue
star
itself
is
not
really
causing.
Much
of
this
we
see
in
the
the
jewel
of
blue
swore
it
was
doing
a
little
worse,
maybe
than
tile
store,
but
in
fact
but
blue
store.
Well,
there's
again,
this
might
be
acing
messenger.
We're
actually
doing
a
little
bit
better
at
like
128k
to
maybe
like
20
48k
reads.
B
The
good
news
is
that
this
is
can
be
mitigated
quite
a
bit
by
using
the
setting
this
you
send
in
line
to
false
in
a
sink
messenger
that
that
helps
quite
a
bit.
Probably
the
other
thing
that
needs
to
be
done
here
is
speaking
up
fast
dispatch,
so
we're
gonna,
probably
if
nothing
else
is
done.
We're
probably
even
with
send
in
line
falls
going
to
be
seeing
maybe
a
ten
percent
performance
regression
for
random
reads
at
Adam
at
mall
io
sizes.
B
It's
not
going
to
be
as
bad
as
what
you're
seeing
here,
but
we
may
still
see
one.
We
still
may
see
a
bump
here
at
like
128
k,
+
io
sizes,
though
so
it's
a
little
offset
but
it'd
be
sure
nice.
If
we
didn't
see
that
small
random
lead
regression
so
we'll
see
we'll
see
if
anything
else
can
be
done
in
in
the
meantime,
dan
is
working
on.
These
are
see
you
this
RC.
B
That
was
that
was
kind
of
the
old
so
like
in
in
the
sequential
read
starting
out
up
here.
There's
a.
D
B
D
The
batter
yeah
either
go
to
the
better
directions,
all
the
worst
direction
at
32
right
for
a
sequencer.
If
you
see
sequential,
read,
ran
dry
from
the
dirty
to
you
started
to
increase
right,
but
for
yall
r
and
angry
and
Rendon
right,
Rendon
read
from
32,
you
decrease
and
then
for
your
Rendon
right
now
from
30
random
right
and
you
go
down
my.
E
C
E
B
C
B
D
B
B
In
fact,
it's
hard
to
remember
exactly
everything
that
that's
gone
instance
jewel,
but
but
kind
of
the
the
goal
in
the
last
couple
of
months
here
has
been
to
have
eliminate
problems
like
this
one
that
you're
seeing
with
the
random
write
performance
where
we're
dipping
way
down,
and
then
maybe
maybe
we're
improving,
but
maybe
not
there's
actually
been
some
some
intermediate
releases
here,
where
the
random
write
performance
actually
looked
really
really
bad,
but
but
the
good
news
here
is
that,
like
especially
with
random,
writes,
if
you
look
kind
of
here
at
the
I
ops
graph,
you
can
see
that
or
blue
store,
no
matter
if
we're
using
like
a
4k
allocation
like
in
this
green
line
or
for
using
a
16k
min
allocation
which
is
camp,
are
proposed.
B
Maybe
idea
for
for
the
release
here.
We're
actually
consistently
above
file
store,
there's
chemistry,
r
dip
at
8k
here,
but
but
overall
were
we're
doing
better.
Let's
that's
good,
and
that's
that's
what
we
were
hoping
to
do
over
the
last
couple
of
months
here
is
basically
really
show
up
the
random
write
performance.
The
things
that
we
that
we
really
need
to
look
at
here
now
going
forward
are
going
to
be
I.
B
Otherwise,
we
may
want
to
think
about
the
trade-off
that
we're
taking
versus
simple
messenger,
or
at
least
instructor
our
users
to
to
think
carefully
about
that,
and
then
also
the
sequential
read
issue.
That's
that
the
one
that
looks
the
worst
right
and
even
though
small
signal
reads,
probably
aren't
something
that
you
hope.
A
lot
of
applications
are
doing.
B
That's
that's
the
one
that
that
you
know
if
you're
benchmarking,
it
really
sticks
out.
So
the
kind
of
last
two
here
with
sequential
mixed,
read
right
graphs.
It
it
kind
of
tends
to
follow
the
the
first
one,
the
the
Reed
performance,
the
sequential
read
performance
regression
for
small
iOS.
Just
kind
of
sticks
out
here
is
not
as
pronounced,
but
it
still
happens,
but
our
sequential
write
benefit
is
helping
at
wyd
rio
sizes.
So
that's
why
it's!
B
This
count
almost
a
mix
of
the
two
we
see
the
large
writes
really
helping
out
at
large
sizes,
and
then
we
see
the
small,
sequential
read
performance
really
hurting
us
as
small
sizes
here
and
random.
Rewrite
again
is
this
kind
of
follows
the
others
because
of
the
issue
with
async
messenger
we're
seeing
small
io
kind
of
dropping
down
with
the
rights
help
pull
us
out
so
anyway,
that's
that's
kind
of
where
we're
at
with
it.
My
personal
take
on
this
is
that
we
need
to
be
looking
carefully
at
the
allocation
size
that
we
use.
B
We
figure
out
if
some
kind
of
you
know,
async
support
for
pulling
in
multiple
extents
at
once
from
the
city,
MO
data
structure
or
doing
a
series
in
general
I'm,
not
sure
what
sage
exactly
wants
to
do
there,
but
I
think
he's
got
ideas.
So
that's
another
big
thing
and
then
kind
of
anything
that
we
can
do
with
fest
dispatcher
nascent
messenger
to
kind
of
give
available
that
last
little
portal.
I
think
that
would
be
very
good.
So
that's
those
are
kind
of
the
priorities.
Icd.
B
B
Other
thing
I
guess
I
had
to
talk
about,
then,
was
that
we
are
also
looking
at
how
to
tune
rocks
TVs
right
ahead
log
and
unfortunately,
our
test
lab
is
sort
of
broken
right.
Now,
since
the
dream
house
has
been
having
some
problems
with
that's
affected,
the
dns
server
and
some
other
things,
but
I
did
get
some
initial
results
and
kind
of
the
what
I'm
seeing
so
far.
B
The
trend
I'm
seeing
is
that
basically
just
increasing
either
the
number
of
logs
or
the
size
of
the
right
ahead
logs
and
rocks
eb
seems
to
help
can't
stabilize
the
long-term
performance
and
the
test
I
did
now.
These
are
only
one
our
tests,
as
opposed
to
like
the
10-hour
tests
that
somnath
has
been
doing
so
may
be
that
that
we
need
to
see
make
sure
that
doesn't
hurt
us
in
really
long
test.
A
A
A
B
B
And
actually
rocks
leave,
you
will
record
its
own
right,
and
so
you
do
get
that
I.
Think
the
the
trick
here,
though,
is
when
we
are
doing
right
ahead
log
rights
based
on
kind
of
some
of
these
parameters
that
that
that
you
have
a
cute
week.
You
may
potentially
be
leaking
them
down
into
level
zero,
at
which
point
they
may
even
get
compacted
into
level
one
depending
on
kind
of
how
short-lived
versus
how
long
live
they
are
and
cab
Holly.
B
You
know
the
moon
and
stars
line
up
regarding
compaction,
so
that's
something
that
I
think
we
need
to
understand
better
is
when
you
have
a
small
like
a
fork,
a
metallic
size
right,
you're,
not
really
doing
well
rights
anymore,
for
you
know
most
of
these
kinds
of
io's
anyway,
so
you're
not
going
to
have
any
leakage,
but
you
now
have
much
bigger
metadata
structures
to
deal
with.
You
have
many
many
more
Oh
notes
to
dealer
now
knows,
but
you
many
more
expense
to
deal
with
so.
A
B
A
A
A
Log
entries-
I
mean
I
agree
with
you,
I
mean
if,
if
we
are
not
diligent
at
pulling
right
ahead,
log
entries
out
of
rocks
DB
it's
a
chance
that
he
will
merge
them
into
into
his
SST
files
and
you'll
see
you
know
an
increase
in
writing
amplification.
There.
There's
no
question
about
that,
but
I
suspect
that
the
you
we
don't
were
putting
in
the
front
and
we
can
see
what
comes
out
the
back
I
if
you're
leaking
these
right
ahead.
A
A
B
A
B
B
The
good
news
is,
it
hasn't
been
entirely
fruitless,
because
I
mean
the
the
trend
that
seems
to
be
showing
up
is
that
we
we
do
better
with
larger
right
ahead
logs,
which
kind
of
makes
intuitive
sense
right.
The
the
tray
off
is,
then,
the
question
of
you
know
that
uses
more
in
memory,
so
you
know:
do
we
trade
the
memory
for
having
a
larger
log
and
the
better
more
consistent
performance?
By
so
more
like,
say,.
A
We'll
do
better,
you
say
you
do
better,
but
in
in
which
test
cases
I
mean
random
thought
yeah,
so
like
portly,
random,
writes
yeah,
so
I,
so
I
would
reject
that
I.
Don't
understand
why
the
log
file
as
long
as
you
assume
that
the
right
ahead
log
entries
are
purged
from
rocks
d
be
reasonably
quickly.
You
know
well
less
than
the
size
of
a
log
file,
then
I
don't
understand
why
there's
any
correlation
between
the
size
of
the
log
file
and
the
performance?
A
B
Do
they
do
get
larger
and
longer?
But
if
you
are
to
the
point
where
the
you're
in
the
middle
of
flush
and
you
have
filled
up
any
log
files
that
are
not
that
Oh
anything,
that's
not
flushing
right.
If
you,
if
you've,
got
to
the
point
where
anything
that
was
available
to
you,
it
is
now
full
and
your
no
flushing
every
the
other
ones.
Now
you
blocked
rights
and
that's
the
behavior
that
we
see
is
that
we
see
these
weird
stalls.
A
So
if
it's
well,
certainly
the
SSDs
could
be
garbage
collected
on
you,
you
might
get
some
excursions
there
and
again
that
would
yo
by
looking
at
the
physical
Layton
sees.
We
should
be
able
to
see
that
but
like,
but
me
the
theory
with
rocks
TD
is
that
he
never
stalls
the
front
end
for
the
back
end.
Okay,
yeah
and
that's
not
true
right,
his
potential
line.
You
have
to
run
out
of
something
okay
and
his
behavior.
You
know
we
may
be
driving
him
to
to
that
state.
A
A
That
I
suspect
they're
going
to
find
like
a
lot
of
these
things
that
the
knobs
that
are
there
in
the
documentation
of
them
is
subject
to
interpretation
and
just
because
it's
never
happened
with
any
of
the
staff.
Not
okay,
you
know,
add
you
know
I'm
going
to
grant
them
the
same
discretion.
Okay
and
you've
got
to
find
that
until
you
have
a
deep
understanding
of
what
these
knobs
do,
you're
going
to
find,
you
really
don't
understand
how
it
works
at
all.
Yeah
I.
B
B
Yeah,
well,
maybe
maybe
that'll
be
the
next
thing
that
we
can
focus
on
here
than
eventually
getting
this
stuff
instrumented.
The
good
news,
though,
is
that
we've
actually
the
amenity
aside.
We
actually
you
know
the
work
that
UN
sage
did
getting
all
of
that
memory.
Usage
instrumented
in
in
what
mem
cools
is
going
to
be
really
really
useful.
I
mean
that's
that's
fantastic.
So,
while
we're
making
progress
in
this
regard
is
just
maybe
maybe
we
could
do
a
sir.