►
From YouTube: 2017-MAR-08 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
For full notes and video recording archive visit:
http://pad.ceph.com/p/performance_weekly
C
C
B
All
right,
she
start
with
four
requests.
Sure
we'll
see
how
my
has
one
that
changes,
the
Roxy
interface
to
use
the
rocks
to
be
range
delete
operator
which
would
help
blue
star
in
a
number
of
ways.
There
are
lots
of
places
where
we
have
to
delete
a
bunch
of
accusing
we
enumerate
them
in
orange
elite
them
and
apparently
it
has
sort
of
a
room
of
range
primitive.
But
the
does
a
big
fat
warning
in
the
header
about
how
it's
slower
in
some
cases.
B
B
Some
of
those
have
gone
in
any
other
ones
will
probably
follow
soon
I'll
just
to
clean
up
there.
That's
good,
that's
all
for
blue
store,
also
with
blue
stores.
There
was
a
polar
cuts
that
enable
processor,
accelerated,
crc32
for
rocks
TV
and
that
wasn't
getting
enabled
in
the
build
those
stuffs.
Crc
is
optimized,
but
rocks
TVs
wasn't
using
the
best
Intel
instructions,
but
that's
six
now
which
speeds
up
their
works
to
be
stuff
quite
a
bit
differently.
So
it's
also
good
for
reference.
B
A
B
B
D
Speaking,
I
guess
there
is
another
possibility
why
someone
could
see
the
such
results
unfair.
This
is
because
this
is
because
the
slow,
because
first
of
CLC,
fall
back
to
SLO
crc
this,
basically
in
andheri,
strikingly
slow
thanks
in
this.
In
the
first
function,
past
years,
little
question
function
that
calls
slow
the
phone
lines
at
home,
worried
if
compiler
is
in
line
inc.
That
will
be
the
region.
Take.
D
B
B
D
At
the
moment,
bitmap
area
leaf
aggregates
bitmap
zones
in
directly
using
SED,
vector
I,
think
that
that's
a
tricky
where
particles
that
could
that
could
let
make
a
hell
potential
to
disable
at
least
or
an
impulse
up
to
impulse
functioning
of
CPU
prefecture,
I
want
to
optimize
I
want
to
embed
bitmap
zones
directly
in
the
race
go
and
after
that,
put
them
on
a
huge
TLB.
Does
the
other
the
go?
Awesome
sounds
great:
okay,.
B
The
other
thing
that
I've
noticed
and
is
that
a
be
processing
as
well
pulsar
is
just
slow
to
start
because
it
has
to
initialize
put
in
apps
which
are
shorter
than
you
murmuring,
every
bit
and
system
in
it.
When
you
run
it
on
in
dev,
it's
not
that
big
a
deal,
but
when
you're
running
out
on
an
actual
disk,
that's
you
know
five
or
ten
terabytes,
and
it
takes
many
seconds
to
actually
start
up
and
initialize
everything.
So
that's
how
that
path
could
probably
be
optimized
I'm
guessing
too,
but.
D
B
Awesome
yep
got
it
gotta,
see
it
getting
some
attention,
let's
see
other
stuffs
the
fastest
that
stuff's
merged
a
while
ago.
Actually
that
might
have
been
before
last
week,
so
it's
good
seems
to
working
fine.
We
didn't
actually
measure
much
of
a
difference,
but
that
could
pass
is
so
much
simpler
and
cleaner.
That
I'd
have
a
hard
time,
imagining
that
it
isn't
helping
come
on
and
let's
see
how
am
I
had
a
pull
request.
I
was
trying
to
optimize
the
get
reserved
map
and
release
maps.
B
B
What
else
there's
some
already
made
changes
trickling
in
I'll,
see
there's
a
new
poll
request
that
makes
at
the
notre
face,
so
you
can
explicitly
control
the
recovery
priority,
four
pgs
on
a
per
PG
basis.
This
was
sort
of
meant
as
a
last
resort
interface
in
case
the
automated
pool,
purple
recoveries
and
doing
what
it's
supposed
to
do,
or
something
and
say
a
little
bit
more
control
of
the
system,
so
that
I'll
likely
go
in.
B
B
Here's
the
changes
that
experiment
that
were
not
ready
to
merge,
but
they
would
basically
batch
up
those
deferred
rights
until
you,
a
den
of
them
and
then
submit
the
mall
in
parallel
and
that
I
quadrupled
performance
on
the
spinning
disk.
So
it's
pretty
soon
indigent,
but
in
the
process
of
doing
that,
I
was
cleaning
up
a
bunch
of
other
things
in
blue
store
and
realize
that
there
are
some
sort
of
deeper
issues
with
different
rights
wall
stuff.
B
A
A
B
Don't
think
maybe
I
don't
know,
I
might
also,
I
also
tested
it
with
using
putting
the
journal
on
SSD
and
using
the
disk
for
data
and
also
saw
corresponding
speed
up
with
it
with
just
the
change.
I
have
now
it's
doing
more,
like
400
I
apps,
where
you're
basically
doing
a
turtle
on
SS
key
and
all
the
rights
going
to
the
disk.
I
was
on
a.
B
B
Bench,
porque
rights,
so
it's
a
separate
object,
so
they're
all
going
to
the
journal
and
then
asynchronously
they're
writing
to
the
disk
and
they're
getting
patched
up
basically
pio
schedulers,
saying:
oh,
these
are
all
written
two
adjacent
blocks,
I'm
just
going
to
do
niƱo
or
they're
in
the
same
tracker,
whatever
how
big
was
named.
Damon.
A
B
See
is
word,
this
was
just
ratos
bench,
HP
4k,
so
there
are
effectively
sequential
once.
Actually
it
did
if
so,
it's
sort
of
a
UD
case.
In
that
sense,
if
you're
doing
random,
4k
overwrites
it'll
be
a
little
bit
worse,
but
even
so
you'll
see
it'll
be
better
because
the
sort
of
the
deeper
the
cue,
the
yeah,
the
more
the
disc
arm,
is
going
to
be
able
to
get
to.
D
A
B
Yeah
yep
all
right,
but
I,
think
I
think
the
main
thing
that
we
can
do
there
that
we
is
is
basically
whenever
we
have
those
deferred
rights.
If
we
just
patch
them
up
a
little
bit
on
a
spinning
disk,
it
will
help
because
the
discount
can
stay
by
the
journal
and
then
when
it
goes
and
does
all
the
other
stuff
that
can
do
it
all
at
once.
Instead
of
going
back
and
forth
yeah,
that
seems
to
make
a
big
difference.
B
It
will
take
a
little
bit
of
tuning
to
figure
out
like
what
the
right
batch
sizes
and
want
to
trigger
a
batch
flush
and
all
that
stuff.
But
when
I
rewrite
the
deferred,
writes
code
I'm
going
to
make
sure
I
structure
it
so
that
we
can
sort
of
have
flexible
batching,
so
neither
flush
the
batch
immediately
or
batch
it
up,
definitely
or
whatever
so
it'll
be.
B
The
code
will
be
structured
so
that
to
enable
that
sort
of
policy
so
yeah
that
belt
thats
come
later
a
little
bit
later,
that's
yeah,
oh
I'm,
planning
to
spend
the
rest
of
the
week,
hopefully
dealing
with
the
deferred
right
stuff,
the
store.
B
D
B
B
D
D
B
B
D
B
B
I,
my
guess
is
that
it's
because
the
current
foreign
is
has
a
loop
in
it
and
usually
the
paralyzer
someone
in
line
steps
of
loops
always
suffered,
but
there
is
a
there
was
a
past
forever
ago
that
just
unrolled
the
loop
basically
into
a
punching
additional.
It
might
be
that
doing
something
like
that.
Well
help
it
never
actually
emerged
it's
open.
If
you
look
at
the
request
tag
for
Bluestar,
it's
one
of
the
oldest
ones:
okay,
okay,
we
dancing
like.
A
D
B
D
Think
that
if
we
eradicate
those,
those
mrs.
will
be
able
to
to
move
to
move
to
another
bottle,
net
I
guess
we'll
be
sending
submitting
io
request
one
in
one
or
in
other
words
per
one
manner.
Basically,
this
is
because
we
don't
support
batching
in
the
block
device,
right,
I,
hope
of
other
way.
We
already
have.
We
are
already
working
on
that.
We
have
a
pull
request
to
some
preliminary
per
request.
Vip,
we
thought,
brings
batching
across
the
wall
blew
through
pipeline.
D
D
B
B
It's
all
about
aford
work
that
that's
killed
us
I'm,
but
that's
good
news.
That's
good
news!
I'm
cool,
okay,
well,
I'm
delighted
to
see
that
you're
you're
doing
this
detail
profiling.
This
is
awesome.
Finding
all
kinds
of
good
stuff.
Oh
nice,
secure
yacht.
Thank.
B
D
Hello,
we
once
again
my
machine,
my
mushroom
just
crashed.
Sorry,
I'm
using
adams
computer
runs
machine
hello,
so
it
looks
that
blue
store
data
structures,
I
mean
here,
especially
excellent
map,
blobs,
etc,
etc,
are
scattered
over
memory.
They
are
linked
to
using
boost
into
this
pointer
but
bite
away
assistant.
The
development
built
not
an
object
on
a
release,
one
but
definitely
development
deals.
We
have
assertion
in
indeed,
indeed,
if
I
operator,
on
a
boost
into
this
pointer
assertion,
11
set
is
not
so
big.
D
However,
we
have
millions
of
dollars
to
Brazilians
of
places
where
we
use
it.
That's
one
problem,
but
I
guess
it's
it's
quite
easy
to
deal
with
the
second
third
and
the
most
day
that
the
work,
the
worst
case
is
that
we
have.
We
have
data
structures,
copyright,
/
memory,
I'm
working
on
that,
for
example,
started
to
put
I
stopped
using
boost
object,
pool
to
have
the
to
have
excellent
maps
placed
in
a
continuous
region
of
memory,
and
it's
already
made
I
have
part
we
will
I
true.
I
hope
I
will
put
the
publish
it
son.
D
B
D
B
A
D
B
Her
cash
right
now
right
context
right,
Vegas,
it's
gone,
there
used
to
be
a
reference
in
the
transact
trans
context,
but
that's
been
removed,
so
I
think
we're
good
yeah.
It's
really
just
it's
just
say:
oh
no,
it
on
the
extent
yeah.
So
you
can
switch
this
like
unique,
pointer
or
something
purchase
or
a
pointer.
Yet
I
think
I
would
be
here
just.
D
B
D
B
B
B
D
B
B
B
Are
still
so
waiting
for
that,
so
it's
so
their
cue
possibilities.
One
is
that
those
reeds
are
from
compaction
itself
or
its
sequentially
scanning
or
supposedly
to
both
sequentially
scanning
SS
fuse
in
order
to
generate
a
new
SSD
and
those
iOS
should
be
sequential
and
hopefully
not
too
painful
and
advice.
Maybe
maybe
we're
not
doing
read
ahead.
I
guess
that's
one
possibility.
Therefore
we
could
fix
it,
but
that
sort
of
scenario,
a
scenario
B,
is
that
it's,
this
cache
invalidation
that
happens
in
rocks
TV
that
results
in
a
bunch
of
caching,
mrs.
B
post
compaction
and
then-
and
these
are
random,
read
iOS
to
the
new
SSDs
and
that's
what
this
write
buffer
would
sort
of
alleviate,
so
yeah
I,
guess
I'll,
just
we're
still
waiting
to
find
out.
If
that,
if
it's
going
to
help
or
not,
it
might
be
mark
that
we
need
to
just
reproduce
this
workload
yeah.
If
you
getting
that
end
to
it,
end-to-end
instrument
it
carefully.
B
A
B
B
A
B
That's
all
good,
oh,
it
might
be
worth
as
anybody
else
does.
Anybody
else
have
anything.
I
want
to
talk
about.
I
have
one
other
thing
at
conventions.
We
have
a
periodic
all
with
some
folks
at
UC,
Santa
Cruz,
for
doing
they're
working
on
tiering
sort
of
explicit
hearing.
They
want
to
build
some
application,
that
sort
of
knows
what
data
is
going
to
be
accessed
ahead
of
time,
and
so
it
can.
B
Certain
columns
on
SSD
before
their
computation
runs,
but
as
a
result
of
that
they're
talking
about
extending
the
serratus
interface
so
that
you
can
sort
of
pass
an
explicit
hint
that
says
this
Oh
map
key
or
prefix
40
map.
You
should
be
on
a
fast
here
or
slowed
here
and
then
how
boost
or
actually
do
that,
where
r
oxy
OD
smart
have
to
put
those
on
a
different
device
toy
a
rough
plan
on
how
to
do
that.
But
right
now
it's
sort
of
pre
research.
B
A
B
I
think
it's
yeah.
I
think
it's
sort
of
a
wrong
assumption
because,
okay,
it
sort
of
I
mean,
there's
that
there's
a
tendency
now
to
like
conflate
api's
with
use
cases
and
therefore
with
sort
of
the
storage
back
end.
So
like
objectives
out
of
this
big
and
slow
and
block
is
like
small
and
fast
or
whatever.
But
that's
that's
not
really
true
and
as
as
the
api's
become
more
broadly
adopted
than
they'll
they'll
be
abused
in
different
ways.
So
you
can
imagine
having
an
object
store
that
has
the
zillions
of
objects
that
are
super
cold.
B
Stores
that
are
that
are
very
high
performance,
and
so
everything
would
be
on
solid
state
sort
of
contrary
to
either
the
assumption
that
some
things
are
so
over.
Some
things
are
best
specifically.
B
D
B
You
don't
need
all
those
indexes
necessarily
on
SS
key
of
a
sort
of
a
limited
example,
because
the
index
is
going
to
be
a
minority
of
the
data
set
and
it's
in
their
particular
use
case.
They're,
like
most
their
data,
would
be
in
Oh
maps
because
it's
a
database
and
scalable
and
they
would
have
liked
column
groups
effectively
sort
of
stored
in
a
gnome
at
peace,
yeah
and
the
whole
point
of
this
is
that
they
would
take
certain
columns
that
so
they
might
have
like
a
hundred
dimensional
or
a
hundred
column.
Data
set.
B
A
B
Also
photo
map
pieces,
half
of
it,
the
other
half.
Would
it's
doing
the
same
thing
on
the
on
the
object,
so
you
would
say
that
this
range
of
the
object
I
know
is
hot
and
I
want
to
hint
that
it
is
staged
on
a
SSD
era,
a
CD
and
though
the
mechanism
we're
looking
at
for
do
that
is
actually
just
making
it
so
that
the
iOS
that
booze
for
generates
using
that
lib
aio
interface
can
be
tagged.
We
wanted,
we
want
a
kernel
interfaces
and
tagged
individual,
a
iOS
as
hot
or
cold.
B
So
that
requires
some
kernel
changes
because
they
owe
interface
doesn't
quite
do
it
and
the
cash
things
aren't
looking
at
/
io
hidden,
sir.
Only
looking
at
process
hints
so
there's
some
picks
up
anything
done,
but
once
all
that
infrastructure
is
in
place,
then
they'll
have
this
general
mechanism
or
we
can
have
booths
or
specifically
hinting
that
this
should
go
on
flash,
and
this
should
not
much
will
be
useful
sort
of
an
all
in
all
of
the
cases
framing
any
hybrid
area.
B
B
This
might
be
of
interest
actually
to
the
Intel
team,
because
they're
their
calf
caching
product,
as
I
know
janet,
has
some
pretty
sophisticated,
hinting
that
they
sort
of
kludge
together
and
I
think
that's
basically
the
functionality.
It
probably
implements
that
the
hinting
that
we
need
month,
except
for
the
aio
interface
cap,
but
that's
that's.
We
want
to
basically
get
that
hint.
B
A
Those
are
the
ones
we're
talking
about
kind.
Experimental
things
are
curious
about
I've
wondered
for
a
while.
It
would
make
sense
to
have
some
kind
of
an
OSP
that
could
do
some
kind
of
like
Mar
distribution
of
data
across
different
discs.
So,
instead
of
using
and
a
crush
to
go
all
the
way
down
to
the
disc
level,
you
use
it
to
go
to
a
specific
OC,
but
then
that
OSD,
you
make
more
real-time
choices
about
which
discs
are
busy
or
not.
Yep,.
B
Even
pretty
simple
scenarios
like
a
pair
of
disks
are
like
three
discs
where
you
should
have
chained
them
together
and
you
use
use
one
of
them
for
like
sequential
journaling
and
use
one
of
them
for,
like
the
random,
writes
and
so
you're
just
doing
some
fairly
basic
heuristics
about
sort
of
directing
different
types,
bio,
two
different
devices
again
and
getting
pretty
substantial
wins
yeah.
So
this.
A
A
B
There's
sorry
go
ahead
now.
A
B
Think
there's
something
to
be
said
to
you
for
having
local
redundancy
as
well
as
the
sort
of
stuff,
multi
or
inter
OST
redundancy
yep.
So
you
can
do
local
repair
and
recovery
and
that's
a
win
yeah
and
you
can
even
make
the
failure
demand
sort
of
reasonable
too
or
you
could
imagine
it
that
has
multiple
discs
where
it
tries
to
keep
pgs
isolated
on
a
single
disc.
So
if
you
have
a
spindle
failure,
you
only
lose
certain
key
Jesus
in
the
others
are
still
intact
and
you
could
do
things
like
that
too.
B
A
B
A
B
Don't
I
don't
want
to
think
about
it?
Yet?
Okay,
but
it
Kurt,
that's,
certainly
good.
It
could
be
done
there
I'm
a
little
bit,
hesitant
to
pile
on
too
much
into
the
same
layer.
So
if
we
can
stack
on
top
of
you
know
a
DM,
a
block
target
that
does
this
and
sort
of
have
a
richness
set
of
hints
that
we
can
communicate
what
we
need
to
communicate.
That
would
be
preferable,
I
think,
because
otherwise
you're
creating
a
increasingly
complex
model
with
yeah.
A
B
B
B
So
you
get
continually
a
perfect
distribution
of
placing
groups
across
those
to
use
without
very
much
overhead,
so
hoping
to
get
the
infrastructure
for
that
at
least
in
place
in
luminous,
even
if
we
don't
actually
have
all
the
parts
to
actually
sort
of
generate
the
distribution,
but
maybe
we
can
have
some
things
that
believe
them
to
do.
It
then
we'll
see,
but
that's
that
that
the
variance
between
oh
steve
is
a
big
source
of
a
performance
loss,
because
you
just
don't
have
an
even
distribution
and
say
you're
underutilizing.
E
Sorry
this
is
then
I
I
didn't
quite
hear
that
part
about
the
OSD
distribution
function.
Is
there
something
written
down
about
it
or
yeah.
C
B
B
Okay,
you
can
have
a
branch
that
implements
sort
of
some
of
it,
but
it's
not
not
complete.
Yet
he
sort
of
the
nice
thing
is,
you
know,
typically,
people
increase
number
of
bits
and
groups
to
even
out
the
distribution,
which
means
that
having
a
sort
of
an
explicit
mapping
is
expensive
because
you
have
a
lot
of
pg's
to
numerate.
But
if
you
have
this
mechanism
that
you
actually
don't
need
to
do
that,
you
can
get
by
with
a
much
smaller
number
of
pg's,
in
which
case
the
exception
table
can
be
quite
small.
A
Hey
Jay,
regarding
that,
how
where
would
you
see
like
a
layer
living
that
would
try
to
automatically
we
recreate
exception,
for
when
the
topology
changes
like?
Where
would
that
live
there.
B
A
B
So
the
way
that
the
remapping
is
structured
you
can
you
can
make
the
sort
of
explosive
mappings
conditional
and
on
a
single
SD,
so
that
you
basically
say
that
for
PGX,
remap,
OST,
120
ste,
for
because
I
know
that
or
is
underutilized
and
one
is
over
utilized,
and
if
that
PG
changes
because
crush
there's
a
big
balaji
change
and
crush
different,
then,
and
that
exception
doesn't
apply
anymore
than
it
just
is
ignored.
So
one
is
no
longer
at
the
genome
longer
maps
to
one
than
it
does
nothing.
B
So
it's
sort
of
great
slowly
revert
to
whatever
crush
is
doing
which
to
a
first
approximation
should
be
pretty
good
yeah,
so
I
think
it'll
be
I,
think
it'll
be
okay,
okay,
well,
we'll
find
out,
I
guess
I
worth
Casey.
We
could
do
this
sort
of
synchronously
in
the
monitor,
also,
but
that's
not
going
to
scale
very
well
two
large
clusters,
so
it's
have
to
be
careful.
A
B
All
we
do
already
with
prime
PG
temp,
and
so
you
could
do
a
similar
thing
here.
Okay,
if
we
did
that
and
it
would
kind
of
live
in
the
monitor,
so
whatever
the
code
that
does
it
I
think
it's
all
right
in
and
see
you
puff
up
and
make
it
a
ball
so
that
it
can
be
cancer
to
run
it
wherever
appropriate,
but
yeah.
We
don't
want
to
I,
don't
think
we
should
limit
ourselves
to
doing
at
the
monitor,
because
for
large
clusters
that
might
not
be
keep
it'll.
A
B
But
the
thing
is
that
the
concern
is
that
it
might
be
a
computationally,
expensive
thing
to
do
that
analysis
yeah.
So
it's
not
so
much
I.
Guess
it's
less
where
it's
when
and
don't
synchronously
in
mind
what
any
our
Torrio's
TMF
update
might
not
be
feasible
yeah.
Maybe
it
is
a
PG
number.
Peach
accounts
are
small,
it
might
be
fine,
but
you.