►
From YouTube: 2018-May-10 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
And
I
think
I'm,
just
gonna
get
started
on
pull
requests
here,
there's
not
a
whole
lot
to
go
over
yet,
but
let's
see
Peter
you've
got
a
new
one
here
that
I
think
you
may
be
entered
in.
Do
you
do
you
want
to
talk
about
it?
A
little.
B
A
B
A
Cool
all
right:
well,
the
the
other
new
PR
we've
got
here
is
Igor
Siamese
as
a
whole
presentation
about
it.
So
it'll
move
on
we'll
discuss
that
once
a
little
bit
here,
I
guess
for
closed
PRS.
There
wasn't
also
a
whole
lot
going
on
I.
Think
chief,
who
was
he
Fuu
closed
both
of
these
here
that
that
merged.
A
The
ID
I
didn't
really
look
too
carefully
at
these.
To
be
honest,
I
think
neither
of
them
are
real
big
and
then
there
was
another
one
that
got
closed,
which
I'm
not
entirely
sure.
Why?
But
I
assume
must
be
related.
Tell
the
rgw
we
worked
this
going
on.
Maybe
this
was
no
longer
necessary
or
no
longer
made
sense
and
yeah.
That's
that's
about
it.
A
The
the
only
other
thing
I
guess
I'll
mention
is
there's
this
pier
by
Adam
that
that
sage
was
able
to
review
and
and
it's
a
little
tricky
because
it
involves
kind
of
changing
some,
some
different
things
that
that
are
a
little
scary,
so
I
guess
the
the
thought
right
now
is
maybe
to
do
a
little
bit
more
investigation
on
it
and
figure
out
kind
of
if
it
makes
sense
or
not
so
I
think
Adam
and
the
user
who
is
testing.
This
are
gonna.
A
Try
to
investigate
a
little
bit
more
beyond
that
yeah,
not
a
whole
lot
going
on
well
anyway,
so
yeah,
it's
that's.
Basically
it
for
pull
requests.
I
will
have
another
one
coming
in
soon,
hopefully
for
cache,
balancing
in
blue
store,
which
I
think
is
gonna,
be
really
exciting,
but
well
maybe
talking
about
that
next
week.
So
Igor.
D
D
D
D
B
C
D
D
D
Also,
it
attempts
to
operate
on
64-bit
values
wherever
possible,
and
these
three
likes
data
structure
enables
much
faster
lookup
and
potentially,
we
can
benefit
from
concurrent
allocation
request,
handling
right
now,
stupid
alligator
is
able
to
well
it
while
it
performs
they
look
up
with
alike.
It
locks
the
whole
data
structure,
so
it
handles
just
one
request:
retire
at
a
given
time.
Actually,
new
bitmap
alligator
is
also
handing
just
one
request,
but
potentially
we
can
use
in
this
design.
D
For
so
well,
we
can
predict
how
much
space
ours
D
in
that
respect,
will
we'll
take
from
one
side
from
another
side.
This
memory
consumption
can
be
pretty
high.
Here.
Are
the
some
estimates
actually
actual
results
of
what
I
observed,
so
it's
about
130
megabyte
for
4
terabyte
store
and
for
key
allocation
unit,
but
unexpectedly
I
observed
that
stupid
alligator
memory
consumption
might
be
pretty
high
as
well.
D
So
actually
that's
not
that
huge
issue,
I
suppose-
and
the
last
item
to
mention
is
that
currently
there
are
virtually
two
modes
of
location
that
we
need
in
blue
store.
First,
one
that
is
used
most
often
is
scattered
and
extensive
location.
So
we
don't
need
the
continuous,
the
contiguous
space
for
user
data.
D
Storing
and
that's
the
thing
that
and
you
bitmap
alligator
do
the
best,
while
contiguous
allocation
is
required
for
blue
FS
during
rebalance
in
space
between
blue
FS
and
raw
data,
and
actually
that's
a
rare
case,
I'll
compile
into
a
regular
mod,
regular,
the
right
reparations
and
actually
I
haven't
done
some
extensive
benchmarking
of
this
mode.
But
I
can
expect
that,
in
some
words
in
in
the
worst
cases,
this
implemented
meditation
isn't
effective
because
it
might
need
to
enumerate
the
whole
bitmap
well
again
in
the
worst
case
in
the
game,
when
the
space
is
pretty
fragmented.
D
Ok,
so
here
is
an
overview
of
the
test
cases
provided
the
first
subset
of
test
cases
is
a
direct
alligator
usage
that
is
just
talked
of
god,
written
based
on
existing
intestine
infrastructure
that
calls
both
stupid
and
bitmap
alligators
in
specific
way
in
a
loop
without
any
multi-threading,
multiple
concurrent
access,
etc.
So
the
simplest
case
is
just
a
sequential
fork.
A
block
allocations,
followed
by
the
sequential
release,
trying
to
allocate
the
whole
space
that
is
one
terabyte
for
this
set
of
cases
and
for
K,
is
for
location
again
units.
D
The
second
that
is
mix
pool
is
a
bit
more
advanced
test
case
that
allocates
blocks
from
ranging
from
4k
2
to
2
megabyte
of
size
selected
randomly
besides
randomly
mixed
we've
randomly
selected
3d
quests
that
are
shorter
from
the
range
of
sizes
shorter.
It's
just
up
to
half
a
Mac,
and
this
allows
us
to
so.
Actually
the
memory
consumption
is
growing
and
the
test
case
completes
on
the
whole
space
when
we
cover
all
the
space.
D
The
fourth,
the
third
test
case,
is
to
allocate
50%
of
the
space
sequentially
and
then
do
a
mixture
of
random
allocations
and
fries
of
the
same
size
ranges
and
well
actually
that
doesn't
take
all
this
control
space
because
we
have
balanced
allocations
and
releases.
But
the
total
amount
of
data
that.
D
D
D
D
That's
pretty
unexpected,
at
least
this
huge
difference
for
me.
Let's
purge
any
infected
mutt,
that's
what
I
have
so
far
and
the
second
slide
is
the
set
set,
the
same
set
of
test
cases,
but
it
covers
memory
consumption.
So
you
can
see
that
for
bitmap
allocated
memory,
consumption
is
fixed
and
it's
high
for.
D
For
sequential
cases
refers
to
them
where
I'm
stupid,
like
a
that,
is
much
much
better,
but
eventually
for
the
last
three
test
cases,
memory,
consumption
of
cheaply
located
become
grow
significantly
and
that's
written
expected
for
me
with
these
large
numbers,
and
maybe
we
make
sense
to
analyze
that
why
it
became
the
face.
This
way,
one
day
eager.
A
D
D
D
And
here
is
the
here:
are
the
numbers
for
if
I
apply,
ginn
benchmark
the
first
count
for
4k
other
rights
and
the
difference
isn't
very
much
with
probably
expected,
but
the
second
column
again
is
pretty
unexpected.
For
me,
it's
when
we
range
we've
arrived
at
the
block
of
the
Meridian
block
sizes
from
4k
to
180
case,
and
the
difference
is
dramatic.
D
D
The
for
for
chart
is
about
again
main
usage
again
Tippett
allocated
money,
which
is
pretty
high
here,
while
bitmap
one
is
isn't
that
high
and
that's
because
just
one
that's
just
two
and
half
megabyte,
and
that's
because
the
store
that
I
was
using
is
pretty
small
and
bitmap.
Locator
hardly
depends
on
its
memory.
Consumption
country
depends
on
actual
store
size.
A
D
Well,
since
I
provided
some
estimates
on
the
memory,
consumption
for
four
terabyte
drive
for
it,
it's
about
128,
megabyte
well
130
megabyte
here.
We
can
see
that
stupid.
Alligators
well
takes
tens
of
megabytes
in
even
smaller
stockman.
Much
smaller
stores,
I
am
not
sure
if
it
grows
line
or
linearly
with
the
store
size.
D
Unfortunately,
I
don't
have
the
side
of
the
story
of
the
discovered
sizes,
but
I
don't
think
it
grows
that
it
grows
line
and
stupid
one.
But
at
least
the
current
memory
consumption
is
comparable
with
tens
of
megabytes,
yeah
I'm.
A
D
D
D
D
Sixteen
sixteen
times
less
from
120
gigabyte
megabyte
this,
isn't
that,
like
a
large
amount
of
ROM
but
well,
definitely
that's
which
we
should
be
tested
more
elaborately
from
just
some
initial
results.
I
have
and
they
look
pretty
interesting
for.
C
E
D
D
The
first
sub
set
of
test
cases
is
just
simulate
one
terabyte,
so
no
actual
right
here,
just
virtual
space
that
we
are
splitting
into
chunks
and
the
second
subset
was
for
against
pretty
small
stores
that
is
about
well
actually
take
actually
allocated
space
by
the
object
is
one-half,
48
gigabytes
and
the
stores
themselves
were
about
80
gigabyte,
their
block
device
and
70
gigabyte
for
data
base.
Okay,
okay,
but
as
I
said,
the
memory
consumption
for
bitmap
located
is
completely
predictable.
D
C
A
D
Well
and
finally,
some
observations
about
the
existing
stupid
locators,
so
I
haven't
investigated
that
much,
but
just
to
mention
the
first
one
is
looks
like
in
functionality.
Keep
it
allocated
provides
at
the
moment,
as
might
have
pretty
negative
impact
on
the
performance,
especially
when
these
hint
rubs,
or
you
know,
allocation
at
least
I
saw
that
drops
about
three
times.
Maybe
that's
the
root
cause
of
that
huge
difference.
D
F
G
D
Well,
actually,
we
can
move
all
that
code
from
reserve
to
allocate
directly
allocate
and
they
turn
from
allocate
immediately
if
you
don't
have
enough
space,
because
if
you
look
at
the
court,
all
these
reserve
allocate
are
in
pairs
and
then
I
thought
they
have
following
each
other.
But
there
is
no
case
when
we
reserve
some
space
and
postponed
allocation
of
allegations
of.
D
F
A
D
D
C
D
D
D
D
F
G
F
F
Think
I
think
that's
the
reason
the
stupid
also
doing
a
bit
worse,
because
we
try
to
kind
of
arrange
them
in
those
minilogue
faders
and
try
to
coordinate
in
the
interval
trees
and
as
in
when
it
gets
fragmented,
we
will
consume
more
memory
there
compared
to
the
business.
This
map
is
optimal
on
the
space,
the
memory
usage,
but
it
might
consume
more
CPU
I
mean.
A
One
of
the
things
that
we've
have
seen
the
past,
with
stupid,
allocator
and
and
I
think
bitmap,
bela
cater
even
may
be
even
worse.
Was
that
exactly
what
you
were
just
talking
about
ferrata?
After
after
doing
a
bunch
of
small
random
writes,
you
could
see
an
impact
from
fragmentation
on
further
like
large,
read
and
write
tests,
yeah.
F
A
G
A
D
H
Muck
I've
just
got
a
little
topic,
I'd
like
to
discuss
sure,
go
ahead,
Nick,
it's
our
old,
favorite
PG,
numb
purple
I've
just
been
thinking
about
this
recently.
You
know,
with
sort
of
the
recent
release
of
14
terabyte
drives
and
of
massive
flash
nvme
drives
as
well
is
I'm
sort
of
just
trying
to
understand
the
thought
behind
this
of
the
one
to
two
hundred
PG
/
OSD,
because
obviously,
although
that
works
historically,
we
have
a
fourteen
terabyte
drive.
H
You
start
ending
up
having
a
lot
of
data
per
PG,
which
is
obviously
going
to
affect
recovery,
speed
and
stuff
like
that,
but
also
then,
within
their
me
disks,
you're
gonna
be
having
a
lot
of
I/o
going
to
a
single
PG
and
getting
contention
and
so
on.
How
does
that
work
going
forward?
So
now
there's
going
to
be
some
work
automating?
How
does
that
work
going
forward
in
terms
of
either
large
capacity
drives
or
large
I/o
drives.
A
It's
really
tricky
the
at
least
in
my
opinion,
I
mean
many.
Other
people
have
thoughts
on
this,
but
we
already
have
a
lot
of
work
being
done
just
given
the
number
of
PGs
that
we
already
have
when
we
look
at
yeah,
PG,
log
and
and
everything
else
is
going
on,
so
fearing
out
how
to
that
there's
there's
so
many
factors
that
go
into
this
right.
I
mean
we.
A
We
have
the
balancing
aspect
of
it
and
that's
kind
of
now
being
taken
care
of
by
some
of
the
work
that,
like
Louie
and
sage,
have
done
on
PG
balancing,
but
that
doesn't
get
you
away
from
kind
of
the
other
problems
with
contention.
Like
you
mentioned,
when
you
have
a
huge
amount
of
data
relative
to
the
nur
of
P
G's
and
then
ya,
increasing
the
number
of
PG
s
increases
the
workload
in
other
areas
that
I
think
we're
already
struggling
with
so
yeah.
A
A
H
E
E
I
mean
I
think
with
betting
with
new
clusters.
In
terms
of
balance,
it
doesn't
really
matter
that
the
larger
the
statistical
properties
don't
change
much
what
the
recovery,
the
time
recovery
for
PG.
That
mention
is
interesting
and
not
something
I
think
anyone's
thought
about
very
much,
but
it's
just
sort
of
a
general
consequence
of
having
a
larger
storage
device
without
fast
without
a
larger
3-wood.
E
That's
the
virtual
concept,
so
yeah
I,
don't
think
we
I
don't
think
anyone's
running
experiments
about
increasing
the
numbers
to
deal
with
the
recovery
time,
but
leaving
that
aside,
if
with
the
new
cluster
I,
would
I
would
probably
just
stick
with
the
800
PG
recommendations.
If
you're
gonna
put
it
in
an
existing
one,
then
yeah
they're
gonna
get
more
features.
So
you
should
probably
respect
the
machine
that
would
okay.
G
H
E
H
F
So
yeah
we
have
a
different
problem.
We
have
a
bigger,
it
is
for
die.
Ops
are
not
high
because
we
are
picking
about
a
hard
boyfriend,
often
right
so
so
this
are
getting
around
14,
terabytes
and
all,
but
it
is
not
faster
fast
enough
like
NSSC
right,
but
I
can
hold
more
pieces,
but
it
10
to
100
mbps
may
be
much
lesser
than
that
in
case
of
a
random
right.
So
in
that
case
I
don't
know
so
what.
F
C
F
Yeah
so
I
mean
until
we
have
a
mix
and
match
up
suppose
you
have
an
SSD
kind
of
a
cluster
and
then
have
the
hard
drive
as
well
in
that
a
cache
here,
more
also
is
enabled
that's
where
we
have
this
challenge.
So
recoveries
both
goes
faster
on
pieces,
D
levels
here
and
then
it
is
very
slow
on
hard
drives,
and
most
of
these
controls
are
kind
of
global
to
throttle
them.
Some
of
them.
H
Yes,
I
think
same
thing
is
that
with
the
larger
disk,
because
you've
got
the
same
matter
PJs
you
have
you're
not
spreading
the
recovery.
You
can't
spread
it.
You
know.
If
you
had
a
thousand
disk
foster,
you
can't
spread
it
recovery
over
a
thousand
discs
of
a
hundred
PJs,
roughly
you're,
only
spreading
it
across
100
discs,
which
means
the
difference
going
from
a
you
know.
One
to
ten
terabyte
disk
in
I,
guess
in
the
lifetime
of
project
is
meant
that
you
can't
spread
recovery
and
further.
H
H
E
E
If
you
lose
one
disk,
yes,
one
PG
will
take
longer
to
recover,
but
you're
still
going
to
be
all
like
all
the
oldies
are
gonna
be
doing.
Recovery
of
therapy
G's
and
it's
just
gonna-
take
well
because
there's
a
lot
of
data
and
not
a
lot
of
not
a
lot
of
I
ops,
but
that's
I'm,
not
sure
with
the
densities,
actually
change
that
much
yeah.
H
E
E
H
H
A
H
H
No
I
might
be
out
to
start
collecting
some
if
I
can
get
some
free
time
from
some
of
the
projects.
I'm
working
at
the
moment
that
yeah
that
I
have
seen
stuff.
Certainly,
if
you're
doing
anything
that
it's
sort
of
like
more
synchronous
I/o
in
these
definitely
start
seeing
it
build
up,
because
you
know
especially
like
using
that
X
of
s
1,
like
an
NFS
server
assignment.
Every
right
has
to
update
the
EXIF
s
Journal,
which
is
obviously
sitting
on
a
very
course
all
the
whole
thing
sitting,
probably
in
a
couple
of
RBD
objects,.
A
E
C
H
C
E
E
A
E
Is
are
that
said,
I
mean
you
know
it's
it's
not
here
yet,
and
it's
not
gonna
be
a
lot
of
incremental
gains
and
they'll
just
appear
one
day,
but
the
like
the
sea
star,
work
and
things
are
designed
to
deal
with
a
lot
of
the
eg
overhead
and
contention
so
like
like
the
locking
issues.
So
that's
our
current
plan
for
dealing
with
it.
C
C
C
Is
really
anything
that
would
help
particularly
well
for
parallelism
aspect
of
it?
C
A
If
we
this,
this
doesn't
really
help
in
the
case
where
it's
it's
like
lock
contention
and
really
fast
drives,
but
on
hard
drives.
If
we
did
some
kind
of
internal
caching
on
the
SSD
of
really
hot
blocks,
then
you
could
have
the
the
X
Fest
journals
for
RVD.
You
know
the
the
objects
that
those
things
represent,
the
the
SSD
or
nvme
drive
or
something.
A
A
H
I
said
because
I
mean
it's
not
just
I,
guess
so
much
the
internals
of
stuff
itself
that
you
know
cause
the
problems
around
the
contention.
It's
because
it's
also
a
distributed
system.
You've
got
network
latency
in
all.
You
know
the
whole
pipeline
latency
building
in
there
that
if
you,
if
you
are
waiting
on
the
PG
you've,
gotta
wait
for
stuff
to
shift
around
everywhere
before
the
next
client
can
do
something
says:
not
it's
not
purely
just
fully
down
to
making
the
OSD
faster
does
help
a
lot.
There
is
a
sort
of
other
aspects
as
well.
F
Think
in
the
case,
we
just
need
to
go
and
implement
that
the
back
off
on
all
the
claims
as
well.
Quite
so
that
will
be
our
if
you
fear,
actually
not
able
to
serve
a
lot
of
I
was
due
to
recovery
or
something
right.
Now
we
have
this
back,
for
is
the
back
off.
Stick
right,
I
mean
should
be
implemented,
the
clients
able
to
handle
them
on
a
PG
or
something
which
is
actually
takes
more
time
to
recover
I.
A
But
I
think
we
we've
exhausted
our
time
guys,
but,
like
that's
a
it,
was
it's
interesting
to
hear
something
I
think
for
us
to
think
about
more
cool
yeah
all
right!
Well,
thank
you,
Nick
and
it
looks
like
yours
gun,
but
thanks
to
you
or
as
well
for
the
good
presentation
and
we'll
meet
again
next
week
and
have
a
good
week,
everyone
set
do.