►
From YouTube: 2018-08-09 Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
B
B
D
E
E
E
It
doesn't,
it
doesn't
I,
think
I'm
a
bit
afraid
about
about
the
relationship
between
setup
costs
and
actual
speed
up,
because
we
are
the
most
prominent
user
of
tax,
the
most
prominent
user
of
our
crypt
obstructionist
FX,
and
we
are
going
there
with
very,
very
small
chunks,
52
or
48
bytes
and
unfortunately,
because
of
messengers
restriction.
We
cannot
cash.
E
The
the
ADP
context.
It
appears
that
it's
creation
is
is,
is
pretty
I
would
say
it's
pretty
high
and,
moreover,
it's
it
varies
from
from
openness
to
diversion
to
even
even
it
can
vary,
even
together
with
that,
the
same
across
the
same
open,
SSL
version
because
of
the
because
the
Phibes
amount
defied
certification,
okay,.
D
Let's,
let's
come
back
to
that
in
a
minute,
because
I
think
this
is
a
bigger
topic,
sure,
let's
see:
okay,
there's
a
tiny
appends
pull
request
from
Igor
we're
going
to
talk
about
that.
Also
in
a
few
minutes,
there's
this
new
EC,
partial
stripe,
reads:
I,
don't
think
I
seen
this
one
actually
looks
like
Gregor's
we're
doing
it.
D
Okay,
that
needs
a
closer
look
use
uptick,
nothing
doing
snap
roll
back,
Jason
reviewed
that
alright
I'm,
so
radix
pull
requests
were
merged
the
async
recovery
was
it
a
whole
discussion
that
moved
to
the
list,
although
I'm
still
confused
about
what?
D
E
D
E
D
On
that,
through
the
testing-
okay,
let's
see
showing
us
working
on
MVS
balancer
stuff,
those
that
another
up
tracker
pull
request.
Maybe
this
is
the
one
that
you
were
just
talking
about.
There's
marks,
pull
requests
or
two
of
them
actually
that
I
put
a
cap
on
the
OST
memory
or
managed
to
cash
more
smartly
and
then
Shawn
Peng
is
working
on
one
in
blue
store
that
does
the
shard
completions
in
the
OP
worker
thread.
B
D
Okay,
that's
amazing!
All
right!
Oh
let's
go
to
the
discussion
topics.
Let's
talk
about
the
EDP
thing
first,
just
because
we
already
touched
on
it
Radek.
It
sounds
to
me
like
we
sort
of
avoid
the
issue.
If
we
keep
the
best
FX
signature
checks
using
the
low-level
API,
but
then
change
the
more
general
abstractions
to
use
a
higher
level
API,
because
but
I'm
not
I,
don't
remember
what
the
users
are.
E
E
D
The
other
thing
to
keep
in
mind
is
that
the
suffix
signature
checks
are
for
the
current
version
of
suffix,
its
current
messenger
one
protocol
and
that's
gonna
change
drastically
in
the
next
couple
of
months,
with
messenger
too
so
I
think
I
wouldn't
bother
worrying
about
that.
I
would
just
maybe
this
the
question
for
look
at
if
there's
an
opportunity
to
improve
our
JW
kurta
performance
or
not,
and
look
at
it
from
that
England
and
just
take
a
little
bit
of
care
not
to
break
messenger.
One
crypto
checks
in
the
process.
E
D
E
To
provide
data
from
from
from
clusters
running
our
BD
at
the
moment,
all
we
have
is
just
a
micro
benchmark.
I
made
some
very
preliminary
tests
using
similar
conditions
to
what
we
had
what
we
had
in
the
case
of
openness,
which
I
mean
the
all
one
scenario,
one
gig
RBD
image,
fitting
entire
in
cache.
One
client
was
the
one
everything
set
to
one
and
I'm,
getting
no
difference,
/
all
a
regression
around
one
and
half
percent.
D
B
Oh
I'm,
the
instigator
of
this
I
guess,
maybe
six
months
ago,
I
had
the
opportunity
to
work
with
an
actual
user
who
was
trying
to
improve
their
blue
store
performance,
and
this
is
actually
kind
of
what
this.
This
interaction
is.
What
led
to
all
this
work
on,
trying
to
make
who
stores
cache
settings
easier
because
they
were,
they
were
really
really
confused
about
a
lot
of
different
things.
But
but
the
big
thing
that
were
one
of
the
big
things
that
came
up
was
that
they
had
defined.
B
While
it
wasn't
overriding
her
now,
I
wasn't
overriding,
but
they
hadn't
I
think
they
had
forgotten
that
they
had
even
set
an
SSD
setting
somewhere
and
then
we're
wondering
why,
when
they
were
changing
the
blue
store
cache
size
setting,
it
wasn't
doing
anything
that
makes
sense
yeah
it
actually
supposed
to
work.
The.
B
B
B
Now
that
we've
got
Auto
tuning
stuff
for
the
cache
sizes,
the
only
thing
hopefully
left
for
the
user
to
set
is
the
amount
of
memory
they
want
the
OSD
to
to
try
to
keep
itself
to.
But
we're
now
again
left
with
the
question.
Well,
do
we
want
to
have
a
separate
SSD
in
a
sheet
for
that
and
it
kind
of
makes
sense.
D
D
Settings
the
HTS
DD
ones.
The
reason
why
we
broke
out
the
defaults
is
so
that
users
never
have
to
touch
it.
It
will
just
do
the
right
thing,
for
example,
those
two
thread
count
or
whatever,
like
nobody,
should
be
worrying
about
that.
Unless
you're,
like
a
power
user
like
nobody,
nobody
should
be
thinking
about
so
just
having
it.
Having
default.
I
think
makes
sense.
I
think
this
option
is
a
little
different,
because
this
is
actually
something
that
is
biased,
nature,
not
a
sort
of
a
magic
hands-off
option.
C
D
Kind
of
thing
anyway,
so
I,
don't
I'm,
not
so
sure
about
this.
This
template
idea
I
would
prefer
to
not
I
sort
of
keep
all
these
other
options
in
the
category
of
things
that
users
shouldn't
really
touch
unless
they
really
are
thinking
about
it
and
the
HD
DST
thing
is
just
a
way
to
like
make
sure
our
default
is,
is
sort
of
the
right
one
given
what
we
know.
D
But
it's
worth
noting
that,
for
the
memory
thing
you
can't
actually
set
different
defaults
on
a
cluster
wide
in
this
cluster
config,
though
at
least,
but
even
then
like
I'm,
not
even
sure
that
that
makes
sense.
It
feels
like
somebody
I
think.
What
actually
is
gonna
happen
is
that
people
are
going
to
have
a
bunch
of
old
nodes
that
are
in
like
old
chassis
old
servers
from
their
early
days
of
the
cluster
that
have
less
memory
and
they're
gonna.
D
Deploy
new
notes
that
have
are
newer
and
faster
and
better
that
have
more
memory
and
they're
gonna
want
to
apply
the
settings
to
those,
and,
unfortunately,
there
isn't
really
a
way
to
tag
config
settings
based
on
like
what
revision
of
the
chassis
happens
to
be
in.
Unless
that
also
maps
to
like
what
RAC
they're
in
they
can
wrap
it
at
their
memory
per
rack.
B
Almost
feels
like
you
want
a
template
for
a
node
right.
You
wanna
be
able
to
say
I've
got
this
class
of
node
I've
got
these
other
classes
of
nodes,
and
you
know
they
have
this
many
disks
and
this
much
memory
and
whatever-
and
this
is
how
much
I
want
each
OS
deed-
remember:
I,
want
each
OSD
to
use
and
I
want
these
disks
and
each
chasis
chassis
to
be
used.
D
Yep
I
mean
that's,
that's
kind
of
what
the
classes
are
meant
to
be
like
before
we
have.
The
system
automatically
puts
you
in
hard
disk
and
SSD
classes,
and
maybe
an
NDB
class
I.
Remember
if
I'd,
that
was
a
real
thing
or
not
that's
not
like
I
merged
I'm.
So
you
can
make
a
class
of
OSD.
That's
you
know
gen
2
or
Gen
3
or
whatever
you
want
to
call
it
yeah.
That
way,
so
I
think
we
kind
of
have
the
tools
there.
I
guess
long
story,
short
I,
think
you're
right.
D
D
B
C
C
C
C
With
this
tiny
data,
originally
I
used
the
opera
effects,
which
is
the
same
as
note
one
that
makes
all
the
records
to
be
kept
in
the
same
namespace,
so
this
from
improved
performance,
but
it
might
make
it
harder
to
cash
them
and
makes
cash
in
less
effective,
probably
and
another
suggestion
is
to
create
another
namespace
and
put
such
records
there.
So
here
on
the
opera
effects
approach,
I
mean
the
original
one
and
K
prefix
is
now
a
new
namespace
and
I'm
trying
to
compare
approach,
as
well
as
against
the
original
mole
right,
yeah
mark.
B
C
Yeah,
that's
another
record
and
even
multiple
records
potentially
actually
for
this
test
case
just
one
record,
but
the
policy
that
enables
these
tiny
rights
at
the
moment
is
an
apparent
happening
at
unallocated
pays
aligned
with
this
allocation
granularity.
So,
in
fact,
if
you
perform
multiple
events
aligned
properly,
it
will
create
multiple
records
for
that
I'm,
not
sure.
If
that
the
best
strategy
I
was
just
trying
to
implement,
they
think
that
allows
to.
C
Allows
this
purple
this
procedure
for
for
small
rights
that
are
performed
just
using
single
right,
so
very
small
objects
written
in
in
just
one
right,
but
unfortunately
we
don't
have
any
flags
saying
that
no
more
rights
expected
this
object
or
something
like
that.
So
currently,
I
eat
it
more
complicated
procedure
which
might
create
multiple
tiny
records,
but
actually
that's
a
bit
different
story.
D
C
First,
rows
are
about
this
SATA
Drive,
which
is
richest
low,
but
the
difference
in
knobs
is
pretty
significant
here.
So
you
can
see
that
for
writing.
We
have
two
and
half
thousand
I
opt
for
original
approach,
which
is
writing
to
the
new
store
and
hence
no
prefer
no,
the
Fort
Wright
in
procedure
here,
while
the
new
write
have
about
seven
and
half
thousand
by
UPS,
and
actually
the
numbers
pretty
comparable
for
both
K
and
all
prefixes.
C
I'd
suggest
that
rather
compaction,
because
what
I,
what
things
that
I
also
monitored,
where
the
P
space
usage
during
each
reads
and
for
second
and
the
third
read
it's
pretty
stable,
both
the
compiled
the
current
size
and
maximum
size
which
8
DP
volume,
while
four
so
the
first
read
I,
can
see
that
maximum
column
is,
it
might
go
up
and
then
stabilizes.
Actually,
this
the
last
column
here
is
an
aggregate.
C
So,
instead
of
publishing
all
three,
even
four
columns
for
for
these
numbers
are
just
wrote
here:
well,
the
stable
among
stable
EP
volume
size
after
after
formally
performances,
tape,
stabilized
and
files,
I
can
see.
No
compaction
is
happening
and
the
maximum
size
which
most
of
the
time
modulates
to
even
write
in
all
the
first
read.
C
A
C
C
C
C
C
C
C
D
Okay,
this
is
this
looks
super
promising.
It's
it's
somewhat
similar
to
what
the
original
original
new
store
code
is
doing
forever
ago,
where
I
was
putting
thumb
number
of
writes
in
a
KB
store
and
then
I
can
run
with
that
they
were
called
they're
like
lazy
rights
or
something
I
forget,
but
we
eventually
ripped
it
out
at
some
point
because
it
didn't
seem
to
be
helping,
but
now
it
clearly
does
I
think
it
makes.
D
B
B
So
that
that
concerns
me
just
given
that
usually
it's
the
the
Beast
or
K
vsync,
that's
the
bottleneck
for
for,
like
you,
know,
random
right
to
us.
I,
don't
know
if
it
would
affect
your
read
numbers
much,
but
at
least
in
the
write
tests.
I
think
you
might
actually
well.
You
might
see
different
numbers,
so
it
might
be
worth
looking
at.
D
I
think
if
I
just
went
with
my
gut
I,
would
they
would
focus
on
the
on
the
K
namespace,
just
based
on
what
we've
been
thinking
about
the
rest
of
the
the
trade-offs
involved.
But
the
I
think
the
big
question
in
my
mind
and
the
big
thing
to
to
sort
of
eliminate
as
a
concern
or
whatever
is
the
point
that
Greg
brought
up
in
the
chat,
which
is
that,
if
we're
putting
more
data
in
the
SST
for
these
small
rights,
then
rocks
TB
is
going
to
have
a
higher
compaction
load
and
that's
been
significant.
D
So
my
sense
is
that
we
need
to
do
kind
of
a
worst
case
scenario
test
where
a
lot
of
data
going
into
these
tiny
rights
may
be
all
of
it
and
OSD
is
basically
filled
with
it
and
see
what
sort
of
a
steady
state
performances
with
that
behavior
and
having
all
this
like
rocks
to
be
compaction
of
all
these
rights.
Going
on.
In
the
background.
B
D
A
uniform
random
work
go
over
them
and
seeing
what
the
ups
are,
then,
when
you
have
sort
of
sustained
compaction
going
and
compare
that
to
sort
of
the
same
situation
where
don't
you
might
not
actually
be
able
to
fill
it
with
as
much
data,
because
the
metallic
size
of
is
it
16
K?
Now?
What
is
our
monoxides
recipes
mark?
Do
you,
members
didn't
think
worse,
yeah
we're.
B
D
16
K,
so
I
actually
would
only
be
able
to
fill
it
with
1/16
as
much
data,
but
maybe
maybe
I
only
have
to
fill.
It
may
be
filled
with
the
same
number
of
objects
or
he
still
fell
to
80-percent,
because
users
could
potentially
do
that,
but
either
way
try
to
figure
out
how
to
do.
It
have
some
some
reasoning
or
some
data
about
like
what
with
that
heavy
compaction
impact
would
be
on.
This
is
that
8
gigabytes
is
kind
of
a
small
small
data
set
for
a
big.
C
D
B
D
F
D
F
F
Nope
no
I
wasn't
that
wasn't
directed
specifically
well
I
guess
in
a
way,
they're
mean
the
only
piece
of
it
that
was
related
to
Igor's
proposal
was
whether
storing
data
like
this
in
rocks
TV
makes
spillover
more
likely,
and/or
Molly.
Do
we
need
to
provision
differently?
If
we're
using
this,
you
think.
D
F
D
D
D
Last
time,
whatever
that
the
possibility
of
pull
requests
like
this,
that
changed
blue
source
behavior
and
how
it
uses
that
DB
is
partly
why
I
don't
want
to
be
specific
and
prescriptive
in
a
size,
the
DB,
it
should
just
be
as
much
as
you
bought
or
commute,
and
there
should
be
I
think
we
should
be
providing
high-level
guidance
and
I
think
the
we
should
keep
the
the
spillover
boogieman
in
context.
D
When
spillover
happens,
it
means
that
you're
putting
storage
on
the
hard
disk
instead
of
the
SSD.
It
means
that
it's
not
fast
anymore,
though
it's
having
a
small
amount
of
flash
and
having
spillover,
is
always
gonna,
be
better
than
having
the
flash
and
having
everything
you
know
spilled
over
by
default
or
whatever.
So
it's
it's,
it's
important
to
understand
what
is
happening
when
spillover
happens,
but
I
I'm
not
sure
that
it.
D
C
F
F
We
had
a
discussion
about
that
this
morning
in
the
hall,
where
somebody
mentioned
the
idea
that
if
you
used
lbm
instead
of
partitions,
that
maybe
you
could
they
expend
the
expand,
the
rocks
TV,
you
know
volume,
dynamically
I,
don't
know
how
blue
store
would
react
to
that.
Would
it
would
that
be
a
disaster,
or
would
that
be
something?
Then
that's.
C
No
specific
case,
but
yeah
well
actually
the
basic
functionality
for
that
already
present
on
the
code
base,
you
are
unable
to
resize
database
volume
of
line
and
well
I'm,
trying
to
to
extend
this
functionality
to
be
able
to
migrate
between
volumes
so
and
simplify
this
right.
Now,
it's
a
while
several
steps.
As
far
as
I
remember.
D
Actually
I
want
some
more
performance,
I
know,
I,
know
and
go
and
add
sound
every
OST
right
now
the
stuff
volume
batch
thing
just
just
carving
it
up
into
n
pieces.
So
it's
assuming
that
the
advertised
size
is
the
size
that
you
want
to
use,
which
is
fine
as
the
first
starting
point
but
yeah
again,
if
I
cut
sophisticated,
we
want
to
make
it.
B
And
I
don't
know
how
much
it
ties
into
the
PR
that
you
linked,
but
I
I,
don't
know.
I
thought
I'd,
maybe
simplest
a
couple
folks
before
but
I
in
the
chat.
There's
a
link
to
a
document
that
or
it's
very
much
you
know
just
a
small
set
of
the
all
possibilities
of
what
people
could
do.
But
this
is
for
K,
writes
to
RB
D
or
K.
B
F
Yeah,
it
was
useful.
The
problem
I
think
people
had
was
that
it
was
so
complicated
too,
because
you
needed
to
know
what
workload
you
were
dealing
with
and
then
you
know
try
to
estimate
the
rate
house
objects.
You
were
gonna
have
within
that
workload
and
oh
yeah
a
probability
probability
of
getting
it
right.
F
The
first
time
was
pretty
low,
so
the
fallback
position
was,
let's
just
allocate
some
percentage
of
the
total
HDD
space
as
rocks
DB
partition
space
and
that's
a
formula
that
you
know
it
can
be
very
generous
with
and
try
to
just
make
it
less
likely
that
this
bill
over
occurs.
Even
if
we
waste
a
little
space
in
some
cases
are.
B
F
B
B
B
See
what
you're
saying
it
might
be
worth
trying
I
would
suggest
at
least
definitely
making
sure
that
the
the
assumption
is
right,
but
you
might
not
need
to
like
carve
up
your
SSD
in
that
kind
of
a
weird
way
when
file
stores
not
being
used.
Yeah
I,
gotta,
I
gotta
look
into
that
you're,
probably
right,
and
then
you
can.
Then
you
can
go
back
to
yeah.
B
B
D
D
Them
if
you
wanted
to
all
right,
I
tuned
out
but
I
thought
made
them
and
for
another
conversation,
I
thought
that,
though,
in
the
normal
case,
you
have
an
ending
me
and
you
have
like
a
hard
disks
or
whatever
and
you
divvy
it
up
in
eight
pieces
and
I
thought
the
case
that
Ben
was
suggesting
is
that
you
might
have
an
nvme,
you
take
half
of
it
and
create
a
dedicated
OSD,
and
you
take
the
other
half
and
you
divvy
it
up
for
the
hard
disks
and
you
have
I.
D
Probably
not
if
you
just
yes,
but
it
does
mean
that
one
advantage
of
doing
it.
That
way
is
that
you
still
have
a
dedicated
India
me
Oh,
Steve,
that
you
have
him
separate
cool
and
then
you
can
scale
it
independently.
So
then
you
can
bunch
of
like
pure
and
emé
SSDs
if
you
need
more
index
capacity
or
less,
and
so
you
have
these
two
in
that
case,
you're
shooting
the
same
arc.
D
B
B
D
D
Yeah
I
think
that
the
thing
that's
annoying
here
is
that
there's
a
whole
bunch
of
complexity.
You
have
to
invest
in,
though
I
ansible
stuff
tooling
to
provision
them
that
way
and
the
end
result
is
just
like
complicated,
and
so
even
if
it's
a
little
bit
better,
it
might
just
be
simpler.
Just
to
say
these
blue
store
uniformly
with
an
nvme
and
some
hard
disks
and
just
bank
on
the
fact
that
euro
map
is
gonna
end
up
on
the
SSD.
All
right,
that's
good,
but
it's
a
little
bit.
D
So
you
want
to
have
some
confidence
that
it's
don't
know
if
it's
gonna
be
small
enough
and
I
actually
don't
have
any
numbers
to
know
how
much
data
is
actually
in
the
old
map
fools
compared
to
data
in
the
data
pools.
It
would
be
helpful
to
have
that
like
if
it's
usually
1%,
then
we're
safe.
But
if
it's
like
20%
and
we're
not
so
safe,
and
we
have
to
be
really
careful
about
how
we
balance
them,
I
think.
B
D
F
D
Mean
assuming
that
the
contention
between
the
two
sharing
the
same
most,
the
circles
or
whatever,
isn't
an
issue,
then
everything
should
be
fine
until
you
have
so
much
a
lot
data
that
it
spills
over
and
that's
like
a
the
equivalent.
If
you
had
four
partition,
then
would
just
be
that
your
home
at
pool
those
Misty's
fill
up,
and
so
there's
always
a
high
level,
like
user,
has
to
deploy
more
SSDs
in
order
to
like
make
it
work.
I
throw
that
in
there
I.
F
F
B
It's
always
a
decision
process,
though
right,
because
how
do
you
know
maybe
it's
better
to
have
your
are
gwo
map
data
for
bucket
indexes
spill
over
to
a
hard
disk,
then
have
like
your
RVD
metadata
or
whatever
spill
over?
You
know
how
do
you?
How
do
you
decide
which
of
those
things
is,
is
more
important?
We've
got
all
things
all
thanks
to
you,
boys
say
metadata
needs
to
be
an
S.
B
D
D
Thanks
again,
Igor
the
I
think
the
tiny
right
thing
looks
super
awesome
and
also
a
little
bit
of
encouragement
on
the
blue
sword,
tooling
stuff
to
resize
those
volumes.
I
think
that's
also
gonna
be
pretty
useful.