►
From YouTube: 2018-MAR-01 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
For full notes and video recording archive visit:
http://pad.ceph.com/p/performance_weekly
B
All
right,
let's
go
through
these
quickly,
I
guess:
there's
a
KP
max
2k
team
in
its
passing
test.
Now
it
just
needs
to
go
through
QA
and
we
need
to
make
a
call
and
what
the
what
the
default
setting
should
be.
I
can
come
back
to
that
when
we
talk
about
your
results,
same
thing
with
blue
fest,
buffered
I/o
equals
true.
It
sounds
like
that
is,
and
actually
it's
big
of
an
issue
because.
A
B
B
A
It
seems
like
we
benefit
from
it,
but
if
we
can
perfectly
figure
out
how
to
cache
things
like
optimally
without
the
page
cache,
maybe
then
it
it
goes
away.
Maybe
then
it's
better
we're
just
I,
don't
think
I
think
we're
going
to
need
to
design
something
much
more
so
Vista
cated
than
what
we
have
now
to
make
the
advantage
go
away.
Yeah
all.
B
Okay,
the
DI
thing
merged-
oh
no
cover
you
based
I,
think
it
merged
micro,
optimization
to
make
the
G
out
condition,
checks
faster.
So
that's
good
I'm
up
history
thing
from
Peter
merged
I
think
will
backboard
that,
but
let's
let
it
sit
in
master
a
little
bit
just
to
make
sure
words
are
unexpected
issues.
C
B
B
Okay,
okay,
okay,
let's
see
the
boosts,
are
cached
ramune,
foals,
removed,
I.
Think
I
need
to
change
the
logging
level
of
the
cache
trim
out
stuff
that
every
shard
send
something
every
50
milliseconds
at
level
10,
which
it
just
fills
up
the
log,
even
when
the
UST
Seidel
so
I'm,
going
to
turn
that
up
and
then
read
ahead
was
enabled
for
jewel,
that's
merged.
So
it's
good.
A
B
A
A
C
B
Not
on
King
David,
later,
okay,
there's
a
improvement
to
assert
to
make
it
a
little
bit
lower
overhead
sounds
fine,
keep
it
reviewed,
I,
guess
and.
B
B
A
A
You
know
what
would
you
know
what
should
I
buy
that
kind
of
thing,
and
the
other
part
of
this
is
that
even
beyond
just
how
much
metadata
we
have
in
in
rocks
DB,
there's,
also
a
question
of
caches
and
how
much
space
should
be
devoted
for
own
would
cash
how
much
space
should
be
devoted
for
block
cache
in
the
block
cache?
How
much
of
that
space
is
being
used
for
indexes
and
filters?
A
A
So
this
is
just
really
a
high-level
overview
of
kind
of
what
I
saw
they'll
talk
about
today
and
I
didn't
even
link
any
of
the
the
current
graphs
or
data
or
anything
in
there,
but
but
the
the
very
high-level
gist
of
this
is
that
how
much
it's
intimately
tied
to
the
number
of
objects
that
we
have,
which
probably
makes
sense?
But
it's
not
something
we
like
to
talk
about
it.
A
We
don't
want
it
to
be
tied
to
the
number
objects,
but
it
really
is
and
how
many
objects
you
can
store
is
really
intimately
tied
to
the
Menelik
size
in
in
in
blue
store.
So
if
you
have
a
high
metallic
size-
and
you
are
writing
out,
like
4k
objects
in
rgw,
you're
gonna
use
a
heck
of
a
lot
more
space
than
you
are,
if
you
use
like
a
fork,
a
metallic
size.
A
So,
having
said
that,
the
difference
between
using
our
BD
with
like
four
megabyte
objects
and
doing
4k
objects
in
our
GW
is
vast.
If
you're
doing
small
writes
like
4k
rights
to
an
arboreal
um--
for
a
256
gigabyte,
RBD
volume
we're
using
maybe
one
gigabyte
of
metadata.
It's
really
small
and
the
index
in
filters
is
tired,
tiniest
like
to
maybe
25
megabytes
of
indexes
and
filters.
The
the
per
object
metadata
is
higher.
A
So
if
your
workload
is
tiny
objects
and
rgw,
you
need
a
heck
of
a
lot
of
DB
space
to
build
to
cache
everything
without
it
rolling
over
to
the
blog
device
and
the
indexes
and
filters
are
large
too.
It's
like
I
think
it's
1.2
gigabytes
of
indexes
and
800
megabytes
of
filters
which
definitely
exceeds
our
current
defaults.
A
A
B
A
D
D
E
F
D
A
A
So
for
a
64
K,
our
GIW
workload
with
the
default
16
came
in
Alex
as
an
SSD,
then
we
were
talking
about
roughly
six
and
a
half
gigs
of
metadata
on
a
256
gig
object,
corpus,
yep,
so
yep,
so
not
bad,
but
still
significant
right.
So
if
you've
got
like
eight
terabyte
SSDs,
which
some
people
are
now
starting
to
to
look
at
20
gigs,
almost.
A
Well,
maybe
I
mean
it
depends
on
how
full
you're
shoving
well
you're,
making
your
your
your
nodes,
I
mean
that
plus
the
blue
store
cash
means
that
if
you
want
those
indexes
and
filters
cached-
and
you
want,
you
know
a
decent
amount
of
oh
no
cash
and
we
might
be
talking
like
10
gigs,
of
our
SS
memory
for
simple
SD
process.
Yeah.
A
B
A
B
So
if
we,
if
we
think
of
this-
and
if
we
assume
that
we
only
have
a
simple
solution
where
we
have
several
fixed
balance
between
blues
from
rocks
TV,
then
we
want
to
prevent
the
worst
case,
presumably
right,
and
that
50%
would
mean
that
you'd
have
to
get
it
dedicate
10
gigs
of
ram
to
the
OSD
in
order
to
effectively
cache
a
one.
I
made
terabyte
SSD
with
all
full
of
tiny
rgw
objects,
which
seems
doable.
A
Yeah,
if
you
see
the
kind
of
last
point
I
make
here
and
this
whole
thing
is
I-
think
the
priority
that
we
want
is
indexes
and
filters
first
owned.
Odes
second
may
be
tied
with
O
map,
depending
on
the
workload
and
then
SS
T's
for
compassion
last,
which
kind
of
like
gives
us
this
weird.
Like
you
know,
if
this
then
this,
then
this,
then
this
pattern
going
between
blue
store
and
Rox
well,.
C
A
A
E
Here's
my
concern.
So
if
I
read
mark
your
email-
and
it
sounds
like
if
we
do
what
you're
suggesting
we're
gonna
have
three
caches
going
on
the
Linux
buffer
cache
for
the
you
know,
the
rocks,
DB
cache
and
the
O
node
cache
and
I,
don't
I
think
what
you
were
suggesting
is,
if
you
give
it
all
the
rocks,
D
bead
and
somehow
that
lessens
the
evil.
But
it's
not
an
ideal
situation.
E
B
A
A
A
Yeah
either
that
or
we
ditched
the
concept
of
using
a
a
key
value
store
that
requires
connection
I
mean
that's
the
other
trade
off
right.
You
know,
there's
there's
multiple
kind
of
directions
you
need
head,
but
yeah.
We
just
don't
have
a
good
candidate
for
that
exactly
so
we
could
write
one
or
we
could.
G
A
E
An
iPad
related
question
about
compaction,
because
that
seems
to
be
the
root
of
the
issue
here.
So
with
this,
our
meaty
workload
that
micron
was
well.
That
partner
was
doing
it's
a
common
workload
for
KB,
be
pre-populated
the
volumes.
So
you
would
think
that
if
you're
doing
random
writes
to
pre-populated
volumes,
there
would
be
no
metadata
change.
Therefore,
there
would
be
no
need
for
compaction,
because
the
thing.
E
Sorry
sure
sure
no
problem,
so
my
thinking
is
you
get
this.
Let's
say:
you've
got
a
set
of
RBD
volumes,
you
fully
populated
them,
so
there's
no
allocation
going
on
to
those
volumes.
What
do
you
do?
Rights
correct
and,
let's
assume
that
we're
doing
4kb
rights
and
the
min
Alec
size
is
bigger
than
4
KB,
so
we're
gonna
be
writing
in
place.
E
E
B
Does
change,
though,
because
two
reasons
that
the
data
has
checks
at
different
checksum,
so
the
checksums
change,
and
the
second
reason
is
that
every
time
we
update
an
object,
we
update
the
attribute
object,
whatever
that
has
the
object
info
that
has
the
version
and
the
log
entry
and
the
timestamp
and
all
our
random
crap.
So
there's
there's
metadata
churn,
as
you
do
writes,
but
I
think
in
this,
as
Mark
mentioned
in
the
RBD
case,
like
it's
small,
so
it's
not
a
lot
of
metadata,
so
there
will
be
compaction
yes,
but
you're.
A
One
of
the
things
I
think
that
might
be
treating
triggering
all
this
compaction
that
we're
seeing
there's
a
link
in
there
for
reading,
writes
compaction.
It
happened
to
be
I,
had
some
data
laying
around
that
I
was
able
to
go
back
and
analyze
and
there
is
a
lot
of
time
spent
in
compaction.
There
I
really
wonder
if
it's
all
of
the
PG
log
and
do
pops
is
coming
in
that
are
making
it
just
like
thrashing
compaction,
because
it's
not
getting
the
tombstones
fast
enough.
I
could
be
wrong,
but
I.
A
A
B
B
B
If
you're
not
logging
them
at
all,
yes
I'm
worried
that
might
have
secured
that
might
scale
the
overhead
we
see
from
compaction
because
you're
going
faster
or
yeah.
A
B
So
kind
of
coming
back
to
the
sort
of
the
main
point
here
it
feels
to
me,
like
there
I
see
sort
of
two
two
points
in
the
solution.
Space
one
is
what's
in
the
branch
right
now,
which
basically
is
a
50/50
split
between
blue,
saw
remarks,
TV
and
that's
clearly
not
perfect,
but
it's
better
than
what
we
have
right
now
and
it.
It
means
that
if
you
know
that
your
cluster
is
all
block
or
that
it's
all
KB
heavy,
then
you
could
tune
that
knob
if
you
wanted
to.
A
B
B
You
have
like,
I
have
12
Fiji's
that
are
all
data
and
I
have
12
P
G's
that
are
50%,
KT
and
I
have
4
P
G's,
that
are
a
hundred
percent
kv
or
whatever
it
is,
and
you
can
come
up
from
that.
You
can
just
come
up
with
a
blend
that
oval
based
on
this
I
think
that
80%
of
my
data
should
go
to
recipe
because
there's
more
KD
or
20%,
my
memory
should
go
to
bikes
to
B,
because
it's
more
block.
B
A
Now,
with
our
current
settings,
we
appear
to
be
good
up
to
around
16
million
our
GW
objects
on
one
OSD,
and
even
after
that,
we're
probably
still
not
bad,
but
that's
the
point
at
which
the
indexes
and
filters
are
starting,
you're,
going
to
start
getting
paged
out
for
st
files.
So
I
guess
the
question
in
my
mind,
is:
do
we
have
any
idea
for
most
of
our
users
how
many
objects
per
OSD
even
on
big
clusters?
We
end
up,
seeing
because
I
mean
16
million
objects
on
one
OSD
is
a
fair
amount.
A
File
store,
doesn't
even
handle
that
file
star
falls
over
when
we
get
up
to
about
that
many
objects.
Because
of
the
way
splitting
works,
so
I
I
wonder:
do
we
really
have
users
that
are
doing
significantly
more
objects
per
OSD
than
that
right
now,
I'm
sure
we
will,
as
we
kind
of
get
bigger
but
I'm,
it
was
falling.
A
B
A
They
were
miss
tuned,
they
had
no
action,
lots
of
reason,
compaction,
it
was
not
indexes
and
filters
and
I'm
almost
a
hundred
percent.
Sure
of
that
could
be
wrong,
but
everything
we've
seen
so
far
indicates.
This
is
not
initial
with
issue
with
the
indexes
and
filters
being
being.
You
know,
patient
in
out
of
memory.
B
B
A
B
A
So
for
64
K
rgw
rights,
we
had
like
six
thousand
seven
hundred
and
fifty
sixty
megabytes
of
data
metadata,
and
then
we
had
sixty
two
point.
Eight
plus
forty
two
point,
one
of
indexes
and
filters
so
was
end
up
working
out
to
be
that
would
be
sixty
seven
sixty,
that's
like
one
point:
six
percent,
so
you're
close
to
percent.
But
one
point
six
percent
happen.
So.
B
A
B
B
B
B
A
B
Of
objects
for
pool,
but
again
it's
a
little
bit
weird
because
in
the
OU
map
object
case
we
don't
know
how
many
OMAP
keys
there
are
and
that's
actually
it
could
be.
You
know,
objects
that
have
four
key
value
pairs
or
to
give
the
objects
that
have
ten
thousand
key
value
pairs
just
like
whole
borders
and
things
you
drop.
So
we
don't
I,
don't
know
that.
That's
a
good
enough!
You
can
choose
a
middle-of-the-road
but
I'm.
A
B
A
D
A
It
seems
like,
as
long
as
we're
just
doing,
puts
and
gets
even
regardless
of
the
the
object
size
or
the
the
Menelik
size
and
all
this
stuff.
We
end
up
with
a
relatively,
not
exact
but
relatively
tight
cluster
of
average
number
of
keys
per
object
in
the
database
mm-hmm.
It
does
change
some
and
I.
Think
it's
because
there's
there's,
you
know
a
set
of
existing
keys.
So
there's
some
just
static.
A
A
G
B
B
B
I,
don't
know
about
that
yeah
right,
but
as
far
as
like
return
on
return
and
complexity
investment.
But
how.
A
B
B
B
So
I
think,
okay,
so
my
proposal
is
still
to
start
with
the
current
KVM
in
branch,
because
it's
better
than
what
we
have
and
then
I
think.
The
question
is:
if
the
end
goal
is
that
we're
watching
the
hit
rates
on
rocks
TV
and
on
the
loose
or
cash
and
sort
of
dynamically
adjusting?
Based
on
that,
the
question
is:
if
we
want
a
midpoint-
something
that's
not
quite
auto-tuning
but
there's
sort
of
a
guess
on
the
way
there
sure.
A
Let
me
publish
the
data
that
I've
got,
because
that
might
inform
kind
of
whether
or
not
how
good
a
guess
would
be
versus
looking
at
hit
rates.
Right,
like
you
know
how,
how
old,
how
tightly
clustered
the
the
the
amount
of
keys
per
object
and
the
you
know,
kind
of
size,
of
the
the
data
block
per
per
kind
of
workload
or
per
you
can
even
get
down
to
like
per
object.
I
mean.
B
B
B
A
B
A
B
B
Why
do
we
need
it?
I?
Guess
it's
the
question
I'm
asking,
because
at
the
end
of
the
day
it
doesn't
matter
how
many
objects
there
are
that's
just
a
proxy
for
how
much
metadata
we
have.
So
what
really
matters?
Looking
at
your
your
stuff
here
on
the
pad
is
124
gigabytes
of
metadata,
but
that's
the
important
part.
That's
what
determines
how
big
the
indexes
and
filters
are
roughly
right,
whereas
if.
B
B
A
B
G
B
B
B
A
B
B
A
B
A
A
And
we
can
change
the
block
cache
dynamically
I.
Don't
know
how
to
do
it,
but
I
saw
someone
reference
at
once
in
in
the
like
in
a
bug
somewhere
that
hey
we
can
just
dynamically
resize
the
cache.
So
someone
claims
that
you
can
do
it.
I
haven't
I,
don't
I've
never
seen
how
to
do
it,
but
someone
claimed.
B
E
B
B
It's
fresh
doing
over
here
boost-
or
my
hope
would
be-
that
within
each
of
those
have
that
we
could
use
the
memory
most
effectively
and
that
it
wasn't,
it
isn't
necessary
to
move
the
division
point
that
much
so,
for
example,
if
you
have
a
bunch
of
hot
rocks
to
be
data
in
much
a
Cold
War
XP
data,
we
have
enough
in
the
pool
in
rock
CD
as
part
of
the
pool
to
fit
all
the
indexes
and
filters
in
there.
But
if
half
the
store
is
fully
idle,
then
they'll
get
booted
out
and
we'll
be
caching.
B
A
A
B
Well,
if
so,
you
you
wrote
this
cash
priority,
which
I
think
I'm
expense,
but
it
seems
if
we
can.
If
we
can
use
the
rocks
to
be
interface,
to
query
the
size
of
the
indexes
and
filters,
then
we
can
do
exactly
what
we
can
do.
Priority
One
as
number
one
and
and
then
two
and
three
are
sort
of
50/50
like
that,
would
be
trivial
to
do
right
and
then,
if
we
wanted
to
get
fancy,
we
could
balance
two
and
three
by
looking
at
the
hit
rates.
A
Ben's
comment
is
making
me
wonder,
though,
maybe
he's
right,
maybe
maybe
an
index
and
filter
read,
isn't
that
bad?
If
it
doesn't
happen
very
often
compared
to
the
prospect
of
doing
more
Oh
node
reads:
maybe
the
the
blue
star,
Oh
node
cache
really
is
you
know
doing
a
miss
there,
maybe
that's
worse
than
the
prospect
of
doing
like
an
index.
Filter
mess.
B
B
A
A
B
A
So
so
the
I
guess,
then
this
the
the
balancing
act
of
what's
your
likelihood
of
getting
a
filter
miss
in
that
that's,
you
know,
looking
at
each
level
and
and
potential
in
each
level
of
missing.
You
know
if
your
level
0
level
one
filters
are
more
likely
to
be
cached
than
like
your
level
5
filters.
What's
your
likelihood
of
actually
missing
and
how
does
that
balance
out
versus
the.
A
A
B
H
I've
made
some
I've
made
some
additional
profiling.
I
was
curious.
Why,
on
the
right
path,
my
development
machine?
Why
I'm
seeing
a
lot
of
a
lot
of
impact
from
instruction
caches
from
data
cursors
from
tlbs
from
branch
predictor
buffers
from
everything,
literally
everything
in
processor
in
CPU?
That
requires
a
warm
up
to
work
effectively
so
switch
to
their
as
I've
switched
the
investigation
of
of
things
that
could
affect
all
of
them
like
Siskel's.
H
D
H
D
H
However,
opposite
situation
could
also
appear
appears
because
few
Texas
are
used
also
to
constitute
the
conditional
variables.
So
if
you
have
our
long
pipeline
divided
into
stages
with
synchronization
between
between
stages
implemented
on
top
of
conditional
variables,
you
will
face
the
you
will
face
are
allowed
from
from
the
colonel
site
of
tutors
as
well.
So.
D
D
D
H
We
in
the
default
configuration
for
for
SSDs,
we
have
to
worker
what
to
working
care
Fred's
to
workers
per
each
chart
and
when
we
are
picking
a
job
PG
item
to
do,
we
are
locking
it.
We
are
not
trying
to
lock
and
if
it's
contended
get
food
rents
get
something
unrelated
nope
we
are
sleeping,
we
are
waiting.
We
are
going
to
pan
out.
B
E
B
B
B
B
D
I,
like
this
entire
call,
but
I
miss
it.
You
know
we
in
our
oyster
cart
stuff.
We
had.
We
threw
around
just
simple
lock
for
another
set
of
lock,
free,
primitives
and
I
know
they're
going
in
on
a
z-star
anyway,
but
but
but
I
need
for
Mitchell
to
look
to
look
seriously
it's
up
up
front.
What
could
we
do
in
a
short
timeframe?
Even
if
we
don't
worry
about.
D
They
are
a
source
of
high
latency,
is
their
source
of
scheduling,
and
it's
wicked
scheduling
or
something
that's.
What
Scalia's
here
I'm
very
interested
in
this
analysis
because
of
that
but
I
think
but
I
think
the
solution
is
to
try
and
experiment
or
the
thing
I
will
do
next.
Prime
experiment
with
ways
to
take
to
take
those
up
blocks
and
sleeves
out.
C
Yeah
and
I'd
like
to
remind
you
that
during
work
on
this
up
history,
optimization
I
also
run
into
problems
working
with
conditionals,
which
simply
made
things
slower
than
before.
I
even
try
to
use
it
so
I
think
removing
conditionals
from
at
least
some
of
the
text
might
be
beneficial,
I'm,
not
sure
how
beneficial
it
depends
on
the.
D
C
B
Yeah,
so
maybe
they're
opportunities
there,
but
I
think
that
so,
if
I
I'm
trying
to
think
of
where
these
the
key
context,
which
is
our
one
is
in
there's
stuff
in
the
messenger,
that's
I
think
not
something
that
we
can
very
easily
address.
There's
the
one
that
right
Assad
mentioned
when
we
r
DQ
PG,
where
we
could
choose
another
PG
to
work
on.
B
Instead,
maybe
it's
worth
doing
a
simple
proof
of
concept
where
you
hack
up
one
of
the
schedulers
to
give
you
the
next
event
and
I'm
Riku,
the
one
that
you
didn't
successfully
lock
and
just
see
if
it
makes
a
difference.
The
other.
One,
though,
is
that
the
IO
completions
are
put
on
a
finisher
and
those.
B
B
B
H
H
Still
sister
will
could
it's
not
sure
it
I
guess
it
could
bring
really
really
significant
benefit.
It
would
allow
us
to
have
to
implement
batching
in
very,
very,
very
transparent
way
if
we
have
a
if
we
have
Freda
same
free
operation,
free
operations
of
the
same
kind
and
each
of
them
can
be
divided
into
some
stages.
H
Let's
say
we
have
one
operation
ABC
and
that
is
divisible
to
two
stages,
one
two
and
three
then,
instead
of
going
since
we
could
try
to
trade
some
a
lot
and
see
forever
fruit
put
and
instead
going
curse,
synchronously
I
mean
1,
a
1
2,
A
2,
a
3,
etc.
We
could,
but
we
could
make
some
reuse.
Some
CPU
localities
like
instruction
cast
to
handle
1a
1b
d,
1
D
1
then
switch
to
the
dadada
to
the
other
stage.
H
B
B
Well,
coming
back
to
your
first
point,
I
think
the
the
main
opportunity
that
I
would
see
to
try
doing
a
tri,
lock
and
avoiding
a
few
texts
would
be
that
D
queueing,
a
PG,
so
I
think.
That's
that
that's
the
proof
of
concept
that
I
would
try,
but
maybe,
if
you're
try
like
fails
just
DQ
the
next
one
and
unconditionally
do
a
blocking,
lock
and
wait
on
that
one
and
just
move
forward
just
for
something
simple.
That
should
work
most
of
the
time
so
from
there
or
whatever,
whatever
makes
sense,
we'll.
H
From
the
from
the
very
different
bucket,
very
good
question,
I
do
what
about
policy
for
backporting
performance
related
stuff
I
have
some
pull
request
that
I
did
I
a
back
ported
to
luminous,
but
still
the
branches
are
waiting
for
some
knowledge.
I,
don't
know
whether
we
should
what
about
tracking
we
are
knocking.
B
As
long
as
they're
low-risk,
then
we
should
do
it.
We
should
definitely
back
port
those
things,
so
I
think
maybe
just
send
an
email
with
links
to
the
pull
requests,
so
we
can
make
sure
that
they
get
reviewed
and
then
urge
for
the
next
point,
release
like
that's
fine,
we're
shutting
murder
Murli.
So
they
get
plenty
of
testing
in
the
Luminess
branch.
Okay,.