►
From YouTube: 2018-Apr-05 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
B
Yeah,
let's
see
some
fuse
changes,
the
O
stream
thing.
Oh
yes,
good
that
finally
merged
yeah
and
chrome
tabs
there's
this
async
lock
thing
which
is
kind
of
encouraging
I
was
just
moving
the
put
message,
basically
outside
of
the
mutex
critical
section
you're,
seeing
a
lot
contingent
a
little
bit
and.
B
The
recovery
optimization,
that's
a
big
one
that
is
failing,
sort
of
spectacular
in
QA
right.
That's
only
it's
more
work,
Josh
yeah,
exactly
all
right
and
then
there's
the
qat
one
which,
like
keeps
coming
up.
Fun,
has
been
in
line
in
flight
for
a
long
time,
be
great
to
get
it
in
I,
don't
know
how
much
progress
they're
making
anyway,
let's
see,
there's
a
blue
fast,
huge
page,
huge
pages
for
the
right
buffer.
That
sounds
good.
B
B
B
And
see
just
card
there's
some
issues
with
that.
Also,
the
performance
I
think
that
issue
would
discard.
Is
that
the
we
added
the
synchronous
discard
good
already,
but
we
can
turn
on
everything
go
slower,
which
is
sort
of
not
surprising,
given
how
like
economy
of
the
discard
implementations
on
assist,
you
seem
to
be
no
to
of
annoying.
B
B
B
B
Just
look
at
what
the
performance
would
be
like
and
then
assuming
that
looks
good,
then
we
have
to
like
to
even
more
rewriting
to
do
compatibility
with
the
old
one
anyway
and
tons
of
work,
and
last
I
heard
the
sort
of
the
upper
bound
on
the
benefit
that
we
can
get
from
this
from
just
like
commenting
out
sending
the
O
map
and
not
writing
the
PG
logs
at
all.
It
was
like
tiny,
so
it's
this
is
worth
pursuing
at
all.
B
C
A
B
I
still
want
to
just
don't
want
to
see
the
current
master
with
the
old
map
stuff
just
turned
off,
so
we're
just
not
writing
no
map.
No
we're
still
accumulating
that
thing
in
memory
and
then
just
look
at
what
the
Delta
is,
because
that's
an
upper
bound,
basically
on
what
we're
gonna
get
right.
We're
not
writing
it
to
a
new
place.
We're
just
not
writing
it.
The
old
place
will
get
no
compaction
additional
convection
overhead
remarks
TV
so
so.
A
B
A
A
E
B
Still
feels
like
we
need
to
repeat
the
test
from
six
months
ago,
where
you
just
stop,
writing
them
at,
because
that's
that's
eventually
gonna
replace
what
the
right
at
somewhere
else,
but
just
not
even
writing.
It's
just
remove
it
and
then
see
what
the
changes
yeah
that
has
to
be
the
beginning,
because
if
it's
like
less
than
five
percent,
doesn't
like
I
think
that
we
have
bigger
fish
to
fry
I.
A
Think
it's
really
going
to
depend
on
whether
or
not
the
behavior
that
we
invoke
in
rocks.
Db
is
a
bottleneck
over
all
right,
and
that
depends
on
the
hardware,
both
the
CPU
and
the
device
that
you're
on
you
know
if
you've
got
unlimited
CPU,
maybe
recalculating
the
bloom
filters
constantly,
because
you've
got
all
these
extra
keys
coming
and
doesn't
matter.
A
F
B
It's
longer
the
complexity
in
the
size
of
the
right
buffer,
roughly
like
that's
where
the
CPU
gets
burned
and
I
write.
Buffers
are
big
because
otherwise
you
push
more
stuff
201,
just
sort
of
like
factor
to
complexity.
I,
guess
I,
don't
know
yeah
it's
it's
like
they're
million
variables,
but
it
seems
like
until
we
can
show
that
the
like
the
best-case
scenario.
When
is
good,
then
it's
like
it
feels
like
a
big
distraction.
B
B
A
A
A
Yeah,
it's
not
just
right,
amp
right!
It's
it's!
That
you're
now
also
having
rocks
DB,
do
a
lot
of
extra
work
for
things
that
don't
matter
like
recomputing
bloom
filters
and
you
know
doing
other
things
on
keys.
That
are
that
none
of
this
is
gonna
matter.
For
so
that's
that's
kind
of
the
in
my
mind,
the
kind
of
craziness
here,
but
can.
F
Certain
level
to
take
a
look,
take
brothers
brothers,
the
table
a
number
of
different
times,
the
complexity
I
mean
I,
mean
I,
miss
and
there's
any
other
claiming,
but
but
the
but
the
stupid
Kitab
detection
is
so.
It
doesn't
need
to
be
treated
the
same
way
as
long-term
data
and
so
confusing
those
time
domains
is
a
fun
is
fun,
doesn't
feel
fundamentally
correct.
C
C
So
if
we
could
run
that
test
with
like
a
I
want
to
have
you
write,
workload
like
I
were
to
give
you
about
a
new
Nexus
I
exclaimed
emulation
of
we'd
probably
see
the
biggest
into
that
kind
of
work
later
compared
to
like
an
RP
one,
which
just
has
long
entries
going
to
him
up
until
he
does
matter
right.
A
I
also
I
also
really
want
to
highlight
this,
isn't
just
a
performance
issue
right
I
mean
in
the
tests.
I
ran,
you
look
at
the
the
difference
in
the
amount
of
data
flushed.
This
is
an
issue
with
how
much
we're
we're
causing
on
SSDs
I
mean
one
on
one.
Our
test.
It
went
from
like
sixteen
gigabytes
flush
to
eight
gigabytes,
flushed.
It's
it's
like
half
the
amount
of
data.
A
A
B
I'm
just
saying
that
if
I
think
we
need
more
information
to
know
whether
this
is
something
that
we
should
address
in
blue
store,
because
the
that
amplification
behavior
is
going
to
be
different
with
D
store,
and
so
if
it
doesn't
make
that
much
of
a
difference
and
we're
targeting
something
different
anyway
and
like
I'm,
not
sure
it's
a
useful
tangent,
but
it's
hard
to
say
without
knowing
if
it's
actually
gonna
make
it
faster
or
not
alright,
and
by
how
much.
So
let's
do
that
next
and
then
we
can
discuss
it
next
week.
C
So,
okay,
did
you
wanna
talk
more
about
the
creating
PD's
opposed
to
you,
sir
than
the
realist?
Oh
yeah
sure
I
said
some
questions.
I
guess
so
if
we
did
have
go
with
like
option
D,
which
is
essentially
creating
one
PG
and
then
splitting
it
out
over
many
times.
C
B
That
would
be
possible,
but
I
think
I
mean
reality.
I,
don't
think
really
matters
its.
C
B
But
it
basically
means
that
we
can
make
sure
that
the
piece
PG
is
successfully
created.
B
Before
P
genome
changes,
which
means
that
all
these
race
conditions
in
the
USD
won't
don't
have
to
be
considered,
I
have
to
check
I
think
there
I
have
to
make
sure
that
we
can't
reach
the
same
race
condition
with
with
appearing
message
from
another
OST
that
would
trigger
an
OST,
a
PG
create,
but
that's
easier
to
deal
with,
because
those
messages
expire
at
an
interval,
change
and
the
Mon
create
ones.
Don't
that's
sort
of
the
annoying
difference
and
why
option
B
would
make
sense.
B
Yeah,
anyway,
I
don't
think
actually
that
hard
to
change
to
make
the
merge
grant
charity
is
adding,
like
a
pea
genome
type
field,
that
there
are
only
a
couple
places
at
the
bottom,
where
we
actually
set
those
values
and
create
a
pool
and
then
would
like
a
genome
and
get
P
genome
and
some
of
the
GP.
No
comparisons,
not
even
those
ones,
and
so
the
only
OS
sort
of
tricky
bit.
Is
that
the
needs
to
decide
where
the?
B
Where
the
thing
that
that
actually
changes
it
is
whether
that's
something
that's
it's
the
manager
or
the
monitor.
I
think
the
manager
makes
more
sense
because
it
won't
be
assimilated
with
the
auto-tuning
stuff
that
China
is
doing,
but
it
actually
feels
like
this
is
one
layer
down
from
that,
where
the
stuff
that
John's
working
along
will
make
sort
of
that
like
the
policy
decision
of
what
we
want
it
to
be
yeah.
That.
B
B
B
B
C
B
C
B
B
A
I
think
the
plan
right
now
is
to
kind
of
take
what
we've
got
and
assign
memory
to
rocks
tb's,
high-priority
cache,
which
unfortunately
means
only
using
rocks
B's
LRU
cache.
This
also
requires
some
changes
in
rocks
TB
itself,
but
then,
after
that,
we
basically
allow
memory
to
be
set
up
to
the
ratios
that
are
assigned
for
each
of
the
different
kind
of
caches.
We
have
so
the
code
cache
and
buffer
cache,
and
you
know
potentially
other
things.
D
A
Then,
if
something
doesn't
need
the
bat
memory,
if
one
of
those
caches
says
hey
I,
don't
even
need
this
much.
You
know
there
aren't
that
many
o
notes
or
whatever,
then
it
can
be
assigned
to
one
of
the
other
things.
This
was
I,
think
more
or
less
your
IDs
age
right
I
mean
that's
kind
of
wait.
I
think
we
we
were
getting
at
when
we
finished
that
meeting
yesterday.
B
B
That's
the
next
chunk
and
then
any
memory
left
over
we
distributed
proportionately,
based
on
whatever
you
configure
this
right.
Unless,
unless
one
of
the
caches
is
underutilized
like,
if
we
say
50/50
and
so
the
blue
sort,
cache
has
500
Meg's
and
it's
only
using
100
and
doesn't
seem
to
want
any
more
than
we
can
give
it
a
four
hundred
or
maybe
300
of
it
to
rock
city.
Until
it
looks
like
those
stories
actually
trying
to
use
what
it
is,
do
yep.
A
So
so
what
I
wanted
to
do
is
implement
that
exactly
that
is
kind
of
the
first
step.
But
then
what
I
want
to
do
is
make
that
step
right
there,
that
that
last
piece
that
we
just
talked
about
repeatable
four
different
tiers
of
memory.
So
the
idea
would
then
be
that
you
you
do
that
step.
Four
high-priority
things
in
all
of
the
caches
may
be
its
own
ODEs.
First
OMAP,
second
data.
Third,
but
you're,
looking
at
maybe
only
the
last
five
minutes
or
ten
minutes
or
whatever
we
decide
cache
data.
A
A
B
D
The
real-time
course
clock
can
be
serialized
and
shared
between
machines.
The
monotonic
clock
won't
work
between
reboots
is
the
big
things
you
don't
want
to.
Serialize
have
any,
but
should
be
fine.
I
think
it's
pace,
yeah
yeah.
Of
course
we
all
time,
but
if
it's
a
memory
correspond
italic
should
do
fine
or
if
you
don't
even
care
about
cross
CPU,
you
can
just
grab
the
TSC
I.
B
B
A
A
B
A
Maybe
it's
moderately
more
I,
don't
know
think
you
would
have
a
whole
lot
of
extra
overhead
with
it,
but
maybe
it'd
be
worse
than
the
time
stamps
I,
don't
know
I
mean.
So
if
you
had
say
like
a
vector
of
intrusive
list,
you
actually
think
that
iterating
through
that
vector
of
smaller
intrusive
lists,
rather
than
one
bigger
one,
would
really
be
I.
B
A
Oh
yeah,
we
can
try
the
time
stamp.
I
mean
it'd,
be
easy
to
add
right
and
then
you're.
Just
the
only
thing
that,
with
like
the
marker
scheme,
I
was
thinking
is
right.
Now
the
the
iterator
too,
is
constant
time
if
I
understand
correctly
the
way
that
they
they
do
that.
So,
if
you
can
iterator
to
a
marker,
you
can
just
get
it
rather
than
searching
through
time
stamps
to
find
kind
of
when
something
was
yeah.
B
It
seems
like
the
question
is:
can
you
get
away
with
just
looking
at
the
end
of
the
cache
and
how
old
it
is,
and
is
that
enough
information
to
make
a
step
or
do
you
need
to
know
like
at
75%
position?
What
is
the
time
stamp
or
can
you
just
know
what
the
oldest
is?
Can
you
make
it
so
that's
sort
of
like
a
it
walks
towards
a
better
solution
or
whatever
it.
A
B
They
think
this
doesn't
necessarily
looking
at
the
tail,
doesn't
fit
into
the
idea
of
thinking
in
terms
of
like
tiers
of
priority
high
priority
and
low
priority,
but
it
might
be
enough
just
to
say
that
you
know
the
oldest
Oh
note
is
two
days
old
and
the
oldest
thing
in
Brock's
TV
is
like
then,
over
two
hours
old.
It's
there
for
me.
Maybe
you.
B
I'm
still
not
completely
convinced
that
we
we
know
what
we
should
do
say
say
that
say:
the
oldest
rocks
TV
data
is
two
hours
old
and
the
oldest.
Oh,
no.
It
is
two
days
old.
B
A
A
Very
coarse-grained
look
at
this
right
now
is
that
I
assume
that
a
lot
of
people
are
probably
going
to
have
just
idle
data
right
on
their
cluster,
that
we
don't
care
about.
So
really
the
the
thing
I
I'm
kind
of
right
now
more
interested
in,
is
not
devoting
like
yeah
to
Koldo
nodes
and
said
letting
that
free
up
to
four
for
data.
B
A
It
seems
to
me
that
we
see
really
dramatically
different
behavior
depending
on
if
we
prioritize
own
ODEs
versus
OMAP
for
different
kinds
of
workloads,
like
RBD,
really
benefits
by
making
sure
that
own
ODEs
are
are
heavily
cached.
Whereas
our
UW
really
benefits
from
making
sure
that
OMAP
assembly
cashed
that
was
kind
of
yeah.
B
B
I
did
feel
like
I.
Did
it
back
of
the
envelope
calculation
I
wrote
it,
but
it
was
for
OMAP
like
how
many,
if
you
had,
if
you
had
a
10
terabyte
SSD,
that
is
full
of
big
OMAP
rgw
objects.
How
many
objects
would
it
actually
be
and
I
think
it
was
something
like
I
can
remember,
Louis
if
I
have
to
redo
the
calculation,
but
my
recollection
was
that
it
was
something
that
would
is
not
that
much
like
they
would
all
fit.
B
A
B
A
B
A
First
things:
first
rocks
TV
also
has
this
kind
of
other
separate
issue
where
the
block
cache
memory
spikes
up
during
compaction,
and
we
almost
might
have
to
treat
that
differently,
because
giving
rocks
to
be
more
cash
might
help
with
that
kind
of
odd
scenario
that
we
don't
deal
with
in
our
other
caches.
So.
A
A
E
One
thing
the
size
of
the
buff,
the
size
of
allocations
for
disk
allocations
for
blue
FS.
At
the
moment,
it's
one
Mac
in
the
pull
requests
rated
to
to
jump
up
a
dress,
just
increase
meant
to
to
max
it's
only
because
those
two
max
is
the
default
value.
It's
typical
value
of
a
huge
page
is
it:
okay
is
the
increment
okay
or
we
want
me
to
introduce
some
extra
complexity,
basically
verbal
fin
complexity,
to
divide
one
huge
page
into
two
one
make
buffer.
E
B
B
E
E
E
E
B
E
B
It
sounds
great
I
mean
at
the
end
of
the
day.
It
probably
even
doesn't
matter
that
much
if
this
perfectly
I
mean
it
should
match,
but
it
doesn't
matter
if
it
doesn't
perfectly
match
the
page
size.
The
huge
table
huge
page
size,
because
there
aren't
that
many
files
that
blue,
if
s,
has
opened
in
a
time,
and
so
even
if
you
wasted
one
Meg
for
each
open
file,
that's
like
that.
Not
that
much
baby
might
as
well.
I
can
might
as
well
make
a
match.
B
E
Linux
offers
two
tastes
of
jumbo
pages.
First
one
is
the
well
known
explicit,
huge
page
mechanism
when
the
operator
is
object
to
reallocate
to
set
the
number
of
huge
pages
that
kernel
takes
from
the
takes
to
form
the
pool
of
huge
pages
available.
Second
mechanism
is,
is
transparent,
huge
pages
that
is
best
effort.
Actually,
there
is
no
guarantee
of
a
you
are
getting
the
large
page.
E
C
E
B
B
B
E
E
E
C
B
E
B
E
B
B
B
B
C
E
B
B
E
C
E
B
E
Okay,
what
is
also
interesting
is
is
I
can
see
our
cycles
spent
on
the
trampoline
in
in
mplt
the
trampolines
for
functions
of
ready
to
buffer
lists.
It's
because
we
are
I.
Guess
it's
because
we
are
compiling
this,
our
our
stuff
with
earth
peak,
which
means
that
all
all
calls
needs
to
go
through
PLT.
B
E
Important
profile,
if
I
can
see
that
we
are
that
this
thing's,
the
PLT
trampolines
related
solely
to
buffer
list,
its
consumes
around
five
and
a
half
percent
of
of
cycles,
not
much
III.
It's
really
thin
thing
and
I
guess
muster
more
important
will
be
that
the
compiler
is
not
able
at
the
moment
to
make
some
optimizations
educate,
repeated
code
and
stuff
like
that.
It.
C
B
B
B
A
C
A
B
Cool
yeah
I
mean
I.
Think
the
trick
is
that,
ideally,
you
need
the
history,
the
historical
log,
not
that
one's
funny,
but
just
20
by
checking
that
it's
a
disk
space
issue
also
and.