►
From YouTube: 2018-Jun-21 :: Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
Let's
just
take
a
quick
look:
I
don't
actually
have
to
run
in
like
ten
minutes
to
go
pick
a
demon,
so
it
will
just
cut
this
short
see
it's
an
arch
W
one
that
sounds
fine.
That's
from
dog
doing
DFG
stuff,
let's
see
was
for
a
fragmentation.
Calculation,
I'm,
I,
think
padam.
It's
gonna
take
a
look
at
that
or
you're
supposed
to
go
like
that.
No,
that's
you
aging
tests.
A
B
A
Like
he's
toughing
tracked
on
this,
like
super
obscure,
peering
bug,
I'm
just
curious,
it
works
for.
A
But
I
think
it's
getting
close
the
last
the
last
thing
on
the
on
the
huge
pages
Radek.
Is
that
trying
to
figure
out
what
page
size
to
use.
A
D
B
A
D
D
D
A
D
Fortunately,
the
default
default,
a
huge
page
size,
looks
reasonable
for
egg
for
current
processors.
Basically
I
guess
it's
quite
rare
to
get
something
different
that
to
max.
Maybe
if
someone
is
running
our
odd
system
with
the
SS
6
PS
e
56,
instead
of
instead
of
physical
address
extension
or
nd
64,
all.
A
A
D
A
Okay,
they're,
the
only
my
only
concern
with
that
throttle.
One
is
just
making
sure
that
we
don't
introduce
a
new
weight
race
condition
where
you
end
up
with
somebody
who's
waiting,
even
though
the
throttle
got
dropped
down.
So
we
need
to
make
sure
there's
no
race
between
somebody
trying
to
get
throttle
and
somebody
putting
throttle
and
some
anything
stuck.
But
I
had
one
theory
about
where
there
might
be
an
issue.
I
wasn't
sure
about
I,
don't
know
if
you
fixed
it
or
not
or
okay.
A
D
D
A
A
C
B
C
C
C
Sure,
as
long
as
as
long
as
I,
don't
have
to
say
a
whole
lot,
I
guess
maybe
I'll
say
something.
Quick
least:
I
did
all
the
work
on
this
and
it
it
appears
to
more
or
less
do
what
she
says,
which
is
good,
but
it's
it.
It
won't
work
in
its
current
form,
its.
What.
B
C
C
It's
there's
no
cool
Lessing
of
anything,
so
we're
just
like
packing
this
stuff
into
4k
chunks,
even
though
they're
like
1k
in
size
and
then
also
there
there's,
you
know
tons
of
4k
random
I/o
happening
to
all
these
different
logs
on
disk,
rather
than
you
know,
appending
these
things
in
the
transaction
in
the
in
the
the
right
head
log
and
rocks
dB.
So
the
at
least
in
my
test
setup,
looking
like
it's
kind
of
a
wash
between
you
know
doing
it
in
rocks
DB
versus
doing
this
in
these
random
aisles
all
over
the
place.
Well,.
C
A
C
A
C
C
The
other
thing
I'm
wondering,
though
this
leads
me
back
to
the
question
of
okay.
It
looks
to
me
just
based
on
this,
like
getting
all
of
this
stuff
out
of
rocks.
Tvs
database
is
beneficial,
but
we're
paying
a
big
penalty
for
having
it,
at
least
on
this
kind
of
hardware,
we're
paying
a
big
penalty
for
having
these
tiny
little
PG
log
rights
that
aren't
even
4k,
maybe
it'd,
be
better
if
it
was
like
a
512
case
exercise
and
we're
actually,
like
you
know,
using
multiple
sectors.
Yeah.
B
B
How
big
is
that,
like
one
over
28,
gay
okay,
that
was
nowhere
near
we're
gonna
be
patching
up
enough
to
you,
that's
nice!
If
it
right
every
time,
any.
C
Would
something
like
this
alternate
scheme
where
you
have
a
single
log,
current
active
log,
but
then
you,
you
kind
of
Mark
it
immutable
at
some
point
and
you
you
compact,
you
know
I,
guess
you
in
this
case,
if
it
was
not
Roxie
B's
other
data.
If
it
was
just
this
stuff,
then
you
just
mark
it
immutable
and
then
once
all
the
references
have
gone
away
to
it,
then
you
can
delete
it
and
you
you
eat
the
the
space
amp,
but
never
never
rewrite.
Never
I
wonder
I,
wonder
how
something
like
that
would
do.
B
Yeah
we
read
about
introducing
it
other
kinds
of
like
a
synchronous
background
work
like
that,
because
you
always
end
up
running
to
some
official,
where
you
have
to
clean
up,
at
least
as
fast
as
you're.
Writing,
if
you're
doing
maybe
out
doing
that
constantly
with
that
constant
overhead,
which
is
kind
of
similar
to
doing
an
online
and
not
doing
the
cleanup,
you're
gonna
have
four
variable
latency
later.
B
B
Yeah
no
means
just
from
the
latency
perspective.
I
think
it
might
be
worth
doing
this,
even
if
it
doesn't,
even
if
to
wash
in
terms
of
overall
throughput
for
all
these
workloads.
The
other
case
where
it
might
help
I
would
be
RBG
in
hard
disks.
B
C
B
B
C
So
the
the
to
me,
it
seems
like
the
the
area
where
they're
like
the
place,
where
this
could
potentially
really
really
be
good
right,
it'd
be
like
if,
if
you've
got
a
small
amount
of
like
ridiculously
fast
persistent
storage,
you
know
envy
dims
or
you
know
octane
or
whatever.
It
is,
especially
if
it's
like
reasonable
to
to
do
like
cache
size
rights
to
it
right,
then,
then,
you
know
this.
The
the
the
the
waste
shouldn't
be
too
bad
if
it's
like
128,
bytes
or
whatever,
and
then
then,
presumably
the
random
nature
of
all.
C
B
C
B
A
I
mean
even
if
it's
built
to
turn
into
a
K
I,
don't
think
it
would
matter
that
much
I
get
four
random
objects.
Do
they
can
fit
in
4k
of
bak
isomeric
I
think
the
issue
is
that
it's
the
code,
complexity
and
I
think
we
need
to
have
both
because
we're
hard
disks.
We
still
want
to
put
it
in
rush
to
because
it's
the
extra,
oh
and
then,
and
is
that
worth
it.
B
Interesting
is
the
tell
Lindsey
rocks
TV
has
because
of
the
compaction,
has
much
higher
Kelli
agency
and
marks
right.
At
least
it
was
something
like
or
was
it
like,
25
verses,
9,
yeah,
yeah,
a
25,000
verses,
9000,
microseconds.
C
B
C
B
C
The
other
thing
I'm
wondering
about
here,
too,
is
okay,
so
I
mean
right
now
it's
really
have
suboptimal
right.
You're
you're
shoving
like
1k
of
data
into
a
4k
right,
whereas
on
something
like
me
maybe
on
obtain,
maybe
you
can
really
do
like
byte,
addressable,
small
writes,
faster
I.
Don't
know
I'm
just
asserting
that,
but
maybe
you
can.
Maybe
you
don't
need
to
actually
do
like
a
padded
for
K
right.
A
C
The
other
thing
I
was
wondering
too,
is
whether
or
not
you
might
have
you
might
see
a
better
situation.
If
you
did
something
where
you
are
still
coalescing,
the
rights
into
one
log,
maybe
with
or
maybe
without
the
other
rocks
TB
data,
but
then
just
marking
those
logs
old
logs
as
being
immutable
and
tall
references
to
them
have
gone
away
and
then
not
compacting.
Anything
that
you've
marked
as
being
lived
well,
short-lived.
C
Rocks
DB
or
you
have
a
custom,
blue
store
or
whatever
store
right
ahead
log
that
that
is
smart
about
yeah
like
it's
moved
on
and
what
gets
left
is
its
space
amplification
you
potentially
bad
space
amplification.
If
you
end
up
like
in
a
pathological
case
where
you
end
up
with
like
one
entry
per
log,
that's
sticking
around
or
something
ridiculous,
but
in
practice
that
bad
yeah,
if
it,
if
it
really
did
in
practice,
you
could
compact,
like
you
could
say
you
know
every
hour,
compact,
all
this
ridiculous
stuff,
that's
sitting
around
yeah.
A
A
A
A
A
B
B
A
A
Thing
file
you
just
like,
don't
write
it
and
keep
the
logs
around,
because
it's
small
and
then
you
try
again
the
next
time
benefits.
E
C
Could
it
be
something
as
simple,
though,
as
for
certain
classes
of
data,
you
avoid
you,
you,
you
have
them
flags,
you
avoid
compacting
them
in
the
first
round,
but
then
you
keep
those
files
around
and
you
compact
them
in
a
second
round
that
maybe
is
much
longer
lasting
right.
So
you
for
first
default.
A
C
Yeah
default
data-
you
compact
it
once
you've
you've
hit.
You
know,
however,
many
log
files
have
filled
up
in
in
their
current
settings
that
they
have
for
saying.
You
know
you,
you
start
compacting
after
two
logs
or
three
logs
or
whatever
filled
up
and
then
for
this
other
class.
You
start
compacting.
After
ten
logs,
it
filled
up
how.
E
C
A
C
Can
make
it
does
that?
Does
that
look
better
yeah
all
right?
So
the
gist
of
this
is
that
when
you
set
blue
stores
cache
size
to
something
and
let's
say,
3
gigabytes
mm,
you
don't
get
really
consistent
memory
usage
for
the
OSD,
as
reported
by
top
or
PS
anything
then
measures
RSS
memory
in
some
of
the
tests
that
I've
done
like
an
RBD
workload.
C
You
might
end
up
with,
like
roughly
four
and
a
half
to
five
gigabytes
of
RSS
memory
usage
with
an
rgw
workload
that
does
like
small
object,
writes
you
might
end
up
with
something
closer
to
like
seven
and
a
half
I
spent
some
time
looking
into
m,
advise
and
TC
malloc,
and
what
it's
doing
and
when
you
release
memory,
all
its
really
doing
is
marking
M
advised,
don't
need.
So
you
end
up
with
a
bunch
of
memory
that
that's
unmapped,
but
the
there's.
No.
As
far
as
I
can
tell
and
I
I
do
not
claim.
C
This
is
right.
I
I
still
am
very,
very
confused
as
to
what
the
kernel
actually
does,
but
it
appears
to
me
that
the
kernel
may
or
may
not
reclaim
those
pages
and
it
it
may
actually
do
it
opportunistically
when
there's
memory
pressure,
but
otherwise
just
kind
of
leave
them
alone
sit
around
again.
I,
don't
know
that
that's
totally
right
it.
It
definitely
seems
like
it
varies
by
platform.
Osx
may
do
something
different
than
Linux
Linux.
You
know
it
it.
Theoretically
I
thought
they
were
supposed
to
be
reclaimed
right
away,
but
I,
don't
I.
C
Don't
know
that.
That's
necessarily
true!
So
anyway,
the
goal
of
this
is
to
try
to
control
the
memory
usage
of
the
OSD
by
by
tuning
the
cache
size
in
blue
store
based
on
some
target.
So
in
in
this
case,
there's
a
an
option.
That's
been
added
for
I.
Think
I
in
the
the
branch
I
have
right
now
it's
called
like
OSD
memory,
soft
cap
or
something
like
that,
but
maybe
targets
a
better
name.
I,
don't
know,
and
we
try
to
then.
C
Tune
the
cache
size,
so
the
overall
memory
usage
of
the
OST
is
is
around
that
memory.
Usage
in
this
case,
I've
defined
as
the
amount
of
mapped
memory.
So
that
is
basically
the
heap
size
of
the
process,
the
unmapped
memory,
which
is
what
you
get
once
you
do,
TC
Malick's
like
release
memory,
whatever
call
so
it
the
the
branch
appears
to
more
or
less
be
working.
This
is
an
RB
d
workload,
the
rg2
rgw
numbers
and
the
other
tab
aren't
quite
done
yet.
C
I'm
finished
pasting
them
in,
but
you
can
see
basically
at
first
before
any
work
is
really
being
done
or
when
it's
just
kind
of
doing
this
pre
fill
or
whatever
we
aren't
using
very
much
memory
yet
so
this
auto-tuning
thing
will
set
the
we
started
out
actually
at
the
very
beginning,
using
the
default
flash
cache
size
from
blue
store,
which
is
around
three
gigs,
but
very
quickly.
Since
the
the
mapped
memory
usage
is
low,
we
push
that
all
the
way
up
to
the
OSD
target.
C
So,
potentially
in
a
theoretical
world
you
could
have
the
entire
amount
of
memory
be
devoted
to
or
the
the
entire
mapped
memory
going
to,
the
blue
store
cache
which
isn't
real,
but
but
until
we
we
start
using
lots
of
memory,
that's
what
it's
setting
it
to
is
Kevin
upper-bound,
once
we
start
doing,
writes
like
real
writes
these
are.
This
is
like
a
pre
filled
stage
where
they're
four
megabyte
writes
in
this
case
the
data
cashing
in
blue
stores,
enabled
and
so
very
quickly.
C
That's
what
I'm
calling
it,
but
that's
yeah!
That's
what
it
is.
The
heap
size
I'm
at
pages,
okay,
if
there's
a
better
better
term
for
that
I'm
I'm
happy
to
use
it.
That's
just
why
I
came
up
with
a
in,
like
you
know,
two
seconds
why
I
named
it
and
the
actual
RSS
memory
usage
is
like
kind
of
variable
between
the
amount
of
mapped
memory
and
the
heap
size.
So
that's
to
me
indicating
that
the
kernels
like
opportunistic
reclaiming
these
things,
but
not
guaranteed
to
do
so.
C
A
A
C
A
C
C
C
Yeah,
really
it
be
nice
if,
if
really
the
user
was
just
saying,
okay,
here's
how
much
of
memory
I
want
to
target
for
the
OSD
and
just
go
and
tune
yourself.
I,
don't
want
to
think
about
ratios
I.
Don't
want
to
think
about
what
cache
blue
store
has
I'm,
yet
maybe
i1
know
maybe
I
want
reporting
on
it.
I
know
yeah.
A
And
be
nice
to
know
that,
like
you
know,
10%
of
its
being
used
for
blue
store
90%
of
PG
locks,
that
might
be
useful
information
to
like
understand
other
things,
but
but
we
can
report
all
that
stuff
out.
So
I
guess
they're.
Just
a
couple
things
that
come
to
mind
here.
One
is
that
right
now
at
the
beginning,
it's
a
gem
things
that
boost
our
cash
all
the
way
up
to
the
target.
We
could
probably
build
in,
like
a
baseline,
that's
a
little
bit
more
conservative,
because,
probably
that's
not
the
case.
A
C
E
D
C
Here
now,
but
yeah
you're
right
I
mean
it
does
it
can
overshoot
a
little
bit
but
it
interestingly,
though,
at
the
beginning
there
doesn't
overshoot
as
badly
as
it
does
during
normal
operation,
yeah
like
during
normal
operation,
once
you
get
into
it,
it's
like
you've
got
rocks
DB
when
it
compacts
it
uses
way
more
memory.
Then
then
it
normally
does
unless
you
set
like
the
hard
cap,
which
then
can
block
rights
so
so
rocks
DB
can
can,
when
it's
compacting
read
tons
of
stuff
into
the
block,
cache
and
overshoot
the
block
cache
target.
C
A
C
A
B
A
C
Because
the
what
happens,
if
you
target
RSS,
is
it
will
the
the
RSS
usage
doesn't
go
down?
It
remains
like
flat,
but
you
keep
decreasing
the
cache
in
an
attempt
to
you
to
compensate
for
it,
and
then
you
end
up
with
like
no
cache
and
high
a
huge
amount
of
unmapped
pages,
and
it's
super
irritating.
That's
why
I
tried
it
first
so.
A
Of
the
target
versus
the
RSS
to
adjust
what
your
factor
is
and
so
on
you
know
you
would
do
this,
maybe
like
five
times
over
the
course
of
this
whole
time
spin,
you
like
slightly
adjust
what
your
ratio
of
your
mapped
target
or
your
target
to
your
actual
RSS,
because
er
says,
is
eventually
tracking
somewhat
right.
Well,.
C
The
the
the
problem
seems
to
it
well
that
it's
it
seems
to
be
dependent
on
memory
pressure.
If
the
current,
like
that's
the
the
the
feeling
I'm
getting
right
now,
is
if
there's
no
memory
pressure,
the
kernels,
just
like
whatever
you
know,
and
you
might
end
up
with
like
a
ton
of
RSS
memory
usage
words
if
it's
under
memory
pressure
like
this
number
might
actually
be
way
way
closer.
If
my
chin.
A
A
A
C
This
was
on
this
was
on
my
my
dev
box,
so
I
can
I
can
do
that
pretty
easily,
since
I
don't
have
a
ridiculous
amount
of
memory.
I
think
the
this
the
harder
part,
though,
is
we
don't
know
what
the
high
memory
watermark
is
going
to
be
like
if,
if
rocks
DB
ends
up
like
way
way
overshooting
it's,
this
cache
size,
the
high-water
mark
could
be
you
know,
maybe
six
gigs
or
something.
If
there's
not
memory
pressure,
then
then
the
RSS
value.
A
A
Exactly
yeah
I
think
you're
right
I
think
we
probably
can
probably
what
we
have
to
do
is
say
that
this
will
approximate
your
RSS
memory
usage
under
memory
presser
with
the
big
asterisk
and
have
like
a
FAQ
type
thing
that
why
is
my
RSS
higher
than
my
target
memory?
Well,
because
there's
no
memory
pressure
on
your
system
and
if
you
do
something
like
X,
you
have
memory
question.
The
kernel
should
reclaim
it
it'll
converge
towards
what
the
target
is
and.
C
C
B
C
This
is
everything
except
for
the
RSS
memory
usage
with
an
hour
rgw
workload
and
in
this
case,
instead
of
using
like
seven
and
a
half
gigs
or
whatever
of
memory,
we've
tuned,
the
the
cache
size
a
little
bit
lower
than
it
wasn't
RBD.
So
on
RBD,
the
cache
size
was
closer
to
like
2.8
gigs
on
this
one.
The
cache
size
was
down
around,
like
you
know,
2.4
instead,
but
interestingly,
that
seemed
to
improve
things.
C
C
C
C
A
B
C
Right
now,
I've
got
like
a
hard-coded
minimum
of
like
128
megabytes
or
something
ridiculous
like
that.
But
but
you
know
we
can
decide
what
we
want
that
to
be
and
make
it
an
option.
It's
not
a
big
deal,
sorry
for
the
cache
size.
A
Okay,
so
just
from
going
back
to
the
beginning,
the
one
the
user
facing
question,
should
we
just
make
it
so
that
if
you
set
OSD
target
memory,
if
you
set
that
option,
then
the
blue
store
cache
size
is
just
completely
ignored
and
then
we
set
a
new.
We
have
two
new
options
to
go
with
it:
OC
target
memory,
SSD
and
HDD
kind
of
like
we
do
with
the
store
cache
size,
and
we
just
said
new
sort
of
default
memory.
Footprint,
I.
C
I'd
say:
let's:
let's
make
it
consistent
for
all
the
other
stuff
that
Auto
Tunes
right
now
the
auto
tuner
will
take
the
default
ratios
and
start
out
at
them,
but
then
it
will
change
them.
So
if
we,
if
we
are
doing
that
for
those
I,
think
we
should
do
the
same
thing
here.
But
if
we
want
to
just
ignore
it
completely,
then
we
should
have
like
everything
ignored
completely
and
just
like
start
out
with
I,
don't
know
some
some
I.
A
Mean
I
think
the
ratios
still
make
sense
within
the
blue
store
bucket,
because
it's
trying
to
figure
out
how
to
what
the
starting
point
in
us
or
whatever,
had
a
vices
like
memory.
We
don't
know
that
yet
eventually,
maybe
those
will
become
obsolete
because
the
cost
cache
auto-tuning
stuff
that
you're
talking
about
in
the
other
attention.
Whatever
would
would
work.
A
But
I
think
the
goal
should
be
that
this
is
the
only
option
that
they
said
right
and
everything
else
is
just
using
the
default
ratios
or
as
auto-tuned,
and
in
this
case
the
main
thing
that's
being
created
off
is
like
blue
servers.
Just
like
PG
lags
and
we
have.
We
have
options
for
PG
logs,
but
they're,
not
no
change
at
all
they're
like
fixed
and
we're
not
changing
that
immediate.
So
for
now
it'll
just
it
doesn't
make
sense
for
the
blue
store
cash
size
to
do
anything.
C
A
I
think
that's
the
main
thing
I
would
change
to
be
make
that
until
until
we
like
have
reached
our
full
memory,
usage
I
would
like
have
a
conservative
bound
of
like
only
eighty
percent
goes
to
blue
store,
or
you
can
look
at
the
men
pools
and
like
see
whether
they're
all
they're,
fully
populated.
If
the
men
pools
are
only
at
10
percent
of
their
configured
capacity,
then
or
PG
log,
then
then
we
should
also
be
conservative
on
cranking
up
blue
store,
because
we
know
we're
gonna
have
to
claw
back
anyway.
A
C
Yeah,
whatever
it's
not
yeah,
it
could
just
be
something
like
that:
I
guess
or
even
like
25%
of
it
or
something
that,
like
you,
said,
if,
if
you
just
happen
to
like
the
cache
before
the
first
iteration
right
in
the
thread
like
it's
starting
up,
the
mempool
thread
in
blue
store-
and
you
know
as
soon
as
you
enter,
that
while
loop,
it's
gonna
like
adjust
it
and
change
it.
But
but
you
know
if,
between
the
time
when
that
happens,
the
cash
like
overshoots,
they
could
I
guess
you
could
end
up
and.