►
From YouTube: Ceph Performance Meeting 2018-12-06
Description
A
Some
infrastructure
related
solely
to
debugging
tracking,
to
debugging
buffer
to
tracking
cstr's,
to
tracking
allocations
down
locations,
and
unfortunately
it
it
isn't
cost-free.
It
appears
we
are
spending
well,
just
take
a
look
on
deeper
on
the
out
from
profiler.
It's
solely
about
the
cstr
buffer
pointer
is
just
example
of
one
thing.
A
A
There
is
more
at
least
three
things
that
are
in
general
or
tracking
cstr's
or
tracking
allocations,
the
allocations
and
for
tracking
CRC
related
stuff.
All
that
things
use
atomic
accounting
and,
of
course
they
are
protected
with
some
with
some
jump.
But
it's
teams
that
we
have.
First
of
all,
we
have
some
missed
predictions
and
maybe
see
Pele
is
trying
speculatively
execute
those
atomic
instructions.
Of
course
they
cannot
retire,
but
I'm
curious
about
the
microarchitecture
are
details
might
be.
Does
that
we
have
ping-pong
ping-pong
I'm,
speculative,
while
CPU
is
speculating.
A
D
C
D
A
D
D
So
it
just
runs
that
you
know
put
it
on
whatever
yeah
anyway
yeah
and
then
tag
very
good
on
the
question.
You
can
take
a
look
I'm.
It
reminds
me
that
I
think
we
should
consider
killing
the
CRC
cashing
to.
D
Yeah,
maybe
I
don't
know
if
it
would
be
usable.
That
intention
is
that
we
would
avoid
a
double
crc
when
we
fear
see
things
coming
up
the
wire
and
then
we
do
a
CRC
check
again
on
the
data.
If,
when
we're
journaling
it
in
file
store,
that's
the
only
time
when
we're
like
you're
seeing
the
same
buffer
twice
with
the
same
seed
and
I,
think
we
care
increasingly
less
about
that
file
store
case,
and
it's
just
like
a
bunch
of
crap
in
the
code.
D
Yes,
but
it's
done
on
on
like
per
blob
basis,
so
it's
slices
of
the
overall
data
and
it's
so
the
granularity
is
usually
for
cave
chunks
or
energy.
W
cases
bigger
chunks,
but
it
doesn't
I,
don't
think
it's
actually
hitting
the
cache.
In
most
cases,
the
only
exceptionally.
A
D
A
very
small
right,
if
it's
less
than
whatever
the
blob
size,
which
is
like
512
K
or
something.
So
if
you
put
a
small
object,
then
it
might
hit
the
cache
and
get
the
same
OC,
but
that's
a
pretty
narrow
case
and
I
think
it's
also
a
small
buffer
to
CRC.
In
that
case,
so
the
impact
is
probably
lower.
Okay,.
A
D
D
A
D
A
B
So
for
a
bond,
then
for
25
for
21
there's
a
proposal
to
add
a
hard
cap,
minimal
SD
memory
target
then
also
manipulate
the
cash
max
and
the
target
value
based
on
the
template,
based
memory
and
fragmentation.
I,
don't
think
we
should
actually
do
either
of
those
sage,
even
though
I
know
I
couldn't
fire
there.
B
I,
don't
think
so.
What
that
does
is
basically
increases
the
target
based
on
the
memory
base,
but
the
whole
idea
behind
those
two
things.
B
The
fragmentation
in
the
base
was
to
reduce
things
when
you're
near
the
upper
boundary
you
you
want
to
keep
the
memory
kind
of
below
the
target
at
all
times,
if
possible,
rather
than
letting
it
cache
shoot
over.
That's
kind
of
how
that
works.
What
his
PR
does
from
what
just
the
my
quick
look
at
it
makes
it
look
like
what
we're
doing
is
saying
well
if
the
targets
too
low,
let's
increase
the
target,
so
that's
kind
of
up
around
where
the
the
minimum
is
based
on
the
base
value
and
the
fragmentation.
B
B
B
Yeah
I'd
like
to
know
too
I
mean
the
the
the
kind
of
the
one
exception
for
it
is
setting
the
OSD
memory.
Cashman
right,
you
say:
okay,
we
don't
want
the
caches
aggregate
Li
to
shrink
below
this
value,
because
it's
just
it
will.
You
know,
break
things.
If
it
does
right,
that's
the
the
exception
that
we
make
I,
don't
think
we
want
to
muck
around
with
the
target,
because
that's
kind
of
what
the
user
tells
us
they
want.
You
know
we
can.
B
A
B
B
I
still
wonder
if
we
should
also
be
looking
carefully
at
the
difference
between
charting
over
column
families
versus
starting
over
databases
just
based
on
other
people
that
have
done
this
kind
of
thing
and
have
said
that
at
least
on
nvme
charting
over
databases
is
the
way
to
go,
but
I
will
I'll
defer
to
the
people
that
are
working
on
it.
So.
D
Really
interesting
sure
this
is
a
turn
into
basically
a
rewrite
of
the
original
module
that
John
wrote,
but
it
basically
adds
it
has
a
couple
new
pool
properties.
There's
a
P
genome
in
that's
optional,
there's
target
size,
bytes
and
target
size
ratio
that
are
also
optional,
that
let
the
user
basically
tell
the
cluster,
how
big
it
thinks
that
the
pool
is
going
to
get
and
so
that,
even
though
the
pool
is
empty,
we
can
scale
PG
s
based
on
what
the
eventual
size
is
going
to
be.
D
It's
sort
of
sort
of
rating
at
VG's
as
we
go,
but
in
the
absence
of
that
or
incorporating
that
information.
Basically,
the
system
will
look
at
each
sort
of
hierarchy
of
the
crush,
a
crush
that
we're
distributing
data
over
and
add
up
all
the
OS
DS
and
look
at
the
target
number
of
Fiji's
porosity
figure
out
how
many
PG
should
be
on
those
those
DS
and
then
it'll.
Look
at
all
the
pools
that
are
consuming
those
those
DS.
D
D
That's
either
off,
which
means
it
doesn't
do
any
of
that
or
on
which
means
will
automatically
adjust
PG
numb
or
that
it's
worn,
and
it
will
brings
a
health
warning
if
it's
off
by
if
it
wants
to
do
something
that
it
doesn't
and
then
there's
a
new
command
that
you
run.
That
basically
does
a
little
chart.
That
shows
all
that
information,
the
sizes
thighs
the
target
size.
If
it's
set
the
ratio,
the
total
capacity,
the
target
ratio,
the
current
PG
nom
the
target
peach
in
them.
D
If
it
wants
to
change
it
in
the
mode,
so
you
can
just
dump
one
thing
you
can
see
sort
of
what
the
cluster
thinks
you
should
do.
You
can
either
do
it
or
don't
do
it.
You
change
the
mode,
that's
it.
So.
The
interesting
thing
I
think
from
this
perspective
is
if
I
did
a
little
bit
of
like
envelope
math
and
it's
basically,
if
you,
if
you
create
a
pool
with
like
one
VG
or
whatever,
you
don't
give
it
any
information,
and
then
you
fill
it
up
and
say
or
to
write.
You
know
a
petabyte.
D
What
and
the
question
I
want
to
answer
was
how
much
data,
how
much
time,
how
many
times
is
it
gonna,
split
basically
and
how
much
data
is
going
to
move
when
it
splits?
That's
that
I
mean
so.
What's
the
overhead
of
like
not
tuning
and
everything
versus
like
giving
it
a
telling
it
exactly
how
big
it's
gonna
get,
so
it
doesn't
have
to
do
any
splitting
or
merging,
and
it
turns
out
it
doesn't
actually
really
matter
how
much,
how
many
times
it
splits
or
doesn't
or
makes
adjustments.
D
That's
sort
of
at
the
margin,
because,
what's
actually
happening
is
the
amount
of
data
movement
is
like
so
a
series
of
1/2,
plus
1/4,
plus
1/8,
plus
1/16,
plus
whatever,
depending
on
how
many
times
outside
it's
in
limited
approaches,
1.
Basically,
all
data
will
move
exactly
once,
which
means
that
if
you
were
to
write
a
petabyte,
then
the
data
will
basically
write.
What's
going
to
write
3
petabytes
because
it's
triple
replicated
and
then
it's
going
to
also
move
3
petabytes,
because
every
object
is
going
to
move
approximately
once
if
it's
automatically
managing
everything.
D
So
because
the
nice
result
is
that
is
bounded.
And
if
you
want
to
do
better
than
that-
and
you
can
tell
the
system
ahead
of
time
how
big
the
pool
is
going
to
be
and
then
we
can
avoid
doing
those
adjustments.
D
Yeah,
that's
it
I
think.
The
only
thing
to
point
out
is
also
that,
when
this
makes
adjustments,
it
just
sets
it
to
what
the
PGM
should
be
and
there's
already
a
piece
of
code
and
the
manager
that
will
basically
make
small
adjustments
to
the
pea
genome
and
it's
rotters
that,
based
on
the
percentage
of
degraded
objects,
they
set
a
global
threshold
and
that
you
want
no
more
than
5%
of
your
cluster
to
be
not
degraded
but
misplaced.
Sorry,
a
max
displaced,
basically,
that's
a
global
setting.
D
D
D
D
A
A
A
C
Yes,
I
responded
to
some
feedback
and
thanks
for
Casey
for
pointing
out
that
live,
FM
T
is
authority
available,
though
we've
got
that
available
now
in
our
GW
and
basically
all
the
kind
of
gnarly
append
calls
that
I
wasn't
very
happy
about.
Are
now
live
format
or
rather
live
FM
teeth.
So
it's
ready
for
another
look.
If
anyone
has
free
time
and
wants
to
have
some
fun,
that's
all.
D
C
Where
most
of
them
were,
and-
and
basically
you
know,
tracking
them
all
down-
is
just
a
bigger
project
and
and
I
kind
of
picked
this
one,
partly
because
I
had
the
blessing
to
do
it
and
partly
because
I
thought
it
would
be
a
fair
case
study
of
what
what
the
typical
usages
and
stuff
look
like
and
sure
enough.
Most
of
them
are
things
we
can
probably
replace
with
stud
string
and
for
the
most
part
the
rest
of
them
can
be
replaced
with
vector.
C
There
was
I,
think
I
think
one
case
I
ran
into
where
I
I
did
and
Joelle
pointed
it
out.
I
placed
a
an
on
VL,
a
with
a
vector,
Colin
and
I.
Think
the
reason
for
that,
as
I
recall,
was
that
I
had
just
written
some
string
transform
function
in
a
way
that
that
wouldn't
take
that
array
I
can
I
can
revisit
that
or
leave
it.
As
is
it's
really,
it's
a
16
byte
buffer
in
this
case,
so
I,
don't
think
it'll
matter.
D
C
It's
it's
passing
whatever
unit
tests
we
have
it's.
It's
certainly
ready
for
review
the
the
kind
of
error
that
I'm
you
know.
I
would
be
suspicious
out
of
this.
It's
possible
I
have
introduced
them
off
by
ones
or
something
like
that.
Just
because
juggling
null
handling,
especially
you
know,
when
you're
converting
the
string
is
a
little
tricky
but
I
say
I
think
I
got
it,
but
that
would
be
please
keep
an
eye
open
for
that.
B
Anything
else
we
should
look
at
here.
I
was
again
just
mentioned
sage
that
I'm
the
live
already
shared,
persistent
read-only
RBD
cache.
We
got
new
benchmarks
where
they
did
test
in
cases
where
they
didn't
have
as
much
cash
as
they
had
size
of
you
know,
aggregate
volume
size
and
it
it
was
consistently
better.
Sometimes
not
nearly
you
know
the
extent
that
the
other
ones
other
tests
showed
right,
or
it
was
just
slightly
better,
but
it
doesn't
look
like
in
any
of
the
tests
that
they
did
there.
D
D
A
I
would
like
to
ask,
or
for
reviewing
the
branch.
We've
happened
buffer.
Basically,
it's
it's
work.
It
plays
very
I
hope
it
will
place
very
nice
with
hyper
combined
buffer
list
because
killing
append
buffer,
which
means
that
I
would
expect
that
the
average
value
of
the
n
rev-counter
across
buffer
point
occurs
before
rows
will
drop
dramatically
this,
and
it's
quite
important,
because
hyper
combined
buffers
have
only
one
slot
for
better
content.
D
D
D
A
Also,
there
is
another
branch:
I
just
marked
I
just
killed
the
work-in-progress
prefix
and
put
performance
level.
It's
about
optimizing,
it's
about
optimizing
atomic
operations
or
for
buffer
rows
can
be
useful
or
for
the
case
where
buffer
is
isn't
shot,
I
mean
something
like
using
buffer
list
instance
just
to
try
to
encode
or
something
like
that.
A
D
A
D
D
C
D
Guess
yeah
I
mean
yeah,
but
that
that
seems
like
that's
the
default
path,
but
probably
just
remove
xio
messenger
itself
from
the
tree
at
the
same
time,
but
the
I
guess
the
question
is:
is
this
concept
going
to
be
useful
elsewhere?
I?
D
C
A
A
A
D
A
A
A
I'm
purchasing
buffer
list
of
some
kind
of
scatter
gutter
is
implementation
with
shallow
copy
and
everything's
fine.
But
at
the
moment
we
are
paying.
We
are
doing
a
lot
of
atomic
operations
even
where
there
is
no
possibility
of
sharing.
If
somebody
creates
a
buffer
list
calls
up
and
on
it,
he
gets
in
it
to
create
a
new
buffer,
oh
the
initial,
a
buffer
pointer,
owning
that
row,
etc.
Unfortunately,
even
when
they're
freshly
fabricated
Buffalo
is
is
obtained
by
the
very
food
by
the
owner
by
the
owning
PTR.
A
Well,
the
ramping
up
the
NRF
counter
is
made
atomically
competences
early.
That's
the
way,
that's
the
reason
for
introducing
the
reason
for
conveying
for
adding
term
information
to
buffer
sorry
type
system,
meaning
that
in
a
bit
and
vegetable
ownership,
something
like
unique
pointer,
but
without
its
managing
behavior.
A
D
A
D
A
The
first
commit
it's
optimization
for
sit
for,
calling
for
ensuring
that
there
is
no
buffer
a
requiring
copy
on
shell.
Like
we
do
drink
copy
like
we
do
while
copy
constructing
a
new
instance
of
batteries,
we
were
iterating
over
over
those
all
all
our
buffer
pointers
to
verify
that
to
verify
whether
that
clowning
is
necessary
or
not,
it
was
made
with
make
sharp.
Oh
I
moved
some
those
bits
to
the
clowning,
the
cloner
of
pity
art.
Note
we
got
in
the
indie
hyper
combined
thing
in
the
hyper
combined
er.
A
D
D
Anything
else
I'll
go
ahead
and
review
that
first,
but
I
mean
it
optimizing
the
reference
it
seems
like
kind
of
a
no-brainer.
My
only
reservation
is
just
maintaining
the
pepper
converse
stuff
or
the
shareable
stuff,
but
we
don't
have
to
drop
it
now.
You
can
always
come
back
to
that
later,
but
the
the
other
one,
the
logger
one,
the
the
pen
bench
and
buffer
one
also
looks
good.
D
A
A
D
D
D
A
The
idea
is
to
to
make
the
assembly
of
encode
stuff,
okay
short,
that
compiler
wouldn't
probably
be
prohibited
from
aggressive
inlining
at
the
moment.
That's
not
the
case.
We
have
a
lot
of
calls
even
to
upend
of
buffers
that
cannot
be
in
line
because
it's
in
the
CC
and,
moreover,
we
are
spending
a
lot
of
instructions,
a
lot
of
code,
just
on
feeling
on
on
the
call
invocation,
basically
on
preparing
and
more
on
putting
the
stuff
to
ready
to
arguments
to
registers,
calling
up
and
etc.
A
D
A
D
D
A
D
A
D
A
It
it's
in
the
development
branch.
It
requires
a
dependency
on
linked
up
in
buffer
to
have
to
have
this
possible
yeah.
Okay,
the
same
with
the
same
with
the
duplication
of
zeros
doing
during
up
and
0
it's
implemented
in
one.
It
is
in
the
development
branch
and
but
the
branch
is
pretty
big
and
I
I
would
loved.