►
From YouTube: 2018-07-19 Ceph Performance Weekly
Description
Weekly collaboration call of all community members working on Ceph performance.
http://ceph.com/performance
A
People
can't
trickle
in
hopefully
we'll
get
a
couple
more,
but
it's
been
small
lately
I
think
we've
kind
of
had
a
drop-off
in
community
folks
showing
up
and
I'm
not
totally
sure.
If
that's
just
past
summer,
people
are
gone
a
lot
or
if
it's
you
know
that
you
know.
Maybe
maybe
folks
are
happy
with
the
performance
now
and
don't
bother
showing
up.
I,
don't
know
a.
A
A
A
A
So,
having
said
that,
let's
see
pr's
this
one
Greg.
If
you
want
to
have
a
throwback
to
looking
at
the
MDS
there's
a
PR
here
to
optimize
the
the
way
how
max
export
size
is
enforced.
Mu,
colonel
I,
don't
know
anything
about
MDS,
optimization
stuff,
so
I'm
not
sure
if
that's
interesting
or
not,
but
it's
there.
A
And
then
Neha's
yaar
to
limit
the
PG
log
length
merged
that
is
fantastic.
I
should
try
to
get
that
into
my
version
to
our
cherry-picked
into
my
my
branch
here,
so
that
I
can
test
with
that
because
it
might
affect
the
the
defaults
I'm
using
for
the
the
memory
limit
stuff.
If
we
don't
end
up,
peaking
higher
I
might
have
to
change
those
a
little
bit
but
yeah
glad
to
see
that
got
merged.
Yeah.
A
Yeah,
yeah
and
I
haven't
even
tested
recovery,
yet
I've
still
been
just
screwing
around
with
like
normal
normal
work,
but
that
was
on
my
list
of
things
to
do
is
look
at
recovery
scenarios
so
definitely
getting
that
merged
before
I.
Do
that
I
think
probably
makes
sense
we're
getting
that
cherry-picked
rather.
A
Let's
see
what
else
there's
this
EC
stripe
cash,
er
I,
don't
think
anyone's
looked
at
that.
Yet
if
anyone
wants
to
review.
A
The
age
add
the
perf
counter.
One
or
recording
the
lane
see
if
K
be
finalized
I
mean
that
hopefully
isn't
that
big
of
a
PR
I,
don't
think,
but
that
can
be
kind
of
interesting
to
see.
I
haven't
seen
kV
finalized
that
that
particular
thread
being
super
super
like
busy
lately,
but
maybe
in
certain
circumstances
it
might
be
so
it
would
be
interesting
to
see
my
PRS
are
all
stuck
on
this
Lib
standard.
C++
runtime
issue,
one
1604.
A
If
who's
gonna
try
to
fix
it,
which,
if
he
was
here
I'd
thank
him
again
profusely
because
that
I
don't
understand
why
it's
exactly
happening
or
how
to
fix
it.
But
hopefully
he'll
he'll
have
some
make
some
progress
on
that.
A
Yeah
and
then
there's
this
kind
of
remove
async
recovery,
one
and
that's
gotten
a
lot
of
discussion.
I
think
sage
is
also
in
the
middle
of
kind
of
discussing
that
or
for
anyone
who
has
been
intimately
involved
in
that
it,
it
seems
to
be
kind
of
subtle
to
me.
The
issue
there
I
haven't
looked
real
closely
at
it,
but
for
people
that
are
interested
that
might
be
worth
looking
at
and
I
think
there
there's
been
some
progress
on
understanding.
What
exactly
the
issue
that's
come
up.
There
is
so.
B
A
I
didn't
even
understand
that,
so
the
the
the
proposal
is
an
optimization
that
makes
it
unsafe,
basically
yeah.
Okay,
all
right
now,
I
understand.
Well,
okay,
good
luck,
uh-huh
what
else
Adams
on
vacation
right
now,
but
he
before
he
went
on
vacation.
He
was
looking
at
implementing
some
kind
of
thing
for
looking
like
trying
to
model
fragmentation
of
the
different
blue,
sir
alligators
I.
A
Don't
think
I've
seen
any
results
from
that
yet,
but
it
looks
like
there's
at
least
some
outstanding
pr's
here
for
it.
Hopefully
that
will
once
he
gets
back
he'll
chance
to
get
that
merch
a
beer,
it
very
interesting
to
know.
Jason
has
been
doing
some
testing
with
our
BD
workloads
in
one
of
our
hats,
labs
and
and
saw
some
just
interesting,
behavior
with
stupid,
L
care
and
bitmap
allocator,
where
bitmap
allocators
actually.
A
That's
was
testa's
but
but
stupid,
AR,
sorry,
bitmap,
allocators,
maybe
actually
being
a
little
bit
smarter
about
well,
maybe
smarter,
about
using
large
uest
chunks
of
free
space
who
turn
that
random
I/o
into
kind
of
more
sequential
looking
I/o,
whether
or
not
that's
actually
a
good
idea.
Maybe
is
a
point
of
contention,
though.
Maybe
you
don't
want
to
use
your
large
free,
contiguous
chunks
of
space
or
random
I/o,
even
if
it
makes
it
kind
of
sequential.
A
Maybe
you
want
to
save
that
for
big
objects
or
big
writes
if
they
should
come
in
and
do
your
like,
64k
random
I/o,
you
know
64k
open
spaces
or
close
to
64k.
You
know
open
open,
you
know
contiguous
regions
on
the
disk.
So
anyway,
definitely
he
knows
some
differences
be
in
behavior,
so
anything
that
that
we
can
do
to
get
a
better
understanding
of
what
the
fragmentation
looks
like
and
what
kind
of
free
space
we
have
under
different
kinds
of
workloads
of
different
sizes
would
be
would
be
useful.
A
Let's
see
what
else
sort
of
surprised
at
this
rgw
thread
pool
size
to
512
has
not
merged
yet
I
can't
imagine
that
that
is
a
particularly
complicated
change
to
to
do,
but
I
don't
think
that
that
has
actually
merged.
Let
me
just
double
check
to
make
sure
I
didn't
miss
it
nope,
not
yet
so
anyway,
what
else
aging
tests
again
for
allocator
stuff,
I.
A
A
Yeah
I
think
a
lot
of
their
stuff
is
pretty
old,
yeah,
not
now
a
whole
lot.
So
all
right,
move
on
to
discussion
topics
I've
been
continuing
to
work
on
cash,
balancing,
stuff,
refactoring
and
cleaning
up
and
trying
to
figure
out
reasonable
defaults
in
the
current
state
that
it's
in
it
needs
some
cleanup.
But
it's
it's
doing
pretty
well.
A
I
did
can't
increase
the
number
of
priorities
to
currently
user-defined
priorities
or
ten
that
you
can
set
for
kV,
Mehta
and
data
that
potentially
could
result
in
if
you
actually
defined
every
single
one
as
an
independent
type,
30
different
config
button
and
others
convinced
me
just
to
shove
this
into
a
space
delimited
string.
So
now
it's
three:
you
can
kind
of
see
what
the
defaults
right
now
that
I'm
testing
are
there.
A
It's
basically
five.
Second
intervals
than
the
number
of
intervals,
so,
if
you're
looking
at
the
etherpad,
you
can
kind
of
do
the
delay
out
there.
That's
just
kind
of
a
very
rough
idea
and
my
part
of
kind
of
what
sort
of
might
make
sense
here.
It
seems
to
be
working
pretty
well
this.
This
set
up
means
that
for
small
cache
sizes,
kve
ends
up
using
the
majority
of
the
cash
and
then
at
large
cache
sizes
that
all
transitions
over
to
two
meta
cash.
A
Instead
of
kV
cash,
because
you
you
have
kV
hit
there,
sorry
Mehta
hits
happening
really
rapidly
and
data.
That's
in
the
kV
cache
ends
up
kind
of
degrading
or
getting
old
and
then
kind
of
falling
off
and
and
not
not
being
prioritized
relative
to
met
Akash.
So
it
kind
of
does
the
night
the
behavior
we
want,
where
for
small
cash
values,
everything's
in
meta
cash
for
large
cash
values,
I'm,
sorry
everything's
in
Cavey
cash
and
for
large
cash
values.
A
All
the
owners
end
up
in
meta
cash
instead
and
then
data
cache
as
it's
somewhere
below
all
of
that.
So
anyway,
still
testing
going
on
with
that.
But
I
think
it's
kind
of
doing
was
supposed
to
do,
which
is
good,
and
the
performance
numbers
interesting
right
now,
I
think
are
probably
going
to
be
higher
than
just
using
static
ratios,
even
when
the
stack
ratios
are
for
a
cash
value.
That's
like
aggregate
cash
value
is
larger
than
what
the
auto
tuner
has
available
to
it
all
right.
A
So
so
that's
going
well
one
question
that
I
proposed
and
I
guess
earlier.
I
think
everyone
here
Ari
have
knows
about
her
or
responded
to
was
looking
at
mallik
plus
placement
new
for
allocating
memory
for
a
struct
kind
of
all
at
once
upfront,
rather
than
using.
You
know,
allocating
memory
for
the
struct
and
then
allocating
memory
for
say,
like
a
a
car
star
array
inside
of
it.
A
This
all
came
out
of
rocks
to
be
doing
things
that
were
kind
of
evil,
so
we
might
be
able
to
still
get
some
of
the
benefit
of
what
they're
doing
without
quite
having
to
be
as
evil
as
what
they
were
doing
anyway.
I'm
hoping
to
look
at
that
a
little
bit
more
too
and
see
if
there
are
any
applicable
places
we
can
do
some
of
that
may
be
inside
blue
store
for
or
limiting
or
reducing
the
number
of
new
and
well
I,
guess,
Malik
and
and
free
or
delete
calls
that
we
make
so
anyway.
A
A
D
A
D
So
we
had
this
parameter
called
the
OST
max
PG
log
entries,
which
was
there
earlier,
but
didn't
really
act
like
a
max
because
during
recovery
backfill
we
were
just
extending
the
log
to
whatever
whatever
was
easier
for
a
log
based
recovery.
But
what
my
PR
does
is,
whatever
value
that
you
are
going
to
give
this
parameter.
It's
just
gonna,
stick
to
it,
no
matter
what
whether
there
is
recovery,
backfill
or
like
a
regular
scenario,
just
gonna
have
that's
the
maximum
upper
bound
for
the
BG.
Look,
no
ok!.
A
Well,
that's
really
good.
That
will
absolutely
be
helpful
and
then,
though,
okay
Josh,
do
you
remember
when
we
were
walking
through
the
code?
I
thought
there
was
something
where
we
were
kind
of
like
either
copying
these
into
some
kind
of
like
temporary
data
structure,
or
there
was
something
I
thought
we
were
using
extra
memory.
We
didn't
necessarily
need
to
I
think.
B
A
D
That
reminds
me
that
we
did
talk
about
the
rollback
information
being
irrelevant
in
one
case,
and
we
also
I
think
the
size
of
the
object.
Then
the
name
of
the
object
also
came
up
because
we
were
particularly
disk
discussing
I,
think
rgw
or
something
so
these
two
things
definitely
came
out.
Yeah.
A
B
But
that
is
I
think
would
be
good
to
take
an
empirical
look
at
exactly
how
big
these
injuries
are,
and
where
did
that
space
is
being
used.
It's
sometimes
looking
at
the
code
is
a
little
bit
misleading.
A
Though,
in
the
the
mempool,
just
in
some
very
very
you
know,
just
random
tests
I've
been
running
recently
well,
I
guess:
they're,
not
random.
It's
for
K,
random,
writes
and
an
RB
d
with
a
256
gig
volume
and
I
think
it's
256,
P
geez,
I'm,
1,
OS
d
I
can
verify
that
but
I
think
or
PG
log
entries
in
the
mempool
we're
using
around
400
megabytes
of
memory.
A
A
E
B
A
A
The
other
thing
I
have
noticed
in
those
tests
is
that
the
I
don't
know,
if
is
fragmentation,
or
if
it's
more,
that
we
have
just
like
spans
that
can't
be
released
by
TC
Malik.
It's
a
combination,
I
think
of
a
lot
of
different
things,
but
when
the
when
the
the
auto-tuning
code
is
rapidly
shrinking
or
growing,
the
rock
TV
cash
versus
the
oh,
no
cash
in
blue
store.
A
I
think
it's
actually
true,
maybe
also
for
the
PG
log
entries
that
are
stored
in
memory,
the
the
more
kind
of
different
size
of
things
that
we
have
a
memory
and
the
more
we
kind
of
remove
those
and
add
new
ones
in
the
worse
that
seems
to
get-
and
this
is
just
kind
of
speculation
and
observation
on
my
part,
but
it
seems
like
when
we
have
lots
of
different
sized
memory
allocations.
We
can
kind
of
confuse
TC
Malik
pretty
pretty
easily.
That
sort.
F
E
A
A
But
beyond
that,
like
mem
polls
and
have
the
trade-offs
that
are
involved
in
in
allocation
speed
and
if
you're
doing
lots
of
deletes
or
not,
and
this
I
suspect,
given
kind
of
how
much
we
are
creating
and
deleting
stuff
that
an
a
mental
might
be
better
or
for
some
things.
At
least
if
we
have
kind
of
the
the
equal
sized
objects
that
were
allocating.
B
B
Well,
I
think
that
I
mean,
with
the
object
name
being
I
mean
piece
of
variance
there.
If
we
had
a
style
alligator
with
that,
I
was
able
to
accept
like
a
largest
every
game
we
do
expects,
or
maybe
they,
the
percentile
90
percentile.
That
kind
of
thing
that
might
be
sufficient
to
avoid
confusing
T's.
Not
so
much,
maybe.
F
A
Well,
in
any
event,
uh-huh
lots
lots
of
questions
here.
I
think,
though,
just
based
on
the
stuff
that
I've
been
playing
around
with
recently
it
seems
like
focusing
in
this
area,
would
be
probably
good
for
us
I
think
if
we
can
do
things
to
help
the
memory
allocator.
You
know,
certainly
when
we
switched
over
from
simple
messenger
to
async
messenger
that
helped
dramatically,
but
it
still
seems
like
based
on
some
of
these
behaviors
I'm,
seeing
where
we
try
to
keep
the
memory
size
limited.
A
It's
it's
really
interesting
to
watch
TC,
Malik
struggling
to
do
so
and
kind
of
the
amount
of
space
that
ends
up
available.
When
we
do
that,
I
suspect
that
we'll
see
both
we
could
improve,
have
the
the
memory
fragmentation
effects
and
then
also
probably
improve
performance
quite
a
bit
if
we,
if
we
kind
of
approach
this
with
kind
of
a
a
goal
of
making
things
friendly
so
anyway,
maybe
when
I
get
back,
I'll
I'll
try
to
look
at
it
a
little
bit
more,
but
that
I
suspect
that
there's
gains
that
we
could
make
there.
G
Kind
of
do
yeah
all
right
running
into
some
problems
with
the
EG
overdose
and
memory
protection,
as
anybody
in
the
Ceph
community
had
a
similar
experience.
We're
trying
to
install
OpenStack,
which
creates
a
bunch
of
different
storage
pools,
and
you
know
most
of
them.
Aren't
that
big,
like
what
we're
seeing
is
that
it's
pretty
easy
to
hit
that
lemon.
G
The
was
it
the
Iman
max
PG
/
OST
limit
and
that's
causing
a
lot
of
like
heartburn,
and
it's
like
I'm
wondering
if
we
kind
of
went
to
the
opposite
extreme
there
like
before
there
was
it
was.
There
was
not
much
of
a
limit
on
how
many
pgz
and
that
got
us
and
certain
kinds
of
problems.
As
you
can't
really.
You
know
it's
it's
hard.
You
can
overdo
that
yeah.
B
I
agree,
I
think
we
have
seen
other
folks
right
into
this
as
well
and
so
relatively
recently
I
guess
we
kept
the
default
hard
limit
to
be
like
600
pts
for
Pro
St
instead
of
400.
G
Yeah,
that's
the
hard
limit.
I
agree
with
that,
but
this
is
the
soft
lemon
I
mean
I.
Think
I
think
the
soft
limit
seems
to
assume
that
you're
only
creating
like
you
know
one
or
two
big
pools
and
I'm,
not
sure
that
that's
you
know
true
for
everyone.
You
know,
and
you
know
if
we
could
loosen
that
up
a
little
bit.
You
might
just
prevent
a
lot
of
people
from
getting
frustrated
who
there
was
yeah.
B
C
B
Tools
could
be
folks
good
guides
there
that
should
stay
with
Josh.
A
G
A
C
E
B
G
You're
you're,
absolutely
spot-on.
That
was
the
first
problem
that
was
it
was
hit,
was
that
they
tried
to
create
the
poles
before
they
created
the
OS
DS
and
that
clearly
didn't
work.
So
they
put
in
a
check
in
there
SEF
ansible
and
set
danceable
to
say
wait
until
all
the
OSD
stated
before
you
create
the
storage
pools,
and
you
know
that
helps,
but
in
there
check
may
not
be,
but
I
mean
when
I
do
the
arithmetic,
you
know
they
when
they
even
may
have
reasonable
eg
count
set
for
their
pools.
C
C
A
B
A
A
C
G
C
G
So
here's
the
thing
you
know
you,
you
basically
have
this
PG
calculator.
That
recommends
certain.
You
know
settings
right
and
if
you
know
what
I
want
to
do
is
make
sure
that
the
PG
calculators
and
recommending
things
and
then
the
overdose
protection
limit
is
saying.
No,
you
can't
do
that
is
that
that
just
creates
a
hurry.
Music
situation
like
he
was
trying
to
create
a
poll
with
a
like
a
thousand
and
twenty
four
PG
s
with
twenty
OSD
I.
Don't
think
that's
yeah.
C
C
G
Understand
that,
but
you
know,
if
you,
if
you
want
I,
mean
so
the
case
was
a
thousand
24
P
G's
for
this
p.m.
storage
pool
and
they
had
like
few
other
pools
that
were
smaller
and
when
you
do
the
arithmetic,
it
you.
You
only
have
4,000
PGs,
because
you
have
200
but
200
P,
geez,
/
OSD
times
20
OSD,
that's
4,000,
and
then
in
a
calculation
it
takes
2024,
multiplies
it
by
three,
and
so
basically
it
starts
to
come.
When
you
add
all
these
these
pools
up.
G
B
B
A
Do
want
to
like
draw
attention
to
the
fact,
though,
that
this
is
gonna
increase
memory
usage
on
the
OSD
as
soon
as
you
write
to
any
of
these
pools
right
unless
I'm
quite
mistaken,
it
will
yeah,
that's
not,
and
it's
not
insignificant.
It's
it's
enough
that
we
might
be
like
if
we
want
to
keep
the
OSD
s
to
a
certain
kind
of
memory
ratio
or
memory
limit.
We
might
be
blowing
all
of
our
available
memory
on
like
PG
log
and
not
on
other
caches.
If
we
increase
this
dramatically
well,
I
thought.
G
That
was
a
sort
of
a
different
related
issue,
which
is
the
there's
the
PG
law
max
and
min,
or
whatever
they
recall
that
the
difference
between
them
was
causing
it
to
consume
a
lot
of
memory,
and
there
might
be
other
factors
as
well
like
what
Neha
is
working
on.
But
this
is
more
of
a.
This
is
like
a
much
more
kind
of
urgent
thing,
because
you
can't
even
get
the
cluster
running.
You
can't
even
get
into
memory
yeah.
B
A
G
C
G
C
A
So
so
the
two
primary
reasons
I
can
think
of
and
add
to
them
or
correct
them
if
I'm
wrong,
that
we
recommend
so
many
PG
spur
OST
is
one
the
kind
of
random
distribution
nature
of
it
and
then
to
the
the
locking
behavior
inside
the
OSD.
For
for
the
number
of
PGs
that
that
you
have,
if
we,
the
the
first
thing
I
think
hopefully
sages
PG,
balancing
code
helps
dramatically
with
blocking
behavior.
No
but
I
wonder
if
we
really
need
as
many
PG
spur
OST
now
as
we
used
to.
B
Yeah
potentially
I
think
the
main
thing
there
is
a
distribution
and
it's
tough
with
like
a
the
smaller
pools
at
least
I.
Guess
they
don't
they
don't
matter
as
much
as
that.
I
have
intensive
ones,
but
but.
A
B
It's
certainly
balancing
things
in
terms
of
data
usage
right
now,
but
I,
oh
okay,.
A
They
are
yeah
what
I
couldn't
remember
if
it
was
luminous
or
mimicked
that
it
finally
merged
for,
but
it
it's,
you
know
it's
it's
there.
I
don't
know
I,
don't
know
how
to
even
enable
it
or
use
it.
But
it's
there
yeah.
B
G
A
G
A
Have
locking
issues
potentially,
if
you're,
on
really
fast
devices
and
usually
keep
them
mind,
you
explain
that
a
little
bit
locking
so
there's
a
PG
lock.
So
if
you
have
fewer
PGs,
if
you,
if
you
do
like
a
wall
clock
profile
on
the
OSD
you,
you
might
actually
see
contention
for
that
lock,
especially
if
you
have
really
really
fast
like
nvme
devices.
So
just
you
have
to
be
careful
to
watch
for
that
and
see.
A
G
F
F
Applies
to
situations
to
a
situation
when
you
have
more
than
one
working
one
TP
HTTP,
Fred
barrier
OSD
work
you
shot,
that's
I
would
define.
However,
this
is
our
default
configuration
honest
is
this:
we
have
a
chart
and
two
worker
threads
per
shot,
so
the
lower
the
lower
number
of
PG
you
have,
the
more
likely
you
will
have
a
problem
with
a
pigeon
of
contention.
C
A
B
A
Your
your
kind
of
you
know
effectively
reducing
the
length
of
that
by
by
shrinking
your
number
of
PGS
and
reducing
your
beverage
consumption
right.
Exactly
all
these
things
tie
into
each
other
and
kind
of
you
know.
If
this
one
of
the
things
that's
complicated
for
the
user
right
is
that
by
by
tweaking
certain
things
you
affect
other
things.