►
From YouTube: Ceph Performance Meeting 2021-08-05
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right
as
we
wait
for
the
core
folks,
I
think
I'm
just
gonna
get
started
here
all
right.
So
this
week
we
had
just
a
couple
of
well
actually
just
one
new
pr
that
that
came
in,
but
this
is
kind
of
an
interesting
one.
This
is
a
change
in
for
the
mds
lock
to
switch
to
a
fair
mutex.
A
A
So
potentially
this
could
be
really
really
good.
I
noticed
this
in
the
mds
when
I've
been
doing
some
benchmarking
for
cfs
that
this
kind
of
thing
cropped
up
a
lot.
So
potentially
this
this
could
be
really
interesting.
Looks
like
I
think
kifu
has
primarily
been
reviewed.
A
I
don't
know
if
patrick
has
been
or
not,
but
not
yet.
It
looks
like,
but
yeah.
That's
that's
an
interesting
one
to
have
on
the
radar.
Let's
see
for
closed
prs,
there's
only
one
that
closed
kifu
merged
this
one.
This
is
a
pg
log
change,
a
rollback
info
trimmed.
A
I
don't
actually
remember
too
much
about
this.
I
know
it's
been
there
for
a
little
bit.
A
Yeah
we
didn't
document
this
too
well
in
the
pull
request,
but
in
the
event
that
one
got
closed,
kiku
merged
it.
I
think
sam
and
niha
maybe
had
taken
a
look
at
it
and
thought
it
was
fine,
so
yeah
that
merged
that's
good
for
updated
prs
this
week
we've
got
rgw
tracing
work.
Casey
most
recently
has
done
some
review
on
that.
I
think
he
had
a
couple
of
comments
that
he
wanted
to
address,
so
that's
being
actively
reviewed
and
worked
on.
A
Radic
has
a
buffer
list,
pr
that
is
being
reviewed
by
kifu
who
approved
it,
but
then
ilya
also
reviewed.
I
think
he
had
a
couple
of
comments,
so
that's
also
being
actively
worked
on
there's
kind
of
a
long-standing
vr
here
for
a
ttl
cache
implementation
for
the
manager
and
that's
had
a
lot
of
work
done.
There's,
there's
performance
graphs
and
a
bunch
of
ongoing
discussion
and
updates.
That
also
is
being
actively
worked
on.
A
I'm
not
sure
what
the
criteria
is
for
merging
that,
but
it
looks
good
and
then
there's
an
old
pr
that
igor
has
had
in
the
works
for
a
long
time
about
optimizing,
pg
removal
that,
unfortunately
failed
tests
again
recently
kifu
had
done
a
review
and
tried
running
it
through
the
test
suite
and
it
failed.
So
that
looks
like
it's
going
to
need
more
review
and
and
work
to
get
it
passing
tests.
A
Let's
see
beyond
that,
I've
got
a
refactoring
cleanup
of
mems
store.
It
doesn't
really
improve
performance,
yet
there
is
an
additional
object
class.
That's
based
on
vectors,
that's
included
in
that
about
two
or
three
years
ago.
That
actually
was
a
pretty
significant
win
versus
the
buffer
list.
Object
implementation
that
we
have
now,
but
it
looks
like
due
to
a
lot
of
work
that
radic
has
done
the
buffer
list.
Implementation
is
now
actually
pretty
much
on
par,
maybe
even
in
some
cases
a
little
faster
than
the
vector
implementation.
A
I
did
so
this
this
is.
This
does
include
that,
but
the
bigger
thing
is
that
it's
just
making
it
a
little
easier
in
memstor
to
implement
different
object,
based,
implementations
and
and
kind
of
do
a
little
bit
of
a
better
job
of
protecting
locks
and
other
things.
Private
data
structures.
So
there's
that.
A
I
don't
think
there's
a
whole
lot
right
now
to
talk
about
if
adam
and
egor
do
show
up.
Probably
we
will
want
to
talk
about
trimming
again
at
some
point
in
the
booster
caches,
but
for
now
not
necessarily
so.
Okay,
that's
prs.
Did
I
miss
anything
guys.
A
All
right:
well,
we
are
still
waiting
for
the
core
stand
up
folks,
it
looks
like,
but
I
will
get
started
on
discussion
topics
and
they
can
come
in
and
join
in
when
they
get
here.
So
the
big
one
that
I've
got
this
week
is
that
the
internal
work
group
folks
at
red
hat,
went
through
and
did
a
bunch
of
work
looking
at
performance
of
kind
of
our
internal
next-gen,
red
hat
stuff
storage,
that's
based
on
pacific.
A
This
is
kind
of
like
the
release
candidate
for
that
versus
the
the
old
one
which
was
based
on
novus.
So
essentially,
we
can
think
of
this
as
basically
nautilus
versus
pacific
and
things
did
look
really
good
overall
in
the
final
version
of
their
testing.
A
But
one
thing
that
did
crop
up
that
was
concerning
is
that
we
saw
a
significantly
higher
rate
amplification
on
the
nvme
base
db
and
wall
partition
in
the
osd,
so
while
performance
was
actually
faster
overall,
we
saw
that
the
ssd
drives
were
being
worked
much
harder
than
they
were
previously
in
nautilus
and
the
actually.
A
The
reason
this
came
up
was
because
they
were
noticing
that
the
osd
logs
were
much
bigger
than
they
had
been
previously
and
they
were
like
well,
this
is,
you
know,
isn't
good
again
a
lot
more
log
space
than
I
did
well,
it
turns
out
that
that
was
because
there
was
a
whole
lot,
more
logging
being
done
for
compaction
and
or
right
ahead
log
flushing
than
previously.
A
So
the
end
result
of
this
is
that
this
isn't
really
a
good
thing.
It's
going
to
wear
out,
ssds
faster
when
used
for
db
and
well,
and
it's
also
actually
significantly
slower
in
some
cases,
if
you
have
enough
back-end,
throughput
and
you're,
really
hammering
those
ssd
or
nvme
drives
so
okay.
I
went
through
and
tried
to
replicate
this
in-house
on
our
internal
test
cluster,
and
I
was
able
to
do
so,
and
this
is
a
machine
this.
A
These
are
machines
that
are
actually
using
nvme
drives,
not
hard
drives,
but
I
I
basically
changed
it
so
that
I
was
looking
at
the
the
the
equivalent
preferred
deferred
size
that
we
use
in
hard
drives
and
the
results
are
in
this
spreadsheet.
I
will
paste
it
into
the
chat
window
here.
A
What
we're
seeing
is
that?
Okay,
let
me
before
I
say
before
what
we're
seeing
in
blue
store.
A
The
way
it's
supposed
to
work
is
that
we
will
have
a
threshold
where
below
for
object,
sizes
or
sorry
right
sizes,
where,
if
a
right
is
below
that
threshold,
let's
say
it's
64
kb
on
hard
drives,
then
we'll
defer
the
right,
we'll
write
it
to
the
right
ahead
log
in
roxdb,
but
we
won't
immediately
write
it
out
to
the
the
block
device
instead,
we'll
we'll
defer
it
and
hope
that
we
can
combine
rights
and
and
do
some
some
tricks
to
to
make
the
actual
right
faster
to
avoid
seeks,
you
don't
need
to
do
this
on
any
nvme
drives,
so
on
nvme.
A
We
instead
will
will
not
defer
we'll
just
do
the
right
immediately
previously,
that
was
fine
for
large
writes.
We
we
didn't
do
this,
we
did
not
defer,
we
just
wrote
it
out
immediately
and
because
large
right,
the
seeks,
didn't
matter
as
much,
and
that
worked
great
now,
for
whatever
reason,
that's
not
immediately
apparent
when
the
blue
store
preferred
deferred
size
is
equal
to
or
larger
than
the
blob
size
in
blue
store.
We
see
a
huge
amount
of
deferred.
A
I
o
traffic,
going
into
the
right
head
log
and
into
the
mem
table
buffers
that
ultimately
leads
to
this
right,
amplification
and
actually,
in
some
cases,
really
poor
performance.
A
This
is
really
different
than
the
way
that
blue
store
used
to
work
so
right
now,
I'm
trying
to
understand
why
this
is
the
case
and
it
does
look
like
we
got
folks
in
from
core
now.
So
sorry,
man,
that
was
a
very
interesting
discussion
happening
I
I
know
I
knew
it
was
going
to
be
that
way.
I
was
like.
Oh
no.
B
A
Yeah
yeah,
so
I
went
through
prs
and
I
just
gave
an
intro
to
the
data
that
I
mentioned
briefly
yesterday
and
posted
a
spreadsheet
in
the
chat
just
very
briefly
recapping
the
gist
of
it
is
that
previously
in
blue
store,
if
you
had
a
large,
I
o
a
large
right
come
in,
you
would
often
not
see
that
or
or
you'd
never
see
that
deferred.
If
you
had
a
like
a
min
alex
size
that
was
significantly
smaller
right.
A
A
All
of
this
combined
creates
a
massive
storm
of
right,
amplification
and
additional
work
being
done
by
roxdb,
especially
during
compaction,
but
also
just
the
general
flushing
from
the
the
mem
tables
into
level
zero.
So
I'm
trying
to
understand
this
behavior.
Why
is
this
happening?
Adam
and
igor?
A
You
both
know
this
code,
I
think
better
than
anyone
else
does.
So
I
wanted
to
bring
this
to
you.
The
the
data
is
in
the
chat
window
in
the
spreadsheet,
so
please
feel
free
to
look
and
come
home.
C
C
C
That's
basically,
all
I
want
to.
I
I'd
like
that.
I
have
to
say
about
that
issue,
and
my
thinking
is
that
in
past
we
had
much
smaller
different
right
size
and
we
never
met
the
blob
size.
But
then
we
noticed
how
useful
it
is
because
it's
very
useful
for
http,
especially
to
to
when
we
increase
the
size.
We
got
more
performance,
so
maybe
that
why
we
never
seen
that
before
and
now
we
we
can
have
that
issue.
That's
that's
all
from
my
side.
B
Well,
a
couple
of
cents
from
my
side,
yeah
mark.
I
think
your
email
and
I'm
working
on
preparing
a
response
at
the
moment.
But
it
looks
like
this
logic
is
pretty
complicated
and.
B
Well,
I
would
expect
that
large
io
produce
some
deferred
operations
in
case
your
input
blocks
are
not
aligned
to
to
block
size.
So
when
you
write
128k,
I
suppose
the
alignment
is
still
still
4k.
Is
that
correct.
B
Yeah-
and
in
that
case,
you
at
least
have
two
two
chunks
head
and
tail
which
to
be
put
into
the
first
operation.
B
On
a
light
head
and
tail,
I'm
still
curious.
If
that
wasn't
the
case
for
bb
before
I,
I
need
to
double
check,
but
well,
so
I'm
still
investigating
that
anyway,
just
a
couple
of
ideas
to
troubleshoot
that
on
your
side,
if
you
like,
first
of
all,
I
suggest
to
use
just
a
simple
rudder,
spool.
B
B
B
I'm
planning
to
check
what
happened
in
earlier
versions
and
well.
We
will
respond
but
yeah.
So
just
two,
two
ideas
of
our
try
simple
put
some
offset
and
try
to
analyze
performance
counters.
B
Maybe
I
need
to
double
check
no,
but
but
you
know,
I
think
that
before
pacific
this
max
blob
size
didn't
impact
different
operations.
B
Yeah,
at
least
for
in
this
just
to
introduce
one
one
term,
so
we
should
distinguish
small
and
big
rights,
and
so,
given
that
minalok
size
is
4k,
everything
below
this
every
right
below
this
4k
is
considered
as
small,
and
this.
B
From
from
beginning
and
peak
rights,
which
are
longer
than
okay,
they've
got
the
first
operations
for
pacific.
If
I
remember
correctly-
and
it
looks
like
this,
new
functionality
may
might
be
different
so
and
the
short
operation
was
trying
to
to
duplicate
small
right
default
operations
before
pacific,
which
which
tend
to
work
on
64k.
A
B
Well,
actually,
that
new
different
functionality
for
big
rights
is
an
attempt
to
reproduce
the
the
previous
deferred
rights
implementation
for
for
small
rights,
so
it
I
I
met.
I
I
realized
that
when
we
downgrade
the
the
minelock
size,
we.
B
But
maybe
something
is
broken
on
this
in
this
simulation,
maybe
it
it
performs
more
different
operations
than
before,
but
this
this.
A
I
might
be,
I
might
be
misunderstanding
or
misinterpreting
adam's
explanation,
but
it
seems
like
it
makes
sense
that
if
you're
breaking
up
the
right,
io
down
into
different
blobs
and
then
you're
deciding
whether
or
not
to
defer
based
on
the
blob
size
that,
then,
if
your
preferred
fur
size
is
equal
to
the
blob
size
and
those
blobs
are
typically
smaller
than
or
equal
to
that.
That
limit,
then
that
those
would
end
up
being
deferred.
Even
if
it's
a
large,
I
o
right.
B
Right
so
it's
the
the
size
of
the
block
itself
is
not
the
only
factor
to
to
decide
whether
we
need
to
apply
different
operations
or
not.
It's
also
alignment
of
this
block
which
takes
which
is
taken
into
account,
but
if.
B
B
So,
even
so,
even
you
have
this
128
kilobyte
block,
starting
at
32
k,
offset
the
original
spinner
behavior
was
to
apply
deferred
operations.
B
A
I
don't,
I
don't
think
I
have
anything
else
to
really
ask
about
now.
Other
than
that
you
know
I
would
just
say
other
than
that,
there's
more
more
testing.
That
needs
to
be
done.
I
think.
B
Well,
one
more
comment
so
so
at
this
point
I
would
rather
be
interested
in
the
difference
between
the
earlier
versions
and
how
it
worked
in
before
pacific
and
the
new
one.
Then.
B
This
so
it's
expected
that
different
operations
on
spinner
might
produce
pretty
significant
launch
to
database,
and
they
did
that
before.
The
question
is
where
we
becomes
even
larger
or
or
not,
or
maybe
we
changed
like
something
in
roxdb,
some
some
settings
on
your
version
which
handles
that
traffic
in
a
worse
way,.
A
So
so
eager
this
actually
did
come
out
of
a
study
inside
red
hat
that
isn't
public,
that
that
was
looking
at
the
old
version
of
red
hat
self
storage
versus
kind
of
a
new
prototype
version
based
on
pacific.
This
is
essentially,
you
can
think
of
it
as
nautilus
versus
specific
really,
and
this
behavior
was
not
seen
in
nebulous
and
nautilus
based
on
the
standard,
minalic
size
and
blob
size
and
preferred
deferred
size.
A
At
that
point
we
did
not
see
this,
but
if
I
remember
correctly,
I
think
that
the
max
blob
size
is
much
larger
as
well,
and,
of
course,
you
know
the
code
has
changed
dramatically,
so
the
defaults
at
that
point
didn't
do
this.
A
I
think
that's
definitely
a
yes!
When
I
looked
through
the
logs,
I
saw
specifically
that
in
nautilus
we
had
far
fewer
flushes
coming
in
from
the
right
ahead.
Log
and
mem
tables
into
level
zero
than
we
did
in
pacific.
B
Okay,
but
well
then
the
next
step
would
be
rather
to
to
simplify
the
the
benchmarking
or
maybe
come
to
some
simple
scenarios
which
show
the
difference
and
then
analyze
what
what?
What
changed.
A
Adam
you've
been
quiet.
What
are
your
thoughts.
A
Okay;
okay,
based
on
your
your
what
you
had
said
earlier,
when
we
first
started
today,
does
this
behavior
make
sense,
based
on
your
your
understanding
of
the
code.
C
I
remember
that
sometime,
I'm
not
really
sure
it
was
pacific
or
maybe
even
earlier,
and
we
added
a
code
to
also
get
deferred
rights
for
big
rights.
There
were
some
cases
when
we
could
do
that,
and
since
that
time
we
we
are
able
to
also
do
deferred
from
a
big
rights
in
blue
store
and
that
I
think,
triggered
the
case
that
in
some
specific
values
of
blobs
and
different
rights,
we
just
always
trigger
the
condition
that
we
want
to
defer.
C
But
I
just
analyzed
it
then
didn't
really
bother
with
it,
because
I
got
it
only
when
I
made
very
atypical
sizes
for
blobs
and
basically
I
ignored
the
the
drop
of
performance,
because
there
was
not
so
too
much
in
that
and
that's
it.
I
mean
my
estimation
is
that
if,
if
we
put
everything
through
default,
we
will
have
exactly
the
results
that
we
have
very
much
database
data,
frequent
compassions
and
very
much
a
rally
amplification.
A
B
That's
actually
the
question,
so
you
shared
the
numbers
for
all
flash
cluster
as
far
as
understand
and
correct
it.
It
could
be
worse
there,
but
actually
all
this
stuff
is
intended
primary
for
for
spinners,
and
so
the
problem.
A
Though
igor
the
problem,
though,
is
that
the
right
amplification
was
so
huge
on
the
drive
that
one
it
was
actually
making
the
logs
significantly
bigger
than
they
used
to
be
because
of
all
the
compaction
and
flushing
and
everything.
But
then
you
know
you
you
imagine
that,
like
a
10x
or
20x
increase
in
rate
amplification
even
for
hard
drives,
that's
a
that's
a
lot
of
wear
on
ssds.
If
you
have
like
five
or
six
osd
start
getting
one
ssd,
as,
as
you
know,
red
hat
login
database.
B
B
This
might
be
the
case
when
we
would
face
sort
of
trade-off
between
safe
space
saving
performance
and
at
least
right
amplifications
to
the
disk,
and
there
is
no
perfect
ones
here,
so
we
might
increase
one
thing
but
make
another
one
worse
and.
A
One
question
for
you:
what
what
if
we
just
changed
the
deferred
size
to
32
kb
in
the
test
that
I
did
for
large
ios,
the
the
the
behavior
was
much
better
for
large
ios.
When
we
did
that.
Do
you
think
that
would
be
bad,
then,
in
other
circumstances,.
B
Well,
definitely,
downsizing
the
preferred
default
size
would
provide
less
traffic
to
database,
but.
B
Then
another
question
would
be
how
it
impacts
the
performance
and
it's
unable
to
yeah
it's
hard
to
tell
about
real
investigation.
So
when
choosing
these
parameters
for
pacific,
I
was
trying
again,
I
was
trying
to
produce
what
we
had
in.
We
had
before
with
this
different
stuff.
B
But
it's
well.
I
I
paid
attention
mostly
to
performance,
so
perhaps
we
missed
something
in
terms
of
this
verification
and
things
like
that
and
that's
why
interesting,
what
what
was
the
before?
What's
the
performance
difference.
A
A
B
A
That
case
this
might
be
ideal,
but
if
someone
had
a
a
much
lower
right,
endurance
or
slower
ssd
for
their
dbm
wall
partitions-
and
they
had
lots
of
hard
drives
for
osds
being
used,
you
know
per
ssd,
then
it
might
actually
be
significantly
slower
because
going
from
a
32k
to
64k
deferred
preferred
deferred.
Size
is
like
a
10x
right-hand
increase.
B
A
Maybe
it
seems
like
there's
a
boundary
there,
though
right
like
when
you
have
the
blob
size
and
the
deferred
per
deferred
size
equivalent.
That's
when
you
see
this
huge
increase
in
red
amplification,
but
my
question
is:
what's
the
trade-off?
Is
it?
Is
there
a
good
reason
to
have
those
equal
or
does
it
you
know?
Are
there
benefits
to
that
or
is
it?
Is
it
just
kind
of
a
downside.
B
C
For
me,
the
most
difficult
part
in
of
the
answer
is
that
if
we
use
deferred
then
eager,
please
verify
if
I'm
correct,
I
think
that
we
overwrite,
we
might
overwrite
the
same
allocation.
So
if
we
use
deferred
rights,
we
can
override
objects
without
fragmenting
them.
But
if
we
use
the
typical
right,
then
we
need
to
allocate
extra
space,
thus
increasing
fragmentation.
So
that's
an
additional
level.
Why
deferred
might
help
preserve
high
speed
operation
for
clusters.
B
B
It's
hard
to
so
depending
on
the
disks,
depending
on
the
existing
block
layout.
It
might
be
beneficial
to
to
the
right
to
the
right
blocks
or
if
they
are
sparsed,
allocate
instead
a
single,
long
block
and
right
there.
So
again,
there
are
no
looks
like
there
are
no
single
approach
here.
It
might
be
hard
to
say,
which
way
is
beneficial.
It
depends
on
on
what
layout
is
on
the
disk
and
things
like
that.
B
C
A
A
Sure
sure,
mostly
this
was
just
to
replicate
what
was
seen
by
the
the
qe
team
by
red
hat,
because
I
wanted
to
verify
that
in
our
cluster
we
saw
something
that
looked
like
what
they
were
seeing
at
this
point
now
I
agree
with
you.
I
think
it's
good
to
go
down
and
see
if
we
can
understand
the
specific
behavior.
A
I
did
in
the
chat
window
post
what
our
defaults
were
in
nautilus
and
what
they
are
now
in
pacific.
A
One
other
thing
before
we
wrap
up
that.
I
wanted
to
point
out
in
these
results.
That
was
really
interesting,
that
I
don't
know
if
it's
related
or
not,
is
that
in
these
128k
random
reads
when
the
preferred
deferred
size
increased
from
zero
up
to
16k,
it
doubled
this
128k
random,
read
performance.
I
have
no
idea
why
I
don't
know
why
the
behavior
manifested
the
way
it
did,
but
it
seemed
right.
B
A
In
these
tests,
the
pre-fill
is,
with
four
megabyte,
writes
to
fill
the
drive
and
then
it's
going
through
and
doing
sequential
reads
in
order
by
I
o
size,
then
sequential,
writes
in
order
by
I
o
size.
Then
random
reads
in
order
by
osis,
from
smallest
to
largest
so
prior
to
the
random
read
test
at
128k,
there
were
four
megabyte
writes
hitting
the
disk
to
fill
it.
A
Then
there
were
sequential
reads
which
shouldn't
matter
then
there
was
4k,
128k
and
four
megabyte
writes
hitting
the
blocks
in
this
rbd
test
prior
to
the
random
reads
happening.
B
B
I
presume
that
you
wouldn't
see
this
effect
if,
if
no
random
rights
happen
to
before
the
reads,
but
it
it
actually
it's
real
right
in
which
changes
will
I
lay
out
of
blocks
on
the
disk
which
might
affect
the
performances?
B
B
A
B
Okay,
but
okay,
again
curious
if,
if
the
leading
behavior
differs
between
so
if,
if
no
smaller
rides
before
the
original.
After
that.
A
A
A
A
All
right,
well,
adam
and
igor
both.
Thank
you
very
much
for
looking
at
this.
I,
I
suspect
this
will
be
a
kind
of
a
hot
topic
as
we
try
to
do
our
release
and
then
I'd
imagine
for
suse
as
well.
This
will
be
good
to
figure
out,
so
thank
you
both
for
looking
at
it,
and
I
will
also
continue
to
look
at
it
next
week.
So
thank
you
all
and
have
a
wonderful
week.