►
From YouTube: Ceph Performance Meeting 2021-10-21
Description
Find more Ceph meetings videos: https://ceph.io/en/community/meetups/
A
So
one
new
pr
this
week
that
I
saw
it's
actually
really
interesting.
Looking
for
people
that
are
interested
in
backfill
and
recovery
performance,
I
don't
know
how
to
say
this
user's
name
gun,
maybe
they
they
basically
are
looking
at
a
very
narrow
situation
where.
A
A
You
have
a
new
osd,
they
require
a
full
copy
from
a
primary
osd
and
if
that
primary
osd's
pg
log
entry
count
is
smaller
than
the
osd
min
pg
log
entries,
then
it
thinks
that
the
new
osd
can
recover
by
pge
log,
and
it
turns
out
that
it's
much
faster,
if
you
have
it
handled
via
recovery,
rather
I'm
sorry
via
backfill,
rather
than
via
recovery,
just
because
of
all
the
extra
work
that
goes
into
recovery.
So
it's
it's
super
interesting.
They
provide
a
whole
lot
of
data
and
analysis
in
there.
A
So
I
think
mia
is
going
to
take
a
look
at
it,
but
yeah
it's
this
pretty
neat.
Let's
see,
there
are
four
that
I've
got
that
were
closed.
I
think
this
ttl
cache
for
the
manager
got
merged
into
another.
Well,
this
feature
another
feature
which
I
think
I
need
to
go
track
down,
but
only
event
that
was
closed
and
that
was
merged
into
something
else.
A
This
osd
compression
bypass
in
favor
of
rgw
compression
that
get
merged
by
casey
and
then
the
adam's
bluefist,
fine,
green,
locking
pr
got
merged
by
kefu,
but
it
looks
like
it's
probably
causing
well,
it's
causing
locked
up
failures
and
testing,
but
I
think
adam
said
in
another
pr
that
was
like
a
a
weird
interaction
or
something
that
isn't
maybe
really
a
failure.
I
don't
remember
the
details,
but
he's
he's
on
it
he's
working
on
it.
Okay,
let's
see
three
updated
prs.
A
Let's
see
for
the
mds,
there
was
a
request
to
provide
like
kind
of
an
overview
of
that
some
tree
removal
pr
kind
of
like
a
high
level
documentation
and
that
just
was
provided
recently.
So
that's
there
now.
I
think
there's
just
some
concern
about
how
big
that
pr
is.
There
are
two
library
d,
optimization
prs
that
were
recently
updated.
A
I
think
those
just
both
needed
rebase
and
there
was
at
least
one
bug
fix
that
went
in
there
too.
I
think
so,
basically,
we'll
have
to
re-review
those,
and
then
that's
that's
about
it.
Did
I
miss
any
prs
guys.
A
All
right,
the
only
things
I've
got
today
are
a
quick
note
that
jeff
layton
submitted
some
fixes
for
the
kernel
clients
that
we
think
may
resolve
the
three
gigabyte
per
second
bottleneck
that
we've
observed
previously,
it
looks
like
we
were
grabbing
a
mutex
that
we
didn't
really
need
to
grab
and
he's
he's
worked
it
out
so
that
he
can
use
spin
locks
instead.
A
So
at
some
point
your
ilia
is
going
to
review
it
and
I'm
going
to
try
to
do
some
testing
on
it.
That
should
hopefully
provide
a
nice
performance
boost
and
we'll
see
if
it
gets
us
up
to
like
the
eight
gigabytes
per
second,
that
we
see
with
the
rvd
and
loops
ffs,
not
totally
sure
if
it
will
or
not,
but
that
will
be
the
that'll,
be
the
goal.
A
So
there's
that
and
then
the
other
thing
I've
got
is
that
so
so
gdp
pmp
is,
has
been
useful
for
a
long
time.
But
it's
it's
not
quite
as
useful
as
it
used
to
be
it's
now
causing
and
has
been
for
a
little
while
our
classic
osds
to
to
basically
brick
when
using
it
periodically
with
high
thread
counts.
A
You
can
use
it
fine
if
you've
only
got
like
one
tpos
dtp
thread,
but
once
you
have
more,
which
we
do
by
default,
it
can
cause
the
classic
osd
to
break
pretty
quickly.
So
it's
it's
value
has
diminished
somewhat
still
works,
fine
for
crimson,
actually,
interestingly,
but
not
for
classic
osd,
so
adam
a
while
back
had
made
a
version
or
a
similar
wall
clock
profiler
that
used
live
unwind.
A
He
did
some
really
clever
things
with
parasitic
code
injection
that
actually
makes
it
very
very
fast,
but
the
code
is
pretty
complicated
and
I
don't
know
adam.
If,
if
I
remember
at
one
point,
you
were
having
some
issues,
but
I
don't
know
if
those
have
all
been
resolved,
but
it
sounds
like
at
some
point.
It
was
kind
of
a
little
flaky
still.
B
It
was
problematic
at
some
point
of
time.
Then
I
fixed
proceed,
p,
trace
attached
procedures
and
then
it
seemed
to
work
fine.
From
now
on,
I
didn't
see
any
problem
with
that,
except,
of
course,
architectural
and
philosophical
problem,
that
there
is
that
very
clever
in
a
bad
sense.
Injection
of
the
other
process,
yeah.
A
Yeah
yeah.
Well,
I
will.
I
will
say,
though,
that
that
your
your
version
is
much
faster
than
anything.
I've
come
up
with
so
far,
but
I'm
working
on
trying
to
to
have
the
best
of
both
worlds.
So
I
I
started
working
on
basically
a
port
of
gdb
pmp
plus
replacing
the
gdb
part
with
lib
unwind
as
well,
and
it's
both
faster
and
works
better
than
the
gdp
version,
but
it's
not
as
fast
as
atoms.
A
So
I
started
digging
into
lib
unwind
since
I,
of
course,
one
of
the
first
things
I
did
when
I
got
it
working
is
to
profile
the
profiler,
and
it
turns
out
that
almost
all
the
time
is
spent
getting
a
process
name.
The
unwind
is
rereading
stuff
from
disc
over
and
over
again
that
was
partially
mitigated
by
utilizing
some
caching
that
they
have
built
in,
but
not
fully.
A
So
I
started
digging
through
the
live
unwind
code
and
identified
the
code
that
was
happening
and
even
tried
fixing
it.
To
some
extent
I've
got
a
branch
of
unwind
where
I
I
go
in
and
I'm
trying
to
cache
parts
of
the
the
in
parts
of
the
code.
It
was
not
utilized
in
the
cache
previously,
but
it's
not
really
helping
that
much
and
as
far
as
I
can
tell
lip,
unwind
is
now
a
dead
project.
No,
the
mailing
list.
No
one
got
back
to
me.
A
The
project
maintainer
didn't
get
back
to
me
after
a
week.
No
one
is
submitting
code.
Really.
I
think
the
last
commit
was
maybe
in
the
spring.
So
it's
it's
pretty.
If
it's
not
dead,
it's
it's
definitely
back
burnered.
A
So
we
have
people
at
red
hat
actually
that
are
working
on
elf
utils
and
they
have
their
own
method
for
unwinding
things.
Lib
dw,
so
I
abstracted
the
back
end
for
this.
This
thing
and
now
we'll
be
able
to
use
lib
unwind
and
hopefully
soon
the
dw
has
an
alternative
which
is
both
supposed
to
be
faster
and
is
better
maintained
by
people
at
red,
hat
and
and
also
other
people
as
well.
A
So
that
might
be
the
right
way
to
go,
but
the
good
news
is
that
this
is
actually
working
now
and
it's
it's
better
than
gdp
pmp
was
so
that's
it's
certainly
an
improvement
already.
I
think
if
we
can
get
the
sampling
fast
enough,
we
can
start
doing
some
really
interesting
things
with
maybe
looking
at
trying
to
match
sample
periods
with
things
that
are
happening
in
the
osd.
A
Maybe
if,
like
roxy,
b
compaction
kicks
in
or
scrub
kicks
in,
we'll
be
able
to
like
look
at
sampling
periods
that
correspond
with
those
events
and
things,
but
that's
that's
out
in
the
future
right
now,
I'm
just
trying
to
get
this
working
as
fast
as
I
can,
and
so
that's
really
all
I've
got.
This
is
maybe
gonna
be
a
fast
meeting,
but
I
do
see
other
people
showed
up
so
I'll
open
it
up.
Anyone
have
anything
they
want
to
talk
about
this
week.
A
I
might
pick
on
you
a
little
bit
since
you
were
talking
to
me
earlier
about
the
the
the
blues
bluefest.
Looking
things
did
did:
were
you
able
to
confirm
that
it
was
kind
of
a
false,
positive.
B
Yes-
and
I
fixed
that
really,
that
was
always
because
of
our
log
depth
feature
is
actually
a
symbolic
feature,
meaning
if
you
have
two
different
objects
that
basically
have
the
same
mutex
of
the
same
name.
It
is
interpreted
as
the
same
mutex.
So
if
you
have
two
logs
to
the
different
objects
in
the
same
thread,
it
was,
it
will
always
complain,
and
that
was
exactly
what
was
happening.
B
B
B
It
just
meant
switching
order
in
compaction
and
in
addition,
I
had
to
like
exclude
compaction
from
other
lock,
which
just
didn't
make
sense,
but
it
was
not
a
problem,
so
I
was
fine
with
that.
So
that's
why
I
was
saying
that
maybe
in
future
it
will
be
difficult
to
actually
get
get
that
working
in
different
locking
schemes.
A
And
gabby,
I'm
going
to
pick
on
you
a
little
bit
too.
It
looks
like
maybe
there's
been
good
progress
on
figuring
out
what
was
going
on
with
the
allocator
changes.
A
Sorry,
so
what
was
the
question?
Oh
I
was
saying
I
was
picking
on
you
a
little
bit.
It
looks
like
maybe
you've
you
figured
out
how
to
fix
the
issue
with
the
allocator
changes.
C
Yeah
yeah,
so
there
was
a
problem
with
hybrid
allocator
caused,
probably
by
my
code,
I
found
a
problem
which
could
explain
this
corruption.
Unfortunately,
it's
now
impossible
to
recreate
it
because
there
seems
to
have
been
some
change
in
the
code
somewhere
else,
so
this
issue
doesn't
appear
anymore.
The
problem
was
that
when
I
save
the
allocation
information,
if
there
was
a
corruption
in
the
file,
but
when
you
open
the
file,
you
still
got
the
file
without
failure,
then
I
would
start
loading
the
allocation
into
the
allocator.
C
C
I
fixed
this
issue
now
that
I'm
using
a
temporary
allocator
when
I'm
reading
from
the
file,
and
only
if
and
when
the
whole
allocation
was
found
to
be
in
good
shape.
I'm
loading
this
into
the
real
locator,
so
that
should
solve
the
problem,
but
in
the
last
few
weeks
this
issue
disappear.
We
cannot
recreate
it.
C
I
suspect
that
maybe
it
could
be
one
of
two
things:
either
the
test
or
not
s
or
change
somehow,
and
they
are
not
as
aggressive
as
before.
So
this
corruption
or
this
failure
is
not
happening.
All
that
blue
store
code
was
modified
and
it's
now
able
to
better
recognize
corruption
in
a
file.
In
theory,
it
should
never
happen.
In
theory.
Blue
store
should
recognize
if
a
file
is
corrupted
and
not
give
it
to
me.
C
But
that's
everything
is
just
you.
We
know.
B
D
B
So
interested
in
actually
trying
to
replay
that
problem
because
having
count
and
error
in
blue
fs
files,
consistency
would
be
a
great
news
for
me.
That
would
at
least
put
a
hope
that
some
errors
that
we
see
in
the
field
regarding
reading
and
from
corrupted
ssd
files
could
be
attributes
to
this
error.
A
B
C
Do
a
few
other
things,
the
arrow
that
my
fix
addresses
is
if
the
file
somehow
got
internal
corruption,
so
it
could
mean
that
in
the
middle
of
the
right,
if
I
open
the
file
to
right
and
the
and
the
right
is
long
enough
and
in
the
middle
we
we
kill
the
system.
That
would
happen.
But
if
you
change
the
test,
for
example-
and
the
allocation
information
would
be
shorter,
such
a
single
right
would
include
the
whole
information.
You
would
never
see
this
problem.
C
You
need
to
have
big
enough
systems,
so
I
don't
know
if
anybody
changed
the
test,
for
example,
and
if
before
you
would
have,
I
don't
know
hundreds
of
thousands
of
location
information,
and
now
you
got
only
like
1
000
of
them,
which
could
fit
in
a
single
right.
C
I
think
I'm
writing
up
to
4k
allocation
information
in
a
single
right,
so
you
so
you
get
it
all
or
nothing.
But
if
you
do
more
than
4k,
then
you
could
fit.
You
could
succeed
the
first
one
and
fail
the
second
one.
If
there
is
it's
like
a
race
issue,
but
it
could
happen
so
if
the
test
was
modified,
the
size
of
the
system
was
shrink.
That
could
explain.
B
C
Actually,
sorry,
I
take
it
back
adam.
I
I
send
sync
request
in
the
end
of
my
of
my
of
my
right.
C
Information
fits
in
a
single
extent
in
a
single
chunk.
Then
I
would
do
a
single
right
sink
immediately.
Then
I
get
all
nothing,
but
if
I
do
multiple
right,
then,
if
I'm
crossing
the
sync
point,
the
system
might
start
flashing
the
data
and
eventually,
when
I
get
to
the
end,
I'm
going
to
miss
it
yeah
it
requires
inspection
yeah
again
everything
is
just
speculation.
C
We
could
create
this
issue
once
did
you
commit
your
change,
the
one
about
the
sanity
check,
which
was
you
had
and
when
in
debug
mode?
Is
this
committed.
C
B
Oh,
yes,
that's!
That's!
That's
fixed!
There's,
a
pr!
It's
not
committed!
It's
still
that
that's
the
log
depth
issue
from
blue
fs
faint
grain
locks.
C
Okay,
so
I
can
write
an
internal
code
in
in
my
in
my
cipher
box.
I
got
this
huge
allocation
information
having
like
500
million
extent,
so
I
could
insert
some
assert,
not
some
force
abort
or
set
myself
sig
nine.
C
After
writing,
say
hundred
million
extends,
and
then
we
could
see
if
bluefs
is
going
to
reject
the
file
on
startup.
C
C
C
C
A
D
C
A
Cool
all
right,
anyone
have
anything
else.
They
want
to
talk
about
this
week.
A
If
there's
nothing
else,
then
we'll
end
a
little
early
and
can
get
ready
for
their
their
long
weekend.
Here
at
least
people
at
red
hat.