►
From YouTube: Ceph Performance Meeting 2022-06-09
Description
Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
So,
okay,
I
don't
actually
have
a
lot
to
talk
about
with
the
pr
so
go
quick.
I
did.
A
Or
updated
ones
this
week,
just
I
think
everybody's
taking
a
little
bit
of
a
break
after
all,
the
quincy
work
that
happened,
but
there,
the
the
one
that
did
close
was
this
tracing
pr.
I
approved
it
last
week
the
new
numbers
looked
really
good.
That
came
out
so
that
that
has
now
merged,
so
that's
really
good.
Otherwise,
I
I
have
a
bunch
of
stuff.
I
need
to
go
through
and
figure
out
what
to
do
with
a
bunch
of
pr's
old
prs.
A
I
still
need
to
look
at
the
cpt
toothology
proof.
Testing
thing.
I
don't
know
why
it's
apparently
sometimes
not
giving
results,
but
I
don't
know
it's
there's
a
there's
a
lot
of
glue
code
in
all
of
this,
so
probably
something
to
do
with
that.
A
Let's
see
dc
malik,
enabling
tc
malik
with
cstar
that
we
have
to
decide
what
we're
going
to
do
about
the
the
sun
errors
that
we're
cropping
up
with
tc
malek.
I
don't
think
it's
our
fault.
We
just
have
to
decide
if
we
care
about
them
or,
if
we're
just
going
to
kind
of
whitelist
it.
So
what
else?
A
A
See
we
don't
have
chorus
adam,
I
was
gonna
say
this,
allowing
the
control,
tc
malik
thread
cache
size
with
a
stuff
configuration
option.
That's
still,
actually,
I
think
a
really
good
idea,
but
I
don't
know
not
necessarily
super
high
priority.
A
Gabby
related
thing
we
were
just
talking
about
so
in
the
ether
pad.
There's
this
pr,
which
adds
some
old
map
benchmarking
capabilities
to
the
google
test,
suite
store
test,
but
it
adds
a
lot
of
extra
stuff
in
store
tests
like
it
takes
longer
to
run,
but
it's
it
was
actually
really
useful
when
we
were
playing
with
it.
This
may
help
give
an
idea
of
like
what
omap
and
roxdb
overhead
looks
like
like
how
fast
we
can
do
that.
A
C
Are
you
saying,
there's
a
benchmark
for
snap
remover.
D
A
Yep-
and
I
I
think,
did
we
determine
that
iteration
behavior
was
still
a
big
deal.
D
We
we
are
not
sure,
but
whatever
is
the
pro
actually
sorry,
there
is
no
iteration
the
current
code.
What
is
doing
is
doing
iteration
size
of
two,
so
it's
unlikely
that
iteration
is
a
serious
cost,
but.
A
D
Hope
to
change
it
and
then
do
each
time
I
don't
know,
maybe
100
objects
each
time
so
iteration
would
become
all
of
an
issue,
but
but
another
thing
we
think
might
be
possible
to
do.
There
is
to
change
the
data
structures
at
the
moment.
The
way
I
understand
it,
snap
mapper,
is
a
single
global
object,
holding
all
the
snap
information.
D
So
if
you
remove
a
snap
object,
it's
only
going
to
affect
the
sorry
you
you're
going
to
remove
the
snap
on
every
pg,
but
mpg
is
standalone
thing.
So
I
don't
see
what
kind
of
dependency
would
be
between
two
pgs,
but
you
might
be
missing
something
and
then,
if
we
would
break
it
further
and
make
it
every
pg
would
have
a
table
of
snaps
and
only
there
will
have
the
object.
Then,
when
you
remove
it,
it's
going
to
be
an
object
that
nobody
else
is
accessing.
So
you
don't
even
need
no
locks.
A
That
that
piece
that
you
just
described,
I
think,
was
the
only
thing
really
holding
us
back
from
doing
like
for
pg
sharding
in
blue
store.
Oh
there's
I
mean
there's
other
stuff,
but
not
really
important
stuff
like
that
was.
That
was
the
thing
that
always
came
up
as
one
of
the
reasons
why
we
we
can't
do
like
independent.
A
You
know
hashing
across
pgs
for
or
you
know,
pgs
across
charts
right
that
you
had
this.
This
global
snapmapper
thing.
D
D
A
Josh
d:
do
you
remember
because
I
don't
I
don't
remember
there
ever
being
any
real
reason
why
it
looks
the
way
it
does,
but
that's
it
it's
every
time
I've
asked
about
it.
It's
always
kind
of
been
like
you
know,
kind
of
a
hand
wave
you
don't
don't
really
remember.
D
A
A
A
D
I
could
think
about
is
somebody
was
trying
to
provide
an
easy
mechanism
to
create
a
snap
across
multiple
pgs,
because
that's
a
single
synchronization
object,
but
I
think
that
thing
could
be
done.
Even
if
you
got
multiples
network
one
rpg,
you
could
still
synchronize
the
creation
by
grabbing
multiple
locks.
A
So
in
the
in
the
chat
window,
I
posted
a
link.
I
think
this
was
the
original
commit
of
the
snapmapper.
B
C
A
So
yeah
what
this
is
appears
to
be
the
state
that
it
came
in
in
like
what
original
the
original
you
know,
intent
or
thought
was
was
here.
A
A
Things
were
a
little
more
yeah,
less
documented
back
then.
I
think.
A
A
A
But
in
any
event,
so
your
proposal,
gabby,
I
mean
the
the
gist
of
it-
is
that
you
were.
You
want
to
do
the
same
thing
with
the
snapper
that
you
did
with
the
allocation
data
right,
yeah.
D
D
D
We
need
to
delete
them
all
and
every
deletion
is
actually
translated
into
two
dimensions.
So
there's
a
burst
of
tombstones
creations
and
that's
one.
We
suspect
that
diet.
That
might
be
one
of
the
reason
for
rbd
mirroring
pain
because
rvd
murray
is
using,
is
creating
snap
every
15
minutes
and
once
then
it's
iterate
over
the
snap
to
copy
the
data
to
the
remote
side.
D
A
And
and
right
now,
the
the
iteration
that
you
mentioned
when
you
iterate
and
then
is
that
before
or
after
the
deletion
is
it
also
iterating
during
deletion.
D
So
deletion,
you
do
an
iteration,
but
for
deletion
you
just
start
the
iteration.
Take
the
first
two
object
and
you
are
done
and
then
you're
going
to
start
another
iteration.
But
if
you
do
a
copy,
I
would
imagine
you
do
another
iteration,
so
in
rbd
mirroring
they
probably
have
to
iterate
over
over
the
snap
to
know
which
object
to
copy
to
the
remote
side.
So
I
sure
iteration
you
see.
D
Let's
just
delete
this
later.
I
just
want
to
first
introduce
the
first
concept
yeah.
D
So
the
first
concept
saying
we
don't
need
ropes
to
be
to
give
us
protection
for
the
snap
mapper
at
shutdown
it's
going
to
be
staged
to
a
file.
It's
going
to
cost
us
hundreds
of
milliseconds
in
worst
case
scenario,
no
big
deal
on
startup.
We
going
to
read
it
from
the
file
into
the
memory
object,
and-
and
here
we
actually
going
to
save
time
because
reading
from
a
flat
file
is
much
faster
than
reading
from
works.
D
To
be
my
experience
with
that
location
map
seems
to
indicate
we
could
actually
save,
maybe
one
second
on
a
again
big
system
if
it's
small
system
none
of
this
matter.
But
if
it's
a
big
system,
I
I
I've
seen
that
reading
the
allocation
from
roxy
b
can
take
a
few
seconds
again
very
big
system,
hundreds
of
million
of
objects
here
with
snapmap.
I
don't
think
it's
ever
going
to
grow
to
100
million
of
object.
D
D
So
all
this
thing
and
all
this
pain
we
experience,
is
going
to
benefit
us
now,
because
we
have
the
exact
entry
point
to
call
for
store
and
restore.
So
that
thing
should
be
relatively
easy.
It's
essentially
serialization
of
the
object,
doing
a
right
to
disk
and
again
reading
from
the
disk
and
populating
the
object
should
be
very
simple
process,
and
this
time
it
should
be
safe
because
we
know
what
we're
doing
and
when
we
it
should
be
safe
to
do
that.
So
that's
part.
D
One
part
two
is,
of
course,
what
should
be
done
in
case
of
a
disaster,
because
roxdb
used
to
give
us
protection
from
disaster
by
using
right
ahead
logs
and
all
the
availability
concept
that
you
got
from
works
to
be
we're
going
to
keep
rocks
db
so
in
disaster,
we're
going
to
lose
that
the
snap
marker
the
same
way.
D
D
But
I
think
this
time
we're
not
going
to
add
a
lot
of
cost
extra
cost
because
we
have
to
iterate
over
rocks
db
object
anyway.
This
thing
we
do
for
a
location
map,
and
that
was
the
biggest
part
of
of
the
coast.
D
D
D
It
makes
sense
and
I
run
it
by
george
targan
and
I
got
some
feedback
from
matan
and
the
feeling
is
that
the
object
not
have
enough
information
to
rebuild
the
data,
because
that's
the
real
problem.
How
do
you
rebuild
the
snap
map
from
the
object
node
and
there
was
agreement
that
the
information
there
is
is
sufficient
for
rebuilding
that's
project
number
one
independent
of
this
project.
D
D
C
D
And
then
we
delete
it,
the
deletion
that
is
done
today
we
iterate
over
snap
mapper.
We
take
each
time,
two
objects.
I
don't
know
why
we
chose
this
number
two
but
never
mind.
Even
that
number
is
not
utilized
really
because
for
each
entity
for
each
object
from
the
two
that
we
took.
So
the
only
thing
we
use
the
two
is
just
to
make
the
iteration
more
efficient,
but
really
two
is
and
then
for
each
object.
D
We
act
as
if
this
was
arrived,
meaning
we
need
to
create
a
replication
job
and
a
pg
log
entry
to
represent
the
fact
that
we
are
going
to
delete
the
object
mode.
D
So
every
object
node
that
we're
going
to
delete
is
going
to
generate
the
pidgey
log
is
going
to
do
communication
with
the
replicas
and
it's
going
of
course,
to
go
to
roxdb.
So
first
thing
we
go
to
roxdb.
We
delete
the
object
node
from
the
snap.
We
also
delete
the
two
objects
on
the
snap
mapper,
which,
if
we
do
the
first
project,
then
that
part
is
going
to
go
away,
make
it
cheaper
and
we
create
a
pretty
long
entry
to
describe
it.
D
D
The
number
we
are
thinking
about,
maybe
100,
but
of
course,
and
we
could
go
50
200,
I
don't
know,
just
find
a
good
number
and
then
we're
going
to
generate
a
pidgey
log
entry
with
the
list
of
all
the
entries
that
we're
going
to
delete
it's
not
all
right.
It's
I
think
the
problem
is
that
we
simply
took
the
code
we
had
or
client
right,
because
the
client
right
it's
something
we
cannot
anticipate.
D
D
D
Pain
is
going
to
be
amortized
on
maybe
100
object,
so
in
fact
it's
actually
going
to
mean
that
pg
law
going
to
cost
us
1
of
what
it
used
to
cause
us
and,
of
course,
we're
going
to
say
dcp
communication
with
the
replicas,
and
we
can
do
everything
else
much
somewhat
faster.
So
that's
project
number,
two.
A
Project
number
two
gabby,
I
I
was
wondering
where,
where
is
the
code
that
governs
the
the
number
of
deletions
that
is
happening
at
the
same
time.
D
For
you
actually
can
I
I
will
share
my
screen
sure.
Okay,
so
this
is
where
we
start
primary
log
await
a
sync
work
react.
This
thing
is
being
called
externally.
D
D
Yes
pushing
them
here,
but
the
number
is
really
just
two,
but
even
that
it's
no
big
deal,
because,
let's
assume
you
change
it
and
make
it
to
be
100,
then
you
ask
the
snapmapper
to
give
us
the
next
object
to
trim
and
everything
went
fine.
There
were
no
errors,
then
we
reach
here
now
we
iterate
over
the
object
that
we
got
from
the
two
three.
The
two
trims
give
us
list
of
object
belonging
to
the
snap
which
need
to
be
trimmed,
and
then
we
called
pg
trim
object.
C
D
D
D
D
D
D
D
D
A
Maybe
while
you
were
talking,
I
simultaneously
was
looking
at
how
we
kind
of
ended
up
with
some
of
that
code.
I
think
so.
I'll
be
honest.
It
doesn't
totally
make
sense
to
me
because
I
was,
I
was
only
half
paying
attention,
but
I
was
also
looking
at
this
and
I
wanted
to
get
your
your
opinion
on
it.
This
is
was
part
of
the
erasure
coding.
I
think
this
this
pr
11701
is
when
I
believe
when
erasure
coding
was
introduced.
C
A
Yeah,
I
confess
I
am
I'm
just
trying
to
do
some
archaeology
to
understand
kind
of
some
of
the
the
rationale
here
instead
of
the
snap
mapper
and
the
snap
trimmer
to
grip
n
at
a
time
is
that
that's
two
is
that
right
is
that
the
network.
A
So
I
see
there
they
added
unsigned
max
equals
g-conf
osdpg
max
concurrent
snap
trims.
A
That
should
be
in
the
yammel
yeah,
but
I
don't
see
sorry
it
would
have
been
back.
Then
sorry
they
could
changed
that
at
the
time
it
would
have
been
in
the
config
options.
Maybe
it
already
existed
because
I
do
see
in
the
original
code
there
was
a
reference
to
this,
so
it
must
have.
We
must
have
already
had
ostpg
max
concurrent
snap
trims,
but.
D
A
A
A
A
Well
and
in
the
event
yeah,
that's
the
setting,
that's
being
read.
A
But
okay,
but
then
what
does
it
mean
if
it
the
this
says
here
that
we're
changing
it
to
n
if
we
were
already
using
that
somewhere,
unless
that
was
part
of
this
erasure
coding
pr
in
general
and
it
it
was
done
in
a
previous,
commit
in
in
the
pr
as
possible.
A
Sorry
yeah,
it
should
be,
it
should
be
in
the
config
options
for
in
the
source.
A
Yeah,
it's
okay.
D
So
we
have
the
number
of
g
maximum
current
snap
strings
type
unsigned
integer
level
floated
two
minimum
one
with
legacy.
True.
D
D
But
once
you
got
this
vector
and
let's
say
you
got
100
entries
at
the
moment
you
just
got
to.
But
let's
say
you
got
100
entries
you're
now
going
to
look
over
each
one
of
them
and
then
pro
create
a
replication
operation
for
each
one
of
them
with
its
own
pg
dog
entry.
C
D
Thinking
about
it,
I
think
the
first
change
going
to
make
much
larger
impact,
because
first,
it's
going
to
impact
any
interaction
with
pg
log,
with
snap,
not
just
permission
and
even
for
the
deletion.
It's
going
to
eliminate
two
tombstone
per
object,
because
at
the
moment,
when
we
delete
an
object,
we
are
generating
four
tombstone
one
for
the
object,
two
for
the
step
mapper
and
one
for
the
pg
log.
D
D
So
you're
going
to
have,
I
don't
know
if
we
have
100
pgs
and
we
got
10
snaps
10
active
snaps,
then
you're
going
to
end
with
having
1000
small
snip
mappers,
which
each
one
of
them
could
be
accessed
using
its
own
set
of
locks.
D
So
when
you
delete
a
snap,
mapper
you're
not
going
to
impact
anybody
else
when
you're
deleting
a
snap,
so
that's
the
third
change
we
suggest
I
expect
its
impact
is
going
to
be
much
smaller,
but
it
might
help
with
reducing
lock
collisions.
I
don't
know
how
much
of
this
we
got.
How
many
collisions
we
see.
A
Gabby,
I'm
still
trying
to
see
if
I
can
find
where
that
was
introduced,
specifically
that
both
of
those
those
options,
but
I
haven't,
haven't
gotten
it
figured
out
yet
so.
In
any
event,
my
my
reading
on
all
of
this
is
that
you
should
trust
your
gut.
A
You
know,
I
think
you're
right
that
there's
probably
a
lot
here
that
just
kind
of
ended
up
in
the
code
base
without
a
whole
lot
of
you
know,
thought
or
testing
on
it
in
terms
of
you
know
how
how
fast
we
could
do
this
or
what
the
rocks
tv
behavior
would
be
like.
So
I
I
think
I
think
especially
your
your
first
proposal
is
a
good
idea.
Second
one
not
sure
about,
but
but
you
know,
I
I
think
it's
a
very
fruitful
area
to
be
looking
at
in
general.
So
I.
D
A
Yeah,
I
agree,
I
think
getting
more
info
from
sam
would
be
a
really
good
idea.
Yeah
all
right!
A
Well,
just
a
couple
minutes
left
here,
so
I
I'll
just
quickly
say
that
I'm
I'm
looking
at
our
shard
upward
q
in
the
osd-
and
I
we
talked
about
this
a
little
bit
last
week,
but
the
the
gist
of
it
is
that
when
you
have
lots
of
shards
and
one
messenger
thread,
it
was
everything's
really
really
slow,
and
I
think
I
figured
out
the
reason
for
that
this
week
when
we
can't
keep
the
per
shard
q
full.
A
We
basically
let
the
threads
just
you
know,
wait
on
the
condition
variable
and
then,
if
we
detect
that
the
queue
is
empty,
but
there's
now
something
coming
in,
we
wake
everything
up.
We
do
a
notify
all,
and
I
think
that
is
the
reason
why
this
is
really
really
slow.
When
we,
when
we
create
lots
of
shards
and
don't
have
enough
data
coming
in
to
fill
them
all
up,
waking
those
threads
up
is
really
expensive.
That's
what
it
looks
like
is
going
on.
A
So
I
think
my
guess
is
that
any
time
we
get
into
a
state
like
that,
where
we
don't
have
enough
data,
keep
the
the
per
shard
queues
full
we're
seeing
a
lot
of
overhead
and
a
lot
of
performance
degradation.
A
If
you
can
keep
the
queues
full,
maybe
you
do
okay,
but
if
you
can't
consistently
do
that,
then
we
might
start
seeing
performance
and
efficiency
go
down.
So
I'm
I'm
thinking
about
different
ways
to
try
to
change
this.
The
one
I'm
kind
of
focusing
on
right
now
is:
maybe
we
can
have
an
alternate
method
to
wake
up
these
threads.
Maybe
we
instead
keep
track
of
the
threads
that
that
we
should
be
waking
up
and
we
we
do
so
individually
in
a
loop
rather
than
a
global
notifial
based
on
one.
We.
B
Well,
I
think
that
the
problem
would
be
that
you
need
to
you
need
to
wake
the
proper
spread
for
all
of
them
and
then
check
yes,
just
necessary
number
of
ophress.
You
will
be
interested.
You
need
to
pinpoint
exact
ones.
B
A
I
think
radic,
what
I'm
I'm
a
little
concerned
about
is
that
we
can't
er
essentially
in
a
situation
where
the
the
messenger
threads
can't
keep
the
shards
full.
That's
when
we
hit
it
right
and
maybe
that's
not
common.
If
you
have
a
huge
amount
of
work
coming
in,
but
I
wonder.
B
A
If
with
our
default
configuration
variables,
often
it
might
be
that
you
see
those
those
cues
like
become
empty.
B
Okay,
this
basically
means
that
an
alternative
could
be
increasing
the
default
number
of
messenger
threads.
A
Maybe
maybe
you
know
three
definitely
is
much
better
than
that.
A
Yeah
we
could,
we
could
both
simultaneously
reduce
shards
and
increase
messenger
threads,
but
I
think
what
I
I'm
a
little
afraid
of
is
okay.
So
one
question
is
how
good
and
how
even
is
our
hashing
over
over
the
shards
like
are?
We
are
we
ever
seeing
situations
where
the
clumpiness,
due
to
the
random
distribution
with
the
hashing
means
that
that
you
know
one
particular
shard
might
be
less
less
activated
than
others?
Well,.
B
You
mean
I,
I
bet
you
mean
sharding
by
selecting
the
the
shards
for
selecting,
basically
tp
or
dtp.
The
thread
for
selecting
the
pg
will
will
be
handled
by.
A
Yes,
I
agree
with
you
on
that.
I
guess
what
I'm
I'm
wondering
is
you
know?
How
often
can
we
end
up
in
a
situation
where
any
particular
shard
ends
up
calling
notify
all.
B
A
Like
it
do
we
have
cases
where
it's
easier.
Certainly,
if
we
have
fewer
shards,
it's
a
much
better
situation
right,
because
it's
much
less
likely
that
the
the
cube
becomes
a
any
individual
cube
becomes
empty.
But
then
we
have
to
increase
the
number
of
threads
per
shard
and
that's
not
optimal
either.
It
seems.
B
Yeah
well
how
about
thinking
that
on
that
next
week?
Well,
tomorrow
is
tomorrow's
good
day.
A
We
should
wrap
this
up,
but
I
agree
with
you
and
then
I
can
also
show
some
of
the
the
locking
behavior
too,
because
I
think
we
do
see
contention,
especially
on
the
guard
lock,
but
also
we
see
contention
on
the
the
weight
lock
as
well.
So
it's,
I
think
it's
a
balancing
act,
a
lot
in
this.
How
to
to
manage
all
these
things.
B
Well,
we
could
improve
the
messenger
to
tp
passing
at
the
price
at
let's,
let's
say
more
contention
around
pd,
lock
and
guard
items.
A
Okay,
I
was
even
just
just
to
think
about
well
well
you
over
the
week,
but
perhaps
maybe
you
have
a
messenger
thread
that
that
well
could
we
re
work
this
in
some
ways
that
each
messenger
thread
is
per
shard
and
can
pull
off
data
off
the
of
the
the
network
layer?
I
don't
know
I'm
not
exactly
sure
how
that
works,
but
things
to
think
about
how
how
all
of
these
different
threads
interact
with
each
other
in
the
system.
B
Yeah,
oh,
it's!
So
you
really
need
to
wrap
up
okay,
cool
thanks
for
thanks
for
beginning
the
discussion.
We
will
get
back
we'll
get
back
to
you.
A
Okay
sounds
good
all
right.
I
think
it's
probably
good
time
to
wrap
up
then
too
so
have
a
good
day.
Everyone
thank
you
for
coming
and
we'll
continue.
Maybe
that
discussion
next
week.