►
From YouTube: CDS Reef: Performance
Description
The Ceph Developer Summit for Reef is a series of planning meetings around the next release and some community planning.
Schedule: https://ceph.io/en/news/blog/2022/ceph-developer-summit-reef/
A
This
week
for
pull
requests,
I
saw
two
new
ones.
The
first
is
corey's
excellent
investigative
work
and
subsequent
pr
for
looking
at
setting
upper
and
lower
bounds
on
roxdb
omap
iterators
casey
is
reviewing
that
and
offered
some
really
good
suggestions
there
for
avoiding
copies
and
and
memory
leaks.
So
that's
good,
I
figure
corey.
Will
we
can
talk
about
that?
Maybe
after
we
go
through
pull
requests,
since
it
deserves
a
longer
a
lot
of
time.
A
And
the
other
new
one
that
it's
kind
of
tangentially
performance
related,
but
there
was
a
pr
that
came
in
to
allow
rgw
to
use
deos
as
a
back
end
and
and
for
folks
that
don't
know,
deos
is
kind
of
an
offshoot
of
lustre
where
they
they
they
actually
use
stuff,
as
some
inspiration
for
when
when
they
started
writing
it,
it's
an
object-based
store
that
is
really
focused
on
on
high
performance
kind
of
tied
to
optane
drives
to
some
extent,
but
it's
it's
still
fairly
experimental
from
what
I
I
think
they
they
kind
of
only
barely
have
replication
actually
at
this
point,
but
but
but
it
is
there
and
it
is
really
high
performance
in
some
center.
A
So
it
looks
like
someone
is
working
on
trying
to
make
a
storage
abstraction
layer
for
that
for
rgw.
So
that's
really
interesting.
It'd
absolutely
be
interesting
to
see
what
what
that
does
and
how
it
works
and
how
it
compares
to
our
own
osds-
and
you
know,
maybe
we'll
learn
something
from
it.
So
definitely
interesting
work
there.
A
There
were
two
closed
prs
this
week.
One
was
from
me
so
I'll
talk
about
this.
A
little
bit
later
on
as
well,
but
this
is
basically
the
minimalist
fix
that
we
can
implement
right
now
for
what
the
what
we
did
in
a
previous
change
to
the
abl
allocator.
A
A
The
problem
is
that
we
would
leave
the
current
position
that
we're
searching
from
the
same
if
we,
if
we
gave
up
so
if
we
gave
up,
we
wouldn't
update
the
cursor
position
and
we'd
end
up
for
every
allocation
request,
continuing
from
that
same
position
and
and
failing
so.
This
is
kind
of
the
minimalist
pr
to
change
that
behavior
and
and
that
fixes
some
issues
that
we
saw
in
quincy.
A
A
A
I
think
typica
reviewed
that
and
there
have
been
some
updates.
It
looks
like
not
sure
exactly
what
and
then
adam's
pr
here.
This
I've
got
the
world
high
string
sitting
on
a
rainbow
vr.
I
think
parts
of
that
maybe
have
been
merged
in
other
pr's
and-
and
it
looks
like
matt's
just
asking
here.
If
there's
anything
left
from
that
adam
is
my
is
adam
here,
yeah
adam!
Is
there
anything
I'm
did
I
get
that
right?
B
There
are
things
from
that,
but
they
pretty
much
need
to
be.
Redone
casey
made
a
much
better
string,
iterator
class,
that's
more
idiomatic,
so
yeah
most
of
that
can
probably
go,
and
it
should
also
probably
just
be
broken
out.
A
Okay,
cool
lots
of
stuff
in
the
no
movement
category.
I
don't
think
anything,
there's
there's
anything
here.
That's
super.
A
Okay,
so
someone
asked
in
the
in
the
discussion
topics
if
we
should
continue
pg
log
discussion.
I
suspect.
Maybe
that
was
gabby
but
gabby
is
not
here
yet,
maybe
not.
A
Sure
it's
possible
too,
that
core
is
still
ongoing
for
some,
so
he
might
be
here
in
a
little
bit.
So
let's
just
wait
a
little
bit
on
that
one.
So,
okay,
I'll
move
on
to
the
next
topic,
then
so
recapping
avl
allocator
changes.
I
kind
of
already
talked
a
little
bit
about
what
we
did
for
this
minimalist
change.
I
still
think
there
is
a
valid
reason
to
change
the
way
that
we
are
limiting
or
the
way
that
that
we
are
switching
into
the
best
fit
mode
from
near
fit.
A
Even
with
this
kind
of
minimalist
change,
we
still
see
us
dropping
into
near
fit
really
quickly
and
often,
even
sometimes
on
like
4k
searches,
and
it's
just
a
result
of
the
the
limits
that
we
have
in
place.
A
We
can
increase
those,
but
it's
really
unclear
how
many
bytes
forward
we
should
search
or
how
many
cycles
or
times
we
should
iterate
the
search
in
near
fit
mode
before
giving
up.
So
this
other
pr
I
have
that
that
basically
changes
to
a
time
based
search
I'll
copy
that
and
paste
it
in
the
chat
window
here
personally,
this
makes
a
lot
more
sense
to
me
here.
Instead
we're
basically
just
limiting
the
search
to
a
certain
number
of
well,
I
think
you
can
go
down
to
microseconds.
A
I
don't
know
if
they
have
seconds
it
would
solve
too
or
not,
but
now
certainly
down
to
the
microsecond
level.
We
can
say
okay.
This
is
how
long
we
want
to
search
in
in
near
fit
before
giving
up,
and
that's
to
me
a
lot
easier
to
tune
when
we
do
that.
If
we
tune
up
like
around
100
microseconds,
we
see
that
the
number
of
searches
that
fail
the
number
of
near-fit
searches
that
fail
goes
way
down.
A
If
we
set
it
to
like
10
microseconds,
we
actually
have,
I
think,
fewer
misses
than
we
do
currently
when
we're
only
setting
like
16
megabytes
forward
or
100
iterations,
but
it's
still
pretty
high,
like
you
can
kind
of
see
that
the
curve
kind
of
goes
up.
A
It
probably
is
kind
of
something
that
looks
a
little
bit
like
an
exponential
curve.
So
I'd
argue
that
the
very
least
we
should
increase
the
defaults
and-
and
preferably
from
my
standpoint,
we
change
it
to
a
time-based
search
that
we're
not
trying
to
tune
on
multiple
axes
at
once.
A
D
Well,
maybe
can
we
support
both
limits
for
for
a
while,
just
to
make
sure
this
time-based
limit
works
properly
and
then
maybe
get
rid
of
bytes
limit.
A
A
E
A
A
A
A
Look
at
the
way
I
wrote
it.
I
I
tried
to
avoid
that
by
basically
starting
out
by
doing
eight
iterations
per
timer
check
and
then,
as
you
move
on,
you
increase
the
number
of
iterations
per
timer
check.
Assuming
that
you
set
a
higher
limit
that
you
can
go
farther
per
per
or
you
know
check
at
the
timer.
A
I
mean
you:
could
you
could
try
to
like
grab
and
find
out
how
many
checks
you
can
do
in
a
certain
amount
of
time
and
then
and
then
like
iterate,
that
at
that
resolution
right
I
mean
we
could
do
it
that
way.
It's
I
don't
know
that
we
need
to.
I
mean
this
is
a
little
sloppy.
You
see
that
you
know.
Sometimes
you
go
a
little
over
your
timer
setting.
You
know
you
might
you
might
be
like
10
over
or
something,
but
is
it
that
big
of
a
deal?
I
don't
know.
G
Yeah
I
like
matt's
idea
of
adapting
number
of
steps
of
algorithm
by
time
and
then
adapting
your
amount
of
steps.
How
much
time
you
you
just
make
it
adaptive
that
that
will
get
give
us
just
one
time
return
basically,
two
times
per
action.
A
I
think
adam
the
the
reason
I
don't
necessarily
like
that,
unless
you're
rechecking
it
periodically
is
that
if
all
of
a
sudden
you
start
going
slow,
you
might
have
a
lot
of
iterations
to
go
through
right
like
the
way
I've
written
it.
It
makes
it
so
that
you,
you
start
out
small
and
you
kind
of
grow
it
each
time
so
that
if,
if
it's
for
some
reason
going
slow,
then
you
start
out
with
a
low
number
of
iterations
so
that
you're
you're,
not
you
can
you
can.
A
That's
why
I
like
restarting
this
each
time,
even
though,
yes,
you
might
have
a
couple
more.
You
know
time
checks
involved
the
the
rate
that
we're
going
at
right.
Now
it
doesn't
really
matter
honestly,
I
mean.
A
You
can
tune
it
though
matt
right,
I
mean
like
if,
if
you
start
seeing
it,
if
we
get
to
the
point
where
we're
fast,
you
can
increase
the
minimum
number
of
iterations
per
per
time
check
that
you
do
like
the
way
it's
written
right
now.
That's
that's
tunable.
It
shouldn't
be
needed
to
be
tuned
until
we
actually
see
that
being
a
problem,
but
you
know
right
now
we're
doing
eight
iterations
minimum
per
per
time
check
up
to
a
maximum
of
like.
A
Yeah
and
keep
in
mind
here
too
right
that
the
the
overhead
for
a
large
I
o
is
really
different
than
for
a
small
io
like
if
we
add
a
a
couple
of
nanoseconds
for
a
four
megabyte
allocation
search
that
we're
already
descending
down
like
800
000
node
avl
tree,
that
the
cost
of
that
search
in
the
adl
tree
is
a
dominating
factor
right,
whereas
for
like
a
4k,
io
io,
we
might
not
have
to
search
nearly
as
much,
but
we
actually,
we
actually
surprisingly,
do
a
lot
of
searching
in
that
tree.
A
For
it,
it's
not
good,
but
that's
a
different
topic,
but
even
even
there
right,
that's
when
presumably
you'd
have
a
much
faster
time
of
finding
a
slot.
You
know
finding
space
and
hopefully
you're
not
doing
as
many
iterations,
which
means
you're
not
calling
it.
You
know
a
time
search
I
mean
hopefully,
actually
one
thing
we
should
look
at
is:
maybe
we
do
the
upfront
search
of
eight
cycles
without
doing
any
time.
Look
up.
You
know
that
would
be
an
optimization
for
this
right.
Like
the
first
time.
G
A
F
Sounds
like
animals
got
some
other
interesting
points
there,
a
giant,
abl
tree
everything
hits.
A
Yeah
so
so
I
mean
like,
as
I've
looked
at
this,
I've
been
kind
of
eyeballing
like
okay.
Maybe
maybe
we
have
need
to
to
to
rethink
some
of
our
strategy
with
some
of
some
of
the
alligators
here,
but
but
that's
a
much
bigger
topic.
You
know
very
minorly
here.
What
I
want
is
to
make
it
fairly
easy
for
people
to
tune
like
not
not
have
to
understand
like
how
far
they
want
to
search
in
this
tree
or
how
many
iterations
is
that
so
meaningless
to
tune.
You
know,
it'd
be
better.
A
If
you
know
with
some
degree
of
accuracy
they
could
just
say:
okay,
here's
how
long
to
spend
in
near
fit
before
we
give
up
it's
easier,
at
least
for
me
to
think
about
wrapping
my
head
around,
but
then
the
bigger
picture
of
all
this,
I
think,
is
okay.
We
we
need
to
avoid
doing
really
deep
searches
in
this
tree
and-
and
maybe
we
shouldn't
be
doing
searches
in
a
tree
like
this
at
all
at
least
not
an
adl
tree.
But
that's
that's
a
much
bigger
discussion,
a
much
bigger
topic,
maybe.
A
A
In
any
event,
the
fact
that
we're
seeing
misses
on
with
an
empty
disk
for
a
for
like
4k
allocations,
like
I
see
this,
I
see
us
actually
giving
up
after
searching
forward.
You
know
16
megabytes
or
even
more
than
16
megabytes
and
then
just
giving
up
and
going
into
best
fit
for
a
4k
allocation.
That's
surprising
something
seems
still
kind
of
off
here
to
me.
A
But
in
any
event
more
to
do
there
definitely
so
we
have
everyone
from
core
now,
oh
actually,
any
other.
Anyone
else
want
anything
regarding
allocation
stuff.
I
think
we
probably
covered
most
of
what
I
want
to
talk
about
there
anyway,.
A
C
It
just
seems
like
a
good
topic
for
cds
gabby.
I
understand
you
did
some
testing
where
you
eliminated
the
pg
info
rights.
H
I
did
I
eliminated,
but
it
was.
I
was
testing
something
else.
I
realized
that
disabling
column,
family
b,
while
increasing
I
o
performance,
caused
some
right
amplification
about
five
percent
x-ray
amplification,
which
made
absolutely
no
sense
to
me
because
we
write
less
data
to
rockstb.
How
come
it
generate.
H
H
H
Change
in
the
ratio,
so
we
write
the
amplification
for
4k
right.
I
think
it's
about
five
percent
higher
with
when
we
don't
put
the
allocation
map
in
works
to
be
now.
The
the
the
thinking
is
that
by
stopping
allocation
map
from
going
to
rocks
to
be
we
made
ourselves
faster
and
by
being
faster,
we
increase
the
distance
between
pg
log
creation
and
pigeon
log
deletion,
because
we
are
now
able
to
to
push
more
information
in
the
same
amount
of
time,
and
that
means
that
pg
log
are
now
exist
in
more
levels.
H
So
to
test
this
id,
I
disable
pg
log
and
with
pixel
of
disabled,
I
tried
once
writing
a
location
up
to
works
to
be,
and
once
without
that,
with
pidgey
dock
disabled.
Now
I
got
this
result.
I
was
expecting
that,
when
allocation
maps
are
not
going
to
roxdb,
we
are
doing
five
percent.
Less
disk
right
is
that
does
that
makes
more
sense.
H
H
H
H
H
C
C
H
But
it's
not
if
you
you
do
the
semiconductor.
If
you
use
a
mechanism
we
use
for
the
third
right
and
adam,
please
correct
me:
if
I'm
wrong,
it's
means
you
need
to
update
roxdb
and
roxdb
would
write
to
the
right
ahead
log,
but
my
understanding
is
that
we've
deferred
right
when
we
do
an
update
we
generate
yet
another
object.
Every
update
is
just
yet
another
object.
H
C
The
advantage
is
that
it's
going
to
write
ahead
log,
so
it
has
the
same
lifetime
semantics
as
the
right
ahead
log.
If.
H
H
C
So
it's
because
the
lifetimes
are
far
shorter
if
it's
actually
still
a
problem,
if
trimming
the
right
ahead,
log
creates
the
same
tombstone
problems
that
the
pg
log
does
then
part
two
would
be
updating
bluestore
to
write
directly
to
the
right
ahead
log.
I
understand
it's
possible
to
co-opt
roxdb's
own
log.
C
Yeah
so
doing
this
in
two
different
steps,
so
the
proposal
is
to
use
the
data.
Payload
portion
of
blue
store
objects
to
store
the
pg
log.
That's
part
a
yes
immediately
in
the
very
short
term,
it
would
still
be
written
to
the
right
ahead.
Log
keys
in
roxtv,
but
further
optimizations
to
bluestora
would
improve
that
pathway
as
well.
Eliminating
that
component.
C
C
H
C
H
C
C
H
C
C
C
So
I
think
what
I'm
saying
is
there
are
two
problems
here.
The
first
is
that
we
are
writing
the
pg
log
into
roxdb
directly
as
their
own
keys.
That
part
is
a
problem
because
it
more
or
less
guarantees
tombstones.
So
the
first
step
is
to
change
the
blue
store
interface.
We're
using
to
do
writes
to
use
a
normal
object
payload.
C
H
H
Right
look
protects
you
against
failures,
that's
of
things.
You
understand,
of
course,
and
it's
allow
you
to
bundle
together,
multiple
updates.
So
if
you
got
object
because
at
the
moment,
what
you
do,
you
will
write,
object,
node,
pg
info
and
pg
log
is
a
single
update
to
the
right
ahead.
Look
in
object,
node.
We
know
that
we
need
them.
H
So
we
piggyback
the
pg
log
on
the
object,
object
to
write
the
headlog
and
that's
the
reason
that
we
assumed
that
pushing
them
to
roxdb
is
going
to
be
very
cheap
because
right,
eight
log
is
that
is
the
actual
cost,
and
that
thing
is
piggybacked
on
the
object.
Node.
What
we
didn't
take
into
consideration
is
that
eventually
the
mam
table
is
being
flushed
and
tombstone
going
to
follow
for
every
object.
H
H
C
There
is
no
benefit,
that's
what
I'm
saying.
So
what
I'm
saying
is.
We
are
using
roxdb's
a
key
range
in
rocksdb
as
a
right
ahead.
Log
right
alternately,
rocksdb
has
an
interface
it
uses
to
access
the
sort
of
sequential
write
file
that
it
uses
for
doing
the
write
ahead
log
in
the
first
place.
We
plug
that
through
to
blue
fs.
What
I'm
suggesting
is
that,
if
writing
the
pg
log
via
roxdb
keys,
becomes
too
expensive.
We
could
bypass
that
layer
and
write
it
directly
to
the
underlying
journal
buffer.
H
Yes,
it
is
doable,
but
it's
very
tricky.
So
that's
one
idea
we
suggested
in
the
past,
and
we
know
that
using
this
mod,
the
logic
is
very
clear:
it
should
fix
the
problem,
except
that
that
code
is
tricky
to
make
right.
Now
there
is
another
option
I
try
to
to
suggest
and
keeping
the
same
semantic.
So
the
option
goes
like
this
bundle,
pg
log
from
different
pgs,
don't
differentiate
between
pgs
and
when
we
need
to
create
a
new
mechanism.
A
pg
log
repository
once
request
arrived
to
the
system
to
the
messenger.
H
H
But
eventually,
when
this
thing
reached
execution,
then
it's
going
to
say:
I
need
you
to
commit
it
by
that
time.
The
the
new
mechanism
would
see
how
many
objects
it
got
and
it's
going
to
bundle
all
of
them
together
and
create
a
single
update
of
a
bigger,
pidgey
log.
It's
going
to
be
pg-log
container,
it's
just
going
to
flash
everything
it
got
into
us
in
a
single
object.
H
C
H
C
Have
a
bunch
of
other
problems
like,
for
instance,
you
don't
actually
know
what
order
those
things
are
going
to
go
to
disk
in.
We
haven't
done
the
work
yet
to
find
out
whether
we
can
even
serve
that
io
synchronously.
It
may
still
need
to
block
on
recovery
or
any
of
the
other
10
things
that
now
it's
much
more
complicated
than
that.
C
That
seems,
but
I'll
point
out
that
it
does
exactly
the
same
thing
you
would
achieve
if
you
instead
wrote
those
pg
log
updates
to
write
a
headlog
and
then
batch
them
after
the
fact
to
write.
H
C
H
C
C
C
C
So
if
we
batch
up,
let's
say
eight
client
writes
those
eight
client
rights
cannot
commit
until
they're,
corresponding
until
their
corresponding
log
entries
come
out
so
one
way
or
another.
We
have
to
write
those
pg
log
entries
with
the
corresponding
object
rights.
The
whole
point
of
using
a
journal
is
to
not
or
one
of
the
optimizations
available
to
you.
If
you
use
the
journals,
you
don't
actually
have
to
do
that,
you
can
perform
the
writes
as
they
as
they
show
up
with
lower
latency
and
then,
after
the
fact,
retire.
C
H
No,
I
do
not.
I
don't
understand
what
what
is
that
you
batch
in
in
your
design
at
the
moment,
what
we
batch
together
is
object,
node,
pg
log
pg
info,
and
if
we
happen
to
have
more
of
them,
then
we
do
them,
but
usually
because
the
way
we
split
things
on
the
pg
boundaries,
if
you
got
128
pgs,
then
you're
never
going
to
have
more
than
two
of
them
on
the
same
pg,
never
going
to
have
more
than
two
active
walks
together.
H
B
C
E
G
I
just
I'm
just
thinking
of
maybe
some
solution
that
might
be
more
feasible
with
current
architecture
in
roxdb.
There
is
an
ability
to
insert
a
merge
operation.
Could
it
be?
G
Basically
I'm
I
guess
I'm
asking
some:
do
you
think
it
would
be
possible
to
somehow
create
keep
having
as
many
objects
as
we
have
pgs
and
reuse
and
create
a
merge
operation
that
will
basically
modify
that
pg
info
lock,
state
somehow
and
therefore
just
on
on
rights,
only
add
that
merge
operation
into
right
ahead
log
and
if
something
goes
wrong
and
we
have
to
recover
from
failure.
We
will
reconstruct
on
after
re-reading,
right
ahead
log
entire
pg
info
log
state.
You
think
that's
the.
C
H
I
think
what
you
suggest
is
something
we
tried
and
that's
how
the
whole
thing
started.
We
try
recycling
the
pidgey
lobe,
so
we
we
had
a
mapping
from
the
real
pg
info
to
the
pg
log
key,
so
we
kept
the
same.
We
recycle
the
same
keys,
just
keep
doing
updates
and
at
that
time
this
thing
actually
caused
performance
degradation.
H
H
My
thinking
now
might
be
that
even
if
you
keep
recycling
at
some
point
of
time
that
table's
going
to
reach
the
disk,
because
every
time
that
every
time
the
right
ahead
log
is
removed,
is
filled
and
then
removed,
then
everything's
flushed
so
by
recycling
and
never
deleting
that
the
thing
at
a
time
we
did
recycle,
never
deleting
which
I
think
maybe
that
was
the
problem.
Maybe
we
should
introduce
a
secondary
garbage
collector,
but
at
the
time
what
we've
done
we
kept.
H
I
think
three
thousand
entries
in
the
pg
log
table
and
everything
was
remapped,
so
we
never
had
to
do
delete.
But
what
happened
is
that
when
the
right
ahead
log
was
removed,
this
thing
arrived
to
disk
and
then
a
few
seconds
later,
these
things
reached
the
disk.
So
every
few
seconds
we
created
entries
which
never
been
removed.
C
So
let
me
ask
this
was
that
you
were
talking
before.
Is
there
how
hard
actually
is
it
to
modify
the
way
we
do
deferred
rights
so
that
they
don't
become
rock's
dbqs.
H
It
is
possible,
but
once
you
do
these
things,
you
need
to
take
control
on
the
flash
operation,
because
once
the
right
headlock
is
filled
and
then
it's
been
removed
and
you
need
to
know
that
it's
removed
because
once
it's
been
removed,
you
need
to
store
whatever
you
have
in
memory
in
some
other
space,
because
the
right
headlock
is
not
persistent
storage.
It's
persistent
right.
C
A
I
think
that
igor's
plan
to
to
try
looking
at
implementing
our
own
right
ahead
log
outside
our
xdp
as
a
lot
of
merit.
D
I
C
D
C
H
Yeah,
if
I
I
still,
I
don't
know,
igor's
walk
plane,
but
if
igor
would
in
fact
replace
or
separate
rocks
to
be
right
ahead,
log
then
we'll
be.
You
will
be
positioned
to
be
to
make
the
other
change,
and
I
think,
after
making
the
first
change
by
separating
right
dialogue
from
roxy
b,
then
giving
us
a
way
to
store
pg
log
without
pushing
them
to
rocks
to
be
is
an
incremental
improvement
which
should
not
be
very
hard.
H
D
K
D
A
F
J
H
B
G
J
A
J
Yeah,
I
I
thought
gavin
had
said
that
he
moved
the
pt
log
out
and
it
increased
the
number
of
iops
being
used
by
twenty
percent.
The
abs
that
we
were
seeing
on
the
subject
means.
J
H
D
A
One
one
early
attempt
at
this,
too
lisa
from
intel,
had
tried
to
just
write
out
pg
log
updates
to
64k
allocations
in
bluefest,
and
she
was
not
seeing
any
benefit
to
it.
So
she
gave
up
on
it
pretty
quickly,
but
it
was
never
clear
to
me
whether
or
not
that
was
really
a
good
test
and
it
sounds
like
igor
you're
you're
having
much
better
success
with
it.
With
your.
A
All
right
well
we're
we're
at
the
the
hour
guys
corey
if
you're
still
here,
I
apologize,
we
didn't
get
to
do
all
of
your
excellent
work.
I
will.
I
will
have
it
as
the
the
first
topic
for
next
week.
If
that's
okay
with
you.
A
All
right:
well,
we
are
rapidly
losing
people,
so
oh
good,
corey,
okay,
great
thank
you
and
I
I
wanna
do
your
work
justice,
because
it's
excellent
so
next
week,
first
thing:
let's:
let's
talk
about
your
work
with
roxdb,
iterator
boundaries
and,
and
then
you
know,
potentially
we
can
continue
this
digilog
discussion
as
well
after
that.
So
thank
you.
Everyone
for
coming
and
have
a
great
week,
everybody
a
happy
holiday
weekend
for
those
that
are
celebrating
it
and
we'll
meet
next
week.