►
From YouTube: Ceph Performance Meeting 2022-03-24
Description
Join us weekly for the Ceph Performance meeting: https://ceph.io/en/community/meetups
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
Well,
it
looks
like
we've
got
people
from
core
now
so
I'll
get
the
meeting
started.
Okay,
I
didn't
have
a
whole
lot
of
updated
or
new
pr's
this
week,
but
that's
okay,
we're
we're
still
in
the
kind
of
final
stages
of
quincy
here.
So
the
the
only
new
one
that
I
saw
specifically
regarding
performance
was
gabby's
pr.
A
This
came
out
of
some
testing
that
I'll
talk
about
a
little
bit
below
or
later
in
in
the
discussion
topics,
but
this
is
to
fix
an
issue
with
the
no
column
b
code,
where
it
was
defaulting
to
use
the
bitmap
allocator
in
certain
circumstances,
so
that
hopefully
I'll
be
testing
that
pretty
quickly
here
just
see
if
it
helps
resolve
the
performance
regression
that
we're
seeing
there
are
two
updated
prs,
just
some
updates
on
this
tracer
code.
A
I
don't
remember
that
it
was
anything
real
major,
just
more
discussion
going
on
there
and
then
this
mds
pr
from
patrick.
I
think
there
was
a
little
bit
more
discussion
also
going
on
with
that
pr
or
actually
perhaps
not.
When
I
look
at
this,
I
think
that
maybe
it
was
supposed
to
move
down
to
the
no
movement
so
any
event,
not
not
a
whole
lot
there.
Everything
else
that
I
looked
through,
I
didn't
see,
updates
for.
A
If
I
missed
anything,
let
me
know
I've
got
kind
of
a
head
cold
right
now,
so
I'm
a
little
out
of
it
any
any
updates.
I
missed
for
anyone.
A
All
right:
well,
then,
we've
got
a
couple
of
different
discussion
topics
to
to
cover
so
they'll
be
good,
all
right.
First
topic,
quincy
large
right
performance
regression,
we've
been
doing
performance
tests
on
quincy
versus
pacific
to
make
sure
that
everything
looks
good
for
the
release
and
in
aging
tests
we
saw
that
we
were.
A
A
You
can
see
in
the
first
link
that
we've
got
in
the
ether
pad
I'll
put
that
in
the
chat
window
as
well
for
folks
here,
but
it's
it's
pretty
obvious.
I
had
the
the
regression
there
that
that
took
a
little
bit
of
work
to
narrow
it
down
to
no
column
b,
but
but
and
we're
pretty
sure,
that's
what
it
is
now
so.
A
Hopefully,
gabby's
pr
will
will
resolve
it,
but
if
not,
then
we'll
have
to
make
a
decision
about
whether
or
not
to
you
know
how
to
proceed.
If
we
want
to
try
to
get
column,
be
fixed
quick
before
we
release
or
if
we
disable
it
for,
like.
You
know
the
first
release
and
then
fix
it
later
in
a
follow-up
or
re-enable
it
later
in
a
follow-up
release.
A
Let's
see
so
one
offshoot
of
that
testing
that
was
kind
of
interesting
is,
I
went
back
and
actually
looked
through.
Both
pacific
and
quincy
at
different
are
different.
Allocator
implementations,
doing
the
same
kind
of
aging
stress
test
and
it's
in
the
same
document
in
another
tab,
but
I
have
put
the
link
directly
to
the
tab
in
the
the
chat
window
too.
Folks
are
interested,
there's
a
really
big
variability
in
in
the
the
kind
of
aging
performance
that
we
see
in
this.
A
This
you
know
kind
of
hypothetical
aging
test
with
the
different
allocator
implementations.
I
found
it
really
really
interesting.
A
C
But
stupid
is
reserved
for
the
allocator.
A
You
know
the
funny
thing,
though,
is
that
the
stupid
alligator's
not
doing
terribly
in
these
tests.
You
know
it's
actually
performing
fairly
well,
and-
and
I
will
I
will
say,
though,
that
this
is
only
a
3.2
percent
fill
of
the
discs.
My
intention,
if
I
had
time,
was
to
go
back
and
and
do
the
same
iteration
of
tests
with
much
fuller
drives,
but
I
haven't
been
able
to
do
that.
A
Yet
I've
been
just
too
busy
trying
to
get
these
other
solar
testing
done,
but
but
it
is
kind
of
interesting
that
that
you
know
stupid
is
doing
pretty
pretty
good
in
these.
Having
said
that,
I
mean
the
hybrid
allocator
back
in
pacific.
At
the
very
least,
it
was.
It
was
quite
good
right.
D
Then
I
have
a
comment
about
that:
it
looks
like
pacific.7
and
pure
versions.
They
lack
at
least
one
oh
well.
They
lack
actually
a
set
of
time
fixes
for
a
wheel
and
hybrid
allocator,
and
I
could
see
a
pretty
significant
performance
drop
with
this
version
and
pacific
and
hybrid
allocator
when
the
disk
is
highly
fragmented,
okay
and
well
disk
is
highly
fragmented
and
that's
the
only
disk
for
those
deals.
A
I
had
a
feeling
that
if
I
did
more
tests
with
the
disk
extremely
full
that
that
you
know
these
results
might
look
a
little
different,
I
just
hadn't
been
able
to
run
them
yet
but
yeah.
This
is.
This
is
a
very
small
overall
data
set
relative
to
the
size
of
the
disks,
but
you
know
with
a
lot
of
a
lot
of
rework
over
that
data
set.
A
D
Just
just
an
example:
the
compaction
for
database
compaction
using
hybrid
allocator
took
10
times
more
time
like
something
like
50
minutes
against
courses.
Five
minutes
using
stupid,
locator
wow.
I
could
see
up
to
100
millisecond
latency
per
single
allocation
for
blfs,
in
in
using
hybrid
allocator
and
in.
D
I
can
later
show
you
the
cluster
latency
when
using
hybrid
allocator
versus
subsequent
switch
to
stupid.
One.
D
Yeah
bitmap
is
somewhere
in
between.
So
it's
okay,
it's
better
than
hybrid
allocator
in
this
case,
but
stupid
one
is
even
more,
but
it
is
even
more
like.
A
A
Well,
good,
good!
Okay,
I
will
put
that
note.
Take
down
the
note
to
do
that
in
mind.
It
wouldn't
be
worth
doing
a
bunch
of
testing
on
two
1627
if
and
just
telling
us
something
we
already
know
and
have
fixed
okay,
any
any
thoughts
or
any
other
questions.
E
For
anyone,
I
I'm
still
trying
to
understand
the
the
second
graph.
What's
the
difference,
there's
two
graphs
and
the
number:
the
ones
showing
60
nvme
blue,
store,
locator
aging
test.
A
Oh
shoot,
I
didn't
fix
the
labels
on
those,
I'm
sorry
gabby,
it's
supposed
to
be
4kb
and
I
just
didn't
get
the
the
title
and
the
labels
fixed.
I
apologize
okay.
There
we
go
that
now
should
hopefully
be
more
clear.
What's
going
on.
E
F
A
E
A
A
So
gabby
you
were
explaining
what
your
pr
does
yeah,
instead
of
a
little
bit.
Do
you
want
to
want
to
continue
that
and
explain
yeah.
E
A
E
Just
a
quick
explanation:
the
way
no
column
b
is
keeping
the
knowledge
if
we
are
writing
to
disk
if
we
keep
using
roxdb
or
if
we
don't
is
by
over
loading,
the
three
list
manager
and
the
3ds
manager.
When
we
move
to
new
column
b,
I
was
setting
the
value
to
null
and
then
later
I
would
remove,
put
it
back
to
bitmap
allocator.
E
If,
without
the
new
pr,
which
I
just
met-
and
I
sent
to
you
when
you
use
nokombi,
you
always
get
bitmap
allocator,
you
cannot
use
anything
else.
So
I
don't
know
how
come
we
are
seeing
different
numbers
here,
because
you
are
showing
that
on
currency
avl
has
different
numbers
than
bitmap
than
stupid,
but
I
don't
see
how
it
possible,
because
I
think
I
always.
A
Yeah
I
mean
looking
at
the
allocator
tests
right
with
with
quincy
hybrid
is
when
you
choose
hybrid,
it's
significantly
slower
than
everything
else
with
ncbe
enabled
no
column
b
enabled
so
to
me.
That
seems
to
indicate
that
this
is
something
that
goes
beyond
just
which
allocator
is
being
used,
but
you
know
we
can
try
the.
B
Well,
my
thinking
was,
I
will
chime
in.
I
don't
know
where,
what's
going
with
copy
that
we,
with
no
column
b,
enabled
we
basically
disable
all
functionality
in
free
list
manager,
so
there
is
only
one
allocator
that
exactly
that
allocator
that
tracks
and
gives
disk
space
and
on
shutdown
we
are
iterating
this
allocator
just
to
provide
the
space
for
to
store.
That's
my
thinking,
that's
why
I
I
explain.
B
I
don't
really
follow
the
pr
that
was
proposed
that
I
did
not
wanted
to
basically
well
criticize
until
I
understand,
because
I'm
I
definitely
not
understanding
the
issue
here.
E
Okay,
sorry,
I'm
just,
but
I'm
now
back
on,
so
I
might
be
missing
something
here,
but
on
startup,
when
we
read
a
free
list
from
disc
from
the
super
meta
column
family,
I
was
always
setting
it,
no
matter
what
was
inside.
If
it
was
not
column
b,
I
would
always
need
to
beat
them
up,
but
could
the
free
list
also
have
values
like
avl
or
or
hybrid.
D
D
So
the
difference
again
during
regular
operation,
which
is
very
right
handling.
If
we
disable
no
column
b,
we
do
not
have
any
different
or
additional
logic.
We
just
remove
pre-list
manager
from
the
path
how
it
could
be
slower
than
before.
E
B
D
D
But
again,
ncb
on
just
removes
release
manager
from
the
path
and
removes
tb
update.
E
A
E
E
A
So
so
gabby
the
plan
I
had
before
you
had
your
your
pr
out
was
to
go
back
and
rerun
the
tests,
but
then
start
investigating
the
behavior
of
the
osd
and
blue
store,
while
while
it's
happening,
maybe
I'll
revert
back
to
that
plan.
If
we
don't
think
that
the
your
pr
necessarily
will
fix
it,
what
do
you.
A
These
tests
take
about
two
hours
to
run
so
just
compile,
and
I
had
to.
I
had
to
modify
your
pr
a
little
bit
to
get
it
to
apply
to
quincy.
It
looks
like
it,
it
must
depend
a
little
bit
on
stuff,
that's
in
in
master.
A
E
A
B
Well,
marco,
I
can
offer
my
help
today
to
keep
digging
it,
but
now
for
now
I
don't
have
anything.
I
don't
think
we
should
continue
talking
about
it.
It's
there
is
something
a
clearly
pacific
was
faster
than
quincy
is
now,
but
one
thing
that
stands
out.
You
are
testing
it
on
a
very
fast
machine.
You
have
a
numbers
that
I
would
never
be
able
to
have.
A
G
D
A
A
It
shouldn't
be
that
bad
to
just
run
through
a
test
with
it
and
then
after
that,
assuming
that
it
hasn't
fixed
everything
I'll
go
through,
and
I
think
I
will
just
run
a
quick
test,
seeing
if
I
can
investigate
what's
happening
inside
the
osd
in
the
the
test
that
I
actually
have
right
now,
but
I
don't
know,
maybe
it
makes
sense
to
see
if
I
can
replicate
it
on
a
single
osd,
we'll
see
oh
I'll
decide.
I
guess
when
I
get
there,
but
one
way
or
another
I'll
try
to
do
both
so.
D
Well,
in
fact,
you
don't
need
these
eight
iterations,
so
I
can
see
just
a
single
well
at
least
two
iterations
is
enough
to
see
the
difference.
A
Yeah
I
started
out
with
three.
That
was
the
first
time
we
saw
it
was
with
three,
and
that
was
almost
by
accident,
because
when
you
just
did
one
it,
it
wasn't
consistent.
Sometimes
it
looked
fine
and
sometimes
it
was
bad.
So
it's.
D
Well,
I'm
curious
if
there
are
no
interleaving
omega
bytes
and
4k
chunks,
why
is
it
still
reproducible
after
a
while
or
not
so
is
it
really
this
inter
junk
size
interleaving,
which
causes
that
picture
yeah
or
just
long
enough
for
writing.
A
I
don't
know
for
sure
I
I
crafted
that
benchmark,
because
I
have
a
another
set
of
benchmarks.
That's
just
kind
of
a
standard
set
that
that
iterates
to
come
over
a
couple
of
different
test
types,
so
random
reads:
random
rights
at
different.
I
o
sizes,
and
that
was
where
it
was
first
showing
up.
So
I
figured
that
this
was
probably
a
smaller
set
of
tests.
I
could
do
that
would
make
it
show
up
and
it
very
much
did
so.
I
haven't
tried
it
with
just
doing
block.
A
B
D
H
A
All
right:
well,
I
think
we've
probably
beaten
that
one
to
death,
I'll
I'll,
keep
working
on
it,
igor
and
adam
and
gabby
I'll
I'll.
Keep
you
guys
in
the
loop
and
let
you
know
what
I
see.
A
All
right
yeah,
let's,
let's
move
on
then
okay,
gabby
and
adam
since
you're.
Both
here
do
you
guys
want
to
talk
a
little
bit
about
your
idea
for
pg
log,
optimization.
E
So
it's
still
not
fully
baked
and
we
do
wish
to
get
more
ideas
from
everybody
which
is
more
familiar
with
the
code,
but
there
is
so
that
there's
two
part
is
the
pg
log
and
the
pg
info,
the
pg
info.
It's
the
easier
of
the
two
we
have
very
few
of
them,
so
the
number
of
object
is
minimal,
it's
maybe
100
200
pg
inputs
and
pg
info.
E
There
is
two
type
of
information:
there
is
the
global
device
state
like
snap
version
and
such-
and
there
is
a
lot
of
statistics
and
information
that
we
update
every
io,
the
first
type
we
could
maintain
and
it's
no
big
deal.
We
have
very
few
objects.
We
probably
could
just
throw
them
in
a
default
as
they
are.
I
think
they
are
already,
but
if
not,
we
can
throw
them
in
a
default
column.
Family
very
few
object
and
they
are
relatively
big,
but
they
don't
update.
E
The
cost
of
executing
a
full
wax
db
update
on
every
I
o
is
very
expensive,
and
the
pg
info
is
a
relatively
big
object
and
when
you
try
to
push
it
to
roxdb,
you
need
to
serialize
in
all
the
fields,
so
everything
have
to
be
copied
and
merged
and
and
encoded
decoded
and
such
we
so
and
all
we
get
from.
This
is
just
some
kind
of
information
that
we
have
anyway,
and
we
don't
really
need
to
make
persistent,
because
that
information
is
part
of
the
global
state.
E
So
what
we
could
do
instead
is
make
the
pg
log
in
so
the
pg
info.
Just
keep
the
big
object
that
this.
The
big
state
object
like
snap,
but
update
should
be
kept
in
memory.
When
this,
when
we
do
shut
down,
we
should
distance
to
file
like
we
do
if
the
location
map
restore
them
from
file,
and
if
we
had
a
disaster,
we
are
already
iterating
over
all
the
object
node.
E
So
while
iterating,
we
could
accumulate
all
the
static
information
that
the
pga
is
holding
like.
How
many
object
nodes?
You
got
how
many
shards,
how
many
globes,
how
many
anything,
how
much
allocated
space,
how
much
in
memory
space
all
this
kind
of
information
could
be
rebuilt
on
the
fly
and
I'm
so
so.
The
recovering
the
location
map
is
adding
significant
amount
of
time.
This
thing,
I
think,
is
going
to
add,
maybe
one
second,
to
the
whole
process,
because
we
anyway
iterate
over
the
object,
so
you
just
need
to
aggregate
information.
E
Oh
that's
all
that
much
that
we
must
do,
and
they
don't
think
we
can
recover
this
from
the
object
I
josh
mentioned
today
that
we
already
have
some
kind
of
a
scrap
task
which
can
rebuild
the
object
that
the
pg
info
state.
So
again,
if
this
scrub
really
does
a
good
job,
it
means
that
the
state
could
be
recreated
from
anything
else
from
everything
that
we
have
in
the
system
and
we
should
not
be
paying
pair.
I
o
update.
H
Oh,
it's
gonna
wait
to
wait
until
you
got
the
pg
log
part.
Okay,
as
far
as
I
recall,
the
only
bits
that
matter
to
me
is,
though,
are
the
last
updated
steps.
You're
right,
we
could
rebuild
the
from
the
store,
but
the
primary
value
has
been
the
ability
to
cross-check
them
against
the
contents
of
the
store.
That's
how
we
catch
most
bugs
who
build
a
store
and
or
the
osd.
H
There
are
two
obvious
pieces
of
information
that
get
updated
on
every.
I
o
there's
the
last
update
fields
in
the
info,
which
obviously
we
can
rebuild
if
we
have
a
way
of
resolving
the
pg
log,
the
other
piece.
These
stats
is
indeed
duplicative
of
the
contents
of
the
store,
but
that's
been
a
feature,
not
a
bug
in
the
past,
as
it
has
allowed
us
to
detect
a
lot
of
bugs.
E
H
I
said
there
are
two
things,
obviously,
that
we
updated
every.
I
o
last
update,
which
we
can
reconstruct
if
we
have
a
way
of
reconstructing
the
pg
logs
as
it's
simply
the
last
version
of
the
log.
The
other
is
the
soft
stat
information
you're
right,
it's
duplicated
with
the
contents
of
bluestora.
It's
just
that.
The
fact
that
it's
duplicative
has
allowed
us
in
the
past.
E
I
H
Sorry,
the
worst
mic,
so
there
are
two
things
that
obviously
get
updated
on
every
io
and
some
other
things
that
I
think
are
less
important.
There's
the
last
update
field,
which
is
basically
just
the
version
at
the
end
of
the
log.
So
obviously,
whatever
your
strategy
for
the
log
is
would
also
apply
to
that,
so
we'll
ignore
it
for
now.
E
H
H
E
E
Two
things
I'm
saying:
object:
node.
There
is
no
way
around
it.
They're
going
to
be
always
committed
to
roxdb,
but
do
we
need
to
commit
pg
info
after
every
io.
H
H
H
E
E
H
A
H
A
A
E
I'm
not
saying
that's
the
problem,
what
I'm
saying
it's
a
problem
and
I
think
we
should
address
the
problem
one
by
one.
I
I
should
probably
come
back
to
you
with
numbers
like
we
did
before,
and
try
to
remove
the
pg
info
from
the
right
path
and
see
what
kind
of
performance
increase
or
decrease
we
get.
I'm
expecting
to
see
some
benefit,
and
my
take
on
this
is
that
there
are
two
things
and
the
biggest
one
is.
Actually
I
love
the
biggest,
but
one
big
one
is
encoding.
E
The
object
is
a
very
it's
a
cpu
intensive
operation
and
it
also
takes
consume
memory
operations.
The
second
one
is
rockdb
update
operation,
it's
also
expensive.
I
mean
that
thing
depends
on
how
many
objects
you
got
in
the
main
table
the
more
object.
You
got
the
more
operation
you
have
to
do,
but
it's
really
at
the
moment
everything
is
speculative,
but
we
can
just
run
the
test.
Disable
pg
login
for
and
see
what
kind
of.
H
H
E
H
A
But
I
I
want
I
want
to
highlight
those.
They
say
all
the
time.
Don't
look
at
performance.
Look
at
behavior
right.
The
reason
why
we
see
performance
gains
is
because
of
a
single
kv,
sync
thread
and
any
cpu
time
savings
there
on
something
like
officials
gives
us
a
win.
That's
why
you
know
any
of
this
stuff
kind
of
matters,
but
it's
it's
not
really
good
enough.
Just
to
look
at
performance
wins
we
want
to
look
at
behavior
with
right,
like
we
can
get
performance
wins
in
other
ways.
H
H
It's
genuinely
useful,
having
pg
stats
that
are
maintained
independently
of
the
contents
of
the
store,
because
it
allows
us
to
detect
bugs
I'm
willing
to
get
rid
of
them.
If
there's
a
good
reason
that
they're
expensive
to
maintain-
and
we
expect
that
reason
to
be
durable
across
multiple
object,
store,
implementations.
H
E
Okay,
so
let's
go
now
to
the
pizza
dough,
so
the
pg
info.
I
need
to
measure
because
at
the
moment
everything
we
talk
is
just
theory.
If
we
see
that
performance
are
significant,
the
impact
from,
and
then
we
can
go
back
to
this
and
see
how
much
of
this
we
should
remove
and
if
then
go
back
to
pg
log,
so
peach
log
is
a
different
kind
of
beast.
E
We
really
abuse
a
mechanism.
Roxdb
was
never
meant
to
hold
something
like
pg
log,
because
roxy
be
never
as
doesn't
assume
by
their
design.
That
objects
are
deleted
so
frequently.
I
actually
tried
to
push
a
change
to
roxdb
to
allow
object
to
be
marked
as
single
occurrence
and
then,
when
you
delete
you
can
delete
them
inside
the
main
table.
They
refuse
to
take
it
because.
E
The
argument
was
that
we
abused
the
system.
The
system
was
not
designed
for
this,
so
they
don't
want
this
optimization.
It's
from
their
perspective
that
we
should
do
something
we
should
not
be
using
groups
to
be,
but
we
do
so.
The
problem
is
roxdb.
We
create
a
pg-log
object
and
later
we
create
a
tombstone.
E
Even
if
the
tombstone
find
the
object
inside
the
mem
table,
it's
not
going
to
eliminate
it.
They
would.
It
would
still
go
to
the
disk
sorry
to
to
level
level
0.
they
don't.
The
roxdb
does
not
support
in-memory
deletions,
nothing
is
deleted
in
memory
only
the
disk
is
deleting
when
they
do
merge.
I
don't
know
if
the
move
from
the
main
table
to
level
zero
is
doing
the
delete
or
maybe
or
from
level
zero
to
everyone.
E
I
am
not
clear
about
it,
but
that's
something
that
we
know
that's
a
problem
because
tombstone
they
always
go
all
the
way
down
and
since
we
are
pushing
so
many
of
them,
it's
become
expensive.
Again,
it's
a
theory.
Some
performance
analysis,
I
think
by
mark
and
other
and
others
show
that
you
can
get
about
40
extra
iops.
E
Yeah
so
again,
it's
it's
worth
repeating
the
test,
because
everything
we
do
now.
It's
again
we
need
numbers,
but
I
I
think
there's
significant
gain
to
be
made
there
and
the
next
thing
and
that's
a
side
effect
once
we
are
able
to
get
rid
from
pg
log
in
pgm,
mainly
from
pigeon
of
not
pg
input.
That
case,
then
the
next
goal
would
be
to
shrink
the
main
table.
E
H
A
H
E
G
E
If
you
do
right,
if
you
write
to
this,
can
this
and
ssd
now
they
use
4k
or
even
8k
sectors,
then
every
right
of
32
bytes
would
become
8.
H
Okay,
so
the
traditional
solution
to
that
problem
would
be,
you
would
add,
an
interface
to
object,
store
that
defines
a
special
buffer
object
right
with
a
special
interface
and
a
special
transaction
operation
that
allows
you
to
that
allows
blue
store
in
the
back
end
to
aggregate
these
very
small,
writes
into
the
journal
and.
B
H
B
E
B
E
D
Say
so
I
at
this
point,
I'm
pretty
sure
that
can
use
that
new
right
ahead,
lock,
I'm
working
on
for
these
sort
of
things
and
even
that
issue
with
space
amplification.
You
mentioned
right
amplification.
It
probably
doesn't
that
critical.
Since
you
have
to
merge
your
glock
update.
D
So
you
need
to
track.
You
need
to
track
them
in
as
a
single
transaction,
so
you
merge
this
data
before
writing
to
this
right
ahead
log
and
hence
you
don't
need
to
write
just
this
16
bytes
you
mentioned
so
your
you.
You
will
write
both
or
not
metadata,
update
and
page
lock,
update
to
write
a
headlock
and
then
do
not
commit
that
to
the
roxdb.
E
That
should
be
doable,
I
mean
the
tricky
part,
I'm
not
saying
everything
here
is
extremely
doable,
that
the
design
in
the
theory
here
is
simple.
I
think
the
implementation
going
to
be
a
bit
tricky
because
you
need
to
know
when
you
are
able
to
release
the
right
ahead
log,
because
the
right
ahead
log
is.
H
Wait
a
minute:
that's
not
quite
the
proposal.
I've
had.
My
suggestion
is
that
we
go
ahead.
So
your
concern
is
that
if
we
only
write
this
to
write
ahead
log,
then
we
have
to
keep
the
segment
of
the
write
ahead,
log
that
we
wrote
the
pg
log
entry
to
until
that
pg
log
entry
gets
trimmed
right.
Yes,
exactly
so.
H
The
natural
solution
to
that
is
to
keep
a
buffer
of
the
un
of
the
portions
that
are
only
in
the
pg
log
or
in
the
wall
rather
and
write
them
to
a
real
object
when
they
get
trimmed.
Yes,
it
means
when
we're
right,
but
you
get
to
do
them
in
big
pieces.
So
the
right
amplification
component
isn't
a
huge
deal.
It's
slightly
less
efficient,
but
it
entirely
removes
the
coupling
between
the
pg
log
entry
lifetime
and
the
wall
lifetime.
E
H
Yeah
so
let's
say
each
each
suggestion
is
the
object.
Store
interface
defines
a
special
object
type,
which
you
write,
sequential,
which
you
write.
Sequentially
a
log
object
right.
Every
pg
gets
one
and
every
pg
structures,
its
log
updates,
as
writes
to
this
log
object
or
as
appends
to
this
vlog
object.
E
D
E
H
A
H
A
A
A
H
H
H
H
H
A
Sam,
can
you
talk
a
little
bit
more
about
the
ensuring
that
the
cadence
is
right
and
making
sure
that,
in
the
fast
case,
where
we're
not
ending
up
in
the
scenario
where
we
we're
ending
up,
like
you
know,
passing
things
through.
H
So,
specifically,
what
work,
what
we're
talking
about
is
a
version
blog
right,
so
updates
to
bluestore
add
things
to
the
end
of
the
log
or
trim
from
the
log
or
both
in
memory
we
keep
a
record
of
which
entries
are
currently
currently
live.
For
each
of
these
objects
remember,
there
are
only
like
100
of
them,
so
we
can
afford
to
spend
a
very
small
amount
of
memory
state
on
this.
H
As
long
as
the
head
and
tail
bounds
are
within
the
current
wall,
we
don't
need
to
do
a
write
out.
It's
only
when
we
trim
a
wall
entry
that
contains
something
in
that
in
that
range
that
we
need
to
do
a
write
out.
So
it's
not
that
we
ensure
that
the
cadence
is
correct.
We
simply
ensure
that
if
it
is,
we
don't
do
extra
work.
E
H
H
H
The
pg
info
now
simply
remembers
the
head
and
the
tail
of
that
structure
of
where
we
are
in
that
buffer.
We
simply
write
to
the
head
of
the
buffer
and
trim
from
the
tail
from
blue
store's
point
of
view.
This
simply
looks
like
a
a
sequence
of
random,
writes
to
a
16.
Meg
object
right
and
what
blue
store
is
going
to
do?
Is
it's
going
to
write
these
rights
into
the
wall
and
then
later
it'll?
Do
a
deferred,
write
right,
at
least
in
the
most
common
configuration.
H
So
the
problem
with
this
approach
is
that
we're
going
to
want
to
do
we're
going
to
end
up
writing
out
a
lot
of
entries
that
don't
turn
out
to
be
live
because
by
the
time
the
wall
gets
around.
To
that
point,
we'll
probably
already
have
trimmed
that
pg
log
entry.
So
the
right
we're
doing
is
pointless
with
me,
so
so
so
far.
H
Yeah,
it's
okay,
so
the
problem.
The
reason
why
that's
happening
is
that
blue
store
doesn't
know
that
we
trimmed
the
entry.
It
doesn't
know
that
it's
an
entry
at
all
right.
It
just
sees
some
very
small
right
so
to
fix
that
we
change
the
object,
store
interface,
to
make
explicit
the
fact
that
we're
sending
versioned
updates
and
that
we're
trimming
version
updates.
H
So
now
we're
still
based
we're
still
semantically
doing
the
same,
writes
right,
we're
still
sending
the
same
rights
we
would
have
said
before,
but
now
we
also
send
a
piece
of
metadata
that
allows
bluestora
to
know.
Oh,
this
is
version
2000
right
and
we're
also
sending
things
that
say
by
the
way,
trim
up
to
version
1796.
H
So
now,
when
we're
doing
our
deferred
rights,
we'll
do
two
things:
one
we're
not
going
to
try
to
do
deferred
rights
for
objects
in
this
category
until
as
late
as
possible
until
we're
actually
trimming
the
entry
from
the
wall
in
the
first
place
and
two
we
won't
bother
to
do
them
at
all.
If
we've
already
trimmed
the
entry
in
question.
H
So
we're
still
writing
to
the
data
payload
of
an
object,
and
this
is
still
an
internal
detail,
blue
store,
but
as
long
as
the
pg
for
for
a
very
quickly
updated
pg
that
updates
at
a
cadence
roughly
in
line
or
faster
than
the
wall,
you
don't
end
up
doing
the
rights
you
on
restart.
You
simply
replay
the
wall.
You
rebuild
the
in-memory
buffer,
representing
this
object
and
when
the
pg
goes
to
rita,
you
pull
it
out
of
memory.
H
E
H
H
E
So
the
idea
was
to
walk
over
all
the
object
and
find
so
everybody
have
the
number
of
the
version,
and
then
you
see
how
many
versions
you
miss.
If
you
miss
that
many
version,
somebody
need
to
push
them
for
you,
which
in
world's
case
it's
just
a
full
racing.
I
mean
it's
not
going
to
be
a
full,
recycle,
it'll.
Think
of
every
object.
You
miss.
H
H
So
now
we
get
into
reasons
why
gloucester's
recovery
strategy
doesn't
work
very
well.
It
gives
you
no
way
to
detect
divergent
updates.
That's
the
biggest
problem.
One
of
the
things
the
log
gives
us
is
the
ability
to
detect
when
an
osd
comes
back
up
after
appearing
has
happened
and
has
updates
that
never
actually
happened.
H
H
H
With
the
scanning,
the
local
osd's
object
store
to
rebuild
the
log.
The
challenge
there
is
that
not
all
log
entries
are
the
creations
or
mutations
of
objects.
We
have
some
log
entries
that
encode
things
related
to
tiering.
That
would
still
need
to
be
written
down.
They
wouldn't
happen
in
the
common
path,
but
they
still
need
to
be
supportive,
but
the
real.
E
H
Yes,
which
means
by
from
a
complexity,
point
of
view,
in
addition
to
still
needing
to
maintain
all
of
the
code
we
currently
maintain
for
the
pg
log,
we
would
also
need
to
maintain
code
for
rebuilding
the
mutations,
so
that
may
be
a
viable
strategy,
but
my
suggestion
is
that
we
find
a
way
to
make
writing
the
log
down
sheet.
That
would
be
a
lot
easier.
B
Yeah
some
I've
got
one
question:
how
much
improvement
you
think
would
justify
that
additional
complexity,
exactly
in
the
form
or
similar
to
what
you
just
proposed.
B
I
was
just
asking
for
your
feeling,
because
we
will
now,
if
we
do
something
similar
to
what
you
proposed,
we
will
have
some
additional
complexity,
and
my
question
was
how
much
more
improvement,
iops
of
or
throughput
would
justify
that
additional
complexity.
H
E
H
E
Okay,
what
about
another
approach
in
which
you
sync
everything
you
do
with
everybody
around
you,
you
tell
them
what
you
so
the
pg
log.
You
don't
commit
them
locally!
You!
You
use
some
kind
of
network
memory,
so
it's
going
to
preserve
something
and
every
100
once
you
reach
100,
then
you
do
one
one
commit
whatever.
H
H
H
Yes,
but
I'm
telling
you
that
it's
extremely
common,
so
as
a
general
rule,
I'm
going
to
start
by
reasoning
about
failure
case
and
then
worry
about
optimizations.
So
reasoning
about
the
failure
case
here
means
that
all
of
the
osds
would
have
to
do
a
scan
to
find
the
most
recent
update
the
most
recent
several
updates.
In
fact,.
E
H
E
H
E
But
sorry,
I'm
talking
about
performance
from
a
customer
perspective
since
delete
is
not
all
they
do,
because
customer
doing
just
that
it
wouldn't
be
doing
anything.
It
wouldn't
be
a
customer,
but
I'm
saying
even
I'm
talking
now
about
performance,
not
about
code
writing.
So
when,
when
we
do
delete,
we
will
use
exactly
the
same
code
that
we
do
today.
If
there
is
no
delete,
then
we
could
buffer
things
until
we
get
delete
and
anything
else
which
is
funny
complicated
requires
special
handling,
anything
which
requires
special
language.
E
H
Yes,
I
understand
that
it'll
be
faster,
so
let
me
let
me
observe
that
this
that
particular
proposal
does
not
in
any
way
require
network
memory.
H
All
you
need
to
do
is
on
startup
rebuild
the
pg
log
from
what's
in
blue
store.
Yes
right!
Yes,
so
that's
theoretically
possible!
If
you're
I
mean
I'm,
I'm
not
massively
opposed
to
that
strategy,
because
it
does
not
change
recovery
protocols
once
you've
booted
up,
you
have
the
same
structures
in
memory
that
you
normally
have,
and
in
that
sense
it's
exactly
as
persistent
as
a
normal
pg
log
connect.
H
E
H
H
H
Yes
right
so
that's
tricky,
because
now
we
actually
embed
the
we
embed
metadata
that
allows
recovery
to
only
recover
the
mutated
portions
of
an
object
which
was
a
pretty
substantial
recovery
performance
when,
if
I
recall,
and
that
information
is
embedded
in
the
pg
log
entries
and
cannot
be
recovered
from
the
objects.
So
that's
a
challenge.
H
H
H
H
It's
not
a
property
of
the
right
when
it
happens,
but
the
other
the
other.
H
E
Yeah,
if
you
got
opt-in,
then
you'd
have
a
very
inexpensive
digital.
H
H
H
Yeah,
it
makes
sense
the
osd
does
not
care
whether
bluestore
is
writing
the
pg
log
entries
to
a
rock's
db
entry,
or
not.
I
mean
it
does
at
the
moment,
because
it's
using
the
omap
interface,
but
there's
no
reason
we
actually
have
to
do
that
if
we
instead
write
to
a
circular
buffer
with
some
cleverness
to
avoid
write
out,
then
from
the
osd's
point
of
view,
what
we
have
is
a
log
object
that
it
just
reads
like
a
big
buffer
on
startup
from
bluestora's
point
of
view.
H
A
H
H
E
A
Right
gabby
like
we're,
not
we're
not
doing
it
the
way
that
we're
doing
it
right
now,
we've
the
way
that
sam's
describing
this.
We
were
writing
this
out.
To
that
like
a
static
set
of
blue
star
objects
or
you
know,
doesn't
have
to
be
static,
but
it
could
be,
it
could
be
static.
It
doesn't
matter
it's
implementation,
detail
inside
blue
store.
This
is
this
is
removing
this
from
the
rock's
tv
wall.
H
Or,
to
put
it
another
way
on
when
when
the
pg
is
created
right
now,
what
we
do
is
we
create
a
pg
metadata
object
with
an
omap
region
that
we
use
primarily
for
the
pg
log,
the
keys
of
the
e-versions.
The
values
are,
the
you
know,
encoded
buffer
and
the
data
payload
of
the
object.
I
think
it's
the
pg
info
can't
rubber,
but
instead
what
we'll
do
is
we'll
make
a
new
object,
a
pg
log
object.
We
will
give
bluestore
a
16
megabyte
allocation
hint
and
then
from
then
on
forever.
E
H
E
E
E
But
the
object
is
going
okay,
so
now
we
need
to
maintain
how
we
do
because
the
object
is
going
to
have
pg-log
update,
update,
update,
pg
log
delete,
delete,
delete
inside
the
same
object
going
to
have.
You
might
have
uneven
solar
article,
the
number
of
opening
brackets
and
closing
brackets
the
node
might
not
be
the
same.
So
you
still
might
be
all
some
delete
operation.
H
E
H
E
E
H
H
H
H
E
H
E
H
H
I
H
H
E
Okay-
and
you
say,
since
this
thing
is
working
sure,
there
is
some
cleverness
happening
underneath,
because
somebody
need
to
manage
what
happened
when
stuff
is
on
disk
and
in
memory,
and
if
you
have
duplication
and
how
you
remove
things,
because
once
you
committed
stuff
in
its
own
disk,
you
need
to
know
that
this
thing
should
not
go
against
the
disk.
I
mean,
if
you
put.
I
E
E
E
H
Well,
I
mean
again
the
external
interface
of
bluester
requires
that
that
part
work.
The
challenge
is,
it
won't
be
efficient,
at
least
not
particularly
efficient.
It
will
be
better
than
what
we
have
now.
I
think
I
think
I
think
the
first
step
is
to
go
ahead
and
implement
this
thing,
I'm
describing
in
a
pr
it's
been
done
a
few
times
in
the
past,
because
roxdb
being
slow
at
doing
pg
log
entry
updates
is
a
pretty
common
observation.
H
So
let's
go
ahead
and
do
that
experiment,
but
there
is
a
way
to
improve
this.
If
we
want
to
make
out
blue
storm
more
complicated,
which
we
can
discuss
later,
but
yeah,
the
simple
version
is
just
building
on
things.
Bluester
already
does
it
would
require
no
changes
in
bluestora
whatsoever.
E
E
A
E
H
But
that's
the
difference.
I'm
just
going
to
tell
you
it.
You
need
to
read
the
readous
paper.
There
are
a
whole
bunch
of
reasons
why
that
won't
work,
at
least
not
not
simply
the
most
obvious
one
is
you
can't
actually
rebuild
the
pg
log
that
way,
because
you
won't
have
the
intermediate
log
entries,
so
it
would
break
a
lot
of
stuff
in
the
usd.
E
H
E
E
H
H
I'll
tell
you
exactly
why
we
I'll
tell
you
exactly
why
we
did
it.
We
assumed
rock
cd
would
have
a
good
implementation,
which
it
sort
of
does
in
some
ways.
Roxdb
does
a
moderately
okay
job
of
this,
but
once
you
saturate
the
osd
hardy
enough,
as
you
guys
have
observed
the
tombstones
make
garbage
collection
a
problem,
so
it
turned
out
to
be
a
bad
choice,
but
it's
an
easily
undone
one.
A
All
right
well
have
we
have
we
beaten
this
to
death
for
this
meeting,
you.
H
Oh
no,
I
don't
want
to
do
that
so,
if,
if
we
do
it
that
way,
c-store
has
to
duplicate
that
trick
and
it
seems
really
fragile.
H
B
H
H
The
second
piece
is
that
the
omap
isn't
actually
constrained
in
the
way
we
want
it.
Con
constrained,
there's
nothing
stopping
the
osd
from
removing
omap
entries
from
the
middle
of
a
pg
log,
which,
for
what
we
want
to
do
here,
would
be
bad.
That
would
break
any
optimization.
You
were
trying
to
do
so
that
imposes
a
restriction
on
the
osd's
behavior
that
wouldn't
be
obvious
from
the
osd
code,
so
because
we
actually
control
the
code
in
both
places.
I'd
really
rather
just
fix
it
does
that
happen.
B
H
A
H
A
H
If
you
want
to
poke
something
like
this
up
through
radios,
that's
a
different
discussion
that
it's
really
a
question
about
protocols.
Okay,
that's
not
quite
the
same
thing
but
yeah
okay,
it
would
be
an
important
building
block
yeah
for
any
any
raido's
users
that
are
attempting
to
use
the
omap
to
do
exactly.
This
would
be
running
into
the
same
problems.
The
pg
log
is
they'd,
be
generating
spurious
tombstones.
We
don't
really
want,
and
otherwise
generating
traffic
pointless
traffic
and
rockets
db.
So,
yes,.
A
H
A
H
There's
a
simple
version
that
shouldn't
be
very
complicated:
I'm
wondering
if
there's
a
way
to
do
better,
but
no,
I
don't
think
it's
a
big
deal.
Okay,
like
the
what
what
you
have
in
your
head
right
now
as
the
way
this
would
work
in
rados,
at
least
at
an
interface
level,
is
probably
correct.