►
From YouTube: 2019-12-10 :: Crimson SeaStor OSD Weekly Meeting
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay
last
week
I
was
on
on
pto
and
I
just
started
work
in
reviewing
brunei's
pr
to
bring
the
support
of
stock
to
crimson,
and
I
think
I
will
finish
reviewing
too
many
pr
spreading
the
alienized
booster.
But
to
me
I
think
you
could
amend
the
common
message
with
with
the
details
and
the
strategy
of
your
your
change.
That
will
be
helped
be
helpful
and
I'm.
A
C
C
C
Yeah,
so
for
me,
like
same
as
last
week,
I
have
been
working
on
cbt,
I'm
actually
at
the
debugging
stage,
right
now,
just
figuring
some
stuff
out
really
close
to
finishing
the
really
close
to
finishing
and
being
able
to
to
present
the
cpu
cycles
for
operation.
Then
I'll
make
a
pull
request.
I
mean
currently
it's
on
my
on
my
fork.
I'm
just
fixing
some
problems,
and
after
that
I
will
continue
working
with
with
rather
closely
to
yeah.
C
B
B
So
I
have
to
disable
that
logger
in
the
alien
booster
and
then
I
replaced
the
code.
I
made
some
compare
error
when
I
use
the
saf
aerator,
so
I
have
asked
the
redact
to
help
me
to
solve
that
problem.
B
When
I
use
the
self
error
self-evaluator
so
there's
some
compare
error.
I
don't
know
because
I
didn't
really
understand
that
several
currently
I
just
read
the
code
and
they
there's
some
used
deleted
constructor,
so
that
one
has
helped
fixed
by
redirect.
Reader
has
told
me
how
to
fix
it
and
they
still
had
some
build
arrow.
When
I
use
the
two
ways
to
waste
with
the
self
erasure,
so
still
has
some
build
error.
B
So
after
I
think,
if
it
is,
the
build
error
is
fixed,
then
I
think
maybe
the
release
version
will
will
work,
because
the
memory
added,
I
think
I
have
sell
it.
A
C
Hello,
I'm
a
cinema
video,
so
working
in
a
kind
of
tourist
mode.
Here,
okay,
I've,
I'm
getting.
First
of
all,
I'm
going
guiding
album
in
the
cbt
op
magic
and
also
I'm
I
post
a
new
series
of
of
patches
for
persister
work
according
to
harvey's
comments.
Also
added
some
new
comments.
There.
C
Yeah
that
I
I
need
to
backboard
non-cleansing
change.
I
need
to
backport
the
the
zeroization,
the
typicalization
to
nautilus.
That's
all.
C
C
I
also
added
this
better
shutdown
code,
and
so
all
of
this
I
is
working.
If
I
get
the
comments,
the
right
I'll
deliver
everything
today,
including
the
format
the
cloud
format
script
I
used
for
these
files.
So
if
anybody
wants,
we
want
to
tweak
a
clang
format
file.
It
will
be
a
part
of
the
of
the
comment
message
and
other
than
that
I'm
reading
and
trying
to
learn
things
related
to
sister
and
now
this
document.
D
Sam,
do
you
want
to
do
things
in
first
that
way,
I
can
just
dig
away
right
into
the
document.
C
D
For
crimson
at
least
I've
been
mostly
working
on
the
c-store
design
doc.
I
did
have
a
quick
question.
Well,
I
guess
I'll
talk
about
that
at
the
end,
let
me
see
if
I
can
get
this
shared
half
sec.
D
Okay,
so
this
won't
take
super
long,
because
the
overall
concept
is
pretty
high
level
at
this
point.
But
the
right
sort
of
design
guideline
to
think
of
here
is
butter
fs
on
a
log
structured
file
system.
D
So
if
you
know
anything
about
rfs,
it
uses
essentially
a
wandering
three
sort
of
deal
with
a
single
super
block.
It
periodically
writes
out
so
you
can
find
the
route,
but
it
does
not
attempt
to
do
the
log
structured
thing
where
it
always
cleans
a
segment
before
it
writes
back
to
it.
So
we're
going
to
make
a
few
changes
so
that
we
are
able
to
do
that
part
efficiently,
but
other
than
that.
The
overall
design
should
look
at
least
a
little
bit
like
butterfs.
D
It
clearly
doesn't
share
any
structures
with
butterfest
as
we
are
not
implementing
a
file
system,
and
there
are
a
lot
of
things
we
don't
need
to
do,
but
the
overall
tree
structure
is
analogous.
I
think.
D
Who
just
joined
xs
x
hd
no
anyway
welcome
okay,
so
the
first
part
is
just
some
details
about
the
physical
layout
on
disk.
I
think
this
shouldn't
be
terribly
surprising.
D
The
goal
here
is
twofold:
one
we
want
to
be
able
to
physically
read
the
disk
and
interpret
which
things
are
data
blocks
versus
which
things
are
metadata
right
at
just
at
a
course
level.
So
each
oh
and
I
made
a
few
choices
for
terminology
rather
than
talking
about
segments
or
zones.
I'm
talking
about
streams,
but
for
streams.
You
can
definitely
read
zone.
If
you
want
that's,
that's
fine
blocks
mean
aligned,
go
ahead.
D
D
Here,
it's
not
necessarily
just
two,
and
this
is
per
shard
by
the
way.
So,
if
we
have,
let's
say
32
cores
on
a
disk,
that
would
mean
64
streams
open
two
per
chart,
but
I
really
want
to.
D
Am
not
putting
a
flag
in
the
ground
about
this
the
goal?
What
I'm
all
I'm
trying
to
point
out
is
that
there
are
at
least
two
different
regimes,
a
disc
where
we
probably
don't
want
more
than
one
and
ssds,
where
more
than
one
is
probably
fine
and
depending
on
exactly
how
many
streams
we
can
productively
have
open,
we
will
make
different
choices,
so
the
metadata
layout
needs
to
be
generally
flexible
enough
to
permit
writing
to
more
than
one
stream
at
once.
D
Does
that
make
sense?
Okay,
thanks
yeah
thanks,
I'm
trying
to
be
like
I'm
trying
not
to
I'm
specifically
trying
not
to
introduce
details
of
the
physical
structure
of
the
disk.
Beyond
the
assumptions
we
actually
care
about,
which
is
sequential
access
pretty
much
and
that
reads
are
generally
cheaper
than
rights.
A
D
A
D
I'm
going
to
point
out
the
following
things
that
are
maybe
not
super
obvious
or
a
single
shard
and
for
the
rest
of
this
conversation,
we'll
just
talk
about
a
single
reactor
right
so
for
a
single
core.
All
operations
are
serial
right
and
we're
not
going
to
make
statements
about
what
things
transactions
can
span.
So
all
transaction
deltas
have
to
live
in
a
single
stream
I'll,
let
that
settle
for
a
bit,
but
it
does
not
follow
that
every
block
has
to
be
in
the
same
stream.
D
A
If
I,
unless
correctly,
what
it
means
that,
if
a
transaction,
a
transaction
should
only
write
to
a
single
stream,
instead
of
us
to
get
a
different
changes
to
different
stream,
just
because
they
belong
to
different
trees,.
D
D
D
D
C
D
D
D
By
contrast,
deltas
are
what
we
would
normally
consider
in
a
file
system
to
be
a
journal,
and
this
part
is
sort
of
a
departure
from
butterfest.
As
far
as
I
know,
butterfs
doesn't
work
exactly
like
this.
I
like
this
design
because
it
makes
it
easy
to
overlay
different
internal
structures.
I'll
talk
about
that
a
bit
more
of
that.
D
But
the
idea
is
that
if
we
want
to
write
out
a
transaction
that
changes
the
allocation
tree
and
or
that
let's
say,
we
want
to
write
out
a
transaction
that
creates
an
object.
What
do
we
have
to
touch?
We
have
to
actually
create?
We
have
to
create
a
new
block
containing
the
data
for
that
object.
Let's
say
it's
like
a
2k
object
right.
D
D
So
all
of
those
different
updates
need
to
all
land
in
the
same
data
block
and
get
written
at
the
same
time
because
they
have
to
happen
atomically
during
recovery.
We
have
to
recover
either
all
of
them
or
none
of
them
does.
That
makes
make
sense.
It's
why
journals
are
always
written
to
a
contiguous
segment
of
disk.
A
I
can't
agree
with
you
that
it's
it's
essential
to
to
to
use
a
single
monetary
increase,
increasing
secret
number
for
funds
for
a
transaction.
D
If
you
go
look
at
rocksdb
is
what
you
do
is
the
journal
agent
inside
of
the
file
system
or
storage
system,
writes
to
a
declared
contiguous
portion
of
disk
and
then
writes
a
checksum
at
the
end,
when
you're
doing
recovery,
you
read
journal
segments
until
you
find
something
that
that
doesn't
check
some
right
and
that's
when
you
know
you've
reached
the
end
of
the
journal
that
way
either
every
every
journal
transaction
you
read
is
either
all
there
or
you
completely
ignore
the
whole
thing.
You
never
apply
a
part.
D
That's
just
sort
of
basic
word
stuff,
so
in
this
design
that
that
unit
is
a
record
containing
all
of
the
deltas
that
comprise
the
transaction
and
any
blocks
that
you
want
to
include
optionally.
You
may
have
chosen
to
write
out
blocks
to
another
stream
prior
to
this
commit,
but
those
blocks
won't
be
committed
until
you
commit
the
deltas
that
link
them
into
the
into
the
tree.
D
C
D
D
D
D
D
D
It's
it
it's
it's
more
like
it's
forward
allocated
different
file
systems
have
different
terminology,
it's
definitely
free.
We
definitely
wrote
a
block
that
is
available
for
writing
right.
We
haven't
overwritten
anything,
but
we
haven't
updated
the
tree
so
that
it
can
actually
be
found
so
that
parts
what
the
transaction
that
commits
is
so
to
do
that.
We
need
to
update
the
leaf
in
the
allocation
tree
where
or
we
need
to
update
the
leaf
in
the
allocation
tree
where
the
old
block
used
to
be.
D
We
need
to
update
the
leaf
in
the
allocation
block
in
the
allocation
tree
where
the
new
block
is
now.
We
also
need
to
update
its
parent
pointer,
let's
say
in
the
extent
tree
for
a
particular
row
note,
so
those
three
changes
all
have
to
be
written
atomically
for
the
for
the
move
to
be
for
the
move
to
be
atomic
right.
D
Otherwise
we
could
end
up
with
two
pointers
in
the
same
place
or
or
a
block
that
is
part
of
the
tree
but
which
we
consider
to
be
free,
either
of
which
would
be
really
bad.
Those
would
be
you
know,
corruption.
C
Okay
and
and
then
we
two
questions
first,
then
we
end
with
a
block
that
is
forever
alien
to
us.
It's
every
work
on
that
block
will
require
this.
D
C
D
C
D
D
C
Yeah,
what
I
was
saying
if
this
block
should
be
marked
as
only
partially
allocated
because
yeah,
but
we.
D
B
D
Were
sealed
properly,
that
is,
we
got
to
the
end.
We
wrote
a
checksum
and
we
closed
it
and
if
not
we're
going
to
walk
backwards
or
forwards
through
the
stream
from
the
first
record.
Until
we
find
an
incomplete
record,
then
we'll
close,
the
stream
with
information
indicating
where
it
really
ended
and
then
that's
it
so
that
block
that
we
wrote
it
will
be
not
in
the
allocation
tree.
So
it's
going
to
be
free,
it's
no
different
from
a
block.
We
wrote
and
then
overwrote
later.
C
D
D
D
So
the
fact
that
we
wrote
to
the
block
does
not
mean
that
we've
that
we
can
actually
find
it
again
and
we'll
keep
in
memory
state
saying
that
we
we
did
in
fact
right
to
this
block,
so
future
right
should
be
after
it,
but
that's
it
like
we
don't
we
don't
have
to
remember
anything
else.
D
Because
this
is
this
is
the
core
concept
for
the
rest
of
it.
Each
one
of
these
deltas
is
basically
an
atomic
modification
to
a
block.
It's
saying
that,
so
it's
worth,
I
guess
talking
about
how
recovery
works.
So
when
we
recover
first,
we
scan
each
stream
quickly
find
all
of
the
open
ones.
Then
we
figure
out
where
the
end
of
the
most
recent
journal
is.
Then
we
rewrite
to
the
beginning
of
that
journal.
D
D
Until
we
get
to
the
end
of
the
journal,
assuming
that
algorithm
works
works
correctly,
every
every
block
that
was
dirty
when
the
crash
happened
should
now
be
present
in
cash
again,
with
the
exception
of
anything
that
was
only
dirty
in
memory,
because
we
hadn't
actually
committed
the
right
yet
and
it
should
be
atomically
correct.
We
shouldn't
have
any
partial
transactions
because
we
won't
have
read
any
partial
records
because
they're
check
something.
C
D
D
D
That's
just
one
of
the
immutable
laws
of
this
kind
of
storage.
The
good
news
is,
it's
not
so
bad.
We
can
batch
up
multiple
transactions
into
one
transaction
if
we
happen
to
have
them
available.
So
if
the
osd
is
actually
busy,
we
don't
have
to
write
out
an
empty
record.
We
can
write
out
one
with
several
different
transactions
baked
into
it.
D
D
And
the
deltas
don't
have
to
be
aligned,
I'm
going
to
point
out
the
delta
blocks
have
to
be
because
we're
not
we
want
to
load
them
directly
into
memory
and
it's
inefficient
to
load
unaligned
stuff
from
nvme,
but
deltas.
We
have
to
decode.
So
we
don't
really
care
about
that.
C
D
D
That
is,
it
knows
if
it's
a
it
just
has
a
little
integer
that
says
are:
am
I
a
b
tree
interior
node?
Am
I
an
extent?
Am
I
a
block?
Am
I
whatever
right,
then
code
can
go,
look
at
the
block
itself
and
figure
out
what
it
needs
to
do
with
it.
In
general,
however,
we
figure
out
whether
a
block
is
in
use
by
using
a
secondary
structure
that
I
haven't
that
I
haven't
introduced
yet.
C
D
That
h,
object
if
you've
looked
at
the
osd
code
translates
to
a
seven
tuple
that
begins
with
a
hash
and
a
pool
id.
So
we
traverse
the
h
object
tree
from
the
root
that
we
have
in
memory
down
until
we
get
to
the
o
node,
which
has
the
extend
triana
right.
We
then
traverse
the
extent
tree
until
we
get
to
offset
256,
which
should
point
to
an
actual
on
disk
block
containing
the
data
for
that
block.
We
then
read
the
the
block.
D
Some
of
this
may
be
in
cache.
Some
of
it
won't
be
so
it
will
like
any
prefix
of
the
tree
will
always
be
in
cash.
So
if
we
have
up,
if
we
have
any
block
in
cash,
we
have
all
of
its
parents
and
cash
too,
not
an
uncommon
design,
choice
honestly,
so
some
much
of
that
traversal
will
be,
will
be
free
and
then
the
rest
of
it
turns
into
log
ad
reads:.
D
D
C
Okay-
and
here
we
have
a
problem-
we
have,
we
have
greater.
We
have
larger
size
for
a
log
which
is
right
to
disk
compared
to
the
cache
that
we
have.
D
D
It's
no
different
from
doing
right
back
in
any
other
file
system,
a
little
more
expensive,
a
little
less
expensive
depends
right.
So
let's
say
we
have
a
dirty
block
and
we're
experiencing
cash
pressure.
So
we
start
at
the.
What
do
we
do?
We
start
at
the
leaf
node
and
we
begin
writing
out
dirty
nodes.
D
Each
time
we
write
out
a
node.
We
of
course
dirty
the
parent
right,
because
we
have
to
change
its
location,
but,
generally
speaking,
we'll
write
out
in
a
breath.
First,
my
bad.
D
Whichever
the
one
is,
the
one
where
you
do,
the
leaves
first,
I
want
to
say
bread
first,
but
whatever
we
write
out
leaves
before
parents
that
way.
If
we
do
that
properly,
we'll
tend
to
amortize
several
updates
per
parent.
When
we
finally
do
the
write
up,
it
sounds
like
a
lot
of
right
amplification
and
it
is.
But
the
truth
is.
Every
metadata
technique
involves
a
lot
of
meta,
a
lot
of
right
amplification.
C
I
have
a
probably
a
more
basic
question
if
you
decide
that
if
you
have
cash
pressure
and
you
decided
to
write
a
block
and
all
its
pointers.
D
C
D
D
D
D
A
D
D
D
A
Yeah,
that's
good
until
the
that
note
does
not
need
to
be
split
or
anything.
Okay,.
D
And
that's
a
real
modification
right,
that's
an
actual
real
thing!
So
when
we
do
that,
we
get
to
do
the
same
thing.
We
record
a
delta
for
like
let's
say
we
need
to
do
a
b
tree
split
works.
Fine,
the
code
that
controls
the
b
tree
allocates
two
new
blocks
with
the
two
split
things
and
writes
it
to
wherever
and
then
either
as
part
of
that
transaction
or
sorry.
D
D
Precisely
so
that,
and
that's
that's,
why
I'm
choosing
to
be
really
agnostic
about
how
the
deltas
work
it'll.
It
means
that,
as
we
write
the
tree
structures
that
we
use
for
our
metadata
lookups,
we
don't
have
to
reinvent.
We
don't
have
to
write
a
whole
new
allocator
each
time.
We
can
just
use
this
one
and
it's
easy
to
combine
all
of
the
different
transactions
into
one
transaction.
A
D
Any
other
questions,
okay,
so
the
rest
of
this
is
I'm
not
attached
to
I'm
just
trying
to
work
out
which,
on
disk
state,
I
need
to
do
garbage
collection,
but,
broadly
speaking,
there
are
metadata
structures
where
there
need
to
be
in
in
the
stream.
So
we
can
figure
out
things
like
so
for
the
journal
stream.
Since
we
only
have
one
journal
per
cpu,
we
could
just
number
them
right.
Each
journal,
epic
gets
a
number
and
we
start
right
when
we
start
writing
out
the
next
journal.
D
Epic,
eventually
we
finish,
we
write
a
record
in
that
stream.
That
says:
okay.
This
is
the
this.
Is
the
real
new
journal
now
and
that
implicitly
deallocates
all
previous
journals,
which
means
that
all
we
ever
have
to
keep
track
of
for
the
for
a
stream
to
find
out
whether
it's
deltas
are
still
relevant.
D
Is
the
newest
journal
id
that
has
deltas
in
that
stream,
because,
if
we're
on
journal
sequence
10
and
we're
looking
at
a
stream
that
only
has
journal
ids
up
to
nine,
then
we
know
that
none
of
the
deltas
matter-
and
we
don't
even
have
to
look
at
them.
D
I'm
sorry
that's
what
I
meant
to
say.
That's
a
good
question.
I
meant
epic
yeah.
So
if
we're
looking
at
epic
10,
if
we're
on
epic
10
and
we're
looking
at
a
stream
with
epic
six
we're
where
the
new,
where
the
within
that
stream,
it
has
no
deltas
for
a
journal
sequence
above
six,
then
we
know
that
none
of
those
deltas
matter.
We
only
have
to
worry
about
garbage
collecting
blocks.
D
D
Okay,
I
will
also
point
out
that,
just
by
nature,
the
beginning
of
every
journal
must
contain
the
roots
of
the
of
the
tree.
D
So
necessarily
there's
a
there's,
a
top
of
the
tree
right
which
I'll
get
to
in
just
a
second,
it's.
Basically,
the
address
of
the
f
of
the
top
o
node
node,
like
the
top
of
the
oh,
no
tree
and
the
top
of
the
allocation
tree.
So
we
need
something
that
references
that
from
the
beginning
of
the
journal,
because
other
than
that
they're
always
in
in
memory.
D
C
D
D
We
will
record
special
deltas
that
tell
us
where,
where
the
roots
are,
but
honestly
that's
not
even
necessary,
we
don't
ever
have
to
move
the
root.
If
you
think
about
it,
if
you
think
about
it,
the
only
because
the
root
is
necessarily
very
very
small,
a
delta
to
the
root
and
the
root
itself
are
the
same
thing.
D
So
I
think
instead,
what
we're
going
to
do
is
we're
going
to
mark
like
this.
This
header
thing
here
we'll
have
some
metadata
for
each
block
and
there
will
be
two
block
types
that
are
special
root
blocks
are
special
and
you
have
to
look
at
them,
whether
they're
in
the
allocation
table
or
not,
because
they
won't
be
and
allocation
tree
blocks.
You
have
to
look
at
because
allocation
tree
blocks
are
not
themselves
in
the
allocation
tree,
but
I'll.
That's
just
a
detail
of
how
to
bootstrap
the
garbage
collector.
D
So
at
a
high
level,
if
we
look
at
the
logical
structure,
so
all
of
this
is
overlaid
in
what
I
described
before.
So
the
root
is
a
single
block
and
it
has
reference.
D
It
has
physical
references
to
the
top
of
the
ono
tree
and
the
top
of
the
allocation
tree,
so
the
ono
tree
is
a
map
from
h,
object,
t
to
o
node
t,
that
is,
it
is
a
b
tree
where
it's
with
leaf
elements
that
are
encoded,
oh
notes,
and
the
allocation
tree
is
a
tree
with
keys
that
are
a
tuple
of
a
stream
id
and
a
stream
offset,
which
is
to
say
a
physical
address,
and
it
points
and
the
three
leaves
are
an
allocation
unit
or
an
allocate
an
allocation
information
thing
which
needs
to
have
a
few
things
in
it.
D
And
I'm
not
going
to
go
into
a
lot
of
detail
about
that.
The
what's
important
is
that,
if
you
encounter
if,
if,
if
you're,
if
you
look
in
the
allocation
tree
and
find
a
physical
address,
the
back
the
set
the
information
recorded
as
back
pointers,
is
enough
to
find
every
block
that
that
points
to
it.
C
D
A
data
block
for
an
object.
It
basically
is
just
the
object
name
and
the
offset
within
that
object
right,
because,
if
you
think
about
it,
if
you
descend
from
the
root,
that's
enough
to
find
the
the
block.
D
Moreover,
the
allocation
tree
is
conservative
or
is
consistent
at
all
times.
So,
if
it's
in
the
allocation
tree,
it
really
is
referenced
by
the
o
node
get
by,
and
the
flip
of
that
is
that
any
non-root
non-allocation
tree
block
that
is
not
in
the
allocation
tree
is
free
to
be
reused
and
that's
really
important,
because
if
you,
if
you
remember
before
I
mentioned
that
every
block
for
every
byte
we
write,
we
need
to
garbage
collect
one
bite.
D
But
the
hope
if
we've
done
our
jobs
well
and
the
and
the
workload
has
behaved,
is
that
by
the
time
we
choose
to
clean
a
stream
that
most
of
the
blocks
in
the
stream
have
already
been
replaced.
Right,
because
we're
mostly
trying
to
to
clean
streams
that
have
short-lived
data.
D
For
the
most
part,
streams
that
have
long-lived
data
we're
not
going
to
want
to
touch,
they
generally
will
they'll.
Only
they'll
only
become
free
very
slowly,
so
we
wouldn't
touch
those
if
we
don't
have
to
so
mostly
when
everything
is
working.
Well,
we're
garbage
collecting
streams
where
most
of
the
blocks
have
already
been
replaced
because
they
were
overwritten
or
something
right.
D
D
D
So
it's
a
bunch
of
cpu
and
read
overhead
to
answer
a
question
we
were
already
pretty
sure
was
false,
so
we
it's,
in
my
opinion,
probably
worth
maintaining
a
secondary
structure,
namely
this
allocation
tree.
That
gives
us
a
an
almost
constant
time
ability
to
scan
forward
in
a
stream
to
find
the
next
live
block.
D
C
D
C
D
I
was
going
to
do
that
in
the
o.
Node,
the
I
figure,
the
o
node
will
will
just
track
mutation,
regions,
recency.
D
Right
because
when
we're
traversing
down
to
the
o
node,
we
have
to
load
the
o
node
right,
so
we
can
load
the
most
recent
write
time
in
the
inode.
That
way,
when
we're
doing
the
relocation,
we'd
have
a
pretty
good
sense
of
when
the
object
was
written
and
when
it
was
last
rewritten
but
yeah,
we
can
embed
additional
information.
The
allocation
tree
too,
I'm
open
to
all
of
that.
D
D
D
D
Well,
no,
sorry
when
I
say
rgw
objects,
I
mean
the
actual
data
objects.
D
D
That's
a
good
question,
though,
that
that's
a
tuning
question
we'll
have
to
think
about,
and
if
we
want
to,
we
can
also
stamp
these
blocks,
as
as
we
write
them
with
the
time
stamp,
we
wrote
them
at
and
then
when
we
pull
them
back
off
desk,
we
can
go
oh
okay.
Well!
This
is,
however,
many
seconds
old
right.
D
So
if
if
we
did
want
to
make
that
happen,
we
could
that
would
allow
us
to
detect
like
cold
subtrees.
So,
let's
say
someone's
using
rgw
for
timestamp
data,
and
it
happens
to
be
the
case
that
the
their
object
names
are
sorted
by
by
time.
D
Like
user
provided
file
names
with
no
locality
or
properties
of
any
kind,
in
which
case
the
key
space,
just
kind
of
grows,
with
keys
inserted
wherever
the
hell
they
want
to,
and
in
that
case
the
b3
as
a
whole
actually
mutates
pretty
pretty
quickly,
and
none
of
them
would
be
stable.
So
both
are
possible.
A
D
Okay,
the
butterfest
version
of
this
is
the
back
ref
tree
if
you're
interested,
in
fact,
I'm
gonna
link
the
paper
in
here.
After
this
meeting,
I
forgot
to
it's
a
good
paper
or
well
it's
an
okay
paper.
It
does
an
okay
job
of
explaining
the
concerns
any
anyway
right.
D
D
D
The
this,
this
gc
part
is,
is
really
the
achilles
heel
of
every
log
structure,
file
system
ever
created
and
we're
going
we're
making
sort
of
one
other
change.
That
systems
usually
don't
do
we're,
not
I'm
not
really
assuming
the
existence
of
asynchronous
garbage
collection.
I
am
assuming
that
we
will
actually
in
line
with
every
client.
D
If
all
use
space
is
live,
then
there's
no
garbage
collection
work
to
do
unless
you're
out
of
space,
and
if
you
have
like
most
of
your
use
spaces
dead,
you
should
do
some
garbage
collection
work,
so
we
could
create
a
range
where
it's
like
you
you're
allowed
to
have
up
to
20
percent
of
dead
data,
provided
you
have
the
the
free
space
and
when
we're
not
servicing
transactions,
we'll
go
ahead
and
do
gc.
D
But
once
if
you
do
have
a
consistent,
permanent
client
workload,
then
you'll
quickly
get
to
that
80
threshold
and
every
single
client
transaction
will
also
have
gc
work
mixed
in
and
that's
the
reason
why
I
really
wanted
this
sort
of
online
gc
structure,
where
all
we
have
to
do
is
maintain
a
very
little
bit
of
state
about
the
allocation
tree
and
the
stream
we're
currently
cleaning,
and
we
can
always
squeeze
in
a
little
bit
of
extra
work
into
each
transaction
that
moves
a
block
when
we
need
to.
D
Does
that
make
sense
to
you
guys
both
I
do,
I
think
how
we're
doing
it
and
why
yes
cool.
Yes,
I'm
also
really
hoping
that
this
will
make
it
like
right
now
we're
having
this
problem
with
bluestore,
where
we
don't
have
a
good
way
of
assigning
a
cost
to
a
transaction,
because
it's
wholly
dependent
on
bluestora's
internal
state,
because
we
have
no
idea
whether
it's
about
to
do
a
compaction
or
or
not.
We
don't
know
how
much
work
it's
built
up,
so
this
way
should
make
it
really
easy
to
account
that
we'll
always
know.
D
D
D
That
just
really
exploits
the
people
to
the
to
the
most
amount
possible.
So
for
that
reason,
if
you
look
down
at
components,
I
plan
on
abstracting
all
of
these
things
out,
so
they
like
the
allocation
tree
with
this
design
as
outlined
here,
has
a
dependency
on
the
stream
manager
stuff,
but
that
doesn't
have
to
be
true
if
it's,
if
it's
backed
by
pmap.
D
So
in
that
case
you
take
this,
the
allocator
part
and
the
owner
part,
and
possibly
even
the
omap
part
depending
and
you
would
replace
them
with
implementations
that
use
p,
mem
and
skins
instead,
gc
would
still
consume
the
allocator
to
figure
out
which
blocks
need
to
be
need
to
be
cleaned,
but
it
would
never
have
to
clean
any
of
the
metadata
trees.
It
would
only
ever
be
cleaning
actual
data
blocks.
D
I
know
it's
fuzzy,
but
I
I
hope
that
was
an
enough
answer
for
how
we
plan
to
have
to
evolve
in
that
direction.
D
D
That's
what
I
was
gonna
do
next
I
was,
I
was
gonna
start
actually
writing
code
well
at
least
interfaces
and
start
writing
the
stream
manager
component
and
then
develop
a
gantt
chart
for
which
parts
can
be
developed
in
parallel,
but
in
other
words
the
brief
answer
to
your
question
is
nothing.
I've
addressed
here
has
anything
to
do
with
persistent
memory.
D
D
You're
correct
the
journal
would
be
another
component
that
could
live
up.
There
forgot
to
add
a
journal
part.
D
D
D
D
And
the
other
thing
is:
is
the
cash
right
if
the
dirty
blocks
and
cash
are
actually
persistent,
you
don't
really
have
to
you,
don't
have
to
write
them
back.
D
I
have
a
flu
like
the
the
cache
is
obviously
very
important
for
how
this
works.
I
was
a
little
sketchy,
but
the
goal
clearly
is
for
each
or
to
have
its
own
pool
of
memory
and.
D
Strictly
out
of
out
of
that,
and
then
we'll
use
some
variant,
or
just
perhaps
literally
mark
nelson's
cash
fairness
system
to
ensure
that,
for
instance,
the
owner
cache
and
the
or
the
block
cache
and
any
additional
caches
we
choose
to
create,
are
able
to
share
the
same
pool
of
memory
fairly.
D
D
D
Did
anyone
else
have
just
radically
different
strategies?
They
wanted
to
talk
about,
it's
worth,
not
rat,
holding
on,
or
it's
it's
important
not
to
get
stuck
on
a
single
design
too
early
in
the
process,
and
we
are
not
late
in
the
process.
We
can
still
choose
a
different
design,
though
it
would
be
good
to
voice
it
now.
I
will
point
out
that
none
of
this
actually
cares
about
b
trees.
D
C
D
And,
like
I
said,
I'm
trying
to
the
goal
is
for
these:
these
these
gc
journal
allocator
roadmap.
All
of
these
components
should
be
pluggable,
so
we
can
create
more
than
one
implementation
of
the
way
the
omap
works
as
long
as
it
respects
some
basic
assumptions
about
how
it
interacts
with
the
allocator,
but
I
don't
think
those
assumptions
are
particularly
onerous,
so
I
think
it
should
be
fine.
D
They
both
actually
do
have
to
do
range
range,
searches
or
range
scans,
as
it
turns
out.
The
difference
is
that
we
know
a
lot
about
the
key
for
the
node
tree.
For
instance,
we
know
the
first
48
bits
will
be
the
oh,
no
64
bits
will
be
the
pool
id
and
the
hash.
So
if
we
wanted
to
the
top
elements
of
the
tree
would
be
really
heavily
optimized
for
the
fact
that
they're
numeric
keys.
D
D
D
I
suspect
that
will
be
a
worthwhile
optimization,
because
rgw
bucket
and
items
are
often
hierarchical
and
so
they'll
often
share
like
path
components
and
such
that
we
don't
necessarily
want
to
keep
like
copying
around.
But
those
are
all
details
of
how
we
implement
the
tree
and
run.
It
is
correct.
It
has
a
large
bearing
on
our
right
amplification
and
efficiency
and
also,
I
want
to
point
out
perhaps
an
even
larger,
bearing
on
our
cpu
efficiency
as
we
move
these
things
on
and
off
of
disk
and
we
construct
deltas
for
them.
So
that'll
be
another
concern.
D
I'll
also
point
out,
I
think
I
skipped
this
part
earlier.
Delta's
do
not
have
to
be
bite
range
modifications
to
blocks,
so
the
idea
here
is
for
every
block
to
be
typed.
So
when
it
gets
read
off
disk,
the
cache
will
be
able
we'll
know
what
type
it
is.
So
when
we're
going
to
apply
a
delta,
it's
not
just
applying
a
delta
to
a
byte
range.
D
C
D
D
A
D
Okay,
so
I'm
going
to
start
sort
of
working
on
this
and
then
for
the
rest
of
the
month
and
I'll,
hopefully
have
something
to
report
next
week,
although
probably
not
a
ton.
I
expect.
D
D
I'm
going
to
start
by
writing
the
this
part,
the
stream
layout,
so
there's
a
stream
manager
that
needs
to
actually
do
right.
So
I'm
going
to
write
that
part,
then
I'm
going
to
start
writing
working
on
the
cache
the
memory
component,
because
that's
what
interprets
blocks
off
of
disk
and
by
then
I
should
have
had
to
make
decisions
about
enough
things
that
this
document
will
be
wrong
and
I'll
fix
it
and
then
we'll
move
from
there.