►
From YouTube: Ceph Code Walkthroughs: SeaStore
Description
Find future Ceph Code Walkthroughs: https://tracker.ceph.com/projects/ceph/wiki/Code_Walkthroughs
A
A
A
A
We
expect
to
see
a
lot
of
cost
focused
capacity
workloads
using
only
qlc
or
zns
Flash
devices
or
hard
disks.
We
also
want
to
be
able
to
serve
highly
performance,
oriented
workloads
using
only
fast
end
vme
devices
and
finally,
we
I
I
expect
that
there
will
be
a
lot
of
demand
for
a
combination
of
high
endurance,
low
capacity
with
low
endurance,
high
capacity,
so
in
other
words,
we'd.
A
Really
like
c-store
to
be
able
to
do
tiering
internally,
it
would
take
a
lot
of
pressure
off
of
the
design
parameters
for
setting
up
ceph
clusters
and
it
would
allow
people
to
better
combine
sort
of
rgw
type
capacity,
pools
and
RBD
type
throughput
oriented
pools
without
having
to
spend
without
having
to
be
quite
so
specific
with
the
cluster
setup.
A
So
internally
to
the
OSD.
There
is
an
interface
called
the
object
store
when
I
say
the
object
store
in
this
talk.
That
is
what
I
am
talking
about.
Not
anything
else.
The
object
store
is
the
interface
by
which
the
OSD
talks
to
its
local
storage.
It
has
some
properties,
it
is
transactional.
It
is
a
flat
object.
Namespace
object.
Names
may
be
large,
because
rgw
uses
them
to
do
direct
lookups,
so
user
side,
S3
names,
May,
translate
to
Object
Store
object
names
within
the
OSD
directly.
A
Each
object
contains
a
key
to
Value
mapping
called
the
omap
as
well
as
a
data
payload.
We
use
that
key
to
Value
mapping
for
things
like
CFS
directories,
rgw
property
called
bucket
indexes
and
certain
kinds
of
RBD
metadata.
A
It
also
needs
to
support
copy
and
write
object,
clones
to
support
rados
snapshots
and
therefore
cephs
and
RBD
snapshots,
and
we
need
to
be
able
to
support
efficient,
ordered
listing
of
both
the
omap
and
object
namespaces
for
different
reasons.
Oh
map,
because
it's
part
of
the
interface
and
the
object
interface,
because
it
is
both
scrub
and
Recovery,
rely
on
being
able
to
efficiently
iterate
over
the
set
of
objects.
In
the
same
order
on
all
the
replicas.
A
A
The
internal
B
tree
nodes
refer
to
their
children
by
direct
location
on
disk,
and
we
use
these
two
data
structures
to
maintain
an
LBA
in
Direction,
so
that
the
trace
on
the
left
side
here,
the
Ono
tree,
the
omap
B
tree
and
the
actual
extents
containing
object
data
do
not
use
physical
offsets.
They
use
logical
offsets.
A
A
So
the
last
sort
of
important
bit
of
Journal
layout
or
a
layout
for
c-store
is
the
way
we
do
journaling
if
you're
familiar
with
a
yield
file,
store
or,
to
a
lesser
extent,
Blue
Store.
The
journaling
mechanism,
I
believe,
is
primarily
in
terms
of
the
user
submit
a
transaction,
but
with
c-store
we
actually
do
all
consistency
at
a
block
level.
So
c-store's
consistency,
semantics
are
actually
totally
independent
of
the
app
destroy
interface
and
they
work
in
terms
of
blocks.
A
A
So
if
it's
a
b
tree
block,
for
instance
in
the
LBA
tree
or
biomap
tree
or
whatever,
the
Deltas,
will
look
like
insert
key,
remove
key
that
kind
of
thing,
whereas
for
data
blocks,
it'll
just
be
changes
to
portions
of
the
of
the
data
range
after
that
are
the
newly
written,
logical
and
physical
blocks
and
taken
as
a
unit.
The
entire
record
represents
one
unitary
change
to
the
to
c-store's
commit
state,
including
the
changes
to
the
LBA
tree,
required
to
find
the
new
blocks.
A
So
the
architecture
of
c-store
looks
something
like
this
up
at
the
top.
There
is
the
c-store
class
itself,
which
implements
crimson's
version
of
the
Optics
store
interface
below
this
we
have
the
odode
manager,
the
omap
manager,
the
other
data
Handler
and
a
few
other
things
responsible
for
dealing
with
metadata
structures
specific
to
the
object,
store
interface.
A
These
all
deal
in
terms
of
logical
addresses
and
are
largely
largely
oblivious
to
the
underlying
storage
below
that
there's,
a
big
interface
barrier
called
the
transaction
manager.
This
is
the
layer
that
handles
transactions
and
extent,
reads
and
writes
and
is
capable
of
placing
extents
among
different
different
backing
devices
and
doing
garbage
collection
and
tiering
between
them.
A
Below
that
we
have
an
extent
placement
manager
responsible
for
deciding
which
device
an
extent
actually
goes
on
to
with
async,
cleaner
and
device
implementations
for
dealing
with
the
actual
devices
themselves
and
maintaining
free
space
yeah.
And
then
we
have
an
LBA
manager
responsible
for
this
physical,
logical
mapping.
I
described
anti-journal
responsible
for
doing
transactional
consistency.
A
So
without
further
Ado,
we'll
talk
about
I'm,
going
to
sort
of
my
strategy
here
is
going
to
be
to
kind
of
touch
on
each
component.
Show
you
the
the
profile
responsible
for
it
and
then,
if
anyone
has
any
questions,
they
can
stop
me
all
right.
So
the
first
piece
I'm
going
to
touch
on
is
the
onode
manager.
So
this
is
the
metadata
structure
responsible
for
mapping
a
GH
object,
T,
which
is
the
osd's
internal
type,
for
an
for
an
object
to
anode,
which
is
which,
because
Blue
Store
used
it.
A
It's
a
perfectly
good
name
for
the
metadata
structure.
That
is
the
root
for
a
particular
object,
although
other
than
that
it
has
no
actual
relationship
to
Blue
Store
Zone
node,
similar
concept,
totally
different
implementation.
A
A
The
order
of
these
elements
is
is,
unfortunately,
obligatory
because
it's
how
we
Define
ordering
we
expect
objects
listed
out
of
the
object,
store
interface
to
sort
with
all
of
the
objects
in
a
particular
placement
group.
Sorted
together,
so
Shard
pool
and
key
have
to
come
first,
followed
by
the
object
name
and
the
namespace,
and
then
because
we
want
snapshots
and
generations
for
a
particular
object
to
sort
together.
Those
have
to
come
last.
A
So,
if
you've
ever
designed
a
b
tree,
the
with
long
Keys,
one
really
common
trick
is
to
align
prefixes
and
suffixes
when
they
either
are
comp
prefixes,
when
they're
common
for
an
entire
for
an
entire
node
or
suffixes,
when
they
are
different
for
every
element
of
a
node
and
that's
what
the
FL
tree
implementation
does
it
takes
as
you
descend
the
tree,
you
were
at
the
top
of
the
tree.
You
primarily
store
Keys
containing
just
The
Shard
or
The
Shard
in
the
pool
in
the
key.
A
As
you
get
to
the
middle
of
the
tree,
you
have
to
store
the
actual
object
name
and
then,
if
any
particular
object
has
so
many
snapshots
that
it
fills
up
a
whole
Leaf
node
that
Leaf
node
won't
need
to
store
the
object
name
because
we
can
align
everything
to
the
left
of
that
portion.
A
B
A
Okay,
so
most
things
in
c-store
have
this
sort
of
interface
and
implementation
split,
mainly
because
it
makes
you
know,
testing
easier.
So
odnodemanager.h
contains
the
interface
for
any
implementation
of
this
metadata
structure.
It
has
roughly
the
methods
you
would
expect.
You
have
a
way
to
check
for
the
presence
of
an
O
node.
A
A
A
Let's
see
the
so,
the
next
major
component
up
at
the
top
of
c-store,
would
be
the
B
Tree
omap
manager.
So
I
mentioned
that
every
object
optionally
contains
this
sort
of
omap
structure.
A
So
the
implementation
of
that
in
c-store
is
a
fairly
basic,
just
string
based
feature
because
at
this
time
we're
not
trying
that
hard
to
optimize
for
rgw,
not
much
effort
has
gone
into
improving
this
implementation
yep,
but
eventually,
we'll
probably
want
to
add
things
like
prefix
and
suffix
solution
to
this,
as
well
as
with
the
onode
manager.
There's
an
omapmanager.h
interface
file
with
pretty
much
what
you'd
expect.
A
There
are
ways
to
list
omaps
remove
key
ranges:
clear
road
maps
Etc
with
the
corresponding
implementation,
betrio
mapmanager.h,
one
thing
I
thought
might
be
interesting
here-
is
to
show
you
the
this
file
omap
B
tree
node
imple.
That
itch
contains
the
actual
extent
that
we
store
on
disk
or
the
data
structure.
That
represents
it.
A
A
This
is
a
bad
example,
so
this
inherits
from
ohm
map
node,
which
I
believe
inherits
from
cached
extent.
So
many
of
these
things,
like
duplicate
for
write,
maybe
get
Delta
buffer.
These
are
methods
responsible
for
hooking
into
the
commit
protocol
in
c-store.
So
when
we
go
to
do
a
commit,
we
go
through
every
extent
and
we
call
these
methods
on
it
to
prepare
it
for
that
process
and
for
putting
it
into
the
cache.
A
A
So
the
last
sort
of
top
level
data
structure
I'm
going
to
touch
on
here
is
the
object
data
Handler,
because
we
have
this
sparse
logical
to
physical
mapping.
We
don't
need
to
explicitly
represent
an
extent
list
or
anything
like
like
that.
Instead,
what
we
do
is
we
reserve
a
large
subset
of
The
Logical
address
space
and
simply
let
the
object
use
it.
A
Any
portions
of
the
space
that
are
that
aren't
zero
or
you
know
less
than
the
size
of
the
object,
get
extents
mapped
to
the
corresponding
logical
address,
and
otherwise
it's
simply
left
unmapped,
which
represents
a
missing
portion
of
the
object.
A
So
we
don't
need
a
secondary
extent
map
that
code
can
be
found
in
objectatahandler.h
and
similar
deal
up
at
the
top
here.
Oh
yeah,
here's
a
better
example,
so
struct
object,
data
block,
inherits
from
logical
cache
extent.
A
Oh
I
have
to
actually
show
the
tab
there.
We
go
so
now
that
we've
gotten
through
the
sort
of
top
level
top
level
structures,
I'm
going
to
talk,
I'm,
gonna,
look
at
read
and
see
store
real,
quick
as
soon
as
I
can
find
the
tab.
That
has
it
one
moment.
A
Okay,
so
I
mentioned
before
that,
Crimson
has
its
own
sort
of
version
of
the
object
store
interface,
and
this
is
it
it's
called
futurize
store
because
it
is
basically
the
object,
store
interface
but
re
interpreted
to
use
Crimson
style
futures
I'm
not
going
to
go
into
this
too
much,
especially
this
aerator
part,
but
the
gist
of
it
is
that
read.
Instead
of
taking
a
callback
returns,
a
future
object
to
which
you
can
chain
callbacks.
A
So
it's
a
it's
sort
of
a
more
ergonomic
way
of
dealing
with
asynchronous
I
O.
So
anything
that
returns
a
future
returns.
It
is
asynchronous,
and
if
you
chain
a
callback
onto
its
return
value,
it
will
either
execute
immediately
if
the
operation
is
already
available
or
you
will
end
up
returning
up
to
the
reactor
and
the
reactor
will
run
the
Callback
later
once
the
future
is
available.
A
So
these
things
look
synchronous,
but
they
are
not
other
than
that.
The
interface
looks
very
much
like
the
object,
store
interface
for
good
reasons
and
there's
a
similar.
It
uses
the
same
transaction
structure
as
classic
implementation
is
partitioned
so
that
each
Crim
each
sea
star
reactor
can
get
its
own
local
Port
into
c-store,
so
that
we
can
avoid
synchronization
other
than
that.
It's
a
fairly
straightforward
implementation
of
that
interface.
In
terms
of
the
components
I
mentioned
before
internally
c-store
has
a
transaction
structure.
A
That
gets
passed
around
to
every
component
that
needs
to
perform
reads
or
mutations
most
of
the.
While
there
are
public
methods
here
you
don't
or
users
of
this
users
above
transaction
manager,
don't
really
interact
with
them.
You
use
transaction
manager
or
other
component
methods
to
interact
with
transaction
instead
and
transaction
has
two
responsibilities:
it
tracks.
A
This
avoids
the
need
to
do
synchronization
while
constructing
the
transaction,
but
it
does
mean
that
it's
possible
that
a
transaction
will
have
to
be
retried
if
a
concurrent
transaction
turned
out
to
conflict
that
state
tracking
is
pretty
similar
to
the
state
tracking.
We
already
need
to
do
for
commit,
so
it
doesn't
turn
out
to
be
a
very
large
extra
bit
of
overhead.
A
All
right,
so
here's
the
read
method
in
c-store,
so
sort
of
linking
together
these
Concepts,
we're
going
to
repeat
without
here,
is
going
to
pull
out,
pull
out
from
the
onode
manager,
the
onode
corresponding
to
this
oid,
which
got
passed
in
from
the
user
or
from
the
OSD
as
a
GH
object
t,
and
then
it
will
call
this
call
back
on
up
with
a
transaction
and
the
onode.
A
A
A
It
has
any
data
at
all
and
the
reserve
database,
which
would
be
the
logical
offset
for
the
zero
offset
in
the
object,
after
that,
it
pulls
from
the
transaction
manager
the
set
of
pins,
which
would
be
the
set
of
all
mapped,
extents
that
fall
between
the
offset
and
the
length
of
the
object
and
then
iterates
over
these
extents,
ignoring
the
ones
that
are
zero
and
populates
the
return
buffer
as
needed.
A
Anyone
have
any
questions
so
far.
I
know
this
is
sort
of
pointless,
I,
suppose
depiction
of
c-store
I,
don't
know
if
this
is
linking
up.
A
All
right,
we'll
move
on
to
the
next
thing,
then
so
jumping
back
to
transaction
manager.
So
the
beauty
of
this
abstraction,
to
the
extent
that
there
is
one,
is
that
the
relatively
complex
code,
for
especially
the
Ono
tree,
but
also
the
omap
tree
and
the
object
data,
doesn't
need
to
worry
about
the
actual
location
on
disks
and
it
can
be
entirely
agnostic
to
tearing
garbage
and
garbage
collection
in
general.
A
It
also
means
that
we
can
avoid.
We
can
allow
a
great
deal
of
transaction
concurrency
without
needing
to
know
ahead
of
time
what
will
be
read
or
written.
So
we
don't
need
to
make
strong
or
we
don't
need
a
sophisticated
locking
strategy,
because,
most
of
the
time
we
won't
have
transaction
conflicts
in
the
first
place.
A
A
Possibly
some
interest
to
mark
anyway,
so
the
transaction
manager
presents
a
uniform
transactional
interface
for
allocating
reading
and
mutating
logically
addressed
extents
mutations
to
these
extents
can
be
expressed
as
a
compact
type
dependent
Delta,
which
will
be
included
transparently
in
the
commit
Journal
Record.
A
For
example,
the
beatrio
map
manager
is
able
to
represent
the
insertion
of
a
key
into
a
block
by
simply
encoding
the
key
and
value
pair
rather
than
needing
to
encode
a
Delta
moving
all
of
the
subsequent
Keys
one
key
to
the
right,
which
would
be
much
larger,
potentially
so.
The
code
for
this
who's
in
transaction
manager.h.
A
And
the
interface
looks
like
a
bunch
of
ways
to
manipulate
logical
mappings
and
to
perform
reads
and
writes
on
extents
so
get
pins
is
the
method
I
I
showed
you
before,
and
it
simply
Returns
the
set
of
all
pinned
extents
between
off
beginning
it
offset
and
extending
length
and
you'll
notice.
It's
a
fairly
direct
call
through
to
the
LBA
manager
keeps
doing
highlighting
read
extent
is
also
probably
fairly
useful.
We
call
getpin
to
find
the
pinned
extent
add
offset.
A
A
A
Okay,
so
for
a
number
of
reasons,
c-store
has
a
cache
both
for
the
usual
performance
ones,
but
also
for
correctness
ones.
While
we
have,
while
there
are
transactions
in
progress,
the
cash
represents
a
projected
outcome.
So
when
you
go
to
commit
a
transaction,
the
updated
versions
of
those
extents
go
into
the
cache
so
that
pipeline
transactions
get
the
correct
versions,
especially
if
metadata
structures.
A
A
It
has
sort
of
the
usual
stuff
you'd
expect
cashed
extent.
The
parent
interface
has
intrusive
members
that
allow
that'll
allow
relatively
memory
inexpensive
membership
in
all
of
the
cash
internal
data
structures,
including
the
lru
and
other
things.
There
is
a
basic
lru
for
managing
a
configurable
amount
of
memory
to
be
kept
around
and
some
Logic
for
forcing
things
to
remain
in
cash
when
they
need
to
be,
for
instance,
the
LBA
manager
has
some
or
the
B
tree.
A
A
It
also
includes
some
other
things
like
reference
counts,
so
that
so
that
extents,
that
are
referenced
from
multiple
places
in
the
LBA
tree
can
be
tracked
and
we
can
release
them
at
the
correct
time.
There's
also
a
back
ref
tree
implementation
that
allows
us
to
do
garbage
collection
more
efficiently.
A
It's
structured
very
similarly
to
the
LBA
train,
uses
the
same
B
tree
implementation,
but
the
mappings
are
backwards
from
physical
addresses
to
logical
addresses
and
at
some
point
in
the
future.
We
will
also
add
the
ability
to
do
check
sums
so
that
we
can
get
full
end-to-end
check
something
from
the
root
tree.
All
the
way
down
to
the
leaves
should
be
relatively
straightforward
in
seester,
since
we
already
do
updates
that
way
see
that
code
is
in
yeah
this
tab.
A
So,
as
usual,
there's
an
LBA
manager
interface
that
governs
how
the
rest
of
the
code
interacts
with
this.
The
interface
deals
mainly
in
terms
of
mappings,
so
we
for
mappings,
We
Believe
already
exists,
they're
Getters.
If
we
need
to
create
a
new
mapping,
there's
an
Alec
concept
Etc
now
there
are
ways
to
increment
and
decrement
ref
counts.
A
The
corresponding
extent
for
this
is
in.
A
Lbabtree
node.h
and
this
actually
uses
this
fixed,
KV
internal
node,
which
is
share,
which
is
a
templated
extent,
implementation,
that's
shared
for
all
of
the
B
trees
that
use
a
fixed
size
and
to
fix
size,
end
layout
kind
of
mapping.
So
basically
just
this
in
the
back
rough
tree,
but
all
of
the
code
responsible
for
splitting
and
merging
notes,
for
instance,
is
in
that
template.
So
some
code,
reuse.
A
A
Which
has
operations
for
submitting
a
record?
This
is
ESET.
This
is
the
primary
commit
pathway,
so
you
submit
a
record
record.
Is
a
struct
containing
deltas
and
extents
that
the
journal
then
writes
down
the
commit
protocol
is
actually
into,
or
is
agnostic
as
the
final
location
of
the
record,
the
idea
being
that,
if
we're
using
a
device
with
Anonymous
append,
where
you
don't
know
ahead
of
time,
what
the
offset
will
be,
the
protocol
will
still
work
correctly.
A
Internally
to
the
Deltas
and
Records
references
to
other
records
within
the
same
off
within
the
same
record
or
to
other
extents
within
the
same
record,
are
relative
to
the
record
base.
So
anytime,
you
read
particularly
a
metadata
extent.
You
have
to
adjust
the
in-memory
for
representations
of
the
of
the
child
pointers
to
reflect
the
actual
location
of
the
extent.
A
And
then
there
are
methods
for
replay
and
for
doing
about
stuff.
There
are
two
implementations
of
Journal
one
for
segmented
devices,
which
would
be
devices
where
we
greatly
prefer
to
do
sequential
rights
that
would
be
zns
and
qlc
devices
in
general,
but
also
some
Garden
variety
flash
devices
as
well,
and
then
there
is
a
circular
bounded
implementation
for
very
fast
nvme
devices
which
don't
have
endurance
problems
when
written
too
randomly
and
at
the
lowest
level
of
c-store.
Most
things
have
worked
this
way,
there's
an
RBM
and
a
segmented
Journal.
A
There
is
an
RBM
and
a
and
segmented
devices
with
different
update
rules
which
I'll
get
to
in
a
moment
so
papering
over.
That
difference
is
the
extent
placement
manager.
So,
as
I
mentioned
at
the
very
beginning,
part
of
the
core
idea
with
c-store
is
that
we
want
to
be
able
to
deal
with
heterogeneous
device
configurations.
A
So
we
wanted
to
be
able
to
deal
with
OSD
deployments
where
the
combining
both
fast
high
endurance
devices
and
slow
low
endurance
devices,
which
necessarily
means
that
c-store
needs
to
be
able
to
manage
multiple
devices
of
different
performance
classes.
So
the
extent
placement
manager
is
where
all
of
that
business
logic
lives.
A
So
if
you
have
a
a
segmented
device,
either
is
your
only
device
or
as
a
cold
tier.
Then
there
is
a
segment
cleaner
that
performs
garbage
collection
on
the
segments
within
that
device
to
free
them
up,
so
they
can
be
reused
for
fast
ndme
devices
where
random
rights
are
efficient.
There
is
an
RBM,
cleaner,
cleaner
background
process
instead
and
by
background
process.
I
do
not
mean
a
separate
thread,
or
certainly
a
separate
process,
separate
process
in
both
of
these
cases.
These
are
simply
callbacks
that
are
called
periodically
by
the
reactor.
A
A
A
The
journal
protocol
I
mentioned
before,
isn't
the
only
way
you
can
write
extents,
you
are
allowed
to
write
extents
either
before
or
after
the
commit,
but
they
sorry
you're
you're
allowed
to
write
extents
down
before
the
the
commit,
but
they
will
not
be
linked
into
the
metadata
tree
until
afterwards,
and
that's
this
out
of
line
ohol
extent
concept.
A
So
these,
like
everything
else,
these
things
are
registered
on
the
transaction,
but
the
accent
placement
manager
is
responsible
for
the
for
threading
those
rights
through
the
commit
process.
A
Same
deal
with
read
get
the
extent
placement
manager
knows
how
to
translate
these
P
advertis,
which
contain
a
device
ID
into
the
correct
device.
So,
first
we
get
the
device
ID
and
then
we
use
the
implementation
corresponding
to
that
device.
Id
to
do
the
actual,
read
yeah,
then
there's
some
some
affordances
here
for
doing
space.
Space
Management,
the
exemplation
manager,
is
also
responsible
for
tiering.
A
At
this
time
it's
still
a
bit
a
bit
primitive,
but
some
quite
quite
a
bit
of
underlying
implementation
is
already
present
in
that,
when
there
are
two
devices
configured
when
they
do
garbage
collection,
the
extent
placement
manager
is
given
information
about
how
long
ago
the
extent
was
written
so
that
it
can
make
new
choices
about
where,
to
put
it
so
over
time,
extents
that
don't
get
read
or
written
very
often
will
tend
to
find
themselves
demoted
to
the
cold
tier.
B
A
So
that
is
an
example
of
something
this
component
will
handle.
If
we
decide
We,
Care
I,
see
I
think
we
probably
do,
and
we
will
probably
need
to
make
changes
to
the
way
the
cleaner
works
so
that
we
so
that,
for
instance,
when
we,
when
we
clean
and
extend
from
corresponding
to
an
object,
we
also
go
and
get
the
other
nearby
extents
and
write
them
all
out
together.
A
I,
don't
think
that
work
has
been
done
yet,
but
it's
this
is
the
component
it
would.
It
will
live
in
I.
B
See
we
are
leaving
the
doors
open.
A
B
Yeah,
that's
actually
Spinners
who's
driving
supporting
Spinners
in
in
sister
was
the
driving
Factor
behind
the
question.
A
Yeah
we
do
plan
on
on
supporting
Spinners,
but
it's
not
the
immediate
goal
like
as
in
the
next
three
months.
Basically
as
soon
as
someone's
willing
to
start
working
on
it,
it'll
become
more
of
a
priority.
It's
just
a
matter
of
what
things
are
important
when
bluestore
is
quite
good
at
Spinners.
So
it's
slightly.
B
A
A
As
I
mentioned,
the
segment
cleaner
is
the
SP
is:
is
the
cleaner
responsible
for
doing
garbage
collection
on
segmented
devices?
This
is
particularly
important
compared
to
with
RBM
devices,
because
segmented
devices
have
this
property
that
you
cannot
overwrite
written
extents.
A
The
interface
is
meant
to
model
the
way
zns
devices
behave
and
if
you
aren't
familiar
with
those
devices,
it's
a
modification.
It's
an
nvme
interface
that
restricts
your
access
to
the
disk,
to
a
pending
sequentially
to
a
large
gigabytes
and
size
segment,
closing
it
and
releasing
it
and
releasing
it
releases.
The
entire
segment
you're
not
allowed
to
overwrite
portions
of
the
of
the
segment.
A
So
for
that
reason,
before
you
can
release
and
reuse
a
segment,
you
need
to
move
any
extents
that
are
already
there
somewhere
else.
But
the
intention
is
that,
if
you're
familiar
with
how
flash
actually
works
internally,
there
are
segments.
There
are
portions
of
the
device
that
have
to
be
that
that
can't
be
overwritten
without
zeroing
them
so
internally,
a
conventional
flash
device
does
something
like
this
but
exposes
to
you
a
mutable
interface.
A
The
cost
is
that
you
have
this
non-deterministic
device
controlled
garbage
collection
process
that
tends
to
impact
hail
latencies.
It
also
tends
to
produce
a
lot
of
extra
rights
for
every
right
the
user
does,
and
that
was
fairly
tolerable
for
three
level
for
TLC
devices,
but
with
qlc
devices,
because
the
right
endurance
is
so
poor.
A
It
also
means
that
we
get
to
do
the
garbage
collection
stuff
when
we
want
to
do
it,
not
when
the
device
feels
like
it.
So
we
have
better
control
over
our
tail
latency.
So
for
these
and
other
reasons,
even
for
non-zns
devices,
we'll
probably
want
to
use
the
segmented
interface
for
slower
flash,
even
if
it
happens
not
to
be
CNS.
A
So,
as
I
mentioned,
it
runs
within
a
within
sort
of
callbacks
that
are
periodically
scheduled
within
the
same
reactor.
Logical
extents
are
simply
remapped
within
the
LBA
manager.
So
when
we
want
to
move
a
logical
extent
around,
all
we
have
to
do
is
update
in
the
LBA
manager,
the
location
of
the
extent.
A
The
conflict
detection
mechanism
will
correctly
deal
with
the
case,
where
we're
relocating
an
extent
that's
being
accessed
by
another
transaction,
because
they'll
both
touch
the
same
LBA,
itry
extents
during
the
process.
So
what
or
the
other
will?
We
will
retry
in
the
unlikely
event
that
that
happens.
A
It's
also
responsible
for
throttling
foreground
work
based
on
the
amount
of
pending
garbage
collection
of
work.
We
want
to
avoid
running
out
of
space
and
having
to
do
a
bunch
of
garbage
collection
before
I
O
can
resume.
So.
Instead,
we
inject
increasingly
long
pauses
as
we
approach
the
the
the
as
the
device
approaches.
It's
sort
of
hard
cap
on
fullness
I
expect
that
heuristic
system
will
need
to
change
quite
a
bit,
but
that
is
where
we
are
now.
A
A
A
Anyway,
it's
an
instantcleaner.h.
The
segment
cleaner
implementation
is
in
here
as
well,
actually
yeah.
Here
it
is
and
you'll
notice
it
inherits
from
async
cleaner
and
it
provides
segment
provider
because
it
behaves
as
a
it's.
A
it's
responsible
for
mapping
which
portions
of
the
backing
segmented
devices
are
free.
A
A
Yeah,
so
if
you're
interested
in
contributing,
we
are
at
a
place
now
where
we
have
sort
of
Fairly
basic
multi-core
support.
Although
Chennai
is
working
on
completing
that,
and
it
should
be
testable
if
you're
willing
to
tolerate
a
certain
amount
of
flakiness
we're
particularly
interested
in
random,
write
and
random,
read
workloads
on
RBD.
A
Is
this
already
integrated
into
the
Crimson
physiology
Suite
I
believe
it
is
in
the
Crimson
experimental
Sweden
does
not
usually
pass
the
regular
Crimson.
One
just
runs
on
Blue
Store
China.
You
can
correct
me
if
I'm
wrong,
but
I
think
all
of
the
c-store
tests
are
still
in
Crimson
experimental
one.