►
From YouTube: 2018-June-26 :: Ceph Code Walkthrough: BlueStore part 2
Description
Second part of Sage Weil's presentation about BlueStore code and internals.
Ceph Code Walkthrough BlueStore (part 1):
https://www.youtube.com/watch?v=f0H-XhcZGP0
A
A
A
Bored
ih
all
right.
Let's
start
in
header
file,
well
the
perf
Kanner's
your
place
started
just
walking
through
some
of
this,
so
class
boost
or
obviously
that's
object,
store,
big
observers
and
bill.
Config
changes,
theater
bunch
of
helpers
that
are
called
on
startup
and,
if
can
figure
out
and
change,
so
they
take
the
config
settings
for
check,
sums
and
compression
and
so
on,
and
then
initialize
various
in
memory
structures
on
to
that
thanks.
A
That
compression,
for
example,
gets
a
handle
for
a
compressor
and
it
figures
out
the
blob
sizes
based
on
the
mins
and
maxes
and
it's
another
random
stuff.
A
trans
context
is
for
declaration
of
a
class
that
tracks
a
right,
that's
being
assembled
a
transaction
and
that's
think
made
it
to
disk
I'll,
get
to
that
shortly.
I
sort
of
like
key
piece,
the
right
path.
A
And
there's
some
they
exist
in
an
LRU.
There
was
also
lists
by
state,
so
we
can
identify
all
the
writing
buffers
easily
bunch
of
helpers
for
idiots
in
truck
getting
data
off
the
end
may
be
reallocating
the
buffer
list.
So
it's
one
one
buffer,
nothing
terribly
crazy,
but
for
a
space
is
sort
of
the
next
level
up
in
the
cache.
This
is
for
a
single
object.
It's
a
mapping
of
a
bunch
of
file,
offset
or
object
offsets
to
buffers
Oh
in
the
Linux
kernel
would
be
in
address
space,
I
guess
the
analogous
mapping.
A
So
it
has
a
map
of
buffers
here
which
is
an
intrusive
red-black
tree.
Actually,
no,
it's
just
a
regular
sto
map
of
offsets
to
buffer
and
there's
some
helpers
for
adding
buffers,
removing
them.
A
The
game
clip
is
a
particular
position
and
so
on
doing
discard,
writes
reads
and
so
on
these
ones.
These
implementations
are
a
little
bit
tricky
because
there's
this
indirection
for
the
cache
that
contains
the
buffer
space,
which
I'll
get
to
in
a
minute
and
then
the
other
sort
of
interesting
thing
here,
that's
going
on
is
we
saw
a
write,
would
write
into
a
buffer
space
would
create
a
new
buffer
with
the
data
at
a
particular
offset.
A
A
B
A
Yep,
so
the
that
I
think
the
difference
is
that
it's
actually
not
the
entire
cache.
It's
a
shard
of
the
cache
so
the
way
that
it's
structured
and
in
blue
store
there's
some
number
of
shards
on
the
OSDs
awkward
queue
like
I've,
or
something
better
or
8i
camera.
What
it
is
by
default
depends
on
if
it's
hard,
disk
or
sisty
blue
source
sets
at
the
same
number
of
shards,
and
the
idea
is
that
you'd
tend
to
be
on
the
same
cpu.
It's
always
the
same
work
queue.
A
B
A
There's
generally
no
contention
on
those
lock
the
locks
for
the
cache
at
all,
except
when
it's
doing
things
like
checking
stats
for
trimming,
there's
what
there's
a
background
thread
that
periodically
will
wake
up
and
turn
stuff,
and
that
goes
through
and
takes
the
lock
on
each
shard
in
turn
right,
so
that
the
cache
is
an
abstract
root
container.
So
it
does
a
few
things.
A
I've
got
a
lock
and
some
counters,
especially
with
it
I
think
these
are
atomic
just
so
that
we
can
read
them
without
I'm,
taking
the
lock
I'm
type,
you
generally
don't
you're,
always
holding
walk
when
you're
modifying
the
cathodes.
That's
not
needed
for
the
writes,
but
you'll
notice
that
all
these
methods
are
all
virtual.
A
A
Basically,
two
types
of
data
structures
that
are
the
cache
manages
nodes
and
buffers
and
they're
sort
of
managed,
mostly
independently,
except
when
I
return
to
do
space
accounting
a
little
bit
wonky,
but
I'm
you'll,
see
basically
parallel
sets
of
methods
for
add,
remove
and
touch
for
those
two
different
types
of
Berkshire's
I'm
gonna
miss
a
trim.
Where
are
you
passing
the
maxes
and
it'll?
A
A
A
A
Yes
implement
the
same,
but
it's
a
bit
more
complicated
about
how
things
get
faulted
in
and
whatever
all
right.
So
that's
the
cache.
So
that's
why
this
class
over
here
buffer
space,
these
implementations
are
weird
because
they
have
various
hooks
into
the
cache
to
add
the
buffer.
All
these
audit
calls
on.
It
is
basically
a
debug
method
that
just
traverses
a
whole
structure
to
make
sure
everything
is
in
sync,
but
it
compiles
out
and
in
normal,
are
all
builds
I
think
it
did
to
find
something
to
turn
on.
A
Right
so
those
are
brought
first,
you'll
remember
from
last
time
that
sort
of
the
on
disk
data
structure
you
have
Ono's,
which
represent
objects
and
that
data
portions
of
own
oats
are
mapped
to
blobs,
which
is
basically
just
like
a
hunk
of
data.
That's
stored,
not
necessarily
in
one
extent,
but
a
sort
of
treated
the
same.
The
metadata
is
managed
together,
it's
got
all
the
checksums
and
how
it's
compressed
and
meditated
about
that
data
on
disk.
A
Basically,
because
we
can
clone
objects,
sometimes
blobs
are
referenced
by
multiple
objects,
and
so,
when
that
happens,
there's
a
shared
blob.
That's
actually
stored
on
disk
that
keeps
track
of
the
reference
counts
on
those
extents
in
memory.
That's
also
true,
but
it's
it's
slightly
different
in
memory,
because
the
cache
the
memory
cache
is
always
associated
with
the
shared
blob
structure.
So
any
blob,
that's
ever
instance,
are
instantiated
in
memory,
has
a
shared
blob,
a
second
allocation
of
the
shared
blob
and
the
cache
the
buffer
space.
A
Caching,
the
buffers
is
attached
to
that
so
you'll
see
here,
there's
a
sure
blob.
It
belongs
to
a
collection,
it
belongs
I,
think
just
by
a
cache
pointer
on
here.
I
guess
not
that
collection
is
mapped
to
a
cache.
Shard
has
some
flags
to
indicate
whether
it's
the
state
for
that
shared
blob
is
loaded
into
memory
or
not,
and
if
it
actually
has
any
persistent
state.
Most
of
the
time
objects
aren't
cloned,
and
so
these
are
both
false
because
there
isn't
actually
Chris
an
instantiation
of
that
shared
Bob.
It's
not
sure
at
all.
A
It's
just
the
blob.
It's
just
basically
a
container
for
the
buffer
space,
but
when
it
is
shared,
then
nothing
changes.
You
still
have
the
buffers
memory
associated
with
that
shared
structure.
Hopefully
that
makes
sense
a
little
convoluted.
So
there's
a
they
have
a
shared
blob,
ID
unique
identifier
for
the
shared
blob
that
only
is
defined
if
it's
loaded
or
if
it's
persistent,
if
it's
not
persistent,
then
I
think
this
doesn't
even
return
anything
that
would
undefined
we
track
references
on
them.
A
So
there's
some
helpers
here,
but
that
mostly
you
can
just
think
of
shirt
blob
as
a
container
for
the
buffer
space
in
memory
buffers
and
there's
a
pointer
to
the
in
gallon
disk
state
when
it
exists.
The
shared
blob
set
is
sort
of
a
more
tricky
mapping
that
lets.
You
actually
find
these,
though,
most
of
the
time
when
you
load
a
blob
into
memory,
you
just
do
node
has
a
flag
that
indicates
whether
the
blob
is
shared
just
part
of
the
blob
I'm
Flags
field.
A
So
if
it's
not
shared,
then
it
just
creates
a
shared
blob
and
uses
the
buffer
space
attached
to
it.
That's
it,
but
if
that
flag
is
set
on
the
blob,
then
the
shared
Bob
actually
has
a
persistent
thing
and
when
it
allocates
the
shared
blob,
it
also
registers
it
with
this
shirt
blob
set
because
it'll
have
a
an
ID
associated
with
it.
So
I
think.
A
A
This
is
what
happens
when
you
don't
prepare
anyway,
if
it
is
a
shirt,
if
it
is
actually
shared,
then
it'll
it'll
register
with
this
shirt
Bob
set
and
it's
basically
just
a
hash
table
of
the
share.
A
Bob
ID,
a
shared
wall
planner
and
what
happens
when
you,
because,
as
soon
as
the
blog
becomes
shared,
if
you
clone
it
becomes
immutable,
and
so
we
basically
just
copped
the
copy,
the
blob
meta
data
to
the
target
object
and
so
there's
two
copies
of
the
blob,
with
all
the
the
extents
and
the
flags
and
everything
in
both
objects.
And
then
we
aren't
allowed
to
change
them.
A
A
A
This
is
a
this
is
slightly
tricky
because
you
have
cases
where
the
reference
counts
go
away
for
the
shared
blob
and
something
you
have
to
deregister
it
from
the
shared
Bob
set.
So
you
can
only
look
up
if
the
reference
kind
of
still
nonzero
this
sort
of
easily.
You
have
to
take
a
walk
to
remember
from
this
set
and.
A
A
I
think
we
do.
We
talked
about
this
before
the
X.
We
talked
about
the
extents
and
bobs
I
believe
before
yeah,
all
right,
yeah,
at
least
feeling
this
apart,
I'm
so
extensor,
basically
Maps.
They
have
an
extent
you
have
an
O
node.
You
have
a
collection
which
has
a
map
Ono's
o
nodes.
Each
node
has
an
extent
map
which
maps
regions
to
blobs
each
blob
will
point
to
the
shared
Bob.
Possibly
multiple
objects
will
point
it
that
that
they
loans.
A
Those
blobs,
the
shared
Bob,
will
point
to
buffers
owned
by
the
cash
shard.
Similarly,
all
the
Ono's
will
also
be
owned
by
a
particular
cash
shard,
most
of
the
time,
all
the
stuff
lines
up
with
the
shards
and
the
blobs.
When
you
clone
an
object,
you're,
always
cloning
to
an
object
that
has
the
same
hash
ID,
so
it
always
lives
in
the
same
shard
in
the
same
collection.
So
these
things
sort,
never
move
around
in
the
namespace.
A
A
Cache
are
answer
to
the
other
cash
shard
and
it
moves
it
also.
It
moves
it
to
the
destination
collections
on
a
map,
because
the
Ono
lookup
hash
tables
are
per
collection
so
that
they're
sort
of
localized
and
the
hash
tables
get
are
smaller,
and
then
it
also
has
to
move
the
shared
blobs
buffers
and
all
this
stuff
whatever.
But
there's
all
this
moving
moving
moving
stuff
here.
Well
that
stuff
straight,
so
that's
I,
think
that's
most
most
of
it.
So
just
a
quick
example
if
you
do
get
owned
owed
now,
let's
do
fault
range.
A
Alright,
so
it'll
load
up
a
bunch
of
the
code,
psalmist
decodes,
a
shard
that
comes
out
of
the
key
value
database,
a
shard
of
a
no
note
about
the
extent
map,
and
in
here
we
have
a
whole
bunch
of
blobs
that
are
sort
of
encoded.
That's
a
really
annoying
and
clever
way
to
try
to
make
them
compact
and
also
reasonably
fast
to
code.
A
A
Don't
know
sorry
if
it's
yeah,
this
is
a
different
block
ID.
But
basically,
if
we
get
down
here,
if
it's
a
sure
blob,
then
we
have
to
open
the
shirt
blob,
which
is
another
help,
Fleur
function
and
assured
that
will,
in
the
general
case,
deed
common
case
for
the
blob
wasn't
shared.
It
just
creates
a
shirt
blob
and
it's
done,
but
if
it
is
shared,
then
we
we
first
try
to
look
it
up.
If
we
already
had
it
then
great,
if
we
didn't,
then
we
create
it
and
we
register
it.
But
that's
the
basic
thing.
A
The
interesting
thing
here
is
that
when
you
load
the
blobs,
you
don't
actually
load
the
shared
blob,
necessarily
because
it's
actually
a
different
key
value
and
the
only
thing
that
it
stores
is
the
ref
counts
on
the
extents.
So
if
you're
just
reading
an
object,
you
don't
care
what
other
clones
are
also
referencing,
those
same
bytes.
The
blob
itself
has
enough
metadata
to
to
read
the
extents
and
we
have
the
Czech
stones
in
both
objects,
and
so
it
doesn't
matter
the
only
time
you
actually
have
to
load.
A
The
shared
blob
is
if
you're
modifying
the
object,
and
you
have
to
change
the
reference
count,
which
is
nice-
that
only
happens
in
the
right
path,
but
soon
you
do
have
to
load
it.
Then
you
go
and
you
actually
get
the
key
out
of
the
database.
You
set
loaded
to
true
you
load.
This
persistent
allocate
this
persistent
shared
blob
T
just
for
the
on
desk
structure
in
memory
I'm
in
decoded
make
sure
blob
is
used
in
the
case
for
your
cloning,
an
object
so
initially
the
blob
isn't
shared.
Obviously,
but
you
clone
it.
B
A
You
have
to
make
all
the
blobs
that
are
not
shared
yet
shared,
and
so
it
marks
some
work,
some
dirty.
That's
the
shared
flag
allocates
this
percent
of
an
object
and
sort
of
initializes
the
reference
counts
or
the
two
objects
that
will
initially
be
referencing
it
or
maybe
just
the
first
one,
and
then
when
you
the
actual
code
clone
does
it.
There
is
some
code
and
blue
store
to
try
to
unshare
blobs,
because
it's
a
ratio
could
it
sort
of
write,
work
load
patterns.
We
make
a
clone
of
the
object
before
we
do.
A
An
erasure
could
overwrite.
So
we
can
roll
back
and
so
there's
sort
of
this
blue
and
blue
sort
of
try
to
like
clean
that
up,
so
that
that
we
don't
end
up
littering
the
entire
every
object
with
all
these
shared
blobs.
And
so
that's
what
the
make
lob
unshared
is
it's
a
set
of
heuristics
that
work
most
of
the
time
unless
you
have
sort
of
a
weird
workload
padding.
So
it's
good
enough,
but
this
is
just
the
opposite:
write
it
under
registers.
A
A
Alright,
so
let's
go,
we
looked
a
little
bit
at
read
before
so
we
look
at
that
again
or
should
we
look
at
let's
look
at
the
right
one
right,
one
is
actually
that
writes
rights
for
all
the
complexity
is
so,
let's
just
get
right
to
it,
so
everything
in
the
object,
store
or
layer
goes
through
Q
transactions.
A
A
This
is
for
debugging,
where
we're
sort
of
pretending
that
were
throwing
away
iOS
and
you
get
a
reference
to
or
get
a
pointer
to,
the
collection
and
the
sequencer
and
these
one-two-one
by
those
different
in
memory
structure,
because
it
survives
in
collections
of
the
same
name.
But
the
main
thing
here
is
that
we
create
a
trance
context.
So
this
is
the
data
structure
that
tracks
data
associated
with
a
in
flight
transaction.
A
It's
a
child
of
a
io
context
because
it
initiates
AOS
and
it
has
a
completion
callback.
It's
like
it's
called
on
it
by
the
block
there.
It's
got
a
whole
bunch
of
states
and
that
it
works
its
way
through
in
the
process
of
actually
doing
their
right
and
the
prepare
is
what
happens.
Initially,
we
might
be
waiting
for
the
initial
rights
once
those
are
all
done.
A
So
the
sequencer
has
a
an
ordered,
intrusive
list
of
the
transactions
that
are
in
flight,
and
so
this
is
our
handle,
our
whatever
node
in
that
list,
each
transaction
is
a
cost
associated
with
it.
I'm
related
to
the
number
of
things
we're
doing
and
how
many
bytes
are
writing
these
fields
track
and
which
own
ODEs
are
being
modified,
which
objects
are
being
modified,
shared
blobs
are
being
modified
or
being
written,
completions
to
finish
collections
that
got
deleted
that
need
to
be
mopped
up
at
the
end.
A
A
What
blocks
on
disk
rather
allocated
a
receipt
or
released
as
part
of
this
transaction
and
the
Delta
for
our
stats?
The
blue
star,
is
maintaining
transactionally
with
you
step
date,
account
of
like
the
number
of
bytes
of
different
types
and
a
bunch
of
stats
or
like
how
many
bytes
are
compressed.
How
many
are
uncompressed,
how
many
are
allocated
and
allocated
yada
yada
that
roll
up
and
eventually
populate
the
status
out?
But
so
this
is
the
whatever
Delta
this
is
incurred
by
this
transaction.
A
A
Yeah,
that's
mostly
it
grants
contact
goes
so
anyway.
Okay,
so
when
you
do
transaction,
we
create
the
trans
context,
and
we
can
look
here
for
TX
e
8
and
basically
all
it
does.
Is
it
allocates
it?
It
gets
a
handle
to
the
current
rocks,
TV
or
key
value,
DB
transaction.
That
is
being
assembled,
so
lots
of,
if
you
have
a
whole
bunch
of
transactions,
they'll
all
be
playing
for
the
same
transaction,
that's
sort
of
piling
on
their
work.
A
On
top
of
it,
Billy
eventually
commit
the
batch,
and
then
this
basically
adds
it
to
the
D
Quinn
sirs
list
of
transactions
that
are
in
flight,
and
then
we
all
this
add
transaction.
This
basically
takes
the
context
of
the
transaction
that
the
OSD
passed
down
and
sure
does
all
the
work
of
figuring
out
what
Kiba
it's
are
happening,
what
I
or
whatever
this
is,
where
everything
actually
happens,
I'll
get
to
that
in
a
second.
It
also
calculates
the
cost.
But
this
is
the
number
of
bytes
that
we've
done.
A
We
take
any
o
nodes
that
were
dirty
and
we
write
them
by
basically
putting
them
in
the
key
value
transaction,
so
they're
ready
to
go
and
if
there's
any
deferred
work
deferred,
I/o,
we
encode
those
keys,
also
and
then
finalize
sort
of-
let's
go
through
these
inner
of
this,
so
the
main
one
is
GXE
a
transition.
That's
where
everything
tiding
happens,.
A
Usually
it's
just
one,
and
then
we
get
it
right
over
the
operations.
Oh
no
app
step
to
anything
they're
sort
of
grouped
here,
so
things
that
operate
on
collections
may
or
may
not
flex.
Your
memory
and
I
exist
right
here
we
have
like
a
reference
to
the
reference
to
the
collection
reference
things
like
remove
and
McCreight
collection
happen
here.
All
the
collection
stuff
is
piled
in
here
and
then
they're
grouped
down
below
there's
a
bunch
of
them
that
implicitly
create
objects,
though
these
might
happen.
A
We
set
this
to
true
and
then
we
actually
look
up
the
O
node
and
if
it
doesn't
exist,
but
this
is
an
OP
that
creates
it
then
we'll
actually
create
otherwise,
we'll
return.
You
know
it
and
actually
usually
error
up
at
this
level.
We
basically
are
never
supposed
to
add
an
arrow
st
is
supposed
to
prepare
well-formed
transactions
into
the
store,
so
this
yeah,
so
ghetto
node,
will
either
look
up
an
O,
node
or
if
the
second
argument
is
true,
then
it'll
also
create
a
new
O
node
or
a
new
object.
A
Only
where
these
operations
will
always
get
a
node
back
just
might
be
empty
for
other
ones.
That
might
be
there
might
be
an
old
reference.
So
these
things
like
touch
and
rate
all,
are
getting
a
nice
clean
pointer
to
the
collection
and
to
the
object
and
the
arguments
that
are
actually
gonna
be
the
operation.
A
So
the
the
interesting
one
is:
let's
look
at
touch
the
safar,
the
simplest
one,
just
to
see
what
it
does,
and
here
we
already
got
an
O
node
right
because
ghetto
no
de-allocated
one,
because
the
great
flag
was
true.
So
all
we
have
to
do
is
make
sure
that
there's
a
unique
identifier
assigned
to
that
object
and
NID
sorta
like
an
inode
number,
and
then
we
mark
in
the
transaction
that
we
should
write
the
node
because
it's
been
modified
and
that's
it.
A
A
A
Well,
basically,
figures
out
whether
needs
to
restart
the
node
and
then
assuming
it's
all
past,
all
that
it
will
encode
the
oh
no
I'm
into
an
in
memory
buffer.
So
it's
I
there
in
my
in
line.
A
No,
let's
see
extend
it
anyway,
yeah
down
here,
it'll
encode,
the
Oh
node
until
its
key,
possibly
also
the
extents,
the
extent
Maps
records.
These
are
also
changed
and
then
it
will
set
that
key
in
the
transaction.
That's
gonna
be
passed
or
SUV,
but
have
some
stats
about
what
how
do
know
is
being
written?
A
Sometimes
we
modify
objects.
But--
didn't
affect
the
note.
I
can
remember
why
that
happens.
We
have
to
keep
track
of
those
two.
If
we
modify
the
shirt
blobs,
then
we
similarly
we
need
to
go
and
get
the
key.
If
we
deleted
a
shirt
bob,
we
have
to
delete
the
key.
If
we
modify
that,
then
we
have
to
encode
it
and
to
buffer
and
write
that
into
the
transaction,
and
so
that's
basically
it
right.
Now.
It's
either
update
co,
node
and/or
the
share
bobs
that
are
affected
by
that
by
that
transaction.
A
A
If
I
guess
those
are
just
X
ORS
it
okay,
so
that's
finalized.
Your
probably
named
like
finalized
allocation
changes
or
whatever
there's
some
throttling
here.
That
slows
it
down.
Probably
this
is
gonna
go
away
once
we
are
throttling
at
a
higher
level
on
the
OSD
with
the
new
us
stuff,
but
until
then
this
is
where
the
throttling
happens,
so
cute
transactions
are
actually
block
sometimes,
and
then.
A
Finally,
we
get
down
to
this
thing
where
it's
unprocessed,
this
exe
state
proc
so
remember
the
transit
context
had
those
like
twelve
states
and
it's
basically
a
simple
state
machine
machine
that
goes
think
always
an
order.
Sometimes
it
comes
back,
but
this
just
txt
state
proc
basically
works
its
way
through
those
states.
A
A
In
a
prepare
state,
which
is
what
it
started
as
key
transaction,
creates
a
transaction
up
here.
Jc
create
starts
out
in
the
prepare
state
where
everything
starts
if
there
are
no
pending
iOS
and
we
go
all
the
way
to
a
wait
saying
that
word
or
no,
if
it
does
have
any
way
of
you
say
we're
waiting
for
those
a
iOS,
and
we
call
this
function
here
that
actually
accuses
iOS
for
the
kernel
device.
A
A
And
when
that
happens
finish
IO
basically
says:
okay,
we're
done
with
the
IO
IO
that
we
switched
the
IO
done
state
and
updates
in
accounting,
but
the
trick
here
is
that
say:
you
had
two
transactions,
a
and
B
and
they
both
started
some
iOS
a
came
first
became
second
but
say
B's
IOT's
finished
before
a's.
Do
OB
will
be
in
the
I/o
done
state,
but
the
transaction
ahead
of
it
is
still
in
the
a
io
8
state,
and
so
this
basically,
so
we
can't
actually
do
anything
when
piece
I
was
finished.
A
We
solved
away
for
a
to
finish
first,
so
this
update
updates
us
and
then
we
basically
start
at
the
front
of
the
list,
and
we
look
for
things
that
are
have
finished
I/o
and
then,
if
they
are,
then,
if
a
is
done
and
it
looks
at
as
many
transactions
as
are
done-
and
it
finishes
them
all
so,
basically
loop
so
for
starting
at
the
front,
all
the
transactions
or
an
iodine
state.
We
process
them
in
order,
and
this
is
the
thing
that
ensures
that
even
though
I
always
happen.
A
Still
commit
them
in
order
at
the
end,
so
that
means
that
we're
calling
back
into
state
proc
with
I/o
done
down
here.
This
is
cert
we're
already.
We
already
have
the
queue
locked
here
blocking
on
the
big
concerns,
little
bit
wonky,
but
basically
it's
a
lock
that
protects
this
list
of
the
transactions
again.
These
walks
are
almost
never
contended
because
we're
always
doing
this
in
the
same
thread.
So
it's
not
it's.
It's
pretty
rare
that
you
actually
are
walking
on
the
mutex,
but
they're
needed
for
a
few
unusual
cases.
A
A
Normally,
this
is
false.
Normally
we
go
into
the
queued
state
and
we
basically
just
get
put
on
a
list
of
tvq
of
all
the
transactions
that
are
ready
to
be
committed,
and
then
we
wake
up.
The
Katie
sync
thread
to
go
commit
them.
There
are
a
bunch
of
optimizations
here
that
we've
played
around
with
that.
Basically,
for
certain
cases
the
transaction
can
be
committed
synchronously,
so
you
might
have
you
know.
Eight
threads
in
the
OSD
that
are
processing
these
transactions
and
they
could
all
be
queuing
directly
to
rocks
to
be
from
those
worker
threads.
A
When
that's
the
case,
then
you
can
get
some
of
better
performance,
but
they're
a
bunch
of
commission.
You
can
only
do
that
when
there
are
no
ordering
issues
with
the
iOS
and
a
whole
bunch
of
complicated,
complicated
conditions,
though
there's
a
subset
here,
you'll
notice.
If
we
just
want
to
do
it
synchronously,
then
it
can't
have
allocated
a
new
ID,
because
we
have
to
update
a
key
that
has
the
max
it
has
to
be.
Oh.
A
The
previous,
if
the
previous
one
is
already
going
through
the
key
value
thread,
then
we
have
to
continue
going
through
the
key
value
thread,
or
else
we'll
sort
of
jump
ahead
of
an
order.
If
there's
unstable,
I/o
with
other
transactions,
then
we
can't
jump
ahead
whatever.
But
if
we
get
lucky,
then
we
can.
We
can
do
it
directly.
A
Now
we
jump
straight
to
submit
it
and
we
caught
applied,
but
usually
it
doesn't
happen
so
well,
so
ignore
that
so
mostly
most
the
time
you
get
put
on
the
kbq
and
then
there's
another
thread,
maybe
think
that
you'll
hear
mark
talk
about
all
the
time,
because
this
is
what
does
most
of
the
work.
Maybe
not
must
work
a
lot
of
the
work
in
blue
store.
This
is
Katie.
A
A
It
will
most
of
the
time
it'll
do
a
flush
on
the
block
device
to
make
sure
all
of
those
asynchronous
iOS
that
were
sent
to
disks
and
that
they
were
completed,
make
sure
they're
actually
committed
durably
to
storage
it'll
block
waiting
for
that.
Once
that
happens,
then
we
get
our
actual
we've
got
a
handle
on
the
transaction
we
iterate
over.
A
A
B
A
Submit
that's
not
blocking
it,
they
can't
I'm
not
really
seeing
it
right
now,
but
then,
basically,
we
basically
have
this
whole
batch.
That's
all
ready
to
go,
and
then
we
push
the
whole
thing
through
with
the
one
that
actually
has
the
flag
that
says
for
xtv.
You
should
write
this
flush
it
to
disk
and
wait
until
it's
actually
dribble.
A
So
this
blocks
does
a
whole
bunch
of
work
in
rocks
TV
and
when
it
finally
returns,
then
that
whole
batch
is
committed,
and
here
we
basically
take
all
of
those
lists
of
transactions
that
we
had
and
we
hand
them
off
to
a
new
set
of
lists
that
are
yet
another
thread
is
gonna
mop
up
at
the
end,
this
is
one
of
the
performance
things
that
happened
over
the
last
year,
a
little
bit
better
performance
on
SSDs.
If
you
sort
of
separate
the
part,
that's
pushing
directs
to
be
an
in
sort
of
doing
that
after
effects.
A
So
this
pushes
it
all
into
the
these
new
lists
called
committing
to
finalize
and
down
here.
There's
the
finalize
thread
that
wakes
up
for
all
of
these
things
that
just
got
finalized.
It
calls
you
know,
it
asserts
that
it's
in
the
submitted
state
they
showed
just
been
submitted,
and
then
we
called
state
brock
back
over
here
and
as
action
is
still
working
its
way
through.
A
Or
is
it
submitted?
We
did
it
calls
a
little
helper
txt
committed
kv
that
basically
triggers
the
finishers
accuse
the
finish
for
work
associated
with
that
transaction
update
some
stats
and
then
we're
now
in
the
kV
done
state,
and
then
we
go
to
of
one
of
two
ways
if
there
was
deferred
IO,
which
means
that
the
original
transaction
basically
Journal
of
entry
that
says
I'm
going
to
do
some
ayah
later
and
then
it
committed
now
we
can
actually
do
that
asynchronous
IO.
If
that's
the
case,
then
we
put
it
on
the
deferred
queue.
A
Or
is
this
all
another
work
you
if
all
that
different
stuff
that
gets
patched
up
and
pushed
out?
Otherwise
we
go
straight
to
finishing
and
in
the
deferred
I'm
not
going
to
go
into
that
right
now,
but
in
the
deferred
work.
You
basically
similar
thing
where,
once
we
clean
up
blah
blah
blah,
we
get
pushed
into
finishing
state
and
it's
the
same
thing
with
finishing
we're
kind
of
like
with
the
IO.
It
might
be
out
of
order.
A
We
might
have
one
transaction
that
doesn't
have
the
furred
io
another
one
that
does
and
another
one
that
doesn't.
We
don't
actually
clean
all
these
transactions,
we
clean
them
up
in
order,
so
we
have
to
wait
for
the
for
the
slowest
one
which
might
have
to
Freud
IO.
That
has
to
finish,
but
notably
the
things
that
happen.
In
finish.
Are
we
mark
the
write
complete,
though
those
buffers
states
change
from
writing
to
clean?
A
A
This
used
to
be
done
quite
a
bit
earlier,
but
we
pushed
it
back
at
the
very
very,
very
end,
because
there
were
some
really
crazy
race
conditions
that
could
happen
where
we
would
overwrite
data
that
hadn't
fully
been
dereference.
Yet
oh
yeah,
because
the
deferred
the
deferred
IO
might
have
been
on
a
an
extent
that
later
got
D
allocated,
and
so
we
want
to
make
sure
that
we
D
allocate
in
order
after
a
deferred,
extents
iOS
are
actually
completed
and
that's
that's
it
again
stuff
to
make
sure
that
deferred.
Ioki.
A
Doesn't
stall
zombie
sequencers
are
sort
of
one
annoying
thing
where
you
might
have
a
collection,
1.2
or
PG
1.2
it
gets
deleted
and
then
you
recruit
the
OST
recreates
the
same
collection,
but
maybe
the
one
that
was
deleted
had
a
bunch
of
deferred
I/o
that
was
sort
of,
even
though
the
transaction
that
deleted
the
connect
collection,
finished
and
completed,
and
the
OST
says
it's
all
done.
It's
actually
like
still
doing
a
bunch
of
work.
If
you
rien
Stan,
she
ate
that
same
collection.
A
We
want
to
make
sure
that
we
use
the
same
sequencer
so
that
it
the
ordering
it's
all
all
correct,
and
so
that's
what
the
zombie
ones
are
if,
if
the
collection
gets
recreated
before
all
references
of
it
have
basically
drained
out
of
the
system,
then
we'll
sort
of
resurrect
that
sequencer.
A
A
Yeah-
and
actually
you
can
look
at
these-
if
you
do
this
demon
perf
command
on
an
OSD,
there's
a
whole
blue
store
group
and
they've
called
out
the
ones
that
are
the
most
interesting.
But
let
me
look
at
the
actual
state,
so
I
can
remind
myself
I'm,
not
remember
Oh
a
io8,
there's
gonna
be
the
million,
see
their
cave
eq'd.
There's
gonna
be
some
latency
there,
because
basically
there's
the
K
vsync
thread
is
one
thread:
that's
just
taking
batch
committing
it.
A
So
the
average
latency
is
like
whatever
half
that
half
that
time
interval
right
and
then
there's
also
some
time
and
submitted.
And
then,
if
you
look
at
the
F
at
the
top
here,
where
we
initialize
our
perfect
order.
A
A
How
long
this
initial
flush
is
taking,
though,
before
we
actually
even
submit
the
key
value
data,
we
have
to
do
a
hardware
flush
on
the
device
to
make
sure
everything
is
stable.
So
it's
how
long
that
is
basically
I
saw
that
will
vary
depending
on
the
type
of
SSD
or
hard
disk.
Hopefully
it's
it's
small
on
hard
disk,
it'll
be
big
and
then
there's
to
commit
latency,
which
is
basically
how
long
it
takes
for
this
batch
then
to
actually
be
committed
to
rock
CD.
A
Which
part
that
is
I,
don't
know,
I
think
there's
one
of
them,
that
is
sort
of
the
whole
the
whole
span
and
one
of
them
and
others
are
so
narrowing
in
on
this.
Just
on
that
key
piecing
thread,
I
can't
remember
exactly
how
they
go.
Unfortunately,
look
at
the
code,
but
yeah
there
there
bunch
of
these,
oh
and
then
how
long,
how
long
waiting
average
ao8
latency.
A
A
A
That's
much
here,
a
bunch
of
stats
for,
like
you
know,
hit
rates
on
the
node,
cache
and
buffer
cache.
How
much
time
we're
spending
compressing
and
decompressing
or
checking
check
sums
how
successful
the
compression
is
actually
being
and
so
on.
So
there's
a
whole
bunch
of
stuff.
A
Ok,
so
let's
look
at
let's
go
to
look
at
you
right,
cuz
I
said
over
the
last
piece
and
it's
kind
of
nutty,
but
this
is
most
of
the
interesting
transactions
actually
write
data
and
that's
for
a
lot
of
the
complexity
of
all.
This
stuff
falls
out
because
we
have
to
wait
for
Aereo's
and
deferred
iOS
and
so
on.
So
in
do
writes
for
lucky
we're
not
actually
writing
anything,
but
usually
we
are
choose
right
up
in
options
fills
out
this
right
context.
A
So,
whether
or
not
this
right
is
going
to
be
buffered
or
that
we're
gonna
write
into
the
buffer
cache
or
not
I'm,
whether
we're
going
to
compress
it.
How
big
we
want
the
blobs
to
be
how
big
we
want
to
check
some
chunks
to
be,
and
they
can
vary
on
a
per
write
basis
because
they
vary
per
pool
I'm.
You
can
have
Poole
settings
that
specify
whether
this
pool
is
compressed
or
not.
A
They
can
vary
by
I/o
because
you
might
have
hints
so,
for
example,
rate
escapable
hint
that
most
reeds
are
large
and
sequential
and
rights
also,
and
so
will
make
big
checksum
chunks
and
that
will
use
less
metadata
because,
usually
you
don't
have
to
be
a
very
small
read
and
you
wouldn't
have
the
read
amplification
and
associate
with
that.
So
choose
the
options
and
we
make
sure
that
the
extent
map
has
all
the
relevant
extents
for
the
node
in
memory.
We
actually
do
the
data
right.
We
allocate
I,
do
a
bunch
of
stuff.
A
Do
write
data
this
is
sort
of
the
preparation
stage,
so
any
any
right
to
an
object.
There's
going
to
be
an
alignment
to
the
metallic
size
and
so
Bernie
right,
there's,
maybe
a
small
piece
of
the
beginning.
That's
a
partial
allocation
unit,
there's
a
part
in
the
middle,
that's
a
multiple
of
allocation
units
and
aligned
to
those
and
then
there's,
maybe
a
little
bit
at
the
end.
That's
also
another
lined.
So
that's
the
head,
the
middle
on
the
tail,
and
so
we
have
two
helper
functions
to
write,
small
and
do
write
big.
A
We
do
small,
writes
on
the
little
end
bits
and
then
do
write
big
on
the
big
part
in
the
middle.
Do
write
big
is
nice
and
simple,
because
it's
we're
writing
full
allocation
units,
and
so
we
can
just
write
to
a
completely
new
part
of
the
device
I'm
doing
copy-on-write
or
whatever
you
want
to
call
it.
I'll
just
write
to
a
new
newly
allocated
region.
A
So
we
throw
out
the
old
extents
that
reference
that
part
of
the
the
object
we're
gonna
allocate
a
new
set
of
blobs
we're
gonna
loop
over
the
length
figure
out
how
big
this
blob
is
going
to
be
based
on
the
next
blob
size
or,
however,
big
of
a
right
we're
doing.
If
we're
not
compressing,
then
well,
there
complicated
bits
here,
because
we
try
to
reuse
old
blobs
because
it
saves
a
lot
of
CPU.
A
So
the
code
is
kind
of
complicated
here,
I,
don't
understand
it
super
well,
because
you
can
read
it,
but
the
short
version
is
that
if
it's
a,
if
it's
a
new
extent
for
them
for
those
big
writes,
we
just
write
a
new
space
and
we'll
create
new
blobs
in
sort
of
the
general
cases,
some
of
you're,
not
reusing,
and
that,
where
that
actually
happens,
is
we
create
a
new
blob?
We
said
you
bought
the
true
and
we
call
right
context
right,
Oh
back
in
the
header
you'll.
A
A
A
It
looks
at
the
existing
extents
and
it
tries
to
reuse
them
and
if
it
can't,
then
it
sort
of
allocates
a
new
blob
that
layers
over
it.
So
if
there's
an
existing,
this
says
there's
an
existing
blob
for
this
sort
of
region
of
the
object
that
we're
considering-
and
you
know,
if
it's
immutable-
it's
not
mutable,
then
we
can't
do
anything
with
it.
Although
blah,
but
sometimes
it's
a
blob,
say
the
blobs
are
state
a
minute.
A
Minimum
allocation
unit
is
64
K
and
we
were
only
wrote
four
K
into
it
on
the
last
right,
and
so
the
other
60
K
is
unused.
Then
you
small
right
we'll
come
back
and
I'll
say:
oh,
this
blob
is
still
mutable
and
it
hasn't
been
used
in
this
other
reason
that
I
want
to
write
it
into
and
so
I'm
going
to
write
into
the
an
existing
blob,
an
unused
portion
of
an
existing
Bob,
and
so
that's
what
one
of
these
cases
handles
right
here:
direct
right
into
unused
blob,
so
the
existing
mutable
blob.
A
A
At
the
same
time,
though,
we
and
it's
non
constant,
so
they
we
can
then
call
something
actually
modifies
it
like
updating
the
checksum
and
we
set
thee,
and
then
we
set
the
extent
map
extent
that
maps
to
that
updated
blob
and
we
mark
it
used
now
and
so
on.
So
the
next
time
this
happens
well,
it'll
be
an
overwrite
and
we
can't
do
this
ran
into
unused
region.
A
Sometimes
we're
writing
into
a
blob.
But
we
don't.
We
need
to
read
something
in
and
sort
of
in
order
to
do
a
read,
modify
write.
So
when
that
happens,
we
figure
out
what
we
have
to
read
and
somewhere
in
here.
We
actually
call
to
the
camera
where
it
is
somewhere
in
here.
We
actually
called
do
yet
do
read.
Write
here
might
actually
have
to
read
in
the
old
blob
that
we
can
overwrite
one
byte
in
the
middle
of
it
and
cue
it
out
to
write
again.
A
So
sometimes
this
will
block
that
I've
annoying
but
again
there's
all
sorts
of
complicated
padding
and
whatever
this
was
painful
code
to
write,
I'm
a
debug,
it's
quite
stable.
Now
again
we
try
to
reuse
blobs.
This
is
the
thing
that
Igor
did,
that
save
CPU,
but
it's
an
optimization.
But
if
all
that
stuff
fails,
then
fine,
we'll
just
allocate
a
new
blob.
A
So
say
they
say
this.
This
region
is
a
shared
blob
that
was
cloned,
and
so
we
can't
modify
it
and
we
can't
do
anything
tricky
to
fill
it
in.
We
just
have
to
allocate
another
blob,
that's
sort
of
logically
layered
over
it
in
the
and
the
extent
dancing.
So
we
allocate
a
new
blob,
we
write,
punch,
a
hole,
we
write
in
to
be
right
into
it
and
it
gets
added
into
the
extent
map
waiting
for
that
new
blob
right.
A
So
that,
basically
far,
we've
gone
through
and
we've
build
out
this
right
context
with
all
the
rights,
the
next
step
is,
do
out.
Great
I'll
actually
allocate
space
and
sends
it
to
disk.
So
what
happens
here?
Is
we
go
through
all
these
rights
and
we
say:
wait
a
sec?
Should
we
compress
them
first?
If
so,
then
we
figure
out
what
the
compressor
handle
codec.
Is
there
checksum
compression?
A
If
we
need
to
compress,
then
we
compress
all
the
pieces
figure
out
how
big
they
are
allocate
a
bunch
of
space
if
we're
not
compressing
that
we
already
know
how
much
space,
so
we
sort
of
skip
all
the
way
down.
Here
we
allocate
space
to
read
all
this
data,
and
then
we
actually
finally
go
to
the
writes
and
we
actually
can
write
it,
and
we
end
this.
In
the
stage
we
calculate
checksums
your
figure
out
what
the
checksum
water
is
checksum
length.
My
neutralize
check
sums.
A
A
A
We
update
the
extent
map
to
point
to
these
new
blobs
and
we
write
into
the
buffer
cache.
If
we
do
this
unconditionally,
even
if
it's
not
a
buffered
right,
because
we
have
to
track
inflight
BIOS
buffers
that
are
on
their
way
to
disk
case,
we
find
him
again
or
we
read
them
and
then
finally
queue
the
I/o.
This
is
actually
Q
it
all
the
way
it
puts
it
on
the
ice.
It's
calling
up
good
to
fur
it
up.
Oh,
this
is
different
I,
don't
that's
something
else:
yeah!
A
Normally
we
call
be
dead,
a
or
right
which
doesn't
actually
key
the
right.
Yet
it's
just
putting
on
this
IO
context,
which
is
describing
all
these
areas
that
are
gonna
happen
right
and
then
finally,
that
do
right
is
going
to
call
right
context
finish,
and
this
is
gonna
say
all
the
old
extents
I'm
gonna
update
the
stats
for
them
to
allocate
them
maybe
and
share
in
the
ec
case
for
the
optimization
at
the
end
of
the
day,.
A
A
Yeah
every
context
it
updates
the
note
sighs
this
garbage
collection
stuff
is
to
basically
make
it.
So
you
don't
have
too
many
layers
of
blobs
layered
on
top
of
each
other,
getting
a
really
complicated
map.
It
collapses
them
down
once
it
its
gets
inefficient,
but
it's
also
complicated,
so
we
won't
go
into
right
now.
A
Really
saying
that
if
you
have,
if
you
have
to
extents,
say
you're
right
where
you
wrote
for
K
and
ended
up
in
this
blob,
and
then
you
wrote
the
next
for
K
and
we
ended
up
writing
into
the
same
blob
because
it
was
unused
and
we
so
we
picked
that
efficient
path
and
you
write
small,
then
you'd
have
a
second
extent
that
points
to
that
same.
The
adjacent
thing
compress
extent
map
just
walks
to
the
extent
mapping
looks
for
contiguous
adjacent,
extends
pointing
to
the
same
thing
and
turns
it
into
one
reference.
A
At
the
end
of
this
whole
process,
then
we
have
a
bunch
of
iOS
that
are
described
in
that
Iowa
context
and
we
have
Ono
data
structures
in
memory
that
have
been
modified
and
own
extent,
map
records
that
have
been
modified
and
all
the
blobs
that
hang
off
then
they've
all
been
modified
in
memory
and
you'll.
Remember
in
two
transaction.
A
B
A
Yeah
yeah
exactly
compression,
you
can't
modify
a
compressed
blob
because
it's
compressed
and
you
don't
know
where
the
data
is,
and
you
can't
take
it
apart.
It's
also
one
big
lump,
and
so
you
end
up
with
if
you're
doing
sort
of
random
overwrites
with
compression
turned
on
you
end
up
with
blobs
that
are
layered
over
each
other,
and
so
the
garbage
compression
code
tries
to
count
for
how
much
is
space
is
wasted.
A
Basically,
in
these
layered
blobs
and
whenever
it
crosses
a
threshold,
then
it
basically
just
reads
them
and
andrey
compresses
it
and
writes
it
out
as
a
new
blob,
and
it
does
that
as
part
of
the
right
path,
and
so
there's
always
a
bound.
That's
enforced.
There's
no
like
background
activity
that
cleans
it
up.
It's
like
once.
You
do
the
third
right,
that's
layering
over
it
does
the
work
of
reading
it
and
putting
back
in
an
efficient
order.
Yeah.
A
A
If
I'm
running
a
part
of
a
allocation
unit,
then,
instead
of
writing
it
instead
of
creating
a
new
blob,
we
should
overwrite
the
existing
one,
but
we
can't
override
it
before
we
commit
the
transaction,
or
else
we
might
corrupt
the
old
data
before
we
commit
the
new
operation.
So
we
have
to
commit
the
operation
with
a
promise
that
we're
going
to
overwrite
it
and
then
overwrite
it
after,
and
so
that's
that
happens
in
here.
A
There
so
it's
anytime
any
time
you
do
a
write,
that's
smaller
than
a
block,
then
it
has
to
do
a
deferred
right
unless
it's
like
a
new
object
and
you're,
not
actually
touching
anything
that
it
was
there
before,
because
we
can
just
observe
it.
But
if
you're
over
overwriting
existing
data
in
a
sub
block
size,
then
I
ask
you
to
fir'd.
If
it's
also,
if
you're,
writing
and
overwriting
a
blob
and
you're
smaller
than
that
checksum,
no,
it's
actually
just
any
time
you're
smaller
than
the
blob,
then
you
have
to
have
deferred
right.
A
The
exception
is
that
if
the
blob
was
not
completely
written
before,
like
it's
a
64
calex
allocation
unit,
the
first
4k
is
written
in
the
other.
The
rest
is
unwritten
and
we
sort
of
track
that
state
in
the
blob.
Then
you
can
write
into
the
unused
portions,
but
assuming
it's
been
written
before
we
can't
override
it
must
we
will
we
commit
first
and
then
do
a
deferred
right.
There's
one
other
case
where
we
do
deferred
rights.
A
There's
a
tunable
called
like
min
deferred,
right,
I
think
it
is,
and
the
observation
was
that,
like
4k
or
no
64k,
basically
small
is
on
blue
star
ended
up
being
slower
than
on
file
store,
because
once
we
hit
the
allocation
unit,
we
sort
of
were
previously
we're.
Always
writing
two
new
blocks
on
on
disk,
which
meant
that
for
a
write
the
latency
had
was
at.
Do
the
IO
wait
for
it
to
complete
and
flush
and
then
commit
the
transaction,
and
then
we
that
concluding,
complete
and
fresh
blush
and
for
small
I?
A
Is
it's
actually
faster
to
just
write
something
the
database?
That
says
this
is
a
transaction,
and
this
is
the
data
that
I'm
going
to
write
and
then
asynchronously
go
write
it
because
the
deferred
writes,
get
batched
up,
especially
if
they're
sequential
and
they
get
like
pieced
together
into
one
big
I/o.
So
if
you're
doing
like
4k
work,
sequential
writes,
for
example,
this
will
do
a
whole
bunch
of
each
one
will
have
a
transaction
to
the
database.
They'll
commit
and
come
back
and
bash
up.
All
these
deferred
writes
and
then,
once
the
batch,
it's
big
enough.
A
They'll
all
get
flushed
out
sequentially,
and
so
it
helped
a
lot
or
small,
sequential
use
on
hard
disks
do
that?
Okay,
so
sometimes
we
do
deferred
write,
even
if
we
didn't
really
have
to
because
it
wins,
and
sometimes
even
it's
even
faster
on
SSD
I
can
remember.
There's
some
you
just
like
play
with
the
values
you
can
like
doesn't
seem
like
it
should,
but
I
seem
to
recall
them.
In
some
cases
it
actually
helped
so.
B
A
Yep,
but
that's
actually
mostly
a
good
thing,
because
each
each
each
I/o
is
basically
gonna
go
into
one
of
these
transactions
or
xdv.
Adding
an
extra
actual
4k
into
rocks.
Tv
is
not
that
much
additional
work
because
it's
a
hard
disk.
It's
already
writing
a
big
blob
to
the
parks
to
be
log
and
then
they'll
go
into
the
deferred
batch,
which
I
can
show
right
here,
which
is
basically
a
whole
bunch
of
deferred,
I/o,
buffers
and
locations.
B
A
Exactly
right
right,
you
get
back
the
benefit.
The
ad
with
file
store,
where
all
the
I/o
into
the
SSD
was
unblocking,
and
then
you
sort
of
lazily
flushed
it
all
to
the
hard
disk.
Bring
it
back
that
behavior
that,
with
a
lot
better,
better
design
a
little
bit
a
lot
more
control,
because
you
can
still
read
all
that
all
that
deferred
I/o
is
in
memory.