►
From YouTube: Ceph Code Walkthrough 2020-08-25: kRBD I/O Flow
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
B
B
B
B
B
Implicitly,
there
is
no
feature
bit
for
it
because
it
is
based
on
snapshots
and
therefore
doesn't
require
a
journal
or
any
other
modifications
to
the
I
o
path
and
also
keep
in
mind
that
for
actually
talking
to
the
cluster,
the
rbd
driver
depends
on
another
kernel:
module
named
ellipsef,
which
is
basically
a
stripped-down
version
of
the
liberators
library
implemented
in
kernel
space,
and
it
is
located
in
the
net,
slash
death
subdirectory,
and
here
you
can
see
the
authentication
framework.
B
B
At
this
point,
it's
just
messenger
v1,
because
we
discovered
several
cryptographic
weaknesses
in
messenger
v2
and
it's
not
conducive
to
implementing
in
the
kernel
client
for
a
couple
of
implementation
reasons,
and
so
we
came
up
with
a
new
revision
of
the
messenger
protocol
called
messenger
v2.1
which
fixes
those
issues
and
will
hopefully
come
to
the
client
soon
bringing
support
for
full-blown
on
the
wire
encryption.
B
And
then
we
have
the
monitor
client.
The
osd
client,
which
is
basically
the
equivalent
of
objector
in
liberators
osd
map.c,
is
where
we
decode
the
osd
map,
invoke
crush
and
post
process
crash
to
get
the
actual
object
placement,
because
we
need
to
account
for
things
like
digi
temp
settings,
dj
up
map
settings,
you
know
primary
affinity,
etc.
B
B
So
the
way
rbd
images
are
mapped
and
unmapped
is
through
the
ccfs
interface.
It's
done
by
writing
to
these
provide
only
files
which
are
called
attributes
and
when
you
type
rbd
device
on
map
or
rbd
device
on
map
on
the
command
line,
the
rbd2
constructs
the
configuration
string
and
writes
it
to
one
of
these
attributes.
B
Of
course,
there
is
a
bit
more
to
it
because,
at
least
for
mapping
the
command
line
two
has
to
parse
the
supplied
options.
It
has
to
fetch
the
sefax
key
and
add
it
to
the
kernel
keyring
and
after
the
image
is
mapped,
it
waits
until
udab
finishes
processing
the
associated
events
and
creates
various
siblings,
although
that
does
interfere
with
container
use
cases
and
we're
currently
working
on
fixing
that,
but
in
a
nutshell,
all
it
takes
is
a
single
write
system,
call
reason
that
there
are
two
add
attributes
and
two
remove
attributes.
B
This
add
and
add
single
major
and
remove
and
remove
single
major.
The
reason
is
that
we
support
two
major
minor
device
numbering
schemes
under
the
legacy
scheme.
Hrbd
mapping
consumes
a
major
number
for
itself
and
the
problem
with
that
is
on
a
typical
linux
system.
There
is
only
about
200
majors
available,
and
so
this
limits
the
number
of
rbd
mappings
you
can
have
on
the
node
and
potentially
also
cripples
the
known,
because
if
all
available
major
numbers
get
allocated
to
the
rvd
driver,
attempts
to
create
other
devices
might
fail.
B
Under
the
single
major
scheme,
we
can
see
among
a
single
major
number
for
all
rbd
mappings
and,
as
a
consequence,
there
is
virtually
no
limit.
You
can
have
thousands
of
million
mappings.
If
you
want
to.
B
B
And
so
here
we
take
the
configuration
string
and
the
first
thing
we
do
is
we
bump
a
reference
count
on
our
kernel
module
and
the
reason
we
have
to
do
this
is
in
linux.
B
So
it's
basically
a
comma
separated
list
of
monitor
id
addresses,
followed
by
a
comma
separated
list
of
mount
like
key
and
key
value
options,
a
cool
name,
an
image
name,
an
optional
snapshot,
name
and
notice
that
there
is
no
field
for
a
namespace
name
here,
namespace
name,
the
namespace
is
optional
and.
C
B
C
B
Well,
let
me
unshare
the
screen.
Maybe
it'll
help
how
about
now.
B
We
started
on
namespaces,
so
the
namespace
name
is
optional
and
it
gets
passed
as
one
of
these
key
value
options.
B
The
output
is
a
set
of
libsev,
specific
options
and
start
check
options,
a
set
of
id
specific
options
and
struct
rb
options
and
the
names
the
pool
name,
the
image
name,
they're
stashed
in
struct,
rbd,
spec
and
rbd
spec
also
stores
ids
for
entities
that
have
them
and
they
are
filled
in
later
one
by
one,
as
they
are
discovered,
namespaces
don't
have
ids.
B
We
get
a
lib
chef
instance
and
a
new
libsaf
instance
means
a
new
messenger,
a
new
set
of
sockets
and
usb
client,
etc,
and
if
we
were
to
create
a
fresh
website,
for
instance,
for
which
I
made
in
mapping,
we
would
start
the
system
out
of
resources
really
fast.
B
So
unless
the
user
asks
us
not
to
share
web
server
options
with
with
the
no
share
option,
I'm
sorry
not
to
share
lipstick
instances.
We
go
through
the
list
of
existing
webset
instances
and
attempt
to
find
one
with
the
same
options.
B
B
And
here,
if
we
find
one,
we
bump
a
reference
count
on
it
and
return
it,
and
here
we
ensure
that
the
returned
libsar,
for
instance,
has
the
latest
osd
map,
because
it
may
have
been
sitting
there
idle
and
it
could
be
that
the
pool
we
are
mapping
an
image
from
has
just
been
created,
and
this
lipstick
instance
hasn't
learned
about
it.
Yet
because
it
serves
the
map,
is
you
know
a
few
epochs
behind?
B
And
otherwise,
if
we
are
not
successful
in
finding
an
existing
instance,
we
create
a
new
one,
and
this
is
where
we
open
a
new
monitor
session
authenticate
and
wait
until
both
map
and
osd
map
are
received.
And
once
that
happens,
we
add
the
newly
created
libsaf
instance
to
our
host
appliance
here.
B
Next
we
look
up
the
pool
id
based
on
the
full
name
that
we
have
and
that's
trivial,
because
we
have
the
led
map
at
this
point
and
then
we
allocate
a
brocked
rbd
device.
B
A
B
Map
stuff,
the
information
about
the
parent
image
and
some
other
things
is,
it
is
passed
pretty
much
everywhere,
it's
you
know
more
or
less
a
global
variable,
which
we
probably
need
to
split
into
a
couple
of
small
structures,
and
the
next
thing
we
do.
B
This
function
is
called
recursively
or
probing
parent
images
and
because
the
kernel
stack
is
rather
small.
It's
just
16
kilobytes
on
x86,
I
believe,
and
even
smaller,
on
architectures.
B
We
need
to
be
careful
not
to
overrun
it,
and
the
first
thing
we
do
here
is
look
up
the
image
id
based
on
the
image
name,
and
there
is
a
special
rbdid
object
for
each
image,
and
what
we
do
here
is
we
format
a
name.
We
format
its
name.
This
is
a
known
prefix,
followed
by
the
image
name,
and
we
call
a
class
method.
B
An
object,
class
method
called
get
id
on
it,
and
you
can
see
that
all
it
takes
is
allocating
a
couple
of
buffers
and
calling
into
libsef,
and
once
we
have
the
image
id,
we
can
construct
a
header
name,
because
that
that
again
is
well-known,
prefix,
followed
by
the
image
id
and
I'm
concentrating
on
format.
B
Two
now
because
image
format,
one
has
been
deprecated
and
the
next
thing
we
do
is
unless
our
mapping
is
read-only,
we
need
to
register
a
watch
on
the
image
header
and
the
way
it's
done
is
again.
We
call
into
libsef
and
we
supply
two
callbacks.
The
first
callback
is
called
on
every
notification.
B
This
is
how
we
know
that
the
image
has
been
resized
or
a
snapshot
has
been
created,
for
example,
and
then
we
need
to
refresh
our
view
of
the
image
header
and
also,
this
is
how
the
exclusive
log
is
implemented.
So
you
can
see
some
exclusive,
lock
notifications
here,
apart
from
the
header
update
one,
the
second
callback
is
called
on
errors,
and
here
what
we
do
is
just
let
everyone
know
that
there
has
been
an
error
and
a
cue
and
work
item
to
reestablish
your
watch
when
that
becomes
possible.
B
B
B
B
Are
while
we
look
at
the
parent
information,
we
want
to
get
the
information
about
the
parent
about
the
spec
of
the
parent
image,
so
that's
the
asset
school
id,
the
namespace
name,
the
image
id
and
the
snapshot
id
and
also
the
parent
overlap
value.
B
B
You
get
a
10
gigabyte
clone
if
you
shrink
that
clone
down
to
say
5,
gigabytes
and
then
growing
back
to
10
that
second
half
of
the
parent
image
could
no
longer
be
visible
for
the
clone,
because
it
should
be
all
zeros
just
like
if
you
had
shrunk,
a
regular
standalone
image
and
then
grown
it
and
the
parent
overlap.
Value
is
how
we
track
this,
and
whenever
we
calculate
a
mapping
of
clone
extends
onto
the
parent
extent.
B
We
take
it
into
account
and
proven
extents
that
are
outside
the
overlap
region,
and
here
we
actually
probe
the
parent
image
that
we
didn't
find
any
parent
in
the
previous
function,
which
is
bail.
Otherwise,
we
make
sure
that
our
parent
chain
is
not
too
long
and
after
creating
a
structurability
device
for
the
parent
image,
call
rbd
of
image,
probe
recursively
with
an
incremented
depth,
and
this
can
go
on
for
a
while,
because
this
parent
can
have
its
own
parent
and
so
on.
B
B
B
B
B
So
now,
if
there
is
no
object
map,
we
we
can
announce
the
disk,
but
if,
if
there
is,
if,
if
there
is
an
exclusive
log
or
an
object
map,
what
we
want
to
do
is
grab
the
exclusive
log
here
and
once
that's
done,
load
the
object
map,
because
we
want
to
attribute
the
latency
of
grabbing
the
exclusive
lock.
I'm
loading
the
object
map
to
the
parent
to
the
mapping
process
instead
of
the
first
io
request.
C
Hey
elia,
I
think
your
screen
is.
C
It's
still
there,
but
it's
showing
dev
image
probe.
C
B
Can
you
see
it
now?
I
am,
at
the
rbd,
add
acquire
log
much
better
thanks,
yep
and
yeah.
So
once
the
lock
is
acquired
and
the
object
map
is
loaded
and
the
object
map
gets
loaded
from
the
post
acquire
handler
that
the
exclusive
log,
our
state
machine,
calls
when
the
exclusive
clock
is
acquired,
I'm
not
going
to
show
that
we
announced
the
disk
to
the
world
and
they
returned
to
suse
fast
and
from
there.
B
B
All
right
so
moving
on
to
unmapping,
that's
that's
another
store
callback
and,
as
you
can
see
it's
the
same
story,
we
have
two
of
them
and
they
forward
to
the
same
function.
B
Block
device
id
and
an
optional
force
flag,
and
here
we'll
after
parsing
that
we'll
look
up
a
struct
rbd
device
with
the
with
the
id
that
we
need
and
check
the
open
count
if
the
image
is
still
opened
by
someone
we
refuse
to
unmap,
but
the
force
flag
overrides
the
open,
gown
check.
This
override
was
added
with
a
specific
iscsi
related
use
case
in
mind,
and
it
isn't
really
useful
for
anything
else.
So
don't
use
it.
B
But
if
the
open
count
check
was
overwritten,
we
may
have
some
outstanding,
I
o,
and
so
we
need
to
freeze
the
queues
and
wait
for
the
outstanding
I
o
to
complete.
B
After
that
we
kill
the
block
clear
disk,
tear
down
the
associated
suspense
attributes,
close
the
object
map
and
unlock
the
image
header.
Here
we
unregister
the
watch
on
the
header
and
flash
the
notification
view
this
on
probe
function.
B
Here
we
put
a
reference
on
the
parent
image
which
in
turn
puts
a
reference
on
its
parent
and
so
on.
The
entire
current
chain
gets
cleaned
up,
and
then
we
clean
up
our
own
state,
which
we
free
the
object
map
for
the
reference
on
the
snapshot,
context
on
various
header
fields
and.
A
D
I
have
one
question
regarding
the
outstanding
ios:
that's
because
what
I
see
is
they're
using
this.
The
drivers
using
multi-cue
block
layer
that
why,
if
it's
a
single
queue,
then
I
assume
any
change.
If
it's
not.
If
I
see.
B
Well,
well,
it's
it's!
It's
called
the
function
is
called
freeze
queue,
so
it's
singular,
but
really
it
it
takes
care
of
all
outstanding.
I
o
requests
we
actually
in
the
in
in
that
rbd
init
disk
function
that
I
showed
earlier.
B
B
All
right,
let's
jump
to
serving
io
the
entry
point
here
is
rbdq
iq
function,
there's
the
definition
it
is
registered
with
the
block
layer
and
the
block
layer
calls
it
for
each.
I
o
request
and
wants
us
to
handle,
and
here
we
grab
the
pre-allocated
struct
rbd
image
request:
translate
the
block
layer
up
code
into
the
rbd
code,
mostly
just
for
historic
reasons.
As
you
can
see,
the
mapping
is
now
one-to-one
and
initialize.
The
image
request.
B
Note
these
two
operations
discarded
and
zero
out
different
for
a
long
time.
We
guarantee
that
this
card
would
zero
the
discarded
region
and
there
used
to
be
a
system
attribute
called
discarded,
underscore
0's
underscore
data
which
was
set
to
true,
and
some
people
relied
on
that,
but
with
the
addition
of
the
zero
out
up,
this
is
no
longer
the
case.
B
The
semantics
have
been
relaxed
and
if
your
discard
request
is
small
enough-
and
we
suspect
that
nothing
will
actually
be
deallocated
on
the
usbs,
we
will
simply
drop
it
or
if
your
discount
request
is
big
but
not
suitably
aligned,
we
will
reshape
it
and
discard
a
smaller
region
and
then
request
it
zero
out
up.
On
the
other
hand,
this
is
guaranteed
to
zero
every
single
night.
We
never
second
guess
zero
out
requests.
B
So
if
you
want
zero
in
semantics,
that's
what
you
should
use
and
in
the
end
we
offload
the
actual
work
of
putting
together
the
image
request
to
the
work
view.
This
is
done
because
the
block
layer
has
certain
restrictions
on
what
this
qrq
handler
can
do
and
in
particular
it
cannot
sleep,
but
we
actually
need
to
take
a
couple
slipping
locks
in
the
process.
B
So
we
offload
to
this
rbd
keyword,
fm
function,
and
here
we
grab
the
offset
and
length
from
the
bulk
layer
request,
which
is
referred
to
as
rq
here
run.
Some
checks,
capture
the
snapshot,
context
or,
and
some
other
element
fields
from
the
from
the
image
header
and
move
on
to
filling.
The
image
request.
B
Discard
and
zero
out
requests,
don't
have
any
data
pages,
but
read
and
write
requests.
Of
course.
Do
the
block
layer
request
is
more
or
less
just
a
single
linked
list
of
this
so-called
bio
structs
hbiostruct
contains
a
bunch
of
data
pages
in
it
which
we're
going
to
need
to
either
write
out
or
within
to,
and
here
we
pass
the
pointer
to
the
head
of
the
bio
list.
B
B
There
are
two
cases
here:
the
case
of
a
default
simple
layout
in
the
case
of
a
so-called
fancy
layout
and
a
fancy
layout
for
our
purposes
is
any
layout
where
the
stripe
unit
size
is
not
equal
to
object
size
and
I'm
going
to
focus
on
a
simple
case
here,
because
the
case
of
the
fancy
layout
is
more
complicated.
B
We
do
two
passes
here
instead
of
one,
but
I
I
want
to
point
out
that
the
no
copy
in
the
name
of
this
function
refers
just
to
the
fact
that,
in
the
complicated
case
we
have
to
make
a
private
copy
of
the
page.
Descriptor
array
page
descriptors
are
small
struct
and
we
never
copy
the
actual
user
data.
B
Instead,
we
manipulate
those
page
descriptors
and
arrange
the
data
pages
during
those
two
passes
and
in
such
a
way
that
they
are
fed
to
the
messenger
in
the
right
order,
and
so
is
always
zero
copy,
in
the
sense
that
the
data
goes
to
the
wire
from
its
original
source
and
arrives
from
the
wire
to
its
final
destination.
B
B
And
here,
whenever
we
deal
with
block
layer
requests,
the
number
of
imagex
extents
is
going
to
be
one.
It
can
be
greater
than
one
only
when
we
deal
with
parent
images,
so
this
loop
is
going
to
be
no
up.
In
most
cases,
step
followed
to
extends
is
a
lipstep
function
that
does
the
striping
work.
B
Basically,
it
works
through
the
given
image
extent.
Does
the
mapping
one
piece
at
a
time
and
any
time
it
encounters
a
new
object.
It
calls
this
allocathem
callback
and
anytime,
it
counters
a
new
stripe
unit.
It
calls
this
action
fn
callback
and
if
you
look
at
how
we
invoke
it
for
the
a
new
object
callback,
we
pass.
B
This
alec
object
extend
function
that
just
allocates
a
new
object
request
and
adds
it
to
the
current
image
request
and
the
image
required
basically
serves
as
a
container
for
the
group
of
related
object
requests
and
for
the
stripe
unit
callback.
We
pass
this
stuff
from
the
appearing
context
and
again
this
is
how
data
pages
get
added
to
the
right
object,
requests
in
the
right
order,
no
matter
how
fancy
your
layout
is
and
on
return
from
file
to
extent.
B
And
here
for
each
object
request,
we
do
some
op
specific
initialization,
because
everything
before
this
point
was
just
generic
striking
stuff
and
the
interesting
bit
here
is
we
may
well.
We
may
end
up
deleting
some
of
the
object
requests
that
we've
just
added.
B
If
you
recall,
on
the
difference
between
discards
and
zero
outs
and
how
we
can
font
and
discards
if
we
don't
think
they're
going
to
be
useful,
this
is
where
it
happens,
and
the
logic
is
in
rbd
and
in
this
card
you
can
see
that
we
do
some
rounding
up
and
rounding
down
based
on
the
alex
size
value
and
it
defaults
to
64
kilobytes.
B
As
a
compromise
between
bluestore,
I
mean
alexis
for
hdds
for
ssds
and
file
store,
which
doesn't
really
have
that
concept
and
if,
after
after
reshaping
the
object
request,
it
becomes
smaller
than
lx
size
will
return
a
positive
value
and
it
gets
dropped
on
this
condition
here
and
at
this
point
we're
done
everything
is
set
up
and
we
are
ready
to
kick
off
the
image
request.
B
And
this
is
where
it
happens,
so
we
are
back
to
that
keyword,
defined
function
where
everything
started.
B
And
the
the
actual
state
machine
resides
in
this
function
that
begins
with
two
underscores:
it
returns
a
boom
and
when
it
reaches
a
final
state,
it
would
be
true,
and
when
that
happens,
there
are
two
cases.
If
this
image
request
is
not
a
request
to
the
current
image,
then
we
just
let
the
block
layer
know
that
we're
done,
but
otherwise
we
need
to
kick
the
state
machine
in
the
image
above
us-
and
this
is
what
this
branch
does.
B
B
Here
it
is,
as
you
can
see,
it's
pretty
simple,
because
all
we
do
here
is
if
the
exclusive
lock
is
needed
and
we
are
not
the
owner,
we
acquire
the
exclusive
lock
and
we
kick
off
the
work
to
acquire
the
exclusive
block
and
when
that
completes,
the
post
acquire
handler
will
kick
the
state
machine
and
we
will
land
in
this
state.
B
We
assert
that
either
we
don't
need
the
lock
or
we
are
the
lock
owner
and
we
kick
off
the
object,
request,
state
machines
and
again
an
image
request
is
really
just
a
container
for
object
requests.
So
all
the
work
happens
inside
inside
object,
request
state
machines.
Here
we
just
wait
for
them
to
complete
and
gather
the
result
once
the
pending
count,
which
is
zero.
We
return
true,
which
signifies
the
termination
on
the
state
machine.
B
B
There
we
don't
have
to
deal
with
object,
map
updates
and
a
whole
separate
copy
up
state
machine,
business,
discards
and
zero
outs
go
through
the
write
state
machine
because
they
modify
the
image
just
like
a
regular
write
does
and
so
on
a
read.
The
first
thing
we
do
is
we
consult
the
object
map
if
it's
present,
if
the
object
map
says
that
the
object
doesn't
exist,
we
move
on
to
handle
to
handling
email
and-
and
this
is
where
it
happens.
B
B
Otherwise,
well
a
note
on
the
object
map
notice
that
the
that
the
query
is
called
may
exist
instead
of
just
exists.
This
is
because
the
object
map
is
allowed
to
go
inconsistent,
but
only
in
one
direction.
The
rights
are
on
right.
The
object
map
is
updated
first
and
the
objects
are
written
to.
B
Second,
if
the
client
crashes,
after
updating
the
object
map,
but
before
creating
the
object,
the
object
map
will
have
a
record
of
an
object
that
doesn't
actually
exist
in
radius
and
that's
fine,
but
the
reverse
is
not
possible
if
they
are
because
if
it
were
possible,
it
would
lead
to
data
corruption
if
the
object
exists
in
radius.
The
object
map
will
always
have
a
record
of
it.
B
First,
the
state
is
transitioned
from
exists
to
dependent
division
and
then
the
delete
is
performed
and
then
the
state
is
transitioned
from
pending
division
r
to
non-existent,
and
this
is
how
we
ensure
that
the
reverse
and
that
inconsistency
can't
ever
happen
back
to
the
read
state
machine.
If
the
object
map
says
that
the
object
may
exist,
we
issue
a
radius
read:
it
can
play
with
annoyant
if
the
object
is
not
actually
there,
and
that
would
mean
that
our
object
map
is
inconsistent.
B
But
again,
that's
fine,
and
here
is
how
our
read
is
issued.
We
allocate
an
sd
request,
format,
read
rsd
op,
allocate
the
messages
and
submit
them
to
rip
chef
and
or
when
we
allocate
an
osd
map.
We
provide
a
callback
which
alexef
invokes
when
it's
done
with
the
request
and
in
this
callback
we
basically
just
grab
the
return
value
and
kick
the
state
machine
for
the
associated
object,
request.
B
So
again,
back
to
the
read
state
machine:
let's
say
we
got
an
oend
and
the
parent
image.
Actually
there
is
apparent
image
and
there
is
some
overlap
in
this
case.
We
end
up
here
and
reverse
map
the
object
extent
onto
the
parent
image.
B
B
So
here
we
call
into
into
lip
chef
to
do
the
reverse
striping
map,
and
you
can
see
it's
just
a
bunch
of
64-bit
divisions
and
multiplications
that
are
a
bit
cumbersome
in
the
kernel,
because
on
32-bit
architectures
we
cannot
use
compiler
intrinsics
and
so,
instead
of
using
like
the
regular
flash
operator,
we
have
to
use
this
macros.
B
And
here
is
where
our
prune
extends,
where
the
parent
extents
are
pruned,
the
the
the
overlap
is
passed
in
and.
B
The
extents
that
are
completely
beyond
the
overlap
mark
are
dropped
and
the
final
overlapping
extent
is
trimmed.
It
could
very
well
be
that
all
parent
extents
get
dropped
here
and
we
return
in
an
empty
array,
and
in
that
case
it
is
no
different
from
an
email
and
without
a
parent
image.
We
handle
it
again
by
zeroing
the
request,
but
if
pruning,
if
the
pruning
process
left
us
with
at
least
one
parent
extent,.
B
What
we
do
is
we
kick
off
a
read
to
the
parent
image
and
you
can
see
that
here
we
create
the
child
image
request.
You.
B
It
with
the
object
request
that
we're
currently
processing
capture
the
necessary
fields
from
the
header
and
fill
it
in
much
the
same
way
as
we
filled
the
original
objective
image
request
that
was
initiated
by
the
block
layer.
The
only
difference
is
that
this
image
request
is
initiated
by
the
object
request
that
we're
currently
trying
to
process
and
it's
going
to
get
one
or
more
of
its
own
object
requests
and
they
will
be
filled
again.
B
The
same
way,
their
page
descriptors
would
be
set
to
point
to
pages
that
were
handed
to
us
by
the
block
layer
for
the
original
image
request,
and
so
again
there
are
no
temporary
buffers
or
anything
of
that
sort.
Everything
is
zero
copy.
B
The
image
request
to
the
topmost
parent
image
can
spawn
another
image
request
to
the
second
topmost
parent
image
and
so
on,
but-
and
you
can
see
that
when
we
do
this
spawning,
we
do
this
while
work.
B
U
again,
to
avoid
building
up
the
stack
due
to
the
state
machine
recursion,
but
eventually
we
would
either
hit
an
object
with
some
data
in
one
of
the
parent
images
or
hit
a
hole
in
the
bottommost
parent,
and
at
that
point
the
the
the
chain
of
this
image
request
would
be
unwound
by
repeatedly
taking
that
slightly
obscure
branch
with
the
go
to
label.
B
On
that
I
showed
earlier
and
eventually
we
would
get
back
to
the
original
object
request
and
land
in
this
read
state
machine
in
in
the
object-oriented
parent
state,
and
here
we
have
to
realize
that,
because
we
pruned
the
list
of
parent
extents
based
on
the
current
parent
overlap
value,
we
haven't
read
anything
past
the
overlap
mark,
and
so
there
is
nothing
there
as
far
as
we're
concerned.
B
B
And
back
to
the
image
requested
machine,
the
pending
account
here
will
be
decremented,
we'll
we'll
take
the
take
the
result
of
whether
it's
an
error
or
or
or
the
number
of
bytes
that
we've
read
into
account
and
again
return
to
this
time
from
the
from
the
image
request
and
back
to
the
I'm
back
to
this
dispatch
function.
B
B
I
think
that's
it
for
each
and
we
are
nearly
out
of
time,
so
I
probably
won't
be
able
to
cover
right
so
the
right
state
machine
and
just
just
just
to
quickly
go
through
a
couple
of
states.
B
B
Here
it
is,
and
you
can
see
that
it's
more
complex
in
particular,
you
can
see
that
there
are
two
object:
map
related
states
here,
that's
what
I
mentioned
with
you
know,
handling
object
deletions.
B
B
This
is
where
we
would
flip
from
pending
division
to
the
to
the
non-existent
state,
and
we
have
an
entire
separate
internal
state
machine
here,
copy
upping
bytes
from
from
the
parent
image,
when
we
need
to
overwrite
them
in
the
copy
and
write
fashion,
and
you
can
see
here,
we
have
states
for
again
reading
from
parent,
which
is
similar
to
you
know
when
we,
when
we
read
from
parent
or
regular,
read,
but
we
have
to
deal
with
object
maps
here
and
with
the
interaction
with
the
positive
and
deep
latin
features
which
for
positive,
we
have
a
fourth
object.
B
Map
state
called
exists.
Clean
which
allows
us
to
tell
you
know,
allows
the
positive
logic
to
to
actually
work
and
with
four
states
the
object
map
is
it's
a
bitmap
with
two
bits:
their
object.
B
Absolutely
those
interactions
are
hidden
in
in
in
this
functions,
but
again
the
structure
of
the
right
state
machine
is
the
same.
Ultimately,
we
kick
off
some
requests
and
we
wait
for
them
to
complete.
With
this
ending
result.
Deck
helper,
just
like
in
the
read
state
machine.
B
And
yep,
I
think
that's,
I
think,
that's
that's
it
in
a
nutshell,
but
the
right
state
machine
is
probably
a
topic
for
a
whole
separate
walk-through,
just
to
explain
all
the
copy
up
intricacies
and
the
snapshot
context,
logic
for
d,
platinum
and
things
like
that.
D
Thanks,
I
have
one
question:
what
I
basically
understood
understood
from
the
code
walkthrough
is
that
when
it's
basically,
the
driver
interacts
with
the
multi-cue
block
layer,
what
I
understand
from
multi-group,
lock
players
that
there
are
multiple
software
staging
queues
and
and
then
you
have
what
I
saw
in
one
there's,
a
one
function
pointer,
which
has
a
hardware
dispatch
queue
as
well.
B
Yes,
that's
what
I've
showed
in
that's
what
I've
showed
earlier
in
this
rb
init
disk
function.
This
this
parameter
refers
to
the
number
of
hardware
queues,
so
the
number
of
the
number
of
software
cues
is
not
controlled
by
the
log
device
driver
at
all.
It's
totally
up
to
the
multi-queue
framework,
the
block
device
driver
controls,
the
number
of
hardware
cues,
and
we
set
it
to
the
number
of
present
or
present
cpus
just
to
in
the
hopes
of
increasing
parallelism.
B
We
haven't
actually
benchmarked
this.
This
is
a
fairly
recent
change.
We
used
to
have
a
single
hardware
rq,
but
that
was
because
we
had
some
global
logs
in
the
lipsap
kernel
module,
which
you
know
which
any
sufficiently
parallel
submissions
from
the
rbd
driver
would
bump
into.
But
those
have
been
six
have
been
fixed
a
long
time
ago,
and
so
we've
we've
flipped
this
number
of
hardware
queues
in
a
number
of
cpus.
D
And
is
there
any
work
with,
I
mean
I
understand
that
this
is
basically
most
optimum
case,
that
you
can
have
hardware
queues
equal
to
a
number
of
cpu
cool
cp
cpu
queue.
So
that's
the
optimum
case.
Okay,
okay
and
last
question.
D
Is
there
any
what
I
basically
understood
from
the
code
walkthrough
and
when
I
read
some
scientific
literature
regarding
rpd
as
well?
Is
there
any
way
someone
can
abstract
the
kernel
module
to
the
user
space,
because
what
I
understood
is
that
the
entire
self
client
kernel
module
is
the
the
entire
set
time
module
is
in
the
kernel
module,
so
it
has
no
user
space
involved.
B
Yes,
there
there
is,
there
is
no
user
space
involved,
it's
a
complete
reimplementation
in
c
in
kernel
space
with
all
of
the
associated,
you
know,
constraints
and
restrictions.
These
are
two
totally
separate
code
bases.
B
Yes,
if
you
want
to,
you
know,
make
use
of
the
user
space
code
but
get
a
kernel
block
device
presented.
We
have
the
rbd
npd
driver
for
that
and
that's
basically,
you
know
the
the
kernel
has
an
mbd
client
and
we
place
an
mbd
server
on
top
of
an
instance
of
library
and
they
talk
via
the
nbd
protocol
and
that
way
the
rbd
image
ends
up
being
exported
to
the
kernel,
but
for
krbd
and
lipsaf
and
saffs.
B
These
are
totally
separate
from
scratch.
Implementations
in
kernel,
space,
there's
nothing
shared,
except
for
the
implementation
of
the
crush
algorithm,
and
it's
the
same
and
it's
written
in
c.
So
it's
it's
the
same
in
libretus
and
in
lipsef,
and
also
some
header
files
are
shared
which
define
stuff
like
latest
feature
bits
and
these
some
parts
of
the
one
wire
format
such
as
messenger
tags,
message,
header
and
message
footer.
B
You
know
what
osd
ops
get
serialized
to
what
mdr
shops
get
serialized
to
so
stuff
of
that
nature.
But
those
are
the
only
things
that
are
shared.
Everything
else
is
but
really
separate.
D
Thanks
thanks,
so
I
think
for
the
user
space.
So
if
someone
is
interested
so
nbd
driver,
like
you
said,
is
the
right
driver
to
see
if
you
have
to
use
the
space.
B
Yes,
if
you,
if
you're
interested
in
utilizing
those
those
features
that
chronology
doesn't
support
so,
for
example,
journal
based
mirroring
it.
It
makes
sense
to
use
ibd
nvd.
B
B
I
I
hope
this
was
useful,
feel
free
to
reach
out
on
rc
or
email
with
anything
related
to
the
kernel,
client
or
stuff
in
general,
and
thanks
for.