►
From YouTube: Ceph Code Walkthroughs: Librbd Part 2
Description
Every month the Ceph Developer Community meets to discuss one aspect of Ceph code, spread knowledge of how it works and why it works that way.
This month we're joined by Jason Dillaman to go over part 2 of librbd.
Part 1: https://www.youtube.com/watch?v=L0x61HpREy4&list=PLrBUGiINAakN87iSX3gXOXSU3EB8Y1JLd&index=2
Find future Ceph Code Walkthroughs: https://tracker.ceph.com/projects/ceph/wiki/Code_Walkthroughs
Ceph Code Walkthrough Playlist: https://www.youtube.com/watch?v=nVjYVmqNClM&list=PLrBUGiINAakN87iSX3gXOXSU3EB8Y1JLd
A
Welcome
everybody
to
another
episode
of
theft,
code
walkthroughs.
It
is
a
february
23rd
and
I
am
joined
by
jason
here,
who
is
going
to
be
giving
us
a
part
two
on
live
rbd,
as
we
have
previously
have
gone
over
in
part
one.
So
I
know
that
we've
had
a
variety
of
people
that
reached
out
for
part
two
and
so
jason
is
giving
us
some
time
now
to
go
ahead
and
go
over
that
so
jason.
Will
you
please
take
it
away.
B
All
righty,
thank
you.
So,
yes,
in
part,
one
we
covered
just
the
introduction
to
how
io's
read
and
write
requests
are
sent
to
lib
rbd,
so
I'll
quickly.
Recap
it
and
then
we'll
dive
a
little
bit
deeper,
touching
on
some
of
the
more
complicated
state
machines.
Then,
in
the
in
the
I
o
path
for
this
one.
Since
I
cannot
see
anyone
raising
their
hands
because
my
my
screen
is
now
taken
up
with
a
terminal
window
feel
free.
B
If
anyone
has
any
questions,
comments
or
whatever
I'll
meet
yourself
and
make
it
known.
Otherwise,
I
will
not
know.
I.
B
That
also
works.
Thank
you
all
right,
so
yeah
at
the
end
of
the
last
time,
or
during
part
one
we
talked
about
where,
basically,
how
ios
enter
read,
write,
request,
enter,
live
rbd
and
in
general
we
talked
about
how
this,
if
we
look
at
where
we
are
in
the
stuff
tree
under
source
live
rvd
is
where
we
keep
nine
99.99
of
lib
rbd
related
functionality
and
there's
one
cc
file
in
here
called
lib
rbd.cc.
B
This
is
basically
the
entry
point
for
the
c
and
c
plus
plus
apis
for
the
lib
rbd.
So
everything
in
this
file
when
it
comes
to
like
read
and
write
methods,
they
don't
really
do
a
lot
in
this
file,
but
it
redirects
somewhere
else
deeper
within
librbd.
So
this
file
is
basically
what
keeps
the
the
stable
api
that's
available
in
the
library.
B
Apis
so
like
in
this
case,
this
is
a
c
api.
So
every
function
you
see
in
the
in
the
library.h
or
library
you
see
live
rdd.http
you'll
find
the
corresponding
implementation
in
the
library
function.
For
the
I
o
path.
The
function
then
gets
basically
bounced
to.
If
you
look
at
the
subdirectory
layout
here,
there's
a
there's,
an
api
subdirectory.
B
So
that's
where
we
start
breaking
things
out
into
different,
lower
level
components
and
features
within
liberbd.
So
for
io,
all
the
idle
method
starts
within
this
api.
Slash,
io
cc
file
and
then
you'll
find
that
the
corresponding,
read
and
write
methods
that
really
all
they
do
is
yet
another
layer
of
indirection.
They
kick
off
in
this
case.
What's
called
an
image
dispatch
spec
an
image
dispatch
spec
just
describes
an
I
o
generally
and
gives
it
any
details
like.
B
If
this
is
the
right
operation,
we
have
offsets
and
links
that
we're
going
to
write.
We
have
the
the
data
we're
going
to
write
any
flags
that
we're
going
to
send
different
different
ios,
create
different
dispatch
specs.
So
that's,
essentially
all
right
discard
creates
a
discount
dispatch
back
a
right
same.
B
That's
when
you
get
it
just
like
a
little
bit
of
data,
and
then
you
tell
it
to
keep
writing
the
same
data
over
and
over
again
different
dispatch,
specs
for
that,
so
on
and
so
forth,
but
the
dispatch
itself,
just
that
stuck
itself,
is
what
takes
you
really
into
the
I
o
subsystem
of
lib
rbd,
which
is
essentially
most
of
it,
is
under
the
live
rbgs
flash
io
subdirectory
in
the
we
talked
about
the
dispatch
back.
B
One
thing
you
might
have
seen
there
was
that
it
has
this
concept
of
image
dispatch
layers.
It
starts
off
at
the
api
start,
but
as
the
I
o
moves
through
the
dispatch
dispatching
layers,
we
have
these
plugable
layers
that
allow
different
functionalities.
B
So
the
first
one
is
just
to
cue
the
event
so
that
we
can
return
control
faster
to
the
the
caller
application.
We
have
quality
service,
throttling
exclusive
lock
blocking.
We
don't
allow
any
any
rights
to
proceed.
If
you
don't
mix
with
the
block
refresh
date
machine,
that's
if
the
image
has
been
modified
somewhere
and
you
need
to
refresh
like
snapshots
or
something
like
that
from
on
disk,
we'll
pause
the
ios.
While
we
do
that
refresh,
we
have
a
migration
layer
that
handles
any
ios.
B
While
an
image
is
under
migration,
a
journaling
layer
doesn't
do
anything,
I
think
right
now,
but
it
will,
in
the
future
a
right
black
layer
that
we
can
just
use
internally
to
basically
pause
all
rights
to
the
image.
Let's,
if
we're
going
to
take
a
snapshot,
that's
how
we
can
quiesce
any
in-flight
ios.
Before
we
proceed
with
the
snapshot,
we
have
a
cache
layer.
We
have
brand
new
with
pacific.
B
We
have
a
crypto
layer
that
implements
lux1
and
looks
to
encryption
directly
within
librbd
and
then,
finally,
last
but
not
least,
we
have
the
core
layer,
which
is
actually
the
meat
of
it,
as
talked
about
in
part
one
and
that's
the
part
that
then
takes
a
given
image.
Extent
based.
I
o
is
when
we
talk
about
image
extents
when
you
think
of
the
block
device-
and
you
have
you-
know-
one
gigabyte
block
device
when
you
reference
any
section
of
that
of
that
block
device.
B
Let's
say
bytes
0
through
1k,
that
is
in
image
extent,
so
you're
referencing
image,
extent,
byte
0
through
image
extent,
byte
1024.
B
What
the
core
layer
will
do
here
is
then
translates
that
image
extent
into
object
based
extents,
because
rbd
itself
is
a
given
rbd
block
device
is
made
up
of
you
know
by
default,
four
megabytes
docking
objects
that
are
striped
across
the
the
object
size.
So
if
you
have
you
know
a
32
or,
let's
say
16
megabyte
image
now
you
have
you
know.
Four
different
objects
object,
zero,
one
two
and
three
that
each
represent
four
megabytes
of
that
extent.
B
Is
the
plugable
layer
for
how
I
o
is
handled?
So,
as
I've
said
up
here,
the
we
have
these
different
plugable
layers
that
just
take
an
I
o
and
they
can
do
different
things
to
it.
Like
the
q,
one
will
and
cue
it
like.
Just
basically
put
it
in
a
queue
to
a
thread,
will
take
it
off
pop
it
off
the
queue
in
the
background,
but
allows
us
to
return
control
immediately
to
the
api
caller.
B
While
live
rbd
continues
on
its
way
quality
services,
throttling
exclusive
lock
is
making
sure
that
you
own
the
exclusive
lock
before
the
I
o
can
proceed
refresh
so
on
and
so
forth.
These
are.
These
are
all
different
things
we'll
we
can
dive
into
we'll
dive
into
the
the
object
dispatch
core
layer
again,
but
these
were
all
covered
at
a
higher
extent
or
in
a
deeper
extent.
In
a
part,
one.
C
So
are
these
layers
in
the
like
communication
protocol
layers
means
no.
B
They're
so
literally
they're,
just
at
different
class
instances,
so
here's
the
with
qos
cos
image
dispatch.
So
this
this
class
instance
we
will,
when
the
image
is
open,
we'll
create
an
instance
of
this
class.
Then
what
happens?
Is
the
ios
get
basically
passed
from
layer
to
layer
to
layer
to
layer
in
that
order
that
you
saw
in
the
enum?
B
So
in
this
case,
if
this
layer
could
get
a
read
request,
that
read
request
was
described
by
the
image
dispatch,
spec
and
it
could
say
like
well,
you
know
do
we
do
we
have
in
this
case
it's
a
new
feature
to
exclude
any
particular
operations
from
qos,
but
otherwise,
if
you,
if
we
need
to
throttle
this
operation
for
a
read
request,
you
know
this
needs
throttle
method
actually.
Does
it
and
then
basically
stops
the
further
dispatch
of
that
I
o
until
eventually
that
until
eventually,
basically,
what
happens?
B
Is
that
the
you
know
when
the
throttle
becomes
ready,
then
we
can
basically
release
the
I
o
again
and
let
it
proceed
to
the
next
layer
down
in
in
the
image
stack.
So
this
is
just
the
original
version
of
you
know.
From
several
years
ago,
when
I
took
over
lib
rbd,
it
was
you
know
the
equivalent
of
a
bunch
of
if-then-else
statements
about
running
through
I
o
flow
and
then,
as
we
added
more
and
more
features
to
lib
rbd,
it
became
an
unmaintainable
map.
B
So
this
just
allows
us
a
way
to
basically
segregate
logically
different
sections
of
code.
You
know
the
example
with
the
the
new
pacific
feature,
which
is
crypto
we
now
we
have
the
new
image
dispatch.
B
So
yeah
the
the
crypto,
you
know
it
handles
you
know.
If
we,
if
we're
going
to
encrypt
the
disk
decrypt,
you
know
this
layer
can
get
dynamically
loaded.
Only
if
encryption
is
is
enabled,
then
it
can
handle.
You
know
decrypting
and
encrypting
decrypting,
reads
and
encrypting
rights
on
the
disk
transparently,
and
it
was
it's
a
layer
that
you
know
we
can
dynamically
plug
in
and
we
don't
need
to
basically
add
little
if
that
else
hooks
throughout
live
rbd.
B
When
we
wanted
to
add
this
function,
we
can
tightly
constrain
and
keep
all
this
logic
consolidated
to
you
know
one
specific
spot.
Does
that
answer
your
question.
C
Yeah
well
maybe
the
next
question
is
layer
to
layer.
Is
there
a
common
interface
between
a
layer
to
the
layer
above
or
below
it.
B
Yeah
so
there's
there's
a
templated
dispatcher.
B
And
so
this
this
little
helper.
Let's
say
this
gets
there's
two
versions
of
this:
there's
the
image
dispatcher
and
there's
the
object
dispatcher,
because
they
we
have
image
layers
and
we
have
object-based
layers
based
on
those
extents,
the
same
functionality,
but
basically
just
it.
It
finds
a
given
dispatch
layer
from
where
the
the
I
o
first
enters
the
system
or
where
it
next
needs
to
be
handled,
grabs
the
dispatcher
for
that
layer
sends
the
op
to
the
dispatch
layer
and
then,
if
the
dispatch
layer
says
it's
it's
handled
it,
it
returns.
B
True
we're
done
so,
that's
basically
how
we
can
pause
the
the
further
approximately
otherwise,
if
it
didn't
handle
it,
it'll
just
keep
going
to
the
next
layer
to
the
next
layer
to
the
next
layer
to
the
next
layer.
Until
it's
it's
hit
every
single
layer
in
the
in
the
stack.
B
When
it
comes
to
actually
sending
it,
we
just
use
a
visitor
pattern
and
that
can
basically
take
the
different
image
in
this
case.
It's
an
image
dispatcher,
so
it
takes
that
image.
Dispatcher
back
and
the
very
spec
type
it
is
via
to
read
a
discard.
B
All
right,
write
same,
compare
and
write
flush
whatever,
and
it
invokes
the
associated
method
on
the
provided
dispatch
layers
pointer.
So
that's
how
based
on
the
the
incoming
dispatch
spec,
we
can
then
invoke
a
specific
function
within
a
given
layer
generically
without
having
to
hard
code
all
this
logic.
So,
but
when
you
see
the
logic
here
for
image
dispatch,
it
actually,
then.
B
Object,
dispatcher
has,
you
know
its
own
version
of
the
visitor,
but
it
based
on
the
object
of
sketch
spec,
which
is
similar
in
concept
to
the
image
dispatch
spec.
But
it's
based
on
object's
extent
and
an
object
extent
is
basically
byte
ranges
within
a
given
object.
So
a
given
object,
number
and
then
byte
ranges
within
that
object,
because
when
we
map,
when
you
think
about
the
standard,
rbd
image
layout
of
a
four
megabyte
object,
if
I
tried
to
access,
I
o
four
megabytes
through
through,
let's
say
five
megabytes.
B
The
image
dispatch
layer
will
process
it
and
eventually,
when
it
gets
down
to
that
core
layer,
it
will
convert
it
into
individual
object,
dispatch
spec
for
that
request
and
the
object
dispatch
specs.
In
that
case.
In
the
example
I
use
the
four
megabytes
to
five
megabytes
that
object,
dispatch
spec
would
be
object,
number
one
and
then
object
expense,
zero,
bytes
through
one
megabyte,
because
that's
how
that
striking
pattern
worked
to
to
map
that
and
that
striping
is
handled.
B
In
the
in
this
state
machine
where
this
is
this
is
the
core
layer.
This
is
pretty
much
more
or
less
legacy
code
that
it
has
been
around.
But
what
we
do.
We
have
a
generic.
In
the
case
of
a
generic
write
request,
we
basically
break
the
request
down
into
the
individual
object
extents
and
then
for
each
of
the
object
extents,
we
can
send.
B
Basically
we
can
create
an
object
request
on
it
and
then
based
on
the
type,
because
this
is
a
generic
base
class
of
abstract,
but
if
it
was
derived
from
the
let's
say,
a
write
request.
The
create
object
request
here
just
goes
and
creates
a
brand
new
object.
B
Dispatch
spec,
representing
the
object,
number
object,
extent
offset
and
the
object
extent
length
which
we
can
basically
interpret
basically
the
length
of
it
from
the
length
of
the
buffer
list
packed
in
because
we
would
already
snip
the
buffer
list
from
the
image
extent
point
of
view
to
the
object
standpoint
of
view
by
this
point
in
time.
B
But
yeah
getting
back
to
the
dispatch
layer
where
we
kind
of
left
off
last
time
was
in
the
object,
dispatch
layer.
We
have
the
cache
layer,
which
is
basically
a
that's
here
right
through
right
back
right.
Around
cache
layers,
we
have
the
new
crypto
layer,
that's
actually
where
the
the
meat
of
the
luxe
1
or
lux
2
based
encryption
is,
is
handled
in
lib
rbd,
it's
handled
on,
object,
low
level
objects.
B
We
have
a
the
journal
layer.
This
one
basically
makes
sure
that
rights
cannot
proceed
until
it's
been.
The
journal
has
been
properly
flushed
to
disk.
That
way,
we
don't
end
up
inconsistent
between
what's
in
the
journal
and
what's
actually
on
the
disk,
we
need
to
make
sure
the
journal
is
up
to
date
and
committed
to
this
before
we
let
the
I
o
proceed.
B
We
have
this
new
thing
from
octopus
called
the
parents
cache,
and
that
is
something
that
allows
read,
requests
to
be
cached
locally
for
immutable
parent
images.
So
if
you
had
a
a
golden
image,
let's
say
of
an
operating
system,
then
you
clone
that
image
to
a
new
block
device.
B
Next,
one
is
the
scheduler
that
all
the
scheduler
does.
It
does
a
very
simple
disk,
scheduler
to
basically
try
to
batch
up
operations
or
that
are
happening
against
the
same
same
object.
So
if
you
had
a
bunch
of
small
sequential
reads
or
a
bunch
of
small
sequential
writes
that
are
that
are
occurring
at
the
same
time,
it'll
batch
those
up
into
a
single
operation
to
the
osce,
so
that
it
doesn't
overload
the
osd
with
you
know.
B
Similar
concepts
for
given
object
request
week,
you
know
we
have
rights,
discards,
right
chains,
comparing
rights
and
reads
the
way
it's
broken
down.
We
have
the
the
object,
requires
the
base
class
and
then
we
drive
from
that
classes
like
for
the
specific
operations
like
in
this
case.
Here's
a
read
request.
B
We
again
try
to
document.
You
know
through
little
pictures
about
how
the
state
machine
progresses.
In
the
case
of
a
read,
we
first
go
to
the
osd
and
try
to
read
from
that
associated
object,
but
for
the
osu
tells
us
that
that
object
does
not
exist
and
it's
a
cloned
image.
B
Then
we
could
then
reissue
the
read
request
against
the
parent
image
to
try
to
read
that
data
and
we'll
dive
deeper
into
that.
Then,
in
terms
of
write
requests.
They
all
derive
from
this
abstract
object,
write
request,
but
then
you
have
a
little
more
complicated
state
machine.
B
If
you
have
object
map
enabled
we
can
basically
detect
that
it's
there's
nothing
for
us
to
do.
Let's
say,
there's
a
discard
and
we
know
the
object
doesn't
exist.
We
don't
need
to
send
anything
to
the
ocs.
We
can
just
know
off
the
right
and
and
continue,
then,
if
we're
actually
writing,
we
might
have
to
pre-update
the
object
map
before
we
actually
write
to
it.
B
So
if
we're
actually
going
to
write
data
and
the
object
map
says
that
the
object
does
not
exist,
we
would
need
to
update
the
object
map
first
before
allowing
it
to
proceed,
and
then
I
mean
there's
some
more
optimization
paths
about
like
well.
If
the
object
mike
told
us
that
it
doesn't
exist,
then
we
can
go
straight
into
the
copy
of
the
state
machine
and
we'll
get
into
copy
after
that
has
to
do
with.
How
do
we
take
data
from
the
parents
and
and
copy
it
over
into
the
child?
B
Otherwise,
if
we
know
the
object
exists,
we
can
just
actually
issue
the
right
against
it
and
then,
finally,
the
last
step
is
the
post
update,
object,
map
stage
that
actually
only
gets
hit
if
you're
deleting
the
object,
because
it's
when
you're
deleting
the
object,
the
update
state
is,
it
says
it's
a
pending
mode,
that's
pending
removal
and
then
the
post
state
actually
flips
it
through
that
the
object
no
longer
exists
with
a
guard
on
it
to
make
sure
that
if
you
had
other,
I
os
coming
in
that
hit
the
same
object
that
it
won't.
B
It
won't
affect
people
and
won't
stop
on
each
other.
To
say
the
object
doesn't
exist
if
it
got
recreated.
In
the
meantime,.
B
So
yeah,
if
we
dive
into
this
path,
a
little
bit
start
with
the
the
read.
B
Yeah
in
general,
any
any
object
that
any
object
that
comes
in
or
any
request
that
comes
in.
We
have
some
helper
methods
at
the
top
in
terms
of
right
hints
that
for
writing
anything
we
can.
We
just
tell
the
osd
that
you
know
it
gives
us
a
max
object
size.
It
was
important
file
store.
B
I
don't
think
it's
really
that
important
with
blue
story
or
more
at
least
it
doesn't
seem
to
be,
but
they're
legacy,
wise,
compute
parent
expense,
helper
method
when
we
talked
about,
if
you
have
a
child
clone
child
image
that
hasn't
been
overwritten
yet
and
we
ever
need
to
get
the
the
parents
data
from
it.
B
This
basically
helps
us
know
that
when
we,
when
we
take
the
object
extents
of
this
child
object,
we
basically
transform
that
coordinate
space
back
into
the
parent
image,
because
the
parent
image
coordinate
space
might
actually
be
different
than
the
child
images
chord
in
space,
and
you
can
think
about
that.
As
an
example
would
be
that
if
I
again,
the
child
had
a
four
megabytes
striping
size
by
default,
and
I
was
reading
starting
at
four
megabyte
offset.
B
That
corresponds
to
object
one
in
the
child
because
object
zero
covers
by
zero
through
four
megabytes
and
then
object.
One
covers
four
megabytes
to
eight
megabytes,
so
object
one,
the
the
offset
of
zero,
because
that's
the
start
of
that
object
of
where
the
image
extent
of
four
megabytes
starts.
B
But
if
I
needed
to
read
from
the
parent
and
the
parent
had
a
an
object
size
of
eight
megabytes
instead
of
the
default
of
four
megabytes,
when
I
read
from
the
parent
I
actually
would
not
be
reading
from
object
one
I
would
actually
be
reading
from
object,
zero
and
my
objective
that
would
not
be
offset
zero
it'd,
be
object,
offset
four
megabytes
because
the
in
the
parent,
if
you
have
eight
megabyte
backing
objects,
the
first
object
object.
Zero
will
cover
the
space
of
the
object,
extent
space
of
zero
megabytes
through
eight
megabytes.
B
So,
if
I'm
reading
for
starting
at
four
megabytes,
I
would
read
from
object
one.
So
this
is
just
some
generic
math
to
basically
do
all
that
transformation
back
into
image.
Extent
of
the
of
appearance,
so
yeah
getting
into
the
into
the
read
state
machine.
B
First
first
entry
point
here
is
just
as
the
read
object
as
as
the
diagram
kind
of
showed
that
you
know
if,
if
we
have
this
little
optimization
here
that
says,
if
we're
reading
from
the
same
snapshots
of
where
the
image
is
currently
set
to,
and
we
actually
have
an
object
map
object
is
not
no
pointer
and
the
object
map
says
that
there's
no
way
that
this
object
exists,
the
object
may
exist
is
is
false.
B
Then,
basically,
we
can
skip
this
step
and
go
on
to
the
next
step,
which
is
the
the
read
from
parents
of
the
of
the
I
o
state
machine,
but
otherwise,
if
we
can't,
what
we
do
is
for
all
the
object.
Extents
that
are
present
in
our
system
are
in
that
in
that
request,
because
we
might
want
to,
we
might
do
a
request.
That's
you
know,
read
bytes
0
through
1k
and
then
read
5k
to
6k.
We
support
these.
You
know
io
vector
style,
reads
and
writes
internally
within
within
live
rbd.
B
We
kind
of
split
it
up
and
say.
Well,
you
know
we
have
this
little
helpful
hand
such
as
you
know,
if
it's
greater
than
this
threshold
size,
which
I
think
is
set
to
64
kilobytes
by
default,
then
we'll
tell
the
osd
gif
give
us
a
sparse
read.
So
the
osd
won't
retrade
a
bunch
of
zeros.
If
there's
no
data
there,
they'll
just
tell
us
that
hey
there's,
no
data
there
from
this
extent.
To
that
extent,
otherwise
just
do
a
standard
generic
read
and
then
for
a
standard.
Read
the
osds
will
say.
B
Well,
if
there's
no
data
in
that
in
that
space
you
gave
us,
but
the
object
does
exist,
we're
just
gonna.
It's
gonna
get
zero
filled
for
for
that
length.
You
told
us
that
doesn't
exist
and
then
it
just
sends
that
request
to
the
osd
and
then
when,
when
the
osd
comes
back,
it's
going
to
call
the
the
handle
method.
So
in
this
case
the
handle
read
object
so
yeah.
If
the.
If
the
object
itself
did
not
exist,
this
is
not
about.
If
there
was
data
there
or
not.
B
If
the
object
itself
does
not
exist,
the
oh,
it
will
return.
E
no
ends
error
to
us
and
then
we'll
basically
go
into
the
the
read
parent
state
machine.
If
we
get
any
other
errors,
we'll
bubble
the
I
o
air
up
to
the
to
the
caller.
B
B
We
told
it
where
to
put
the
data
when
it
when
it
reads
it,
so
we
gave
it
a
place
to
say
here's,
the
buffer
list
of
when
you
read
that
data
put
in
the
software
store
for
doing
a
sparse,
read,
we
said,
put
the
data
in
this
buffer
list
and
also
give
us
the
sparse
extent,
mapping
of
what
data
is
actually
in
that
buffer
list
and
how
it
maps
to
the
object,
object
we'll
dive
deep
into
a
dive
deeper
into
how
that
then
gets
converted
back
up
into
image
space,
because
this
again
is
object.
B
Space
so
extend
space.
Those
buffer
lists
then
need
to
get
reassembled
back
up
potentially
into
image
space,
because
that
that's
what
the
caller
uses,
they
always
reference.
The
ios
in
image
coordinate
space.
So,
if
I
was
reading
by
zero
through
eight
megabytes,
that's
two
different
requests,
but
for
a
four
megabyte
default
shaping
size.
So
that's
gonna
get
broken
up
into
an
object
request
against
object,
zero
and
object
requests
against
object,
one.
We
send
both
those
requests
at
the
same
time.
B
You
know
in
parallel
to
the
osds
I
mean
they're,
potentially
most
likely
on
different
osgs
and
different
pgs,
we'll
get
the
responses
back
at
different
times
so,
but
this
class
in
itself
just
represents
one
of
those
requests
and
the
next
layer
up
actually
handles
them.
Reassembling
and
basically
concatenating
those
two
separate
buffer
lists
in
that
example,
back
together
into
a
single
buffer,
lift
that
will
be
used
and
expected
by
the
by
the
client.
B
So
yeah
the
in
finish
here,
if
the
data,
if
it
was
able
to
read
any
data,
otherwise
it'll,
go
into
the
next
step,
which
would
be
try
to
read
from
the
parent.
Assuming
we
have
some
internal
flags
to
basically
say:
don't
don't
read
from
the
parent,
but
assuming
that's
not
turned
on
yeah,
then
we
basically
kick
off
a
helper
function.
This
is
semi
new,
I
think
with
pacific,
but
really
all
it
does.
B
Is
it
takes
that
those
extents
and
does
that
conversion
from
this
this
to
cloned
or
child
images,
object
space
and
converts
it
back
into
the
parent's
image
extent,
space
and
issues
a
read
in
the
exact
same
fashion
that
we
saw
before
through
the
to
the
api.
I
o
layer
to
kick
off
a
brand
new
image
extent
read,
but
it
kicks
it
off
against
the
next
image
in
the
in
the
chain,
which
is
the
parent
image.
B
So
in
this
case
this
is
essentially
acting
like
a
generic
lib,
rbd
client
like
kimu,
because
it's
it's
issuing
a
read,
request,
an
image,
expense
space
and
it
gets
a
response
back
in
image.
Extent.
Space
and
that's
you
know
just
handles
here.
Then
so,
eventually
the
data
will
come
back
if
if
there
was
a
parent
it'll
get
handled
it'll
fall
through.
But
if
there's
no
data.
B
That's
by
the
time
you
think
about
it
as
a
recursive
algorithm,
where
it's
you
know
following
the
chain
up
and
causing
a
read
on
the
next
parent,
the
next
parent,
the
next
parent
and
the
next
parent
up.
That's
how
we
can
support.
You
know
clones
of
clones
of
clones
of
clones
of
clones.
It
just
keeps
recursively
calling
down
into
each
parent
and
issues
reads
until
it
actually
gets
some
data.
B
So
if
it
gets
back
to
here
and
there's
no
data,
we
know
that
no
parent
has
any
data
that
covers
that
extent,
and
we
can
just
handle
that
generically.
When
we
reassemble
the
I
o
we
can
still.
We
can
basically
give
a
buffer
of
zeros
back
to
the
user
and
say
yep,
there's
no
data
there,
not
that
you
didn't
care,
but
it
was
all
zeroed
or
if
there's
any
errors,
we
can
bubble
that
up.
The
last
step
of
the
state
machine
is
copy
up,
which
is
essentially
on
the
read
path.
B
This
is
basically
a
copy
up
on
read
it's
not
enabled
by
default
we're
gonna
we're
gonna
get
into
the
copy
update
machine.
Next,
after
we
go
to
the
write
state
machine,
so
I
won't
dive
into
this
too
much.
B
But
basically
it
says
like
if,
if
this
feature
is
enabled,
if
we're
as
long
as
we
are
doing
a
copy
up
on,
read
like
and
that's
something
that
someone
said
they
wanted
to
do
and
opted
into
this,
then
we'll
we'll
kick
off
a
copy
up
request
to
copy
the
data
from
the
parents
and
copy
it
and
clone
it
down
into
the
into
the
child
image
so
that
we
never
have
to
read
from
the
parent
again.
For
that
given
object,
extent.
B
The
right
state
machine
very
similar,
but
in
terms
of
where
the
state
machine
starts
in
the
sun
method,
we
kind
of
have
the
the
same.
Optimization
here
you
know.
Well,
if
we
don't
have,
if
we
say
we
don't
have
an
optic
map,
you
know,
then
we
just
have
to
assume
the
object
may
exist
because
we
don't
know
without
the
object
map.
B
Otherwise
we
can
look
it
up
directly
in
the
object
map
to
ask
it
if
the
object
may
exist,
because
if
it
knows
for
sure
that
the
object
does
not
exist,
we
can
potentially
take
some
take
some
optimizations
here
like
if
we
were
trying
to
remove
an
object
and
the
object
map
says
it
doesn't
exist.
Then
that's
the
example
of
how
we
can
skip
a
null
up
on
a
on
a
non-existent
object.
We
can
basically
end
the
state
machine
early
without
a
round
trip
to
the
server.
B
You
know
if
someone
was
issuing
there's
a
brand
new
rbdm
engine.
You
did
a
makeup
s
xfs,
which
loves
to
do
a
giant
like
whole
disc
discard,
and
you
had
object.
Map
enabled,
if
you
didn't
have
object
map
enabled
you
would
just
literally
have
to
send
discards
for
every
extent
of
the
image
which
you
you
know,
the
osb
would
say:
yeah,
there's
no
data,
there's
something
for
me
to
remove,
but
if
you
had
object,
map
enabled
and
the
object
map
is
basically
saying.
No,
this
data,
there's
no
data
here.
B
Nothing
exists
that
makeup
makeup
s
call
is
basically
no
opt
away
into
a
a
very
fast
operation
instead
of
a
very
long
operation.
On
long
images
on
big
images.
A
Hey
jason,
while
you're
in
this
layer,
can
you
also
at
some
point
cover
how
a
delete
recall
sorry,
delete
requests
happens
in
the
flow
as
well.
B
Yeah
so
delete
is
a
is
a
write.
Request
of
this
abstract
object,
flight
request.
It
covers
rights,
discards
which
are
discards
that
are
zeroing
extent
within
an
object.
It's
truncating
the
end
of
an
object
or
it's
removing
an
object
entirely.
B
B
So
as
shown
on
that
little
ascii
diagram,
you
know
now
we're
here.
The
next
step
would
be
to
update
the
object
map
to
make
sure
that
it
actually
says
that
you
know,
for
if
we're
actually
going
to
write
some
data,
we
want
the
object
map
to
say
that
the
data
that
the
object
exists
or,
if
we're
going
to
remove
an
object
that
exists,
we
need
to
first
set
it
to
the
pending
state.
B
B
Otherwise,
if
it's
just
a
plain,
vanilla,
write
operation,
all
we
do
is
we
can
tell
the
the
object,
map
and
say:
hey
asynchronously,
go
update
the
object
map
on
disk.
For
this
given
object.
Number
with
my
new
state,
the
new
state
is
determined
by
the
drive
class.
These
are
virtual
functions
and
then,
once
that
completes,
we
get
ins.
We
advance
into
the
handle.
B
Pre-Write
object,
map,
update
state
which
doesn't
really
do
much,
except
for
check
for
errors,
so
it
can
bubble
the
error
up,
but
otherwise
it
proceeds
to
the
actual
write
phase
of
the
write
request
and
again
we're
talking
generic
it
generically.
Here
the
write
request
is
any
right:
operation,
discards,
rights,
right
fans,
zeroing,
truncation,
removal,
compare
right,
yada
yada,
so
the
first
step
here
is
if
we
potentially
have
copy
up
enabled
and
the
definition
of
if
we
have
copy
up,
enabled
there's
two
cases
here.
B
B
So
we
we
have
this
special
assert
here
that
when
the
when
the
I
o
gets
to
the
osd,
the
osd
will
evaluate
this
assert.
That
basically
says
like
hey.
We
are
asserting
that
the
current
object.
That's
on
disk,
has
a
particular
snapshot.
Sequence
number,
if
it
doesn't
error
out
and
we
can
detect
that
error
so
that
we
know
that
we're
doing
like
a
scale,
a
scale
write
operation
or
something
like
that
to
avoid
stomping
on
someone
else's
changes.
B
So
this
this
is
new
logic.
When
we
added
the
live
migration
feature
in
mimic
or
no,
I
can't
when
it
came,
but
otherwise
the
the
normal
path
is.
B
The
original
path
was
just
this
assert
that
the
object
exists,
and
this
is
for
the
case
only
when
you
have
a
cloned
child
image,
because
the
very
first
right
to
a
child
image,
we
need
to
make
sure
that
all
the
data
from
that
overlapping
object
extent
is
copied
from
the
parent
image
down
to
the
child
image
and
then
with
your
right
applied
on
top
of
it.
B
So
if
the
object
doesn't
already
exist
and
that's
what
this
is
sort
of
exists,
check
handles,
it's
gonna
fail
out
and
not
actually
perform
the
right
operation
and
it's
gonna.
Allow
us
to
basically
go
sideways
in
the
state
machine
to
go
and
do
a
copy
up
request,
which
is
a
different
class.
B
B
So
this
is
where,
like
a
a
delete
when
put
in
a
sleep
right
or
a
or
a
compare
right
would
put
in
the
compare
operation
and
then
put
in
you
know
the
the
compare
and
write
operator
and
then
it
sends
the
request
to
the
osd
for
the
current
context
for
the
currents
we
convert,
that
object
number
into
the
corresponding
rbd
underscore
data.imageid.object
number,
and
when
we
get
the
callback
from
the
osd
asynchronously
we'll
advance
the
state
machine
to
the
to
the
handle
right
object
state.
B
So
yeah
the
what
could
happen
here.
The
the
first
error
condition
we
could
get
would
be
the
e
no
end,
and
that
really
can
only
happen
because
of
the
assert
exists
failed.
So
the
assert
is
exists
that
the
object
doesn't
exist,
so
it
fails
out
and
aborts
atomically
aborts
the
right
operation
saying
with
your
code
enough.
B
So
that's
how
we
know
to
go
kick
off
the
copy
upstate
machine
if
we
got
a
e
range
and
we
regarding
guarding
that
right
operation
because
we're
doing
a
live
migration,
that's
how
we
knew.
I
know
again
that
oh
we
gotta
go.
You
know
through
this
special
path
to
go,
do
a
copy
up
and
the
copy
upstate
machine
also
handles
this
live
migration
path.
B
Otherwise
you
know
there's
this
race
condition
case
where
you
were
flattening
an
image
while
right
into
an
image,
so
the
parent
might
actually
disappear
while
you're
an
il
is
in
flight.
So
this
is
just
something
that
says:
oh
you
know.
Let
me
let
me
go
restart
the
state
machine
back
up
at
the
top
illegal
sequence.
That's
from
the
compare
and
and
write
operation.
B
The
one
thing
I
skipped
over
here
is
the
is
the
copy
up
and
that's
going
to
be
the
next
day
machine
I'll
dive
into.
But
what
it
does
is
the
first
thing
it
does
is.
It
looks
to
see
if
we
already
have
a
copy
up
in
slice
for
a
given
object
number,
because
I
could
have
multiple
ios
that
all
land
against
the
same
object
that
might
kick
off
multiple
copy
up
state
machines.
Concurrently,
we
can
actually
then
coalesce
those
actual
writes
down
into
the
same
copy
up.
B
So
if
we
couldn't
find
it
we'll
create
a
brand
new
copy,
update
machine
keep
track
of
our
our
new
requests
copy
up
request
so
that
any
other
request
can
find
it
and
then
start
that
state
machine
off,
and
otherwise,
if
we
already
found
an
exact
match
for
an
inflight
copy
up
for
this
given
object,
number
then
we
can
just
append
ourselves
to
that
request,
and
what's
going
to
happen
is
is
that
when
that
copy,
upstate
machine
runs
and
it's
finally
getting
to
the
point
where
it's
going
to
then
add
in
all
the
right
operations,
it'll
just
go
in
order
that
it
solves
these
rights
appended
and
append
those
light
operations
to
its
to
a
single
request
to
the
osd.
B
So
yeah
we
get
down
here,
we
can
start
seeing
all
the
drive
classes,
so
an
object,
write
requests,
you
know
for
its
right
hand
versus
write-offs,
the
right
ops
it
does
it
literally
it's
either
gonna.
If
you're,
if
your
write
request
was
gonna,
hit
up
the
entire
extent
of
the
object,
it
just
calls
a
slightly
different
osd
operation,
which
is
rightful
otherwise.
It
just
calls
a
write
with
an
offset
and
the
data
you
wanted
to
send
this
a
discard.
B
This
is
where
it
starts
getting.
It's
got
different
actions.
So
if
we
know
that
we
were
delete
discarding
and
we
were
gonna
result
in
the
entire
object
getting
discarded,
we
can
actually
remove
the
object,
except
for
one
case,
which
is
the
fact
that
if
we
were
a
cloned
child
image,
we
don't
when
we
do
a
discard,
we
still
want
to
hide
the
parent
image.
So
what
we
actually
do
is
we
create
a
an
empty
image
and
truncate
it
down
to
essentially
a
size
zero.
B
In
this
case,
the
the
truncate
case
is
if
we
were
discarding-
and
it
just
touched,
the
trailing
end
of
of
a
of
an
object
like.
So
if
you
had
a
four
megabyte
object
and
you
wanted
to
discard-
and
it
hit
object
except
two
megabytes
before
megabytes,
it
would
just
truncate
that
object
down
to
two
megabytes
inside
and
the
zero
it's
just.
Whenever
you
have
a
discard,
that's
going
to
hit
somewhere
in
the
middle,
not
at
the
end,
and
it
doesn't,
it
doesn't
take
up
the
entire
object
itself.
B
The
compare
right
same
it
has
a
right
same
operation,
just
copies
the
data
down
the
compare
and
right
it
invokes
two
operations.
It
does
this
compare
extent
which
is
kind
of
like
a
guard.
That's
the
guard
operation
to
say,
make
sure
the
data
at
this
object
matches
this
buffer
list,
and
if
that
atomically
passes
then
atomically.
You
know,
then
do
that
the
actual
right
operation,
be
it
a
rightful
if
it's
a
full
object
or
the
right,
and
it's
got
this
special
filter.
B
The
right
result
case
back
just
because
we
need
to
store
the
the
mismatch
offset
of
where
the
the
data
first
mismatched
in
the
I
o
and
the
osd
kind
of
gives
us
that
data
through
the
the
return
code.
So
we
just
have
to
extract
it
out
to
get
the
object
offset.
So
we
can
return
it
back
to
the
user
and
convert
that
object,
offset
into
from
object
extent,
space
to
image
extent
workspace.
B
B
B
B
In
general,
well
from
the
little
ascii
chart
here,
it's
how
the
state
machine
functions
when
it
when
it
first
starts
up
the
the
first
thing
it
does.
Is
it
if
we're
doing
a
standard,
a
standard
clone
child
image
having
to
copy
data
from
its
parent
image
down
to
the
child
image,
so
that
we
can
actually
overwrite
some
data?
B
The
first
step
it
does.
Is
it
reads
that
corresponding
data
from
the
parent
image
and
then
proceeds
if
we're
doing
a
live
migration?
We
take
this
alternate
path,
which
is
a
deep
copy
path,
which
is
a
different
state.
Machine
probably
won't
have
time
to
get
into
because
that
one's
a
real,
a
real
doozy,
but
what
the
deep
copy
does
it
actually
handles
going
through
and
reading
snapshot
deltas
from
a
parent.
B
B
The
deep
copy
will
actually
just
take
the
deltas
for
each
of
those
snapshots
from
each
of
the
parents
and
then
reapply
just
those
deltas
to
the
to
the
child
image.
And
that's:
how
live
migration
handles
the
the
process
of
of
getting
a
a
fully
deep
copied,
but
yet
still
sparsely
allocated
version
of
your
source
image
to
your
destination
image
in
a
consistent
fashion
that
doesn't
require
exclusive
locks
or
anything
like
that,
but
yeah.
Once
once
we
do
the
the
reason
parent.
B
We
can
update
the
object
maps
and
it's
mass
plural,
because
we
might
have
one
or
more
snatch
one
or
more
snapshots
on
our
child
image
or
destination
image
in
the
case
of
live
migration.
So
we
would
need
to
update
all
object
maps
from
beginning
of
time
to
the
current
head
revision
and
then
we
actually
perform
the
copy
up
operation.
B
So
state
machines-
maybe
so
I
can
see
a
pattern
here.
Entry
point
is
defend
method.
The
first
step,
though
in
the
state
machine,
is
just
reading
from
the
parent.
B
A
little
weird
case
here
that,
if
there's
a
potential
race
condition
between
kicking
off
a
copy
up
request
and
the
the
parent
image
getting
flattened
away
and
removed
from
the
child
image,
we
can
just
return
that
saying:
hey
the
parent
image
doesn't
exist
anymore,
which
would
restart
the
state
machine
back
in
the
object
request
state
machine
if
there's
a
deep
copy,
it
get
live
migration.
It
goes
to
that
alternate
path
from
that
ascii
diagram,
like
I
said,
we'll,
probably
have
time
to
cover
that
here.
B
Otherwise
it
does
the
the
normal
quote-unquote
generic
live,
rbd
or
rbd
copy
up
operation.
The
what
happens
is,
as
I
mentioned
before,
to
read
from
the
parent
it
at
first
and
it
had
gotten
the
apparent
image
extents
or
just
the
image
extents
that
correspond
to
the
child's
object
that
was
being
copied
up
from
the
object
request
state
machine.
So
it's
part
of
the
constructor
for
the
copy
up
request.
If
it
passed
the
image
extent
of
what
what
this
current
object
represents.
Image
extent
was
so
what
happens?
B
Is
the
copy
state
machine
it
just
kicks
it?
It
creates
a
brand
new
image
to
spec
image
dispatch.
Spec
read
request
against
not
the
image
itself,
but
the
parent
image,
so
this
parent
is
basically
a
pointer
to
another
image
context.
So
that's
how
you
can
start
getting
the
chain
of
chain
of
a
chain
of
a
parent
parent
up
the
layer.
B
We
start.
We
we
tell
it
that
we're
going
to
start
dispatching
it
at
the
start
of
the
internal
layer,
so
that
bypasses
like
those
qos
layers
and
bypasses
the
the
queue
layer.
The
other
things
that
normally
are
only
meant
for
the
end
user
api,
give
it
the
completion
function,
which
is
what's
going
to
get
called
when
the
date
is
available.
B
B
B
Once
we
get
the
data
back
from
a
parent
assuming
it's
not
a
a
failure
of
some,
I
o
that
we
need
to
bubble
up
we.
This
is.
This
is
a
new
thing
related
to
pick
crypto,
because
if
we
have
like
a
lux
header
on
the
on
the
image,
we
just
need
a
generic
way
to
basically
map
the
data
and
basically
slide
the
data
down
or
up
to
cover
up
the
lux
header.
B
Otherwise,
so
now
that
we've
gotten
the
data
from
the
parents,
if
you
remember
that
copy
up
request
list,
we
basically
take
ourselves
off
that
request
list,
because
now
we're
getting
to
the
point
where
we're
going
to
actually
send
the
right
requests
and
we
don't,
we
don't
want
any
incoming
ios
to
try
to
append
itself
to
us.
While
we've,
you
know
already
sent
the
data
to
the
osd
good.
B
The
next
step
is
we
prepare
the
copy
of
data
and
really
all
that
does.
Is
it
it
kind
of
like
loops
through
all
the
requests.
The
object
requests,
instances
that
were
tagged
and
appended
to
this
copy
of
request
and
say
give
me
your
data.
You
know
if
you're
a
write,
request
or
you're.
You
know
give
me
your
write
off
and
things
like
that.
So
now,
there's
some
optimizations
here
to
know
you
know
if
everything
we're
writing
is
gonna
be
zeroed.
B
A
B
Yeah,
there's
never
enough
time.
It's
number
one
yeah,
I'm
trying
to
get
through
the
the
copy
update
machine,
but
that's
in
general
the
object
flow.
The
only
thing
that
was
left
was
when
we
were
talking
about
reassembly
of
of
data,
and
maybe
this
is
just
an
exercise
to
the
reader.
B
But
that's
when
we
talked
about
the
the
read
results,
we
can
basically
use
this
helper
to
reassemble
the
data
back
into
a
form
that
the
caller
was
expecting
it
in
again
in
a
generic
fashion,
because
we
have
different
formats
that
we
want
the
data
end
up
in.
We
have
different
source
formats,
so
the
data
could
be
specified
and
it
could
be
sparse
or
it
could
not
be
sparse.
So
this
kind
of
is
a
generic
thing
that
covers
all
those
cases.
B
You
know
for
reading
image,
extents
versus
object,
extents,
so
on
and
so
forth,
so
yeah
in
general.
That
was
the
the
I
o
paths
and
how
it
bubbles,
through
all
the
layers
and
gets
to
the
to
the
to
the
osd's
and
back
any
high-level
questions
in
the
last
five
minutes,
or
so.
A
I'm
looking
at
chat
right
now
we
don't
have
any
particular
questions.
Listed,
you've
already
covered
some
of
them
and
well
you
covered
all
of
them.
Rather
so,
let's
see
if
anybody
that
is
on
blue
jeans
wants
to
speak
up
now
would
be
a
good
time.
We
got
about
five
minutes
left
otherwise
I'll
be
watching
chat.
B
Okay,
so
in
general,
though
we,
the
important
thing
to
know
is
that
we
try
to
let
you
follow
the
state
of
the
world
through
at
least
pictures
to
to
try
to
make
it
easy
to
follow
the
control
flow
if
you're,
just
looking
at
this
from
the
from
the
outside,
and
just
knowing
that.
Basically,
everything
in
lib
rbd
is
a
is
an
asynchronous
state
machine
where
we
send
a
re,
kick
off
an
asynchronous
request
via
deliberative
api
and
eventually
we're
going
to
get
a
call
back
to
our
handler
function.
B
For
that
request,
and
that's
how
we
advance
the
the
state
machine,
I
think
that's
in
general,
like
how
you
could
approach
lib
rbd,
every
state
machine
is
its
own,
is
its
own
file.
Usually
just
at
least
everything
new
is
in
its
own
file.
Just
try
and
make
things
easier,
given
the
kind
given
the
potential
for
complications
be
make
things
easier
to
to
understand
and
follow
the
flow
of.
B
And,
like
I
said,
everything
is
99
of
it
is,
is
in
the
library
subdirectory
and
that's
where
your
key
views
in
the
world
are
gonna
are
gonna
attach
to
and
utilize.
Even
if
we're
gonna
look
at
other
functions,
we
have
things
in
the
tools
directory.
We
have
the
the
rbdclis
and
tools
for
rbd,
mirroring
that's
a
subject
in
itself,
but
yeah.
B
Obviously
the
rbd
meteor
daemon
is:
is
there
the
rbd
network
block
device
demon
is
got
its
own
directory
and
so
on
and
so
forth,
and
this
is
this
is
newest
pacific.
This
is
the
new
windows
mbd
demon
from
suze
and
cloudbase,
but
it's
yeah.
We
try
to
keep
semi
organized
where
everything
goes
to
make
it
hopefully
easier
to
understand.
But
if
there's
questions
comments
or
concerns,
I
mean
there's
the
mailing
list.
There's
irc
feel
free
to
reach
out.
A
Okay,
well,
thank
you,
jason
for
taking
the
time
and
covering
with
part
two
of
rbd.
It
looks
like
we
still
have
even
more
to
cover,
but
I,
of
course,
don't
want
to
take
up
more
of
your
time.
While
you
know
we're
working
on
the
pacific
release
because,
hopefully
we're
looking
at
early
march
right.
A
That's
always
the
goal,
so
thanks
everybody
for
joining
us
live
and,
of
course
this
will
be
recorded
and
put
up
on
the
archive
and
on
the
ceph
youtube
channel.
So
thank
you
jason
again
for
taking
the
time
and
we'll
see
everybody
on
the
next
code.