►
From YouTube: Ceph Code Walkthrough: LibRBD I/O Flow Pt. 1 2020-10-27
Description
Apologies for the screen share quality in this episode.
https://tracker.ceph.com/projects/ceph/wiki/Code_Walkthroughs
A
Hello,
everyone,
my
name,
is
jason
dillman.
I
am
the
current
tech
lead
for
the
radius
block
device
portion
of
ceph.
The
goal
of
this
talk.
B
A
I
guess
to
start.
I
am
going
to
share
my
screen.
A
Your
name
yep
all
right,
that's
one
plus
one
all
right
so
I'll
count
that
as
it's
working
as
I
get
going,
I
since
I
won't
have
the
the
chat
window
visible.
If
anyone
has
any
comments,
questions
I'll
I'll,
stop
from
time
to
time
and
ask
if
anyone
has
any
comments
or
questions
that
I
can
try
to
answer
in
line.
A
Otherwise,
if
I
forget-
and
I
don't
feel
free
to
speak
up
and
and
chime
in,
so
we
don't
get
too
far
ahead.
A
A
Whatever,
let's
say
it's
a
one,
terabyte
block
image
you're
going
to
have
you
know
you're
going
to
be
able
to
address
all
the
bytes
between
byte,
zero
and
byte
at
one
terabyte,
but
the
way
that
is
internally
represented
and
stored
within
ceph
is
that
rbd
via
kbd
rbd,
careful
using
the
kernel,
library's
user
space
tool,
library
what
it
does
is
they
they
both
break
down.
Those
requests
into
much
more
much
smaller,
much
more
manageable
backing
objects
in
the
in
the
cef
storage
cluster.
A
So
all
this
is
documented
here
on
docs.theft.com.
If
you
go
to
the
architecture
section,
there's
a
link
on
how
data
striping
works
so,
regardless
of
the
size
of
the
image
rbd
only
talks
to
backing
stuff
objects
that
are
of
a
fixed
size,
the
default
fixed
size
is
four
megabytes.
So
if
you
just
do
from
the
rb
cli,
you
do
an
rbd
create
dash
dash
size,
one
t
and
then
an
image
name
to
create
a
one
terabyte
image.
A
You're
going
to
have
the
it's
going
to
use
the
default
four
megabytes
backing
object
size
when,
when
you
first
start
writing
data
to
that
image
and
it's
rbd,
everything
is
then
provisioned,
so
there's
actually
no
data.
Besides
a
little
bit
of
that
metadata
describing
your
image
when
you
first
create
the
image,
it's
only
as
you
start
writing
data
to
the
image
that
backing
data
actually
starts
getting
written
to
the
osds.
A
The
default
case
is
that
it
has
a
stripe
unit
of
the
object
size.
So
if
the
object
size
is
four
megabytes,
the
stripe
unit
is
four
megabytes
and
then
the
stripe
count
is
one.
So
in
that
case,
bytes
zero
through
four
megabytes
of
the
object,
go
into
object:
zero
by
four
megabytes:
eight
megabytes
object,
one
and
so
on
and
so
forth.
A
If
we
look
at
this
rbd
types.h
file
for
the
new-
and
I
say
new
loosely
because
it's
been
the
default
for
a
while
now,
but
the
rbd
image
format
too,
and
the
way
that
those
backing
data
objects
are
then
represented
in
ceph-
is
that
you
have
these.
Just
these
objects
called
prefix
with
rbd
underscore
data
dot,
and
then
it's
got
the
unique
image
id
which
is
generated
when
the
image
is
created,
and
then
it
has
a
sequence
of
like
an
index
into
the
image.
A
So
for
the
again
the
simple
case
of
just
a
four
megabytes
object
size
stripe
count
of
of
one.
Then
you
can
just
take
that
index
and
multiply
it
by
four
megabyte
the
object
size
and
that's
how
you
know
which
object
to
go,
read
and
write
to
when
it
comes
to
getting
your
data.
B
A
So,
as
I
mentioned,
it's
lip.
What
this
talks
about
is
about
lib
rbd.
This
is
a
user
space
library
that
different
programs
can
compile
and
link
against
to
utilize
and
access.
Rbd
images-
I
think
probably
one
of
the
most
well-known
integrations-
is
going
to
be
keemu.
Here
I
have
the
the
qmu
source
code
and
in
the
block
subdirectory
there's
this
rbd.c.
A
This
is
the
driver
for
how
keemu
interacts
with
live
rbd,
but
it
has
its
own
internal
hooks
about
how
to
do
reads
and
writes
and
flushes,
and
all
that
all
alchemy
that
has
to
do
is
swap
its
internal
api
calls
into
library
api
calls.
It
just
so
happens
that
it
always
uses
the
same
helper
function.
Method,
which
is
this
rbd
underscore
start
underscore
aio.
A
It
determines
if
it's
old
enough,
if
it's,
if
it's
a
way
old
version
of
lib
rbd,
so
it
needs
to
use
a
bounce
buffer.
If
so,
it
has
to
copy
all
the
I
o
into
a
because
it
might
be
an
iovec
has
to
copy
the
I
o
into
a
bounce
buffer
before
issuing
the
read
or
write.
But
if
it's
a
new
enough
version
it
just
can
pass
that
data
right
to
lib
rbd.
A
So
in
this
case
you
have
library,
calls
for
aio,
write,
vector,
read,
vector
or
flush
or
discard,
and
so
on
and
so
forth.
A
So
at
a
high
level,
that
is
somebody
that's
trying
to
use
rbd
that
that's
how
they
translate
their.
I
requests
in
to
lib
rbd.
A
So
those
apis,
then
we
provide
those
apis
in
both
the
c
c
plus
plus
and
python
bindings.
The
python
bindings
are
actually
just
a
thin
wrapper
around
the
actual
c
api.
So
it's
not
a
native
python
binding.
A
So
when
it
comes
to
the
the
c
api
bindings,
that's
what
gmu
uses,
because
it's
that's
what
the
qmu's
built
in,
but
you
can.
If
I
search
for
one
of
these
calls,
we
have
both
synchronous
blocking.
A
That
is
something
you
get
when
you
invoke
the
rbd
open
command,
you
give
it
a
rados,
I
o
context,
which
basically
is
a
connection
to
the
cluster
and
a
specific
reference
to
a
specific
pool
in
the
cluster.
Let's
say
the
rbd
pool
you
give
it
the
image
name
and
then
it'll
do
all
its
magic
on
the
inside
and
populate
this
rbd
image
key
and
such
that
all
the
remainder
of
the
image
specific
api
calls
are
good
to
go
on
the
c
plus
side.
It's
similar,
except
the
everything,
is
kind
of
wrapped
in
objects.
A
So
in
this
case
we
have
an
image
object
that
has
various
methods
on
it,
like
you
know,
reading
and
writing,
and
the
aio
equivalence
of
all
those
methods
as
well.
A
A
But
rbd
library,
specifically
it's
it's-
it
is
located
in
ceph
source
librbd,
it's
pretty
much
self-contained
from
there.
A
We
try
to
break
down
all
the
various
internal
sub-components
of
what
lib
rbd
can
do
into
different
classes
and
namespaces
namespaces
would
be
in
their
own
subdirectory,
underneath
the
main
librbd
subdirectory,
but
in
terms
of
where
the
api
methods
first
hit,
it's
in
this
file
called
librbdcc.
A
So
this
is
just
a
giant
translation
file
that
is
the
it
keeps
the
the
abi
the
stable
for
the
api
so
that,
no
matter
what
version
of
librbd
shared
library,
you're,
you're
running
against,
you
know
dynamically
linked
against.
Hopefully
the
goal.
Is
it
didn't
matter
what
you
compiled
against,
because
this
this
live
rbdcc
file
files,
what's
responsible
for
maintaining
that
api's,
binary
interface
and
and
what
it
will
eventually
do
is
if,
if
I
go
and
find
one
of
the
api
methods
like
wreath
like
image,
read.
A
A
In
the
sea
world,
you
know
we're
given
just
straight
up
buffers
pointers
to
the
locations
in
memory
and
length,
so
we
have
to
kind
of
translate
some
of
those.
You
know
around
some
of
those
things
instead
using
the
c
plus
plus
buffer
lists,
we
natively
kind
of
wrap
that
around
with
little
helper
functions
to
represent
that
buffer
and
length
in
memory
to
say
where
we're
going
to
store
the
read
result
or
in
the
case
of
a
of
a
write,
we
actually
have
to
we
kind
of
translate
and
copy.
We
don't.
A
We
have
to
copy
the
memory
from
a
c
buffer
into
our
internal
buffer
list
implementation,
which
is
what
you
see
actually
happening
here,
where
it's
it's
taking
the
the
buffer,
the
raw
pointer
buffer,
you're,
providing,
and
it's
creating
just
a
buffer
list
that
points
to
that
points.
To
that
data
for
use
internally,.
A
So
this
is
the
this
is
just
the
api
translation
layer,
but
the
next
layer
it
gets
into
is
where
we
actually
then
start
implementing
how
these
ios
are
going
to
get
broken
down
for
getting
sent
where
they
need
to
go
so
we're
starting
the
historical
trend
of
librbd
is
that
all
the
internal
non
abi
locked
functions
used
to
be
in
this
file
called
internal.cc,
which
grew
to
a
giant.
You
know
thousands
and
thousands
and
thousands
of
line
file
which
was
hard
to
maintain.
A
A
Where
then
we
handle
all
the
internals
of
what
we
need
to
do
so,
the
first,
the
first
set
of
I
o
methods
that
are
up
here,
they're
all
the
synchronous
versions,
so
blocking
reads:
blocking
rights
blocking
discard
calls,
but
what
we
kind
of
do
under
the
scenes
is
we
just
actually
translate
all
those
synchronous
calls
to
asynchronous
calls,
and
then
we
just
wait
for
the
asynchronous
completion
to
complete
before
we
we
move
on.
So
there's
really
not
much
to
any
of
these.
A
Which
all
those
start
down
here
with
the
aio
underscore
prefix
so
yeah
when
we
get
a
request
in
it,
comes
with
a
completion
callback,
so
this
that
this
aio
completion
object
holds
enough
state
information
that,
when
the
I
o
has
completed,
we
know
which
you
know.
We
have
a
pointer
back
to
a
function
that
the
user
provided,
so
that
we
can
call
it
back
or
if
the
user
didn't
provide
a
callback
function,
because
they're
going
to
do
something
like
polling
or
something
like
that,
we
can
at
least
mark
the
completion
as
complete.
A
So
when
the
user,
that's
using
the
lib
rbd
api
next
checks,
the
aio
completion
object,
it's
marked
as
complete,
and
they
know
that
it's
it's
it's
ready
for
for
use,
but
the
calls
like
the
ao
read:
ao
writes
you
know
they
have
an
offset
and
a
length.
This
offset
is
it's
in
image
space,
so
we,
you
know,
we
think
about
like
an
lba
address
or
something
like
that
on
a
hard
disk.
A
It's
it's
representing
the
absolute
position
of
the
byte
within
the
rbd
image.
So
this
is
before
we
start
translating
it
down
into
the
internal
object
based
and
object,
extents
dial
ios
that
actually
get
translated
to
the
the
seth
cluster,
but
the
api
level
layer.
At
this
point
everything
is
image
extent
based
ios.
A
We
have
some
boilerplate
code,
that's
been
in
here.
I
don't
think
it's
actually
ever
been
used
besides
by
some
testing
groups,
where
we
there
was
this
goal
to
be
able
to
trace
all
ios
flowing
through
the
system
from
from
the
front
end
of
the
library
api,
all
the
way
to
the
osd's
to
the
replication.
You
know
across
the
osd's
and
back
you
could
get
a
complete
picture
of
how
a
single
io
moved
through
the
system.
A
So
I
don't
think
anyone
ever
turns
this
on,
but
that's
some
boilerplate
code
for
that
you
know
legacy
functionality
for
tracing
first
step.
We
come
in
through
we
kind
of
initialize
the
the
completion
we
basically
say
hey.
We
started
the
I
o
at
this
time
and
that's
how
we
can
start
tracking
latency
statistics
later
on
once
the
I
o
completes.
A
There's
a
an
alternative,
as
I
mentioned,
you
can
provide
the
ao
completion,
a
callback
function
or
you
can
do
a
polling
loop.
This
is
so
if
you
have
marked
your
image
of
saying
you're
going
to
use
an
event
socket
to
get
notified
on
instead
of
having
a
callback
function.
A
B
A
That
just
represents
that
I
o
request,
and
this
is
what
helps
keep
track
of
our
I
o
state,
as
we
move
through
different
dispatching
layers
and
we'll
I'll
get
into
that
shortly,
but
you
can
just
think
of
this
image
dispatched
by
spec,
as
it
just
has
enough
data
just
to
describe
a
given.
I
o,
in
this
case
a
read
I
o,
but
if
you
go
through
it,
it's
the
same
pattern.
Over
and
over
again,
you
know,
here's
a
an
image
dispatch,
spec
that
describes
a
right
or
a
discard
or
a
right
same
operation.
A
B
A
Does
the
discard
alignments
don't
properly
line
up
with
ceph
expectations
and
alignments.
A
Compare
and
write
operation
that
just
says
you
give
it
a
buffer
to
say.
I
expect
the
data
to
look
like
this
on
the
disk
and
then,
if
it
does,
then
you
can
override
it
with
the
data.
I
give
you
if
it
doesn't
throw
an
error
and
give
me
give
me
the
offset
at
which
the
data
mismatches
and
then
a
flush
operation
which
just
ensures
that
any
rights
that
you
had
provided
have
actually
been
persistently
written
to
the
backing
osd
before
completing
the
flush.
A
So
so
now
we've
taken
an
I
o
from
the
front
api
layer
we've
translated
to
the
internal
api
layer,
which
now
starts
the
process
of
pushing
the
data
through
our
internal.
I
o
dispatching
engine.
A
So
the
first
thing
we
saw
was
that
basically,
every
single
io
in
the
system
gets
translated
into
this
image
dispatch
spec,
which
just
describes
the
I
o,
and
it
has
enough
data
to
store.
You
know
if
it's
going
to
be
a
read,
a
discard,
a
write,
write,
same,
compare
and
write,
and
then
it
stores
it
in
a
giant
effectively
giant
union
so
that
this
data
structure,
you
know
only
stores
the
data
doesn't
store.
You
know
the
size
of
the
is
the
the
size
of
the
structures.
A
You
know
the
size
of
the
largest
substructure.
A
A
Are
you
know
what
are
the
offset
and
lengths
and
you
can
have
more
than
one
internally
tracing
information
and
so
on
and
so
forth,
and
the
other
piece
of
data
that
it
stores
is
this
image
dispatch
layer
and
as
we
talk
about
this
next,
is
when,
when
you
send
this
request
and
issue
this
request,
the
way
current
lib
rbd
works
is
that
we
know
instead
of
hard
coding,
a
bunch
of
if-then-else
statements
throughout.
A
Throughout
the
code
for
handling
how
we
can
interact
with
different
plugins,
we
now
have
a
way
where
we
can
dynamically
and
programmatically
just
take
an
I
o
and
iterate
it
through
different
books
to
let
different
hooks
manipulate
the
I
o
as
they
want
to,
and
those
hooks
are
defined
here
as
an
enum,
I'm
in
the
I
o
subdirectory
io
namespace
and
there's
the
types
folder
and
here's
just
an
enum
that
describes
some
of
the
various
dispatches
we
have
layers.
We
have
for
manipulating
and
doing
things
with
incoming
image
based
io.
A
So
first
one
we
have
is
the
first
real
layer
that
we
have
is
this
cueing
layer?
All
it
does.
Is
it
takes
your?
A
I
o
puts
it
on
a
work
hue
and
then
basically
returns
the
the
control
back
to
your
calling
application,
because
when
you
think
about
it,
if
I'm,
if
I'm
cumu
calling
in
to
my
api
function,
it's
not
until
it
hits
this
cueing
layer,
then
it
completes
control
and
returns
control
back
from
you
know
this,
let's
say
aio
write
call
from
qmu,
so
all
that
does
is
it
throws
it
in
a
work,
queue
to
say:
hey,
there's
another
thread,
a
lib
rbd
thread.
A
A
We
have
a
quality
service
layer
that
handles
throttling.
So
anytime,
you
define
throttling
parameters
to
say,
like
I
give
an
image
she's
only
allowed
to
utilize.
You
know:
100
iops,
that's
handled
by
the
qos
layer,
the
exclusive
lock
layer,
it's
just
a
hook
that
says
anytime.
It
sees
a
write
operation
in
it
makes
sure
that
the
exclusive
lock
is
actually
acquired
for
the
image
and
if
it's
not
acquired
for
the
image,
it
attempts
to
acquire
the
exclusive
lock.
A
A
While
emu
is
running
so
I
could
have
an
rbdcli
process
on
node
a
manipulate,
an
image,
but
that
image
is
currently
being
used
on
node
b
by
a
kimi
process.
We
need
some
way
to
instruct
that
rbd
client
on
node
b,
that
hey
some
data,
has
changed
about
the
image
you
have
to
go
refresh
the
metadata
about
the
image
that
you
know
about.
You
know
a
new
snapshot
or
that
the
image
has
been
expanded
or
so
so
forth.
A
So
the
refresh
layer
is
responsible
for
detecting
those
changes
and
then
issuing
a
an
asynchronous
call
to
go
refresh
the
the
image
in
the
background,
and
it
pauses
the
I
o,
while
that
refreshes
is
occurring,
the
next
layer
is
a
it's
a
new
lit
new
layer
for
the
pacific
release,
and
this
is
the
image
migration
layer.
This
is
in
supports
of
instant
image
restoration
from
a
read-only
source,
so
I
could
have
a
an
rbd
image.
A
That's
on
an
s3
endpoint
and
I
could
use
the
rbdcli
to
define
a
new
image
and
say:
hey
the
parents
of
this
image
is
actually
you
know
on
this
remote.
You
know
http
s3,
protocol
endpoint,
whenever,
whenever
this
clone
effectively
doesn't
have
the
data
go
get
the
data
from
this
external
endpoint,
so
that
migration
layer
is
what
actually
takes
care
of
translating
you
know
ios
into
whatever
formats
data
might
be
stored
on
that
it's
because
it's
probably
not
going
to
be
stored
realistically
in
the
native
stuff
format,.
A
The
journaling
layer-
and
this
is
for
rbd
journaling
features,
there's
a
right
blocking
layer.
That's
this
is
just
used
for
internal
metadata.
Whenever
let's
say
we
have
create
a
snapshot
or
something
like
that,
we
need
to
basically
pause.
I
o,
while
that
snapshot
is
being
created,
we
can
use
that
right,
block
layer
to
basically
block
all
writes
from
occurring
while
we're
doing
this
internal
bookkeeping,
and
then
we
can
resume
all
rights
right
back
cache
layer.
This
is
this
is
related
to
intel's
work
with
a
persistent
write
back
cache.
A
That
would
be
a
local
cache
that
writes
to
an
ssd
or
octane
device
on
your
local
local
node.
So
to
hopefully
the
goal
is
to
reduce
tail
latencies
on
on
rights
and
then
finally,
the
next
layer
is
the
core
layer
and
that's
actually
what
sends
if
it
gets,
if
the
I
o
gets
all
the
way
to
the
core
layer,
that's
actually
what
will
send
the
I
os
on
the
next
step
to
get
them
ready
to
go
to
the
cef
osd
cluster.
A
B
A
A
So
you
know
here's
the
here's,
the
cute
image
dispatch,
here's
the
quality
service,
image,
dispatch
layer,
here's
the
right,
blocker,
so
they're,
all
pretty
easy
to
find
they're
hopefully
well
named,
but
the
the
one
I
want
to
focus
on
next
is
the
core
one,
the
core
layer
and
that's
what
starts
the
process
of
doing
that
striping
operation
to
determine
hey,
I'm
talking
about
image,
extent,
four
megabyte
off
through
eight
megabytes.
That
means
that
I'm
actually
going
to
send
this.
A
I
o
to
object
one
in
the
backing
cluster
and
that
I'm
it's
using
object.
Extents.
You
know
zero
bytes
through
four
megabytes,
because
that's
just
the
way
the
ray
translation
works.
If
you
go
look
back
at
that
cheat
sheet
on
the
on
the
data
layout,
so
the
core
layer
is
in
this.
A
This
image
dispatch-
this
is
the
core
layer.
All
it
really
does.
Is
it's
a
just,
a
proxy
translation
layer
between
the
this
newer
style
pluggable
I
o
handling
engine
and
the
original
legacy
io
state
machine,
which
is
this
image
request
class,
so
the
dispatch
layer
will
invoke
the
prop
the
appropriate
method
on
the
api
via
to
read,
write,
discard,
write,
same
compare
and
write
flush,
you
name
it,
but
here
in
the
image
dispatch
class,
it's
going
to
translate
it
to
the
associated
state
machine
class
in
the
image
request.
A
A
Is
that
they're
just
they're
just
factory
methods,
because
what
happens
is
at
the
end
of
the
day
it
just
each
of
those
methods
instantiates
a
very
particular
kind
of
object,
be
it
an
image,
read:
request
an
image,
write,
request,
image,
discard
request
and
then
just
invokes
the
send
method
on
those
objects
to
actually
kick
off
the
state
machine.
A
It's
unfortunate
and
I'm
apologize
in
advance,
but
we
try
to
do
a
good
job,
at
least
on
all
our
modern
state
machines
that
we
actually
have
a
little
ascii,
drawing
to
describe
the
straight
the
state
transitions
and
between
the
between
the
state
machines.
Unfortunately,
because
this
is
so
old,
it's
one
of
the
one
of
the
original
functions.
It's
just
been
tweaked
and
tweaked
and
tweaked
as
the
years
has
gone
on
yeah
it
doesn't
have
the
the
ascii
drawing.
A
That
makes
it
a
little
more
clear
as
to
how
the
state
machine
transitions
between
the
different
functions,
but
we
can
we
can
just
dive
into
it.
So
if
we
look
at
the
image
read
request
state
machine,
you
first
saw
actually
that
everything
invoked
a
send
method.
It
reads
so
it
creates
instantiates
the
object
and
calls
the
send
method.
That
said,
method
is
the
same
virtual
method
on
every
single
class,
which
is
in
the
base
class,
which
is
the
image
request.
A
A
So
this
is
this
is
how
those
time
stamps
those
the
modify,
timestamp
and
the
access
timestamp
are
are
updated.
Is
that
there's
a
little
star
side,
state
machine
that
says
based
on
your
image
property
that
says
I'm
going
to
update
my
modified
time
or
my
access
timestamp
every
20
seconds
or
whatever
your
settings
are
because
it
doesn't
update
it
with
every
I
o.
You
know
it
just.
A
Does
it
on
a
periodic
interval
as
your
I
o
comes
in
so
if,
if
that
period
of
time
has
has
come
up,
it'll
kick
off
the
state
machine
to
basically
update
the
timestamp,
which
is
just
sending
off
this
io
request
to
that
rbd
underscore
header
dot,
unique
image,
id
object
to
track
the
the
access
or
modified
timestamp,
but
that
doesn't
really
affect
your.
I
o
flow,
that's
just
something
that
gets
kicked
off
and
that's
actually
running
concurrently,
then
with
your
I
o.
A
A
For
the
read
first
step,
it
does,
it
says:
well,
if
you're
using
rbd
the
internal
rbd
cache,
and
you
haven't
said
that
you're
doing
random.
I
o,
and
you
have
read
ahead,
enabled
it
just
optionally
kicks
off
another
state
machine
that
says:
hey.
You
know
analyze
this.
I
o
pattern
and
see
if
someone's
attempting
to
do
a
bunch
of
sequential
I
o
if
they
are
in
the
background
and
also
asynchronously
to
whatever
this
user
is
requesting
start
reading.
A
A
So
that's
what
this
mapping
does
it's
this
little
helper
striper
function,
that
does
all
the
all
the
calculations
to
say,
given
these
image
extents
map
them
into
one
or
more
object
extents,
because
if
we
go
back
to
the
picture
for
how
stripy
might
look,
I
mean
striping
might
get
complicated
and
it
might
be
that
my
image
extent
has
to
go
across.
You
know
multiple
objects
multiple
times
potentially
and
include
multiple
extents
of
multiple
objects
as
it
as
it
loops
through.
A
I
might
have
to
read
stripe
unit
zero,
one,
two,
three
four:
five,
which
means
I'm
reading
from
two
sections
from
object:
zero,
two
sections
from
object,
one
one
section
rounded
two
and
one
section
from
object:
three,
and
when
all
that
reading
completes,
I
need
to
be
able
to
reassemble
all
that
data
appropriately.
So
it
goes
back
in
the
correct
order
that
the
user
actually
expects
it
in,
because
the
user
shouldn't
have
to
know
about
you
know
the
internal
details
of
how
rbd
is
striping,
its
images
and
its
design.
So
that's
all!
A
That's
all
this
function
does
right
here
this
file
to
extends
it.
It
maps
image
extents
file
to
object
extents
in
this
case
extents,
that's
the
naming
is
kind
of
legacy
and
leftover
from
ceffs
and
the
original
cfs
client.
So
that's
why
the
help
the
method
name
is
called
file
to
extends
and
not
like
image
extents
to
object.
Extents.
You
know
because
it's
it's
it's
from
a
point
of
view
of
seth,
but
it's
the
same
math
for
both
and
we
also
keep
track
of
this
buffer
offsets.
A
Is
it
also
as
it's
as
it's
doing
these
mappings?
It's
also
keeping
track
of
when
it
has
to
reassemble
this
data?
All
the
stripe
data
needs
to
know
where
to
go
put
that
data
back
into
the
the
buffer
that
the
user
has
provided,
so
that
the
data
is
in
the
right
order.
A
So
so
now
we
have
a
bunch
of
this.
This
file
two
extensions
put
populating
this
object:
extense
collection.
A
We
give
the
the
a
completion,
we
had
stored,
the
read
results
in
it
and
we
also
basically
tell
the
read:
result:
hey
when
you
have
to
reassemble
it.
Here's
the
original
image
extension
of
how
everything
gets
reassembled,
there's
a
couple:
there's
a
couple
different
permutations
that
actually
need
those
image
extents.
Some
of
them
don't
need
it
at
all,
so
they.
This
is
a
no
op
function.
We
can
dig
into
that
later,
but
we
actually
issue
the
request.
A
So
this
ao
completion
we
tell
it
hey,
you're
gonna,
expect
you
know,
object,
extent,
count
number
of
requests
that
are
actually
going
to
be
issued
concurrently.
So
you
have
to
wait
for
this.
Many
internal
requests
to
complete
before
the
actual
like
user
side,
ao
completion
is
actually
complete
and
then
it
iterates
through
all
the
object
extents
and
it
issues
it's
a
very
similar
pattern
here.
So
now,
instead
of
an
image
dispatch
back,
it's
actually
going
to
issue
these
object,
dispatch,
specs
and
it's
very
similar
it.
You
know
it
describes
it
as
a
read
describes.
A
It
doesn't
write,
describes
it
as
a
discard,
depending
on
whatever
state
machine
you're
in
so
this
is
the
process
now
to
start
kicking
off
a
read.
If
I
we
go
to
this
right
side,
most
of
it's
the
same
so
there's
between
the
different
right
methods,
you
know
write
a
discard,
a
write
same
and
a
compare
and
write.
So
most
of
the
methods
actually
get
kicked
off
from
this
generic
send
request
method,
but
it's
very
similar.
A
It
does
the
same.
Computations
of
converting
image
extends
to
object,
extents.
It
does
some
extra
work
about
pruning,
which
is
basically
saying
if,
if
the
the
right
goes
beyond
the
object
boundaries
and
things
like
that,
then
we
we'll
truncate
those
ios
so
that
we're
not
going
past
the
end
of
the
image.
A
But
then
the
same
thing.
We
set
the
number
of
requests
that
we
expect
and
then
the
one
place
we
start
differing
here
is
if
the
journaling
mode
is
enabled-
and
this
is-
this-
is
still
legacy
code,
because
the
journaling
hasn't
been
broken
out
into
its
own
dispatch
layer.
Yet,
but
when
it
does,
this
will
all
be
the
the
read
and
write
methods
will
basically
look
exactly
the
same,
but
right
now
we
have
this
special
hook.
A
A
If
you
had
right
back
cache
enabled
that
your
I
o
would
appear
to
complete
faster,
but
it's
really
just
being
held
in
memory,
while
the
journal
event
is
being
appended
and
then
once
the
journal
event
is
securely
written
to
disk,
then
we're
allowed
that
right
back
is
allowed
to
proceed,
but
the
send
object
request.
If
we
go
look
at
this,
there's
going
to
be
a
different,
send
object,
request
for
for
each.
A
Thing
where
it
iterates
over
the
extents
and
then
it
calls
up
your
virtual,
create
object,
request
which
creates
an
object
request
and
then
it
sends
it
so
the
difference
the
different
methods
create
different
objects.
So
here's
a
here's,
a
write,
request,
creating
object
quest,
so
it
just
creates
the
image
dispatch,
object,
dispatch
back,
create
write,
you
know,
you'll
see.
The
same
thing
for
here
is
a
discard
request.
All
it's
doing
is
creating
a
discard
object,
dispatch
back
right,
same
same
thing,
compare
and
right
same
thing.
B
A
Any
questions
on
I
mean
so
the
next
section
I'm
going
to
talk
about
is
object,
io
dispatching,
but
any
questions
on
image
extent
dispatching
on
how
the
io
now
has
gotten
broken
up
into
object.
Io
extents.
These
are,
these
are
the
I
o
ranges
within
individual
objects.
So
it's
going
to
identify
the
I
o
saying
this
belongs
to
object,
zero,
and
I
want
to
read
extents
of
that
object
of
bytes
zero
through
four
megabytes.
A
All
right,
well,
the
object,
dispatch
spec
is
actually
very
similar
in
design
and
function
to
the
image
dispatch
back.
It
just
has
to
store
a
little
bit
different
data,
specifically
the
most
important
piece
of
data
it
needs
to
store,
is
the
object
number
in
which
that
I
o
is
going
to
get
issued
against,
but
on
that
it's
very
similar.
It's
it's
offsets
and
lengths
and
various
other
properties.
A
So
it's
same
same
concept
where
we
store
it
in
a
giant
essentially
union,
all
the
possible
different
io
types
restore
the
current
layer
in
which
this
particular
I
o
is
currently
processing
on.
A
A
Store
our
callback
so
that
we
know
how
to
basically
complete
the
I
o
to
basically
then
hooks
back
into
the
ao
completion
to
finish
off
the
ai
completion.
If
we
go
look
at
the
types
again,
the
object,
dispatch
layer
types
have
a
caching
layer.
So
this
is
the
you
know.
The
legacy
librbd
and
memory
cache
those
those
work
on
the
object
level.
A
This
is
different
than
the
intel
persistent
write
log
right
back
cache,
so
this
is
an
in-memory
only
cache
right
back
or
right
around
or
right
through
we
have
a
new
crypto
layer,
cache
layer,
that's
coming
in
with
pacific,
where
we
can
internally
have
you
know,
let's
say
lux
encryption
on
an
rbd
device
and
library
handles
all
the
encryption
internally,
so
this
crypto
layer
is
actually
going
to
handle
block
alignments
and
encrypting
and
decrypting
ios
as
they
come
in.
A
We
have
a
journal
layer,
that's
responsible
for
blocking
ios
that
haven't
committed
to
the
journal.
Yet,
based
on
that
journal,
tid
we
have
a
parent
cache.
This
is
something
that
came
in
octopus.
This
is,
if
I
have
a
cloned
image
of
a
parent.
All
the
data
of
the
parent
is
read
only
so.
The
parent
cache
is
the
it's.
It
talks
via
domain
socket
to
this
demon
called
the
cepha
mutable
commutal
object.
B
A
Daemon,
which
is
responsible
for
basically
promoting
hot
read-only
objects
to
like
a
fast
local
cache
on
your
device,
so
it
can
serve
reads
locally
instead
of
sending
the
reads
and
redirecting
the
reads
back
to
the
soft
cluster.
So
if
you
have
a
lot
of
golden
images
or
something
like
that
in
rbd
in
theory,
that's
what
the
parent
cache
could
help
you
with
there's
also
work,
I
think,
being
done
to
incorporate
it
into
rgw
for
immutable
objects
in
rgw.
A
Next
layer
is
a
scheduler
layer.
All
this
does
is
it
tries
to
determine
if
you
have
a
bunch
of
sequential
ios.
So
this
is
just
a
very
dumb.
I
o
scheduler.
That
says
it
looks
like
you're
doing
a
bunch
of
sequential
ios
to
a
given
object,
so
it
it'll
try
to
collapse
all
those
sequential
ios
into
a
single.
A
So
similar
to
how
it's
easy
to
find
the
hopefully
easy
to
find
the
the
object
dispatch
layers.
Hopefully
it's
also
easy
to
find
the
if
you
find
the
image
dispatch
layers,
hopefully
find
the
the
object,
dispatch
layers
like
here's,
the
here's,
the
sketch
scheduler
dispatch
layer,
the
initial
implementation.
We
just
called
simple,
because
it
really
is
a
very-
doesn't
try
to
do
anything
fancy
it's
just
trying
to
do,
detect
sequential
ios
and
batch
them
up.
So
that's
a
simple
scheduler
and
then
the
core
layer
comes
down
to
just
the
object.
A
It
looks
very
similar
to
the
image
dispatch.
All
it
does.
It's
kind
of
a
translation
layer
between
the
old
world
and
the
new
world,
the
old
world.
Being
this
object,
read
request
or
discard
request
to
write,
requests,
state
machines
and
the
new
world
being
this
plugable
dispatch
layer.
So
it
just
translates
between
this
api
and
the
original
legacy
library
api.
A
So
as
the
different
methods
are
invoked
by
the
the
dispatcher,
be
it
if
it's
a
object,
dispatch,
spec,
that's
describing
a
read:
it's
gonna
invoke
the
read
method.
If
it's
a
discard
or
look
the
discard,
if
it's
right,
it'll,
look
the
right
method,
they
all
get
broken
down
into
the
appropriate
state
machines
that
are
all
described-
and
these
are
these-
are
the
core
mach.
These
are
the
core
state
machines
that
actually
then
perform
the
I
os
against
the
cluster
and
these
ones
actually
do
start
to
have
some
ascii
drawings
in
them.
A
A
A
Right
state
machine,
but
if
we
go
look
at
the
let's
say
the
first
one,
the
read
requests
it
gets
entered
in
with
the
send.
So
the
first
state
is
it's
going
to
go
and
just
issue
an
object.
Read
request
to
the
the
cluster.
A
So
it
determines,
you
know,
first
step
it
determines
what
you're
trying
to
read.
Be
it
if
you're,
trying
to
read
the
head
of
the
image
or
a
given
snapshot
of
the
image,
if
you're
having
to
read
the
head
of
the
image
like
you're,
not
trying
to
read
from
a
snapshot.
A
That
means
that
we
can
then
go
check
the
object,
map
and
things
like
that,
because
the
object
map
will
be
in
memory.
So
we
can
go
check
to
see
you
know,
hey.
Does
the
object
map
say
if
the
app,
if
we
know
the
object,
may
or
may
not
exist?
If
we,
if
the
object
map,
says,
there's
no
possible
way
for
that
object
to
exist,
we
can
actually
just
skip
ahead
and
we
don't
have
to
issue
that
initial
read
request
to
the
to
the
osds.
A
A
We
can
issue
either
sparse,
read
to
the
osd
or
our
read
to
the
osd,
and
we
just
have
like
a
little
hint
here
that
says
you
know
based
on
this
configuration
setting,
which
is,
I
think,
defaults
to
like
64
kilobytes
in
the
in
the
config.
So
if
your
read
request
is
greater
than
64
kilobytes,
it
tells
the
osd.
B
A
To
try
to
do
a
sparse,
read
for
me
because
what
a
sparse
read
does
is
it
won't
return
blank
sections
of
the
object,
so
if
it
was,
if
there's
no
data
there,
it'll
return,
you
know
no
data,
it'll
say
hey
the
data
I
gave
you
back
really
only
represents
extents
from
point
a
to
point
b
and
point
from
point
c
to
point
d,
and
you
know
there's
a
gap
somewhere
in
there
and
it's
up
to
you
to
do
the
math
to
figure
it
out,
but
it
doesn't
inject
a
bunch
of
zeros
in
there
where
there
weren't
any
zeros
before
it
keeps
the
data
thin
and
then
yeah,
then
we
just
execute
the
method
on
using
the
librados
api
and
when
and
when
the
data.
A
This
is
an
asynchronous
method.
This
is
non-blocking,
so
this
goes
away
and
eventually,
once
liberators
comes
back
with
an
answer,
it's
going
to
invoke
its
completion
callback,
which
is
the
handleread
object
and
we
get
told
by
the
osd.
The
object
doesn't
exist,
in
which
case
we
may
have
to
go
read
from
the
parent
if
it's
a
cloned
object.
Otherwise,
if
it
truly
is
an
error,
we
have
to
you
know
we'll
bubble
that
error
up
to
the
up
to
the
the
user.
A
Nowhere
is
good
to
go,
then
this
request
is
done
and
what
this
finished
thing
does
is.
It
goes
and
tells
the
original
ao
completion
that
this
particular
request
is
complete
and
the
ao
completion,
that's
tracking
all
in-flight
requests
once
it
gets
down
to
zero
inflate
requests
will
finish
itself
off.
A
But
if
we
need
to
read
from
the
parent,
you
know
and
minus
the
this
little
special
flag
that
you
could
pass
to
the
state
machine
that
says:
hey
do
not
attempt
to
read
from
the
parent
and
99.99
of
the
cases
it
is
going
to
read
from
the
parent,
in
which
case
we
just
invoke
this
little
helper
method
here
that
if
we
go
look
at
in
utils.
B
A
Apparently
there
it
is
all
right.
This
is
missing.
So
all
this,
this
is
just
a
helper
method,
because
we
had
a
few
places
in
code
that
were
doing
the
exact
same
logic
over
and
over
again,
so
we
got
broken
out
into
a
helper
method,
because
when
you
want
to
read
from
your
parents
from
your
parent
image,
you
already
had
an
I
o.
That's
coming
from
a
given
object,
let's
say:
object.
A
B
A
That's
what
this
method
does?
It's
just
the
reverse
of
mapping
an
image
extended
to
an
object
extent,
this
maps
an
object,
extent
back
into
an
image
extent,
and
then,
once
we
have
these
parent
image
extents,
we
determine
if
we
have
an
overlap
with
a
parent,
because
the
parent
might
be
smaller
than
the
current
child
because
you
expanded
the
child
and
then
we,
you
know
potentially
prune
those
image
requests
based
on
that
overlap
and
assuming
we
actually
have
data
to
read
because
there
actually
is
an
overlap
with
the
parent.
A
For
that
given
read
request
then
really
all
we
do
again
is
just
issue
another.
We
started.
We,
we
have
our.
We
start
a
brand
new
ao
completion
for
tracking
the
read
and
we
kick
off
a
new
read
request,
but
instead
of
issuing
the
read
to
ourselves,
we
this,
if
this
image
context,
represents
us
this
image
context.
Parent,
actually
is
another
image
context.
That
is
all
the
state
about
the
parent.
So
it
just
issues
the
read
directly
to
the
parent's
image.
A
Yeah
going
back
to,
if
you
read
the
parent
same
thing,
checks
for
errors,
if
you
had
it,
if
the
parent
didn't
exist,
that
means
there's
no
data
to
read.
If
there's
an
error
bubble
the
error
up,
and
then
we
optionally
there's
this
field,
which
I
don't
think
anyone
ever
uses,
but
you
can
turn
enable
copy
on
read
or
copy
up
on
read.
A
And
that's
all
that
this
is
just
boilerplate
for
for
handling
that
case
of
kicking
off,
that
asynchronous
requests
kicks
it
off.
It
forgets
about
it
and
moves
on
and
finishes
the
the
read
request,
but
in
terms
of
rights
just
noticing
that
you
know
we
only
have
a
little
time
left.
A
I
just
want
to
point
out
just
it
does
very
similar
things.
It
checks
to
see
if
the
objects
may
exist,
if
there's
any
optimizations,
it
can
do
it
potentially
updates
the
object
map.
Because
again
this
is
legacy
code.
Before
we
broke
anything
out
into
layers
in
the
future.
I
would
have
said
this
was
object.
Map
would
have
been
its
own
layer
and
then
assuming
everything's
good
to
go.
It
kicks
off
the
actual,
write
requests
and
again,
if
it
was,
there
was
a
copy
up.
It
can.
A
If
it's
a
child
or
a
clone,
it
can
put
an
assertion
on
it
to
say:
hey
you're
not
allowed
to
write
this.
Unless
it's.
I
know
this
object
already
exists
on
the
child
image.
Otherwise,
it's
going
to
you
know,
write.
Give
it
a
right
hint,
add
some
write,
ops
and
execute
the
the
radius
api
call
and
then
once
it's
done,
it
gets
invoked
back
down
into
handle
which,
in
which
case,
if
I,
if
the
object
didn't
exist,
which
that
really
only
way
to
get
that
error
would
be.
A
A
Otherwise
the
right
should
be
complete.
The
illegal
sequence
is
related
to
compare
and
write.
That's
a
expected
error
from
compare
and
right.
If
you
had
a
bad
bad
data,
it
completes,
and
then
it
potentially
does
a
post
update
of
the
object
map,
which
only
happens
with
the
discard,
because
you
might
remove
an
object
because
the
first
state
to
update
it
would
be
to
mark
it
as
removing
pending
and
then
finally,
you
go
and
remove,
mark
it
as
non-existent
and
I'm
sorry
we're
running
out
of
time.
A
Maybe
we'll
schedule
another
talk
about
this
to
dive
more
into
it.
But
in
a
couple
minutes
left
is
there
any
other
comments.
B
A
Definitely
yeah
we
have.
We
have
quite
the
extensive
unit
test
library,
definitely
way
more
code
and
unit
tests
than
we
have
an
actual
library
code,
which
is
great.
It
gives
us,
hopefully
good
confidence
that
you
know
the
code
we're
putting
out.
There
is
going
to
continue
to
function
and
we
don't
just
have
to
rely
on
on
high-level
integration
tests.
We
actually
can
get
it
down
into
all
these
classes
and
test
them.
A
Well,
I'm
sorry
that
we
ran
out
of
time.
I
was
hoping
to
get
a
little
bit
further,
but
yeah
like
I
said
we
can
definitely
schedule
another
one
of
these
to
dive
in
again
and
yeah.
I
guess
thank
you
for
joining
and
you
can
reach
me
on
the
mailing
list.