►
From YouTube: 2015-FEB-26 -- Ceph Tech Talks: RBD
Description
A detailed look at the inner workings of the RADOS Block Device for Ceph.
http://ceph.com/ceph-tech-talks
B
Patrick
so
today,
I'll
be
talking
about
our
be
either
Vedas
block
device
and
going
into
a
bit
of
detail
about
how
it
works
and
mechanisms
that
it
uses
fermitas
to
do
with
the
things
that
it
does
so
at
a
high
level.
No
stuff
is
broken
down
into
several
different
components,
built
upon
the
low
level
Rados
and
RBD
in
particular,
uses
a
number
of
features
of
ratos
that
make
it
much
easier
to
build
something
like
our
BD.
B
Then
it
would
without,
if
you
were
building
it
on
top
of
like
a
plain
file
system
or
without
a
nice
consistent
just
repeatedly
or
any
of
you,
but
today
we're
not
gonna,
be
talking
too
much
Rodriguez.
So
I'm
going
to
be
assuming
that
you
already
been
familiar
with
how
it
works
and
if
your
orange,
basically
just
assume
that
you
have
a
strongly
consistent
up
a
store
that
gives
you
all
kinds
of
different
transactions
and
partial
reads
and
writes
that
you
can
do
and
for
more
details.
B
The
readers
block
device
is
typically
used
as
either
something
for
storing
discs
for
virtual
machines,
and
in
this
case
it
goes
through
the
user
space
library
which
is
linked
into
a
hypervisor
like
qmu
and
then
on
the
back
end.
This
talks
to
this,
the
stuff
cluster
and
on
the
virtual
machine,
it
just
looks
like
it
has
a
regular
disk.
It
can't
tell
whether
it's
using
our
BD
or
local
storage
or
anything
else.
B
The
other
main
way
of
consuming
our
BD
is
via
a
Linux
kernel
module.
And
with
this
you
can
map
a
RBD
device
directly
onto
a
Linux
host
and
it
just
appears
like
a
regular
bomb.
The
next
block
device-
and
you
can
mount
it
and
put
a
file
system
on
I,
didn't
do
what
you
want
with
it
there
from
there
and
again,
it's
just
like
to
accessing
the
base
cluster
on
the
back
end,
not
using
local
disk
alone.
B
So
the
high
level
features
of
Liberty
are
the.
It
says:
I'm
presenting
a
block
of
ice
interface,
but
it's
actually
storing
that
data
across
the
entire
cluster
or
at
least
one
pool
in
the
cluster.
If
there
were
eyes
snapshots
which
I
read
only
and
then
copy
and
write
clones,
if
you
wanted
to
grab
create
a
copy
of
something
quickly
without
taking
up
extra
space
and
I've
been
dealing
with
use,
a
copy
of
a
block
twice
quickly
and
it's
integrated
in
a
lot
of
different
projects,
including
cumulant
externalized,
just
mentioned
a
couple.
B
Different
is
because
these
tax,
the
TGT
user
space
one
and
the
l
AO,
I'm
colonel
one
and
several
different
cloud
platform
or
management
platforms.
Whatever
they're
called
these
days,
I
like
OpenStack
cloud
stack,
open,
nebula,
Eddie,
Brock's
box
and
Hank
over
its
coming
soon
else
provides
incremental
backup
the
relative
to
snapshots,
we're
just
pretty
handy
for
it
and
being
able
to
take
backups
of
things
offline
or
online.
Since
they
are
be,
it
also
supports
doing
the
snapshots
online,
and
then
you
can
do
the
backups
based
on
those
snapshots.
There.
B
B
It's
all
built
on
top
of
readers,
so
the
radio
interface
basically
provides
an
API
around
doing
transactions
on
the
single
objects
at
a
time
and
an
object
being
composed
of
bytes
as
a
followed
by
it's
like
a
file,
but
also
a
set
of
attributes
and
value
data,
and
readers
allows
you
to
do
partial,
overrides
or
partial
deeds
of
existing
data
and
also
I'm
compound
operations
which
are
atomic
understand
object.
So
you
can
do
things
like
that.
B
An
attribute
and
write
some
bytes
to
objects
or
check
that
attribute
has
a
certain
value
and
if
it
does
then
do
an
update
anomaly,
but
it
also
provides
some
concept
of
readers
classes,
which
are
essentially
like
a
sort
of
procedures
and
database.
But
what
that
allows
you
to
do
is
load
code
into
the
storage
demons
themselves.
The
OS,
DS
and
I
call
those
methods
from
yeah
readers,
client
and
passing
them.
Whatever
custom
data
format
you
want
and
that's
used
extensively
already
to
kind
of
abstract
away.
Some
of
the
low
level
storage
details.
B
So
Arby's
has
like
a
few
kinds
of
objects
that
in
which
restores
data
first,
it
has
some
metadata
in
each
pool
where
you
have
already
images,
there's
an
RV
directory
object
which
takes
care
of
an
index
of
the
name
to
ID
and
an
ID
to
name
mapping
for
all
the
images
in
that
pool,
and
that's
handy
just
so.
You
can
list
all
the
images
in
a
pool,
and
this
also
that
then
for
supporting
cloning.
B
There
is
an
RPD
children
object
which
contains
lists
of
clones
in
a
pool
index
by
the
parent
and
that's
just
there,
so
that
you're
able
to
tell
when
you're,
trying
to
go
and
and
being
unprotected
snapshot
that
is
used
for
clone
ins,
whether
there
actually
are
at
still
any
clones
belong
in
that
snapshot.
B
So
the
way
ever
use.
That
is,
it
goes
through
and
looks
at
every
pool
and
looks
at
all.
The
RV
children
objects
objects
in
each
pool
and
just
looks
and
checks
whether
there
are
any
referent
references
to
that
parent
snapshot
there
at
the
time
you
trying
to
inflict
it
it
does
this
and
it's
a
more
complicated
night
because
it
has
to
change
this
date
of
from
there's
like
is
a
tri-state
for
protection
status
of
a
snapshot.
So
there's
no
there's
no
race
conditions.
B
So
at
first
when
I
start
protecting
a
snapshot,
it
goes
and
first
sets
it
to
status,
I'm
protecting
which
prevents
any
new
clones
from
appearing,
and
then
it
goes
and
checks
for
all
any
to
age,
any
references
to
the
snapshot
and
these
all
the
are
we
children,
hybrids
and
then
finally,
I'm
time
protected.
If
there
were
not
no
references,
the
then
there's
a
few
per
image
objects.
B
These
are
just
for
format,
two
metres
really,
but
a
format.
One
isn't
really
too
relevant
anymore.
So
this
described
format
to
so
former
to
has
an
RB
d
ID
object
and
which
is
named
after
that
the
user-defined
name
of
the
image,
and
that's
so
that
you
can
go
and
look
up
what
the
actual
internal
ID
is
it
so
that
our
buddy
can
support
renaming
of
images
without
having
to
like
do
complicated
things
with
references
like
clones.
B
B
Is
that
thin
provisioned?
So
when
you
for
a
great
one,
you
have
all
these
metadata
objects,
but
you
have
no
data
objects
at
all,
and
so,
when
everybody
goes
to
read
and
something
from
an
object
and
it
doesn't
exist,
it
just
treats
that
as
a
bunch
of
zeros
similar
to
our
file
system,
which
we'd
like
a
hole
in
a
file.
B
B
There's
nothing
fancy
going
on
with
the
format
there
right
now
and
as
snapshots,
which
we'll
talk
more
about
later,
all
handled
by
vibrators
itself.
It's
worth
noting
that
the
greatest
objects
in
general
can
be
sparse
and
in
for
our
buddy,
the
Afton
are
so
even
though
I'm
already
it
has
default
striping
of
like
4
megabytes.
You
should
leave
and
it'll
be
doing,
updates
of
much
smaller
sizes
and
the
whole
4
megabytes
might
not
even
be
and
be
allocated
or
present.
B
There's
just
right
thing:
by
default,
our
buddy
stripes
everything
I
cross,
a
block
device
in
4,
megabyte
chunks
and
that's
just
a
simple
kind
of
the
largest
size
that
you
can
do,
and
it's
kind
of
optimized
for
spinning
discs.
Where
you
have
have
one
one
seeks
worth
of
data
that
you
can
get
reasonably.
B
You,
sir,
these
are
kind
of
distributed
randomly
across
the
OS
DS
by
crush,
so
one
block
device
is
spread
across
all
the
obviously
is
no
cluster
potentially
or
at
least
in
in
one
pool,
and
the
child's
lots
were.
That
means
we
have
to.
You:
have
lots
of
parallel
requests
to
a
given
image,
you're
spreading
that
work
over
many
spindles
to
not
bottleneck
to
anyone.
Ost.
B
Else
means
that
there's
no
single
set
of
servers
responsible
for
an
image.
So
if
you
have
some
kind
of
catastrophic
failure-
and
you
happen
to
lose
like
three
nodes
permanent
because
of
three
decided
at
the
same
time-
lose
data
for
some
images,
but
only
a
subset
of
them
and
for
a
much
larger
subset.
You
only
lost
a
one
copy
of
data
or
two
copies,
and
you
can
recover
from
that.
B
Using
small
objects
like
these
four
megabyte
objects
is
also
important
tomatoes
in
general,
because
a
couple
of
reasons,
one
is
spatial
utilization,
because
radius
is
doing
random
placement.
It's
very
important
that
there
there's
not
a
very
high
variance
in
the
utilization
among
those
DS,
because
if
you
have
more
than
like
one
OSD,
that's
getting
too
close
to
full
because
we're
doing
randomized
placement.
We
can't
really
tell
when.
B
It's
it's
be
safe
to
stop
bio
from
happening,
because
if
an
LC
ever
come
becomes
actually
full,
it's
quite
a
problem
because
wait
since
we
do
journaling,
we
actually
need
to
have
some
space
there
to
be
able
to
lead
something
so
having
wide
variance
the
problem,
because
we
need
to
actually
block
writes
when
the
cluster
becomes
full
or
when
any
OSD
becomes
full.
Since
you
may
be
placing
on
base
there.
B
B
B
B
If
the
guest
is
asking
for
it
to
be
flushed
or
if,
after
a
few
seconds
or
the
cache
is
full,
the
reverie
D
will
start
doing
right
back
to
the
actual
latest
cluster.
So
if
you
have
a
4k
right
coming
down,
it'll
go
into
the
cache
as
just
a
4k
and
if
there's
no,
if
there
are
no
other
rights
that
coming
in
come
in
at
next
to
it,
for
it
gets
written
back,
it'll
get
written
back
to
the
torito's
as
another
4k
right
to
a
Raiders
object.
B
It's
worth
noting
here
that
the
internal
live
RVD
cash.
Even
though
it's
working
at
in
terms
of
objects,
it's
really
doing
fine
I
agree
and
things
than
that.
So
it's
all
it's
all
in
just
determined
by
whatever
request
that
the
end
happens,
to
send
how
much
I
always
done.
It's
like
it's,
not
fetching
or
review
or
writing
inside
advocate
once
it's
only
writing
whatever
the
VM
has
requested
I'm
eating,
whatever
the
VM
has
requested.
B
It's
also
worth
noting
about
the
that
we
it's
been
recently
enabled
in
by
default
in
hammer,
but
with
an
extra
option
so
that
it's
in
right
it's
in
right
back
mode,
it's
in
right
through
mode
until
it
detects
a
flush
from
the
guest.
This
is
for
extra
safety,
so
that's
its
to
make.
It
basically
behave
like
a
well
behaved.
B
I'm
hard
drive
cash,
even
though
it's
in
memory,
all
your
rights
will
be
some
flushed
out
to
disk,
as
the
VM
expects,
when
the
virtual
machine
actually
request
a
bit
sense
of
barrier
down
the
hypervisor
or
flush
request,
and
so
that's
exactly
the
same
guarantee
that
our
local
hard
drive
gives
you.
But
if
a
dip
the
VM
is
too
old
or
if
it's
kernel
as
the
beriberi
is
turned
off,
then
it
it's
not.
B
It
won't
send
these
pleasures
and
it
won't
won't
be
safe
in
the
hits
palyer
failure
and
it
might
purpose
file
system
or
possibly
lose
some
data.
But
some
people
run
that
in
this
mode
for
more
formance,
so
that
you
can
turn
that
off
and
turn
off
the
flushes
and
turn
off
the
safety
in
everyday.
But
by
default
it's
it's
in
right
through
mode,
meaning
that
it
waits
for
the
rights
to
go
all
the
way
through
the
de
cluster
before
returning
back
to
the
vm
until
I
go
actually
sees
a
flush
request.
B
That's
what
tells
that
it,
it
probably
the
VM
actually
is
I'm,
probably
being
safe
about
it.
Setting
flushes
and
you
can
switch
to
a
back
mode
so
that,
like
the
cache
here,
mainly
even
serves
to
coalesce
rights
and
provided
I'm
much
lower
latency
at
first,
like
burster,
writes
the
VM
they
do,
even
if
they're,
if
they
don't
overwhelm
the
cache.
It
helps
quite
a
bit.
B
B
So
one
of
the
more
interesting
things
that
I've
eSports
is
snapshots-
and
these
are
actually
I'm
done,
not
by
Raiders
itself,
in
kind
of
an
interesting
way.
So
they're
done
on
the
copy
and
write
on
a
per
hour
basis
in
Rados
and
the
way
that
they're
maintained
is
by
having
what's
called
a
snapshot
context,
which
is
a
list
of
snapshot,
IDs
and
and
the
latest
opportunity
that
it
an
object,
was
part
of.
B
So
basically,
whenever
you
take
a
snapshot
and
you
allocate
a
new
snapshot,
ID
from
the
monitors,
add
it
to
this
list
of
snapshot
IDs
and
send
this.
It's
not
like
context
with
every
right
through
the
image,
and
that
way
you
can
do
snapshots
on
on
a
per
image
basis
without
having
to
do
the
amount
of
across
the
entire
pool
and
doing
these
contracts
on
a
like.
A
property
basis
like
this
or
a
per
unit
basis,
is
called
the
self
manage
snapshot
in
areas
because
we're
managing
the
snapshot,
context
and
storing
it
at
the
RB
level.
B
And
one
other
thing
about
snapshots
is
that
they're
actually
deleted
asynchronously.
So
when
you
delete
a
snapshot,
you
readers
will
go
out
and
in
the
background
I
had
that,
let's
not
try
to
get
to
the
OSD
map
as
a
snapshot.
That's
been
deleted
and
as
a
host
gets
propagated,
the
OSD
use
will,
in
the
background
to
them,
handle
that
there's
a
legion
of
actual
snapshot
data.
B
Radius
also
helps
aside
even
further
by
keeping
override
statistics
so
that
whenever
you
there
is
a
copyright
happening,
it
records
how
much
what
data
has
changed
since
the
previous
snapshot.
So
it's
very
simple
to
get
diffs
out
of
it
and
that's
how
I
read
incremental
backups
can
can
work
so
easily,
just
from
just
relying
on
the
existing
Rados
and
storage
of
that
made
a
data
about
which
things
changed
in
a
given
object.
B
Kind
of
going
through
that
process
and
help
you
understand
the:
how
about
how
this
natural
context
works,
so
normally
I,
say
whoops.
Let's
say
we
have
an
already
image
that
has
no
snapshots
yet
you
know
the
hell.
It'll
have
a
snapshot
context
with
an
empty
list
of
snapshot
IDs,
but
the
highest
notch
ID
will
be
the
highest
one.
That's
currently
relevant
for
the
pool
now
so
let's
say
in
this
case
would
be
seven.
So
when
you
do
a
write
it
little
bit,
you
will
send
the
snapshot
context
with
that
right
and
the
OSD.
B
You
was
will
note
that
there
are
no
stanch
IDs,
so
it
doesn't
have
to
do
any
copy
and
write
or
any
other
work
greatest
nap
shot
it.
Okay,
the
new
Scotch
ID,
and
if
is
the
next
one
next
one
allocated
in
the
pool
Billy,
this
is
mapped
on
eight
and
lure.
Bd
will
then
update
the
header
and
for
that
image,
and
with
that
adding
that
snapshot
8
to
Felicity
of
snapshot,
IDs
and
increment.
B
B
It's
not
tried
8,
which
didn't
pretty,
which
hasn't
been
written
to
yet
since
not
ready,
it
will
go
ahead
and
do
the
copying
right
and
I
do
a
full
copy
on
a
lot
of
the
objects
if
you're
on
XFS
or
ext4
or
most
other
backends
or
if
you're
on
brother
FS
I
can't
do
a
special
at
all
on
the
file
system.
I'll
clone
range
and
save
space
space
space
there,
but
for
the
common
case,
I'm
with
XPath,
is
doing
a
full
copy
where
that's
the
copy.
On
rate.
B
B
So
that's
essentially
how
snapshots
work
but
you'll.
Note,
then,
if
you're
familiar
with
our
video,
already
you'll
notice
that
and
this
kind
of
you're
able
to
take
snapshots
like
with
a
command
line
tool
while
the
VM
is
running
or
all
the
images
in
use
elsewhere,
so
may
be
wondering
how
how
that
works
when
you
need
to
be
able
to
send
the
snapshot
context
with
every
right
well,
for
that,
we
actually
we
take
advantage
of
another
primitive
in
ratos
called
watch
notify.
B
This
is
basically
a
way
for
clients
of
librettos
to
send
very
infrequent
and
small
kind
of
communications
to
each
other
via
ratos,
so
how
this
works
is
whenever
you
open
our
an
already
image.
The
remedy
or
the
chrome
client
will
be
open,
setting
a
watch
on
the
upper
header
for
that
image
and
whenever
there,
whenever
there
is
a
change
to
some
kind
of
information
on
the
header
like
the
image
is
resized
or
its
flattened
or
snapshot,
is
added
or
removed,
where
the
whatever
is
doing.
B
That
update
well
notify
other
clients
that
are
watching
the
same
header
via
this
watch.
Notify
mechanism
and
they'll
know
that
they
need
to
then
go
and
reread
the
header
to
see
what's
been
changed
so
for
an
example.
If
you
have
a
VM
using
an
image,
it'll
have
a
watch
on
the
image
header
object
in
particular,
and
if
you're
creating
a
snapshot
say
with
the
RPG
online
tool,
that
ready
command
line
tool
will
also
create
a
watch
when
it's
doing
the
snapshot
I'm
since
opening
it
for
write.
B
The
command
line
tool
will
add
the
SAP
child
information
to
the
header,
like
we
just
described.
It'll
allocate
the
stock
ID
from
the
monitor,
then
update
this
map
by
context
and
save
any
snapshot
metadata
associated
with
it.
Like
the
current
size,
the
image
the
current
features
at
the
image,
any
other
immutable
state
that
needs
to
be
stated
with
the
snapshot,
and
then
it
will
perform
a
notify
and
windowed.
If
I
works
in
Rados
is
that
the
the
notify
waits
for
all
Watchers
to
respond
or
timeout.
B
And
then
now,
this
watch
notify
mechanism
has
recently
become
become
even
more
robust
in
the
hammer
release
as
well.
So
now
it's
there's.
Actually
it's
actually
a
separate
operation
at
the
greatest
level.
There's
a
watch
to
operation
which
has
better
slightly
better
semantics,
where
it
actually
reports
error,
is,
if
they're,
if
the
watch
that
times
out
for
some
reason
the
client,
like
normally
the
client
the
previously
the
lower
level
thing
inside
of
liberators,
would
continuously
refresh
the
watch.
But
they
were
really.
B
You
wouldn't
get
a
notification
if
the
watch
timed
out,
but
now
with
watch
two
primitive
vetoes.
If
their
way,
if
they
watch
this
time
I
would
we
were
up.
D
gets
a
call
back
and
it
can
tell
that.
Okay,
that
watches
I
found
out
I
need
to
I
need
to
go
and
read,
check
reread
the
header
data,
because
I
may
have
wished
mists
and
notifications,
and
you
know
we
establishes
the
watch
at
the
livery
level
instead
of
leaving
it
to
the
brightest
level.
B
So
the
other
major
feature
of
our
videos-
I'm
not
just
a
proud
to
tribute
only
but
clones
which
are
writable
so
they're
they're
clones
are
also
copy-on-write
but
and
at
the
optical
again,
but
they
are
they're
done
at
the
at
the
RT
level
instead
of
the
raters
level,
if,
as
radios
only
supports
transaction
or
atomic
operations
on
single
objects
and
the
VAR
BD
clones
are
kind
of
conceptually
and
weenies
doing
these
for
an
entire
image.
B
So
clones
have
I
can
be
entirely
like
different
settings
wise
to
each
other.
They
can
I
have
different
strifing
policies.
Different
features
enabled
they
can
use
different
objects.
One
can
use
one
megabyte
objects,
one
can
be
used
for
micro
objects.
They
can
be
in
different
pools,
so
you
can
have
like
a
a
parent.
B
That's
enough,
for
example,
an
SSD
pool
if
you're,
if
you
have
a
very
heavy
workload-
and
you
can
have
many
many
clones
of
that
data
in
perhaps
the
slower
pool,
because
you
don't
expect
many
rights
and
so
you're,
actually
just
mostly
reading
from
the
parent
and
getting
access
to
the
vast
pool
and
so
clones
are
all
based
around
a
snapshot.
So
you
create
clone
from
a
snapshot,
in
particular
a
protected
snapshot.
B
I
kind
of
mentioned
this
earlier
before
explaining
really
what
protection
about
what
clones
were,
but
so
the
protection
basically
means
put
friends
a
snapshot
for
being
deleted,
so
that
you
can
make
sure
that
you
don't
delete
a
parent
image
that
a
bunch
of
clones
referencing
and
which
would
basically
corrupt
them.
B
So
when
you're
doing
I,
wanna
clone
them,
if
you're
doing
a
read
and
the
average
doesn't
exist,
it'll
just
go
directly
to
a
parent
and
when
you
first
create
a
clone
again
since
I
reused
and
revision,
and
it's
a
copy
on
right,
there
will
be
no
beta
objects
in
the
in
the
clone
at
all.
It'll
just
be
the
everybody
header
and
the
upper
D
ID,
and
that
we
header
will
have
a
reference
to
the
parent
snapshot.
B
Parents,
not
by
date
by
ID
the
image
image
by
ID
and
the
pool
by
ID,
so
that
those
things
could
be
renamed
potentially.
But
it's
the
references
would
still
be
intact,
and,
and
so,
whenever
our
video
doesn't
read
to
the
clone
it
won't,
it
will
notice
that
get
it
it
gets
back.
Response
is
saying
that
the
app
doesn't
exist
and
if
codes
and
read
from
the
parent
instead
so
Cesar
coffee
and
write,
if
you're
doing
it
right
and
the
object
doesn't
exist.
B
Yet
we
first
opportunistically
send
an
operation
with
a
a
card
that
says
fail
if
this
object
doesn't
exist
otherwise
and
then
and
then
do
a
right
if
it
does
exist.
So
if
that
already
exists,
we
just
need
to
send
one
operation.
If
it
doesn't
exist.
Let
me
get
back
a
response
thing.
We
need
to
go
and
the
object
didn't
exist
and
then
we
need
to
go
and
copy
up
the
data
from
the
parent
and
actually
give
the
cop
can
write
at
the
every
level.
B
So
then
we'd
go
read,
read
the
entire
object
from
the
parent
or
the
entire
logical
portion
of
the
parent
that
the
span
of
which
would
will
be
contained
in
the
child.
Object
that
we're
trying
to
write
to
and
then
we'll
send
a
new
operation.
Once
we've
read
that
data,
which
does
two
things
comically
one
is
the
copy
up,
which
is
writing
all
that
parent
data
to
the
object
to
the
clone
object.
B
Only
if
the
object
doesn't
still
does
not
exist
to
protect
that's
protect
from
races
where
you
might
be
doing
calc
yep
the
same
object
more
than
once
at
the
same
time.
So
you
only
want
you
want
to
make
sure
that
you
don't
clobber
rights
and
other
rights
at
the
same
time,
so
we
only
copy
the
parent
data.
If
that
still
doesn't
exist,
and
then
we
do
the
right
that
was
originally
requested.
On
top
of
that.
B
So
that's
basically
how
cloning
and
snapshotting
and
pretty
much
all
the
I/o
pass
work
today
and
Firefly
and
giant,
but
I'll
talk
a
little
bit
more
about
how
actually
use
our
BD
and
then
go
into
more
detail
about
what's
coming
in
hammer
and
in
the
future.
B
So
when
you're
using
nerdy
image.
Basically
you
need
to
be
era
and,
if
you're,
using
it
like
a
any
any
radars
client
who
needs
a
few
different
pieces
of
information,
you're
have
foundational
details
and
for
virtualization
this
is
usually
stored
by
putting
putting
a
secret
key
into
libvirt
secret
store.
B
So
if
you
don't
set
the
cache
heading
to
write
back
in
queue
in
Kumu
or
the
others,
the
liber8
interface
or
the
acuity
rectly,
even
if
the
big-ass
descending
flushes,
the
flushes
won't
be
propagated
to
live
a
birdie
because
cumulus
will
just
ignore
them.
B
So
it's
very
important
to
me,
if
you're,
using
weight
by
caching
to
enable
the
write-back
studying
in
qmu,
as
well
as
in
every
d,
there's
also
the
bus
which
the
device
detached
and
in
kkumeul
the
most
common
ones
and
most
recommended
ones,
are
ver,
do
block
or
overdosed
cozy.
Since
they
have
better
performance,
then
I
go
I
like
legacy
ones
like
planing,
scuzzy,
airplane
IDE.
B
The
one
thing
that
Verdejo
block
doesn't
support
is
this
card,
which
is
like
that
term
operation
on
an
SSD.
It's
just
a
zeroing
out
a
range
of
data
from
a
block
device
and
I
read
supports
this
by
Tim.
You
doesn't
let
the
guest
actually
use
it
unless
you've
had
some
special
senior
options
which
are
kind
of
hard
to
specify
and
tend
to
change
between
QE
versions.
So
it's
not
the
best
supported
feature
at
the
at
the
virtualization
management
layer.
B
So
this
kind
of
the
how
that
picture
looks
from
the
in
terms
of
Debian
setup
and
what
it's
actually
running
RBD.
So
your
cloud
management
stack
or
even
beat
you
manually,
would
add,
give
it
give
liver,
tin
some
XML
to
define
your
VM
live
bird,
would
translate
that
and
configure
things
and
I
mean
into
a
queue
of
command
line
and
then
run
queue.
Kamiya
would
then
start
up
load
liberators
in
the
ready
open
it.
B
So
livered
in
the
in
most
use
cases
isn't
actually
using
the
RVD
at
all.
It's
just
passing
around
some
configuration
beta
theta.
Essentially
they
want
exception
to
that
is
when,
if
you're
using
livered
storage
pools,
which
are
actually
linked
to
Liberty
directly
because
they're
about
managing
our
video
images
in
general
or
disk
images,
and
so
they
can
then.
B
B
So,
that's
that's,
basically,
basically
what
I,
what
I
virtualized
that
it
looks
like
in
general
and
there
could
be
any
kind
of
Canadian
top
of
this.
On
top
of
this
setup,
it
could
be
on
cloud
stack.
Openstack
rocksbox
can
ID
many
things
and
some
of
them
you
might
even
just
call
key
new
directly
and
bypass
the
verb
in
terms
of
the
chrome
clan.
B
It's
bit
simpler,
so
every
map
is
the
command
that
sets
everything
up.
It
is
a
pretty
simple
interface
that
uses
talk
to
talk
to
T,
that
kernel
a
BT
device
and
actually
get
it
to
be
visible,
just
going
through
a
service,
and
then
you
dev
takes
care
of
actually
making
out
it
and
nice
and
human
readable
some
link,
so
you
can
tell
which
device
Imogen
actually
is.
Instead
of
using,
like
the
kernel
kernel,
a
lot
of
automated
dev
rb0
dev,
everyone.
B
This
is
because
some
of
them
aren't
aren't
employed
in
the
kernel
and
others
like
re.
Caching
aren't
really
relevant
at
the
kernel,
since
they
can
for
the
kernel
kernel.
Ibd
devices
can
take
advantage
of
person
to
file
systems
on
top
of
it
using
the
page
cache
like
normally
do
so.
It
doesn't
really
need
that
extra
level
of
caching
there.
B
And
the
colonel
are
Colonel.
Ibd
client
doesn't
quite
support.
All
the
muscle
today
features
yet
since
we
added
them
in
userspace
first,
this
much
easier
to
write
them
there
and
get
them
into
the
kernel.
So
it
does
support
clones
and
and
for
my
to
memory
just
in
general,
but
does
not
yet
support
V,
which
is
striping
in
a
more
complicated
fashion
than
just
one
block
of
data.
Then
the
next
bob
then
the
next
blob.
B
B
Some
of
them
are
off
by
default,
but
can
be
enabled
in
special
circumstances
or
may
be
more
useful
in
endpoint
release
hammer.
So
if
copy
and
read
is
a
setting
that
applies
to
clone
to
cloned
images
where,
instead
of
doing
the
copy
up
operation
on
a
right,
you
could
do
it
on
a
read
instead.
So,
instead
of
just
reading
whatever
small,
like
doing
whatever
read
that
the
the
block
twice
requested,
if
you
only
requested
for
okay,
it
was
still
fetch
the
entire
object
from
the
parent
and
then
do
copy
up.
B
In
the
background,
while
satisfying
the
read
to
me,
I'm,
not
too
good,
just
read
so
that
that's
a
nice
way
to
speed
up
copy
up
and
do
it
up
opportunistically
to
in
a
bit
of
a
more
efficient
way.
If
you
know
that
you
want
to
have
these,
these
clones
like
full,
made
full
copies.
Eventually,
it's
more
efficient
way
to
do
that.
B
B
You
know
copy
up
for
having
low
reason,
lower,
read
latency,
but
then
the
cop
can
read
lets
you
do
that
more
efficiently
by
making
often
opportunistic
and
in
the
background,
rather
than
paying
the
penalty,
and
if
you
write
necessarily
I
mentioned
earlier
as
well,
that
we
enabled
the
already
cashing
in
that
safe
mode
where
it's
right
back,
but
if
it
once
it
sees
a
flush.
B
But
it's
right
through
I'd
start
out
with
that's
been
something
it's
kind
of
been
on
the
horizon
for
a
while,
but
I
just
finally
happened
now,
you've
been
recommending
it
it
be
enabled
in
this
mode.
For
a
long
time,
Adam
chrome
from
UCSC
added
read
head
support
for
RPG.
So
when
you
have
cash
manuals,
and
you
have
a
bunch
of
sequential
I/o
coming
in
because
of
a
VM
is
booting.
For
example,
we
do
but
read
ahead
into
the
RPG
cache
up
to
fifty
megabytes
by
default
or
until
the
I/o
stops
being
sequential.
B
So
the
reason
we
stopped
doing
it
after
this,
like
50
megabytes
or
this
boot
phase,
is
because
after
that
points
generally,
there
is
a
kernel
running
inside
the
VM,
which
is
going
to
be
doing
its
own.
Read
ahead
at
that
point,
but
during
that
boot
up
phase,
at
least
for
the
next
minute,
I
think
for
otherwise
several
other
OS
as
well.
B
There's
not
a
lot
of
smarts
in
like
them
at
the
bios
or
other
things,
they're
reading
things
at
the
very
beginning
of
the
boot
process,
and
so
this
actually
speeds
up
computing
quite
a
bit
for
especially
for
legacy
buses
like
ID,
we're
doing
lots
of
very
small
requests
and
a
little
bit
more
for
newer
buses
like
everyday.
Oh.
B
And
I'm
also
a
bunch
of
melty
TNG
trace
points
to
levar
PD
and
the
Gredos,
as
well
as
a
tool
to
after
collecting
these
traces
to
which
basically
gives
you
a
view
of
all
the
IO.
That's
been
done,
not
the
actual
data,
but
just
when
different
reason
writes
happened
and
how
large
they
were
and
what
offsets.
And
so
we
can
collect
those
traces
now,
and
he
here
also
got
a
tool
to
call
our
BB
play
to
take
those
traces
and
play
them
back
against
a
different
every
image.
B
So
you
can
benchmark
performance
against
an
actual
known
workload
that
you
you're
running
in
your
actual
cloud,
instead
of
just
guessing
using
a
custom
benchmarking
tools.
So
hopefully
we
can
get
started
gathering
traces
from
these
from
actual
use
cases
and
get
more
performance
data
that
might
actually
be
more
relevant
than
synthetic
benchmarks.
B
Allocation
Hanson,
we're
with
something
that
was
added
I,
think
you
might
have
been
in
giant
I
can't
thank
you
was
disabled
in
giant
because
they
turned
out
to
be
a
kernel
bug
in
the
way
that
in
that
in
excess,
because
we're
using
bit
yeah
how
arcade
octo.
Basically
that's
when
we
send
these
allocation
heads
with
it
right
we're
saying
we're
telling
Raiders
that
these
are
these.
Aren't
we
know
that
these
objects
are
going
to
be
four
megabytes
eventually,
so
just
allocate
them
right
now,
I'm
on
the
first
right,
so
that
they
will
be.
B
B
B
You
can
specify
that
and
this
is
used
to
like
within
RPG
cache
when
it's
doing
right
back
now
to
not
catch
me
and
not
waste
the
OSD
page
cache,
because
it's
not
to
do
any
double
caching
there,
and
also
by
things
like
that.
Doing
sequential
reason
writes
like
everybody,
import
and
export
since
that
those
those
cases
again,
it's
not
useful
for
two-way
time
to
store
things
in
the
page
caching
and
they
whistie
when
they're
only
being
read
once
and
in
the
future.
We
might
be
able
to
use
those
on
that
acceptable.
B
The
last
two
features
here
are
that
kind
of
larger
ones
which
are
right
now
off
by
default,
since
they're
kind
of
newer
end.
We
want
to
make
sure
that
they're
really
fully
solid
before
we
Nabal
them,
so
exclusive
locking
is
kind
of
the
implement
is
added
to
our
B,
so
that
when
you,
when
you
open
it,
the
client
tries
to
do
an
operation.
There
is
like
some
kind
of
raid
all
right.
Well,
it's
it's
a
hike,
creating
a
snapshot
or
doing
resizing
the
image
or
just
doing
a
plain,
plain
old
right
or
a
discard.
B
It
needs
to
hold
a
lock
on
the
header
object,
and
this
is
I
coordinated,
with
watch
notify,
as
well
as
the
existing
readers
lock
clasp,
so
that
we
can
make
sure
that
operations
don't
sew
rights,
right
rights,
don't
happen
from
lawful
places,
and
so
with
this
exclusive
lock
and
set
up,
and
this
like
that,
whoever
has
the
lock
will
actually
get
notifications
from
people
from
other
clients
who
are
trying
to
do
these
things
like
resizing
or
flattening
or
a
screen
a
snapshot,
and
so
those
would
basically
be
forwarded
to
the
whoever
had
it
has
the
lock
and
has
the
image
open
for
rights
already
and
they'll
perform
the
snapshot
instead
of
having
a
different
client
to
it.
B
So
if,
if
a
client
tries
to
open
something
damaged
for
writing-
and
it
detects
that
the
existing
there
was
an
existing
lock
holder,
but
they
no
longer
have
a
watch
in
the
image
and
they
don't
seem
to
be
reestablished
in
their
watch,
everybody
would
go
ahead
and
blacklist
that
old
client,
which,
in
in
rate
at
the
readers
level,
which
means
that
the
clients
address
and
and
nonce,
which
is
a
unique
per
session
for
raised
cluster
connection,
basically
will
be
added
to
the
OST
map
and
those
keys
will
refuse
any
operations
from
that
client.
B
So
this
this
provides
some
extra
safety.
If
you
have
some
like
virtual
machines
that
might
be
stuck
running
somewhere
and
you
VAR
you've
and
you
thought
they
were
down
but
and
so
you've
already
started
them
somewhere
else.
B
This
inclusive,
locking
will
help
with
blacklisting
will
help
with
that,
like
that
kind
of
case,
but
more
generally,
it's
a
good
groundwork
for
features
that
they
will
require.
One
clients
be
updating
something
in
the
future.
So
the
first
one
of
those
is
the
object
map,
which
is
also
new
and
hammer
and
off
by
default,
and
this
is
basically
an
index
of
which
objects
exist
in
an
image,
and
this
is
mainly
a
performance
enhancement.
B
So
it
will
let
you
since
not
only
with
like,
for
example,
with
clones,
when
you're
going
to
do
a
read
from
a
clone,
you
have
to
check
whether
the
object
exists
or
not,
and
then
go
back
to
the
parent.
If
it
doesn't
with
an
index
of
which
objects
exist,
you
can
immediately
go
to
the
right
place.
I
mean
if
there's
a
lot
of
levels
of
parents
you
did.
It
is
not
the
remark.
Darn
several
levels
requests,
it's
just
one
request,
there's
been
an
option.
App
has
to
be
enabled
for
all
of
those.
B
So
then
you
know
well.
It
would
also
enable
and
things
like
better
utilization
tracking
in
the
future.
You
can
actually
see
based
logic
map
which
objects
exist
very
quickly
and
therefore
how
much
roughly
how
much
space
is
used
in
a
snapshot
or
how
much
space
I
was
added
between
this,
this
clone
and
the
parent
or
that
sort
of
thing.
So
this
is
also
off
by
default
for
now,
because
we
want
to
make
sure
it's
totally
solid
and
at
some
point
we
it's
right
now.
B
It's
also
only
possible
to
use
it
once
you
went
in
for
a
new
images,
but
we
wanted
to
add,
had
the
capability
to
add
it
to
existing
images
or
remove
it
from
an
image.
If
you
wanted
to,
if
you
phrase
ever
wanted
to
map
it
on
to
use
their
own
old,
crone
clan's,
it
doesn't
support
out
of
maps
and
still
be
able
to
map
that
you
could
just
remove
that
object,
map
from
it
and
over
the
future
bid
and
then
be
able
to
use
the
user
with
the
kernel
client.
B
B
So,
in
terms
of
things
beyond
hammer,
the
largest
thing
is
definitely
rpm
earring,
which
is
the
idea
of
a
synchronous
replication
between
different
clusters,
and
that
would
also
be
built
on
top
of
the
exclusive
locking.
So,
basically,
there
would
be
a
log
of
all
rights
to
our
I'm
done
to
an
RV
image
and
written
out
as
they're
happening
and
then
there'd
be
some
other
processed
reading
that
log
and
asynchronously
mirroring
those
rights
to
a
different
image
in
a
different
cluster
or
maybe
even
the
same
cluster
in
a
different
pool.
B
B
B
So
in
general,
stuff
takes
the
approach
of
blocking,
rather
than
giving
IO
errors,
because
file
system
send
don't
really
handle
IO
errors
very
well.
So,
and
you
can
any
kind
of
condition
where,
like
an
object,
is
temporarily
unaccessible
because
recovery
is
happening
or
the
cluster
is
full,
since
the
block
layer
doesn't
have
any
way
to
report
like
that
that
it's
outer
space
Adel
this
block
those
iOS
and
wait
for
that
condition
to
be
resolved
and
hopefully
that'll
happen
within
a
time
period
of
time.
That's
acceptable
for
the
application
say
on
top
of
it
waiting.
B
For
this
to
happen,
we
use
our
volume
from
a
VM.
Is
the
chrome
driver
not
used
yeah?
That's
right
so
usually
when
you're
using
a
RPG
from
a
VM
you're
using
it
with
the
queue
I'm
just
using
the
remedy
directly
and
not
going
through
the
kernel
kernel
driver
at
all
so
ii
knew
it
would
be
linked
against
like
liberators
in
the
bar
BD,
and
it's
just
calling
my
RB
right
and
read
directly
from
it
from
that
user
space
process.
B
A
No
all
right
well!
Thank
you
very
much
Josh.
This
is
great,
and
this
should
show
up
on
YouTube
here
within
the
next
day
or
two.
So
alright
stay
tuned
and
we'll
see
you
guys
next
month
for
the
next
F
Tech
Talk,
great
thanks,
Patrick
I
believe
is
Yahoo
talking
about
the
Rados
gateway,
so
we'll
see
everybody
then.