►
From YouTube: Ceph Month 2021: RBD Update
Description
Presented by: Ilya Dryomov
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
Hi
everyone,
my
name
is
ilya,
I'm
the
tech
lead
of
the
rbd
team.
I
will
be
giving
an
rbd
update
going
in
a
little
bit
more
detail
into
the
features
that
shipped
in
pacific
and
also
covering
what
you
expect
in
quincy.
A
A
The
advantage
is
that
the
image
becomes
available
for
use
as
soon
as
the
migration
is
prepared,
which
is
basically
instantaneous,
because
it
just
creates
an
empty
image
and
makes
a
record
of
the
data
source
and
the
data
source
format
in
the
image
header
supported.
Data
sources
include
file
both
local
and
remote,
observed
over
http
or
https,
and
an
object
in
any
amazon
s,
recompatible,
object,
store
and
that
file
or
object
has
to
be
in
either
raw
or
qcal
or
qcao2
format,
but
note
that
the
advanced,
cucao
and
qco2
features
such
as
compression
encryption.
A
Cloning
here
referred
to
as
backing
file
external
data
file
and
some
others
are
not
supported
in
pacific,
and
so
once
the
link
to
the
data
source
is
set
up.
The
image,
even
though
it
is
still
empty,
can
be
opened
as
as,
if
it
was
already
fully
imported.
A
A
read
on
an
initialized
area
would
be
redirected
to
the
data
source
and
right
on
an
initialized
area
would
check
if
it
overlaps
with
something
in
the
data
source,
with
something
initialized
in
the
data
source
and,
if
so,
trigger
so-called
deep
copy
up,
and
that
would
pull
the
data
into
the
image
that
is
being
imported.
A
Of
course,
one
thing
to
keep
in
mind
is
potentially
high
latency
if
your
data
source
is
remote
because
the
image
can
be
literally
anywhere-
and
here
are
some
examples
of
how
you
might
import
a
qca
2
image
from
a
remote
http
server.
The
first
one
is
the
most
straightforward
and
also
the
smallest
one.
We
just
patch
the
file
convert
from
qcal
format
to
raw
in
a
separate
step
and
then
call
rbd
import
command.
A
A
A
So
the
image
can
be
opened
and
used
right
after
rbd
migration
prepare
command
so
right
after
the
first
step
and
then
rbd
migration
execute
command
can
be
invoked
at
any
time
to
hydrate
or
populate
the
image
by
forcing
those
deep
copy
ups
to
happen
in
the
background-
and
this
can
happen
while
the
image
is
being
actively
used
and
finally,
once
the
image
is
fully
imported,
rbd
migration
commit
command
can
be
used
to
disassociate
it
from
the
data
source,
and
at
that
point
the
iteration
is
finished.
A
The
second
feature,
definitely
worth
highlighting,
even
though
it
did
not
get
completed
in
pacific,
is
built-in
encryption.
A
The
need
for
encrypting
data
on
the
client
at
image,
granularity
with
you
know,
with
per
image
keys,
comes
up
more
and
more
and
not
just
from
financial
or
other.
You
know
similarly
regulated
industries,
but
also
from
small
regular
users
as
well.
A
A
If
you
don't
want
that,
you
basically
end
up
having
to
abandon
thin
provision
clones
in
favor
of
thick
copies
and
to
solve
that
support
for
luke's
base.
Encryption
has
been
incorporated
within
liberty.
A
A
Right
now-
and
this
is
what
looks
defaults
to
on
you-
know
only
regular
devices
and
as
far
as
the
implementation,
there
are
no
reinvented
wheels
here,
at
least
not
yet.
A
We
use
liquid
setup
for
working
with
the
lux
header
and
the
openssl
library,
for
the
actual
encryption
both
looks
to
looks
one
and
looks
two
formats
are
supported
and
for
looks
two.
We
set
the
sector
size
to
four
kiloblades
for
better
performance
across
the
board,
but
this
can
hurt
some
workloads
because
any
rights,
smaller
than
are
not
aligned
to
four
kilobytes,
will
incur
an
expensive
three
modified
right
cycle.
A
So,
if
you
have
you
know
512
byte
workload,
you
might
want
to
stick
to
looks
one
clone
images,
inherent
parent,
encryption
profile
and
key,
and
unfortunately,
only
flat
that
is
non-cloned
images
can
be
encryption,
formatting
and
pacific.
A
This
means
that
the
the
limitation
that
I
mentioned,
that
I
just
mentioned,
is
still
there
in
pacific,
but
it
would
be
resolved
in
quincy,
and
here
is
an
example
of
encryption.
Formatting,
an
image
and
mapping
it
with
ibd
nvd
encryption
format
command,
generates
the
look
scatter.
A
Sorry
generates
a
loops
master
key
adds
the
supplied
passphrase
to
key
slot
0
and
then
writes
out
the
lock
scanner
and
because
it
is
a
standard
lux
lux
format.
A
A
standard,
lock,
shutter
grip,
setup
tool
can
be
used
to
add
additional
pass
phases
or
perform
any
other
maintenance
operations
that
you
would
typically
do
on
a
lux
volume
and
another
cool
consequence
of
using
lux
is
that
as
long
as
the
image
is
not
an
encryption,
formatted
clone,
which
is
coming
in
quincy,
its
layout
is
understood,
for
example,
by
the
encrypt,
and
so
you
can
see
an
example
here
of
mapping
the
same
image
with
current
rbd,
which
doesn't
know
anything
about
reliability.
Encryption
and
we
just
open
the
box
container
on
it
and
it
works.
A
Next
up
is
to
performance
items.
The
first
one
is
a
vast
improvement
in
small
io
performance,
thanks
to
work
that
started
an
octopus
and
was
completed
in
pacific
as
single
library,
client,
often
topped
out
somewhere
between
twenty
and
thirty
thousand
four
kilobyte
iops,
mainly
due
to
issues
with
the
internal
friday
architecture.
A
You
know
one-off
threads
and
thread
pools
within
library
and
some
in
the
greatest,
and
this
resulted
in
generally
lower
latency
and
up
to
three
times
increase
in
iops
for
some
benchmarks
in
all
flash
clusters
going
forward.
The
switch
to
an
asynchronous
reactor
model
may
also
allow
libar,
bd
and
liberators
to
tighter,
integrate
with
spdk.
A
In
an
ideal
world,
we
would
have
a
single
reactor
handle
handling
everything
from
spdk
or
maybe
even
qmu
in
the
very
far
future,
all
the
way
through
liberty
and
down
to
the
you
know
all
the
way
through
the
objector
and
down
to
the
messenger
layer
in
libras.
That
would
be
really
cool
and
eliminate
lots
of
monologues.
A
The
second
performance
item
is
a
client-side
persistent
run-back
cache
for
many
different
use
cases
and
workloads.
Setting
up
persistent
write
back
cash
on
the
client
is
a
sensible
choice
and
an
acceptable
trade-off.
A
But
the
problem
with
doing
it
today
is
that
layering,
something
like
dm
cache
on
top
of
leave
ibd
is
too
risky,
because
dirty
cat
blocks
are
flash
to
the
backing
image
out
of
order.
A
This
is
done
according
to
whatever
cash
policy,
but
you
know
none
of
the
supported
cash
policies
can
can
implement
in
order
flashing,
and
so
with
the
current
caching
solutions,
at
least
those
that
are
widely
used
and
easily
available
out
of
the
box,
you're
sort
of
stuck
at
one
of
the
two
extremes:
either
the
cache
is
not
right
back
and
a
synchronous
ride
gets
act
only
when
persisted
in
the
cluster,
which
does
not
help
the
latency
problem
or
a
synchronous
right
is
act
when
persisted
on
the
cache
device,
but
the
write
back
is
not
ordered,
which
leaves
you
with
the
backing
image
that
is
most
likely
corrupted
from
the
file
system
or
user
application
point
of
view,
because
it
is
essentially
a
mix
of
all
blocks
and
new
blocks,
with
all
blocks,
potentially
being
really
really
old
for
offsets
that
correspond
to
hot
spots,
which
would
be
the
you
know.
A
The
file
system
journal,
for
example.
So
you
you,
you
know
as
soon
as
that
is
off.
Even
by
a
couple
of
bytes,
the
file
system
is
gone.
A
A
A
They
share
a
common
core
in
the
you
know,
the
flushing
logic
the
I
o
dispatch
code,
things
like
that,
but
they
differ
in
on
disk
format
and
in
how
the
cache
device
is
accessed
in
the
persistent
memory
mode,
which
is
referred
to
as
rwl.
A
You
know,
for
historical
reasons,
the
on
disk
format
takes
advantage
of
byte
versability
of
persistent
memory,
and
it
is
just
a
very
simple
head
and
tail
pointers
in
the
pool
root
structure.
A
contiguous
log
entry
table
and
a
contiguous
data
area,
the
ssd
mode,
is
more
complicated
because
updates
are
done
in
four
kilobyte
blocks.
A
The
log
entry
table
is
spread
throughout
the
cache
device
in
the
form
of
a
linked
list,
and
there
is
a
certain
amount
of
coordination
that
needs
to
happen
when
the
data
is
written
and
the
log
entry
is
committed
and
then
the
root
is
updated.
So
it
is,
it
is
more
involved.
A
Additionally,
everything
gets
zero
padded
to
four
kilobytes,
so
you
should
expect
some
cache
space
wastage
if
your
rights
are
smaller
than
that
for
accessing
the
cache
device.
The
persistent
memory
mode
uses
a
library
from
intel's,
persistent
memory,
development
kit
and
the
ssd
mode,
reuses
seph's
block
device
obstruction,
which
I
was
written
for
bluestore
and
it's
basically
just
reba,
io,
plus
or
direct
flag
to
bypass
the
host
page
cache.
A
The
latency
improvement
is
dramatic,
particularly
for
99
percentile
latencies.
Some
reports
claim
an
almost
two
orders
of
magnitude
reduction
and
and
then
right
benchmarks.
A
Latency
reduction,
I
should
say
which
is
which
is
huge
for
latency
sensitive
workloads,
but
it
currently
has
some
rough
edges.
There
is
a
cache
reopen
issue
that
affects
both
modes,
the
fix
for
which
is
expected
in
16.2.5
table
release
and,
unfortunately,
several
stability
and
crash
recovery
issues
that
affect
the
ssd
mode
have
been
discovered
after
pacific
has
shipped
most
of
them
already
fixed
in
master,
but
a
couple
are
still
pending
and
those
fixes
would
be
reported
to
future
pacific
stable
releases.
A
A
This
is
a
couple
of
new
rpc
messages
that
allow
coordinated
snapshot
creation.
This
comes
up,
for
example,
when
creating
a
scheduled
mirror
snapshots
which
are
initiated
at
the
cluster
level.
So
neither
the
hypervisor
nor
the
user
know
anything
about
them
since
ensuring
that
the
file
system,
and
possibly
the
user
application
checkpointed
before
taking
a
snapshot,
is
always
a
good
idea
by
default,
if
any
clients,
if
any
client
fails
to
qs
the
snapshot
is
not
created,
this
behavior
can
be
changed
to
ignore
the
error
or
skip
the
attempt
to
qs
entirely
in
pacific.
A
This
has
been
wired
up
in
rbd
nvd.
If
dash
dash,
qs
flag
is
supplied,
the
daemon
would
attempt
to
freeze
the
file
system
mounted
on
top
of
its
device
before
taking
a
snapshot,
and
it
is
done
from
a
sim,
simple
shell
script.
A
Ensuring
application
level
consistency
and
the
goal
here
is
to
integrate
this
in
krbd
and
qemu
in
the
future.
A
But
at
least
in
the
latter
case,
it
is
somewhat
challenging
because
the
ends
up
being
initiated
very
low
in
the
stack
in
the
blog
device
driver,
and
it
is
a
bit
of
a
layering
violation,
because
the
information
usually
goes
backwards
in
the
stack.
A
Quite
a
lot
of
work
and
the
pacific
cycle
has
gone
into
the
kernel.
Client
support
for
messenger
2.1
protocol
was
added
in
kernel.
5.11
is
controlled
by
the
new
ms
underscore
mode.
Mapping
option
and
one
caveat
here
is
that
currently
there
is
no
separate
option
for
affecting
only
monitor
connections.
A
Also,
the
original
messenger
two
protocol
from
nautilus
is
not
and
will
not
be
implemented
because
it
is
defecated
and
shouldn't
be
used
as
a
consequence.
The
kernel
client
would
require
a
messenger
to
the
one
enabled
release
of
nautilus
and
octopus,
and
you
can
see
the
particular
various
versions
on
the
slide
and,
of
course,
the
legacy
messenger
one
protocol
is
still
supported
and
will
be
for
many
years
to
come.
A
In
fact,
it
is
still
the
default
in
the
kernel.
We
should
probably
change
that.
A
A
So
if
your
client
is
in
or
close
to
data
center
dc1,
you
would
provide
that
cross-location
string
when
mapping
the
image
and
for
the
pgs
that
have
an
osd
located
in
dc1.
A
There
are
some
corner
cases,
though,
where
a
read
cannot
be
served
by
the
replica,
and
in
that
case
it
is
redirected
to
the
primary
osd,
and
this
is
the
reason
for
needing
optical
services
for
this
to
work.
Otherwise,
you
might
run
into
data
consistency
issues
and
also
note
that
this
feature
only
applies
to
replicated
pools
because
there
was
some
confusion
in
the
community
about
this.
A
The
primary
use
case
here
is
clusters
that
are
stressed
across
data
centers
or
cloud
availability
zones
where
the
primary
osd
may
be
over
the
link.
That
is
not
only
higher
latency
but
also
higher
cost,
because,
inter
availability
zone
traffic
in
the
cloud
is
usually
much
more
expensive
and
finally,
there
is
support
for
compression
hands
and
then
also
landed
in
kernel
5.8.
A
You
can
use
this
to
enable
compression
on
an
image
that
resides
in
a
pool
with
compression
mode
set
to
passive,
or
vice
versa.
A
A
It
in
pacific
we
have
native
windows,
support,
ceph
client
code
was
ported
to
windows,
so
we
now
have
libarbd.dll
liberators.jl
et
cetera.
A
I
o
requests
through
to
user
space
via
a
device
io
control
that
is
much
more
efficient
and
in
user
space,
rbd,
w
and
d
xz
demon
joins,
you
know,
picks
them
up,
transforms
them
and
calls
an
entirely
about
bd.
This
is
similar
in
concept
to
rbd
nbd
on
linux.
A
This
daemon
can
also
run
as
a
windows
service
and
in
fact,
that's
usually
how
it
is
run
and
in
that
mode
it
manages
the
the
mappings
in
a
way
that
they're
persisted
across
reboots
with
the
information
stored
in
the
windows
registry,
and
it
also
provides
a
proper
boot
dependency
ordering.
A
So
rbd
is
really
first
class
citizen
on
windows
now,
because
there's
also
support
for
hyper-v-
and
I
believe
other
integrations
alessandro
covered
this
in
a
much
greater
detail
in
his
seven
windows
talk
earlier
this
month,
the
recording
is
already
available.
So
I
encourage
you
to
take
a
look.
A
Starting
with
the
usability
and
quality
bucket,
I
already
touched
on
encryption,
formatted
copy
and
write
clones
and
persistent
right
cache
improvements.
This
is
the
continuation
and
hopefully,
the
completion
of
the
work
started
in
pacific.
A
So
that
is-
and
that
is
you
know,
thinly
provisioned
for
copy
and
write
as
you
would
expect
from
the
regular
rbd
clone.
A
Of
course
it
is,
it
would
not
be
compatible
with
standard
dmcrypt
anymore,
so
there
would
be
there
will
be
a
flag
or
a
magic
string
modified
to
prevent
the
lux
container.
Open
example
that
I
showed
on
the
previous
slide.
A
Persistent
right
back
cash
improvements,
we
already
went
there,
our
crash
recovery
testing
is
going
to
be
vastly
expanded,
and
we
also
need
to
make
the
status
easy
to
get
and
easy
to
interpret
because
you
know
tune
in
your
cache.
A
Well,
there
aren't
any
tunables,
but
just
seeing
if
it
actually,
if
it
actually
helps,
and
how
much
is
dirty,
how
much
it's
clean
is
very
useful.
A
Make
rmd
support
manager
module
handle
full
clusters
should
allow
removing
images
by
the
manager
command
interface
when
the
cluster
is
full.
This
does
not
work
today.
The
manager
just
hangs
and
and
not
necessarily
inside
of
the
rvd
support
module.
A
So
this
is
a
wider
issue
that
spans
outside
of
rvd
replace
rvd
map
script
with
systemd
unit
generator.
This
would
generate
a
systemd
unit
per
device,
allowing
other
units
to
depend
on
it
and
making
systemd
handle
file,
system
mounts
and
unbounce,
including
fancy
things
such
as
auto
mounts
that
would
be
triggered
by
accessing
the
mount
point.
A
A
A
This
is
not
a
good
user
experience,
so
there
we
there
will
be
a
new
option
added
to
rbd
import
and
rbd
export
commands,
and
this
would
support
both
full
and
incremental
updates.
A
And
export
import
of
consistency
groups
is
somewhat
related,
which
should
allow
exporting
and
importing
consistency
groups
in
a
manner
similar
to
individual
images.
Again,
both
both
full
export
and
export,
div
incrementals
are
meant
to
be
supported,
and
this
would
introduce
a
new
export
file
format
for
files
that
export
files
that
can
contain
consistency,
groups.
A
On
the
multi-side
front,
we're
looking
to
improve,
mirroring
monitoring
and
alerting
the
goal
is
to
come
up
with
a
unified,
mirroring
metrics
schema,
at
least
across
rbd
and
cfs,
and
expose
this
matrix
to
prometheus.
A
Currently
only
journal
based
mirroring
is
supported,
and
this
would
probably
tie
into
the
improved
monitoring
item
with
perhaps
fancier
grafana
dashboards,
based
on
the
exposed,
metrics
and
snapshot
based
mirroring
of
consistency.
Groups
should
allow
for
mirror
snapshotting
consistency,
groups
and
then
mirroring
those
consistency.
Group
snapshots
again
similar
to
individual
images.
A
Quincy
on
the
ecosystem
front,
the
big
ticket
item
is
nvme
over
fabric's
target
gateway
at
a
high
level.
This
is
similar
to
the
existing
iscsi
gateway,
but
hopefully
more
performant
and
scalable.
A
This
is
this
would
be
used
in
the
kubernetes
environment,
where
the
rbd
plug-in
pod
or
container,
which
is
you
know,
which
is
responsible
for
mapping
the
image
and
which
actually
contains
the
rbd
and
the
demon
responsible
for
the
image
can
be
restarted
or
upgraded
at
any
time,
and
we
need
a
way
to
to
to
make
that
reattach
safe
in
the
kubernetes
environment.
A
It
is
relatively
same
because
the
environment
is
highly
constrained,
but,
to
you
know
to
expose
this
functionality
wider.
It
is
critical
that
there
are
some
stop
gaps
in
place,
because
currently
this
is
you
know,
one
command
away.
One
key
stroke
away
from
corrupting
corrupting
the
image.
A
A
separate
set
of
credentials,
and
so
on,
so
to
improve
the
volume
density
which
again
comes
up
in
kubernetes
environments.
This
is
something
that
we're
looking
at
for
quincy
and
on
the
windows
support
side.
It
would
be
nice
to
get
a
sustainable,
ci
infrastructure
setup,
preferably
upstream.
A
We
we
do
not
have
any
tests
we
covered
and
to
add
that
we
need
we
need
somewhere
to
pull
from,
and
you
know
have
them
be
up
to
date,
and
I
think
this
is
the
last
item.
The
qmu
block
driver
should
receive
a
face
lead
facelift
in
the
upcoming
6.1
release
of
qmu.
A
The
most
important
thing
is
probably
rbd
write,
zeros
support,
which
is
available
since
nautilus,
but
the
qemu
driver
wasn't
updated
to
take
advantage
of
it,
and
this
should
allow
us
to
close
some
poles
in
in
our
sparseness
story,
because
in
some
cases
when
importing
the
image-
if
you
remember
that
example
with
the
fancy
qmu
image
command,
that
I
showed
on
one
of
the
first
slides
in
some
cases
that
that
actually
breaks
sparseness-
and
this
should
this-
should
we
should
help
resolve
that.
A
The
second
item
is
support
for
loading
the
encryption
profile,
so
that
the
image
so
that
the
encryption
formatted
image
can
be
opened
and
I
believe
leadvert
integration
is
planned
as
well,
but
that
is
blocked
on
on
this
landing
in
qmu.
First
and
finally,
there
is
a
switch
to
qumu
routine
infrastructure
from
the
legacy
asynchronous
io
emulation
stuff
that
is
currently
in
use.
B
B
A
Hopefully
it
made
it
into
the
recording.