►
From YouTube: Block Cloning by Pawel Jakub Dawidek
Description
From the 2022 OpenZFS Developer Summit: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022
Slides: https://drive.google.com/file/d/1eyvv_5madwBwlianA-Rb049gCmJ4GtMW/view?usp=sharing
A
I
hope
you
guys
can
hear
me.
Okay,
my
name
is
Pavo
and
I
wanted
to
talk
about
blog
cloning
for
ZFS.
It's
been
a
project
I've
been
working
on
for
for
quite
a
while.
This
is
not
like
a
huge
project,
but
basically
it's
I'm
working
on
this
whenever
I
can
so
I
think
Matt
knows
about
that.
So
he
invites
me
to
talk
about
this,
so
he
knows
that
this
will
motivate
me
to
work
a
bit
more
and
like
push
it
forward.
A
Most
of
the
stuff
I
will
be
talking
about
is
actually
written
in
a
in
in
the
pr
for
for
this
feature,
but
nobody
reads
text
these
days,
so
I
will
record
Tick
Tock
video
later
Okay,
so
the
talk
is
mostly
about
the
design,
but
I
will
I
could
take
this
off
right.
A
So
the
talk
is
mostly
about
the
design,
but
I
will
say
a
few
words.
A
What
we
are,
what
we
are
actually
doing
here
so
so
blog
cloning
is
like
on
demand
the
duplication,
so
the
duplication
in
ZFS
is
automatic
and-
and
this
is
I
have
to
specifically
say
that
I
want
to
clone
like,
for
example,
a
specific
file.
So
the
idea
is
that
we
don't
want
to
copy
any
data.
We
just
want
to
reference
the
data
blocks
from
to
different
files.
A
So
the
idea
here
is
that
we
want
to
I'm
actually
connecting
this
to
copy
file
range
system
call
if
you
are
not
familiar
with
the
system
call
normally.
If
you
copy
a
file,
it's
the
CP
will
read
the
data
from
one
file
to
user
land
and
then
send
it
to
the
kernel
to
the
right
system.
Call
so
copy
file
range
actually
tells
the
kernel
here.
You
have
two
file
descriptors
and
offset
from
the
source
file
and
offset
into
the
destination
file
and
length,
and
just
do
the
copy
in
in
the
kernel.
A
So
don't
bother
like
sending
the
data
to
user
land
and
and
back
so
it
actually
have
some
nice
properties.
Other
than
that,
let's
say
you
have
NFS
mount
and
by
doing
copy
file
range,
you
can
actually
tell
NFS
server
to
do
server
server
site
copy.
So
you
are
not
sending
any
data
over
the
wire.
You
just
tell
the
NFS
server
to
copy
the
data
at
the
server,
so
I
I'm
I
want
to
reuse
this
system
call
for
this
purpose.
So
you
will.
A
You
will
say
that
hope
you
want
to
copy
the
data,
but
actually
you
will
be
cloning.
The
data
between
two
files-
it's
also
possible
to
do
this
for
civils
as
well,
but
this
is
not
not
implemented
okay,
so
we
cannot
modify
block
pointer.
So
we
cannot
keep
any
reference
counter
in
the
block
pointer
or
anything
like
that.
So
we
need
additional
table.
A
So
you
will
notice
that
this
is
actually
pretty
similar
to
the
duplication,
but
it's
different
and
there
are
quite
a
few
differences
actually,
but
we
will
keep
this.
This
additional
table
with
reference
counts
very
top
level,
V
diff,
so
each
top
level
v-depth
will
have
its
own
block
reference
table
and
I
will.
I
will
talk
about
that
a
bit
more
later.
So
when
we
clone,
we
don't
read
any
data.
A
We
just
need
block
pointers
and
also
when
we
write,
we
also
don't
write
any
data
so
because
of
that
it's
way
faster
than
even
copying
the
blocks.
So
we
save
space-
and
we
also
save
save
time
when
you
move
files
between
data
sets
between,
because
this
is
also
supported.
A
When
you
move
files
between
data
sets,
you
don't
really
grow
the
the
table
because
you
we
just
need
to
bump
the
refer
accounts
for
a
short
while
write
those
indirect
blocks
in
the
destination
data
set
and
just
free
those
The
Source
ones.
So
you
don't
really
grow
the
the
table
and
the
feature
is
also
always
on
which
have
some
consequences,
but
there
is
no
additional
cost
when
it's
unused.
So
if
you
don't
use
it
there
is.
A
A
So
the
brt
entry
is
extremely
small.
So
actually
we
need
three
three
things
in
order
to
be
able
to.
A
To
tell
which
block
is
it?
We
need
vdf
ID.
We
need
offset
into
this
video
into
this
vdef
and
we
need
the
reference
counter
and
if
you
can
it's
easy
to
imagine
that
if
you
have
that,
you
have
quite
that
you
have
only
few
v-depths
and
you
can
have
many
blocks
so
in
most
of
those
block,
brt
entries,
the
vdef
ID
will
simply
repeat
so
we
decided
to
to
have
a
the
stable
per
top
level.
V
diff,
that's
the
reason,
so
we
don't
have
to
store
vdf
ID.
A
A
So,
as
I
mentioned,
there
is
no
cost
on
right
with
the
duplication
we
have
to
have
the
data
and
we're
actually
writing
the
data
and
ZFS
will
calculate
the
checksum
and
then
it
will
decide
that
you
actually
don't
need
to
write
the
data
to
the
disk,
but
you
have
to
have
the
data,
so
we
can
calculate
the
tracks.
So
with
block
cloning,
you
need
no
data.
A
You
just
provide
the
source
and
destination
and
it
works
with
any
checksum
algorithm
with
the
duplication
you
have
to
have
cryptographically
strong
checksum
block
cloning
co-works,
even
with
checksuming
disabled.
A
The
entries
are
with
the
duplication.
The
key
is
the
hash,
so
if
especially
with
cryptographically,
strong
hash
blocks
that
are
resides
close
to
each
other
on
disk
can
be
actually
in
totally
different
parts
in
the
duplication
table.
That's
why
it's
really
hard
to
like
cache
the
duplication
table,
because
the
blocks
are
just
scattered
across
the
entire
table
with
with
brt.
The
key
is
the
offset
so
entries
that
are
close
to
each
other
will
be
close
in
the
in
the
table.
A
There
are
no
entries
with
single
reference,
so
if
you
have
data
block
and
you
clone
the
block
only
then
we
will
create
an
entry
in
brt
because
there
is
additional
reference
to
the
block,
but
once
you
free
one
of
those
copies,
the
entry
will
be
removed
from
the
from
the
table.
With
the
duplication,
you
have
to
store
all
the
blocks
in
the
table.
So
on
each
right
on
each
block
creation.
We
have
to
create
an
entry
in
dedup
table.
A
So
that's
the
huge
problem
with
the
duplication
as
well
that
the
table
grows,
even
if,
if
you
have
a
lot
of
blocks
that
are
with
just
one
reference
with
brt,
there
are
no
such
entries
in
the
table
and
as
I
mentioned
it's
on
demand,
so
you
actually
specifically
clone
the
file
or
block
or
given
range
of
bytes
and
it's
always
available
so
use
cases.
There
are
a
few.
So,
of
course
the
big
one
is:
is
space
savings
when
cloning
files?
A
It's
like
the
duplication,
you
can
have
I
know
many
thousands
of
of
reference
to
a
single
block
and
of
course
the
savings
will
be
huge.
Another
one
is
how
you
recover
files
from
snapshots.
So
let's
say
you
took
a
snapshot.
You
removed
the
file
by
you,
you
removed
a
file
by
accident
and
you
want
to
recover
the
file.
So
currently
you
have
to
just
copy
the
file
from
the
snapshot
into
your
data
set,
so
you
will
actually
duplicate
the
storage.
You
need
for
this
file
with
block
cloning.
A
You
could
easily
just
clone
the
file
from
the
snapshot
into
your
data
set.
So
you
don't
you
don't
pay
this?
This
extra
cost
of
of
additional
storage
also
cloning
is
extremely
fast
because
we
don't
read
any
data,
we
don't
write
any
data,
we
just
need
the
block
pointers,
so
it's
it's
super
fast
and
you
and
it
can
also
be
used
to
to
moving
files
between
data
sets
and
when
I
was
listening
to
the
AWS
talk.
A
It's
it's.
It's
really
interesting
to
imagine
when,
when
you
have
NFS
mount
on
ZFS,
let's
say
across
the
ocean
and
you
need
to
copy
like
a
huge
file
and
NFS
will
use
copy
file
range.
So
you
don't
really
transfer
the
data
over
the
wire
and
on
the
server
side
it
will
use
block
cloning
to
actually
copy
the
file.
So,
even
though
the
server
is
extremely
is
somewhere
else
in
and
it's
the
link
is
very
slow.
A
Coping
extremely
large
files
will
be
just
super
quick,
so
that
would
be
cool
okay,
so
cloning
is
divided
into
a
two
parts.
So
when
we
do
the
Clone,
we
will
call
ZFS
clone
range.
It
will
read
the
level
zero
block
pointers
from
from
the
source
file
and
it
will
create
all
the
it
will
create
a
it
will
remember
which
blocks
needs
needs
reference
bump
in
in
disbanding
pending
on
the
spending
list.
A
Let's
say,
but
once
we
get
to
the
syncing
content
context,
we
will
apply
all
those
changes
that
we
made.
So
we
we
increase.
Some
of
the
reference
counts
for
some
of
the
blocks
Etc
and
in
syncing
context,
we'll
just
sync
everything
to
the
disk.
A
So
there
are
like
two
parts
to
consider
so
when
we
free
a
block,
if
if
the
block
was
cloned-
and
we
are
freeing
the
block,
but
it
was
cloned
in
the
same
transaction
group,
we
will
just
remove
it
from
from
the
pending
list
and
and
we
are
done
if
the
block
was
cloned
and
it's
already
on
disk
the
we
have
brt
entry,
then
we
have
to
decree
decrement
the
the
brt
entry.
So
there
is
additional
stage
in
in
Xeo
pipeline
for
this,
which
comes
just
before
dedup
free.
A
Okay,
but
we
don't
want
to
pay
this
cost
on
every
free,
so
we
don't
want
to
go
to
the
disk
and
read
the
table
and
and
try
to
find
if
we
have
an
entry
for
this
specific
block.
So
so
we
had
to
optimize
this
the
specific
case.
A
So
what
I
do
is
I
divide
each
vdaf
into
a
regions-
let's
say
one
gigabyte
in
size
and
and
in
this
structure,
I
keep
track
how
many
additional
references
I
have
in
in
the
hollow
region-
and
this
is
extremely
small.
We
can.
We
can
hold
this
in
memory.
So
let's
say
for
one
terabyte
of
storage
with
one
gigabyte
regions.
A
You
only
need
eight
kilobytes
of
memory,
so
actually
Matt
suggested
that
we
could
even
use
much
smaller
regions
like
one
megabyte
and
and
have
and
require
eight
megabytes
of
memory
per
one
terabyte
of
storage.
But
this
can
tell
us
when
we
are
doing
free.
There
is
a
function.
Brt
maybe
exists
that
the
function
will
will
tell
us
that
for
sure
there
is
no
brt
entry
for
this
region
at
all.
A
So
there
is
not
even
one
block
in
this
region
that
was
cloned
or
it
will
tell
us
that
there
might
be
a
block
in
this
region,
so
only
then
we'll
go
to
the
disk
and
read
try
to
find
the
entry
corresponding
to
this
block.
But
if
it's
fine-grained
enough
I
think
that
that
would
be
like
a
huge
optimization
that
we
won't
be
actually
going
to
the
disk
almost
at
all,
to
find
out.
A
If
the
block
was
cloned
or
not,
we
will
be
able
to
just
free
as
usual,
and
it
will
be
cheap
so
and
there
will
be
no
additional
work
to
be
done
in
order
to
to
free
a
block,
because
this
this
is
what
we
don't
want
to.
We
don't
want
to
definitely
slow
down,
freeing
the
the
blocks
if
somebody
doesn't
use
or
just
use
brt,
but
maybe
just
for
for
a
very
small
amount
of
data
and
and
this
this
array
of
reference
counter
counters
is
stored
on
disk
as
well.
A
So
we
will
keep
track
on
this
all
the
time.
There
is
also
additional
bitmap,
because
for
large
regions,
like
one
gigabyte,
the
whole
array
will
take
eight
kilobytes.
So
we
can
just
sync
the
whole
array
all
the
time,
but
let's
say
we
would
try.
We
would
want
to
switch
to
one
megabyte
regions.
So
then
we'll
have
eight
megabytes,
so
we
don't
want
to
flash
eight
megabytes
every
time.
A
Somebody
clones
a
block
so
I
keep
additional
bitmap,
which
which
can
tell
us
which
part
of
this
array
is
actually
dirty
and
should
be
flashed
to
disk
I
even
have
a
picture.
Maybe
it
will
be
easier
to
understand.
So,
let's
say
at
the
top.
We
have
a
brt
table,
so
we
have
one
block
at
offset
one
gigabyte.
We
found
five
reference
with
five
references
right.
Then
we
have
another
block
with
three
references,
another
one
with
seven
references,
but
this
is
all
within
one
gigabyte
region
right.
A
So
in
the
middle
we
have
this
array
of
those
rigid,
ref
counts.
I
didn't
came
up
with
any
clever
name.
For
that
sorry,
so
we
will
have
like
a
sum
of
all
the
ref
counts
in
this
region.
So,
as
you
can
see,
region,
0
and
2,
there
are
no
block
clones
in
those
regions.
So
if
we
are
freeing
a
block
which
is
within
those
regions,
we
immediately
can
tell
there
is
no
need
to
go
to
the
disk
and
look
for
entries,
but
for
Regions
one
and
three.
A
If
we
are
freeing
some
block,
we
have
to
check
if,
if
the
block
is
in
the
table
or
not-
and
this
dirty
bitmap
is
only
stored
in
memory,
and
it
can
tell
us
that
some
parts
of
the
of
this
array
are
dirty
or
not
so
we'll
be
just
writing
to
disk.
Only
a
small
part
of
the
of
this
region
ref
count
array,
but
currently
this
is
not
yet
implemented.
There
is
some
code,
but
with
one
gigabyte
regions
it's
probably
not
yet
required,
but
it
definitely
can
be
done
so
also.
A
We
have
to
consider
how
block
cloning
interacts
with
the
duplication,
because
you
can
easily
imagine
a
block
that
was
in
the
loop
table.
It
was
stored
multiple
times,
so
we
have
few
references
in
the
table
and
the
block
was
also
cloned
using
copy
filer
range,
so
the
block
is
in
both
tables.
So
when
we
free
the
block,
we
have
to
choose
one
of
the
tables.
Where
do
we
in
decrease
the
ref
count?
A
We
can
do
this
in
Dido
table
and
we
can
also
do
this
in
in
brt.
So
actually,
Alan
came
up
with
idea
that
if
somebody
is
already
using
a
dedup
table,
we
can
just
use
the
dupe
table
and
not
use
brt
in
this
case.
So
if
somebody
is
cloning,
a
block
which
already
has
the
bit
the
debit
or
D
flux
set,
will
just
increase
the
ref
count
in
the
dedup
table
and
there
will
be
no
no
additional
entry
in
in
brt.
A
So
we
will
have
no
such
problem
at
all
to
decide
which
table
we
should
in
which
table.
We
should
decrement
the
counter
foreign
okay,
so
Crossing
data
set
boundary
is
a
bit
challenging.
It's
perfectly
doable
like
that's.
Why
we
have
like
pulled
storage.
A
Attached
to
to
single
to
one
data
set
right,
so
if
data
set
has
separate,
separate
intent,
log
and
and
we
have
to-
and
we
have
to
address,
we
have
to
clone
the
blocks
that
are
actually
part
of
a
file
from
different
data
set.
So
when
we
replay
the
Zeal,
the
blocks
may
be
freed
already.
A
So
yesterday,
actually
Matt
mentioned
an
idea
that
we
could.
We
could
make
use
of
the
Zeal
claim
to
actually
bump
the
reference
for
the
for
the
blocks
before
we
actually
replayed
the
Zeal,
because
the
Zeal
is
reply
when
we
Mount
file
system.
So
we
can
import
the
pool
without
mounting
all
the
file
systems
and
then
Mount.
Some
of
the
file
systems
remove
a
file
that
had
blocks
we
wanted
to
clone,
and
then
we
will
Mount
the
the
file
system
that
we
cloned
into
and
then
those
blocks
are
already
freed.
A
A
A
In
our
case
it
wasn't
possible
because
what
I
do
I
don't
just
want
to
reference
some
object
from
another
file
system.
So
what
I
do
I
copy
the
block
pointers
into
the
log
record?
A
A
Maybe
not
something
I'm
super
happy
about,
but
but
I
don't
see
any
other
choice
for
now.
A
And
as
I
mentioned,
there
is
a
problem
on
when
block
resides
blocks
resides
on
different
file
systems
that
the
blog
that
the
blocks
might
not
be
valid
anymore.
So
I
mentioned
some
solutions
here,
for
example,
we
can
not
use
Zeal
at
all
in
those
cases,
but
maybe
with
the
Zeal
claim
idea,
we
could
actually
have
Zeal
as
well
supported
for
this
kind
of
use
case.
A
So
there
are
three
new
pool
properties,
so
we
can
like
see
how
brt
is
being
used,
so
the
first
value
brt
used
is
how
much
data
was
actually
cloned,
then
how
it
will
be
archaeologically
used
how
it
will,
how
much
space
it
will
be
used
without
block
cloning
and
brt
ratio.
So
basically
you
can
calculate
this
brt
ratio
by
dividing
those
two
values.
A
There
are
some
special
cases
because
initially
I
thought
that
this
is
like
very
similar
to
the
duplication,
but
actually
it's
not
like
not
having
the
data.
It
changes
a
lot.
So
first,
one
is
the
block
we
want
to
clone
might
have
been
created
in
the
same
transaction
group.
A
So
somebody
writes
the
data
and
wants
to
immediately
clone
the
data,
so
the
block
pointer
is
not
yet
allocated,
so
we
cannot
really
clone
it
yet
so
so
I'm
simply
in
this
case,
I
will
wait
for
a
transaction
group
to
sync
and
then
we
will
be
able
to
continue
and
clone
the
block
another
one.
It's
pretty
similar
that
the
block
might
have
been
modified
in
the
same
transaction
group
that
we
are
cloning.
A
So
block
could
also
be
cloned
multiple
times
during
one
transaction
group,
so
the
spending
list
I
was
talking
about,
is
actually
a
tree,
so
we
can
quickly
look
up
the
entry
and
just
bump
the
the
counter.
So
with
on
the
spending
list
we
can.
We
can
tell
that.
Okay,
this
block
was
actually
cloned
five
times
in
this
transaction
group,
not
only
once.
A
So
also
another
case
we
have
to
consider
is
that
we
clone
a
block
and
we
throw
a
block
in
the
same
transaction
group.
So
this
have
to
be
handled
as
well.
A
A
And,
of
course,
we
have
to
make
sure
that
we
will
properly
handle
holes
in
the
file
and
also
BPS
with
embedded
data,
and
there
is
another
interesting
case
which
actually
might
be
even
useful
for
some
for
handling
temporary
files
that
I
create
a
file
I
delete
the
file,
but
I
keep
the
file.
Descriptor,
open
and
I
can
just
use
it
as
a
temporary
file.
And,
of
course,
if
I
crash,
the
the
data
will
be
freed,
but
so
there
is
no
file
on
the
disk.
A
So
I'm
not
sure
if
this
will
be
useful,
but
it's
also
needed
some
special
handling
and
some
random
notes.
A
V
Dev
growing
is
supported,
shrinking
is
not
so.
If
the
video
grows
we
automatically
will
extend
the
table,
extend
those
actually
not
the
table.
That
will
extend
the
this
region.
Ref
counts
array,
but
we
cannot
shrink.
The
shrinking
is
is
not
supported.
A
So
if
it,
if
brt,
is
not
used
or
no
longer
used,
we
will,
for
example,
we're
free
the
last
reference
in
the
brt.
All
the
structures
and
all
the
objects
in
most
will
be
freed,
so
the
property,
the
the
feature
will
change
from
active
to
enabled
in
in
zipper
properties,
offsets
and
length
in
copify
range
have
to
be
a
record
site
aligned
so
for
now,
I'm
just
returning
an
error.
A
If
there
is
no
alignment,
so
the
operating
system
can
still
use
the
regular
copy,
but
this
this
can
be
fixed
by
just
copying
the
data
which
is
which
are
misaligned
and
using
cloning
for
everything
which
is
record
size
aligned,
but
for
now
I'm,
just
returning
an
error
and-
and
there
will
be
regular
copy
in
this
case,
but
most
of
the
time
you
want
to
just
clone
the
entire
file,
so
it
does
not
huge
deal.
I
guess
you
cannot
also.
A
A
Copy
file
range
on
the
FreeBSD
operates
only
within
single
Mount
point.
There
is
a
check
in
VFS
for
that,
but
this
can
be
easily
fixed.
I
actually
have
a
patch
for
that
on
Linux,
and
there
was
a
recent
change
that
allows
copy
file
range
to
work
even
for
different
Mount
points,
but
for
the
same
file
system
type.
So
if
it,
if
it,
if
it
is
the
same
file
system
types,
let's
say
we
are
copying
between
ZFS
and
ZFS,
then
ZFS
method
will
be
called
and
we
will
try
to
clone
it.
A
Of
course
it
might
be
still
different
pools,
but
if
it's
the
same
pool
and
all
the
conditions
are
met,
it's
not
data
sets
are
not
encrypted
Etc.
We
can
do
the
cloning
and
a
big
one.
People
cannot
actually
give
up
on
this
one
when
we
send
a
snapshot
or
when
we
send
the
data
using
ZF
Ascent.
We
lose
all
the
savings
on
the
receiving
end.
We
are
not
able
to
rebuild
the
brt.
A
We
just
send
the
data
and
I
have
no
solution
to
that.
The
only
one
that
comes
to
my
mind
is
just
turn
this
into
ddube.tell
that
okay,
those
blocks
were
cloned
on
the
source,
so
maybe
add
those
and
those
of
those
blocks
into
the
Duke
table
and
because
they
seem
to
be
this
data
seems
to
be
maybe
the
duplicatable.
A
So
but
that's
the
only
idea.
I
have
it's
so
that's
unfortunate,
but
this
is
how
it
works.
A
A
Some
Plumbing
is
required,
but
actually
different,
record
sizes.
This
this
was
easy
actually
did
that
yesterday,
I
can
just
return
an
error
record
size
sizes,
don't
match
handican
aligned,
requests
again,
I
I'm
for
now
I'm,
just
returning
an
error,
but
we
could
partially
do
the
copy
and
do
the
cloning
for
the
majority
of
the
of
the
request
and
Zeal
between
when
we
are
cloning
between
data
set,
that's
still
unresolved.
A
I'm
not
sure
if
I
will
be
able
to
repeat
all
that,
but
but
the
idea
is
so.
Your
idea
is
to
to
try
to
transfer
this
information
that
that
the
block
that
there
are
multiple
copies
of
the
or
multiple
reference
to
the
block
within
one
sense,
zfl
stream,
correct
yes,
but
this
will
all
work
only
within
a
single
ZFS
sense
stream
right.
If
we
have,
if
we
have
the
duplicated
data
in
one
stream,
then
we
could
duplicate
using
brt.
A
Yes,
we
will
have
to
bring
back
the
duplication
option
for
ZFS
sent
for
that.
That
would
be
possible.
But
again,
this
is
just
a
not
full
solution
right.
It
will
only
work
within
a
single
stream.
If
you
do
another
ZFS
send,
then
you
won't
be
able
to
figure
out
that
you
already
have
those
blocks
actually
the
same
data
right
yeah.
A
A
A
Yeah
so,
but
this
is
again
it's
it's
only
partial
solution
to
the
problem
right.
We
could
only
use
brt
within
a
single
zfsn
stream,
but
of
course,
would
be
great
if
we
could
figure
out
how
to
like
be
able
to
rebuild
brt
somehow
on
the
server
side,
but
I
don't
know.
A
So
you
know
that
much
so
the
question
is
why
why
the
record
side
have
too
much
because
well
when
the
DFS
is,
is
building
a
file
it?
Basically,
the
the
block
size
will
grow
until
it
will
reach
record
size,
and
that
will
just
add
another
record
size
and
only
the
last
block
can
be
smaller
right.
So
so
we
cannot
just
so
so
a
block
in
the
middle.
It
cannot
have
smaller
size.
A
So
all
the
blocks
up
to
the
very
last
block
have
to
have
the
maximum
size
or
the
record
size
of
the
data
set
or
or
at
least
the
same
block
size.
So
we
cannot
just
punch
punch
a
block
in
the
middle
of
a
file
with
different
block
size.
It
wouldn't
work
simply
so
how
does
block
cloning
works
with
Device
removal?
I
think
there
is
there's
actually
no
special
handling
needed,
because
all
the
block
pointers
will
simply
work
right.
A
So
if,
if
let's
say
I
free
a
block,
I
just
use
the
old
block
pointer
if
I
clone
the
block,
I
can
still
use
the
old
block
pointer.
So
there
is
no
interaction
between
block
cloning
and
video
free
mobiles.
Actually
I
think
I
tested
that
some
point
and
there
is
nothing
special
we
need
to
do
there.
A
Okay,
so
the
question
is
how
we
determine
if,
if
we
that,
if
I,
consider
different
data
structure
for
determining,
if,
if
I
need
to
free,
if
I
need
to
consult
brt
when
I
free
a
block
like
Bloom
filters?
Yes,
so
this
was
actually
blue.
Filters
are
better
for,
like
the
duplication,
I
guess,
but
for
this
we
we
don't
really
need
anything
fancy,
because
the
structure
is
extremely
small
and-
and
it
can
tell
us
already
with
like-
can
give
us
pretty
much
very
precise
answer.
A
A
Why
why
this
cannot
be
used
for
dedup
so
for
dedup,
because
you
have
all
those
like
random
Keys.
You
will
quickly
fill
the
entire
table
right,
so
every
region
will
have
some
reference
count
right,
because
it's
just
so
Random,
so
it
won't
be
possible
to
to
make
it
work
efficiently
for
data,
so
for
DWP
would
need
something
like
Bloom
filters,
but
for
this
it's
it's
much
simpler.
A
Okay,
so
how
does
it
work
that
we
only
need
offset
and
ref
count
in
brt
entry?
And
we
don't
need
length,
because
the
only
thing
we
need
is
during
the
Clone
to
make
sure
that,
like
block
pointer
already
has
length
or
you
can
determine
the
length
right,
the
size
of
the
of
the
block?
So
we
don't
really
need
to
store
that
in
into
the
in
the
entry.
That's
why
also
we
have
just
one
offset.
A
We
don't
really
need
entire
block
pointer
as
in
the
duplication
case,
because
just
a
single
offset
and
vdf
ID,
uniquely
it's
Unique
for
for
each
block
pointer
there
is.
There
cannot
be
any
like
overlap
or
anything
like
that.
So.
A
So
Matt
actually
suggested
that
we
could
even
go
even
make
the
the
brt
entry
even
smaller,
to
just
try
to
fit
because
now
there
are
two
64-bit
values
of
sentence:
ref
count
and
maybe
we
could
just
squeeze
all
of
them
into
64
bits
which
would
be
possible,
especially
for
like
let's
say
you
have
like
4K
a
shift
or
4K
sector
size.
Then,
of
course
offset
cannot
occupy
those
first
bits
and
we
could
use
those
for
ref
count.
A
Yes,
yes,
so
so
the
question
is
what
happens
when
we,
let's
say
clone
blocks
which
are
not
stored
on
the
disks
yet
so
we
have
to
wait
for
the
transaction
to
sync
and
one
what
if
there
is
a
problem
during
the
sync
and
we
cannot
really
clone
the
an
entire
thing
right,
so
it
will
just
work
as
a
partial
right.
So
if
you
do
the
Clone
and
the
cloning
will
fail
at
some
point,
we'll
just
do
partial
success,
we'll
just
return
partial
success.
A
So
that's
it
and
I
decided
to
wait
for
the
transaction
to
sync,
because
the
only
other
solution
would
be
to
just
return
an
error.
But
this
would
tell
the
upper
layers
to
just
copy
the
data.
Let's
say
and
I,
think
it's
better
to
just
wait
for
the
transaction
to
sync
than
to
do
copy,
because
somebody
just
wanted
to
clone
within
the
same
transaction
group.
So.
A
But
definitely
we
had
this
discussion
like
about
privileges
when
somebody
does
Zippo
sync,
if
unprivileged
users
should
be
able
to
do
that
or
not
so
so,
maybe
that's
not
a
problem,
even
if
we
force
and
and
there
will
be
a
way
for
the
user
to
just
force
transactions
to
sync
by
just
doing
this,
creating
a
block
and
cloning.
This
will
force
the
transaction
to
sync
every
time,
so
I'm
not
sure
I'm,
not
maybe
convinced.
Yet
if
this
is
like
safe,
if
we
want
to
do
that
or
not,
but
maybe.
A
Okay,
so
the
question
is
that
we
only
that
we
only
read
level
zero
block
pointers
and
if
it
would
be
possible
to
clone
also
like
a
higher
level
block
pointers,
so
maybe
it
would
be
possible,
but
for
now
you
have
to
recreate
the
entire
tree
of
indirect
blocks.
It's
still
I
think
it's
I
think
it's
I'm,
not
sure.
If
the,
if
the
savings
you
will
get
from
not
doing
this,
if
that's
possible
would
be
or
reward
the
complication
that
that
it
brings
I
didn't
really
try
to
do
that.
A
I
just
followed
the
my
idea
was
to
like
Follow
The
Experience
on
the
duplication,
so
with
the
duplication,
you
also
only
the
duplicate
data
blocks
and
so
I
wanted
to
like
copy
some
of
the
experience
we
had
with
the
duplication
that
are
good
and
not
to
copy
some
of
that
that
are
bad,
so
so
I
just
focused
on
the
data
blocks.
A
So
quota
accounting
Works
exactly
the
same
way
as
with
the
duplication,
so
each
data
set
will
be
accounted
for
those
data,
so
it
will
be
accounted
twice
if
you,
if
you
clone
into
different
data,
sets
because
we
cannot
really
determine
who's
the
owner
of
the
data
right.
So
we
have
to
just
do
that:
I'm,
not
sure
how
would
the
duplication
accounts
in
within
the
single
data
set?
A
So
does
that
support
fi,
clone
and
fi
clone
range
ioctals
from
Linux
I
implemented
no
interface
for
Linux
to
use
that.
But
it's
definitely
it
definitely
should
be
supported.
If
you
just
create
this,
just
teach
SPL
layer
for
Linux
how
to
use
ZFS
clone
range,
because
there
is
EFS
clone
range
function
in
operating
system,
independent
code,
which
does
the
whole
work,
and
there
is
some
interfaces
like
in
FreeBSD
specific
method
to
just
call
this
one.
So
it
has
to
be
implemented
for
Linux
as
well,
but
I'm
pretty
sure
it
should
be
straightforward.
A
So
the
question
is:
if
I
did
any
testing
to
see
how
how
practical,
how
practical
savings
look
like
for
for
this?
So.
A
I,
don't
really
see
how
you
can
like
determine
that
because,
like
for
D
dupe,
you
could
use
zdb
to
determine
like
for
your
live
data.
A
How
much
is
the
duplicable,
how
much
you
could
save
by
the
duplication,
but
for
this
one
it's
on
demand,
so
whenever
you
need
to
like,
of
course,
some
tools
like,
for
example,
on
FreeBSD
CP,
already
supports
copy
filer
range,
so
CP
will
automatically
be
able
to
use
that
right,
but,
but
you
still
have
to
like
you
have
to
use
the
system
call
to
in
order
to
to
do
the
Clone.
A
So
the
only
thing
you
could
test
is
to
like
how
much
the
duplicate,
how
much
data,
how
much
duplicates
I
have
in
my
pool.
So
maybe
if
I
start
to
use
that
I
will
be
able
to
eliminate
those
duplicates,
but
probably
not
all,
but
some
I
guess
so.
The
question
is
in
CPU:
utility
is
dash
dash,
redirect
or
referling
reference
option,
yeah
yeah,
so
on
Linux.
There
is
option
like
that.
So
yes,
it
means
to
use
this
kind
of
feature,
I'm,
not
sure
the
details.
A
There
was
some
like
problems
with
this,
how
it
works,
because
there
is
also
like
dash
dash
always
which
won't
work.
Interestingly,
but
but
on
FreeBSD
you
don't
need
any
options.
It
will
just
use
all.
It
will
always
use
copy
file
range,
so
it
will
always
try
to
once
this
is
done.
It
will
always
clone
basically.
A
Okay,
so
the
question
is:
if
I
did
anything
special
for
scrub?
No
in
indeedup
there
is,
there
is
code
which
basically
allows
not
to
scrub
the
same
blocks
multiple
times.
Unfortunately,
with
this
it
it's
it's
not
possible
to
the
extent
it's
possible
for
a
with
dedup,
so
unfortunately
you
will
be.
You
need
to
scrub
the
same
data
multiple
times.
A
Probably
somewhere
can
be
done
to
like
remember
what
we
scrubbed
I,
don't
know.
We
would
need
to
maintain
another
table
in
memory
during
the
scrub,
so.
A
So
the
question
is
to
to
try
to
Once
once
again
explain
why
cloning
between
different
record
size
doesn't
work,
because
once
you
have
a
file
in
ZFS
right,
it
should
only
use
one
block
size
within
the
entire
file
right.
So
the
only
block
that
can
be
smaller
is
the
last
one
right.
A
It
won't
be
able
to
do
that
because
the
block
size
is
property
of
the
Z
note
of
the
or
D
note
it's
just
one
for
the
entire
file,
so
we
cannot
really
use
different
ones
for
different
regions
within
the
file,
so
we
just
have
to
fall
back
to
to
full
copy
in
this
case.