►
From YouTube: File Cloning with Block Reference Table by Pawel Dawidek
Description
From the 2020 OpenZFS Developer Summit
slides: https://drive.google.com/file/d/1csE8OuPotfhaFi9KvTGKMGy86KxrBu2W/view?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020
A
Block
reference
table,
so,
what's
the
what's,
the
general
idea,
so
general
idea
behind
this
feature
is
ability
to
to
reference
one
data
block
from
two
separate
files.
So
a
lot
of
people
actually
when
they
hear
that
zfs
is
copy
on
right.
They
wonder
why
this
feature
is
not
already
implemented
it's.
It
must
be
so
easy.
A
You
can
feel
think
of
this
feature
as
a
let's
say:
file
cloning.
So
you
have
a
file
you
want
to
clone
the
file
and
basically,
all
both
copies
of
the
file
are
now
have
this
copy
on
the
right
property.
So
if
you
modify
one
file,
it
doesn't
modify
the
other
file.
So
it's
not
a
hard
link
right
where
you
have
two
separate
file
names,
but
this
is
basically
the
same
file
same
data.
A
So
this
is
different:
two
separate
files:
they
have
their
own
properties,
their
own
permission,
ownership
etc,
but
can
share
data
blocks.
A
It
turns
out
that
linux
already
has
this
feature
through
with
battery
fs.
There
is
a
special
ioctyl
and
there
is
an
option
for
a
cp
called
ref
link
that
you
can
use
to
to
clone
a
file,
and
there
is
a
dedicated
system
call
on
mac
os
clone
file,
which
also
allows
to
do
exactly
the
same
thing.
A
A
Another
huge
benefit
is
when
you
try
to
recover
a
file
from
a
snapshot,
so
you
accidentally
deleted
a
file
and
you
want
to
bring
the
file
back
so
now,
if
you
just
copy
the
file
back,
it
will
consume
additional
space
because
the
data
blocks
will
be
copied
it
has
to.
It
will
allocate
additional
space
for
all
those
data
blocks.
A
So
some
people
complain
about
about
this,
so
it
would
be
nice
to
be
able
to
to
not
to
pay
the
the
cost
to
just
recover
files
from
from
a
snapshot.
A
Also
such
a
copy
will
be
super
fast.
So
if
you
clone
a
file,
we
don't
read
data
blocks.
We
just
read
the
parent
blocks.
We
just
need
block
pointers
that
will
that
will
clone.
So
we
we
read
and
write
only
a
fraction
of
of
what
is
read
and
written
when
you
do
regular
copy,
so
it
should
be
extremely
fast
and
another
benefit
is
that
you
can
move
files
between
data
sets.
A
A
But
with
a
block
reference
table,
there
is
no
cost
on
the
right
with
the
duplication
you
have
to
look
up,
you
have
to
go
through
the
dub
table.
A
So
when
you
write-
and
you
have
large
debuff
table,
there's
a
lot
of
problems
with
that,
if
it
doesn't
fit
in
ram,
you
have
all
those
performance
problems
so.
A
A
If
you
do
of
course,
file
cloning,
there
is
a
cost
on
the
right,
but
it's
much
smaller.
As
I
mentioned
the
copy,
it
will
be
super
fast
and
it
works
with
any
checksum
algorithm.
You
don't
need
cryptographically,
strong
checksum
in
order
to
be
able
to
use
block
reference
table
because
we
we
don't
care
about
about
the
checksum
itself,
and
another
benefit
is
that
with
deduptable
and
because
we
use
cryptographically
strong
checksum,
the
blocks
are
scattered
throughout
the
entire
d-dub
table.
A
So,
as
I
mentioned,
d-dub
tables
can
grow
very
big
d-dub
table
entry
is
pretty
large.
This
is
in
memory
size
of
the
dupe
entry.
So
it's
almost
400
bytes.
It's
one-fifth
of
this
for
block
reference
table,
which
also
means
that
you
can
fit
much
much
bigger
block
reference
table
in
ram.
A
In
the
duke
table,
you
have
a
lot
of
entries
most
likely
that
if
only
a
single
reference,
this
is
not
possible
with
block
reference
table.
There
are
no
entries
with
a
single
reference.
If
there
are
a
single
reference,
you
basically
just
remove
the
the
entry
from
block
reference
table,
so
the
table
only
contains
the
reference
that
I
actually
meaningful
that
actually
reference
blocks
that
are
referenced
more
than
once
and,
as
I
mentioned
already
the
duke
table,
all
the
reference
can
be
all
the
blocks
can
be
scattered
throughout
the
d-dub
table.
A
A
There
is
a
cost
when
you
free
a
block
so
on
every
three,
every
three
you
we
have
to
cancel
block
block
reference
table.
That's
the
difference
between
this
feature
and
d-dupe
with
d-dupe.
We
have
a
special
flag
in
block
pointer
the
d
flag
that
says
that
this
block
is
in
a
d-dupe
table.
So
if
there
is
no
d
flag,
we
simply
don't
consult
the
dedupe
table
with
block
reference
table.
A
A
So
when
we
so
when
we
wrote
the
block,
we
simply
didn't
know
if
this
block
is
going
to
be
cloned
or
not
so
there
is
no
special
flag
and
of
course
we
cannot
modify
block
pointer
so
simply
on
every
free.
We
have
to
go
and
check
if
this
block
is
referenced
in
block
reference
table
and
if
it
is,
we
have
to
decrease
the
counter.
A
A
A
So
when
it
comes
to
design,
as
I
mentioned,
there
is
no
bp
rewrite,
so
we
cannot
modify
bp,
so
we
cannot
either
put
a
special
flag
or
we
cannot
put
a
reference
counter
into
a
block
pointer.
A
A
A
A
This
might
be
a
bit,
let's
say
discouraging,
but
the
same
with
there
is
similar
limitation
with
d-dupe,
but
because
of
the
checksum
d-dub
can
reconstruct
the
basically
when
we
send
and
receive
the
block.
D-Dub
can
figure
out
based
on
the
checksum
that
this
block
have
more
entries
and
simply
update
the
dupe
table.
A
Here
we
have
no
idea
if,
if
the
entry
is
on
the
target
system
and
the
target
system
has
no
idea
if
the
data
that
is
coming
have
more
references,
so
if
you
use
block
reference
table
a
lot
and
you
will
send
data
set
like
that
over,
unfortunately,
it
will
be
much
bigger
than
the
original
data
set.
A
So
that's
a
bit
disappointing.
In
my
opinion,.
A
So
I
would
like
this
to
work
across
data
sets.
I
would
like
to
be
able
to
move
or
clone
files
between
data
sets,
but
of
course
I
don't
want
this
to
work
when
there
are
different
encryption
keys.
So
this
is
similar
limitation
to
dedupe.
If
the
keys
is
in
key
is
encryption,
key
is
inherited
like
through
zfs
clone.
A
Then
that
should
be
fine,
but
if
it's
totally
independent
encryption
key,
then
we
won't
be
able
to
clone
files
between
data
sets
with
different
keys
and,
of
course,
between
data
set
that
it's
on
that
it's
not
encrypted
and
data
set
that
it's
encrypted
and
and
the
opposite.
A
Okay,
I
was
trying
to
be
brief,
but
it
really
went
quickly.
So
I
guess
there
will
be
some
more
time
for
for
questions,
but
I
will
just
mention
status
of
the
project,
so
I
don't
want
to
put
your
hopes
too
high.
This
is
just
my
hobby
project,
so
the
progress
is
very
slow.
A
You
can
register
additional
system
called
from
kernel
modules,
I'm
not
sure
about
linux,
probably
too,
although
battery
fs
is
using
additional
iotal,
so
maybe
this
is
a
better
way
to
go,
but
for
now
I'm
just
using
additional
system
call.
A
There
is
some
some
things
that
already
work,
but
I'm
sure
there
are
a
lot
of
corner
cases.
We
have
to
consider
that
are
not
considered
at
this
point
at
all.
A
So
it's
it's
it's
a
very
early
prototype.
B
A
A
B
And
then
you'll
see
the
in
the
zoom
there's
a
q,
a
with
those
four
questions.
Cued
there.
A
A
So
I
think
yes,
definitely
if
you
will
just
install
the
new
the
updated
system,
files
from
some
template
and
you
will
use-
and
you
will
use
file
cloning
for
that
and
not
just
regular
copy
or
install
tools
which,
of
course,
we
can
extend
copy,
cp
and
install
to
use
file
cloning
if
feasible.
A
So
then,
yes,
definitely
and-
and
you
can
keep
the
savings
even
after
after
updating
your
jails.
I
hope
that's
answered
the
question.
A
So
christian
is
asking
what
is
the
reason
why
the
brt
files
not
be
sendable?
A
So
the
problem
here
is
that
when
we
send
the
file,
we
we
don't
really
transfer
any
information
about
block
pointer.
Matt
can
correct
me
if
I'm
wrong
here,
but
at
the
dmu
level
we
lose
that
information.
So
we
simply
just
send
data
blocks,
but
at
the
destination
destination
zfs
they
will
have
totally
different
block
pointers,
so
there
is
different
vdf
different
offset,
and
this
is
what
we
use
to
reference
the
blocks.
A
So
we
we
lose
that
information
when
we
send
it's
totally
different
on
the
destination
data
set,
so
so
simply,
and
especially
that
I
would
like
this
to
use
across
data
sets,
but
even
within
a
single
data
set,
we
cannot
assume
that
there
is
a
copy
of
the
block
already,
because
dfs
sent
and
receive
is
one
directional.
A
So
we
discussed
that
initially.
If
anyone
has
ideas
how
to
address
that,
I'm
I'm
I'm
happy
to
to
discuss
that.
But
at
this
point
I
think
that's
that's.
Unfortunately,
a
limitation
of
of
this
feature.
B
A
B
A
Yes,
although
that's
interesting
question,
if,
if
I
clone
a
block
that
it's
much
much
older,
it
will
be
included
in
in
a
specific
snapshot
range
on
the
destination
data
set.
So
maybe
you
could
detect.
B
A
B
But
in
any
case
the
discussion
is
really
about.
Could
we
preserve
the
block
sharing
of
brt
not
about
whether
send
is
going
to
work
at
all?
With
this.
A
Cool
okay,
a
question
from
johnny:
why
do
we
need
another
table?
Can't
we
simply
use
dedup
tables?
So
I
was
initially.
I
was
thinking
about
that.
A
But
I
believe
that
additional
table
is
a
better
option,
because
it's
much
lighter
so
the
entries
are
are
much
much
smaller
because
there
is
no
checksum.
It's
simply
vdf
offset
and
counter
and
also
youtube
table
is
a
bit
different
like
like
I
mentioned,
we
put
every
single
block
into
the
table
even
blocks
with
a
single
reference.
A
So
with
brt
there
are
no
such
block
with
single
reference.
So
if
the
block
reference
count
drops
to
one
it's
removed
from
the
from
the
table,
so
there's
different
dynamics,
it's
also
a
different
way
to
clone
files,
because
we
have
to
do
special
read
because
we
don't
want
to
read
the
data
and
we
have
to
do
special
write
because
with
dupe
you
actually
write
the
data
itself
and
then
within
the
zeo
pipeline.
A
You
decide
that
after
calculating
the
checksum,
you
decide
if
this
entry
should
go
to
well,
you
decide
earlier
because
the
depend
property
is
set
on
the
data
set,
but
but
basically
you
have
to
provide
the
data
that
should
go
through
zero
pipeline,
so
we
can
calculate
the
checksum
and
either
put
the
block
into
the
table
or
just
write
the
block
to
the
disk
with
brt.
A
B
Wouldn't
want
to
combine
it
with
the
d2
tables
in
case
you're,
also
using
dedupe.
You
don't
want
to
like
when
you're
doing
a
free
whenever
you're
doing
a
free,
you
have
to
look
in
the
block
reference
table.
You
don't
want
that
block
reference
table
to
be
larger
than
necessary.
So
you
don't
want
to
be
you.
Don't
you
wouldn't
want
to
have
to
go?
Look
in
the
d-dub
table
when
you're
doing
a
free
that
doesn't
involve
d-dupe
but
might
involve
the
block
reference
table.
A
Yeah,
so
my
hope
is
that
the
brt
will
be
much
much
smaller.
Of
course
you
can
always
go
to
extremes,
but,
but
there
is
you
need
to
you
know.
You
need
to
make
much
more
work
to
actually
make
brt
a
performance
problem
because,
as
I
said,
it's
much
more
compact
and
specialized
okay,
a
question
from
crest.
Could
this
accelerate
copify
range
for
the
nff
server?
To
be
honest,
I'm
not
familiar
with
the
system
call,
so
I'm
not
sure
how
it
works
exactly
so.
Maybe
somebody
else
know
how
it
works.
B
I
think
that
it,
I
think
that
it
would
be
able
to
take
advantage
of
this.
It's
basically
like
copying
a
range
of
a
file
to
another
file.
A
So
that's
so
that's
interesting
because
there
are
two
approaches.
I'm
considering
so
one
is
a
clone
file
system
call
that
basically
clones
entire
file,
but
this
can
also
be
implemented
as
cloning
individual
blocks.
A
Then
you
could
punch
a
block
in
the
middle
of
a
file
that
is
basically
a
clone
block,
but
of
course
you
have
to
preserve
like
alignment,
so
you
cannot
punch
the
block
anywhere
into
the
file.
It
has
to
be
at
the
block
size
boundaries.
So
it's
a
bit
limited.
So
if
a
copy
file
range
allows
to
to
move
the
data
into
any
place
within
a
file,
then
it
won't
work
for
general
case,
but
it
can
work
for
for
some
specific
cases
where
you
preserve
the
boundaries.
B
A
A
So
question
from
alan:
what
type
of
data
structure
does
the
brt
use?
Is
it
stored
at
the
top
level
of
the
pool?
Yes
it?
Basically,
I
I
simply
try
to
reuse
as
much
dedup
code
as
possible,
or
at
least
use
it
for
more,
like
an
inspiration,
but
is
very
similar
to
to
ddp
in
that
regard.
So
it
used
the
same
data
structure,
similar
data
structures
in
place
in
the
same
exact
place
in
the
pool,
so
there's
a
lot
of
of
code
that
is
reused
from
from
dedup.
A
Okay
and
question
from
tenzin:
if
the
source
snapshot
removed,
what
happens
to
the
copied
cloned
file?
Does
the
file
get
copied
up
to
the
live
system.
A
That's
a
very
good
question
and,
to
be
honest,
I
didn't
figure
out
yet
the
interaction
between
brt
and
and
snapshots.
There
are
some
potential
questions:
how
how
this
fits
into
intuition
of
the
user?
How
the
block
is
the
life
cycle
of
the
block,
because
this
black
pointer
is
not
is
not
updated
so
so
this
is
something
I
I
still
need
to.
A
Okay,
so
last
question
from
christian:
you
could
have
the
zero
hint,
whether
it
it
used
the
brt
or
not,.
A
B
Cool
well
thanks
everyone
who,
thanks
thanks
paul
for
telling
us
about
this.
This
is
a
really
cool
idea
and
thanks
to
everyone
who
asked
all
those
great
questions.