►
From YouTube: Dedup Send Receive by Paul Dagnelie
Description
From the OpenZFS Developer Summit 2018
slides: https://docs.google.com/presentation/d/1Ecwg3L52FQ7B86QPleVqXDJe9g5Jv_S4TJsNyK5uJLI/edit?usp=sharing
ZoL issue: https://github.com/zfsonlinux/zfs/issues/7887
A
Fair
number:
yes,
all
right,
thanks,
Matt
hello,
my
name
is
Paul
de
Leeuw
and
I'm
gonna
be
talking
to
you
today
about
ZFS
deep
sense
sort
of
what
it
is,
what
if
it's
in
trade-offs
it
has
and
what
its
future
might
be
so
I
know.
Probably
a
lot
of
you
already
know
what
it
is
or
how
it
works,
but
I'm
sure
there's
some
that
don't
sure
there's
some
people
online
that
junk
so
I
want
to
go
over
sort
of
an
introduction
of
terms
before
we
get
started.
A
A
File
system,
it
sort
of
finds
all
the
blocks,
writes
the
map
over
to
the
network
or
on
to
a
tape
for
later
use
for
backups,
it's
very
efficient
in
terms
of
being
able
to
suppress,
you
can
do
incrementals
between
one
snapshot.
Another
snapshot
there's
a
variety
of
very
useful
features
that
make
it
a
very
high
quality
data
replication
at
backup.
A
Zfs
deep
is
a
future
ZFS
that
allows
you
to
deduplicate
the
data
in
your
pool.
It
will
find
data
with
the
same
checksums
and
then
rather
than
storing
multiple
copies
of
what
is
this?
Essentially,
the
same
data
will
store
it
once
and
then
we
have
references
to
the
other.
This
is
a
very
powerful
tool.
It
can
dramatically
reduce
your
storage
costs
for
certain
workloads,
but
in
order
to
keep
sort
of
performance
acceptable,
it
needs
to
store
a
lot
of
data
in
memory,
and
so
it
can
be
a
very
expensive
feature
to
use
in
practice.
A
B
A
Very
little
to
do
with
each
other,
it
does
have
something
to
do
with
DFS.
So
that's
good,
so
basically
ZFS
D
dupes
end
is
a
post
processing
pass
that
goes
over.
The
send
stream
finds
records
that
contain
the
same
data
and
then
removes
references
to
later
ones
and
sort
of
points
back
to
earlier
references
in
the
send
stream.
It
uses
a
special
kind
of
record
called
a
rank
by
ref
record.
That
sort
of
points
to
another
data
set
already
been
received
as
part
of
the
same
package.
It
has
a
couple
problems
with
it.
A
A
The
on
the
send
side
of
dee
doop
sends
we
create
a
hash
table
in
user
land
that
is
very
similar
to
the
video
table
that
started
the
kernel
when
you're
using
video,
it's
a
mapping
from
a
checksum
to
a
combination
of
a
snapshots,
viewing
the
object
in
that
data
set
and
to
be
offset
within
that
object.
So
that's
how
it
does
sort
of
the
back
referencing
to
previously
receives
data,
it's
sort
of
every
time
it
sees
a
new
record.
It
looks
at
the
checksum
if
the
new
checksum
it
stores
it
in
the
hash
table.
A
If
it's
an
old
one,
it
will
find
the
value
already
in
the
hash
table
and
replace
the
entire
record
with
a
right
by
ref
record,
pointing
to
the
other
data.
All
of
the
processing,
for
this
happens
in
a
single
thread
and
user
lands,
and
it
has
absolutely
no
connection
to
ZFS.
Do
you
whether
you
have
ZFS
need
to
be
enabled
or
not
whether
you
is
table
is
started
stored
in
memory
stored
on
this?
Whatever
is.
C
A
A
Of
the
GD
T,
but
it
does
not-
and
this
is
one
of
its
many
misleading
features
on
the
receive
side
when
you're
receiving
one
of
these
right
buyer
of
records.
You
have
the
gooood
of
the
snapshot
that
the
data
is
stored
in
the
object
in
that
snapshot
and
the
offset
within
that
object,
and
so
your
next
step
is,
you
have
to
go
and
open
that
snapshot,
find
that
object
in
that
snapshot
and
like
hold
the
appropriate
debuff
with
the
data.
So
you
can
write
it
into
the
correct
location.
A
That
is
one
command,
and
so
we
build
a
map
from
gooood
to
snapshot
over
the
course
of
a
command
and
then
keep
track
of
all
the
receives
that
are
part
of
the
same
command
by
passing
the
enough
to
and
from
the
colonel
once
we
have
sort
of
found
the
right
snapshot.
We
hold
the
right
debuff,
we
copy
the
data
from
the
existing
snapshot
into
the
new
snapshot
and
again,
if
you
have
deed
of
the
needle
or
not
on
the
target,
it
doesn't
matter.
A
A
You'll
have
to
read
the
data
from
the
other
data
set
when
you're
like
receiving
a
PI
RF
record,
and
that
involves
doing
a
demand
read
before
you
can
do
your
write,
which
can
Prudential
II,
increase,
latency
and
decrease
your
bandwidth.
There
should
be
prefetching
to
help
make
it
better,
but
prefetching
does
not
always
do
what
you
wanted
to
do,
especially
in
sort
of
high
I/o
situations.
A
Overall,
the
big
drawback
to
this
feature
is
that
when
you're
doing
a
ZFS
send
this
cube
feature
only
works
within
one
send
command.
So
if
you're
doing
like
a
normal,
send
or
an
incremental,
that's
one
snapshot
and
the
data
in
that
one
snapshot
that
you
continue
against
if
you're
doing
a
send
with
R
or
I
the
capital
R,
which.
A
You
have
a
larger
space
that
you
can
get
over.
You
also
spend
a
lot
more
memory,
storing
this
to
do
table
in
user
lands,
and
so
for
a
lot
of
use
cases
do
you
do
actually
isn't
gonna
get
you
very
much
because
you
don't
store
a
lot
of
copies
of
the
same
data
in
the
same
data
set
and,
of
course,
much
like
normal
DD
if
that's
only
works
on
whole
blocks.
A
Possibilities
when
you're
writing
tests-
and
so
it
adds
maintenance
cost
adds
development
cost.
It
adds
code
bloat.
It
has
a
bunch
of
sort
of
development,
focused
downsides
in
addition
to
the
actual
downsides
of
using
it.
So
sort
of
the
question
that
this
talk
is
intended
to
ask
is
what
happens
if
we
deprecated
it,
and
this
is
an
interesting
question,
because
we've
never
really
deprecated
a
feature
like
this
before
CFS
has
always
sort
of
had
a
very
strong
philosophy,
a
very
good
philosophy
that
the
data
should
be
mean.
A
You
should
always
be
able
to
keep
your
data.
You
should
always
be
able
to
look
at
old
data
and
sort
of
weave
new
bits,
load
old
fools
and
still
get
all
the
functionality
and
with
new,
receive
bits,
receive
your
old
streams
and
people
to
restore
your
backups,
but
as
sort
of
the
filesystem
grows,
and
we
add
more
features
and
time
goes
on.
If
we
never
remove
anything,
the
codebase
will
grow
without
bound
and
complexity
will
grow
and
will
become
harder
and
harder
to
maintain
and
understand
all
of
this
logic.
A
So,
even
if
G
dupes
and
isn't
the
thing
to
remove
it's
a
good
template
to
think
about
what
deprecation
should
look
like
for
CFS.
How
do
we
want
to
do
something
with
this?
If
we
decide
that
we
do
need
to
remove
features
to
make
our
lives
easier
as
developers,
so
we
would
probably
sort.
This
is
a
proposal
for
how
we
would
do
this
for
dnews
end.
We
would
start
in
much
the
same
way
that
you
basically
any
feature
from
any
programming
language
or
utility.
You
start
throwing
warnings
when
people
use
it.
A
A
Eventually,
you
remove
that
and
then,
after
a
while,
when
people
have
sort
of
gotten
used
to
it,
you
remove
it
for
the
logic
from
Libs
EFS
on
the
sending
side,
and
then
before
you
actually
move
the
receive
logic,
you
could
create
some
sort
of
utility
that
would
DD
duplicate
the
data.
So,
basically,
what
if
you
have
your
stream
stored
on
disk?
It
can
iterate
over
the
stream,
find
the
by
RF
records
and
replace
them
with
the
appropriate
records.
A
You
do
have
to
maintain
this
script,
but
the
script
doesn't
need
to
consider
certain
new
features
that
can
add
into
ZFS
its
test.
Matrix
is
fixed
in
time,
because
once
you
tube
is
gone,
you
no
longer
have
to
worry
about
adding
new
test
cases,
for
it
will
have
to
deal
with
new
features
and
complexities
and
then,
finally,
once
everything
is
ready
and
we
feel
secure
and
we
feel
confident
about
it,
remove
the
logic
from
the
receiving
side
and
remove
all
traces
of
the
crumpet
code,
and
at
this
point
the
feature
would
be
gone.
A
So
obviously
this
is
not
something
that
like
we
could
should
or
want
to
do
as
like
a
declaration
or
a
patch
that
we
would
even
propose
without
getting
a
lot
of
feedback
first.
This
is
kind
of
a
first
for
the
community
and
we
want
to
hear
from
everybody
not
only
about
potentially
doing
this
to
this
feature,
but
also
what
this
protocol
should
look
like
in
the
future.
People
who
use
the
future
other
developers
of
ZFS
other
members
of
the
community
who
have
feelings
about
how
this
process
should
go.
A
We
want
to
talk
about
this
and
have
a
dialogue
and
figure
out
what
the
right
thing
to
do
is
in
the
future,
because
it
seems
like
at
some
point
we
are
going
to
need
to
think
about
ways
that
we
can
stop
the
file
system
from
just
growing
endlessly.
At
some
point,
we
do
need
to
think
about
what
things
should
be
in
and
out
of
scope
and
how
we
deal
with
making
those
changes
and
making
those
decisions.
So
I
really
I
look
forward
to
people
asking
questions
about
this.
Talk
movie
about
this.
A
A
A
community
so
I
feel
like
this
should
probably
be
a
discussion
that
happens
or
in
the
open
in
the
open
on
some
sort
of
public
forum.
And
so
maybe
the
right
place
is
like
an
issue
on
github
or
a
message
in
the
mailing
lists.
And
so
probably
the
right
thing
to
do
is
to
look
into
doing
something
like
that.
Where
we
can
track
sort
of
discussion
and
store
feedback
in
a
way
that
it's
actually
persistent
and
not
just
receiving
emails.
That
I
will
not
be
able
to
keep.
A
It's
we
don't
have
a
good
way
to
gather
metrics
or
something
like
this.
The
limited
evidence
that
I
have
is
mostly
sort
of
looking
at
how
bugs
get
discovered
like
when
we
do
release
a
future
that
has
bugs
in
it
and
then
looking
at
tools
like
Z,
Red,
Bull
and
a
couple
of
the
automated
other
sort
of
data,
backup
solutions,
and
none
of
them
really
take
great
advantage
of
this
feature
and
almost
every
tool
that
I've
seen
that
does
it
doesn't
use
that
capital
R
capital
I.
A
B
F
F
C
That
kind
of
goes
to
what
I
was
going
to
suggest,
whether
it's
deprecated
or
not.
It's
been
awhile
since
I
looked
at
the
man
page
for
it,
but
I
think
there
is
actually
a
soul
implication
that
it
isn't
some
way
related
to
actually
do
I'm
sure
and
cleaning
that
up
and
actually
adding
the
explicit
commentary
that
it
in
fact
has
nothing
to
do
with
with
the
deegeu
feature
and
that
there's
a
very
limited
use
case
for
it
would
might
be.