►
From YouTube: Fast Clone Deletion by Sara Hartse
Description
From the 2017 OpenZFS Developer Summit:
http://www.open-zfs.org/wiki/OpenZFS_Developer_Summit_2017
A
So
yeah
I'm
Sarah
I'm
working
at
del
phix
and
I'm
gonna
talk
about
a
ZFS
performance,
optimization
with
regards
to
how
we
delete
clones.
So
through
my
talk,
I'm
going
to
talk
about
how
we
delete
clones
now
and
why
it
can
be
really
really
slow.
The
fast
solution,
algorithm
and
then
changes
that
algorithm
we've
made
to
basically
make
it
more
scalable
within
the
ZFS
ecosystem
and
then
I'm
going
to
talk
about
hormons
gains
and
basically
why
this
is
an
awesome,
awesome
thing.
So
clones
leishan.
So,
as
you
know,
clones
are
mutable
copies
of
existing
datasets.
A
You
take
a
snapshot
which
is
frozen
version
of
the
data
set
and
then
you
can
make
a
clone,
and
then
that
means
that
you
can
change
things
based
on
that
snapshot.
Copy
on
right
means
that
creating
a
clone
is
as
simple
as
pointing
to
the
route
of
a
given
snapshot
so
creating
a
clone
super
fast
super
inexpensive
throughout
the
course
of
the
clones
lifetime.
A
So
here's
an
example:
the
green
are
the
blocks
that
pertain
to
the
clone
and
then
the
blue
is
the
underlying
data
set.
So
in
some
cases
this
isn't
too
bad.
One
example
is
a
contiguous
right,
so
down
in
the
bottom
are
the
blocks
written
by
the
user,
and
these
are
all
contiguous.
They
take
up
one
contiguous
chunk
of
memory,
so
not
that
many
indirect
blocks
have
to
be
modified
and
traverse
to
find
things
you
wanted
to
leave.
A
On
the
other
hand,
there's
no
rule
that
your
rights
have
to
be
contiguous,
and
if
you
have
sparse
rights
all
over
the
place,
then
you
end
up
basically
having
to
touch
every
part
of
the
tree
when
you
want
to
go
and
find
what
have
to
delete.
This
can
be
really
expensive,
because
you
have
to
read
in
your
indirect
blocks
and
say:
okay,
where
the
other
blocks
when,
where
they
modified,
should
I
actually
care
about
them.
A
So
the
kind
of
takeaway
from
this
is
that
in
the
worst
case,
deleting
a
clone
is
proportional
to
the
original
size
of
the
underlying
database.
It
doesn't
matter
that
you
only
made
four
tiny
changes
to
the
clone.
If
you
made
them
in
a
scattered
enough
place,
your
deletion
cost
can
be
proportional
to
whatever
that
clone
was
based
on,
so
that
can
get
really
costly.
This
isn't
like
a
super
dramatic
example,
but
you
can
see
these
are
all
underlying
data
sets
of
varying
sizes.
The
amount
of
data
that
was
written
but
to
the
clone
is
fixed.
A
It's
500
Meg's,
but
you
can
see
that
500
Meg's
of
Rights
looks
like
very
different
deletion
times,
depending
on
how
big
the
data
set
is
and
the
red
bars
are
for
random
rights
and
the
green
are
for
contiguous.
So
you
can
see
that
random
is
worse
than
contiguous,
although
contiguous
can
also
get
pretty
bad.
So
that's
all
folks,
we've
seen
users
kind
of
run
into
this
situation
where
you
may
have
to
maybe
terabytes
of
data.
You
make
a
clone.
You
make
a
few
small
changes,
you
go
and
delete
it
and
it
takes
you
45
minutes.
A
So
it's
a
problem
all
right,
our
algorithm.
So
the
basic
premise
is
to
change
this
underlying
issue
that
deleting
a
clone
means
caring
about
what
the
original
data
set
looks
like
and
the
size
of
the
original
data
set.
What
we
want
to
do
is
keep
track
of
the
clone
specific
rights
and
deletes
as
they
occur
so
that
when
it
comes
time
to
delete
them,
we
only
have
to
care
about
the
actual
changes
that
happened
on
the
clone.
So
the
premise
is
to
store
them
in
a
live
list
to
delete
a
clone.
A
We
just
have
to
process
each
element
in
this
live
list.
Work
is
proportional
to
the
number
of
rights
to
the
clone,
not
to
the
size
of
the
data
set.
It
was
based
on
so
here's
some
more
detail
on
this
live
list
algorithm.
So
we
need
to
have
this
data
structure
and
we
are
going
to
in
queue
block
pointers
that
we're
both
allocated
and
freed
on
the
clone
as
the
rights
occur.
A
We
need
both
the
alux
and
the
freeze,
because
if
we
only
keep
track
of
the
blocks
that
were
allocated
over
the
course
of
the
Clones
lifetime,
those
walks
might
have
been
freed,
like
naturally
by
the
clone
stuff
that
happened
to
the
clone,
and
we
don't
want
to
double
free
them
in
the
end.
So
when
we
want
to
delete,
we
need
to
determine
which
were
they
not
yet
freed
blocks,
the
ones
that
are
circled
in
red
and
the
way
we
do.
A
This
is
we're
going
to
step
backwards
through
the
live
list,
insert
freeze
into
an
AVL
tree
check
for
memberships
of
Alex
and
the
AVL
tree,
and
then
based
on
that
you'll
know
if
you'll
have
to
delete
them.
So,
a
little
more
step
through
of
how
that
works.
Exactly
we
start
at
the
end
of
a
line
list.
We
have
this
handy
invariant
that
all
frees
of
a
block
will
happen
after
it's
allocated.
So
that's
why
you
can
go
backwards.
We
see
a
free,
we
keep
track
of
it
in
our
AVL
tree.
A
We
see
an
alec
and
we
check.
If
it's
been
freed,
it
hasn't
so
we're
going
to
keep
track
of
it
as
something
we
actually
do
need
to
free.
We
see
another
Alec,
oh
it's
already
been
freed.
We
ignore
them
both
store
our
freeze
and
then
check
for
memberships
of
the
Alex.
This
Alec
has
already
been
freed,
so
we
ignore
it.
This
Alec
has
not
so
we
free
it
and
this
Alec
has
so
we
ignore
it.
So
it's
basically
the
core
of
the
correctness
of
this
algorithm.
A
We
have
this
data
structure,
we
process
it.
We
know
what
we
have
to
delete
and
significantly.
This
is
proportional
to
the
things
that
have
happened
to
the
clone
not
to
its
data
set.
So
where
are
we
now?
This
algorithm
was
implemented
this
summer
by
adelphic
intern
Serrano
Gopal,
and
she
saw
like
really
good
and
performance
improvements
and
science.
We're
looking
good,
there's
still
some
things
that
we
want
to
tweak
to
this
algorithm,
to
make
it
more
scalable.
One,
as
you
might
have
noticed,
is
that
the
live
list
can
grow
arbitrarily
large.
A
A
This
is
also
something
that's
tricky
to
incremental
e
destroy
I'll
talk
about
that
a
little
bit
later,
but
the
kind
of
pattern
of
incrementally
destroying
something
and
not
doing
all
of
your
work
and
single
transaction
group
is
an
important
pattern
within
ZFS
to
keep
things
basically
flowing
and
to
not
spend
too
much
time
deleting
when
and
like
prevent
users
from
from
actually
doing
stuff.
So
the
next
step
to
try
and
address
these
issues
is
to
break
the
live
list
into
smaller,
fixed
size,
sub
lists.
A
So
the
idea
is
that
if
we
can
make
the
guarantee
that
each
sub
list
contains
both
the
Alex
and
the
fries
for
a
given
block
pointer,
then
we
can
process
these
sub
lists
independently.
So
in
the
first
sub
list,
if
we
want
to
delete
this,
we
don't
want
to
care
about
anything
that
happens
in
the
other
sub
lists,
because
we
only
want
to
load
one
into
memory
at
a
time
instead
of
loading,
a
giant
amount
of
stuff
into
memory
and
arbitrarily
large
amount
of
stuff
into
memory.
A
So
one
question
is:
how
do
we
guarantee
that
blocks
matching
blocks
are
grouped
together
and
the
solution
to
that
is
to
group
them
based
on
birth
time.
So
basically,
this
sub
list
will
be
all
blocks
that
were
born
in
this
transaction
group
and
then
the
next
one
can
be
blocks
that
were
born
between
these
transaction
groups,
and
one
of
the
key
ideas
is
that
we're
creating
these
extra
sub
lists
dynamically.
So,
as
we
run
out
of
space
and
one
we
create
another,
and
that
way
we
don't
have
to
like
pre-allocate
how
many
initial
sub
lists.
A
Another
question
is:
how
big
should
an
individual
sub
list
be
because
of
the
transaction
group
grouping?
We
know
that
the
smallest
it
can
be
is
the
number
of
Rights
that
happen
within
a
given
transaction
group.
So
that's
kind
of
a
lower
bound.
That's
obviously
like
not
fixed,
and
then
an
upper
bound
is
the
question
here.
A
So
this
is
still
something
that
hasn't
been
like
fully
decided
upon
is
one
of
the
main
tunable
zhh
for
this
performance
feature,
but
basically,
what
we're
doing
right
now
is:
what's
a
smallish
fraction
of
RAM
to
use
in
a
dually,
so
we're
doing
about
1%
of
RAM
right
now
it
and
a
sub
list
can
have
about
half
a
million
entries,
so
we've
solved
that
problem
of
arbitrarily
large
amounts
of
memory
to
read
in
when
we
delete
the
other
nice
thing
about
this.
Is
this
presents
a
handy
interface
for
incrementally
destroying
something?
A
If
we
can
process
these
sublists
independently,
then
we
can
destroy
them
between
transactions
between
in
different
transaction
groups.
The
other
handy
thing,
or
at
least
thing
to
consider,
is
the
concept
of
asynchronous
destroy.
Mark
may
be
mentioned
this
in
his
talk,
but
the
idea
of
doing
a
lot
of
your
destroy
work
in
the
background
is
pretty
appealing.
So
what
we
wanted
to
figure
out
is
what
part
of
this
delete
algorithm
actually
does
need
to
be
in
sync
in
context
and
which
part
can
be
done
by
worker
thread
in
the
background.
A
So
when
we
call
ZFS
destroy,
clone
will
basically
in
queue
this
clones
ID
as
something
to
be
deleted
and
then
we'll
sing
oldest
worker
thread
that
will
do
the
work
of
loading
in
the
live
list.
Doing
the
algorithm
with
the
AVL
tree,
to
figure
out
what
we
actually
need
to
free
and
then
it
can
call
a
sync
task
which
will
actually
do
those
deletes,
and
so
this
has
the
benefit
of
again
like
pushing
this
loading
in
work
to
the
background
and
making
sure
that
we're
only
really
doing
critical
work
in
the
singing
context.
A
And
then
this
like
lets
us,
do
it
incrementally
as
well.
So
where
are
we
now
pros?
We've
limited
how
much
memory
is
loaded
in
it
once
with
the
sublists,
and
we
can
delete
things
incrementally
and
more
quickly.
So
a
con,
you
might
have
noticed
that
the
number
of
sub
lists
is
unbounded.
We
still
have
this
problem
of
we're
using
kind
of
like
arbitrarily
large
amounts
of
disk
space
throughout
this
algorithm.
If
a
clone
lives
a
really
long
time,
we
can
have
a
ton
of
sub
lists.
This
gets
expensive
in
two
ways.
A
So
every
time
you
write
to
a
clone
or
you
delete
a
file
or
you
move
something
around
we're
going
to
have
to
interact
with
this
live
list,
and
we
don't
want
to
make
that
interaction
so
costly
that
we're
actually
degrading
like
the
performance
of
writes
and
things
to
a
clone.
So
the
more
sub
lists
we
have
the
no,
the
greater
the
number
of
I/os
we
have
to
do
to
propagate
these
changes.
In
the
original
case,
back
when
we
had
one
giant
sub
list,
you
just
append
to
the
end
of
it.
That
was
it.
A
There
was
only
one
place
to
put
things,
but
when
you
have
a
bunch
of
them,
the
cost
of
insertion
of
a
bunch
of
different
iOS
is
greater
because
or
a
bunch
of
different
block
pointers,
it's
greater
because
they're
spread
over
over
different
areas
cool.
So
the
next
idea
is
condensing
after
a
block
is
freed.
The
live
list
is
now
containing
irrelevant
information.
So,
for
example,
in
this
one
we
have
block
pointer
1.
A
We
know
that
we
freed
that
block,
but
we're
still
keeping
this
information
around
we're
still
keeping
this
history,
even
though
we're
not
actually
going
to
do
anything
with
it
when
we're
actually
deleting
the
clone.
So
the
idea
behind
this
is
before
we
actually
delete.
We
can
condense,
so
we
basically
use
the
exact
same
algorithm.
We
use
for
deletion
to
determine
which
are
the
still
like
living
block,
pointers
which
are
2,
&
3,
and
we
just
get
rid
of
the
ones
that
are
no
longer
alive,
so
that
reduces
the
amount
of
disk
space.
We're
using
we're.
A
No
longer
storing
these
block
pointers
that
we
don't
care
about.
Can
we
reduce
the
number
of
sub
lists
we're
using?
Well
since
we've
compressed
these
things
into
a
smaller
sub
list?
If
they're
under
like
half
of
the
minimum
size
of
this
or
maximum
size
of
the
sub
list,
then
we
can
merge
them
together
and
reduce
the
overall
number
of
sub
lists.
So
here
we've
deleted
the
unnecessary
and
then
we've
merged
it
together
in
one
sub
list
that
basically
stores
information
more
compactly
and
is
information
we
actually
care
about
when
we're
deleting.
A
So
in
general,
all
of
we
done,
we've
made
the
work
of
deleting
a
clone
proportional
to
the
number
of
rights
to
that
clone,
as
opposed
to
the
size
of
the
data
set.
We've
limited
the
memory
in
that
we
load
at
once.
We
now
have
like
tight
control
over
like
what
happens
when
we
delete
we're
not
going
loading,
tons
and
tons
of
memory
when
we
delete
something
which
was
an
issue
with
the
original
life
list
algorithm,
and
we
did
that
using
sub
lists.
A
The
sub
list
paradigm
also
gave
us
a
strategy
for
incrementally
deleting
sub
lists
and
asynchronously
destroying
them,
and
then
this
condensing
merging
strategy
gives
us
the
ability
to
limit
the
number
of
sub
lists
by
periodically
condensing
them
should
have
mentioned
this
in
the
earlier
slide.
But
the
condensing
strategy
kind
of
gives
us
a
lot
of
questions
like
in
an
ideal
case.
We
would
never
store
redundant
information,
but
loading
in
everything
to
get
rid
of
redundant
Alex
would
be
pretty
costly
to
do
forever.
A
So
we
probably
want
to
have
a
kind
of
like
background
worker
thread,
that's
periodically
condensing
when
it
can.
So
how
often
should
we
condense?
When
should
we
condense?
These
are
all
questions
that,
like
we
don't
we
don't
know
yet
and
we'll
be
interesting
to
kind
of
try
and
figure
out
how
how
the
system
acts
under
like
different
workloads.
A
So
here's
some
performance
numbers.
These
numbers
are
from
surah
Baz
internship,
her
implementation
of
the
live
list,
so
these
are
kind
of
just
like
the
vanilla
one
big
live
list.
This
is
the
continuous
write
case,
which
is
where
we
saw
kind
of
the
least
concerning
behavior
from
the
original
delete
strategy.
So
in
this
is
basically
just
a
small
example
but
turns
out,
we
have
actually
reduced
the
number
of
I/os
with
this
live
with
strategy,
so
deletion
is
a
little
bit
faster.
Even
in
this
contiguous
right
case,
I
said:
we've
deleted.
A
The
number
of
I/os
we've
deleted
then
reduce
the
number
of
I/os
for
the
deletion
case,
but
we've
also
added
some
iOS
which
happen
while
we're
adding
things
to
the
clone
right.
We
we've
kind
of
spread
out
some
of
this
work
right,
instead
of
just
doing
all
the
processing
and
all
the
deleting
work
when
it's
time
to
delete
we're
kind
of
like
paying
it
forward
by
keeping
track
of
what
we're
gonna
delete,
which
is
handy.
So
the
most
improvement
is
really
dramatic.
A
You
can
see
in
this
case
we
have
500
gigs,
which
is
the
underlying
data
set
500
Meg's,
a
sparse
shadow
or
written
to
it,
and
it
took
46
minutes
to
delete
with
the
old
method
and
with
this
method
it
takes
30
seconds.
It's
like
100
times
improvement.
It's
pretty
good,
so
this
again
is
the
the
sparse,
the
sparse
case,
and
that's
where
we
see
you
like.
It's
also
really
interesting
to
kind
of
look
at
this.
A
A
This
worst
case
scenario
can
show
up
when
people
have
really
big
really
big
data
sets
they're,
creating
clones
they're
deleting
clones
or
trying
to
make
more
clones
and
people
just
like
they
think
they've
deleted
something,
but
they
can't
actually
create
a
new
thing,
because
the
blocks
have
not
been
reclaimed
so
speeding
up.
This
is
really
good
for
high
high
usage
systems
with
like
really
big
data
sets,
and
then
we
saw
these
tweaks
that
were
made
to
make
the
algorithm
more
scalable.
A
These
really
interesting,
because
we're
balancing
things
like
how
much
memory
do
we
want
to
sacrifice
to
this
live
list
system?
How
much
work
should
we
spend
optimizing
it
throughout
the
course
of
the
live
lists
or
the
clones
lifespan
and
yeah?
So
it's
it's
underway.
Hopefully
it
will
be
coming
soon
to
open
ZFS,
so
yeah
I
just
wanted
to
say
a
quick.
Thank
you.
It's
a
seroma
who
did
a
lot
of
the
original
work
on
this
project
and
to
Matt
and
Seraphim
who
helped
me
out
a
lot
adèle
fix,
so
questions
Yeah.
B
A
C
A
Yeah
we're
actually
so
the
question
is
in
the
case
of
clone,
promote
how
do
we
handle
this
because
you're
you're,
basically
changing
the
roles
of
these
and
the
relationship
between
these
data
sets?
So
currently
we
are
not
supporting
this
feature
for
promoted
clones.
I
can't
supersede
people
but
yeah.
A
So
the
question
was:
how
do
we
resolve
the
fact
that
sub
lists
need
to
be
independent
of
each
other,
namely
all
Alex
and
all
the
free
use
for
a
given
block
pointer
should
end
up
in
at
the
same
sub
list.
So
the
way
we
do
that
is
we
group
the
sub
lists
five
birth
time,
and
so
a
block
pointer
that
was
created
will
have
a
birth
time
and
then,
when
it's
freed
it
will
have
the
same
birth
time
as
its
created
copy
because
they're
like
the
same
block
pointer.
A
A
A
D
A
A
Yeah,
so
the
question
was:
have
we
looked
at
the
case
where
a
clone
has
almost
completely
diverged
from
the
parent
and
whether
or
not
this
strategy
is
is
still
beneficial.
So
we
haven't
looked
at
the
performance
in
that
in
detail,
but
we
did
see
that,
like
the
contiguous
right
case
still
gave
us
performance
gains.