►
From YouTube: ZFS TRIM Explained by Brian Behlendorf
Description
From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1Osc5IajVUqfrFlXFiz5m7p9UXa0iFV0f
A
Less
but
your
mileage
may
vary,
so
why
is
this
again?
I?
Don't
want
to
dig
too
deep
into
the
media
details
right.
This
is
with
SSDs
I
think
it
is
worth
mentioning
briefly
the
background
like
where
this
phenomenon
comes
from
and
fundamentally
it
comes
from
the
fact
that
Fernan
SSDs,
at
least
you
can
only
write
pre-race
pages,
so
these
are
typically
4k
iOS,
4
K
pages
on
the
media
side.
These
are
very
very
fast.
A
You
can
write
them
very
very
quickly
when
they're
pre
erased,
but
you
can
never
overwrite
them
when
you
need
to
erase
them,
you
have
to
erase
full
erase
blocks.
These
are
much
much
larger
and
you're
gonna
erase
a
lot
of
pages,
so
this
operation
is
very,
very
slow
and
hurts
performance
so,
like
I,
say
a
couple
quick
examples
of
this,
because
I
think
we're
all
kind
of
up
to
speed
in
how
SSDs
work
they're,
not
exactly
new
technology
but
write
amplification
is
that
effect,
you're,
gonna
have
or
SSDs.
So
here's
a
really
simple
example.
A
Here:
we've
got
three
erase
blocks
basically,
and
we
got
a
new
write
come
in,
so
those
are
the
green
blocks
write.
It
all
very,
very
fast,
but
if
you
want
to
override
it,
you
just
can't
overwrite
that
one
sector,
what
you're
gonna
have
to
do
is
read
all
the
previous
data
likely
and
write
it
to
a
new
unwritten
spot
because
you
can't
overwrite
in
place.
A
This
can
obviously
get
much
worse
along
with
this.
Another
effect
people
know
about
parentheses
is
garbage
collection,
so
here
we've
got
the
same
blocks,
the
same
three
erase
blocks,
and
now
we've
got
old
data
here
on
the
middle
one.
This
is
data
that
the
SSD
moved
internally.
It
knows
it's
not
valid
anymore.
It
knows
where
the
real
good
data
is
well.
We
can't
write
those
blocks
till
well.
If
you
got
another
right,
come
in
you're
going
to
need
to
erase
that
erasing
that
block
really
really
really
slow
with
SSDs.
A
So
you
come
in
your
you
in
a
race
you're
right
data
there,
this
full
erase
right
cycle
is
really
expensive.
So
that's
just
a
quick
issue
of
the
problem
with
SSDs
and
why
trim
is
important.
So
the
solution
to
this
needs
to
be
tackled
at
the
file
system
level,
and
this
is
why
people
are
keen
on
having
trim
in
their
file
systems,
because
only
the
file
system
knows
which
blocks
are
actually
allocated
and
which
blocks
are
actually
free.
A
The
SSD
internally
can
be
very,
very
clever
about
making
sure
things
get
pre
erased
and
moving
stuff
around,
but
at
the
end
of
the
day
you
can
only
do
so
much,
and
this
gets
a
little
bit
harder.
The
fuller,
the
SSD
gets
alright,
the
more
data
it
has.
It
has
to
manage
internally
the
worse.
This
problem
gets
and
it
manifests
itself
as
reduce,
write
performance
over
time
with
your
SSD.
So
again,
originally
file
systems
were
optimized
for
hard
drives.
So
this
wasn't
a
problem.
Unlike
SSDs
hard
drives,
don't
suffer
this
rewrite
penalty.
A
You
can
pretty
much
expect
the
same.
Rewrite
performance
more
or
less,
for
your
drive
for
the
life
of
the
drive
for
SSD
is
trims
kind
of
more
requirement
right
because
you
expect
the
performance
degradation
it
internally.
Like
I
say,
the
drive
could
only
do
so
much
so
there's
some
nice
effects.
I
come
from
this
and
motivations
for
implementing
trim
support
in
ZFS.
Again,
that's
probably
why
it's
been
one
of
the
most
requested
features.
A
Typically,
if
you
have
pre-trimmed
write
devices,
because
you've
got
a
lot
of
these
pre
array
cells
that
you
can
write
to
very
quickly
and
additionally,
it's
going
to
improve
device
longevity,
because
if
you're
not
doing
these,
if
you
can
minimize
erase
write
cycles
on
your
media
so
much
the
better
right,
because
all
of
these
flash
devices
have
a
limited
lifespan.
So
what
you've
seen
is
that
from
support
has
systematically
been
added
to
most
Linux
file
systems.
At
this
point
I.
This
is
true
for
most
other
platforms
as
well.
A
Sixty-Four
added
trim,
butter,
FS
added
trim,
even
fat,
GFS
and
XFS-
have
all
added
trim
support
at
this
point.
On
linux,
what
the
story
doesn't
stop
there,
it's
actually
a
little
bit
trickier
than
that,
because
the
file
system
have
trim
support
and
still
not
behave
particularly
well.
So
it's
for
this
reason
that
it's
disabled
by
default
and
a
lot
of
linux
distros
for
just
out-of-the-box,
like
you,
make
a
new
ext4
filesystem
on
some
distributions
trim
just
isn't
enabled
because
it
doesn't
necessarily
there
can
be
performance
concerns
with
certain
devices
right
where
performance
isn't
great.
A
So
you
may
have
noticed.
I
missed
one
file
system
before
Soho
ZFS
ZFS
has
actually
had
trim
support
for
a
long
time
now
right
going
all
the
way
back
to
freebsd
9.2,
there's
been
a
version
of
trim
in
ZFS
extent
to
added
a
version
two
to
their
product,
so
trim
support.
We've
noticed
been
a
good
thing
for
a
long
time
and
there's
been
a
version
of
it
out
there,
but
none
of
those
made
it
into
the
Linux
port.
That
was
until
the
last
release
and
the
last
release,
at
least
for
ZFS
on
Linux.
A
Oh
wait:
we
went
and
took
it
as
an
opportunity
to
revisit
why
trim
support
had
never
been
added
and
to
look
at
how
we
really
wanted
to
do
it
in
the
in
the
pork,
so
design
goals
the
first
one,
probably
a
no-brainer
right
we
want
online
trim
support,
did
not
hurt
applications;
they
should
actually
help
application
performance.
We
don't
want
it
to
do
negatively
impact
anything
additionally
and
we
wanted
to
interoperate
seamlessly
with
all
of
the
existing
open,
ZFS
features.
This
is
one
of
the
things
that
held
back
getting
trim,
support
into
the
Linux
port.
A
For
quite
a
long
time
is.
You
may
have
noticed
that
there's
been
new
features
added
to
ZFS
at
a
pretty
steady
clip,
and
one
of
the
things
we
pride
ourselves
on
is
having
all
those
features
work
really
well
together.
So
porting,
one
of
the
previous
versions
of
trim
was
easier
said
than
done
right
because
it
needed
not
only
you
couldn't
pick
up
the
old
code,
exactly
as
it
is
and
add
it
in
you'd
have
to
adapted
to
work
with
all
the
new
functionality
that
was
in
ZFS.
A
This
again
goes
to
like
not
adding
a
lot
of
additional
code
and
keeping
the
design
easy
to
reason
about
and
relatively
straightforward.
We
didn't
want
to
be
overly
complex
and
because
we
wanted
this
to
be
used
everywhere.
I
think
it
was
important
to
keep
an
eye
towards
portability,
between
platforms
and
for
treatment.
A
Particular
this
is
a
little
bit
bigger
of
an
issue
and
something
I'll
touch
on
later,
but
we
wanted
the
final
implication
to
be
pretty
portable,
so
these
are
all
good
things
to
strive
for,
but
why
didn't
it
happen
until
now
and
the
reason
it
happened
now
and
it
was
much
easier.
It's
because
they're
recent
opens
the
FS
feature
that
got
added
so
in
the
release.
A
V
dev
initialize
got
added
as
a
feature
also
known
as
the
eager
zero
anybody
here
familiar
with
Vita
initialize.
They
use
it.
Oh
good.
This
is
the
right
crowd
yeah.
So
if
you
dev
initialize
added
as
a
feature
to
initialize
on
allocated
space
in
the
pool,
basically
as
a
background
task
for
performance
concerns,
it
turns
out
that
on
some
systems
you
may
have
a
first
access
penalty.
A
When
you
do
your
first
right,
if
it's
like
a
thinly
provisioned
device
in
a
virtual
environment,
it
may
cost
you
a
fair
bit
to
perform
that
first
right
and
if
you
can
hide
that
first
access
penalty,
that's
great!
So
that's
what
be
dev
initializes.
Therefore,
in
the
background,
it
writes
a
pattern
zeros
to
all
the
unallocated
space
in
the
pool
when
you
think
about
it.
This
behavior
is
exactly
what
you
want
for
trim
right.
A
This
is
exactly
what
you
want
to
do
with
trim
you
want
to,
in
the
background,
not
issue
zeros
to
all
the
unallocated
space.
What
you
want
to
issue
trim
iOS
to
the
all
the
unallocated
space
so
building
on
vida
I've
initialized,
like
put
in
place
a
large
numbers
of
components,
we
already
needed
to
implement
trim,
and
we
could
do
it
in
a
way
that
leveraged
all
of
that
existing
functionality,
which
was
really
nice,
so
in
particular,
I'm
gonna
talk
about
all
these
components
in
a
little
bit,
but
just
a
touch
of
them.
A
Briefly,
we
dev
initialize
added
a
really
nice
CLI
options
for
it,
along
with
good
reporting
the
ability
to
enable
and
disable
medicine
I've
allocations.
This
is
so
we
can
safely
initialize.
In
the
background,
the
Sun
allocated
space
yeah
added
some
infrastructure
for
us
to
figure
out
where
to
submit
the
iOS
to
the
physical
disk
for
the
unallocated
space,
but
everything
we
need
wasn't
quite
there.
Some
new
work
was
required.
Originally
V,
dev
initialized
was
written
just
for
that
sole
purpose,
so
it
needs
to
be
extended
a
little
bit
and
made
more
generic.
A
So
what
got
added
manual
trim
so
there's
a
zpool
trim
command
in
the
latest
release.
It
does
exactly
what
you
think
it
should
it
initializes
an
on-demand
trim
and
it
trims
all
unallocated
space
in
the
pool.
So
the
command-line
syntax
is
very
much
like
V
dev
initialize,
just
give
it
zpool
trim,
give
it
the
pool
name.
Additionally,
if
you
want
to
restrict
it
particularly
devs,
you
can
do
that
just
specify
the
ones
you
want.
There
are
a
handful
of
options
it
takes.
A
If
you
want
a
rate
limit
it,
you
can
give
it
a
dash
or
if
you
want
to
do
a
secure
trim
and
your
devices
support
it.
You
can
do
that,
but
its
purpose
is
basically
to
go
through
and
efficiently
issue
trim,
iOS
for
all
of
that
on
allocated
space
in
the
pool
that
includes
merging
these
ranges.
All
set
me
up
great
earlier
today,
talking
about
range
trees
and
Metis
labs,
so
that's
great,
but
the
raise
trees
basically
handle
all
the
burn
for
ZFS
trim.
All
right,
we
just
walked
that
space.
A
Furthermore,
a
manual
trim
used
to
skip
very
small
ranges
when
issuing
trim
iOS
to
disk.
If
you
look
back
to
the
SSD
stuff,
I
mentioned
right,
there
may
be
no
point
in
issuing
really
small
trim
IOT's
for
the
SSD,
if
they're
smaller
than
an
erasure
page,
because
the
drive
can't
really
do
anything
with
that
they
also
may
be
slow.
So
these
are
things
that
could
often
be
skipped
on
the
flip
side.
A
Additionally,
a
manual
trim,
you
can
cancel
it,
you
can
suspend
it,
you
can
resume
it.
This
is
because
we
preserve
the
state
for
it
on
disk
so
of
your
pool
crashes.
Whatever
you
export
rien
port,
we
can
continue
on
so
it
behaves
administrative
leave
from
a
user
perspective.
It's
a
very
nice
way,
and
all
of
this
has
been
nicely
integrated
with
the
existing
zpool
status
utility.
So
you
can
get
good
reporting
for
what's
going
on,
so
that
was
a
lot
of
what
it
can
do,
but
what
does
it
actually
look
like?
A
So
a
zpool
status
looks
like
this
now
in
the
prime
release
or
about
the
upcoming
release,
but
the
current
release.
If
you
issue
a
zpool
trim
and
you're
in
sepal
status,
you'll
see
it
broken
down
by
individual
videos,
which
ones
are
trimming
and
there'll,
be
this
little
trimming
line
appended
after
them.
If
they're,
currently
in
the
process
of
a
manual
trim
digging
a
little
deeper,
you
can
give
it
at
the
Dashti
option
if
you
want
to
know
a
little
bit
more
about
the
octave
running
trim.
A
In
this
case,
it
looks
like
the
raid-z
devices
here
are
about.
2/3
of
the
way
through
and
they're,
currently
running,
it'll
show
you
their
percentage
complete.
The
devices
in
the
mirror
I've
actually
got
out
of
my
way
and
suspended
here.
So
what
you
can
do
is
run
deep
will
suspend
and
list
particular
vetoes
and
it'll
suspend
the
progress
for
them,
so
they
won't
advance
until
you
resume
them.
So
if
there's
a
performance
concern
or
your
have
somebody
other
issue
where
you
want
to
pause
them,
you
could
do
that
and
then
the
last
one.
A
Here's
done
right.
The
log
devices
smaller,
really
quick.
The
trim
trim
the
whole
thing
it's
complete,
so
you
can
always
run
a
zpool
status.
T
and
integrated
with
the
existing
tools
get
a
good
idea
of
what's
going
on
with
a
manual
trim,
so
that
was
a
lot
of
high-level
stuff.
But
let's
go
dig
down
into
the
internals
here.
A
little
bit
and
I'm
gonna
talk
a
little
about
it.
Each
of
those
pieces
that
I
mentioned
draw
Metis
labs
again.
A
We
know
a
lot
about
menís
labs
now,
which
is
great,
so
I
don't
need
to
go
over
much
of
that
background,
and
but
what
we're
interested
in
the
meta
slaves
here
is.
One
range
tree,
in
particular
from
annual
trims,
there's
a
range
tree
in
each
Metis
lab
called
MS
allocatable,
and
it's
what's
used
to
track
the
allocated
space
on
disk
right.
A
So
this
describes
all
the
free
space
and
what
we're
going
to
need
to
trim
from
the
this
particular
Metis
lab
building
on
the
work
that
was
done
for
Vida
of
initialize,
we
added
the
concept
of
being
able
to
enable
and
disable
one
of
these
Metis
labs.
So
when
you
disable
a
Metis
lab,
it
doesn't
mean
it's
completely
inaccessible.
What
it
means
is
it's
unavailable
for
new
allocations.
Basically,
so
this
is
the
fundamental
mechanism
that
was
added
to
allow
us
to
safely
trim
the
meta
slab.
A
B
A
Ever
want
to
do
is
have
someone
write,
a
new
block
to
that
meta
slab?
Have
it
land
on
disk
and
then
accidentally
trim
it
and
that's
what
this
prevents
again.
We
don't
want
to
have
too
many
of
these
loaded
concurrently
because
they
take
a
lot
of
memory.
There's
performance
concerns
with
that.
So
what
we
want
to
limit
is
the
number
of
meta
slabs.
A
Meet
up
translate
Evita
up.
Translate
is
the
next
important
bit
of
the
puzzle
here.
So
with
the
MS
allocatable
Raintree,
we
know
which
logical
ranges
we
need
to
trim
from
disk,
but
it's
a
little
trickier
to
map
those
like
physical
offsets
on
disk
and
that's
what
we
get
out
of
Vida
translate
so
Vedic
translate
was
like
a
helper
that
was
added
to
the
Vita
bop
structure.
A
Let's
let
you
do
this
logical
to
physical
translation,
so
you
can
call
the
be
deaf,
translate,
function
and
specify
a
particularly
feat
of
analogical
range
you're
interested
in
and
it
will
return
the
physical
ranges
on
that
video
and
those
are
what
you
can
trim.
So
the
strategy
for
this
is
pretty
straightforward.
Excuse
me
so,
starting
at
the
child,
you
walk
up
to
the
parent.
A
There's
not
a
lot
of
complicated
mapping
that
needs
to
be
done
there,
but
for
raid-z
there's
a
custom
one
which
helps
you
figure
out,
like
which
part
of
that
logical
arrange,
exists
on
your
particularly
feet
of,
and
then
when
it
one
lines
all
the
way
down
to
the
child.
It
returns
that
range.
So
this
provides
you
a
way
to
basically
say
which
part
of
this
logical
reigns
exists
on
my
child,
be
dev
and
what
are
the
physical
offsets
for
it.
So
that
provides
the
second
bit
of
this
right
now.
A
We
know
what
we
should
trim
and
we
know
where
it
is
on
which
leafy
does
the
last
bit,
and
this
is
the
bit
that
was
added
for
trim.
The
previous
two
parts
were
added
as
part
of
e
to
initialize,
because,
like
I
say,
it's
got
the
same
problem
basically,
but
we
needed
the
ability
to
issue
trims,
and
this
is
where
things
diverge
a
little
bit
from
the
other
ports.
A
One
thing
that
kind
of
bubbled
up
is
interesting
when
looking
at
the
FreeBSD
and
the
extent
to
version
is
I,
didn't
realize
this
at
all
until
I
started
looking
at
them,
but
the
implementation
that
the
system
level
four
trim
are
very,
very
platform-specific
how
freebsd
handles
a
trim
is
very
different
than
how
illumos
handles
a
trim,
which
is
very
different.
How
linux
handles
a
trim.
A
The
interface
is
to
do
that
are
pretty
different,
and
what
we
really
wanted
to
do
was
prevent
that
kind
of
platform,
specific
detail
from
bubbling
up
into
the
core
ZFS
code,
which
it
had
in
the
other
implementations,
and
it
was.
It
was
awkward
things
just
weren't
quite
in
the
right
place.
So
we
did
here
and
this
version
is
made
trim
a
first-class
citizen
for
the
aisle
pipeline.
A
Basically,
so
now
you
can
issue
not
just
cio
reads
or
zio
writes
you
can
issue
zio
trims
and
they're
handled
like
reads
or
writes
like
anything
else
in
the
pipeline.
They
aggregate
it
properly.
They
get
added
to
the
queues
properly
for
issuing
to
the
devices
they're
normal
full
trim
or
full
IO
operations.
A
But
this
way
each
platform
can
do
their
right
thing.
Handle
trim
for
their
system
like
I,
say
I
was
surprised
that
are
different.
That
was
v2
file.
Again,
you
just
implement
the
right
thing
for
there
on
Linux
VOP
space
is
actually
a
wrapper
for
phep
allocate.
So
if
you're
familiar
with
hole
punching
on
Linux,
this
really
boils
down
to
a
hole
punch.
So
if
a
file
it
makes
it
into
sparse
file,
it
punches
a
hole
for
that
range.
If
your
file
system
supports
it,
NFS
excess
ZFS.
A
All
these
things
support
hole
punching
on
Linux,
so
even
get
trim
for
your
files,
which
is
nice
and,
most
importantly,
all
the
higher
core
components
of
ZFS.
Don't
know
anything
about
this.
All
they
know
is
that
they're
issuing
a
trim
right
for
a
particular
space
that
ranges
an
allocated
so
putting
that
all
together.
This
is
what
it
looks
like
from
architecture
standpoint.
So
when
you
issue
is
equal
trim,
we're
gonna
spawn
one
thread
per
leaf
feed
EV.
So
these
are
the
orange
boxes
here
at
the
bottom.
A
These
threads
are
relatively
short-lived
and
they're,
going
to
make
one
single
pass
over
the
meadow
slabs
and
the
idea
is
they're
going
to
go
through
each
minute.
Slab
and
disable
Metis
live
allocations,
so
we
know
whether
it's
safe
to
trim
that
Mellish
Metis
live
allocation.
No
new
route
rights
are
gonna,
be
coming
in
and
then
we're
gonna
issue
trim
IOT's
for
the
leaf
in
question.
A
So
there
is
some
logic
in
there
to
rate
limit
things
make
sure
only
so
many
trims
are
outstanding
at
any
given
time
what
you
know
you
issue
them
all
and
you
wait
for
them
to
complete
when
it's
completed,
enable
Metis
live
allocations
and
that
metal
slab
has
been
trimmed
right,
so
repeat:
go
to
the
next
meta,
slab
repeat
the
process.
This
is
at
a
high
level.
A
What
happens
with
the
manual
trim
while
it's
going
along
doing
this
progress
gets
saved
in
a
leap
zap,
so
each
one
of
these
leafy
devs
has
his
app
and
it
stores
the
state
in
it.
This
is
what's
used
for
resuming
if
you
suspend
it
or
progress
reporting,
so
it
knows
how
far
along
it
is,
but
the
state
gets
tracked
on
a
per
leaf
basis.
Then
you
can
cancel
suspend
resume.
A
So
what
does
this
look
like
in
practice?
Here's
what
we
expect!
You
know
we
expect
performance,
be
good
again
for
our
SSD
and
we
expect
it
to
drop
off.
But
when
you
issue
a
trim,
we
trim
all
that
free
space
and
performance
should
be
returned
on
the
system
right.
We
should
have
trimmed
the
device
normally
good,
but
it
will
drop
off
again.
This
is
where
a
lot
of
file
systems
kind
of
stop
and
they
say
well.
This
is
good.
A
Just
you
know,
I
don't
know
every
day,
every
we
put
in
a
crown
job
from
your
file
system
or
whatever
and
you'll
have
pretty
good
performance,
but
you
still
end
up
with
the
sawtooth
behavior,
depending
on
the
activity
of
your
file
system
and
how
busy
it
is.
So
it's
it's
good.
It
solves
the
problem
to
have
this
kind
of
a
manual
trim
that
you
can
do,
but
it's
not
quite
what
you
really
want.
A
This
way,
the
underlying
storage
always
has
an
update
mapping
of
what
the
file
system
thinks
is
in
use
and
isn't
in
use
and
can
much
more
efficiently
manage
that
storage
to
deliver
good
performance.
This
was
added
to
ZFS
by
the
auto
trim
property.
There's
a
property
you
can
set
on
the
zpool
set,
auto
trim
on
or
off
when
it's
on
you're
doing
automatic
background
trimming
so
to
make
this
work,
there's
one
more
piece
of
the
puzzle
and
that's
the
free
block
life
cycle,
life
cycle
and
ZFS.
A
So
when
you
have
a
block-
and
it's
really
a
free
bak
you're
really
about
to
free
it,
it's
not
referenced
by
a
snapshot
anymore
or
you
know
it's
really
going
to
be
free.
The
the
block
gets
added
to
the
MS
reading
tree.
This
is
a
tree.
That's
also
attached
to
each
Metis
lab
it'll
migrate,
as
transaction
groups
get
synced
right,
and
this
block
actually
gets
freed
to
the
these
trees
called
MS
defer.
Trees.
It'll
hang
out
there
for
a
while
right
before
we
return
it
to
the
MS
allocatable
of
range
tree.
A
Once
it's
on
the
MS
allocatable
arrange
tree
again,
it
can
be
allocated
again
by
meta,
alik,
DVA
blocks.
This
is
the
normal
cycle
that
blocks
make
it
through
and
ZFS.
So
when
you
turn
the
auto
trim
property
on
what
happens
if
we
get
this
other
MS
trim
tree
that
gets
added
to
the
meta
slab.
Now
the
MS
trim
tree
is
always
going
to
be
a
subset
of
MS
allocatable,
because
it's
going
to
contain
only
recently
free
blocks
on
the
system,
so
it'll
be
some
subset
of
MS
allocatable
and
it
also
will
only
exist
in
core.
A
A
So
this
looks
familiar
automatic
trimming.
This
is
the
same
diagram,
but
it's
a
little
bit
different
for
auto
trim.
So
in
this
case,
when
you
turn
on
the
auto
trim,
you're
gonna
get
one
thread
per
top
level
v
Dev
instead
of
one
per
leaf.
It
also
could
be
a
long-running
thread
instead
of
a
short
running
thread.
In
fact,
it'll
run
as
long
as
the
property
is
enabled
just
looping.
In
the
background,
there
are
a
couple
advantages
of
having
one
thread
instead
of
one
pervy
dev.
A
In
this
case,
the
big
one
is
probably
that
it
makes
it
easier
to
only
disable
one
meta
slab
at
a
time
when
you're
trimming
it
because
we
don't
really
want
to
disable
more
meta
slabs
and
we
have
to
and
disable
allocations
for
more
places.
We
want
to
have
minimum
impact
on
the
applications,
so
if
we
can
get
only
disable
one
meta
slab
while
we're
trimming
it
and
work
on
all
of
those
children
that's
ideal.
Also
one
thread
is
really
more
than
enough
to
issue
the
iOS
for
the
child
be
devs
here.
A
So
we
have
one
thread
for
that:
it
works
very
similarly
to
the
manual
trim
process,
except
we're
gonna
continuously
iterate
over
the
meta
slabs
in
this
case.
So
it'll
start
at
the
beginning.
It'll
disable
one
of
the
meta
slabs
start
at
the
top
I
guess,
and
then
it's
gonna
consume
that
MS
trim
tree
and
what
I
mean
by
swap
here
is
when
it
consumes
that
MS
trim
tree
what
its
gonna
do
is
remove
it
for
the
meta
slab
and
insert
an
empty
one
in
its
place.
A
Basically,
so
this
allows
new
frees
that
happen
on
the
system
to
be
added
back
to
the
MS
trim
tree
and
we'll
get
to
them
on
the
next
pass.
It's
not
a
big
deal
that
they
accumulate
there
for
a
while.
It's
fine,
we're
just
gonna
makes
it
easier
to
operate
on
the
current
MS
tree,
so
we
can
traverse
the
whole
thing
and
be
done
with
it.
A
We
go
through
me
issue,
trim
iost
to
all
of
the
children
in
this
case
and
then
wait
for
them
to
complete,
like
I
said
before
and
running
at
the
top
Aliyev
here,
so
we
actually
have
to
issue
the
I
out
to
all
the
children
under
us
whether
there
are
a
mirror,
arrayed
Z
and
we
use
the
translate
function
again
to
figure
out
where
those
offsets
are
passing
in
the
right.
Children
wait
for
the
trim
to
complete
and
a
bailout
locations
and
we're
good
that
Metis
lab
has
been
trimmed.
A
A
So
the
solution
for
that
has
been
to
group
these
meta
slabs
into
metal
slab
groups
for
lack
of
a
better
term,
and
by
default
we
group
all
the
metal
slabs
into
32
different
groups,
and
the
idea
is
that
at
most
we're
going
to
process
one
of
these
groups
for
transaction
groups.
So
that
means
it's
going
to
take
you
a
minimum
of
32
transaction
groups
before
you
get
back
to
trimming
the
same
medal
at
the
provides
you
32
transaction
groups
worth
of
time
to
aggregate
adjacent
Rangers
together
to
issue
them
efficiently.
A
We
found
that
in
practice,
32
works
pretty
well
from
testing.
It's
controllable.
You
can
adjust
it,
but
32
works
pretty
pretty
well.
Furthermore,
the
automatic
trim
is
set
up
such
that
it
never
forces
a
transaction
group
sync.
So
the
whole
process
is
driven
by
normal
transaction
group
syncs
occurring
on
your
system,
but
if
you
have
pending
freeze
in
a
transaction,
they
won't
force
the
transaction
group
to
sync,
because
you
really
don't
want
it
to
right.
You
want
this
process
to
be
driven
by
rights.
A
Just
because
there's
freeze
you
don't
want
to
do
it
sort
of
deal
with
it.
If
your
pool
is
idle
this
way
it
stays
idle
if
you're
not
actually
doing
I/o
to
it.
So
you
can't
see
this
in
his
equal
status
output,
but
you
can't
see
it
in
zpool
io
stat.
So
if
you
run
zpool
io
stat
are
well
auto
trimming
is
enabled,
but
you
can
see
there's
additional
columns
on
the
right
there
and
there's
the
independent
column,
which
shows
you
the
standing
true
miles
in
this
case.
A
We've
got
I,
don't
know
some
middle
sized
ones
in
progress,
but
most
of
them
are
pretty
large
trim
commands
outstanding.
So
if
you
want
to
monitor
it,
that's
an
easy
way
to
do.
It
he'll
be
running
all
the
time.
People
io
stat
W.
You
can
also
get
the
request
times
for
each
of
these
trim.
Ios
that
are
outstanding
and
I.
A
A
So
what
do
we
expect
from
Auto
trim?
So
we
started
pool
with
Auto
trim
off.
We
expect
performance
to
degrade
again
right
and
then
you
know
we
turn
it
on
at
some
point
and
then
we
expect
performance
to
gradually
improve
as
blocks
are
allocated
and
freed
from
the
system,
because
eventually,
every
block
that
we
free
is
going
to
get
trimmed
on
disk.
A
A
So
here's
the
test
case
I
ran
because
I
wanted
to
convince
myself
to
write.
Does
it
actually
work
as
its
intended?
So
the
test
case
here
is
the
time
to
copy
the
Linux
kernel
source.
So
what
you
want
to
test?
What
I
wanted
to
test
at
least
make
sure
the
trim
was
working
as
advertised
was
to
compare
performance
of
copying
the
Linux
kernel,
which
is
a
couple
of
gigabytes
and
I,
don't
know
a
couple
tens
of
thousands
of
files
at
this
point.
A
How
long
does
that
take
on
a
pool
that
has
trimming
enabled
and
one
that
doesn't
have
auto
trimming
enabled
and
to
do
it
at
a
constant
pool
capacity?
All
right,
because
we
know
all
I
mentioned
before,
there's
this
cliff
past
about
80%,
we
could
totally
skew
your
results
so
at
a
target
pool
capacity
about
80%.
What
does
how
long
does
it
take
to
talk
copy
this
colonel?
A
So
details
here
well,
the
hardware
config
mirroring
and
rag2
was
tested.
So
what
does
this
look
like?
So
here's
the
data
mirror
and
reads
you're
on
here.
What
you
can
see
is
these
are
the
averages
over
five
runs
and
then
I've
just
got
a
little
scatter
plot
there
for
each
test.
Point
for
each
of
the
runs
so
at
the
very
beginning,
at
a
pre
trim
pool
performance
is
good
all
right
or
making
quite
a
few
copies
of
the
kernel
per
minute.
A
It
looks
like
about
six
copies
per
minute,
something
like
that,
which
is
what
we'd
expect
it's
completely
empty
pool
performance
is
just
fine
as
we
fill
the
pool
performance
does
degrade
all
right.
It
gets
worse
over
time
and
eventually
it
bottoms
out
somewhere-
and
this
is
all
with
you
know-
no
auto
trimming,
nothing-
fancy
just
enterprise-grade
SSD
that
we're
testing
with
here.
So
that's
the
sustained
performance
without
it
there
at
the
bottom
to
convince
ourselves
that
really
this
is
not
because
the
pools
full
cuz.
You
can
imagine
that.
A
Maybe
this
is
just
a
performance
drop
off
to
do
filling
the
pool
we
reshoot
a
zpool
trim.
Sure
enough.
Your
energy
pool
trim,
just
like
you
expected
performance
pops
back
up
almost
back
to
where
it
was
before,
which
is
great,
but
also
as
expected
for
most
kind
of
goes
down
again,
pretty
pretty
quick.
It
doesn't
take
too
many
more
copies
of
the
kernel
before
performance
goes
off
a
cliff.
Do
you
pull
trim
again
just
to
convince
ourselves?
That's
really!
A
What's
going
on
sure
enough,
it
is
right,
then
performance
drops
off
and
then
the
really
good
bit
here
at
the
end
is
autumn
does
work
as
intended.
Once
you
set
auto
trim
and
you're
freeing
and
allocating
a
ton
of
blocks
right,
it
actually
doesn't
take
particularly
long
for
performance
to
recover
and
when
it
does
recover
it,
it's
able
to
maintain
that
performance
almost
as
if
we're
a
new
pool.
So
the
good
news
here
is
all
that
theory.
Actually
works.
Trim
does
work
as
advertised
and
performance
is
pretty
good.
A
B
A
Doing
like
removing
a
data
set
and
doing
all
those
freeze,
yeah
so
I
haven't
run
that
exact
test
I
imagined
that
it
would
work
pretty
well
because
I
mean
they're
all
freeze
handled,
like
any
other
freeze,
we
rate
limit.
How
often
we
issue
trims,
so
it
might
take
a
while
to
process
all
those
freezes
they
aggregate,
but
it
shouldn't
affect
the
rate
at
which
were
really
issuing
trims
as
long
as
you're
talking
about
the
auto
trim
case,
you
know
a
trim
case.
A
Well,
so,
yes,
they
will
be
broken
up
at
the
lower
level.
There's
a
threshold
where
I,
don't
I,
don't
know
what
it
is
offhand,
but
we
guarantee
that
we
own
tissue
trims
higher
than
I
want
to
say
it's
like
16
or
maybe
32
megabytes,
something
like
that.
But
yeah
there's
a
cut-off
where
we
say
no,
no.
This
is
just
a
bad
idea.
Don't
do
it.
B
A
That's
a
good
question:
they
can
absolutely
both
be
run
at
the
same
time.
Oh,
the
question
was:
how
do
you
manual
and
automatic
trims
interact
when
they're
run?
At
the
same
time,
so
a
manual
trim,
an
automatic
crimp
can
run
at
the
same
time.
The
manual
trim
basically
runs
as
fast
as
possible.
Right,
it's
the
administrative
thing
and
it
will.
The
intent
is
trim
all
the
free
space
in
the
pool
as
quickly
as
possible,
so
that
will
be
run
normally
in
parallel.
The
auto
trim
will
run
but
I
don't
think
it.
A
A
A
We
need
a
separate
trim
for
MS
trim,
because
we
only
care
about
blocks
has
been
recently
freed,
not
all
the
blocks
that
are
free
currently
in
the
pool.
So
you
only
want
to
trim
things
that
were
recently
released.
We
could
go
through
MS
allocatable
there,
but
that's
a
lot
of
extra
work
and
it's
possible
that
a
lot
of
those
ranges
have
already
been
trimmed
by
our
previous
pass,
like
you
could
run
a
manual
trim
over
the
pool
and
trimmed
everything
in
MS
allocatable,
then
all
that
stuff
is
already
trimmed.
A
B
B
A
Yeah,
so
the
question
is
I
mentioned
that
we
don't
force
transaction
groups
inks.
So
when
the
pool
is
idle,
we
don't
go
do
trimming
in
the
background.
Basically,
we
wait
for
some
new
right
to
come
in.
A
question
is:
is
that
not
nested?
Isn't
that
a
good
time
to
do
trimming
actually
because
the
pools
otherwise
idle?
That
seems
like
the
perfect
time
to
do
trimming
because
it's
not
going
to
impact
any
kind
of
application
workload
yeah.
A
You
could
imagine
some
kind
of
more
complicated
machinery,
I
suppose
where
you
go
through
and
you
trim
everything
once
you
drain
all
the
trees
and
then
once
everything
is
fully
empty,
you
quiet
and
stop
that
could
be
left
his
future
work.
There
might
be
value
in
that
it
didn't
seem
necessary
in
the
first
implementation.
I
would
say.
B
A
A
Rate-Limiting
for
the
trim
question
was
what
kind
of
rate
limit
was
done
for
the
trim.
I
didn't
mention
it
the
way
the
trim
I/os
are
issued
as
part
of
those
threads.
They
rate
limit
themselves.
Basically,
there's
a
control
that
issue
that
controls
how
many
outstanding
trim
bytes
will
be
issued
down
to
the
pipeline,
and
then
the
pipeline
itself
does
some
limiting
and
breaking
up
of
those
I/os
before
they
get
submitted
to
disk.
But
yeah
I
glossed
over
that
there
was
a
berry.
Do
one
of
there
there's
there's
more
detail
about
how
those
get
broken
up.
B
A
The
impact
on
the
read
latency
I
didn't
run
careful
data
on
the
read
latency,
so
so
I
do
not
know
I,
don't
mm-hmm
I
had
I,
don't
know,
I,
don't
know
I
be
hazard.
To
guess
on
that,
like
you
say,
could
very
wisely
between
drives
from
my
personal
experience.
How
the
SSDs
behave
does
vary
widely
between
the
consumer
grade
the
enterprise
grade
between
manufacturers
between
a
lot
of
different
things,
so
I
don't
know
exactly
how
it
behaved
there.
B
A
So
I
think
the
question
was:
what
is
the
memory
impact
of
maintaining
this
additional
range
tree
tracking,
the
the
free
yeah,
the
ranges
that
are
to
be
freed?
It's
pretty
minimal
in
our
experience,
mainly
because
the
tree
itself
doesn't
get
that
big,
typically
because
we're
pruning
it
fairly
aggressively
like
every
32
transaction
groups,
you
basically
drain
the
entire
tree.
It's
also
a
range
tree
which
happily
has
been
nicely
optimized
now,
so
that
helps
any
free
in
contiguous
ranges.
They
don't
take
up
that
much
space,
so
we
haven't
found
it
to
be
a
problem
in
practice.
A
I
suppose
it
could
be
if
you
suddenly
free
a
lot
of
stuff
that
happens
to
not
be
contiguous
and
makes
really
big
range
trees.
It's
not
capped
at
the
moment
could
be
I
mean
if
it
turns
out
to
be
important.
We
could
cap
it
yeah
this,
isn't
that
8!
Yes,
the
question
was:
is
this
in
the
current
release?
Yes,.
B
A
So
the
question
is:
what's
the
x-axis
on
this
right,
I
conveniently
left
it
off
right
yeah,
so
the
XS
axis
here
is
test
iterations,
basically
so
as
well.
Each
each
of
these
dots
is
one
cycle
of
that
remove
copy
loop
from
beginning
to
end.
Here.
It's
about
2500
runs
of
the
test,
something
like
that
for
perspective,
so
it
probably
took
when
performance
collapses.
Here,
maybe
I
don't
know
30
40
50,
maybe
copies
runs
of
the
test
before
performance
was
back
down
below.
That's
gonna
vary
based
on
how
full
your
pool
is.
A
In
this
particular
case,
the
pool
was
80%
full,
so
we
weren't
leaving
ourselves
like
a
lot
of
extra
free
space,
so
that'll
contribute
to
it,
but
the
trim,
so
the
trim
time
isn't
yeah
I
didn't
mention
so
who
must
have
been
about
2,
terabytes,
usable
capacity,
something
like
that
to
get
about
200
copies
of
Linux
kernel
in
it,
and
how
long
did
it
take
the
trim?
A
A
Yeah,
so
the
question
is:
did
I
did
I
run
the
trim,
while
the
test
was
running,
the
manual
trim
or
did
I
run
it
in
between,
and
the
answer
here
is
I
did
run
it
in
between
so
you're,
not
seeing
like
the
effect
of
a
running
manual
trim
and
what
it
does
to
perform
it
on
the
system.
That's
yeah,
the
Scriptures
weren't
set
up
that
way.
I'd
be
curious
about
that
myself,
I,
don't
know
what
impact
a
manual
trim
would
have.
B
A
A
good
question:
they
were
close
enough
that
I
didn't
investigate
that
right.
I'm,
like
raid-z,
Mir,
they're
they're,
about
the
same
right
there
about
what
I'd
expect.
So
no
I
don't
know
exactly
why
it's
a
little
bit
slower.
A
B
A
Yeah,
so
one
other
nice
thing
to
add
about
that
too
is
I
meant
to
mention
at
the
time,
but
because
there's
like
a
generic
helper
for
this
too,
it's
easy
to
extend
the
mechanism,
so
it
works
with
things
like
dear
aide
when
it
gets
integrated
right
it'll
be
able
to
do
that.
Mapping
and
trim
and
we'd
have
additional
eyes
we'll
just
more
or
less
work
with
the
earring.
A
All
right
so
the
trying
to
be
summarized
what
Matt
just
said
is
it
it
doesn't
matter
because
the
MS
allocatable
were
walking
just
represents
like
what
blocks
are
allocated,
doesn't
care.
What's
in
them,
it's
just
what's
allocated
and
what's
frite
right
and
we're
just
trimming
all
the
stuff.
That's
nothing
is
using
right.
B
A
We
have
not
done
testing
at
a
large
number
of
so
the
question
was:
are
we
keeping
a
list
of
discs
that
we
know
are
problematic
right?
No,
we
try
to
justify
good
discs
for
the
most
part,
so
we
don't
have
a
really
long
list
of
of
discs.
You
should
avoid
it'd
be
cool.
If
someone
had
something
like
that,
no.
A
A
Question
was:
is
there
any
additional
delay
between
when
a
block
is
deleted
and
treated,
and
when
we
actually
go
off
and
trim
it?
No,
there
isn't.
So.
Once
it's
added
back
to
ms
allocatable
be
considered
eligible
for
trim,
it
might
not
mean
we
get
to
it
immediately
depending
on
you
know
the
thread
when
it
can
revisits
that
meta
slab
to
trim
it
but
yeah
it's
immediately
eligible.
So
there's
not
an
additional
deferred
delay,
so
you
can
only
yeah
how
about
to
say
well,
even
yeah
for
automatic
trim.
A
You
get
the
2t
excuse
to
because
it
does
go
through
the
MS
to
fir
tree.
So
you
get
to
transaction
groups,
so
you
can
rewind
your
pool
as
far
as
you
normally
would
be
able
to
safely
rewind
your
pool
without
having
any
risk
of
your
data
being
gone,
but
once
it
was
eligible
to
be
overwritten
by
like
a
normal
file
system
right,
it's
also
eligibly
trimmed
and
it
may
be
trimmed
at
any
time.
So
no
additional
delay.
B
A
Yeah,
so
the
point
was
that
this
does
potentially
eliminate
or
restrict
your
ability
to
roll
back,
because
we're
pretty
aggressively
trimming
these
things.
That's
absolutely
true.
If
you
don't
want
that,
I
was
just
not
turning
on
automatic
trimming
or
we
can.
You
know,
think
about
extending
it.
If
that's
useful
functionality
you
want
to
preserve
and
you
wanted
to
lay
it
farther
I
mean
that's
the
thing
you
could
extend.
B
B
A
I
mean
I
think
that
would
all
be
worth
interesting
to
explore.
Generically
like
as
part
of
the
MS
de
fer
tree,
like
Alan,
was
saying
right,
like
maybe
you
want
to
let
stuff
last
a
lot
longer.
That's
totally
a
reasonable
thing
to
request,
but
I
think
it
should
apply
uniformly
to
trim
and
like
normal
file
system,
al
right.
B
A
B
B
B
A
B
B
B
A
So
I
guess
the
comment
there
was.
This
may
behave
a
little
bit
differently
on
SATA
drives
which
have
different
limitations
right.
This
was
testing
was
done
on
scuzzy
drives
and
hopefully
they
do
a
little
better
with
trim,
but
your
mileage
may
vary.
A
A
There
was
some
trouble
chasing
data
corruption
bugs
for
a
while.
That
was
thing
we
cared
a
lot
about.
Obviously
it's
one
of
the
things
that
really
kind
of
delayed
this
work
for
quite
a
while
is
like.
How
do
you
convince
yourself
that
this
is
working
absolutely
correctly?
It
is
never
gonna
do
anything
wrong,
because
if
it
does
it's
bad
right,
so
there
was
a
lot
of
runtime
and
testing
and
running
down
those
kind
of
bugs
to
convince
ourselves
that
it
was
absolutely
solid.
A
So
the
question
is,
if
you're
running
a
pool
with
all
hard
drives,
that
don't
support
trim
what
happens
with
the
properties.
So
at
the
moment
nothing.
Basically,
the
properties
are
still
there.
You
can
turn
them
on,
but
each
Drive
is
individually
detected,
whether
it
supports
trim,
IOT's
or
not,
and
if
the
device
doesn't
support
issuing
trim,
it's
not
supporting
that
device
that
won't
issue
trims
to
that
device.
If
it
does
it'll
issue
trims
another
report
which
ones
which
don't
support
it.
So
it's
just
disables
itself.
A
B
A
A
I
think
it's
the
farmer
is
based
on.
There
were
bytes
outstanding,
not
the
number
of
ops.
Yes
yeah,
that's
it's
throttled
at
two
levels,
a
one
layer
for
the
manual
trim
is
throttled
based
on
a
number
of
bytes
outstanding.
We
issue
to
the
lower
IO
and
layers
in
ZFS
the
IO
pipeline,
and
then
the
Sierra
a
pipeline
itself
will
determine
when
to
issue
those
iOS
based
on
other
activity.
A
On
the
system,
like
I
said
it's,
a
full
trim
was
added
as
a
full
class
member
for
trim
top
for
I/o
types
in
ZFS,
so
it'll
trade-off
between
outstanding
reads
and
oustanding,
writes
and
trim
and
trim
is
like
the
lowest
priority
thing.
So
if
there
are
outstanding,
I
think
scrub,
my
people
out
trim,
but
it's
below
reads
and
write.
So
if
there
are
outstanding
reads
or
writes
that
need
to
be
handled,
they'll
be
handled
first.
A
You,
yes,
it's
possible,
but
the
question
was
like:
is
it
what's?
How
does
the
throttle
work
and
the
throttle
works
based
on
bites
at
the
higher
level,
so
we're
going
to
issue
a
range
to
trim
a
certain
number
of
bites
through
lower
layers,
and
then
the
pipeline
will
issue
iOS
to
handle
those
trims
as
it
deems
appropriate
as
it's
been
tuned
right
to
avoid
impacting
performance.
So
if
there's
a
lot
of
reads
that
need
to
be
handled,
synchronous
reads:
the
trims
will
be
deferred
in
the
queue
and
they
won't
be
issued.
B
A
The
question
was,
why
have
an
automatic
and
a
manual
trim?
The
thought
process
was
because
a
manual
trim
and
an
auto
trim
are
kind
of
serve
different
purposes.
The
manual
trim
you
really
want
to
as
quickly
as
possible
return
a
device
to
a
fully
trim
state
and
that's
what
it
does.
It
runs
through
the
entire
pool
and
trims
everything
that
could
be
trimmable.
It
may
impact
performance,
but
it
will
get
you
to
that
fully
trim
state
as
quickly
as
possible.
Like
we
said
it
may
just
be
a
few
minutes
to
get
there.
A
The
auto
trim
is
more
for
maintaining
performance
on
a
pool
and
keeping
things
in
a
fully
trim
state.
So
slightly
different
use
cases.
I,
guess
you
can
make
an
argument
that
the
auto
cream
gets
you
most
of
what
you
need
and
you
could
just
leave
that
enabled.
But
at
that
point
it's
easy
enough
to
have
the
manual
trim
to,
and
it's
useful.
A
So
the
question
was:
if
you're,
if
you
were
allocating
from
a
meta
slab
and
we
disabled
it,
because
we
need
to
trim
it
and
yes,
your
allocations
are
gonna
shift,
other
meta
slabs
and
we
may
have
to
load
new
ones
because
of
that,
but
we
probably
have
a
couple
of
meta
subs
loaded.
So
as
long
as
we
don't
disable
too
many
of
them,
which
is
what
happens
with
the
auto
trim
right,
it
shouldn't
be
too
much
of
an
impact.