►
From YouTube: OpenZFS Office Hours with Matt Ahrens
Description
OpenZFS Office Hours with Matt Ahrens, October 11, 2013
Questions / topics:
Hole Birth Time
Send/Receive network buffer
NFS over RDMA
shareiscsi property
compacting ZAP objects
larger block sizes
Write smoothing & histogram changes in production
Linux port & cross-platform codebase
Device removal & bprewrite
Performance with lots of snapshots
Trivia question & t-shirt giveaway
A
Cool
so
I'm
coming
to
you
guys
live
from
san
francisco,
I'm
matt
ahrens.
A
I
started
the
zfs
project
with
jeff
bonwick
back
in
2001
at
sun
maker
systems,
and
I'm
here
today
to
answer
questions
about
zfs
and
open
zfs
and
anything
you
might
want
to
talk
about.
So
it
looks
like
we
have
just
a
few
people
here
extremely
promptly,
so
take
it
away,
derek
karen
or
percussion
anything
that
I
can
tell
you
about
open,
zfs.
A
A
So
right
now
I'm
working
on
a
project
with
max
max
grossman
who's,
another
employee
at
delphics.
A
A
We've
noticed
the
problem
with
it
a
performance
problem
where,
when
you're
doing
a
send
of
a
stream
that
has
lots
of
holes
in
it,
so,
for
example,
this
happens
a
lot
with
sending
z
vols.
So
we
noticed
this
problem
when
we
were
sending
z
vols
for
remote
replication
from
one
machine
to
another,
so
z,
vols
hold
other
file
systems
and
they
start
out
basically
totally
sparse,
but
then,
as
the
other
file
systems
say,
maybe
ntfs
fills
in
parts
of
the
files
parts
of
the
volume.
A
Then
the
parts
that
it
hasn't
filled
in
are
going
to
be
holes,
so
those
holes
need
to
be
with
the
open,
z
fest.
Currently
those
holes
need
to
be
transmitted
to
the
other
machine,
every
time
that
you
do
a
send.
So
basically,
the
problem
is
that
we
don't
know
whether
a
hole
was
newly
created
or
it's
always
been
there.
A
A
A
So,
with
the
new
changes
that
max
has
done,
I
will
basically
store
the
birth
time
of
every
hole.
So
just
like
we
just
like
when
we
write
to
a
data
block,
we
store
the
birth
time
and
then
we
use
that
birth
time
to
know
whether
the
block
has
been
changed
or
not,
and
therefore,
whether
we
should
send
that
data
to
the
other
system
or
not
we're
gonna
do
the
same
kind
of
thing
with
holes.
A
So
when,
when
a
hole
is
created
by
either
the
application,
writing
a
whole
run
of
zeros
and
then
us
compressing
that
away
to
a
hole
or
by
an
explicit
trim,
command
or
a
truncate
of
a
file.
Then
that
will
record
the
when
that
happened.
A
So
then
we
know
whether
we
need
to
send
that
to
the
remote
system
or
not
so,
we've
seen
this
have
a
drastic
reduction
in
the
number
of
holes
that
need
to
be
sent,
which
which
has
a
big
performance
impact
primarily
on
the
receiving
system,
which
now
doesn't
need
to
go
examine
all
of
those
blocks
to
figure
out.
Are
they
already
a
hole?
Do
I
need
to
punch
a
hole
here
or
what
not
and
we've
also
this
is
kind
of
the
result
of
several
improvements
to
zfsn
with
holes.
A
We
we
first
discovered
that
there
were
a
bunch
of
kind
of
almost
pathological
performance
problems
with
receiving
these
holes,
and
this
is
exacerbated
by
the
fact
that
there
are
so
many
of
them.
So
first
we
saw
that
that
the
kind
of
pathological
performance
problems
that
you
could
see,
sometimes
when
receiving
a
file
like
a
z-ball
with
a
lot
of
holes
and
that's
in
open
zfs
today
and
then
we
realized
well.
A
A
Good
any
questions
about
that
or
other
questions
about
anything
open,
zfs.
A
No
all
right,
then,
I'm
going
to
start
picking
on
you
guys
and
asking
you
some
questions
so
luke.
I
know
you
guys
use
zfs
at
hybrid
cluster
and
you're
using
send
and
receive
to
send
the
data
between
the
different
nodes
in
the
cluster.
A
How,
how
are
you
doing
that?
In
other
words,
what's
what
are
you
using
for
a
network
transport
and
what
have
you
noticed
any
kind
of
performance
issues
with
that,
or
have
you
needed
to
do
anything
in
particular
to
improve
the
performance.
B
Yeah
sure
so
we've
got
a
sort
of
a
group
messaging
protocol
which
is
built
out
of
twisted,
which
is
a
python
networking
framework
that
coordinates
all
of
the
nodes,
but
the
actual
zfs
send
and
receive
happens
via
it
happens
over
an
ssh
transport.
So
the
group
messaging
code
just
sets
up
these
ssh
connections.
B
It
actually
uses
a
fifo
on
the
local
machine
in
order
to
make
one
node
start
pumping
data
into
another
node
and
then
zfs
receive
gets
initiated
on
the
other
node
to
start
receiving
it.
Some
fairly
recent
changes
in
freebsd
improved
performance
of
ssh
transfer
over
wide
area
networks,
which
has
helped
us,
and
we
also
use
m-buffer
in
the
pipe.
So
we
use.
B
On
both
sides-
and
that's
really
useful,
because
it
looks
because
m-buffer
then
reports
to
us
what
the
speed
of
the
transfer
is
and
we
use
that
to
detect
stalled
transfers
and
things
like
that.
So.
A
M-Buffer,
as
I
understand
it,
is
a
utility
that
basically
receives
the
stream
and
then
has
has
some
fixed
size
buffer
and
then
sends
the
stream.
So
it's
basically
just
like
a
very
simple.
Like
producer
consumer
kind
of
problem,
yeah.
B
Than
getting
stalled
on
the
receive
process
needing
to
find
some
free
space
that
would
stall
the
sender
if
the
pipe
size
was
very
small,
then
buffer
allows
you
to
smooth
out
that
send
performance
over
the
network
and
yeah.
I
think
that
I
don't
know.
Inbuffer
is
a
threaded
c
program
that
was
in
ports.
I
think
it
originated
on
solaris
that
was
intended
for
like
tape,
archives
or
something
but
yeah
it
does
the
job
yeah.
A
Yeah
I've
seen
that
problem
as
well,
and
you
know
the
the
kind
of
fundamental
problem
there
is
that
the
kernel
is
producing
say
with
zfsn.
The
kernel
is
producing
data
and
then
sending
it
into
this
pipe,
but
the
pipes
in
the
kernel
have
very
small
buffers.
It's
like
dozens
of
kilobytes,
so
you
end
up
with
essentially
having
one
thread
which
is
producing
the
data
like
it
gets
the
data
from
disk
it
it
generates.
A
The
send
stream
sends
it
into
this
pipe
and
then,
if
it
isn't
being
read
extremely
quickly
from
the
network,
then
it
has
to
wait
for
the
net
for
those
bytes
to
be
sent
over
the
network
before
it
can
get
the
next
batch
of
data
from
the
disk.
So
you
end
up
with,
like
I'm
either
reading
from
the
disk
or
I'm
sending
over
the
network,
and
what
you
really
want
is
to
be
doing
both
of
them
at
once,
which
the
m
buffer
achieves.
A
B
B
A
So
I
think
this
is
a
pretty
small.
This
would
be
a
pretty
small
project
to
implement
it
in
zfs.
So
if
anyone
is
going
to
be
at
the
open,
zfs
developer
summit
and
hackathon,
I
think
this
would
be
a
great
hackathon
project.
It's
probably
doable
in
a
day,
or
at
least
you
could
get
it
started
in
a
day.
A
So
we
had
a
couple
other
questions
over
on
irc,
so
one
was
from
josh
simon
asking
if
there's
any
version
of
open
zfs
that
supports
nfs
over
rdma
when
I
was
at
sun.
I
remember
them
working
on
this,
but
I
don't
know
if
it
ever
got
integrated
george.
D
I
think
it
actually
did
get
integrated.
I
don't
know
if
it
was
complete,
but
I
think
it
should
be
in
illumos
has
rdma
access.
A
Cool,
do
you
know,
out
of
curiosity?
Is
there
also
such
a
thing
as
iscsi
over
rdma
or.
A
Josh
says:
yes,
it's
called
icer.
I
ser.
A
Cool
all
right,
so
it
sounds
like
that
is
in
illumos.
I
don't
know
about
freebsd
or
linux.
A
All
right,
cool
and
there's
an
another
question
from
someone
on
irc
odk
asking:
do
we
plan
to
implement
share
iscsi
property?
He
recently
checked
the
last
bill
to
smart
os
and
it
was
not
there.
C
This
is
to
actually
do
like
zfs
share.
C
D
Know
if,
because
the
current
iscsi
share
functionality,
I
think,
is
for
the
old,
like
I
scuzzy,
tigget
d,
which
is
the
old
implementation,
not
the
comments
implementation.
So
I
don't
think
that
anybody
has
gone
in
to
actually
look
at
what
it
would
take
to
rip
that
out
because
it
pretty
much
that's
all
dead.
As
far
as
I
know
and
then
re-implement
it
using
the
comstar
iscsi
kernel
stuff.
A
Gotcha
so
in
it
seems
like
this
would
be
specific
to
whatever
platform
you're
running
it
on.
So
you
know
we're
talking
here
about
doing
it
on
lumos.
I
imagine
that
the
way
that
it
would
work
like,
hopefully
you
you
we
would
be
able
to
do
something
where
you
could
set
share
is
equals
on
on
any
platform,
and
it
would
hook
up
with
that
platform's
specific
way
of
sharing
stuff
over
iscsi,
but
that
way
would
probably
be
different
on
each
platform
so
need
to
be
implemented
separately.
A
Yeah,
so
eric
eric
sprole
was
also
commenting,
the
same
thing
that
you
said
that
the
sheri
schezzy
was
removed
when
they
switched
to
comstar,
which
is
the
new
ice
iscsi
and
fiber
channel
et
cetera,
sharing
mechanism.
That's
in
the
lumos
kernel.
A
So
there's
a
question
on
irc
from
prakesh
asking
how
difficult
would
it
be
to
support
collapsing
of
fat
zaps?
In
other
words,
when
there's
a
lot
of
entry,
when
you
have
a
zap
object,
that
has
a
lot
of
entries
and
then
a
lot
of
them
are
removed.
It
becomes
really
sparse.
It'd
be
nice
to
collapse.
The
leafs
to
a
more
compact
form.
A
So,
interestingly
enough
way,
back
in
the
day,
like
probably
10
years
ago,
you
could
actually
shrink
or
zap
objects
would
automatically
shrink
when
the
leaf
blocks
became
sparse
enough,
and
this
is
a
little
bit
tricky,
especially
in
terms
of
the
locking,
I
think
back.
Then
I
was
trying
to
implement
an
even
finer,
grained
locking
strategy
than
we
have
today.
A
A
If
you
want,
I
think
that
that
would
be
pretty
doable.
I
think
it'd
probably
be
easier
to
do
today
with
with
kind
of
the
infrastructure
that
we
have
now
than
it
was
back
when
I
removed
when
I
removed
that
functionality
a
long
time
ago.
The
main
tricky
things
so
like
the
way
that
the
main
trigger
thing
would
be
integrating
with
the
locking
of
the
zap.
So
to
give
you
kind
of
an
example
of
what
hap
think
about
what
happens
when
we
add
an
entry
to
the
zap.
A
A
Now,
if
the
leaf
block
is
already
mostly
full,
then
we
will
need
to
split
the
leaf
block
so
take
that
one
block
and
create
another
block
to
ship
to
move
half
of
its
entries
to
and
then
we'll
change
the
big
pointer
table
that
points
to
it.
If
there's,
hopefully,
there's
two
entries
in
that
pointer
table
that
point
to
this
leaf
block
and
then
we'll
change
it
so
that
one
that
points
to
the
leaf
block
and
the
other
one
points
to
the
new
block
that
we've
created.
A
So
in
terms
of
the
locking
what
normally
happens,
the
fast
path
would
be
that
we
get
a
lock
on
the
entire
zap
object
just
for
reader,
so
the
reader
writer
lock,
and
then
we
we
lock
the
specific
leaf
block
exclusively.
A
Now.
If
we
discover
that
we
need
to
split
it,
then
we
need
to
change
that
table.
So
we
need
to
get
an
exclusive
lock
on
the
zap
on
the
whole
object,
so
upgrading
that
lock
and
retrying,
it
is
kind
of
the
tricky
part,
and
we
need
to
do
something
similar
when
shrinking.
So
when
you
go
to
shrink
look,
we
would
look
at
the
or
when
you
get
to
remove
an
entry,
say,
go
and
look
at
that
that
leaf
block.
A
If
it
is
only
you
know,
10,
full
or
whatever,
then
look
at
its
sibling
and
see.
Oh,
could
I
could
I
collapse
this
this
block
and
its
sibling
into
one
block.
That
would
still
not
be
too
full,
then
I
need
to
upgrade
the
lock
on
the
entire
zap
object
to
writer,
so
that
somebody
else
doesn't
try
to
do
this
at
the
same
time,
but
I
think
that
would
be
pretty
doable.
A
A
So
currently,
the
maximum
block
size
on
zfs
is
128k
and
I
the
we
might
want
to
have
larger
block
sizes
up
to
say
one
megabytes,
four
megabytes,
maybe
even
bigger
the
main
motivation
for
this
is
to
increase
performance.
A
It
would
probably
increase
performance,
a
tiny
bit
on
on,
like
mirrors
or
stripes,
but
there'd
probably
be
a
bigger
performance
improvement
on
raid
z,
where
you
take
that.
Currently
we
have
like
a
128k
block,
you
take
it
and
you
split
it
up
into
a
bunch
of
chunks,
each
chunk
going
to
one
of
the
devices
in
your
raid
z
group.
A
A
And
so
there's
both
space
space
efficiency
utilization
issues
there,
as
well
as
performance
issues
where,
if
you're
doing
say
a
bunch
of
say,
if
you're
doing
like
a
re-silver,
where
we
need
to
read
every
block
of
data
off
the
disk.
A
If
we
have
larger
blocks,
then
will
be
able
to
read
more
contiguously
from
the
discs,
because
resilient
tends
to
be
a
very
random
io
type
workload
and
so
having
a
smaller
number
of
operations
that
we
have
to
do
because
we
have
larger
blocks.
I
would
potentially
increase
performance
there
a
lot,
the
the
issues
with
actually
doing
this.
Our
main
are
mainly
that
we
would
need
given
the
current
infrastructure
when
we,
whenever
we
have
a
block,
say
128k
block.
A
A
I
know
I'm
rambling
a
little
bit
here,
but
so
the
the
the
performance
issues,
the
performance
issue
is
that
doing
a
large
allocation
requires
a
lot
of
contiguous
virtual
memory
in
the
kernel's
address
space,
and
that
depends
that
performance
of
that
depends
a
lot
on
the
kernel
memory
allocator,
which
has
kind
of
varying
qualities
across
different
platforms.
A
A
A
A
I
know
that
on,
for
example,
linux
they're
already
they
already
have
some
performance
problems
as
a
result
of
having
to
allocate
just
128k
contiguously,
so
the
linux
guys
are
looking
at
using
the
page,
allocator
directly
and
basically
creating
like
a
scatter
gather
list
or
a
list
of
pages
rather
than
a
contiguous,
contiguous,
in-memory
buffer.
A
It
would
be
a
list
of
particular
pages
of
memory
and
then,
whenever
we
needed
to
copy
in
or
out
of
that,
then
the
scatter
gather
code
would
deal
with
accessing
each
page
separately,
so
that
would
enable
some
more
scalability
there,
at
least
on
the
linux
side.
So
those
are
the
kind
of
issues
that
we
need
to
address.
A
B
B
Yeah,
no,
I
might
as
well
ask
it
for
the
record.
Actually
so
andre
has
been
working
on
merging
two
change
sets
from
the
lumos
into
freebsd.
B
The
right
smoothing
change
that
you
described
at
eurobsdcon
and
also
the
metaslab
histogram
changes,
and
I
was
just
wondering
how
those
are
working
out
for
you
in
production.
Have
you
got
them
in
production
yet,
and
then,
if
so,
what
are
the
improvements
in
that,
and,
in
particular,
with
fragmented
pools,
is
how's
that
working
out
and
also
is
there
more
work
to
come
on
that
yet
or
is
that
sort
of
the
the
body
of
it?
B
A
What
first
I'll
I'll
talk
about
the
right
smoothing
code
and
then
I'll
ask
george
to
answer
the
question
in
terms
of
the
the
histogram
and
the
future
of
that,
so
for
the
right
smoothing
code,
we're
using
that
both
of
the
both
of
those
features
we're
using
internally
at
delfix
in
production.
Today
I
we
have,
I
don't
think
we
have
shipped
that
code
to
our
customers,
yet.
C
It's
actually
at
customers.
Now,
okay,.
A
So
there
is
a
few
customers
in
terms
of
our
internal
use.
We
use
it
on
a
server
that
serves
up
virtual
machine
images
for
for
us
to
use
for
testing
that
gets
a
lot
of
load
so
that
that
should
sell
a
lot
of
performance
issues.
We
have
we've
seen
basically
no
problems
with
the
with
the
right
smoothing
code.
That's
been
working
very
well
for
us
in
production
so
far,
and
I
don't
think
that
we
really,
I
think,
that's
basically
done.
A
I
don't
think
that
there's
really
more
work
that
we
have
planned
in
that
area
for
right
now,
george,
do
you
want
to
talk
about
the
histogram
and
fragmented
pool
performance,
stuff.
D
Sure
so
so
yeah,
that's
also
going
out
to
our
customers
that
any
of
our
new
customers
are
actually
deploying
those
changes.
D
So,
what's
out,
there
today
is
kind
of
a
fact-finding
mission
as
well
as
some
performance
improvements,
so
the
histogram
is
kind
of
meant
to
give
us
a
good
idea
of
like
how
is
the
space
actually
being
comprised
in
a
meta
slab.
D
There's
some
changes
that
I'm
putting
together
that
actually
are
going
to
take
that
same
histogram
and
then
bubble
that
up
so
that
you'll
be
able
to
see
a
bigger
view
on
a
v
dev
and
also
on
your
entire
pool,
so
you'll
be
able
to
get
an
idea
across
devices
and
across
your
entire
pool
that
show
you
that
same
kind
of
histogram
and
how
your
space
is
now
allocated
or
how
your
free
space
actually
is
exists.
D
The
other
changes
that
are
part
of
that
are
to
do
some
pre-loading
of
metaslabs,
which
we've
seen,
has
actually
been
beneficial
at
customers,
where
there
is
a
lot
of
fragmentation,
it's
tunable
today,
so
it's
I
think
it
defaults
to.
I
want
to
say
it's
like
three
per
device
where
it
will
actually
preload
them
at
the
end
of
every
transaction
sync.
So
every
spa
sync,
when
it
completes,
we
asynchronously
go
and
load
the
next
three
best
meta
slabs
for
that
specific
device.
D
The
things
that
we're
working
on
are
to
take
some
of
the
information
that
we're
getting
from
from
the
histogram
we're
building
up
a
fragmentation
metric
which
we're
going
to
expose
to
the
user,
so
you'll
be
able
to
see
kind
of
how
fragmented
your
devices
are,
and
then
we're
adding
additional
logic
to
allow
you
to
select
meta
slabs
or
the
code
will
actually
select
meta
slabs
that
are
quote
unquote
better
than
other
meta
slabs
and
then
we're
using
that
space
to
determine
also
how
we
select
devices
to
allocate
from
so.
D
The
three
main
components
for
allocation
have
always
been
select
a
device
select,
the
meta
slab,
select
a
block.
We
have
effectively
made
changes
or
will
be
making
changes
across
all
those
three
things.
The
way
that
you
actually
select
the
block
has
there's
a
new
allocator,
that's
out
there
today,
but
isn't
deployed
so
still
that
hasn't
really
changed,
but
there's
new
code
that
went
in
as
part
of
that
meta
slab
change.
B
D
B
My
next
question
right
is
there
any
way
that
it
would
be
helpful
for
us
to
submit,
and
then
how
can
we
do
that
for
instances
where
we
have
pools
that
have
become
badly
fragmented
and
where
we're
getting
performance,
problems
and
yeah?
What's
the
format
for
packaging
that
up
and
sending
it
to
you.
D
It
now
has
the
ability
to
dump
out
not
only
the
on
disk
histogram,
but
you
can
also
get
a
more
accurate
in
core
histogram.
D
So
if
you
look
at
zdb
with
the
minus
m
option
m
as
in
mary,
that
will
give
you
that
information.
Now,
if
your
pool
is
live
and
heavily
fragmented,
it
can
be
very
difficult
to
get
that
information,
because
the
pool
is
changing.
So
you
know
so
dramatic
right
right.
So
one
of
the
next
things
that
I'm
doing
as
part
of
my
next
change
is
to
actually
be
able
to
pull
this
out
on
a
lumos
via
mdb,
but
presumably
from
the
other
platforms.
D
You
would
also
be
able
to
pull
it
out
from
the
kernel
yeah
and
then
dump
it
out
that
way.
So
that's
I
recognize
that
zdb
is
is
great
when
the
pool
is
pretty
static
and
when
it's
not
it's
really
hard
to
get
some
of
that
information
out.
B
Sure
one
of
the
things
that
would
be
helpful
that
is
helpful
for
us
is
that
with
hybrid
cluster
we
can
live
migrate.
All
the
applications
offer
node
in
the
cluster
and
then
use
that
tool
to
extract
it
once
the
pool's
quiescent
and
then
so
we
could
that
that
can
help
immediately,
but
that's
great
and
it
might
be.
It
would
be
really
awesome
to
have
some
instructions
on
how
to
get
that
data
on
the
wiki,
perhaps
yeah,
so
you
can
follow
that
that'd
be
awesome.
Thank
you.
A
George
related
to
that,
how
much
does
the
work
like
once
you've
already
once
your
pool
has
already
got
been
fragmented?
How
much
does
the
workload
impact
the
kind
of
performance
that
you're
going
to
see
ongoing?
So,
in
other
words
like
when
we
gather
the
fragmentation
information?
A
Do
we
also
need
to
know
like
what
the
workload
look
like
looks
is
looking
like
in
the
future
in
terms
of
like
size
of
blocks
that
need
to
be
allocated
in
order
to
like
evaluate
what
you
know
or
predict
like
what
the
performance
will
be
like.
D
I
know
we've
kind
of
talked
about
trying
to
take
some
of
the
stuff
that
we
did
with
the
right
throttle
and
the
knowledge
we
have
kind
of
ahead
of
time
to
make
an
indication
of
what
the
allocator
should
do.
So
I
think
that's
that's
probably
the
next
step
where
we
can
try
to
get
some
guidance
based
on
the
workload
coming
in
to
affect
how
the
allocator
needs
to
behave
when
it's
trying
to
make
allocation
decisions,
but
it
is,
it
is
definitely
susceptible
to
different
workloads,
are
going
to
cause
different
issues
based
on
the
fragmentation.
A
Yeah,
one
of
the
things
that
we
have
kind
of
tried
to
avoid,
I
think
in
zfs
so
far
has
is
like
having
assuming
that
there's
going
to
be
some
some
period
where
there's
no
work,
where
there's
less
load
on
the
machine
and
then
like
going
around
and
doing
some
activity
during
that
time
like,
for
example,
scrubbing
or
something
like
that,
so
we've
kind
of
avoided
doing
that
so
far,
but
I
know
we've
thought
a
little
bit
about
along
those
lines
for
this
alloc
for
the
allocator
stuff
in
in
terms
of
saying
like
well
what
what
really
is
the
best
allocation
algorithm,
because
you
know
you
may
be
that
I'm
allocating
a
lot
of
small
blocks,
and
I
have
a
lot
of-
I
have
a
bunch
of
small
holes
and
I
have
some
big
holes.
A
Should
I
just
always
be
filling
the
biggest
hole
in
getting
the
best
performance,
or
should
I
kind
of
somehow
figure
out
that
well
right
now,
the
load
is
really
small.
So
now
is
a
good
time
for
me
to
be
filling
in
all
these
little
holes.
Even
though
it'll
mean
a
lot,
the
I
o
will
be
more
random
and
so
like
it'll
take
longer
to
do
that.
But
you
know
it's:
okay,
that
it
takes
longer
because
the
load
is
lower
because
it's
quiet
at
night
right.
C
D
D
A
Yeah,
so
it's
very
tricky
to
figure
out
like
what
is
a
light
workload
and
what
is
a
heavy
workload
right
yeah.
So
when
should
I
be
doing
which
so
there
are
some
questions
on
irc
a
little
discussion
going
on
there
that
I
think
I'll
I'll
recap
and
address
a
little
bit
for
people
on
the
video.
A
So
I
think
prakesh
was
asking
about
the
linux
port
and,
if
there's
anything
that
we
can
do
to
help
them,
I
guess
keep
more
up
to
date
on
the
changes
that
are
happening
in
the
other
platforms.
A
You
know
to
you
to
speed
up
the
process
of
linux
tracking
the
upstream
and
so
one
of
the
things
that
we're
looking
at.
Well,
I
guess
first,
you
know
the
question
would
be
what
what
are
the
problems
with
that
that
are
facing
people
that
are
trying
to
up
pull
pull
changes
into
it
into
linux,
say
from
a
lumos
today.
A
So
one
of
the
problems
is
that
a
lot
of
those
changes
don't
apply
cleanly
another
problem
so
meaning
that
the
diffs
don't
just
apply
an
automated
fashion.
Some
an
engineer
has
to
go
and
look
at
a
merge,
conflict
and
figure
out.
You
know
which
lines
of
code
to
take
from
which
side
or
make
minor
modifications
to
the
code
that's
being
brought
in
that
are
specific
to
linux.
A
Another
problem,
that's
kind
of
mechanical
another
problem
is
being
confident
that
the
changes
that
you've
applied
actually
work
and
don't
break
anything.
So
hopefully
we
can
trust
that
that
the
changes
work
in
a
lumos,
but
that
doesn't
necessarily
mean
that
they
will
work
properly
on
linux,
and
you
know,
one
of
the
things
brought
up
was
like
the
right.
The
right
smoothing
right,
throttle
patch.
You
know
something
that's
very
performance
dependent
like
like
that
is
difficult
to
verify.
A
A
So
if
we
have
you
know
regression,
regression
tests
can
be
used
to
to
get
a
lot
more
confidence
around
a
set
of
changes
so
that
once
the
changes
have
been
applied,
you
run
the
regression
tests
and
at
least
have
some
confidence
that
they
haven't
broken
anything
too
badly.
A
So
I
think
one
thing
that
we
need
to
work
on
is
getting
the
tests
that
we
have,
that
that
can
be
run
on
illumos
and
on
freebsd
to
also
be
able
to
run
on
linux.
So
john
kennedy
just
recently
has
finished
importing
all
of
the
zfs
test,
suite
from
the
stf
test
framework
to
the
new
test
runner
framework,
which
is
much
simpler,
and
I
should
be
much
easier
to
port
to
other
platforms.
A
The
other.
The
other
aspect
is,
you
know,
patches
applying
cleanly,
so.
A
What
we're
working
on
is
creating
a
common
zfs
code
repository,
so
the
idea
would
be
that,
rather
than
pulling
changes
from
pushing
pulling
and
pushing
changes
from
each
of
the
different
platforms
like
from
lumos
to
linux-
and
you
know
from
freebsd
to
lumos
and
from
limits
to
freebsd,
we
would
have
a
common
code
repository
that
would
be
independent
of
all
platforms
and
then
the
the
goal
would
be
that
every
platform
would
be
able
to
pull
the
code
from
that
repository
directly
into
their
platform
without
any
without
any
platform.
A
Specific
differences
to
the
files
that
are
part
of
that.
So
there's
several
challenges
in
accomplishing
that
we
need
to
be
able
to
test
the
code.
That's
in
this
repository
on
any
platform
and
and
be
confident
that
it's
going
to
actually
work
on
every
platform.
A
So
the
way
that
we
would
go
about
doing
that
is
creating
a
used
land
framework
so
that
we
can
compile
all
this
code
into
userland
and
test
it
in
newsland
on
every
platform.
So
the
idea
would
be
that
we
would
create
like
a
definition
of
a
zfs
kernel
interfaces
that
would
be
functions
that
the
kernel
would
on.
On
every
platform
would
need
to
provide,
and
then
those
would
be
there
would
be
like
a
compatibility
layer
that
would
be
on
every
platform.
A
You
know,
including
lumos,
that
would
translate
from
those
like
zfs
kernel,
apis
to
that
specific
platform's
way
of
implementing
it,
and
then
we
would
have
another
kernel,
api
or
kernel
compatibility
layer
that
would
be
for
running
this
code
in
userland.
So
the
idea
would
be
that
that
code
would
be
able
to
run
on
any
platform
and
so
we're
working.
What
I'm
working
on
right
now
is
making
more
code
being
able
to
be
tested
in
userland.
A
So
right
now
we
have
z-test
which
tests
mainly
a
bunch
of
spa
and
dmu
code
in
userland,
and
it's
kind
of
like
a
stress
test.
So
what
I'd
like
to
be
able
to
do
is
run
the
full
dfs
test
suite
against
the
user
line
implementation
of
zfs.
So
this
would
allow
us
to
test
like
libcfs
the
zfs
command
line
tools,
z,
if
I
send,
and
receive
basically
most
of
zfs,
except
for
the
zpl.
The
zpl
is
the
posix
layer.
A
So
that's
the
part
that
interfaces
with
each
platform's
vfs
virtual
file
system
layer,
which
tends
to
be
have
a
lot
of
differences
between
the
different
platforms.
So
that
would
not
be
a
candidate
for
initial
inclusion,
but
I
think
that
we
could
get
to
a
point
where
the
vast
majority
of
the
zfs
code
can
be
compiled
in
new
zealand
tested
in
userland
and
then
be
confident
that
that
code
can
be
taken
verbatim
to
all
the
different
platforms.
A
So
that'll
both
increase
test
coverage
of
the
code
and
make
it
easier
mechanically
to
pull
those
changes
in.
So
obviously
the
work
the
work
to
do
there
is
is
both
the
kind
of
infrastructure
that
I
that
I'm
working
on
right
now
to
to
make
that
code
able
to
be
compiled
into
design,
which
is
largely
around
creating
like
an
ioctal
shim
layer,
so
that
the
zfs
command
line
tool
can,
rather
than
talking
to
the
kernel
to
do
its
tasks,
can
talk
to
this
like
userland,
zfs
daemon.
A
That's
the
userline
implementation,
but
the
tricky
that
I
think,
they're,
really
more
invasive
part
of
the
work
will
be
to
change
to
decide
what
these
zfs
kernel
apis.
Should
be
and
then
change
the
code
to
use
that
abstraction
layer,
so,
for
example,
we
were
just
having
a
discussion
on
the
mailing
list
yesterday
about
use
of.
I
think
it
was
cv
timed
weight,
sig
high
res
or
something
some
combination
of
those
words
with
underscores
between
them,
which
is
it,
which
is
a
interface
on
lumos.
A
That
allows
the
thread
to
block
for
to
to
go
to
sleep
for
a
certain
amount
of
time
with
a
specified
resolution,
so
this
is
used
as
part
of
the
right
throttle
code
and
there's
no
exact
corresponding
code
in
linux.
So
this
is
an
example
of
something
where
what
we
should
do
is
like
create
a
zfs
kernel
interface.
That
says,
you
know,
put
this
thread
to
sleep
for
this
amount
for
this
amount
of
time
and
then
on
elimos.
A
It
could
be
implemented
using
this
cv,
time,
weight,
high-res
mechanism
and
on
other
platforms
it
could
be
implemented
using
you
know,
whatever
the
platform-specific
mechanism
is
now.
Differences
in
how
that's
implemented
would
potentially
have
both
correctness
and
performance
implications.
So
obviously
the
routines
need
to
actually
do
what
they're
expected
to
do.
So
we
would
need
to
document
that
very
well
in
terms
of
you
know
all
the
different
error
cases
side
effects
things
like
that.
A
So
you
know,
for
example,
there's
a
question
about
like
what,
if
the
time
that
we
want
to
wake
up
is
in
the
past
or
what,
if
we're
sleeping
for
a
negative
amount
of
time,
what
does
that
mean?
So
defining
all
those
sorts
of
things?
And
with
this
you
know,
there's
also
performance
implications.
So
you
know,
if
the
amount
of
time
that
we
can
sleep,
has
different
resolutions
on
different
platforms.
A
Then
we
would
need
to
make
sure
that
if
it
has
lower
resolution,
meaning
like
you,
can
only
sleep
in
big
granularities
like
say
you
can
only
sleep
for
one
millisecond
chunks
at
a
time,
then
we
would
need
to
make
sure
that
the
right
throttle
code
still
behaves
the
way
that
we
expect
it
to
even
with
higher
resolutions,
so
that
that
I
actually
tested
when
I
was
doing
this
on
lumos,
because
you
can
just
input
a
parameter
of
what
the
resolution
is.
A
But
things
like
that
are
things
that
we
would
need
to
keep
an
eye
out
for
in
these
common
interfaces.
A
I
know
I
kind
of
rambled
on
there
a
bit.
I
hope
that
that
kind
of
answers,
your
question
prakash,
I
think,
ultimately
like
there's
a
bunch
of
coordination,
work
to
be
done
and
a
bunch
of
just
you
know
not
very
glamorous,
hammering
out
of
interfaces
to
make
that
comment
and
importing
test.
Suites.
A
C
A
So
bp
rewrite
is
a
project
that
I
was
working
on
at
sun
and
the
idea
was
very
all-encompassing,
so
the
idea
was
that
we
would
be
able
to
take
any
block
on
disk
and
manipulate
it
in
whatever
way
we
need
to
allocate
it
somewhere
else
change
the
compression
change,
the
checksum.
You
know
dedupe
it
or
not,
not
dedupe
it
and
then
change
and
you
know,
keep
track
of
that
change.
A
So
it's
called
bp
rewrite,
because
in
order
to
do
this,
we
need
to
change
the
block
pointer
to
point
to
some
others
to
some
new
block.
So
the
tricky
part
you
can
imagine
kind
of
the
the
straightforward
implementation
that
you
might
do
like
on
ufs
would
simply
be
to
traverse
all
of
the
blocks.
A
You
look
at
each
block
if
I
want
to
reallocate
it
I'll
hit
it
somewhere
else,
great
shove,
the
new
pointer
in
there
done
the
tricky
thing
about
doing
this
on
zfs
is
it's
kind
of
two
two
parts
of
the
trickiness
one
is
that
you
can
have
the
same,
a
pointer
to
this
a
pointer.
A
There
can
be
many
pointers
to
a
single
block,
so
this
is
because
of
snapshots
and
clones,
so
you
can
have
one
block
on
disk,
which
is
pointed
to
by
you
know,
ten
different
clones
and
the
reason
that
we
implemented
zfs.
That
way
is
so
that
performance
of
snapshots
and
clones
is
very
good.
A
So
there's
no
difference
in
the
performance
of
accessing
a
clone
versus
accessing
the
main
file
system,
they're
they're,
basically
identical,
there's
just
like
a
administrative
control,
basically
which
says
which
one
is
this
file
system
and
which
one
is
the
clone
which
you
can
change
with
the
zfs
promote
command.
A
So
that's
why
we
did
it
that
way.
It
creates
this
problem
where,
if
you
have
a
bunch
of
point,
a
bunch
of
instances
of
that
block
pointer,
then
when
I
traverse
this,
when
I
traverse
all
the
block
pointers,
I'm
going
to
visit
that
old
block
winter
several
times.
So
we
need
to
remember
that
we
swit
we
changed
from
this
old
block
pointer
to
this
new
block
winter.
In
other
words,
I
moved
this.
A
This
particular
block
from
place
a
to
place
b
so
that,
if
I
see
another
pointer
to
place
a
then
I
know
to
change
it
to
place
b.
A
This
creates
a
performance
problem
because
you
end
up
having
to
have
like
a
giant
hash
table
which
maps
from
the
old
location
to
the
new
location
and,
if
you're
familiar
with
dedupe,
then
you're
aware
that
also
involves
a
giant
hash
table.
Mapping
from
the
blocks
checksum
to
the
location
on
disk,
that's
stored
and
the
ref
count,
and
if
you've
ever
used
dedup
in
practice
on
very
large
data
sets.
Then
you're
probably
aware
that
the
performance
of
that
is
not
very
great.
A
A
So,
for
example,
it's
counted
in
the
in
the
d
node
so
that
each
file
knows
how
much
space
it's
using
that's
used
by.
If
you
do
ls-s,
which
tells
you
the
number
of
sectors
used
or
df
that
counts
up
the
amount
of
space
used
by
all
the
files,
it's
also
used
by
user
accounting.
So
if
you
have
a
quota
on
a
on
the
files
used
files
owned
by
a
particular
user
or
group,
it's
accounted
for
in
both
of
those
places.
A
You
can
access
that
with
the
like
zfs
user
space
commands
they'll
tell
you
how
much
space
each
each
user
is
using
and
you
can
set
quotas
on
them.
It's
at.
It's
also
accounted.
C
A
The
in
the
dsl
layer
and
in
a
bunch
of
different
places,
so
in
you
know,
when
you
do
zfs
list,
it
tells
you
how
much
space
is
used
by
each
file
system
and
the
space
use
their
impacts
like
all
the
snapshots
and
then
all
the
parent
file
systems,
because
the
space
is
inherited
up
the
tree
in
terms
of
the
space
you
used.
A
So
there's
a
bunch
of
different
places
that
that
space
accounting
needs
to
happen
and
making
sure
that
those
all
get
updated
accurately
when
a
block
changes
size
is
very
tricky.
So,
for
all
of
those
reasons,
as
I
was
working
on
this,
I
was
very
concerned
that
this
would
be
the
last
feature
ever
implemented
in
zfs,
because.
A
As
most
programmers
know,
magic
does
not
layer
well
and
the
bp
rewrite
was
definitely
magic
and
it
definitely
broke
a
lot
of
the
layering
in
zfx
it
needed
to
have
code
in
in
several
different
layers.
No,
you
have
have
intimate
knowledge
of
how
this
all
worked,
and,
on
top
of
all
that,
I
should
mention
that
you
know
we
were
doing
it.
A
We
wanted
to
be
able
to
do
this
live
so,
in
other
words,
while
you're
doing
arbitrary
things
to
your
pool,
be
able
to
do
this
vp
rewrite,
which
is
pretty
essential
when
it's
going
to
take
you.
You
know
weeks
for
the
to
accomplish
this
because
of
the
performance
issues.
A
So
you
know
it's
also
kind
of
like
changing
your
pants,
while
you're
running
as
you're
trying
to
move
everything
to
different
places,
while
you're
also
looking
at
everything-
and
you
know-
maybe
you're
deleting
snapshots
creating
snapshots,
while
you're
also
changing
where
those
snapchats
are
trying
to
reference.
A
A
A
I
don't
think
anyone
is
is
attempting
like
a
full-on
bp
rewrite
do
ever
do
anything
kind
of
implementation.
A
I
know
you
know
some
people
have
have
looked
at
it
from
different
like
what,
if
we
just
did
this,
what
if
we
ignored
that
which
I'll
kind
of
get
to
in
a
moment
like,
for
example,
derek
here
is
asking,
would
an
offline
bpv
rate
be
accepted
into
the
upstream
code
base?
I
think
it
would
depend
on
how
exactly
it
was
implemented.
A
So
a
separate
utility
for
you
know
potentially
like
something
based
on
libsy
pool,
but
not
adding
more
code
to
it.
I
think
would
be
great.
My
concern
with
having
even
an
offline
vp
rewrite
in
the
codebase
is
the
kind
of
far-reaching
implications
that
that
can
have
on
all
on
the
different
layering.
So
I
think
I
would
very
much
welcome
something
like
a
separate
utility
which
allows
you
to
offline.
You
know,
bp,
rewrite
your
stuff.
A
The
issue
would
be
how
that
is
maintained
like
how
much
of
the
common
code
it
needs
to
use
and
then
how
you
know
deeply
its
fingers
are
stuck
into
that
common
code,
so
certainly
doing
it.
Offline
simplifies
a
good
chunk
of
it
simplifies
all
the
interactions
with
the
arc
and
the
and
the
dmu,
but
you
you
still
have
the
issues
of
the
performance
of
having
a
giant
hash
table
and
needing
to
update
the
accounting
at
every
layer.
This
is
certainly
doable
in
terms
of
streaming.
A
I
think
you
know
it
depends
a
lot
on
how
how
much
it
affects
the
other
layers
so
now
getting
a
little
bit
too.
So
this
is
a
little
bit
something
I
talked
about.
Someone
on
iris
who's
asking
does
performance
suffer
as
the
count
of
snapshots
goes
over
time
for
a
given
data
set.
So
in
terms
of
the
data
path
so
reading
and
writing
files.
A
The
number
of
snapshots
has
no
impact
on
performance,
so
performance
is
not
going
to
get
any
worse
as
you
create
thousands
or
hundreds
of
thousands
of
snapshots
in
terms
of
you
know.
Reading
and
writing
your
files,
obviously
operations
that
impact
snapshots
or
iterating
over
snapshots,
like
the
zfs
list,
are
going
to
get
a
lot
slower.
A
Deleting
of
snapshots
is
going
to
get
a
little
bit
slower,
as
you
have
zillions
of
them,
because
snapshot
deletion
is
roughly
proportional
to
the
typical
number
of
snapshots
in
your
file
system,
but
that
is
usually
not
a
huge
concern.
Even
so,
because
you
know
10
000
is
not
a
big
number
for
computers.
A
So
getting
back
to
device
removal
and
bp
rewrite
so
one
of
the
things
that
bp
rewrite
was
supposed
to
solve
or
would
have
been
able
to
solve,
is
device
removal.
So
I
have
a
bunch
of
devices
in
my
storage
pool
and
I
want
to
shrink
the
size
of
the
storage
pool
by
removing
one
of
these
devices.
So
bpv
rate
would
have
allowed
us
to
say
great,
we'll
find
all
of
the
blocks
that
are
on
that
device
and
just
allocate
them
onto
a
different
device.
A
This
is
a
problem
that
can
also
be
solved
in
other
ways,
so
we've
talked
a
little
bit
about
potentially
doing
this
at
delfix.
The
idea
would
be
to,
rather
than
changing
the
block
pointers,
we
would
implement
it
as
a
virtual
v,
a
a
virtual
v
dev,
a
v
dev
stands
for
virtual
device,
so
a
actually
virtual
v,
dev
or
perhaps
more
sensically
named
in
indirect
v
dev.
The
idea
would
be
that
we
would
take
that
you
can
think
of
kind
of
from
this.
A
The
most
naive
way
of
doing
this
would
be
to
say,
take
that
device
and
just
like
copy
that
device
into
a
file
that's
stored
on
the
pool
so
like
create
a
zval
and
then
copy
the
device
onto
the
z-vol.
A
If
so,
this,
basically
would
say:
okay,
you
can
no
longer
allocate
from
this
particular
device
that
I'm
going
to
remove.
Now,
I'm
going
to
create
a
new
z-ball,
I'm
going
to
just
use
dd
to
copy
all
the
data
from
the
device,
I'm
removing
to
the
z
vol.
That
will
end
up
writing
that
data
to
all
the
other
devices
in
the
pool,
and
then
I
can
essentially
treat
that
z-vol
like
a
replacement.
A
So,
like
you
know,
you
can
do
a
z,
z,
pull
replace
to
replace
one
device
with
another
one.
Basically,
what
it
does
is
it
puts
them
into
a
mirror
mirrors
all
the
data
over
to
it
and
then
removes
the
first
device.
So
you
would
basically
be
doing
something
like
that,
but
rather
than
mirroring
it
to
another
actual
device,
mirroring
it
to
a
z-ball
which
is
stored
itself
in
the
same
storage
pool.
A
But
the
essential
trade-off
is
that
with
the
bp
rewrite
scheme,
once
bp
rewrite
is
done,
then
the
data
structures
are
all
as
they
were
before.
In
other
words,
you
could
actually
do
a
bp
rewrite
and
then
use
that
pool
on
old
software,
and
you
would
have
exactly
the
same
performance
that
you
would
have
if
the
data
had
been
laid
out.
That
way
originally
so,
basically
vp
rate
is
like
huge
performance
impact,
while
you're
doing
it
and
then
once
you're
done
no
performance
impact.
A
The
indirect
vdev
method
of
device
removal
would
have
much
less
performance
impact
while
it's
in
progress,
but
then
there
would
be
some
performance
impact
of
accessing
of
reading
data
from
that
indirect
vteve,
perhaps
in
in
perhaps
forever.
Obviously,
as
the
data
on
that
view
is
deleted,
then
there
would
be
less
and
less
of
that
data
which
would
have
this
additional
layer
of
indirection.
A
We
also
think
that
for
most
for
typical
workloads,
the
indirection
table
could
be
kept
in
memory
due
to
some
tricks
about
with
the
keeping
track
of
like
where
the
space
is
allocated
and
there's
also
some
additional
tricks
we
can
play
with.
If
you
access
that
data
from
a
file
system
as
opposed
to
from
a
snapshot,
then
when
you
access
it
from
the
file
system,
we
can
change
the
block
pointer.
That's
in
that
file
system
at
the
time
you
access
it.
A
So,
in
other
words,
like
the
file
system,
says
it
has
a
block
pointer.
That
points
to
you
know
v
dev
number
three,
which
is
the
one
that
I
removed
and
is
now
a
indirect
we
div.
We
could
see.
Oh
I'm
reading
that
one
I
go
through
the
indirection
table
and
then
I
find
oh,
it's
actually
stored
on
vdf7
at
this
other
offset.
Well,
let
me
just
put
that
viewdivin
offset
into
the
into
the
block
pointer
in
the
snapshot.
A
Sorry,
in
the
file
system,
we
cannot
modify
snapshots,
which
is
why
those
will
still
have
to
go
through
the
indirection
layer.
But
there's
a
bunch
of
the
kind
of
takeaway
is
this
is
device?
Removal
is
doable
in
a
much
lighter
weight
way,
both
in
terms
of
performance
and
layering
violations
or
lack
thereof.
This
would
basically
be
implemented
with
the
exception
of
those
little
tricks
that
I
mentioned,
it
would
be
implemented
totally
in
the
spot.
A
Upper
layers
wouldn't
need
to
know
about
it,
and
there
would
be
some
amount
of
performance
impact
of
continuing
to
access
the
that
data,
but
we
think
that
it
would
be
small
enough
that
it
would
not
be
measurable,
at
least
in
in
most
use
cases.
A
D
A
D
Just
kind
of
as
an
aside
to
that,
I
think
the
other
way
for
for
those
that
have
been
looking
at
bp
rewrite
as
kind
of
a
solution
for
your
specific
problem.
It
might
be
best
to
look
at
what
that
problem
is
you're
trying
to
solve
and
see
if
there
are
other
ways
to
solve
it.
I
know
fragmentation
comes
up
frequently
things
like
rebalancing
storage.
D
A
Yeah
so
like,
for
example,
with
device
removal,
I'm
sorry
with
lun
rebalancing
or
device
rebalancing.
You
know
we
would
start
by
looking
at.
Why
do
you
need
to
rebalance
the
luns?
So
probably
the
reason
is
that
performance
sucks.
So
the
question
is:
how
can
we
improve
performance
when
you
have
imbalanced
lungs?
A
One
is
to
rebalance
them
which
could
be
done
using
a
variant
of
the
indirect
v
dev
method.
Basically
indirecting
like
if
you
have
three
devices
and
then
we
added
another
one,
that's
partially
full.
We
can
basically
say
let
me
chop
off
the
end
of
each
of
these
three
devices
and
indirect,
just
like
the
last
third
of
them
onto
you,
know
this
new
vw,
but
one
like
in
some
use
cases
like,
for
example,
what
we're
doing
what
our
customers,
delfix
customers,
do.
A
A
In
other
words,
these
are
problems
that
are
created
by
zfs
and
can
be
solved
by
zfs
and
george,
I
know
has
done
some
some
preliminary
work
on
that
so
far,
but
the
the
issue,
the
the
idea
there
is
that
we've
added
a
new
device
and
it's
not
as
full
as
the
others,
but
so
if
we
just
wrote
to
that
device
and
ignored
the
ones
that
are
really
full,
then
performance
would
be
fine,
because
all
of
those
devices
are
actually
lens
that
are
like
backed
by
the
same
discs
or
whatever,
like
in
some
abstract
storage
fabric
and
the
the
problems
that
we
see
are
mainly
with
like
allocation,
trying
to
allocate
all
these
little
bits.
D
It's
probably
not
as
well
known
of
a
feature
primarily
because
it
is
kind
of
a
manual
process,
but
there
is
a
in
the
upstream.
There
is
a
tunable
that
you
can
enable
that
says.
I
want
all
my
devices
to
have
you
know
I
allow
the
devices
to
allocate
or
to
allow
zfs
to
allocate
from
those
devices
until
it
reaches
reaches
this
much.
D
You
know
until
there's
no
more
than
this
much
free
space.
So
the
idea
is,
let's
say
you
had
four
devices
that
the
first
two
had
you
know:
15
free
space
and
the
other
device
had
90
free
space.
You
could
set
the
tunable
at
15
and
it
would
start
forcing
all
the
rights
to
go
to
the
one
that
is
mostly
empty
until
they
all
come
up
to
the
same
level
and
then
they
all
start
writing
again.
D
So
there's
been
code
in
zfs
for
a
long
time
to
try
to
do
that,
heuristically
and
it,
but
it
moves
the
needle
very
at
such
a
very,
very
slow
rate
and
what
we
needed
for
our
customers
was
to
actually
be
able
to
just
quickly
move
to
the
devices
that
were
empty.
So
I'm
happy
to
talk
more
about
that.
That
particular
feature.
If
it's
something
that
you
think
might
be
useful,
we've
been
looking
at
it
as
something
that
we
intend
to
eventually
make
automatic.
D
So
it
isn't
attunable
so
you'll
see
that
coming
forward,
probably
in
six
months
or
so
is
when
you
should
expect
that.
A
Cool
there
are
a
couple
questions
about
the
discussion
that
we
were
just
having,
so
luke
was
asking
what,
if
I
remove
one
device
and
it
becomes
an
indirect
device
and
then
I
remove
yet
another
device
which
may
itself
have
indirect
mappings
on
it.
Yes,
that
does
that
would
work
recursively
and
we
thought,
through
the
problems
related
to
that.
I
remember
that
we
thought
about
it.
I
think
that
we
decided
that
it
just
works
yeah.
I
think.
A
You
know
you,
basically,
you
would
go
through
one
indirection
layer
and
then
you
use
you'd
find
okay,
this
this
this
device
is
indirect
and
the
data
is
actually
stored
in
this.
You
know
in
this
z-vol
and
then
we
go
into
that
z-vol
the
offset
and
we
find
oh
it's
stored
on
a
block
pointer,
which
is
which
is
stored
on
device
number
y
and
then
we
go
look
and
find
oh
device
y
is
also
indirect.
Then
we
just
keep
repeating
this
process.
A
A
Yeah
in
practice
that
shouldn't
be
necessary,
so
the
one
of
the
tricks
that
I
kind
of
alluded
to
would
would
basically
mean
that
if,
if
you
do
something
like
add
a
new
device
and
then
remove
an
old
device,
you
know,
then
they
could
be
different
sizes
or
whatever.
Then,
basically,
what
we
can
do
is
allocate
a
monstrous
contiguous
region
and
basically
the
the
size
of
the
indirection
table
is
proportional
to
like
the
number
of
these
monstrous
contiguous
regions
that
we
have
to
allocate.
A
A
So
we
use
that
one
entry
plus
the
space
map
that
we
already
have
of
that
device
to
figure
out
where
the
new
data,
where
the
new
location
is
cool.
There's
another
question
about
the
performance
of
snapshots,
so
he's
asking.
A
Even
if
I
have
like
a
lot
of
you
know
a
snapshot
every
hour
and
I'm
reading
a
snapshot
of
a
z-vol
that
I'm
taking
every
hour
and
I'm
doing
lots
of
heavy
reads
and
writes
to
it,
then:
will
there
really
not
be
any
performance
impact
due
to
all
the
accounting
that
has
to
happen.
A
Basically,
no
the
you
know
the
whenever
we.
So
when
you
do
a
write
to
that
z-vol,
then
it's
potentially
overwriting
an
old
block
and
that
lock
that
gets
overridden
overwritten
is
logically
free,
but
it
could
still
be
referenced
by
some
by
some
snapshots
so
that
block
pointer
is
put
onto
if
it's
referenced
by
the
previous.
A
If
it's
referenced
by
snapshots,
then
it's
put
onto
the
z-vol's
dead
list,
so
the
dead
list
is
a
list
of
blocks
that
have
been
they're
no
longer
referenced
but
were
referenced
in
the
previous
snapshot.
A
So
that
is
independent
of
the
number
of
snapshots
that
that
there
are
so
you
could
have
a
bajillion
snapshots
and
it's
still
just
a
matter
of
whenever
I
free
a
block.
I
write
it
to
the
z
balls
dead
list.
Now
I
made
some
enhancements
to
the
dead
list
code.
A
So
if
you
go
and
read
like
my
really
really
old
blog
post
from
2005,
it
describes
this
process
of
how
snapchats
are
implemented,
and
if
you
read
that
then
you'll
the
details
of
that
then
you'll
see
that,
like,
oh
okay,
so
basically
like
there
is
no.
It
doesn't
matter
how
many
snapshots
there
are
we
only
care
if
they're
about,
if
the
most
recent
snapshot
references
it,
which
is
kind
of
which
is
true
until
I
made
some
enhancements
to
the
snapshot
to
the
deadlift
management
code.
A
Now
the
change
that
I
made
makes
it
so
that
the
dead
list
is
actually
a
composite
of
many
lists,
and
the
number
of
lists
is
proportional
to
the
number
of
snapshots
that
there
are.
So
it
is
true
that
if,
when
you
have
more
snapshots,
then,
as
you
make
changes
as
you
logically
free
blocks
from
that
z-vol,
those
block
pointers
will
need
to
be
written.
A
Those
rights
will
be
spread
out
over
several
different
objects
which
hold
the
free
lists
which
hold
the
deadlifts.
A
So
you
know
there's
some
kind
of
second
order
effects
there
in
terms
of
performance,
so
you
know
because,
rather
than
just
appending
to
one
object,
I'm
appending
to
you
know
if,
when
I
free,
you
know
ten
thousand
blocks
rather
than
appending
those
ten
thousand
blocks
block
pointers
to
the
end
of
one
object.
I'm
appending
those
ten
thousand
block
pointers
to
the
end
of
you
know,
I'm
splitting
that
up
into
a
thousand
different
lists.
Each
of
you
know
you
know
putting
ten
things
on
to
the
end
of
each
of
those
1000
lists.
A
So
you
know
there's
some
kind
of
skill.
Scalability
or
you
know
you
don't
get
as
much
as
good
of
like
consolidation
of
like
the
indirect
blocks
and
things
like
that,
so
there's
a
little
bit
of
impact
there.
I
haven't
measured
that,
but
if
you
do
see
issues
I
would
be
I'd
very
definitely
be
curious
to
hear
it.
C
A
Yeah,
so
richard
lager
was
mentioning
that
when
we
have
this,
when
you
remove
a
bunch
of
different
devices
and
have
this
like
chained
in
direction,
it
would
be
nice
to
eliminate
that
recursion
that
would
be
eliminated
by
the
kind
of
by
the
trick.
I
mentioned
with
changing
the
actual
block
pointer
in
the
file
system.
A
So
when
you
read
from
the
file
system
and
we
discovered
that
it
was
indirected
we'll
we'll
go
through
all
the
layers
of
indirection
to
find
the
actual
concrete
location
on
disk
so
that
the
next
time
you
go
and
read
that
block
from
that
file
system,
then
it
will
go
directly
to
the
location
on
disk.
A
There's
a
question
about:
is
there
a
zfs
internals
book
in
production?
Not
that,
I
know
of,
I
think,
we're
starting
to
try
to
document
more
of
the
zfs
internals
from
an
implementation
point
of
view
on
the
opencfs
website.
Max
has
been
doing
a
lot
of
that
working.
He.
He
actually
just
wrote
up
an
article
about
denote
sync
which,
which
talks
a
little
bit
about
freeing.
A
So,
for
example,
when
we,
when
you
truncate
a
file
we
or
punch
a
hole
in
a
file,
then
we
need
to
keep
track
of
that
and
how
that
gets
written
out,
which
is
some
stuff
that
he
ran
into
while
implementing
the
the
whole
birth
times
for
send
and
receive,
which
kind
of
brings
us
back
full
circle
to
the
beginning
of
this
discussion.
So.
A
Eric
george,
do
you
know
what
the
name
of
that
tunable
is?
Oh
here,
george
is
on
there
cool,
so
I
think
we're
kind
of
getting
to
the
end
of
the
questions
think
about.
If
you
have
any
more
questions
and
also
there's
going
to
be
a
trivia
question,
whoever
answers
the
oh
man,
whoever
answers
the
trivia
question
correctly
or
whoever
gets
closest
to
the
answer
of
the
trivia
question-
is
going
to
receive
a
open,
zfs
t-shirt,
which
I
will
send
to
you
through
the
mail.
A
So
far,
no
one
has
gotten
a
t-shirt
through
the
mail,
so
a
few
people
have
requested
and
I've
been
too
lazy
to
do
it
to
actually
send
them
out
so
far,
only
people
who
don't
have
a
t-shirt
are
qualified
to
enter
this
contest.
But
this
is
your
opportunity
to
get
a
t-shirt
without
having
to
go
to
a
conference
that
I'm
speaking
at.
A
A
Okay,
all
right,
no
questions
all
right,
so
the
trivia
question
you
have
to
answer
it
quickly.
You
cannot
go
look
this
up.
If
I
think
that
you
have
gone
in
looked
it
up,
then
I
then
you
were
disqualified.
A
The
answer
is
the
form
of
a
number
and
so
you'll
need
to
type
in
the
number
and
whoever
gets
closest
to
the
actual
number
is
going
to
receive
a
t-shirt
in
your
choice
of
size,
shipping
included
to
the
united
states.
If
it's
outside
the
united
states,
then
we'll
have
to
work
something
out
all
right.
A
So
the
question
I'm
gonna,
I
guess
I'm
gonna.
I
should
type
this
in
to
irc
because
they
have
a
little
bit
of
people
who
are
not
actually
in
the
hangout
have
a
little
bit
of
lag
so
all
right.
A
So
I
just
posted
the
question
so
in
particular
I'm
talking
about
the
zfs.
You
know
in
files
that
are
cn.h
and
files
that
are
part
of
like
the
normal
user
facing
part
of
zfs.
So
the
kernel
lib
zfs
lib
zfs
core
the
zfs
command,
the
z,
pull
command.
A
A
I'm
gonna
give
you
guys
a
few
more
minutes,
but
if
anybody
looks
it
up,
then
I'm
gonna
ridicule,
you
jones
says
440
000.,
luke
luke
would
have
guessed
thirty
thousand,
but
he
already
has
a
t-shirt.
A
A
Anyone
else
want
to,
I
know.
Unfortunately,
I
know
that
most
of
you
have
not
actually
looked
at
the
zfs
code,
so
you
you're
definitely
at
a
disadvantage.
Nobody,
I
was
hoping
that
you
know.
Maybe
some
of
the
some
of
the
guys
who
have
reported
it
to
other
platforms
would
have
shown
up,
because
I
bet
they
would
have
a
pretty
good
idea,
because
they've
kind
of
looked
at
all
of
it
all
right.
Nobody
else.
Nobody
else
is
going
to
guess.
A
C
D
A
A
So
this
is,
I
did
this
on
a
lumos.
It
was
very
similar
on
freebsd.
A
A
No,
it's
not
500
000
as
brewer,
so
I
meant
to
go
and
look
up
like
how
many
lines
of
code
there
are
in
some
other
file
systems.
But
I
didn't
have
time,
but
I
know
george
like
when
we
were
back
at
sun.
Like
you
know,
this
is
like
maybe
2007
2008
and
we
actually
broke
110
000
lines,
and
that
was
including
all
all
the
code
that
we
wrote.
A
So
if
you
include
like
kind
of
all
the
code
that
the
zfs
team
at
sun
wrote,
including,
like
you
know,
z-test
and
zdb
and
stuff
like
that,
it
was
it's
now
like
173
000.,
there's
a
time
when
we
when
we
we
broke
a
hundred
thousand-
and
I
remember
jeff
being
really
sad
that
our
code
had
gotten
so
bloated
that
it
was
a
hundred
thousand
lines
of
code.
A
You
didn't
guess
that
we
would
have
broken
that
in
the
past.
D
A
Yeah
yeah
so
yeah,
I
think,
if
I
remember
correctly,
ufs
was
bigger
at
was
more
than
110
000
at
some
point.
Is
that
not
right.
D
Yeah,
I
think
it's
correct.
I
think
it
was
larger.
I
think
there
were
even
some
device
drivers
that
were
larger.
D
I
was
trying
to
find
I
I
seem
to
recall
that
there
was
a
thing
that
that
compared
it.
A
And
just
the
dusty
files
in
zfs
is
is
ninety
nine
thousand
six
hundred.
So
if
you
just
compare
the
kernel
stuff,
then
we're
still
quite
a
bit
bigger
we're
more
than
double
the
size
of
of
ufs,
but
I
would
contend
much
more
than
double
the
functionality.
A
Well,
I
think
what
I
think
what
jeff
was
comparing
before
was
ufs
plus
svm,
the
slurps
volume
manager,
which
I'm
sure
is
ridiculously
huge.
A
Cool
all
right,
thanks
guys
for
coming,
we
haven't
determined
when
the
next
one
will
be
the
next
office
hours
will
be,
but
I
am
proposing
that
it
happens
on
october
31st
because
that's
always
been
kind
of
an
auspicious
date
for
zfs.
We've
had
a
bunch
of
big
milestones
around
that
date:
october
31st
2001.
A
We
first
we
had
like
the
first
tiny,
tiny
prototype,
and
then
we
open
sourced
the
code
and
integrated
it
into
the
solaris
kernel
on
october
31st
2005.,
and
I
think
also
some
other
big
features
were
landed
around
october
31st
on
halloween.
A
It's
also
just
a
few
days
from
my
birthday.
So
hopefully,
in
a
few
weeks
we
will
get
another
zfs
expert
to
hold
office
hours.
Thanks
george
for
fielding.
Like
half
of
the
questions
that
came
at
me
today,
hopefully
you
will
have
your
own
office
hours
at
some
point
in
the
future.
A
So
thanks
a
lot
and
zsh
email
me
your
mailing
address
and
and
shirt
size
and
I'll.
Send
you
a
t-shirt
thanks
guys.