►
From YouTube: August 2023 OpenZFS Leadership Meeting
Description
Agenda: OpenZFS Conference; ZED; RAIDZ Expansion; Fast Dedup; etc
full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#
A
All
right,
I
guess
it's
time
we
can
get
started,
because
the
first
reminder
is
that
the
open
ZFS
developer
Summit
is
coming
up
soon.
That'll
be
October,
16th
and
17th
in
San
Francisco.
The
call
for
papers
is
open.
Now,
Matt
would
like
everybody
to
get
their
abstracts
into
him
before
September
5th,
so
that
we
can
build
the
schedule
for
the
conference.
A
Okay,
right
now
the
agenda
looks
like
it's
all
my
stuff,
but
I
hope
other
people
have
stuff
to
add.
A
But
the
first
item
that
we
have
is
some
work
we're
doing
on
Zed
for
Linux
specifically
to
extend
the
work
we
did
a
couple
of
months
ago,
where
you
can
using
per
Vita
properties,
configure
Zed.
As
far
as
the
number
of
I
o
or
checksum
errors
within
a
set
amount
of
time
triggers
the
auto
spare
of
the
device
we're
now
extending
that
so
that
you
can
do
the
same
thing
for
slow
iOS.
A
So
if
a
particular
disk
has
too
many
slow,
I
O's
in
the
set
amount
of
time
configured
with
the
the
Vita
property,
then
ZFS
will
automatically
replace
it
with
a
spare
I
think
we've
all
run
into
the
situation
before
where
one
disk
was
dying,
but
not
quite
dead
and
was
dragging
the
performance
of
the
entire
pool
Down
based
on
it.
So
if
Zed
can
detect
that
and
automatically
swap
in
a
spare,
if
you
have
one,
then
that
will
get
the
performance
of
the
pool
back
easily
Alan.
A
Spares
that's
a
good
question.
B
A
A
Anybody
else
see
any
gotchas
with
that
other
than
d-ray.
C
But
I'm
thinking
there
would
be
nice
to
have
opposite
protection
if,
for
example,
some
indeed
single
disk
but
affects
whole
size
Fabrics
that
happens
quite
regularly
and
makes
problems
for
many
disks
same
time.
It
would
be
good
to
not
activate
in
that
case
because
attempt
to
rebuild.
While
we
are
not
saying
in
general,
it
would
just
make
things
worse,
I'm
not
exactly
sure
how
to
formalize
it,
but
would
be
definitely
good.
So
we
would
not
try
to
go
beyond
a
pull
redundancy
and
Beyond
cool
stability.
A
C
B
A
And
then,
in
a
related
one,
about
Zed
we're
also
looking
at
improving
the
logic
around
if
you
pull
a
disc
out
of
the
enclosure
and
then
put
a
replacement
disk
in
the
same
slot
on
Linux,
depending
how
you
had
the
device
named,
you
know
if
it
was
named
by
the
the
wwn
or
anything
like
that,
then
Zed
says
you
know
this
replacement
disk
isn't
the
same
and
doesn't
do
anything
and
so
we're
looking
at
setting
it
to
look
at
the
enclosure
path.
A
The
ink
Path
property,
if
it's
not
null
so
that
you
know
if
it's
the
new
disk
is
in
the
same
you
know,
enclosure
in
the
same
slot
then
consider
that
a
replacement
for
the
disk
that's
missing
and
start
the
replace
automatically
rather
than
having
to
be
triggered
by
a
person.
A
C
So
it
would
be
good
to
have
logic,
be
the
same
I
think
on
the
other
side,
I'm
just
slightly
worried
of
extra
replacement.
We
never
used
that
logic,
and
it
was
just
normally
blocked
by
the
fact
we
are
using.
We
are
partitioning
disks
and
freebies
GFS,
never
partitions
disks,
so
it
effectively
protected
us
from
an
unwanted
activity.
I
just
thinking
how
would
it
be
controlled
on
Linux
I,
don't
want
things
to
be
replaced
arbitrarily.
A
C
A
C
So
it
reminds
me
at
some
point:
there
was
questioned
whether
pass
on
FreeBSD
and
Linux.
Actually
is
the
same.
Originally
they
came
from
lumos
and
some
people
complain
is
that
you
must
Solaris
format
is
pretty
weird
and
there
was
a
wish
to
improve
it.
I,
don't
remember
where
it
ended.
Up
on
freebiesd
I
had
a
feeling.
Somebody
was
working
on
that
on
changing
that
format.
E
C
C
Readable
string
yeah
there
was
some
activity
to
recently
change
it
on
3bg
to
make
it
more
reasonable
and
readable,
but
it
would
break
compatibility
with
previous
pulls
from
that
perspective.
That
makes
me
wonder
or
like
what
are
we
gonna
do
about
those
paths?
Do
we
already
have
them
different
between
different
platforms,
because
originally
freebies
mimicked
salaries
lumos,
so
that
was
it
is
compatibility,
is
on
that
front,
but
I
never
knew
what
how
that
looks.
A
C
Because
if
you
take
any
enclosure
with
SCS,
you
could
get
that
pass
right.
If
you
get
any
HCI,
if
you
stimulates
HCI
for
well,
it
emulate
spiritual
enclosure
for
HCI
ports
yeah,
so
in
sales.
Discord
modern
system
should
have
an
order,
pass
okay,
physical
path,
so
you
should
find
it
without
problems.
B
Sure
yeah
I'm,
looking
for
reviews
still
so
anybody
out
there
is,
wants
to
contribute
and
help
get
this
across
Finish
Line.
That
would
be
helpful.
There's
a
few
z-test
asserts
that
I'm
still
chasing
down
with
Fedor,
so
but
they're
kind
of
rare,
so
I'm
not
as
concerned
about
those
and
then
I
had
a
question
about
the
FreeBSD
side
of
things
it
went
when
I
do
commit.
It
doesn't
look
like
any
of
the
test
spots
run
against
FreeBSD.
B
A
I
think
that
was
part
of
the
problem,
because
I
noticed
this
on
I
forget
which
recent
pull
requests
didn't
end
up,
getting
run
against
FreeBSD
either.
B
Yeah
because
I
did
I
did
a
recent
change,
which
is
FreeBSD
specific,
so
I
really
want
to
see
a
bot
run.
The
test
Suite,
so
I
added
detection
for
the
btx
bootloader
on
MBR.
They
they
they.
B
There's
a
header
there,
so
it's
easy
to
detect,
but
I
want
to
make
sure
I
have
a
CFS
test
suite
for
that
to
make
sure
that
you
get
the
negative
test
on
that
so
and
then
the
other
question
with
code
coverage,
we
don't
do
code
coverage
anymore
on
Linux
I
know
that
FreeBSD
has
code
coverage
capabilities,
so
I
don't
know.
If
there's
a
way
to
opt
in
to
like
for
FreeBSD
side
to
do
code
coverage,
it
would
be
nice
to
get
code
coverage
numbers,
but.
F
For
the
for
the
buildbot
stuff,
you
may
check
with
Tony
hudder
he's
been
doing
some
work
on
that
infrastructure.
So
maybe
he
accidentally
broke
something.
B
C
Oh
yeah
recently
we
had
February
here
2013
to
test
that
at
least
two
versions,
but
I
haven't
checked
lately
speaking
about
rate
Z
I,
wonder
about
timelines
like
if
it's
approaching
it's,
it's
been
implemented,
I
guess
it's
already
too
late
for
any
2.2
and
it
probably
won't
be
merger
signs.
It
really
introduces
new
feature
Flags
for
that.
Just
my
sales
team
attacks
me
regularly
like
when,
when
when
we
see
it,
yeah.
B
C
F
That
seems
likely
Brian,
who
is
it
here,
could
probably
comment
more
definitively,
but
I
would
probably
recommend
that
we
not
try
to
stick
it
in
at
the
last
minute
into
2.2.
But
I
will
take
a
look
at
the
code
review
probably
later
this
week.
B
Okay,
I
noticed
Pablo's
on
the
call
I
don't
know
if
you
would
be
available
to
look
at
it
or
look
at
portions
of
it.
B
Okay,
it's
fairly
fairly
straightforward.
Actually,
and
if
you
watch
one
of
Matt's
talks
it
it
sort
of
outlines
the
high
level.
You
know,
approach.
A
So
I
have
an
update
on
the
fast
YouTube
work.
We've
been
doing,
maybe
spend
a
few
minutes
on
that.
The
kind
of
results
we
have
so
far
are
actually
pretty
interesting.
So
just
quickly
share
that.
A
A
So
this
is
all
just
normal
dedupe
in
master
and
notice
that
you
know
with
the
block
size,
equals
12
or
block
shift
equals
12..
You
can
see
the
Blue
Line
there
that,
as
we
wrote
more
and
more
data
to
the
pool
with
dedupe
the
amount
of
inflation
or
the
amount
of
the
amount
of
Rights
we
had
to
a
dedicated
vdev
went
up
as
we
got
more
objects.
But
obviously
you
see
with
the
higher
block
size
of.
A
Yeah
32
or
64k,
then
we
see
much
higher
inflation
almost
double.
A
So
it's
that
change,
while
making
the
DDT
itself
quite
a
bit
smaller
and
more
compressible
does
actually
have
the
effect
of
making
the
right
inflation,
especially
with
small
blocks
and
D2
much
worse,
but
that
also
made
it
easier
if
you
compare
with
log
D
dupe
and
see
what
the
mitigations,
for
that
were
the
other
main
difference
that
explains
some
of
the
reduction
in
amplification.
A
You
see
with
the
work
we're
doing
is
the
change
to
the
actual
on
disk
structure
of
DDT,
so
it
uses
a
new
set
of
zaps
with
a
feature
flag
and
will
only
activate
if
you
didn't
have
a
traditional
dedup
on,
but
in
traditional
dedupe.
Each
entry
in
the
DDT
had
four
ddt-fizz
tees,
so
it
could
store
the
list
of
dvas
for
a
normal
single
copy.
Dedup
copies
equals
two
and
a
copies
equals
three,
and
then
it
had
the
deprecated
slot
for
the
ditto
block.
A
So
that
means
it
actually
had
room
for
12,
full
dbas
and
ref
counts
and
birth
times.
It's
making
each
entry
quite
a
bit
bigger
and
mostly
full
of
zeros,
and
that's
why
each
entry
is
separately
compressed
with
zle
before
it's
actually
put
into
the
DDT
We've
replaced
this
with
some
new
code
that
has
just
the
one
set
of
three
dbas,
but
has
the
ability
to
update
the
existing
YouTube
entry.
A
If
a
newer,
if
the
new
block
coming
in
has
a
higher
number
of
copies,
then
we'll
allocate
just
the
one
extra
DVA
and
update
the
entry
to
have
the
existing
single
DBA
and
the
new
second
DBA.
A
A
Don't
think
people
really
change
the
copies
equals
around
that
much
that
it's
going
to
be
a
big
impact
for
people,
and
you
know
if
you
ask
for
two
copies
at
some
point,
then
you
know
you
asked
for
this
and
that's
just
how
it's
going
to
be.
But
this
way
we
save
you
know
three
quarters
of
the
biggest
part
of
each
DDT
entry.
A
And
then
we
implemented
the
first
version
of
dedo
blog.
So
as
new
changes
come
in,
we
write
them
to
a
depend,
Only
log
and
maintain
the
list
of
what
those
changes
are
in
a
separate,
in-memory,
AVL
tree.
A
Then,
once
that
log
reaches
its
some
criteria,
we're
expecting
a
maximum
size
or
age,
then
it
will
condense
that
log
basically
truncate
that
log
after
writing
out
all
those
changes
to
the
zap,
and
the
idea
is
that
this
would
hopefully
amortize
the
cost
of
using
a
larger
leaf
and
indirect
block
size
like
we
saw
in
the
first
slide
there
right
now.
A
The
Prototype
isn't
that
smart
all
it
does,
is
condense,
a
log,
every
n
transaction
groups,
but
it
was
enough
to
be
able
to
show
what
the
performance
impact
of
that
is
so
test.
This.
We
ran
fio
on
a
data
set
with
a
record
size
of
8K
chosen,
mostly
to
allow
us
to
get
a
large
number
of
dedup
entries
without
having
to
write
a
huge
amount
of
data,
but
better
performance
than
we
got
we're
trying
to
do
it
with
4K.
So
we
get
better
throughput.
A
All
the
blocks
are
completely
randomized,
so
if
they
fill
up
the
unique
DDT,
there's
no
actual
due
happening
here.
This
is
just
the
worst
case
of
100
unique
data
being
written.
So
we
write
out
make
a
new
data
set
and
write
out
four
two
gigabyte
files
to
it.
So
we
can
use
a
couple
of
threads
to
write,
basically,
eight
gigabytes
of
data,
which
would
create
about
1
million
DDT
entries.
A
Then
we
stop
and
measure
how
much
data
has
been
written
to
the
dedicated
V
Dev
in
the
pool.
Then
we
export
and
import
to
reset
that
to
zero,
create
another
data,
set
and
write
eight
more
gigabytes,
and
did
this
a
number
of
times
in
iteration
to
show
how
it
affects
performance
over
time
as
the
D2
table
gets
bigger
and
bigger.
A
So
here
we
see
in
blue
and
red
the
first
two
lines:
those
are
the
existing
traditional
dedup
code,
red
being
the
previous
where
each
Leaf
is
4K
and
the
blue
is
where
each
Leaf
is
32k,
and
you
see
like
the
first
graph
that
big
difference
in
the
amount
of
write
inflation
then
at
the
bottom.
The
green
line
is
the
fast
YouTube
code,
where
it's
right
into
a
log
and
then
updating
this
app.
Both
the
log
and
the
zap
live
on
the
dedicated
dedupe
device.
A
So
all
the
log
rights
are
included
in
these
numbers,
so
just
batching
up
32
transaction
groups
worth
of
Rights
and
then
updating
the
zaps
increment.
Once
every
32
transaction
groups,
you
can
see
greatly
reduces
the
amount
of
write
amplification
doing
it.
Every
256
transaction
groups
you
can
see,
has
this
kind
of
spiky
effect
of
every
time
we
do
it.
There
is
a
bunch
of
extra
data
to
be
written
out
and
something
smoother
might
make
more
sense.
A
B
A
This
is
a
small
yeah.
F
I
know
this
is
I.
Don't
understand
like
why
we
see
it
not
you
know.
I
know
you
have
more
work
in
mind,
but
I
would
think
that
you
know
even
aggregating
32
transaction
groups
at
a
time.
Given
that
you're
doing
random
rights,
you
would
still
have
like
every
every
leaf
block
of
the
zap
would
have
only
one
update
to
it
right
because
it's
like
because
it's
fully
randomized
in
terms
of
where
it's
being
updated,
unless
the
unless
the
DDT
is
very
small
I.
A
F
So
so
do
you
think
that
what's
happening
is
that
the
DDT
is
so
small
that,
like
that
you're
actually
you're
fully
rewriting
the
DDT
and
you're
getting
multiple
multiple
like
entry
updates
in
each
Leaf
block
of
the
DDT
each
time
or
or
is
it
one
entry
one
entry
updated
per
DDT
Leaf
block
and
it's
only
the
indirect
blocks
that
we're
seeing
a
benefit
here.
A
Definitely
seeing
the
benefit
in
the
leaf
blocks
as
well,
because
there
aren't
that
many
in
the
end,
when
you're
at
the
smaller
size,
when
we
looked
at
just
zdb
on
the
DDT
on
the
zap
at
the
end,
after
the
10
million,
there
was
a
decent
skew
of
birth
times,
where
not
every
block
in
the
zap
had
to
be
updated.
Each
time
yeah.
F
I
guess
it
maybe
it
would
be
interesting
to
see
how
this
like
80
gigabytes
are
in
is
like
not
you
know,
probably
is,
is
not
that
big
compared
to
the
challenging
use
cases.
So
it'd
be
interesting
to
see
this
go
up
to,
like
you
know,
terabytes
and
for
sure,
and
see
like
if
there's
some
discontinuities
in
those
lines.
A
Yeah
this
was
just
the
first
cut
to
tell
if
our
prototype
was
doing
what
we
thought
it
was
doing,
got
it
because
yeah,
like
we've,
got
some
data
from
a
couple
of
real
world.
Youtube
customers
that
have
you
know,
320
million
objects
in
their
YouTube
table,
and
it's
like
yeah
at
that
scale.
Things
are
going
to
be
quite
a
bit
different.
A
One
of
the
things
that
this
work
is
hoping
to
do
is
by
using
the
sharding
of
the
zaps
that
we
will
further
concentrate
and
make
it
slightly
less
random.
So
part
of
that
is
either
sharding
on
the
physical
block
size
so
doing
that
in
4k
increments,
or
something
because
even
just
doing,
every
4K
information
from
0
to
16
megabyte
the
max
record
size
would
be.
You
know,
4
000
shards,
but
it
means
that
you
know.
A
If
you
have
a
block,
that's
1.4
megabytes,
then
you
only
have
to
search
that
zap
of
block
set
or
that
size
to
find.
If
the
hash
is
already
there-
and
also
you
know,
if
you're
concentrating
your
updates,
then
you're
when
you're
flushing,
one
of
these
logs
that
contains
you
know
only
the
updates
to
16k
blocks,
then
you're
writing
a
much
smaller
zap
and
hopefully
it's
less
random.
You
get
more
concentration
so
that
again,
you
get
more
of
this
amortization
effect.
F
A
F
That
I
think
that
the
goal
here
is
to
when
you're
flushing
out
the
changes
to
have
multiple
updates
go
into
each
of
these
block
right,
yeah,
so
I
think
that
that
would
the
optimal
way
to
do.
That
would
be
to
have
all
of
the
all
of
your
changes
sorted
by
hash
value
and
keep
the
most
that
you
can
and
then
write
them
out
into
the
DDT
by
hash
value.
That's
great!
F
Since
you
can
do
this
over
multiple
two
excuses,
you
can
kind
of
like
do
it
continuously.
So
you're
like
okay,
like
my
limit,
my
memory
limit
is
I
can
use
like
a
gigabyte
of
RAM
for
the
pen,
these
like
pending
changes
and
then
like
once
I
get
close
to
that
I
just
start.
You
know
eating
through
writing
out
by
hash
value
and
keeping
like
a
cursor
of
where
I
am
that
way.
A
Exactly
and
it's
kind
of
signaling
what
I'm
saying
but
kind
of
like
we
do
with
scrivenson
it's
like
we
want
to
fend.
You
know
this
much
of
each
transaction
group
flushing
these
out
or
whatever
yeah,
when
it.
F
Does
I
think
if
you
do
that,
that's
going
to
give
you
kind
of
the
optimal
behavior
and
you
don't
really
need
any
additional
like
sharding?
On
top
of
that,
because
you
aren't
like
you,
you
know
you
mentioned
like
oh,
like
we
separated
out,
so
we
don't
have
to
search.
But
since
it's
a
hash
table,
you're
not
searching
at
all
right.
A
So
yeah
we'll
look
at
that.
This
is
the
basically
the
same
data
but
looking
at
the
right
throughput,
and
we
see
that
you
know
as
the
zap
gets
bigger
that
does
have
more
and
more
effect.
Part
of
this
is.
We
were
purposely
exporting
and
importing
the
pool
between
each
of
these
runs
so
we're
starting
with
an
empty
Arc.
A
Each
time
also,
this
was
a
relatively
small
VM,
whose
Arc
was
not
going
to
be
able
to
Cache
the
whole
DDT
anyway,
but
seeing
the
the
effects
of
the
different
stuff,
we
still
see
quite
a
big
Improvement
but
I
think,
like
Matt,
said
some
of
the
ways
we
can
amortize.
A
Most
of
these
costs
can
end
up
making
a
big
difference
and
allow
us
to
take
advantage
of
being
able
to
have
larger
Leaf
blocks
so
that
it's
more
fragmented
or
less
fragmented
and
we
get
better
prefetch
and
so
on
on
dedupe
as
well.
A
Other
thing
we
noticed
is
that,
right
now
we
store
all
the
DDT
zaps
as
copies
equals
three,
which
makes
sense,
especially
you
know
the
duplicates
app
is
you
know
you
could
never
free
a
block
ever
again
if
it
was
damaged
without
knowing
you
know,
if
you
could
decrement
it
properly
and
so
on,
but
the
uniques
app
is
different
there
with
the
prune
concept,
we'll
be
able
to
go
through
the
Unix
app
and
delete
some
entries
to
make
room
to
keep
the
the
size
of
the
DDT
from
getting
too
big,
especially
to
make
it
fit
in
memory
or
at
least
constrain
it
to
the
size
of
your
dedicated
vdev.
A
A
So
with
that,
it
means
that
if
we
happen
to
damage
the
unique
sap
table
that
it
wouldn't
be
catastrophic
to
the
pool,
it
would
mean
that
you
know
if
we
lost
the
holes
app,
it
would
mean
that
you
know
new
rights
wouldn't
have
a
good
chance
of
deduping
against
the
old
blocks,
but
the
pool
wouldn't
lose
any
data
or
anything.
A
A
Because
yeah,
just
looking
at
real
world
use
cases,
we've
seen
even
with
customers
getting
Duty
ratios
of
like
3.5
to
1,
they
still
had
like
240
of
their
320
million
records
were
unique
and
that's
a
lot
of
stuff
to
a
store
three
times
just
takes
space,
but
also
every
time.
You're.
Writing
one
you're!
Writing
it
out
three
times.
F
Or
when
you're
like,
when
you're
pruning
entries
or
presumably
could
lose
them
here,
are
you
have
you
changed
the
way
that
Scrub
Works
so
that
it
can
still
scrub
these
blocks
that
have
the
dedup
bit
set?
Because
because
normally
scrub
ignores
like
when
it's
traversing
the
block
winners,
it
ignores
ones
that
have
the
dedo
bit
set
and
it
scrubs
them
by
looking
by
finding
them
in
the
YouTube
table.
F
So
I
think
that
even
to
handle
the
freeing
or
like
evicting
things
from
the
unique
zap
table,
you
need
to
like
change
with
that
Scrub
Works.
Presumably
like
have
scrub.
You
know,
actually
look
at
the
blocks
of
the
DJ
bit
set
and
then
go
look
them
up
in
the
deduct
table
and.
F
F
F
A
Yeah,
because
also
I
need
to
have
a
big
effect
on
the
dedicated
vdev.
You
can
fit
more
DDT
if
you're,
not
writing.
Three
copies
of
all
the
blocks
to
your
dedicated
vdev
and
three
copies
on
the
same
video
is
not
as
helpful
as
if
you
know
the
way
copies
normally
works
for
regular
blocks.
It
would
try
to
spread
those
copies
across
different
vdevs,
but
when
you're
using
an
allocation
class,
it's
going
to
put
all
three
copies
on
the
special
video.
G
Question
so
I
like
the
idea
of
not
doing
copies
three
for
for
the
unit
exam,
but
also
like
the
the
biggest
problem
from
my
tests
was
how
much
we
we
have
to
read
during
when
we
sync
transaction
group
because
of
the
hash
Is
Random.
G
So
we
have
to
read
all
the
indirect
blocks
so
was
I
was
actually
wondering
if
it
wouldn't
be
better
to
have
a
single
DDT
and
just
read
once
like
one
large
one,
instead
of
just
looking
into
two
different
ones
for
the
entry,
because
that
multiplies
the
number
of
reads
and
reads:
Rita,
of
course,
synchronous
during
transaction
groups.
So
that
slows
down
transaction
group.
G
Because
for
for
to
look
up
each
entry
for
like
large
DDT,
we
need
like
four
or
five
reads
per
entry
right,
and
we
are
doing
this
during
transaction
group.
Sync,
we
cannot
prefetch,
we
can
only
prefetch
for
freeze,
because
for
free
we
do
have
hush,
but
for
rights
we
cannot
do
pre-fetching
because
we
don't
have
hush
yet
so
during
transaction
group.
If
we
have
two
zaps
we
will
have.
We
have
to
do
the
reading
twice
to
try
to
find
the
entry
in
each
of
them
each
yeah.
F
F
D
G
A
F
I
mean
you
might
consider
like
maybe
reducing
copies
to
two
just
like
for
all
ddts
or
like
having
you
know,
maybe
like,
because
because
I
feel
like
it's
pretty,
it's
not
super
necessary
given
how
people
configure
their
pools
like
yeah,
copies
stuff
was
neat
when
you're
thinking
about
a
non-redundant
pool
and
adding
a
little
bit
of
redundancy
for
very
little
cost.
But
in
this
case
the
cost
is
high,
and
you
know
probably
people
are
using
this
with
like
grade
Z
or
something
some
type
of
redundant
pool.
F
That's
like
they
care
about
their
data
and
they've
configured
it
to
be
redundant
enough,
that
they
aren't
going
to
lose
data
and
the
case
of
like
single
block.
You
know
you'd
only
worry
about
this
one.
It's
like
I've
lost
the
maximum
number
of
disks,
I
haven't
reconstructed
yet
and
then
I
also
lost
one
random
block
randomly
and
that
and
then
like
that
happened
to
be
DJ
block,
which
was
important
and
now
I
have
now.
F
The
additional
copy
really
saves
me,
which
is
hopefully
not
that
common,
so
I
would
think
that
reducing
it
to
copies
equals
two
just
like
for.
F
Time
would
be
fine
and
then,
like
maybe
think
about,
could
you
reduce
it
to
like
if
you
reduced
it
to
copies,
equals
one
for
everyone
all
the
time?
What
would
the
implications
be
like
Could?
You?
F
Could
you
not
lose
data,
but
just
leak
stuff
right,
because
we're
only
we're
only
really
worried
about
like
losing
access
to
like
one
or
like
a
small
number
of
blocks
of
the
YouTube
table,
and
so
if
we
could
write
the
code
such
that
yeah,
if
you
lose
like
one
block
of
the
gdub
table,
then
what's
going
to
happen
is
what
everything
was
right.
Whatever
is
referenced,
there
is
leaked
forever.
A
Yeah,
like
conflicts
with
the
concept
of
dealing
with,
we
couldn't
find
this
hash
in
the
dedup
table,
so
it
must
have
been
a
copies
equals
one
or
a
ref
equals
one
that
we
purged
well.
F
You'd
have
to
know
like
oh
we
F.
If,
if
we,
if
we
can't
read
this,
if
we
can't
determine
whether
it's
in
the
detail
or
not
right,
then
we
have
to
assume
that
there
are
still
more
copies
of
it
and
not
forget
right.
But
if
we
can
can
read
the
block
that
it
should
be
in
and
we
find
it
is
definitely
not
in
the
deduct
table,
then
you
can
still
do
the
right.
A
F
D
A
G
F
Anyways
I
I
think
that
it
would
be
nice
to
preserve
the
only
need
to
look
in
one's
app
for
performance
reasons,
because
that's
like
you
know,
double
your
performance
right.
There
yeah
versus
having
to
look
in
two
zaps,
because.
D
A
Yeah,
because
the
only
other
use
case
we've
done
before
for
a
different
customer,
because
D2
prune
didn't
exist
at
the
time
was
like
at
import,
dropped
the
entire
unique
zap
with
flag.
So
if
you
set
a
flag
while
importing
the
pool,
we
just
dropped
that
entire
zap
and
truncated
it
and
made
a
fresh
one,
and
so
you'd
only
have
your
blocks
that
were
already
deduped
to
do
more.
A
D
A
Know
if
we're
gonna
have
real
D2
prune,
where
we
can
walk
through
the
tree
and
clean
stuff
up,
then
I.
Don't
think
we
need
that
separation
anymore,
but
Willie's
family
with
it
and
see.
If
that
you
know,
if
we
find
out
why
it
was
originally
split
up
because
yeah
the
advantage
with
the
sharding
is
because
it's
deterministic,
you
only
have
to
look
in
The
Shard,
a
specific
Shard
ever
so
you're
not
ever
having
to
look
at
more
but
I.
A
Think
yeah,
especially
if
we're
going
to
Shard
not
having
to
have
two
of
a
recharge,
would
help
on
our
side
as
well
and
make
the
code
a
little
bit
easier
to
deal
with.
G
How
long
do
you
have
any
numbers
on
how
this
dedup
log
impacts
reads
during
transaction
group
sync,
because.
A
I'm
not
concerned
there
are
no
reads
yeah.
So,
while
entries
exist
in
the
log
on
disk,
they
also
existed
in
avl3
in
memory,
so
you
would
never
have
to
read
from
disk
if
the
hash
you're
looking
for
is
in
the
log,
it
will
always
be
in
the
avl3
in
memory.
Yeah.
G
But
I'm
more
concerned
that,
because
you
have
the
log
that
you
are
not
really
writing
the
updates
to
DDT
every
transaction
group,
so
every
32
transaction
groups,
when
you
try
to
sync
the
lbl
avl3,
you
will
need
a
large
amount
of
reads.
So
you
will
have
like
spikes
in
the
reads.
So
it
would
be.
B
D
F
Think
if
you
take
the
approach
that
I
outlined
of
like
you
know,
you
have
some
limit
and
then
you
for
the
size
of
the
AVL
tree,
and
then
you
have
a
cursor
of
like
this
is
the
last
the
hash
value
that
I
wrote
out
and
then
like
every
every
txg,
you
walk
forward
a
little
bit.
Writing
some
of
the
entries,
but
those
are
the
entries
that
are
all
with
the
most
similar
hash
values.
F
Then
that
way,
you
know
you
that's
the
way
you're
gonna
the
most
likely
to
have
consolidation
of
modifying
the
same
block
of
the
of
the
DDT
right,
multiple
entries
being
in
the
same
block
of
the
DDT
right.
G
That's
interesting
because
we
could
also
like
limit
the
how
much
avl-3
we
think
by
monitoring
how
much
reads
we
did
and
how
much
we
slowed
down
the
transaction
group
sync.
So.
A
That's
why
I
was
thinking
in
particular
like
scrub,
limiting
the
the
flushing
of
the
log
to
you
know
some
number
of
milliseconds
per
transaction
group.
F
A
G
All
in
all,
it's
great
to
see
DDT
being,
did
you
being
worked
on
so
I
think
everyone
is
happy
yeah.
C
Yeah:
okay
I,
just
like
from
experience
of
meta
slab,
precise
meters
log
into
the
slab
during
poly
import,
I'm
thinking.
It
would
be
good
to
not
like
if
we
can
flush
more
of
the
dupe.
If
load
is
low,
somehow
we
can,
if
we
could
flush
it
more
often
do
not
accumulate
too
much
in
lock
to
replay
during
pulling
part
yeah.
So
just
wish.
A
Yeah,
that's
definitely
one
of
the
things
I.
Don't
radar
is
making
sure
that,
especially
for
you
know,
failover
and
so
on
that
the
import
time
can't
get
two
out
of
hand,
because
that'll
not
be
useful
to
anyone
and
so
yeah
again.
That's
why
I
was
thinking.
You
know
we
have
we'll
spend
up
to
this
much
time,
each
transaction
group,
or
at
least
this
much
time
each
transaction
group
flushing
things
out.
A
F
I,
don't
know
if
you
already
covered
this,
but
the
developer
Summit
is
coming
up
soon
and
we
would
love
to
hear
about
all
the
great
work
that
folks
are
doing.
So
the
deadline
is
in
three
weeks,
September
5th,
to
submit
a
talk.
F
F
D
C
Yeah
I
would
like
to
bring
once
more
attention
to
my
second
attempt
to
refactor,
Zeal,
writing
and
Zeal.
Looking
in
previous
attempt
that
is
currently
committed
and
merged
to
2.2
George
found
deadlock
caused
by
the
fact
that
there
is
no
way
Brazil
can
sleep
or
block
or
anything
like
that
after
it
allocated
next
block
or
maybe
a
deadlock
at
the
end.
C
It
also
like
in
in
a
way
makes
code
cleaner,
because
I
previously
tried
to
avoid
those
scenarios
explicitly
one
by
one
until
I
found
it's
impossible
to
to
do
it
for
all
of
scenarios.
C
So
it's
slightly
bigger
refactoring
and
slightly
completely
more
complicated
State
machine,
but
I
believe
it's
much
cleaner
at
the
end
and
like
Signs,
there
are
no
deadlocks
in
the
previous
implementation.
I'd
like
this
to
be,
if
possible,
a
review,
written,
merge
it
sooner
and
hopefully
get
into
2.2,
even
so
George.
So
far
is
the
only
one
who
reproduced
that
look
on
his
workloads
still
I
am
not
very
comfortable
about
that.
C
So,
comparing
to
previous
ones,
this
also
removes
a
few
more
operation
out
of
ziel
is
sure
lock.
So
it
should
be
even
more
scalable
than
a
previous
one,
so
it
it
does
couple
more
atomics,
a
couple
more
logs,
lock
unlocks,
but
I
haven't
seen
any
more
contentions
more
or
less
contention
so
looks
good
to
me
much
better
than
previous,
but
needs
a
look.
George
promise
to
turn
around
some
testimony
this
week.
A
A
The
one
other
one
I
had
on
my
list
here
was:
oh
okay,
looking
at,
we
already
have
per
user
quotas,
but
there's
some
interest
in
per
user
reservations,
so
that
you
can
make
sure
that
you
know
if
you
have
data
set
for
a
class
at
a
university
or
whatever
that,
while
the
students
can
use
up
the
whole
quota
for
the
the
data
set
or
whatever,
making
sure
that
the
professor
has
a
reservation
so
that
they
will
always
be
able
to
write.
C
I,
just
it
comes
to
my
mind
that
somebody
also
also
asked
for
possibility
to
do
some
reservation
and
quota
not
on
a
physical
level,
but
a
logical
level
because,
like
if
you
are
giving
somebody
a
promise,
somebody
you
can
write.
100
gigabytes,
you
probably
need
100
gigabytes
of
logical
size
and
physical.
A
Yeah
I've
definitely
seen
people
asking
about.
You
know
logical
quotas
instead
of
physical
quotas,
so
that
you
know
if
the
data
happens,
to
compress
that's
the
advantage
of
the
owner
of
the
pool,
not
the
user,
so
you
can
only
write
100
gigs
of
logical
data
or
whatever,
but
yeah
I
was
thinking
about
trying
to
think
about.
The
interactions
between
you
know
a
user
quota,
a
user
reservation
and
then
maybe
a
project
quota
and
the
data
set
quota
and
reservation.
C
A
A
E
A
lot
of
questions
not
currently
but
historically
when
I
ran
a
number
of
multi-user
systems.
We
ended
up
working
around
not
having
that
I,
basically
just
having
a
data
set
per
user
for
their
home
directors
and
then
just
giving
each
of
those
a
reservation,
so
I
think
there's
certainly
probably
demand
for
it.
I
don't
know
if
there's
enough
demand
to
Warrant
the
complexities,
but.
A
B
E
E
So
far
on
what
should
be
done
there
I
was
curious
what
other
people
thought
might
be
on
that,
because
I'm
going
to
go
collect
a
bunch
of
data
on
when
this
would
be
useful,
but
the
data's
on
the
going
would
be
as
good
as
how
constructive
the
thoughts
I
have
on
the
phone.
E
I
figured
that
just
starting
at
that
much
of
an
Upward
Bound
would
be
a
reasonable
start,
because
that
way,
you
don't
have
to
say
save
two
Megs
to
compress
a
16
Meg
walk.
Yes,.
A
Oh,
that
reminds
me
of
I
have
to
clean
it
up
and
Upstream
it,
but
we've
developed
compression
equals
slack,
which
just
trims
off
trailing
zeros.
A
When
you
have
a
like
record
size
equals
16
Meg
and
your
file
is
20
megabytes
and
you
have
compression
off
because
say
the
file's
already
encrypted
or
whatever.
Currently
you
end
up,
wasting
you're
writing
a
whole
bunch
of
zeros
at
the
end
of
that
block,
because
compression
is
off
and
so
compression
equals
slack
just
takes
the
trailing
zeros
off
the
end
of
the
last
block
and
can
really
improve
the
the
write
throughput
of
files.
A
That
are,
you
know
not
divisible
by
the
the
record
size,
especially
if
they're
you
know,
if
you're
writing
a
bunch
of
17
megabyte
files
to
a
data
set
with
record
size,
equals
16
Megs.
You
end
up
writing
you
know
practically
40
something
percent
zeros
to
your
disk
and
it
really
hurts
your
throughput.
A
Well,
it
doesn't
even
do
zle,
it
just
sets
the
physical
size
to
end
where
the
data
The
Logical
data
does,
and
then
it
just
you
know
inflates.
This
doesn't
have
to
inflate
the
zeros
back,
because
the
logical
data
is
only
that
long
right.
G
For
embedded
the
blocks
because
embedded
blocks
are
currently
also,
we
try
to
compress
them.
G
Yeah
yeah,
but
because
of
the
of
the
metadata
of
the
compression
algorithm,
you
can
actually
write
much
less
well
much
less
yeah
less
than
you
could
otherwise.
D
E
If
you're
going
to
do
that,
you
probably
also
want
to
use
various
options
for
like
ld4mv
standards
to
skip
a
bunch
of
the
other
data.
If
you
know
you
don't
care
about,
because
it's
a
tiny
frame
anyway,
but
I,
don't
know
how
much
that
would
save
you
in
practice.
So
when
you're
only
getting
110
votes
or
something.
G
A
We
tried
yeah
Z4,
we
even
outside
of
the
compression
header.
We
have
a
ZFS
header
of
the
original
logical
size,
that's
like
the
first
eight
bytes
or
four
bytes
of
of
the
data
and
so
on
and
so
yeah.
There
could
be
opportunity
to
get
more
data
into
embedded
block
pointers,
especially
if
people
had
you
know,
100
byte
files.
E
G
E
A
Basically,
I,
don't
remember
all
the
details
now
because,
just
a
couple
months
ago,
when
we
wrote
this
I
think
part
of
it
is
avoiding
some
of
that
inflating
it
to
the
full
size,
because
you
know
why
copy
a
bunch
of
zeros
around
in
memory,
if
we're
not
going
to
write
them
down,
yeah
in
particular,
trying
to
avoid
doing
that
as
well,
when
recreating
it
on
the
other
side.
A
So
when
we
decompress,
we
don't
need
to
build
memory
with
a
bunch
of
zeros
just
to
to
throw
them
away
when
we
return
the
data
to
user
space,
and
it's
only
going
to
be
the
original
logical
size.
E
A
Right
now,
it's
just
a
special
slack
setting
because
the
customers
like
we
want
compression
equals
off,
because
we
don't
want
to
waste
CPU
time
on
lz4
on
encrypted
data.
That's
not
going
to
be
compressible,
but
when
we
switch
our
record
size
from
128k
to
16,
Meg,
suddenly
we're
losing
a
lot
of
space
and
we
were
seeing
worse
performance
because
we're
spending
all
this
time
writing
out
zeros
on
all
of
our
randomly
sized
objects.
E
A
E
E
E
A
I
think
that
threshold
was
definitely
come
up
with
when
the
128k
was
the
max
record
size
yep.
E
A
I
think
we're
at
time
here.
So
thanks
everybody
and
we'll
when's
the
next
one
Matt
four.
F
Weeks,
it
should
be
four
weeks
from
now
September
12th
yeah.