►
From YouTube: November 2022 OpenZFS Leadership Meeting
Description
Agenda: quota performance; ARC MRU/MFU; BRT details
full notes: https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#
A
A
You
have
a
quick
update
from
the
conference,
although
I
think
most
folks
here
are
at
the
conference,
I
had
a
great
conference
I
thought
of
great
talks,
a
lot
of
great
socializing
with
other
developers
that
haven't
seen
in
several
years
and
we're
posting
the
videos
now
so
I'm
getting
about
one
a
day
uploaded
on
YouTube,
so
that
content
will
be
available
to
folks
who
couldn't
attend
in
person.
B
Of
speaking
about
topics
just
like
to
bring
attention
to
my
podcast
for
review
for
some
archery
Factory
and
I
started
unexpected
legal
for
myself,
I
was
going
to
work
on
Uncle
prefix
I
was
going,
I
was
talking
during
the
hackathon,
but
ended
up
first
digging
Arc
and
stuck
there
and
ended
up
refactoring
a
few
things
got
ugly
over
the
years.
So
there's
a
number
of
refactorings
so
I'd
like
some
people
to
have
an
eye
on
it,
because
at
this
point,
I'm
just
already
fixing
unrelated
test
failures,
which
are
not
even
mine.
A
Yeah
I
took
a
quick
look
at
the
at
your
description
and
you
know
it
all
sounds
good.
I
didn't
look
in
detox
the
code
yet.
B
While
looking
at
it,
I
found
a
few
more
odgs
around
that
logic
separate,
but
still
like,
for
example,
right
now,
we
are
in
first
order
eviction
data
from
cost
lists,
I'll.
Normally,
then,
starting
a
week
in
metadata,
I
bet
on
system
with
a
lot
of
metadata.
Our
balance
between
mru
and
mfu
is
heavily
biased
to
metadata
I'm.
Not
sure
was
it
intentional
or
it's
another
bug.
B
A
Yeah
yeah
I
I
I've
seen
that
kind
of
imbalance
in
the
past
on
real
systems.
I
suspect
that
there's
some
issues
with
the
logic
there.
B
A
Cool
by
the
way,
Muhammad
I
was
just
responding
to
your
request
for
slack
invite.
For
some
reason
your
email
address
didn't
come
through
on
the
mailing
list.
So
if
you
could
send
that
to
me,
send
me
your
email
address,
then
I'll
get
your
invite
to
slack,
and
that
applies
to
anybody
else,
who's
watching
or
listening
later
on.
If
you
want
to
join
the
open,
ZFS
slack
just
send
me
your
email
address
and
I'll
give
you
an
invite.
C
A
E
B
E
B
I
I,
so
lost
I
saw
that
by
default,
quotas
are
still
verified,
so
I
can
see
that
that's
good.
It's
at
least
it's
not
bad.
E
B
E
It
I
think
it
depends
how
much
you're
writing
but
yeah.
It
was
mostly
just
to
make
sure
we
never
slow
down
unnecessarily
on
his
evolve
or.
E
B
E
The
quota
is,
is
purposely
kept
relatively
small
and
you
know
the
amount
of
outstanding
rights
could
be
larger
and
resulting
you
still
waiting,
but
that's
why
it's
not
on
by
default.
B
No
I
still
hope
that
what
I
I
don't
remember
what
was
the
amount
of
over
quota?
But
if
it's
in
percent
I
think
before
any
long
too
small
right,
it
should
be
sufficient
or
well
and
enough
to
not
slow
down
too
much.
But
it
just
looks
much
less
invasive
to
me
yeah
and
be
good
about
that.
E
Vita
properties
on
the
root
vdev
and
how
you
will
access
that
up
to
Rob
will
push
an
update
to
that
pull
request
today
or
tomorrow,
and
then
I
think
that
one
should
be
good
to
go.
Then
and
Marius
should
be
able
to
finish
the
last
update
to
the
fail
fast
one
to
make
the
the
Linux
specific
tunable
or
module
parameter
for
setting
the
mask
and
we've
changed.
The
data
set
property
back
to
to
be
simple,
and
so
that
it'll
be
compatible
with
something.
If
we
add
a
similar
mechanism
to
FreeBSD.
B
D
E
Think
that's
the
all
of
the
progress
we
have
open
I.
Have
we
just
started
working
on
the
force
export
one
again,
so
it's
not
really
ready
for
to
get
approved
to
be
merged
or
anything
just
yet,
but
we're
working
on
that
one
actively
again,
I
think
Matt
that
will
include
based
on
your
feedback,
the
small
semantic
change
to
fail
mode
equals
continue
where
we
will
actually
fail
out
rather
than
wait
on
an
fsync.
That's
started
before
we
suspended,
but
then
we
suspended-
and
you
know
we
should
fail
there.
E
Instead
of
wait
forever
and
likely
add
a
new
fail
mode
equals
export.
That
will
you
know,
Force
export,
a
pool
that
gets
suspended
so
that
it
can't
jam
up.
You
know
any
other
pools
that
happen
to
be
on
the
system.
E
A
A
Cool
thanks,
Helen
I,
see
a
couple
other
folks
have
joined.
Were
there
wasn't
anything
on
the
agenda
that
I
saw
in
the
agenda
doc.
So
it's
just
open
discussion
of
any
questions
or
topics
that
folks
have.
So
what
would
folks
like
to
talk
about.
E
I
guess
we
don't
have
Ryan
today,
but
there's
been
more
questions
around
when
we
would
be
doing
a
2.2
release.
E
We
were
looking
at
is
the
per
data
set.
I
o
swaddling
stuff
that
I
talked
a
bit
about
at
the
conference.
We've
got
based
on
that
got
some
interest
from
folks
on
actually
getting
that
done,
and
we've
looked
at
a
couple
of
different
places
where
you
might
implement
it
in
particular,
if
it
makes
sense
to
do
it
near
the
top
before
it's
the
the
VFS
side
or
lower
down
more
towards
the
the
vdef
side.
E
The
pros
and
cons
to
both
of
those
a
bit
obviously
doing
it
closer
to
the
vdev
side
means
that
really
large
rights
are
already
broken
down
into
more
reasonable
chunks,
and
we
have
it
accounts
for
the
extra
things
that
happen
like
the
metadata
updates
that
are
caused
by
what
the
user
does,
so
that
those
count
against
their
quota,
whereas
if
you're,
just
at
the
VFS
level,
maybe
doesn't
foreign
things
like
that,
but
also
you
know
once
we're
too
far
down
towards
the
video
level.
E
How
much
do
we
know
how
this
right
is
associated
back
to
a
specific
data
set
to
charge
them
for
it
and
kind
of
looking
at
all
of
those
just
interested?
If
anybody
has
opinions
or
experiences
to
share.
F
If
you
look
at
the
and
I
have
to
I,
don't
recall
the
specific
places
since
I
didn't
write
it,
but
on
Smart
OS.
This
is
never
Upstream,
because
people
it's
per
zone
so
like
a
container.
E
So
there's
a
people
request
where
someone
ported
it
mostly
to
the
Linux
one:
it's
like
a
pull
request
in
the
9000,
so
I've
looked
at
it.
It
looks
kind
of
interesting,
although
it's
mostly
about
ensuring
fairness
between
the
different
zones,
not
more
strictly
limiting
a
specific
zone
or
data
set
and
yeah.
F
I
mean
it's
not
about
I,
don't
think
you
can
set
like
a
specific.
I
o
rate
as
much
as
you
can,
whatever
the
IRA
rate
the
pool
can
sustain,
you
can
divide,
you
know
proportionally
divide
it
between
things,
because
it's
kind
of
it's
kind
of
in
this
on
a
lumos
or
something
called
the
fair
share.
F
Scheduler,
which
kind
of
can
work
the
same
way
for
scheduling
between
zones
or
things
called
projects
or
which
have
nothing
to
do
with
CFS
projects,
but
essentially
groups
of
processes,
and
so
it's
kind
of
I
I
believe
it
was
kind
of
modeled
after
that,
I'd
have
to
ask
Jerry
for
sure,
but
but
that's
kind
of
how
it
works.
F
A
similar
idea,
where
it's
kind
of
pro
you
know
a
relative
proportion
to
the
other
stuff,
but
at
least
some
of
the
techniques
in
terms
of
you
know,
slowing
you
know
essentially
I
believe
it
basically
adds
latency
to
certain
iOS
to
kind
of
achieve
the
fairness
with
the
others.
I
don't
know
if
at
least
some
of
those
techniques
or
bits
there
might
be
useful.
A
Yeah,
depending
on
like
the
semantics
that
you're
trying
to
achieve
that,
might
it
might
be
better
to
do
it
at
the
you
know
the
like
the
SPL
layer
versus
at
the
over,
I
o
layer
like
if
something
maybe
what
you
want
to.
Maybe
what
you
want
to
expose
is
like
a
quota
of
you
know,
this
data
set
can't
do
more
than
x
megabytes
per
second.
A
If
that's
the
case,
then
I
think
you
could
do
it
at
the
at
the
SPL
it
layer
pretty
straightforwardly,
because
you
know
how
many
bytes
are
being
written
and
also
or
red,
and
also
you
know,
you're
exposing
the
megabytes
per
second
to
the
user.
A
So
you
know,
if
there's
like
an
overhead
of
doing
Med
of
like
reading
metadata,
then
you
know
maybe
they
shouldn't
get
charged
for
that,
because
they
don't
have
really
as
much
control
over
that.
Similarly,
like
they
don't
have
as
much
control
over
compression
like
the
compression
ratio
or
whatever.
So
you
know,
maybe
they
should
be
charged
for
the
you
know,
pre.
You
know
before
compressed
or
uncompressed
size,
which
is
what's
kind
of
easily
available
at
the
at
the
zpl
layer.
E
Yeah,
like
we've
even
gone
back
and
forth,
or
you
know,
if
it's
an
arc
hit,
should
we
charge
them
for
it
or
not,
because
you
know
there's
advantages
to
not.
You
know
they're
not
actually
using
up
a
precious
resource
in
that
case,
but
does
that
then
feel
weird
to
the
user?
Where
sometimes
it's
fast
and
sometimes
it's
not,
but
do
we
want
to
limit
them
to
as
lower
speed
for
no
reason
and
yeah.
E
Yeah,
especially,
you
know
if
it's
partly
like
artificial
scarcity,
trying
to
do
an
upsell
or
something
to
charge
more
for
apps.
Maybe
you
want
to
limit
everything,
whereas
you
know,
if
you're
just
trying
to
make
sure
that
noisy
neighbors
don't
steal
all
of
the
the
disc
bandwidth,
then
it
makes
sense
to
to
charge
for
what's,
you
know
actually
consumed
rather
than
it's.
A
A
Well,
you
know
the
one
that's
issuing
a
million
at
once
is
gonna
saturate,
your
storage
back
end
and
the
one
listening
one
at
a
time
is
going
to
experience
much
higher
latencies
because
of
the
higher
Q
depths.
You
know,
that's
that's
the
kind
of
worst
case
Noisy
Neighbor,
so
maybe
you're
just
interested
in
preventing
that
yeah
I.
F
Don't
know
the
other
part
too,
if
you're
trying
to
specify
certain
you
know
bandwidth,
you
know
the
the
difficulty
is
I
guess
you
know
in
terms
of
you're
dealing
with
over
subscription
because
even
with
a
given
pool
you
know,
I
haven't
really
done
much
testing
with
this,
but
I'm
guessing
that
the
actual
quote,
unquote,
Max
bandwidth
of
the
pool
is
probably
gonna
vary
depending
on
the
I
o
pattern.
You
know
if
you're
talking,
you
know
physical
disks,
you
know
random
versus
sequentials
gonna.
You
know
impact
the
I
o
rate.
F
You
know
if
you're
talking,
you
know
SS,
you
know
D
or
if
you
have
log
devices
in
there,
that
all
kind
of
makes
it
hard
to
say.
Oh
well,
this
pool
can
sustain,
you
know,
can
do
you
know
xio.
You
know
megabytes
or
gigabytes
a
second
of
I
o
and
I'll,
even
the
IR
rate.
So
that's
the
other
thing
with
I.
Guess,
waiting
I,
guess
is
kind
of
you
know.
I
think
was
probably
the
other
rationale
behind
at
least
the
approach
that
was
taken.
I,
don't
know
if
that
helps
any,
but.
A
Yeah
I
think
that's
that's
definitely
true
and
the
like.
If,
if
the
quota
is,
if,
if
you're
leading
a
quota
of
megabytes
per
second,
then
the
issues
that
you
raised
fortunately,
are
not
relevant
to
that
right,
because
you're,
just
saying
like
well
like
I'm,
going
to
prevent
you
from
doing
more
than
this
other
things
might
prevent
you
from
doing
other
amounts
of
I
o,
which
might
be
less
than
the
quota
just
like
with
the
disk
usage
quota.
A
E
But
I
think
we
are
leaning
towards
specifically
basically
IO
quotas
of
iops
and
megabytes
and
probably
doing
that
at
more
closer
to
the
physical
level.
Just
to.
A
B
And
there
I
with
my
idea:
how
could
we
improve
arcp
behavior
in
case
where
same
data,
they
were
also
I,
should
be
same
time,
mru
and
mfu,
because
they
are
recent
and
they
are
frequently
accessed
and
right
now
they
are
like
behaving
slightly
weird
because
we
move
them
to
mfu,
but
Arc
p
is
not
decreased,
so
we
can
accumulate
more
of
both
mru
data.
As
a
result,
some
of
currently
mru
data
set
left
became
very,
very
old,
but
then
surely
much
older
than
should
be
and
I
think
we
could
improve.
B
If,
if
we
explicitly
track
buffers
which
are
more
promoted
from
mreu
to
mfu
for
proper
tracking
and
Mario
depths
like
you,
we
would
know
that
X
megabytes
of
recent
data
should
be
kept
no
matter.
How
much
do
we
promote
Let's
issue,
14
120?
If
somebody
wanna
go
on
command
there,
I
haven't
tried
to
implement
it
yet,
but
it
seems
not
so
complicated.
It
makes
sense,
but
maybe
I'm
wrong
in
understanding
the
concept.
A
Oh
yeah,
I
haven't
looked
at
that
one,
but
I
think
like
I
agree.
There
probably
is
a
lot
of
weirdness
about
how
like
mru,
mfu,
split,
Works,
the
way
I
always
think
about
it
is
it
feels
like
those
are
kind
of
misnomers
and
at
least
from
how
it's
implemented,
because
it's
really
like
one
list
is
things
that
have
been
accessed
exactly
once
and
the
other
list
is
things
have
been
accessed
more
than
once
right,
so
it
kind
of
makes
sense
how
that
accomplishes
the
arc's
goal
of
being
scan
resistant
right.
A
So
if
something
is
scanned,
it's
accessed
only
once
and
then
it
can
like
fall
out
of
the
cache
faster
than
things
that
are
accessed
have
been
accessed
multiple
times
where
we're
like.
Well,
it's
been
accessed
twice
at
least
twice,
so
it's
like
more
likely
to
be
accessed
again
versus
something
that's
only
been
accessed
exactly
once,
and
maybe
that
kind
of
thought
process
will
help
with
the
when,
like
trying
to
figure
out
what
it
should
really
be
doing.
B
Well,
I
actually
read
original
paper
about
our
cash,
but
I
haven't
found
a
that
part
in
there
like
it
looks
like
it's
based
on
assumptions
that
mru
and
them
a
few
lists
shouldn't
overlap
too
much.
But
the
fact
is,
if
we
are
doing
a
lot
of
mru
hits
like
we
end
up
with
buffers
that
are
pretty
recent,
but
they
are
in
a
few
lists,
but
while
potentially
there
should
be
territory
in
Boss
because
they
are
still
recent
and
problems
that
RFP
practice
is
practically
measured
in
bytes
lengths
of
time.
B
How
long
shall
we
kept
recent
data
in
cash
with
hope
for
them
to
be
reused,
and
from
that
perspective,
when
we
are
promoting
something
to
a
few
RFP
should
remain
the
same,
because
distance
haven't
changed,
obviously
from
the
fact
of
promotion.
But
on
the
other
side
a
science
we
promoted
Arc,
but
there
are
just
less
data
ended
up
in
in
mru,
RP
should
be
reduced
and
there
is
controversy-
and
this
is
like
I
described
mechanism.
B
A
Yeah
I
I
read
the
original
paper
like
a
long
time
ago,
and
I
totally
forgot
that
that's
how
it
was
originally
proposed,
but
yeah.
That's
definitely
not
how
it
was
implemented
in
terms
of
like
a
block
can
only
be
on
one
list
at
a
time
in
the
actual
implementation.
B
That's
just
my
assumption,
because
idea
of
mru
is
just.
We
have
the
X
amount
of
data
which
are
recent,
we
that
we
may
reuse,
but
when
we
promote
from
data
from
our
mru
to
mfu,
we
remove
them
from
mreu
and
just
the
distance
if
we
still
try
to
measure
it
in
bytes
of
of
cache
size
just
by
itself.
Sequentially
read
from
disk
are
not
the
same
as
byte
as
currently
in
cash,
because
we've
removed
things
out
of
mru
list
and
after
that,
if
it
started,
do
more
reads
and
add
more
data
to
mru.
B
A
Yeah,
it
does
seem
like
the
mru
mfu
split
can
be,
can
become
very
far
out
of
balance
and
it's
like
it
can
get
stuck
there,
and
then
you
have
the
weird
behaviors
you're
talking
about
are
even
more
accentuated
mm-hmm.
G
Such
so
Matt,
you
mentioned
the
what
I'll
call
sensitivity
to
a
scan.
If,
if
data
is
passed
over
n
number
of
times
to
be
potentially
cached,
is
that
adjustable
in
any
way?
Or
is
it
just
hard-coded.
A
I
mean
it's
not
about
it's,
it
only
cares
about.
Is
it
scanned
over
once
versus
multiple
times?
Okay,
so
there's
no.
Like
n,
you
know,
there's
no
hit
count
in
the
arc,
like
per
block
hit
count
in
the
in
the
implementation.
A
B
B
Know
if
I
feel
good,
even
just
Basics
or
some
simple
fact,
okay,
we
are
now
on
Mr
U.
We
got
another
hit,
so
we
are
promoted,
it
doesn't
care
about
counters,
but
we
keep,
but
we
do
have
counters
for
every
state.
We'll
counter
for
a
number
of
hits
for
every
state
for
l2s
are
because
they
they
are
in
L2.
Here,
therefore
other
states
there
are
L1
here.
So
we
are
spending
some
memory,
but
they
are
reported
only
through
some
stats
interfaces,
not
use
it
for
any
Mass
yeah.
A
B
A
Yeah
I
think
there's
probably
a
lot
of
Improvement
that
could
be
done
here.
I
think
it's
It's
tricky
because
we
don't
have
like
some
like
reference
workload
that
we
can
feed
into
it
and
then
measure
the
hit
rate
and
then
like
try
a
different
algorithm
and
then
feed
in
the
reference
workload
and
measure
the
hit
rate.
G
B
Well
for
media
data:
you
will
you
should
never
get
ghost
hits
as
a
result.
You
should
never
grow
mru
much
so
whatever
your
MSU
should
stay
However.
You
should
stay
pretty
small
and
don't
care
about
wiping
out
Arc,
while
mfu
should
be
sufficient
to
keep
all
your
frequent
use
of
data.
That's
the
idea,
but
I'm
not
sure
that
I
think
like
like,
as
I
mentioned,
there
could
be
mess
between
a
distribution
of
data
and
metadata.
We
have,
in
addition
to
our
a
few,
that's
where
it
could
be
messy
and
I.
B
Think
it's
heartbroken
now
I'm
outside
I,
used
to
think
in
which
way
to
better
solve
it
either
I
think
it
could
be
tracked.
The
distribution
data
metadata
could
be
handled
through
the
same
mechanism
of
ghost
ghost
caches,
because
here
there's
or
do
we
need
to
at
least
unify
and
don't
track
separately
data
metadata
distribution
for
those
States
at
all
for
proper
eviction,
because,
right
now
it's
is
broken.
There
was
PR.
Actually
somebody
created
to
completely
remove
Distribution
on
data
metadata
in
Arc,
but
somebody
was
screaming
loud.
No,
it's
bad!
E
E
At
it
quickly
at
the
the
L1
buff
header
might
actually
have
a
bunch
of
more
stuff
than
maybe
we
need.
A
D
B
D
G
E
G
B
I
make
sure
you
wonder
why
so
small
distance
between
seats,
which
are
counted
as
mfu?
It's
now,
it's
like
60,
millisecond,
16,
yeah,
64
milliseconds.
Something
like
that.
Why
so
small
like
sure
like
because
I
think
on
sequential
read,
you
may
end
up
reading
indirect
blocks
multiple
times
within
the
time.
Considering
typical
look
at
some
slower
disks
and
thousand
block
pointers
per
indirect
block,
it
means
like
more
than
100
megabytes
per
second.
B
E
B
Yeah,
if
you
access
it
quickly,
it's
still
in
mru
and
it
should
be
fine
stay
in
there.
So
I
was
thinking
whether
you
value
could
be
better.
It's
actually
what's
confusing,
because
comment
was
talking
about
128,
milliseconds,
well,
actual
value.
If
I
read
correctly
means
64
like
1
16
of
a
second,
so
it's
even
not
even
self-consistent
so
I
was
thinking.
It
was
a
bigger
value
should
be
better.
E
Right
and
it
depends
how
it's
measured,
I
know
with
the
old
scrub
timing
stuff.
You
know
the
lumos
clock
ticked
at
100,
charts
in
the
previous
D1
and
a
thousand
Hertz,
and
it
meant
the
tunables
did
very
different
things
with
the
same
value.
A
E
A
I
think
we
fixed
all
those
several
years
ago,
but
I
remember
that
foreign.
B
Ature
as
one
of
possible
solution
for
evict
data
that
will
never
demand
access
at
times
he
can
add
in
one
more
Arc
State,
something
like
uncached
or
something
not
sure
about
name,
but
the
idea
is
just
to
drop
there
all
the
buffers
that
should
be
freed
after,
like
one
second
and
create
separate
Reclamation
threads
and
just
go
couple
times
per
second
and
free
everything
that
remains
there.
That's
one
of
my
ideas,
alternative
idea,
I
had
just
put
them
into
debuff
signs.
B
I
expect
them
to
be
physically
contiguous
and
there
should
be
no
duplicate
memory
and
then
on
eviction
from
debuff
Cache
that
could
be
evicted
also,
but
from
the
professional
perspective,
science
doesn't
use
debuff
cache
right
now.
That
would
be
a
bit
awkward,
but
I
think
it
would
be
good
to
allow
uncachable
data
to
still
to
stay
in
debuff
Cache.
That
would
allow
like
sub
block,
reads
to
be
still
cachable,
because
right
now
it
happens.
Weird
like
if
you
have
block
128k
but
I,
read
64k
at
the
time.
B
A
It
might
be
reasonable.
Do
you
think
that
it
would
be
sufficient
to
say
like
if
you
did
a
sub
block
read
then
we'll
keep
it
in
the
debuff
cache?
But
if
you
read
the
whole
thing,
then
we'll
kind
of
assume
that
you
got
everything
you
could
need
and
then
we
would
not
cache
in
the
name
of
cash.
B
B
A
Cool,
that's
a
good
brainstorming.
Powell
I
saw
you
unmuted
there
for
a
sec.
Did
you
have
another
topic
to
discuss?
Yes,.
C
I
would
love
to
get
some
people
to
to
do
the
final
review
of
blood
cloning
I
think
everything
is
addressed.
Not
everything
is
addressed
in
optimal
way,
but
I
think
we
can
definitely
move
forward
and
and
basically
work
on
some
optimizations
later.
C
This
is
mostly
stuff
like
cloning
across
multiple
and
across
two
different
data
sets.
When
we
have
sync
such
cloning
operation
now,
we
will
just
wait
for
transaction
group
to
to
be
synced
instead
of
using
zeal,
but
this
can
be
added
later
exploring
your
idea,
math
to
using
Zeal
claim
yeah.
B
I
I,
don't
remember,
was
it
committed
comment
that
I
was
going
to
but
I
think
full
transaction
commit
on
every
copy
request.
So
we
think
is
is
too
much
what
the
files
are
small
and
we're
just
for
file
of
100,
kilobytes
or
even
a
few
megabytes
will
commit
transaction
Group
which
will
meet
many
megabytes
of
rides
and
cash
flushes
and
quite
expensive.
Maybe
there
could
be
some
threshold
before
which
we
just
report.
We
can
do
it
or
something.
C
I
think
the
best
idea
is
to
just
do
zero
claim,
but
but
it
needs
some
work.
Oh.
B
Yeah
yeah,
obviously
I'm
just
at
this
point
like
I
understand
that
transaction
committee
is
easy
to
do,
but
it
may
be
unacceptably
slow
like
you.
Wouldn't
it
be
better.
Just
return
error
and
let
fallback
code
to
do
manual
copy
I
think
it
would
be
much
faster
than
waiting
for
commits.
C
Yeah
well
in
in
that
case,
you
can
just
simply
don't
change
the
code
in
the
VFS
to
not
allow
to
not
call
the
the
file
system,
copy
file
range
and
just
use
generic
file
range
generally
copy
file
range,
so
this
can
be
disabled
for
now,
until
the
Zeal
claim
is,
is
used
for
that.
Oh.
B
C
But
for
now
VFS
prevents
that
at
least
on
FreeBSD
on
the
Linux
I
think
it
will
allow
you
will
call
into
into
GFS.
We
could
return
an
error,
then.
A
You
could
argue
like
me,
maybe
the
first
thing
that
gets
integrated
is
like
there's
a
flag
that
says
that,
let's,
like
a
tunable
that
lets
you
choose
between,
you
cannot
you
know
zero
copy
between
data
sets
or
you
can
zero
copy
between
data
sets,
but
it's,
but
you
get
a
txua
synced.
If
you
have
sync
it,
and
maybe
the
default
is
just
you
can't
zero
copy
between
data
sets
in
the
first
implementation.
A
That
way
like
it
was
less
of
a
sharp
edge
for
these
initial
users
and
then
like
once
once
we
do
the
claiming
or
some
other
solution
to
for
the
f-sync,
then
we
can
make
that
be
enabled
by
the
you
know.
Zero
copy
across
file
systems
be
enabled
by
default.
C
I
would
actually
even
opt
for
a
third
option
where
you
simply
ignore
fsyncs
for
for
cloning,
because
I
think
that
the
the
the
most
common
use
case
would
be
to
clone
large
files
and
not
small
ones.
So
in
this
case,
I
I
personally
I
would
just
do
that
on
my
tools.
A
B
B
Well,
it's
okay,
it's
double
right,
it's
extra
space
usage,
but
it
should
be
so
much
incomparably
faster
than
it
should
be
better
I
see
I
just
had
a
thought
what
else
in
case
of
equipment
from
snapshot,
as
mentioned
good
use
case?
Would
this
be
counted
and
separate
data
sets
or
not,
because
it
would
probably
be
safe
to
copy
data
out
of
data
set
because
they
are
stable.
B
The
only
way
for
the
space
to
be
freed
is
actually
the
Legion
of
the
snapshot
which,
which
means
a
transaction
can
meet
by
definition,
I
think
and
then
we
should
be
safe.
B
C
A
Yeah
that
would
and
like
unfortunately,
I
don't
think,
that's
the
case
because
the
what
he
mentioned
like
you
can
you
could
do
the
like
copy
file
range
from
the
snapshot
and
then
lose
power
and
then
come
back
up
and
not
Mount
the
file
system
and
so
not
play
the
Zill
and
then
like
delete
the
snapshot
and
then
and
then
later
on,
Mount
the
file
system,
and
then
it
plays
The
Zone.
The
Zeal
says:
oh
refer
to
this
block
that
happens
to
be
from
the
snapshot
that
was
deleted
should
be
bad.
C
Yeah,
thank
you.
I
can
definitely
do
some
tunable
I
can
also
look
into
Zeal
claim.
There
was
I
think
one
complication
with
Zeal
claim
it's
that
you
that
maybe
it's
not
a
huge
complication.
I
I
didn't
write
into
the
code
too
carefully,
but
there
is
some
handling
of
rewinding
the
pull.
If
you
import
the
pull-
and
you
want
to
rewind
to
some
other
transaction
group,
you
don't
want
to
replay
or
claim
some
of
the
Records.
A
Should
be
fine
right
like
if
we're
not
in
in
those
cases,
we're
we're
discarding
the
cell
so
but
like
when
you
re
rewind
to
an
old
txg,
we're
also
discarding
the
Zill
and
and
not
using
the
Zill,
and
so
the
fact
that
we
don't
claim
or
we
don't
claim
it
and
we
don't
replay
it.
So
the
fact
that
there's
entries
in
those
the
records
that
refer
to
blocks
that
we're
not
sure
if
they
really
exist
doesn't
matter
because
we're
never
going
to
play
them
right.
B
A
C
But
for
me
the
rewind
operation
was
mostly
like
a
way
to
to
recover
the
pool
when
police,
in
some
kind
of
like
Corrupted,
State
right
so
I
would
I
would
think
that
you
don't
want
to
replay
any
data
behind
some
point,
because
you
don't
really
trust
any
data.
Beyond
this
transaction
group
or
something
like
that.
B
C
A
D
A
The
behavior
out
of
the
box
should
be
correct,
like
we
shouldn't
just
be
ignoring
that
like
ignoring
an
fsync,
and
it
would
be
nice
if
the
behavior
out
of
the
box
didn't
have
like
Extreme
Performance
sharp
edges
where
it's
like.
Oh
you
know,
if
you
fsync,
it
then
like
most
fsyncs,
take
on
their
milliseconds,
but
then
some
take
on
their
order
of
seconds
because
they
happen
to
have,
because
you
did
a
copy
file
range
right.
A
So
the
thing
I
was
suggesting
where
it's
like
you
know
normally
the
behavior
is,
is
that,
like
you,
you
can't
copy
file
range
across
data
sets?
Then
you
don't
have
to
worry
about.
You
know
fsyncs
of
copy
five
ranges
across
data
sets
because
they
don't
exist.
B
Think
if
we
would
set
a
tunable
to
like
one
gigabyte
and
copies
above
one
gigabyte,
we
always
do
transaction
commit
that
I
think
should
be
acceptable
and
for
anything
below
this
crosses
it
should
return
error
and
do
manual
copy.
That
would
be
okay
for
me
and
should
be
trivial
to
implement.
A
C
Semantically,
it's
still
easy
enough,
yeah,
okay,
but
just
so
you
know,
there
is
one
more
case
where
we
went
when
we
wait
for
a
transaction.
Sync
is
when
we
try
to
clone
blocks
that
were
created
in
the
same
transaction
group.
C
So
I
just
wait
for
a
transaction
group
to
sync
with
this
operation
we
could
also
fall
back
to
the
copy,
but
well
again
it
depends
on
the
perspective.
I
know
that
IX
have
a
lot
of
storage
available,
but
I
would
guess
there
are
some
use
cases
where
you
want
to
save
as
much
storage
as
possible.
So
no.
B
Saving
is
good,
but
but
again
what
happened
if
we
have
a
large
Z
wall
with
multiple
concurrent
operations,
some
of
one
VM
doing
a
lot
of
rides
as
a
VM.
Doing,
who
knows
what?
But
some
other
like
VMware
itself
will
try
to
do
copy.
The
employing
and
science
like
granularity
of
the
decision
is
a
full
the
wall.
It
may
end
up
sinking
them
for
every
few
megabytes,
that's
also
quite
expensive.
B
It
would
be
nice,
I
think
if
the
code
could
differentiate
it
more
fine-grained
like
whether
this
data
is
copies
actually
overlapped,
because
having
offset
dirty
is
quite
big.
One.
C
But
if
you
copy
stuff
in
parallel,
I
think
that
you
are
okay,
only
one
one
stream
would
be
frozen
for
a
bit.
D
B
Like
just
what,
if
one
one
process
doing
a
lot
of
Rights
while
other
process
trying
to
do
block
cloning
on
completely
unrelated
part
of
their
object,
so
they're
not
overlapping,
not
conflicting,
but
science,
the
first
one
creates
always
creates
dirty
blocks
or
or
I
misremember.
How
it's
implemented.
I
think
it
was
like
if
we
have
dirty
record
for
anything
for
for
object
practically
for
the
wall
itself.
Right.
C
So,
actually
for
civil,
it's
not
implemented
yet.
So
it's
not.
B
A
A
B
A
It's
probably
unusual,
and
if
you
hit
that
unusual
workload,
let's
fall
back
on
copying
the
data
rather
than
fall
back
on
txt
waste
synced,
because
the
performance
implications
of
the
tx2
rate
synced
can
be
pretty
extreme
I
think
it's
probably
like,
in
my
mind
at
least
I'm
thinking,
that
the
copy
file
range
is
kind
of
advisory
like
there's,
no
guarantee
that
it's
going
to
save
you
space,
it's
like
most
of
the
time
in
most
circumstances,
if
you're,
using
it
kind
of
the
way
that
we
expect
you
to
then
you're
going
to
save
a
lot
of
space.
A
But
if
you
know
if
the
stars
didn't
align
the
dirt,
the
data
that
you're
coming
from
is
dirty
or
what
is
different
data
set
or
whatever,
then
we're
going
to
fall
back
on
some
layer
here
is
going
to
fall
back
on
just
copying.
The
data
that
that's
kind
of
my
mindset
does.
Is
that
reasonable
for
the
use
cases
that
you're
thinking
of.
C
Who
me
yeah
yeah?
It
is
reasonable,
of
course,
actually
copy
file
range
does
have
a
place.
You
can
provide
some
Flags.
Currently
the
standard
actually
doesn't
Define
any
Flags
I
think
there
are
some
ways
on
Linux
to
or
I'm
not
sure
if,
if
there
is
but
but
I
could
imagine
a
flag
which
which
says
that
just
clone
at
all
cost,
so
don't
care
about
performance
and
the
default
Behavior
would
be
to
to
be
just
performant
and
do
whatever
is
quicker
right.
C
Yeah,
that's
that's
reasonable,
like
I
was
actually
coming
from
the
perspective
of,
but
this
was
also
different
that
my
initial
implementation
was
my
perspective,
was
to
always
clone
the
data
and
save
the
space,
but
then
I
was
also
implementing
dedicated
system
calls,
so
you
could
so
that
was
the
purpose
of
the
system
calls
right
to
to
save
the
space,
but
now
copy
file
range.
As
you
mentioned,
it
is
advisory
right.
C
So,
if
it's
this
system
call,
is
it's
not
there
for
to
guarantee
space
savings
right
it
it's
there
to
actually
speed
up
copying.
That's
the
purpose
of
this
system
call.
A
Yeah
so
then,
in
that
case,
I
mean
I
I
feel
like
getting
this
integrated
where
it
you
know
getting
this
integrated
sooner
rather
than
later,
with
the
perspective
of
like
it's
going
to
speed
up
a
lot
of
file
copies,
but
not
all
of
them
would
be
reason
would
be
a
reasonable
approach
and
then,
like
later
on,
we
can
add
more
circumstances
where
the
copies
will
be
accelerated
by
this
depending
on
you
know,
customer
needs.
C
Yeah
definitely
there
is
also
another
case
that
should
be
optimized,
where
you
do
unalign
copy
file
range.
So
for
now,
I
just
returned
an
error,
but
we
can
definitely
Implement
copying,
just
the
the
underlying
fragments
and
cloning.
The
rest.
A
C
The
implementation
is
a
bit
more
complex
than
that,
because
we
probably
want
to
do
that
under
one
v-note
log
and
the
generic
copify
range.
At
least
we
have
in
freebies
the
assumes
it's
it's
not
called
as
a
part
of
a
different
operation.
So
so
it
will
log
V
nodes
on
its
own,
so
I
cannot
really
use
that
atomically
atomically,
with
just
copy
fragment,
using
generic
copy
file
range
and
clone
the
rest
under
without
dropping
the
locks.
B
A
You
yeah,
it
looks
like
even
mentioning
the
indirectly
mentioning
the
possibility
of
an
early
meeting
has
has
caused
us
to
use
the
full
time
but
I'm
glad
we
had
this
good
discussion
thanks.
Everyone
and
we'll
see
you
in
four
weeks.