►
From YouTube: Scrub/Resilver Performance by Saso Kiselkov
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
so
I
guess
this
talk
is
gonna,
be
a
little
bit
talked
about
people
who
have
had
frustrations,
and
it's
one
of
my
primary
reasons
why
I
work
on
ZF
is
I'm
basically
trying
to
solve
problems
that
I'm
encountering
just
a
quick
question.
Raise
your
hand
if
you're
running
a
ZFS
pool
substantial
size,
let's
say
50,
terabytes,
okay,
which
ones
are.
You
are
running
scrub,
every
silver
regularly,
okay,
about
half
the
hands
and
which
ones
are
you
enjoy?
A
The
experience
you
guys
are
crazy
anyway,
so
we
understand
that
the
problem
basically,
is
that
these
operations,
scrub,
Andry
silver,
are
take
a
really
long
time
and
to
many
people
it
would
seem
that
they
take
a
little
bit
too
long
to
be
in
any
way
sensible.
So
first
gonna
do
a
quick
design
recap
of
people
who
are
maybe
not
quite
familiar
with
the
object
model
and
ever
and
the
general
model
of
ZFS,
and
so
as
a
quick
overview.
Basically,
the
way
we
can
view
ZFS
at
the
object
level
is
it's
basically
a
flat
database.
A
You
have
there's
the
hierarchy
itself
of
the
filesystem
is
built
up,
as
was
previously
discussed
using
an
upper
layers
EPL,
but
the,
but
the
actual
base
structure
of
the
file
system
is
flat
in
itself,
and
it's
it's
based
around
the
notion
of
constructs
that
we
call
objects
and
the
objects
are,
you
can
think
of
them
as
basically
like
a
flat
playing
file.
It's
just
a
just
a
just
an
array
of
blocks
with
a
number.
It's
it's
identified
by
number
and
we
sometimes
group
these
things.
A
A
The
keynote
and
key
thing
to
note
here
is
the
DM.
You
doesn't
really
care
about
the
structure
the
objects
inside
all
it
knows.
Is
it
only
understands
enough
to
be
able
to
find
all
the
objects
and
read
their
contents,
but
it
doesn't
really
care
whether
there
is
any
linkage
between
them
or
any
kind
of
any
kind
of
structure
to
them.
It
understands
only
how
to
get
to
the
objects
how
to
get
to
get
to
all
the
parts
of
them
and
how
to
check
someone
make
sure
that
they
are
all
consistent.
A
So
everything
that
you
see
in
ZFS
as
a
user
is
built
as
an
object
of
any
plain
file
in
the
directory,
any
even
sim
links
and
attributes
on
objects
as
evolves.
Everything
is
built
out
of
these
object.
Nomenclature
out
of
these
object
things
with
each
with
a
specific
binary
format
inside,
but
basically
that's
what
we're
looking
at
and
so
graphically.
This
is
about
what
it
looks
like
it's
a
little
bit
of
simplified
view,
but
you
can
view
these
things.
These
denotes
as
essentially
your
objects,
and
each
of
them
has
a
specific
data
type.
A
A
They
know
that
the
only
sensible
way
to
design
really
such
a
such
a
data
structure
is
in
kind
of
a
tree
like
thing
with
indirect,
with
lots
of
indirection
in
between,
and
this
allows
us
to
not
have
like
a
fixed
size
of
an
object
or
be
able
to
like
punch
holes
in
it
have
pieces
of
it
that
are
missing
and
things
like
our
copy
and
write
structure
so
we're
we
are
able
to
reallocate
a
file.
It's
not
just
you
don't
have
to
have
a
file
on
disk
as
a
contiguous
thing.
A
You
can
just
move
it
around
bits
and
bits
in
there,
and
the
only
way
to
do
that
sensibly
is
to
have
it
as
a
tree
of
indirect
blocks.
Then
refer
further
down
further
down
everything,
and
so
these
indirection
levels
in
ZFS
are
numbered
and
they're
numbered
from
the
bottom
up.
So
at
your
lowest
level,
zero.
Is
that
that's
where
actually
the
objects
contents
live.
This
is
frequently
referred
to
as
user
data
or
the
actual
useful
data
ZFS,
and
anything
above
that
that's
higher
than
level
zero.
It's
your
metadata!
A
So
that's
the
thing:
how
to
locate
the
actual
contents
of
the
objects,
and
so
the
important
part
here
is
the
disparity
in
amounts
of
data,
and
this
little
picture
here
does
not
really
capture
that
all
that.
Well,
you
have
to
understand
that
the
upper
part,
which
is
all
the
in
directions
and
basically
the
levels
of
beat
levels
of
the
tree
that
are
not.
A
They
don't
contain
your
user
data
that
user
data
is
down
here,
but
the
upper
parts
up
there
is
just
in
directions
into
the
lower
level
blocks,
and
you
got
to
keep
in
mind
that
this
is
a
tree
that
expands
out
very
very
quickly
the
upper
layers
they're
a
really
a
fairly
small
amount
of
data
volume.
It's
usually
only
about
1%
or
thereabouts.
Most
of
your
blocks
on
disk
are
gonna,
be
these
very,
very
bottom
level
of
blocks.
A
So
that's
where
pretty
much
all
of
your
used
space
usage
lives,
and
so
the
dmu
is
also
the
thing
that
implements
the
ability
to
scrub
and
resilvered
and
what
they
do
fundamentally
scrub
and
resealed,
or
do
the
same
thing.
It's
the
same,
algorithm
pretty
much
they
all
they
do.
Is
they
just
go
through
all
the
objects?
Read
all
the
blocks
and
they
just
check
the
checksums
scrub
memory.
Silver
does
not
care
about
the
structure
inside
it
doesn't
know
anything
about
directories,
it
doesn't
know
anything
about
symlinks
or
hard
links
or
anything.
A
So
that's
why
some
people,
who
think
that
ZFS
has
a
sort
of
a
checksum
in
terms
of
white
telltale,
just
I'll
just
run
scrub
that'll
check
the
file
system.
It
really
doesn't
scrub
only
ever
verifies
that
the
stuff
on
disk
is
checks
out
with
the
checksum.
It
doesn't
care
about,
link
counts,
it
doesn't
care
about
loops
and
directories.
A
You
can
create
all
that
if
you
wanted
to
with
the
appropriate
editing
tools
and
so
resilvered,
just
a
variation
of
the
idea
of
scrub
in
that,
if
you
have
a
broken
drive,
we
will
simply
reconstruct
the
data
on
it
and
yeah.
So
that's
that's
the
basic
idea
behind
these
two
things
now.
The
important
thing
note
here
is
that
these
operations
are
performed
in
order
on
a
given
object.
A
So
if
you
have
a
file
that
consists
of
two
terabytes
data
in
sequence,
it'll
just
start
at
the
start
of
the
object
and
just
pretty
much
work
itself
work
its
way
forward,
and
so
the
algorithm
to
summarize
it
in
a
really
quick
way,
is
just
grab
an
object.
Read
through
all
of
its
logical
blocks.
In
sequence
and
logical
I
mean
here,
biological
I
mean
the
the
way
that
the
block
is
represented
inside
of
the
object
and
then
just
grab
the
next
one,
in
repeat,
repeat,
until
you've
passed
over
everything
on
the
file
system.
A
A
So
you
would
jump
up
to
a
very
high
level
where
you're
talking
to
where
you're,
basically
looking
at
the
top
level
of
indirection
there,
and
then
you
recurse
down
your
curst
down
recurse
down
and
once
you
reach
the
very
bottom
layer,
you
would
basically
issue
a
zio
read
that
would
get
sent
right
away
to
the
drives
and
you
keep
on
going.
You
just
keep
on
following
this
tree
structure
until
you've
consumed
everything
in
sequence
in
the
in
the
object
itself
issuing
any
IO
as
you
go
now.
A
This
is
all
working
at
all,
very
well,
if
you're
on
initial
layout,
so
this
is
this
is
sort
of
two
scenarios
where
you've
just
about
initially
written
an
object.
So
let's
say
that
this
object
was
written.
It's
got
seven
blocks
in
it.
You've
written
it
in
two
sets
like
blocks,
one
two
three
and
at
some
point
later,
you've
written
blocks,
four
five
and
six
and
seven,
and
so
usually
what
ZFS
will
do
is
it'll
try
within
a
reasonable
amount
of
times
you've
written
chunks.
At
the
same
time,
it'll
try
to
write
them
in
sequence.
A
After
some
time
has
passed,
maybe
there's
some
a
little
bit
of
data
extras
accumulated
on
the
pool,
and
then
you
write
out
the
subsequent
portion
and
so
from
the
point
of
view
of
the
resilvered.
What
it'll
see
is
a
little
issue
reads
for
blocks,
one
through
seven,
pretty
much
in
sequence
and
the
disk.
What
else
the
the
well?
What
they
do
we'll
see
is
they'll
see
blocks
1
2
3
skip
4
5
6
&
7,
so
that
works
reasonably
well.
This
is
on
your
initial
writing.
A
When
you've
just
filled
up
your
filesystem
and
you're
gonna
be
running
scrub
already
silver.
Now
the
problem
is
what
happens
if
you've
rewritten
lots
of
data
if
you've
got
a
gun
in
so
your
typical
database
use
case
database
comes
in
modifies,
a
block
goes
away,
goes
in
again
modifies
another
block
goes
away,
and
these
things
tend
to
basically
spray
out
the
physical
layout
of
the
file
over
the
disk
and
pretty
much
random
order.
A
So
now,
when
you
scrub,
when
you
read
the
object
in
sequence,
you'll
you'll
end
up
seeing
blocks:
1
2,
3,
4,
5,
6,
7,
you'll
issue,
all
the
reason
sequence.
But
what
the
disk
will
actually
see
is
it'll
see:
okay,
I
got
a
read
block
1,
then
it'll
do
a
seek
block,
2,
C
and
so
on,
and
so
forth
and
Paul
is
just
currently
checking
the
layout
there.
It's
completely
good,
but
yeah
pretty
much
it'll
just
jump
around
a
little
bit.
A
Yeah
disks
are
able
to,
to
a
certain
degree
reorder
reads,
but
this
will
basically,
if
you
have
a
large
enough
object,
little
overpower
the
ability
of
a
disk
to
reorder
anything
and
you'll
end
up
you'll
end
up
converting
your
initial
you'll
think
that
you're
doing
sequential
reads
on
the
object.
Because
of
all
this
rewriting
what
you'll
end
up
hitting
the
drive
ass
is
random
reads.
Now
it's
obviously
a
little
bit
of
a
problem
for
performance,
so
the
improvement
here
is
obviously
to
try
and
not
do
that.
A
So
basically,
the
question
is:
how
do
we
get
I
Oh
back
into
order
and
the
way
we've
done
it?
Is
we
split
up
this
scrub
in
Ori
silver
into
two
sections?
We
first
essentially,
we
scan
the
as
much
of
the
data
set
and
these
things
are
not
they're,
not
sequential.
They
are
pretty
much
operating
in
parallel,
but
we
split
up
the
process
into
two
sections.
We
first
scan
through
and
try
and
discover
as
much
of
the
end
user
data,
which
remember,
is
99%
of
your
data.
A
We
try
and
discover
as
much
of
the
locations
the
blocks
on
disk.
Then
we
do
something
to
them
and
then
we
set
them
off
to
be
you
read
or
resilvered
or
anything.
So
in
order
to
do
that,
we
have
introduced
a
per
top-level
vida,
reordering
queue,
because
when
you're
scrubbing
or
riesling
really
don't
care
and
what
sequence
the
blocks
are
being
rebuilt
as
long
as
it's
all
done
by
the
end,
so
we
just
reorder
everything
and
the
the
iOS
are
being
the
IO
that
have
been
generated.
A
Doing
scanning
face
will
get
queued
up,
reorder
aggregated,
so
we
know
which
ones
are
close
together,
which
ones
aren't
and
then
we'll
issue
it
in
a
relatively
sensible
sequence.
So
this
is
pretty
much
how
it
looks
like
after
after
we
introduce
our
changes
here
for
the
improvements
you
come
in
from
the
top
you
again,
the
algorithm
for
scanning
is
unchanged.
So
from
the
point
of
view,
if
the
code
base,
it's
really
not
that
much
of
a
change.
A
So
this
is
a
kind
of
view
that
we
have
from
a
system.
Topology
point
of
view
we
have
a
top-level
Vida
and
each
topple
v-dub
gets
its
own
key,
because
each
top-level
Vida
is
pretty
much
the
thing
that
matters
about
disk
layout
each.
So
we
track
at
perb
top
level
Vida
and
then
reorder
for
it
for
the
particular
disk,
this
sequence.
A
So
how
are
the
queues
implemented?
The
queues
primarily
track
a
keep
track
of
two
things.
They
keep
track
of.
The
individual
reads
to
be
issued,
so
the
CIOs,
although,
as
I
said,
it's
not
the
actual
ciot
structure
and
the
reason
why
we
keep
that
around
is
because
the
CIO
is
contained,
the
block
pointers
or
they
tell
us
about
the
block
lenders
and
the
block
owners
tell
us
that
check
sums
of
the
data
that
we
got
to
check.
A
A
It
can
be
actually
a
fairly
tight
collection,
but
it's
a
collection
of
reads
that
are
close
together
and
each
in
sequence,
so
that
we
know
that
they
sort
of
represent
a
good
target
of
opportunity
where
it
could
go
in
and
issue
a
lot
of
a
lot
of
I/o
in
good
sequence
and
the
the
sorting
queue
itself
consists
of
principally
three
trees
pretty
much
and
the
they
all
sort
of
interact
with
each
other.
We
have
a
tree
that
tracks
each
extent
sorted
by
the
aggregate
size
of
the
constituency
I/o.
A
So
we
know
how
large
an
extent
is
and
how
much
of
it
is
actually
filled
with
CIOs.
So
we
know
roughly
how
valuable
it
is,
and
at
the
front
of
the
tree
we
have
sort
of
the
biggest
chunk
is
targets
that
we
want
to
that.
We
might
want
to
work
with.
Then
we
keep
obviously
extensive
tracked
by
address
on
this,
so
we
know
how
to
assign
CIOs
to
them
as
they
come
in
and
the
CI.
Then,
of
course
we
keep
track
of
the
individual
CIO
senator
to
be
issued.
A
So
this
is
sort
of
a
layout
roughly
what
it
looks
like
from
top
to
bottom,
be
extents
by
size
by
address
and
then
finally,
the
CIOs
by
address,
and
as
you
can
see
a
little
bit
over
here,
we
can
see
that
one
of
the
extents
we
call
them
scan
X
second
segment
R.
They
don't
have
to
be
necessarily
completely
filled
with
CIOs.
We
do
allow
for
a
little
bit
of
inter
zio
gap
because
drives
are
pretty
efficient.
A
That's
skipping
that,
but
if
the
gaps
are
too
large,
we
just
consider
that
to
be
a
separate,
cio,
a
separate
extent,
and
so
obviously
we
aggregate
those
into
these
larger
extent
structures,
and
then
we
sort
them
by
size
so
that
we
know
which
ones
are
the
juiciest
ones
and
the
algorithm
for
that
is
a
little
bit
more
complicated
than
just
we
sort
by
size.
We
actually
also
do.
We
do
also
consider
how
well
filled
they
are
so,
for
example,
this
guy
over
here,
even
though
he's
fairly
large,
he
has
a
few
chunks
missing
in
it.
A
So
we
could.
We
do
wake
that
a
little
bit
in
the
algorithm.
So
do
we
know
that,
for
example,
somebody
who's,
a
couple
bites
shorter,
but
completely
filled
up.
We
do
consider
that
one
to
be
more
valuable
than
the
ones
that
are
like
full
holes,
so
say
somebody
so
say
for
existen
is
EAJA
that
comes
in
so
over.
Here
we
got
a
new
CIOs
being
about
two.
That
was
a
request
to
be
queued.
You
can
see
that
it
bridges
the
gap
between
two
extents.
So
obviously,
then
we
got
a
reconstruct.
A
A
The
way
it
works
is
pretty
much.
This
is
many
of
you
will
probably
recognize
that
this
is
a
classic
sorting
problem
and
that
you
can
sort
as
well
as
you
have
as
long
as
you
have
good
good
amounts
of
memory.
You
can
sort
anything
right,
but
the
problem
is
that,
usually
you
have
a
lot
more
data
than
you
have
memory
to
sort
with
it
so
because
this
is
essentially
a
metadata
problem,
you're
trying
to
cache
metadata
instead
of
RAM
in
order
to
do
the
sorting
thing.
A
So
we
track
all
the
queues
and
we
do
understand
how
much
memory
they
take
up
and
we
try
to
limit
it
to
a
reasonable
value,
although
its
tunable
by
default.
We
limit
it
to
5%
of
your
physical
memory
and
if
the
queues
bureau,
just
too
large,
will
start
to
issue
your
largest
extents
at
the
front
of
the
queue
and
during
while
issuing
the
zio
reads:
we'll
actually
pause
the
scanner
part,
because
the
scanner
is
essentially
doing
random,
read
I/o
and
it
really
impacts.
A
Obviously,
the
more
memory
you
have,
the
better
it
works
or
the
more
memory
you
dedicate
to
the
thing,
the
better
it
works
and
I'll
show
that
in
a
couple
of
benchmarks
in
a
moment
now,
one
thing
that
this
really
impacts
on
is
the
ability
to
resume
after
reboot
or
you
have
a
machine
crash,
and
you
want
to
resume
your
resilvered
or
scrub.
So
what
do
you
do?
This
kind
of
design,
that's
kind
of
out
of
order
thing
really
tends
to
mess
with
the
old
way.
A
The
CFS
thought
the
the
progression
records
essentially
on
disk
and
for
scrub.
It's
really
not
that
big
of
a
problem,
but
if
you,
if
you
miss
some
some
transactions
from
your
trim,
it's
from
yuri
silver,
that's
gonna
be
a
bit
of
a
problem,
so
we
do
it
by
essentially
in
certain
periodic
intervals.
We
by
default,
have
it
set
to
about
once
an
hour
once
an
hour,
but
it's
tunable.
We
basically
just
pause
the
scanner.
A
And
then
we
can
update
the
DSL
scan
fisty
structure
on
disk
and
you'll
be
able
to
resume
from
this
point
on
for
work,
and
then
you
can
restart
the
scanner
and
keep
going
filling
up
the
queues
and
issuing
CIOs
out
of
sequence,
CIOs
reordered
and,
as
is
necessary.
The
point
is
that
scrubs
and
resellers
typically
do
take
a
fair
amount
of
time.
So
if
you
lose
about
an
hour's
worth
of
progression
and
if
it's
a
12-hour
job,
it's
it's
somewhat
of
a
deal.
But
it's
not
that
big
a
deal.
A
If
you
were
to
loose
all
12
hours,
that's
bigoted,
that's
kind
of
a
nasty
thing.
So
that's
why
we
we've
both
in
this
kind
of
mechanism
to
be
able
to
continue
and
in
in
the
end.
It
doesn't
take
that
much
of
a
toll
and
performance.
It's
maybe
5%
of
performance
loss
and
the
important
thing
is:
it
avoids
any
kind
of
disk
format
changes.
So
this
is
complete
a
complete
code
change.
It
doesn't
really
change
anything
on
the
disk,
so
you
could
just
install
it.
A
Try
it
out
if
you
don't
like
it,
roll
back
be
happy
with
it.
So
in
terms
of
numbers,
what
are
we
looking
at?
So
this
is
a
completely
synthetic
test,
the
best
kind
of
or
I
guess,
worst
kind
of
situation.
You
could
look
at
you
could
look
at
so
this
is
basically
a
5dr
raid-z
running
on
two
and
a
half
inch
10k
RPM
300
BBS
drives
and
it's
filled
up
and
I've,
basically
thrashed
it
completely
with
VD
bench.
It's
randomized
to
the
point
where
the
regular
stock
resolver
was
running
at
about
8
megabytes.
A
A
second
yeah
we've
taken
about
two
and
about
a
day
and
a
half
to
complete
with
this
improvement
and
the
queue
size
parameter
that
you
see
here
is
basically
the
value
that
we
had,
that
I've
limited
the
queue
to
to
max
and
grow
to
a
maximum
of
two
hundred
twenty
sixty
two
megabytes.
The
reason
why
that's
that
kind
of
a
weird
number
is
because
the
Machine
I
was
testing
it
all
had
256
gigs
of
RAM,
so
I
just
set
it
to
one
percent
or
actually
point
one
percent.
A
A
Everybody
else
is
probably
gonna,
be
a
little
bit
closer
to
this
kind
of
a
number.
This
is
general-purpose
file,
store
server
that
we
have
in
Ascenta
about
26
terabytes
of
data
running
on
10
8
terabyte
drives
and
the
regular
reso
takes
about
two
days
or
scrub.
Actually,
in
this
case
it's
easier
to
test
scrub
than
re-solder,
because
you
don't
have
to
take
tribes
out,
but
yeah.
This
pretty
much
was
done
to
test
what
the
queue
size
parameter
effects.
A
So
if
you
set
it
on
a
twenty
six,
terabyte
drive,
I
think
the
general
recommendation,
by
the
way,
in
terms
of
memory
to
data
storage,
I,
think
we
recommend
about
one
was
that
0.1%,
so
per
terabyte
of
storage.
We
recommend
about
a
gig
of
ram,
so
this
thing
had
had
about
twenty
six
terabytes
of
data
only
took
at
one
point
three
gigs
of
ram.
It
was
still
three
point
four
times
faster
than
stock.
A
A
Ssds
the
SAS
SSDs
through
six
of
them
in
there
put
on
some
data
randomized
it
and
SSDs
generally
do
take
a
little
bit
a
lot
better
to
randomize
crap
being
stored
on
them,
but
still,
if
you
reorder
the
their
reads
in
sequence
and
give
them
a
little
bit
of
leeway
here,
you
still
get
a
little
bit
of
a
performance
when,
even
though
it's
not
quite
that
much
so
yeah,
that's
pretty
much
it.
If
you
have
any
questions
or
and/or
Rotten
Tomatoes
feel
free
to
fling
them.
My
way
sure
yep.
A
A
Yes,
no,
the
this
is
all
protected
by
a
single
lock.
This
is
one
data
structure
yeah
the
the
one
Q
one
Q
is
always
locked
in
one
piece
because,
as
you
can
see
here
on
the
modification
as
soon
as
you
add,
a
zio
you've
got
to
modify
the
middle
part
and
then
you
got
to
modify
the
top
part.
So
all
that
happens
in
one
step.
It's
one
lock.
B
A
I'm,
so
so
the
question
is
how
badly
trashed
was
this
data,
and
how
did
we?
How
on
trash
did
we
get
it
yeah?
This
was
completely
trashed,
I,
think
I've
led
BD
bench.
This
is
32,
K
block,
size
and
I.
Think
I
led
BD
bench
run
a
100
and
Emre
on
this
for
about
a
day
and
a
half,
so
it's
completely
gonzo
and
the
Cuse,
the
the
the
critical
part
here
is
that
the
reordering
part
does
not
do
it
all
in
one
go.
A
It
basically
tries
to
do
as
good
a
job
it
can
with
reordering
reads
until
it
hits
the
memory
cap
and
then
it'll,
sort
of
okay.
Well,
we
just
gotta
pick
the
best
target
and
we
just
clears
out
one
of
the
the
very
largest
ones
until
it
gets
about
I,
think
about
50
megabytes,
underneath
the
limit
and
then
again
gross
and
basically
keeps
that
around
now.
The
nature
of
this
beast
is
that
such
that,
as
you
progress
further
down
as
you
get
basically
to
the
end
of
the
pool
with
the
scanner.
A
At
that
point,
you
got
to
start
again
grabbing
the
lower
value
target
ones
that
are
shorter.
So
the
final
push
to
get
the
thing
finalized
is
gonna,
be
a
lot
slower
than
the
average
up
until
that
point.
So
this
took
about
two
hours
May,
the
main,
maybe
95%
of
the
data
volume
is
gonna,
be
done
about
in
about
an
hour
and
a
half
and
the
remaining
5%.
Is
there
just
a
really
scramble
around
bits
that
you're
gonna
be
the
elevator
in
through
once
at
the
end?
That's
gonna
be
taken
about
a
half
hour.
A
So
in
the
ends
this
this
is
the
final
average
that
you're
seeing
here
the
the
initial
thing.
When
you
first
run
it,
and
then
it
hits
the
queue
size.
It
starts.
Issuing
you're,
just
gonna,
be
some
all
smiles,
because
Lola
Ron
had
pretty
much
drive,
drive
speed,
but
that's
only
because
you've
just
picked
out
the
largest
contiguous
chunks
that
you
can
find
and
you'll
start
issuing.
Those
first
sure.
A
Yeah,
so
the
question
is
whether
the
iOS
are
issued
out
of
or
inside
of,
sync
in
context,
they're
issued
out
of
sync
in
context,
so
the
waiter
works
normally
GFS,
resilvered
and
scrub.
They
issue
I/o
only
inside
the
syncing
context,
and
they
only
scan
ahead
in
in
sync
in
context.
The
way
this
works
is
once
we
hit
the
memory
cap
or
we
hit
the
the
time
limit
for
a
basically
a
checkpoint.
A
A
The
problem
is
the
the
reason
why
it's
in
top
level
Vita
is
because
that's
where
your
sequential
your
your
ordering
of
data
becomes
important
on
disk,
even
a
raid
Z,
which
seemingly
seems
relatively
scrambled
around,
even
if
you
have
like
contiguous
pieces
when
you
have
contiguous
blocks
on
a
raid,
zv
dev
of
top
level,
vita
it'll
still
break
apart
in
sequence,
on
to
the
individual
constituent
tribes.
So
that's
why
I
was
put
there
changing
the
queueing
strategy
at
the
leaf,
I'm,
not
sure
the
easy
to
do
there.
A
A
C
A
It's
pretty
much
a
straight
up
up
my
sleeve
number,
its
tunable.
So
it's
not
hard
code
in
the
code.
I've
tested
out
a
number
of
values
and
on
a
number
of
pools,
but
basically
it's
tunable
at
this
point.
We
might
fix
it
if
we
determine
that,
there's
really
no
reason
for
it
to
be
tunable,
but
I
don't
really
see
a
need
to
remove
that
tunability
I
got
to
take
a
question
for
up
there
somewhere.
A
A
As
soon
as
I
fix
one
particular
bug
in
it,
now
it's
a
it's
pretty
much
feature
complete.
The
only
thing
that
I
got
to
do
is
there's
a
little
bit
of
a
data.
Corruption
bug
there,
where,
if
you
try
and
interrupt
re
silver
and
then
boot
to
an
old
machine,
it
could
just
cook
the
pool,
but
but
it's
really
it's
really
just
about
finishing
up
one
of
these
things
and
the
bulk
of
the
algorithm
is
done,
and
we
could
just
go
over
it
tomorrow
and
in
the
hackathon.
A
A
A
The
question
of
how
much
it
compares
to
line
rate
on
the
drive
is
pretty
much
a
question
of
how
much
memory
you
can
throw
at
it.
If
you
give
it
a
little,
if
you
have
a
very
large
data
set,
and
you
give
it
a
little
bit
of
memory
to
implement
the
cues
basically,
then
we
cannot
do
as
good
a
job
of
reordering
the
cios,
because
you're
just
gonna
have
just
gonna
have
a
whole
bunch
of
extends
very
large
extents
with
lots
of
holes
in
them.
So
obviously
you're
still
going
to
get
some
skip.
A
The
second
question
was
the
second
question
about
again
right.
If
you
still
do
I
owe
to
the
pool,
how
much
does
it
affect
it?
It
affects
it,
obviously,
in
a
fairly
negative
manner,
in
the
same
way
to
affect
it
be
affecting
regular
scrub.
We
issue
the
CIOs
in
exactly
the
same
priority
and
queuing
status
as
regular
scrub.
So
it's
more
of
a
question
of
how
much
does
other
I/o
happening
on
the
pool
effects
scrub
in
general.
It
affects
it
to
the
point
where
you
allow
how
much
you
allow
it.
A
Basically,
in
how
you
how
you
set
up
your
priorities
at
this
point,
scrub
is
fairly
heavily
affected
because
scrub
is
a
low
priority.
Resilvered
are
in
a
much
higher
priority,
but
still
I
mean
it's.
It
is
gonna,
suffer
you're.
Gonna
have
two
concurrent
workloads.
It's
it's
not
gonna,
be
maybe
I,
don't
know,
dropped
out
of
5
percent,
but
you're
gonna
get
a
drop
if
I
don't
know,
maybe
50
percent,
because
other
areas
taking
priority
any
questions.
Sure.
A
Yes,
that's
the
hope
yeah.
So
the
question
is:
if
the
if
this
is
gonna,
be
requiring
less
I
ops
than
doing
and
conventionally
that
up
to
this
point,
yes,
it'll
require
a
lot
less
I
ops,
it'll
still
do
the
same
volume,
but
I'll
just
do
it
in
another
sequence,
preferably
completely
in
sequence,
that'd
be
best
sure.
A
Yeah,
the
question
is
whether
this
can
be
used
to
rebalance
data
I
guess
across
ETF
members
I
was
waiting
for
when
this
year
the
obvious
BP
rewrite
question
is
gonna
come
up?
No,
it
cannot
be
used
because
scrub
does
not.
Maybe
people
imagine
that
scrub
arre
silver
when
it
discovers
a
bad
block,
it'll
just
reallocate
it
somewhere.
It
doesn't
really
cannot
move
data
around.
A
Yes,
but
but
but
keep
in
mind
that,
as
I
said
in
the
initial
part
of
the
talk,
we
don't
look
at
the
structure
of
the
data
inside
so
I,
don't
know
if
what
I'm
looking
at
is
a
bunch
of
block
pointers
or
it's
just
your
Shakespearean
novel,
so
I
cannot
move
it
around
and
if
I
try
to
move
it
around,
I
might
break
checksum
or
I
might
break
block
pointers
that
are
in
a
completely
different
object
pointing
to
this
one.
So
that's
why
we
wouldn't
want
to
do
that.
This
BP
rewrite
essentially
you're
implementing.
D
A
A
A
It's
just
done
as
part
of
the
spa
sync
in
Spain
syncing
context
and
the
read
threats
are
all
individual
threads
per
top
level
be
dev
and
they
just
sit
in
a
CV
weight
and
a
CV
timed
weight,
and
once
they
have
enough
data
to
be
either
kicked
off
either
a
consume,
the
top
level
most
extent
or
start
everything
elevating
through
everything.
Then
that's
when
they'll
get
woken
up
so
you're,
not
gonna.
There's
no
blockage,
no
obvious
blockage,
at
least
I,
not
that
I'm
aware.
So.
C
A
A
Yes,
so
the
question
is:
how
question
is
how
do
we
progress
report,
the
thing
there's
a
new
value,
so
the
way
the
current
progress
reports
work.
If
I
remember
right
is
you
get
a
scan?
Stat
scan,
stat
T
in
your
zero
configuration
from
the
kernel
we
put
in
another
value
in
there,
which
is
basically
read
progress
and
kept
the
old
one
as
the
scan
progress,
and
it's
also
a
modification
of
's
equal
status
of
the
printout
and
you'll
get
two
lines
on
there.
A
It's
one
line,
I,
don't
remember
exactly
which
one
you'll
get
basically
two
printouts
there.
One
of
them
is
scanning
progress
and
the
other
one
is
reading
progress.
So
you
can
easily
end
the
time
estimated
it's
based
on
the
reading
progress.
That's
been
changed
so
that
you
don't
get
like
bogus
values
like
I'm,
remaining
zero
minutes,
0.
Second-
and
it's
still
not
done
so.
A
No
question
is:
where
do
we
store
basically
persistent
state
of
this?
How
do
we?
How
do
we
keep
track
of?
What
do
we
resume?
That's
already
being
done
by
regular
zpool
scanning
and
scrubbing
and
Riesling.
It's
part
of
a
structure
called
DSL
scan
this
t,
+
/,
/,
sync
and
contacts.
You
basically
issue
this.
A
This
kind
of
you
basically
write
up
this
write
out
this
record
that'll
tell
you
I've,
scrubbed
or
resold
ur
up
to
this
TX
G
and
then
tap
TX
g
and
that
TX
g,
it's
basically
keeping
that
kind
of
progress
nd
it
does
a
little
bit
more.
Does
dxg
and
object,
set
number
and
object
number
and
offset
within
the
object
and
so
forth,
and
once
it
gets
around
back
to
the
next
syncing
context,
it'll
pick
that
up
again
look
at
where
it
was
and
keep
going
from
there.
A
So
it
does
it
it
sort
of
stops
and
starts
the
only
chip.
This
only
changes
it
changes
it
insofar
as
we
don't
write
out
the
DSL
scan
feste
every
transaction
group
number,
but
only
once
we
are
done
with
the
checkpoint
and
we
have
cleared
out
all
the
queues.
So
we
are
certain
that
we've
gotten
up
to
this
very
point:
that's
when
we
issue
the
right
in
syncing
context,
and
then
we
keep
going
with
the
scanner.
C
A
Yes,
that's
basically,
your
you're
describing
the
common
problem
called
broth
pointer
rewriting,
is
that
block
pointers
themselves
are
not
the
check
since
the
block
burners
are
not
in
block
pointer
in
the
end
it
blocks,
but
in
the
pointers
to
them.
So
if
you
rewrite
anything
that
contains
a
block
pointer,
I
gotta
finish
up
this
last
question.
So
whenever
you
rewrite
the
block,
that
means
you've
changed
a
checksum,
but
there's
a
myriad
of
places
that
can
still
refer
to
that.
A
A
So
pretty
much
for
any
any
rewriting
of
a
block
that's
already
committed.
You
would
have
to
then
pretty
much
scan
the
entire
file
system
and
find
all
the
occurrences
that
have
been
referring
to
it
and
rewrite
those
and
then
go
back
again.
It's
basically
an
exponential
kind
of
problem
and
makes
file
system
people's
heads
explode.