►
Description
From the 2022 OpenZFS Developer Summit https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022
Slides: https://drive.google.com/file/d/1bz1IGuzKdEPze4uLm8Cp6jEtogryv_u1/view?usp=sharing
A
Good
morning,
everybody,
yes,
as
was
told,
my
name,
is
Alexander
Martin
I
am
a
OS
team
leader
at
IX
systems,
working
on
freebiesd
for
many
years
on
the
fs,
also
for
many
years
and
in
practice
applying
the
fs
for
freenas
and
now
turn
us
storage
appliances
on
both
freebc
and
Linux,
and
today,
I'm
gonna
talk
about
several
aspects
of
my
work
earlier
this
year.
So
I
will
talk
about
faster,
safe
and
scrap,
faster
pool,
import
and
adaptive
speculative
prefix.
A
Let's
start
I
started
looking
at
this
earliest
this
year,
just
from
trying
to
fix
minor
bug,
but
I
couldn't
stop
myself
from
doing
some
benchmarks
and
I
Benchmark.
It
scrap
several
possible
directions
and
first
Direction
was
a
case
of
Highly.
Fragmented
pool
I
set
up
quite
beefy
system
with
a
lot
of
nvme
drives
and
I
filled.
The
pool
with
four
kilobyte
blocks:
I
trashed
it
for
a
couple
hours
with
random
rewrites
and
that
caused
full
fragmentation
about
70
percent
and
then
I
run
a
run.
A
Cpu
profiles
of
two
different
stages-
that's
crap
sort
of
scrap-
has
our
first
his
scan
stage
where
it
passes
all
metadata
and
processes
them
and
second
stage
where
actually
each
issues.
A
Here,
you
may
see
what
I
first
found
on
a
scan
stage.
You
may
see
here
avl3
sorting
where,
for
each
video
scan
process,
tries
to
order
all
requests
in
in
offset
order
to
fully
execution
and
on
the
right
to
sort
of
CPU
time
is
spent
by
B3,
where
CFS
only
tries
to
find
where
sequential
chunks,
so
that's
kind
of
counterproductive
workload.
A
But
what
hit
me
the
most
you
may
see
here
about
quarter
of
all
CPU
time
spent
by
memo
operations
inside
of
b3s
and
I
dive,
deep
into
degrees,
how
they
look
and
started
optimizing
only
to
them
only
later
notices
that
you
may
see
that
my
move
for
insert
operation
takes
incomparably
much
more
time
than
it
takes
for
remove
operation
while
actually
in
this
for
cloth
and
certain
remove
are
quite
comparable
because
of
ir
getting
aggregated.
A
So
those
should
be
comparable.
That
makes
me
look
into
the
trivial
freebies
demon
move
code
that
I
found
that
sometimes
it's
better,
not
to
speak
than
speak
wrong
in
practice
appears
that
modern
CPUs
and
hence
the
trip
moves.
B
operation
appears
to
be
not
so
much
enhanced
if
you're
trying
to
move
data
in
opposite
or
this
direction
and
just
removing
a
bunch
of
code
improved
my
move
on
that
case
by
several
times.
Obviously
it's
always
specific,
but
that
was
unexpected.
Fine.
A
So
after
that
I
got
such
results,
you
may
see
it's
got
much
better,
maybe
who
they
see
it
in
that
situation.
I
wouldn't
even
start
the
project,
but
I
was
already
half
there.
A
So
what
B3
does
is
it
has
a
tree
of
elements,
it's
like
the
real
tree,
but
it
used
it
stores,
multiple
elements
within
each
three,
three
and
old.
The
most
interesting
I
leave
node
signs.
That's
where
most
of
activity
happens,
it's
four
kilobyte
chunks
full
of
elements
and
if
B3
tries
to
insert
something
in
the
middle
of
the
node,
it
will
obviously
if
it
has
to
shoot
all
elements
after
it
to
the
right
on
removal.
A
It
has
to
shift
elements
to
the
left
so
with
the
file
to
leave
size
of
four
kilobyte
and
average
feel
of
75
percent.
It
means
for
each
insert
normal
operation.
We
need
to
move
one
and
a
half
kilobyte
of
memory
on
average.
Obviously,
it's
not
great,
not
great,
but
I
found
out
it's
quite
easy
to
do
or
to
allow
empty
elements
in
front
of
the
list
too.
A
Second
part:
I,
try
to
improve
is
scrap
process
itself.
Actually
scrap
has
not
1v3,
but
2b3s
I
use
it
for
slightly
different
purposes.
Both
p3s
we're
using
24
byte
elements,
including
start
of
segment,
end
of
segment
and
field
which
number
of
bytes
within
fields
that
field
within
the
segment,
oh
so
for
the
first
of
between
makes
sense
for
the
second
much
less
for
the
second
second,
with
reused
only
to
to
track
category
segments,
so
we
need
practical,
only
start
and
some
score,
which
is
calculated
based
on
all
three
use.
A
It
for
sorting
I
found
that
between
that
scrap
was
using.
The
B3
was
quite
complicated
comparison
function
where
it
practice
required
264-bit
divisions
for
each
comparison,
operation,
which
obviously
is
a
terrible
waste
of
time.
Even
more
modern
system,
not
talking
about
some
older
aware
divisions
are
even
worse,
so
what
I
was
able
to
do
is
I
was
able
to
squeeze
those
24
bytes
into
eight
byte
single
value,
where
I
put
a
score
in
Upper
8
bit,
actually
using
only
five
or
six
bits.
A
That's
enough
because
I
use
it
just
an
exponential
scale,
it
once
on
insertion
and,
and
then
it's
compared
compared
to
a
simple
64-bit
value,
all
right,
Exquisite
start
in
the
lower
bit,
because
we
know
that
we
never
shift
is
always
bigger
than
9
bits.
So
we
give
us
the
required
space.
After
that,
you
may
see
what
happened
dramatically.
Better
results,
my
move,
you're
using
half
as
expected
from
the
G
optimization
comparison
function
between
four
times
because
of
remote
division.
So
now
time
practically
spent
because
of
some
cash
misses.
A
That's
all
and
every
all
three
now
consumes
significant
chunks
of
space
of
time.
It
would
be
great
if
it
could
be
somehow
optimized,
but
the
code
was
already
pretty
much
optimal
and
the
biggest
problem
is
just
cashmesis
which
are
plenty
of
in
a
real
tree.
Who
is
my
recent
magician
experiments
with
B3
I
think?
Maybe
we
could
think
about
moving
some
more
a
cold
pass
to
b3s
just
to
remove
or
use
number
of
cashmists
and
uncashed
in
pointer
the
references
because
B3
is
more,
the
sensors
have
bigger
leaves
it.
A
May
it
not
require
so
many
pointer
differences,
but
again
it
works
best
for
a
very
small
element
which
I
was
able
to
achieve
here.
Reducing
just
eight
byte
next
part
I
took
a
look,
was
issue
stage
of
scrub.
You
may
see
a
huge
amount
of
time
spent
on
low
contention
like
72
percent
and
closer
look
shown
that
actually
three
different
contentions.
One
contention
was
caused
by
using
shared,
Zio
or
pool
white
for
all
scrap
iOS
over
the
all
the
pool,
all
the
devs.
A
Obviously
it's
each
additional
removal
of
child
CIO
required
lock
unlock
and
it
was
terrible,
so
I
just
introduced
one
intermediate
Gio
for
each
PDF
and
very
small
patch.
It
solved
the
problem
immediately.
Second
thing:
I
found
that
a
scrap
calculates
statistics
for
Zio
four
blocks
in
a
pool
which
was
originally
accessed
at
only
lumos
debugger
on
NASA
freebies,
Dino
Linux.
To
use
it
it's
not
used
anywhere.
A
So
I
done
two
things:
I
moved
the
statistics
from
issue
stage
where
it
required
Global
lock
to
the
scan
stage,
which
is
single
threaded
anyway.
It
doesn't
require
any
locks,
and
second
tag
is
abled
until
somebody
needs
it
and
wish
and
has
the
idea
how
to
export
so
much
data
as
it
collects.
It
particularly
collects
number
of
each
blocks
on
each
level
of
interaction,
size
or
blocks.
It
just
collects
a
lot
of
data.
I
just
haven't,
found
good
weight,
how
to
represent
it
or
be
nice
in
the
user
space.
A
So
if
anybody
which
have
idea
you
just
disabled
with
load
or
tunable
or
modern
module
parameter
or
whatever-
and
the
last
thing
that
remained
is-
we
have
in
other
cases,
is
spark
config
enter
exit
which
already
micro
optimized
it
several
times.
Probably
nothing
can
be
done
much
more,
except
maybe
we
couldn't
replace
it
with
some
other
primitive
that
much
more
suitable
for
concurrent
accesses
about,
maybe
something
always
specific
or
something.
A
But
that's
it's
ended
up
after
optimization,
so
you
may
see
low
contention
reduced
from
72
to
44
and
that's
why
iops
has
tripled
at
this
point.
Here's
total
results
of
scrub
time.
It's
introduced
in
half
in
green.
You
may
see
the
scan
stage
in
red.
You
may
see
issue
stage
and
a
yellow
is
actually
mixed
where
we,
where
scan,
doesn't
currently
properly
calculates
time
where
it
actually
scans
and
where
it
issues.
But
you
may
see
that
all
of
them
got
reused.
You
should
reduce
the
most
but
can
also
get
better.
A
That's
it
second
part
of
my
investigation,
went
about
large
blocks,
it's
significantly
different
problem
because
we
don't
care
about
Ira,
but
we
care
about
efficiency
of
the
process.
I
use
the
same
configuration
and
just
added
few
more
nbms
and
changed
it
from
stripe
into
mirror
and
raids
and
I
filled
pool
with
one
megabyte
blocks
around
scrub.
This
is
case
of
mirror,
and
you
may
see
that
only
24
of
CPU
time
actually
spent
on
chick
Simon.
A
The
rest
spent
on
memory
copies
like
half
of
CPU
time,
and
you
may
see
also
a
contention
Low
contents
caused
by
the
fact
that
some
of
memory
copy
done
under
the
log
and
in
parallel,
so
it's
a
fixed
amount
of
local
tension,
no
matter
what
you
do
investigating
that
I
found
that
scrub
doing
quite
weird
thing.
If
we
have
a
dual
mirror,
each
disc
is
original
into
its
own
buffer.
That's
predictable,
calculated,
but
then,
from
from
every
successfully
for
for
everything
that
successfully
passed
through
the
check
sums
data
copied
into
the
parent
buffer.
A
So
it
may
be
copied
twice
three
times
four
times
whatever
which
mirror
you
have,
and
then
we
have
a
deep
top
blocks,
support
which
also
mirror
inside
that
also
copies
data,
one
more
so
for
either
white
mirror.
We
always
had
n,
plus
one
memory
copies
I
found
it's
completely,
not
needed.
I
was
able
to
share
buffer
original,
so
original
buffer
share
with
dito
buffer
and
then
with
one
of
mirror.
A
Buffers
Downstream
I
tried
to
choose
the
most
promising
buffer,
more
promising
PDF,
but
if
that
will
fail,
took
some
fail
or
anything
data
only
then
copied
from
the
other
Dev.
Otherwise,
this
process
completely
removes
primary
copies,
and
here
you
may
we
may
see
that
Jack
Simon
takes
76
percent
of
CPU
time,
probably
not
much.
That
can
be
done
after
the
point.
I
can
only
wish.
A
Results
are
not
as
bad
as
it
was
originally
for
mirrors,
but
still
one
memory
copy
actually
caused
by
the
same
mirror,
because
data
blocks
in
case
of
3D
are
still
mirror
and
one
memory
copy.
Second
memory
copy
was
using
for
array,
Z
parity
verification.
It
took
original
buffer
allocated
new
one
copied
their
previous
parity
recalculated
compared
I,
just
replace
it
that
coping
with
buffer
swap
I
allocate
new
buffer
put
there
that
other
taken
out
comparison,
free,
just
some
optimization
trivial
and
here's
a
result.
A
We
may
still
see
memory
copy
yeah
here
it
is
hiding
under
a
slightly
different
name,
but
that
one
that
memory
copy
is
a
part
of
red
Z
in
parity
calculation,
because
all
the
functions
are
like
single
argument
function.
Where
is
that
using
accumulator
buffer?
If
we
have
three
white
right
Z,
we
first
copy
the
data
and
then
doing
parity
with
second
disk.
Secondly,
Dev.
If
we
would
introduce
dual
argument:
parity
function
that
could
be
avoided,
so
that's
like
15
of
CPU
time
would
be
nice
project
for
somebody
to
touch
I
haven't
bought
at
least
anyway.
A
Results
got
dramatically
improved.
You
may
see
that
scrap
time
reduced
a
lot,
cpu's
age
dropped
a
lot,
especially
in
case
of
mirror
this
bandwiz,
it's
a
robot
with
Summer
total
from
all.
These
can
increase
it
from
20
Gigabytes
to
30,
gigabytes
and
I.
Also
calculated
memory
boundaries
from
Hardware
performance
counters.
You
may
see
that
in
case
of
mirror
it
even
dropped
from
under
373
to
109
gigabytes
per
second.
So
it's
always
I
know.
It
means
that
we
have
like
sixth
time
of
more
of
memory.
Bandwidths.
Then
we
have
data
boundaries.
A
We
should
investigate
what
we
what
actually
happens
in
case
of
data.
It
writes
how
many
times
our
heat
is
there,
but
at
least
in
case
of
scrub
I
was
able
to
reduce
it
in
half.
A
That
should
help
a
lot
when
we
go
into
faster
system
faster
pools
that
where
we
may
not
have
enough
memory
boundaries
as
on
this
system
with
12
memory
channels,
there
are
a
lot
of
systems
with
one
two,
four
whatever
so
that's
the
result
for
a
scrap
next
project
I
had
also
earlier
this
year
is
pulling
part-time
for
our
Turner's
Appliance.
We
need
High
availability
Solutions,
where
we
need
to
guarantee
for
a
lower
in
case
of
controller
fault
or
just
a
routine
update,
preferably
within
like
30
seconds.
A
Better
more
about
30
seconds
is
generally
acceptable,
but
we
also
need
to
detect
that
controller
failed.
We
need
to
do
skies
reservation.
We
need
to
do
server
start
from
networking,
reconnect
everything
so
it
best.
A
Obviously
it's
not
acceptable
even
for
non-ha
and
for
H
A,
it's
not
even
close,
not
anywhere,
and
even
if
we
try
SSD
pool
import
still
takes
five
to
ten
minutes,
also,
not
even
close,
but
there
we
can't
even
blame
SSD.
It's
not
a
problem.
Not
a
problem
of
SSD
I
went
investigating
that
and
I
found.
The
biggest
problem
was
logspace
map
replay
I'm,
not
going
to
dive
too
deep.
A
It
was
presentation
a
few
years
ago
about
log
space
map,
but
to
say
short
idea
of
log
space
map
is
to
avoid
meta
perimeter,
slab
space
map
update
on
every
transaction
group,
because
each
disk
have
several
hundred
meta,
slabs
and
so
space
mobs.
We
have
thousand
disks
multiplied
at
200,
000
or
300
000
meta
slab
and
if
pull
is
written
randomly
each
of
them
updates
and
creates
a
lot
of
traffic.
So
instead,
as
your
first
writes
a
single
log,
sequential
purple,
and
then
updates
distribute
data
later
flush
later
to
reduce
number
of
iops.
A
Originally,
the
problem
was
handled
by
a
limited
number
of
blocks
in
the
log
it's
limited
in
two
ways:
either
it's
limited
to
four
blocks
parameters
lab
in
a
pool
the
guarantees
like
force
selected
to
guarantee
acceptable
space
efficiency
for
the
logo
record,
so
that
we
have
enough
data
to
fill
all
the
buffers
or
we
are
right
in
into
middle
slab
and,
secondly,
meters
to
56
000
blocks,
which
allows
to
limit
maximum
import
time.
But
there
was
assumptions
said
that
we
should
import
within
10
minutes.
A
That's
exactly
same
10
minutes
I
mentioned
during
SSD
pool
import,
it's
practically
hard-coded
into
ZFS
into
its
default
Union.
It
was
maybe
for
somebody
and
it's
good,
but
not
for
us.
So
I
went
to
investigate
in
that
and
found
two
problems.
First
log
replay
is
inherently
sequential,
like
all
the
records
has
to
be
processed
one
after
another,
sorted
put
into
bitrees
in
memory,
and
only
one
CPU
can
do
that.
Well,
it
says
one
side
and
also
blocks
it
has
to
build
it.
Sequentially
been
processed
sequentially,
so
in
case
of
hard
disk.
A
If,
in
worst
case,
it
may
happen
that
256
K
blocks
mean
256k
transaction
groups,
each
of
transactions
who
a
group
have
separate
object,
which
means
speculative
prefecture
of
the
fs,
can't
do
anything,
it's
practical
objects
of
single
block.
There
is
nothing
to
prefetch
already,
and
if
we
try
to
just
divide
256k
by
this
health
secret,
it
will
be
40
minutes
by
itself,
just
on
the
read.
So
what
I've
done?
A
I've
made
every
movement
prefix
up
to
16
transaction
grouping
at
once,
so
that
it's
always
become
a
CPU
Bound
for
hard
disk
same
as
for
SSD
pool
it
reduced
import
time
to
like
5
to
10
15
minutes
somewhere
there,
but
then
it
just
single
core
bound
and
it's
no
problem
of
hard
disks
anymore.
Here
you
may
see
the
CPU
profile
I
get.
Originally.
The
code
can
process
about
300
bucks
per
second
in
some
Benchmark
I
got
400
Bots.
It
probably
depends
on
how
fragmented
the
pool
specific
workload
and
sandwich
Market
got.
A
Four
then
later
it
right.
I
got
three
but
someone's
there,
but
it
means
like
10
to
15
minutes
processing
time
and,
most
of
time
again
my
move
between
old
old
friends.
Just
after
video
optimization
from
the
previous
part,
you
may
see
my
move
reduced
dramatically
and
there
are
no
big
issues
like
there
is
obviously
still
comparison.
But
in
this
case
comparison
function
is
three
wheels,
not
much.
That
can
be
done,
so
it
improved
like
almost
double
block
rate
and
reduce
it
to
my
maybe
five
minutes
but
better,
but
still
not
great.
A
We
need
to
reduce
log
size,
as
I
told
the
original
design
was
to
achieve
best
possible
space
efficiency,
but
best
possible
means
five
pulling
part
times
is
not
acceptable,
so
I
try
to
reduce
it
more
to
make
it
not
best
possible
but
acceptable
or
efficient
value.
Oh,
the
ice
is
the
most
prominent
example
of.
What's
of
why
it
was
needed,
or
we
may
consider
a
pool
of
the
Thousand
disks
which
I
mentioned.
If
we
start
writing
it,
sequentially
in
slow
Speed
without
rewrites,
it
will
feel
one
meters
laboratory
after
another.
A
Oh
I
think
maybe
several
at
a
time,
but
it's
pretty
sequentially
and
most
of
metal
slabs
during
process
are
no
longer
modified
after
they
are
already
full.
They
are
complete
and
they
are
done
after
that
point.
It
would
be
good
to
flush
the
emitter
slabs
and
don't
touch
them
any
anymore,
ever
forget
about
them,
but
Concord
doesn't
do
that.
A
Oh,
so
that's
what
I
was
trying
to
do
so.
Instead
of
scaling
number
of
limit
of
number
with
the
slabs
in
the
log
by
number
of
the
general
number
of
meta
slabs
are
using
number
of
unflashed
metal
slabs,
practically,
which
scales
with
how
dirty
is
the
pool?
How
like
how
activities
if
we
are
starting
writing,
start
writing
Bull
from
scratch
from
empty
there's,
no
dirty
metal,
slabs
and
the
effects
will
actively
start
flushing
them.
A
So
it
will
keep
some
like
a
thousand
meta
slabs,
just
in
case
to
not
overdo
too
much
but
as
full
as
we're
getting
writing
more
and
more.
You
start
flashing
more
and
more
and
we'll
keep
up
with
time.
Lock
will
not
grow,
but
there
will
appears
one
more
scenario
when
it
still
May
backfire.
If,
after
that,
we
delete
a
lot
of
objects
randomly
from
a
pool,
it
creates
a
lot
of
holes
through
all
the
pool.
That
is
how
why
actually
log
metal
slab
was
implemented.
A
So
with
only
this
limitation
that
pool.
Will
you
see
that
there
are
a
lot
of
unfortunate
slops
and
we'll
start
flashing
them
slowly,
one
after
another,
but
again
it
will
consider
its
normal
situation
and
will
not
try
to
shrink
it
actively
so
introduced
another
limitation
or
that
each
meta
slap
must
be
flushed
every
at
least
thousand
transaction
groups.
A
It's
after
that
it
means,
after
massive
deletion
will
get
a
lot
of
some
of
them
dirty
and
then
the
defense
will
immediately
start
flashing
them
quite
aggressively
if
there
will
be
no
longer
massive
duration,
it's
after
well
like
a
few
hundred
or
thousand
transaction
groups.
It
will
get
cleaned
out,
stable
and
quiet
again
if
the
pool
is
really
fragmented
and
a
lot
of
random
operations
go
on
and
on
and
on
it
will
still
keep
all
meta
slabs
as
dirty
more
unflashed
and
will
actually
try
to
flush
them.
A
But
again,
each
of
metal
slapsibly
flush
at
once
per
thousand
transaction
groups.
Still
about
500
is
better
than
was
before
the
implementation
of
the
log
meta
log
space
map,
but
it's
still
at
his.
It
can
at
least
has
some
constraints
so
that
one
pulling
part
we
would
never
have
to
replay
more
than
thousand
transaction
group
of
log.
It
will
never
grow
to
256
thousands,
so
it
will
be
still
stay
Compact
and
try
to
adapt
with
workload.
A
Oh
just
idea
for
if
somebody
wished
to
find
a
project
to
work,
I
was
thinking.
Maybe
we
could
in
some
cases
when
we
see
that
some
metal
slab
was
previously
flushed,
but
within
one
transaction
group
it
received
a
lot
of
updates
for
some
reason.
Maybe,
instead
of
writing
into
pool
white
look,
we
could
write
it
directly
to
metal
slab
space
map
itself
so
that
we
would
avoid
double
copy
right
now.
It
is
double
copy.
A
Oh
and
second
offender
I
found
during
a
pool
import
is
that
import
tries
to
scrub
last
three
transaction
groups
written
before
their
crush
or
export
or
whatever,
and
that
may
means
like
dozen
gigabytes
of
traffic
or
even
more,
depending
on
setting
closer
pool
and
activity
of
the
pool,
and
it
may
take
may
take
time
because
we
have
no
anything
catch
up
to
all.
The
metadatic
has
to
be
traversed,
sometimes
sequentially,
and
it
takes
a
lot
of
time.
A
Looking
through
the
code,
I
found
that
errors
during
that
scrap
for
data
do
not
affect
import
process
it
just
like
regular
scrub.
The
it
GFS
tries
to
recover
them,
but
if
it
fails,
fails
a
bit
so
what
I've
done?
I've
I've
disabled
scrub
for
data
during
the
import
that
significantly
reduced
amount
of
data
that
needs
to
be
scrubbed
and
reduced
the
one
exception,
or
that
Still
Remains
significant,
is
in
case
of
dedup.
A
We
have
to
practically
scrub
all
the
dupe
table
and
because
within
City
transaction
group,
it's
quite
likely
all
been
updated
or
significant
part
of
it,
and
it's
all
in
four
kilobyte
blocks
huge
number
of
around
around
the
read
operations.
That's
a
not
great,
so
I
was
thinking,
maybe
one
more
small
project,
or
maybe
we
could
reduce
that
three
transaction
groups
to
something
smaller,
because
right
now
number
three
goes
from
number
of
transaction
groups.
A
Cfs
keeps
the
previous
Data
before
the
freeing
space,
but
those
two
values
I,
don't
believe,
they're
related
in
any
reasonable
way.
It's
pretty
arbitrary
and
it
makes
no
sense.
So
it
would
be
good
if
somebody
have
ideas.
Why
do
we
need
to
replay
more
than
one
transaction
to
to
scrap
more
than
one
transaction
group?
A
That
number
doesn't
mean
anything
to
me,
so
I
think
for
some
investigations
that
could
be
done
and
with
all
those
optimizations
or
we
measured
up
to
95
reduction
of
pulling
per
time
from
the
45
minutes
worst
case.
We
got
to
like
one
minute,
plus
or
minus,
depending
on
situation,
which
is
incomparably
better,
so
I
think
I
heard
some
responses
from
other
people.
They
were
happy.
A
A
A
So
for
many
years,
our
prefecture
analyzes
up
to
eight
streams,
it
tries
to
take
up
to
eight
sequential,
read
or
write
streams
and
keeps
detective
streams
for
up
to
two
seconds.
The
problem
appear
if
we
try
to
mix
sequential
and
run
the
workloads
in
foot
to
the
same
objectives,
for
the
most
prominent
is
for
the
devs
or
for
the
walls.
So
in
case
of
the
walls,
we
have
some
nice
guys,
Target
on
top
One
processor
in
sequentially
data.
Another
processor
limited
data
randomly
through
the
pool.
A
At
the
end
we
get
no
prefetch
at
all.
The
problem
is
that
all
Random
Access
is
feeling
immediately
feeling
all
the
eight
streams
and
after
that,
preface
block
it
for
next
two
seconds.
Nothing
goes
on,
so
what
I
have
done
I
for
streams
that
never
saw
second
hits
that
were
just
random
reads
and
point.
Those
could
be
very
reduced
by
later
accesses
immediately.
A
So
just
in
order
of
praying,
the
oldest
used
very
used
at
first
and
for
strike
streams
that
had
some
hits
they
kept
for
this,
the
second
to
still
benefit
from
preface
for
them
not
to
be
white
but
but
random
accesses,
and
then
they
can
really
use
it
in
order
of
arrival.
A
Just
that
makes
me
wonder
why
do
we
have
a
limit
of
eight
streams
couldn't
shouldn't
it
be
improved
for
large,
easy
walls,
maybe
not
so
much
needed
for
files,
but
for
the
walls
could
be
it
very
old
limit.
Maybe
streams
could
be
allocated
instead
of
the
list
of
elements
allocated
just
as
array
to
reduce
number
of
pointer
references,
another
project
for
somebody
to
work
on.
A
It's
pretty
small
and
compact
chunk
of
code,
not
invasive,
and
here
around
simple
Benchmark
I
run
random
IO
info
streams
and
stride
that
workloads
two
megabytes
per
100
megabytes
at
the
chunk.
Hundreds
here
100
there
100
there
and
you
may
see
that
while
random
iops
haven't
changed,
strided
throughput
improved
by
five
times
it's
just
because
previously
there
was
absolutely
no
prefecture
hits
and
after
their
period
sum,
maybe
algorithm
could
be
improved
fuser.
So
probably
it
can
be,
but
still
better
much
better
than
it
was
and
second
part
I
addressed
is
prefetch
depths.
A
Previously
we
had
default
The
Limited,
preface
up
to
eight
megabytes,
so
prefix
started
from
bios
size.
Double
it
on
every
successful
hit.
So
after,
like
16
for
16,
reads
professionally
reached
maximum
of
8
megabyte
and
it
stopped.
Oh,
it
just
continued
to
have
that
eight
megabyte
in
advanced
region,
I
found
from
my
tested
for
a
vme
pool,
no
matter
how
wide
it
is,
we
barely
ever
need
profession
more
than
four
megabytes,
just
because
nvmia
so
fast
and
low
latency
they're,
just
single
reader
thread
unable
to
read
more
data.
A
Anyway,
it's
limited
by
just
memory
operations,
so
we
don't
need
more
than
four
megabytes,
but
if
we
use
hard
disk
pool,
even
64
megabyte
is
not
limit
the
longer
you
increase
preface
depth
the
longer.
It
improves
I'm
thinking
that
thing
once
it
should
be
investigated.
How
sequence
should
are
actually
writing
the
data
I
have
strong
suspicion
that
our
space
allocator
in
on
system
with
many
cores
allocated
rewarders
data
pull
quite
a
lot.
Maybe
that's
why
we
benefit
so
much
from
professional
hard
disk.
A
Maybe
it
should
be
improved
from
other
side,
but
still
that's
where
we
are
and
now
the
prefetch
does
really
help
for
pool,
and
what
they
have
done
is
that
I
split
the
growth
into
two
stages.
Up
to
the
first
four
megabytes
with
one
more
new
tutable
preface
grows
exponentially
the
same
as
before.
At
that
point,
it
stops
and
grows
fuser
only
when
it's
needed.
It
grows
up
to
one
eight,
every
time
when
prefetch
for
new
read
didn't
complete
in
time.
A
So
if
we
have
pool
which
is
faster
than
our
consumer,
prefetch
only
grows
to
the
point
that
is
sufficient
to
satisfy
bandwidth.
We
took
our
latency
for
that.
Pandu
is
and
then
preface
stops,
and
it
allows
to
avoid
extra
prefetches
that
are
not
needed
in
case
the
strided
access.
If
consumer
doesn't
need
the
data
we
would
drop,
otherwise
it
shows
good
results
and
I
was
able
to
increase
maximum
from
eight
megabyte
to
64
megabyte,
which
is
not
so
dramatically
bad
consequences
or
if
it
goes
to
the
amount
of
extra
read
data.
A
One
downside
of
this
algorithm
is
that
if
we
have
two
slow
pool,
for
example,
we
have
some
USB
stick
which
anyway,
can
handle
more
than
one
request
at
a
time
it
with
pretty
quickly.
It
will
reach
maximum
perfect
distance
of
64
megabyte.
If
we
could
set
more,
it
would
reach
more
there.
We
set
it
will
reach,
but
it
makes
no
sense
so
one
more
project.
A
even
for
64
megabyte
there
was
much
device
must
be
really
slow
to
reach
that
point,
but
maybe
we
could
mostly
limit
I,
don't
know
100,
milliseconds
or
something
else,
or
maybe
we
could
Implement
some
more
fancy
logic
like
if,
if
none
of
request
preface
requests
be
sent,
was
sent
to
disk
immediately
but
were
cute,
and
it
makes
no
sense
to
increase
preface
depth
or
something
like
that.
Ideas
are
welcome.
A
Obviously,
space
for
improvement
projects
for
somebody
to
play
with
sorry,
I've
been
so
fast,
but
time
was
constrained
and
I'll
be
open
for
any
questions
during
the
day
or
discuss
those
topics
or
any
others.
Thanks
for.
A
No
I'm
not
sure
how
would
it
help?
Obviously,
we
have
a
flag,
but
right
now
prefecture
doesn't
use
it
for
anything.
We
can.
We
may
have
no
rotational
USB
stick
or
we
may
have
no
rotational
nvme,
storage
or.
A
A
Yes,
the
question
like
what
to
use
as
threshold
easily
it
would
be
if
we
could
analyze
IO
and
see
what's
average,
and
if
we
go
beyond
our
age.
You
automatically
should
understand
that
we
are
triple
perfect
anymore.
That's
why
I
was
thinking
about
other
algorithm
like
analyzing
Q.
If
we
were
put
on
a
queue
immediately
executed,
then
we
are
benefiting
from
prefetch,
but
there
my
peers,
so
two
cases
when
we
have
different
types
of
vdfs.
A
Memory
there
are
tools,
at
least
for
freebhd,
from
Intel
in
their
Imports
PCM
Intel
PCM,
something
no
I
bet.
There
should
be
the
same
for
other
platforms.
It's
just
a
set
of
convenient
tools
for
freebies
they're,
using
the
same
same
performance
counters
as
he
uses
for
profiling,
adjusts
all
modern
CPUs
collecting
a
bunch
of
random
architectural
things
and
Intel
nicely
wraps
that
into
tools
that
collecting
per
Channel
bandwiz
personal
qpi
bandwiz.
All
the
numer
effects
power
consumption.
They
consume
a
lot
of
different
information.