►
From YouTube: OpenZFS Developer Summit Part 6
Description
http://www.beginningwithi.com/2013/11/18/openzfs-developer-summit/
Performance on full/fragmented pools (George Wilson)
A
A
A
So
our
customer,
our
customer
base
is
very
well
known
for
having
imbalanced
pools,
and
this
is
the
case
where,
like
you
know,
start
off
with
your
pool
configuration.
You
have
like
four
lines
and
you
start
loading
a
bunch
of
data
and
then
sure
enough.
It's
like
I'm
out
of
space.
I
need
to
add
more
space.
So
now
I
have
four
luns
that
are
full
and
I
have
four
more
ones.
So
I
have
like
this
big
disparity
in
ones
that
are
full
and
ones
that
are
completely
empty,
and
performance
can
nose
down
from
there.
A
So
that's
a
problem
that
our
customers
face
a
lot
just
because
they're
always
adding
and
adding
more
data.
If
somebody
has
like
a
storage
appliance,
they
may
start
off
with
this
massive.
You
know
lots
of
terabytes
or
petabyte
of
storage,
and
maybe
they
don't
see
that
issue
because
they've
had
it
all
at
once.
Our
customers
don't
do
that.
They
tend
to
add
it
kind
of
piecemeal.
A
B
A
Like
if
you're
you
know
39
full,
you
know,
is
it
like
38.75
full,
then
it's
nice,
because
it
kind
of
brings
them
back
in
balance
and
they're,
both
39,
but
when
one's
at
90
and
the
other
one's
at
zero,
you
can't
get
there
and
balance
them
out
fast
enough.
So
we
came
up
with
this
thing,
which
is
in
open
zfs.
A
I
don't
think
we
really
talked
very
much
about
it,
but
it's
called
zfs
mg
no
out
threshold
and
the
whole
idea
behind
this
is
it
sets
this
threshold
of
free
space
that
you
want
to
maintain
on
your
device.
So
as
long
as
there's
that
much
free
space
on
the
device
that
device
is
eligible
for
allocations,
if
free
space
on
that
specific
device
drops
below
that
threshold,
then
it's
no
longer
considered
considered
eligible.
A
A
A
What
we
did
was
we
made
this
change
by
having
the
mg
noah
threshold
such
that
again,
I
can
fill
it
up
and
once
I
get
to
a
certain
point-
and
I
add
my
additional
disks
this
time
when
I
actually
start
to
write-
I
start
writing
much
much
less
in
this
case.
This.
These
writes
that
we
allow
to
go
to
the
other
devices.
Are
the
minimum
size
right
that
you
can
accept
so
512
bytes?
What
you
get
on
those
devices
and
that's
simply
just
because
if
we
fail
512
byte
allocation,
then
we
can't
gain
that.
A
So
that
turns
into
an
error,
so
we
at
least
allow
that
size
to
go
through.
But
then
we
write
big
data
onto
the
other
devices
and
we
write
until
they
all
come
up
to
the
same
threshold
and
then
once
that
happens,
then
you're
allowed
to
write
the
same
amount
of
data
to
all
devices.
You
start
striking
across
them
all
once
again,
so
that
was
a
change
we
made
because
we
needed
to
force
things
to
get
to
these
new
devices
much
much
faster.
A
It
just
was
not
aggressive
enough,
so
you
would
see
instances
where
you
know
without
like
raid
z
or
marion's
or
just
simple
stripes.
We
have
a
stripe
with
that
effectively
says
you
do
like
512
meg,
I
think
per
device
before
you
ground
robin
to
the
next,
and
it
was
like
you
do,
400
to
1
and
600
together.
So
it
was
like
it
wasn't.
Balancing.
A
So,
in
this
case,
the
small
small
rights
are
metadata
rights
that
happen
to
be
512,
bytes
that
I
can't
fail
so
when,
if
you,
if
you
spend
any
time
in
like
the
metaslab
allot
code,
if
you
try
to
write
the
smallest
unit,
that
cfs
allows
so
500
total
bytes.
If
that
ever
fails,
then
that's
considered,
I
o
error
and
then
that
causes
the
entire
pipeline
to
stall.
So
if
you
ever
try
to
write
to
that
device
with
512
bytes,
then
it
better
succeed
because
we
can't
gang
it.
B
So
it
sounds
like
this
is
a
this
is
an
ease
of
implementation
issue,
rather
than
I
you
know
we
intentionally
made
it
because
that's
the
right
thing
to
do
is
to
allocate
512
there.
It's
just
correct.
It's
like
it's
kind
of
an
artifact
of.
We
wanted
to
implement
it
in
an
easy,
straightforward
right.
A
A
Is
a
bit
of
a
clutch?
Yes
again,
this
was
so
as
adam
kind
of
alluded
to
earlier.
We,
you
know
somebody
was
asking
the
questions
like.
How
do
we
do
these
performance
changes
and
how
do
we
roll
them
out?
A
By
default
on
opengfs,
we've
enabled
it
on
some
of
our
customers
so
that
we
can
see
how
it
behaves
and
how
it
performs
and
what
we're
learning
from
all
this
is
that,
as
we
run
across
these
different
scenarios-
and
we
discover
something-
that's
effective-
we
can
find
ways
to
actually
like
leverage
it
and
enhance
it
and
I'll
get
to
that.
You.
A
So
for
us
what
you
end
up
with
like
when
you're
in
this
when
you're
in
this
situation,
the
read
performance
is
typically
means
that
most,
your
data
is
actually
here
because
it
hasn't
been
striped
across
so
for
for
reading.
It's
not
that
big
of
a
deal
where
you're
losing
is
right
bandwidth,
because,
in
the
case
of
the
new
enhancement.
A
You're
writing
most
of
your
data
to
these
devices,
so
instead
of
getting
four
spindles
of
bandwidth,
you're
actually
technically
getting
two,
but
for
our
customer
base,
and
I
would
assume
anybody
who's
kind
of
been
in
this
behavior,
where
they
start
off
with
some
discs
and
are
adding
more.
They
were
already
getting
two
spindles
worth
of
bandwidth
now,
they're,
getting
two
spindles
worth
of
bandwidth.
A
F
A
So
the
other
thing
that
we've
been
working
on
is
trying
to
understand
free
space.
So
you
know
again,
adam
asked
the
questions
like
30
000
segments
was
that
something
was
that
a
large
number.
A
We
never
looked
at
the
number
of
segments
that
actually
like
made
up
a
space
map.
Everybody
was
happy
right,
everybody's
running
fast,
and
nobody
really
complained
about
that.
I
was
like
yeah
I'll,
do
it
well,
so
you
look
at
the
number
of
segments
and
that
doesn't
really
tell
you
very
much
because
you
don't
know
how
the
free
space
is
there.
You
could
say
you
can
look
at
one
meta
slab
and
it
may
have
30
000
segments.
It
may
have
250
000
segments
and
it
has
you
know
four
gig
of
free
space.
A
It's
like
what
does
that
mean.
I
have
no
idea.
I
have
a
lot
of
segments
on
one.
I
have
fewer
segments
on
the
other,
but
it's
still
high
and
I
have
the
same
amount
of
free
space.
So
the
thing
that
we
wanted
to
do
was
to
create
this
histogram
and
start
storing
this
on
the
pools.
So
there's
a
new
feature
flag.
That's
out
there
that
you
enable,
and
it
upgrades
your
space
map
objects
to
start
storing
histograms.
A
These
are
just
power,
two
buckets
of
how
the
free
space
is
comprised
on
disk,
so
it
gives
us
the
first
time
we
actually
get
a
view
of
like
how
is
that
free
space
made
up
now
we
can
figure
out
that
that
four
gig
might
be
nothing,
but
you
know
a
bunch
of
2k
segments,
and
you
know
that
might
explain
why
we
have
250
000
segments
they're.
All
really
really
small
performance
is
going
to
be
really
really
bad
on
that
on
that
any.
A
That
the
other
thing
so
today,
what's
available
is
you
can
get
histograms
for
meta
slabs,
which
is
misspelled?
A
We've
now
enhanced
this
to
actually
have
it
for
devices,
so
you
get
histograms
on
a
device
level
and
you're
also
going
to
be
able
to
pull
this
on
a
pool
level.
So
we've
been
spending
a
lot
of
time
kind
of
adding
more
observation
into
the
way
that
free
space
is
there,
so
we
can
have
a
better
understanding
how
to
deal
with
fragmentation,
and
especially,
how
do
you
allocate
in
a
highly
fragmented
pool
the
other
thing
that
we
added
was
a
new
block
allocator
and
again
this
is
not
enabled
by
default.
A
Goes
keeps
track
of
the
last
the
largest
segment
on
a
meta
slab
and
then
just
simply
consumes
everything
within
that
segment
and
then
goes
fine
to
find
another.
The
current
allocator
that's
been
shipping
for
a
while
actually
has
two
modes.
It
has
a
mode
where
it
tries
to
find
everything
based
off
offset.
So
it's
trying
to
keep
things
in
offset
order.
So
if
you
last
allocated
it
offset
1000,
it's
looking
for
allocations
that
are
close
to
1
000,
but
slightly
greater,
and
that
will
satisfy
that
allocation.
A
Once
you
get
to
the
point
where
the
space
in
that
meta
slab
is
mostly
consumed,
then
it
starts
going
into
a
best
fit
mode
where
it's
like
everything
is
going
to
be
random
right,
so
it
has
kind
of
this
two
mode
property.
This
does
not.
This
would
actually
always
goes
to
the
large
segment
and
then
consumes
that
segment
in
its
entirety
before
moving
on
this
may
help
some
workloads
out
there.
A
We're
also
changing
the
way
that
we
actually
handle
the
meta
slab
selection,
so
how
many
people
are
familiar
with
the
way
that
we
allocate
on
disk
the
way
that
the
zfs
does
allocations
so
I'll
just
kind
of
briefly
go
into
it.
A
Does
this
like
three
there's
three
different
phases
to
allocations
for,
for
when
you're
trying
to
allocate
a
block
the
first
thing
that
it
does
is
it
determines
which
device
to
allocate
from
so
it's
first
going
to
pick
a
device
and
then,
within
that
device.
B
A
Yes,
or
about
a
slap
group
for
those
of
you
that
are
familiar
with
meta
sub
groups,
you
can
think
of
those
as
devices.
Then.
H
A
Actually
going
to
look
within
that
device
at
any
one
of
these
200
meta
slabs
and
pick
the
best
meta
slab
to
allocate
from
it's
going
to
then
within
that
meta
slab,
go
and
find
the
best
region
on
that
meta
slab
to
actually
satisfy
the
allocation,
so
it's
device
meta
slab
block
effectively.
Those
are
the
three
the
three
layers
that
it
goes
through.
A
The
problem
that
we
had
is
that
the
meta
slabs,
the
way
they're
ordered
only
took
into
account
free
space
and
again,
you
know,
we've
just
talked
about
how
we
didn't
know
what
free
space
you
know
how
free
space
was
comprised
of
that
metaslab,
because
all
we
knew
was
a
number
well
now
we
actually
have
some
histogram
information.
We
have
the
way
that
the
meta
slabs
are
how
the
free
space
is
broken
up.
A
It
gives
us
the
advantage
to
actually
come
up
with
a
new
weighting
mechanism,
so
we've
come
up
with
a
metric
to
measure
fragmentation
at
a
metaslab
level,
and
we
can
now
use
that
as
a
weighting
mechanism
to
weight
metaslabs
that
are
highly
fragmented
down
weight,
metaslabs
that
are
less
fragmented
up.
So
this
gives
us
the
ability
to
actually
pick
the
best
metaslab,
truly
the
best
metas
lab
to
go
do
allocations.
A
A
Well,
if
the
demand
isn't
there,
then
maybe
we
shouldn't
pick
the
best
one,
because
all
we're
going
to
do
is
swiss
cheese.
It
up.
Let's
wait
and
do
the
best
one
for
when
there's
actually
a
lot
of
workload
to
be
done
so
that
isn't
implemented
in
it.
Yet
we've
been
kind
of
kicking
that
around,
but
that's
an
idea
that
we've
had
of
trying
to
now
take.
This
is
again
kind
of
learning
all
the
things
that
we've
we've
discovered
so
far
of
we
know
now
how
much
workload
is
coming
in
based
on
the
right
throttle.
A
Work,
that's
been
done,
so
we
can
keep
track
of
how
much
dirty
data
there
is
to
be
written.
If
the
amount
of
dirty
data
to
be
written
is
actually
very
low,
then
maybe
we
don't
try
as
hard
to
go,
find
a
really
pristine
and
awesome
meta
slab.
Maybe
we
just
say:
okay,
you're
gonna
get.
You
know
this
one,
that's
mostly
fragmented,
because
we
can
waste
a
few
cycles
there.
Simply
because
there's
nobody,
that's
really
requesting
a
lot
of
work
for
us,
so
we
think
that
could
be
a
win
for
us.
A
D
Just
another
point
on
that:
we
can
actually
use
that
idleness
in
the
system
just
to
draw
a
finer
point
on
it
to
optimize
our
space
maps,
so
knowing
that,
rather
than
having
30
000
segments,
if
we
can
go
find
exactly
the
right
place
to
drop
that
puzzle
piece,
then
we
only
have
29
999
seconds.
If
we
do
that
enough
yeah,
we
can
actually
make
it
a
little
more
manageable.
A
So
slowly
we're
trying
to
take
the
information
that
we're
learning
to
be
much
more
intelligent
about
how
we
actually
pick
the
right
block.
Pick
the
right,
meta,
slab
and
you'll
see
even
pick
the
right
device.
The
other
thing,
that's
that
we've
done
is
we've
actually
added
this
preloading
of
meta
slabs.
So
we've.
A
What
we've
learned
is
that
anytime,
you
actually
have
to
go
load,
a
meta
slab
while
you're
in
the
allocation
code
path,
so
you're
in
the
right
context
is
very,
very
painful,
so
we
now
can
figure
out
which
ones
are
the
best
meta
slabs
on
each
device
that
we
anticipate.
We
would
need
to
load
and
let's
go
load
n
of
those
every
single
time.
We
finish
a
transaction,
so
they're
already
loaded
in
memory,
they're
in
hand
the
next
transaction
comes
through.
A
He
goes
to
do
allocations
everything's
already
there
we
don't
actually
have
to
go,
spend
you
know
io
cycles
to
go,
read
them
and
load
them
in.
Hopefully,
that's
enough
again.
That's
another
area,
for
I
think
more
future
work
where
we
can
determine
what
n
is
based
on
kind
of
the
information
that
we're
getting
about
dirty
data.
That's
coming
in
so
if
we
know
that
we're
getting
you
know,
gigs
worth
of
dirty
data
coming
in
then
n
better
be
large
enough
to
handle
the
gig.
That's
gonna,
you
know
get
synced
out
in
the
next
transaction.
A
G
A
Yeah,
so
if
we,
if
we
knew
so
it's
going
to
vary
based
on,
you
know
what
record
size
everybody
has
but
yeah
once
you
kind
of
have
an
idea
of
like
what
blocks
you're
coming
in.
You
can
kind
of
assume
that
say
even
with
compression
I'm
going
to
be
looking
for
like
an
8k
block.
If
that's
what
you're
actually
recording
an
ak
block,
maybe
I
can
fill
you
know
a
4k
segment,
but
at
least
I
can
get
an
idea
that
I
have
these.
A
Grams,
no,
this
is
more
of
trying
to
understand
like
if,
if
we
know
the
types
of
ios
that
are
coming
in
this
is,
I
think
this
is
more
than
piecing.
You
know
finding
the
right
puzzle
piece,
it's
like
how
do
we
know
which
puzzle
piece
to
go
after,
if
I
know
that
I
have
you
know
100
ak
segments
that
are
coming
in
and
that's
my
low
period,
then
I'm
looking
for
meta
slabs
that
potentially
have
a
hundred
open
spaces
of
at
least
8k,
but
not
too
much.
A
A
G
A
Gets
updated
every
txg
it
if
you've
spent
any
time
in
the
meta
slab
code?
It's
not
going
to
be
100
accurate
every
single
time,
because
there
are
cases
where
we
actually
do.
Freeze
where
the
space
map
isn't
loaded.
A
So
what's
interesting
here
is
just
as
we
were
using
this
fragmentation
metric
to
weight,
a
meta
slab
and
try
to
find
the
right
one.
Now
we
can
do
a
similar
thing.
If
you
go
back
when
I
was
talking
about
the
imbalanced
line,
we
had
this
hard-coded
value.
Well
now
what
if
this
value
become
kind
of
becomes
the
metric
of
which
devices
do
I
go
after
to
actually
look
for
allocations,
ones
that
are
more
fragmented?
A
B
We
can
we
can
assert
100,
fragmented
means
that
it's
all
in
the
smallest
chunk
size
possible,
so
like
all
512
by
three
chunks
or
one
k.
Actually
one
k
one.
A
A
And
and
again
this
is
kind
of
it's
a
little
nebulous
as
matt
was
saying
we're
trying
to
come
up
with
a
way
that
we
can
create
a
relatively
good
metric.
That
makes
sense
to
us
here,
and
the
idea
was
that
a
16
meg
segment
is
probably
large
enough
to
be
able
to
satisfy
most
allocations
of
any
workload
that
you
generate.
We
may
find
that
16
meg,
maybe
isn't
you
know,
but
at
least
from
some
of
the
data
we've
seen.
16
meg
seems
to
be
satisfactory
and
does
not
cause
performance
problems.
A
So
if
we
can
say
that
anything,
that's
16
meg,
any
segment,
16,
meg
or
larger
is
considered
not
to
be
fragmented.
We
use
that
as
our
first
metric
and
then
we
can.
We
step
it
down
as
you
go
from
16
meg,
all
the
way
down
to
512
bytes.
So
1k
is,
I
think,
at
1k.
If
your
segments
are
nothing
but
1k
is
100
fragmented,
so.
B
A
So
then
we
factor
all
that
in
so
this
gives
you
an
idea
that
you're
you
have
things
that
are
probably
like.
You
know,
somewhere
in
the
four
to
four
to
eight
meg
range
of
segments,
actually,
probably
less
than
that
I
think
50
might
be
128k
segments,
so
most
of
your
segments
are
128k,
although
there's
probably,
if
you
look
at
the
real
histogram
and.
A
There's
so
we've
added
to
zdb
the
ability
to
actually
dump
out
the
meta
slab
information,
the
space
path,
information
and
actually
there's
zdb
minus
m.
So
maybe
what
we
need
to
do
is
just
remove
it,
so
it
never
prints
under
d
to
solve
your
problem.
I
don't
know
why.
B
A
Several
solutions
that-
and
this
was
the
one
that
seemed
to
like
fit
and
nicest
of
being
able
to
take
into
account
the
entire
range
of
all
free
space
segments
in
this
in
the
metrics
lab
and
still
represented
as
one
number.
But
the
ability
is
that
that
one
number
then
gives
us
the
the
opportunity
to
make
better
decisions
of
when
we
go
to
actually
select
devices.
A
So
one
thing
to
note,
too,
is
if
you're
upgrading
to
this
feature,
the
space
map
objects
have
to
actually
get
reallocated
in
order
for
you
to
in
order
for
them
to
use
the
larger
bonus
buffer,
so
older
objects
won't,
have
it
newer
objects,
will
you'll
see
if
you
upgrade
fragmentation,
won't
get
won't
actually
show
up
in
z
pool
list
until
like?
I
think
I
have
it
right
now.
At
50
percent
of
the
mena
slabs
within
a
device
have
been
upgraded,
so
if
fewer
than
50
have
been
upgraded,
then
you
just
get
a
dash
there.
A
It
doesn't
show
anything
until
over
time.
They
upgrade
we're
also
adding
logic
to
actually
kind
of
speed
that
process
along
just
because
we
definitely
want
to
get
more
information
about
free
space.
So
we
can
understand
how
to
improve
things.
F
Okay,
question
for
sure.
A
A
We
hit
a
point
where
we
actually
fan
out
to
a
bunch
of
threads
and
go
through
the
allocation
in
this
kind
of
you
know
parallel
fashion.
The
problem
with
it
is
that
if
you're,
you
know
allocating
data
data
data
data
and
maybe
metadata
metadata
metadata,
if,
when
it
all,
gets
fanned
out
it
kind
of
all
meshes
together,
so
you
get
into
a
mode
where
you
could
get
allocations
on
the
same
device
of
you
know:
larger
block,
smaller
block,
larger
block,
smaller
block,
and
over
time
this
can
add
to
the
swiss
cheesing.
A
A
So
if
every
device
now
has
an
allocation
threshold
effectively
the
number
of
allocations
that
they
have
pending,
if
a
device
has
already
reached
that
when
we
go
to
allocate
from
you
know
the
next
block
and
issue
it
down
to
the
to
the
v
devs,
we
would
find
that
certain
devices
already
have
reached
that
threshold.
So
we
just
skip
over
them
and
find
one
has
already
processed
them
more
than
likely
that
device
either
doesn't
have
as
much
workload
or
has
run
through
his
queue,
much
faster,
meaning
that
he's
probably
doesn't
have
the
much
fragmentation.
A
I
think
that's
all.
I
had.
A
A
A
I
don't
think
that
the
the
thing
that
we've
discussed
and
the
thing
that
makes
more
sense
to
me
is
meta
labs
that
are
actually
size
based
the
problem
you
have
there
is
that,
and
I
think
the
reason
the
original
intent
of
them
being
200
was
to
allow
you
to
have
devices
as
small
as
64
meg,
so
that
you
could
carve
that
up
200
ways,
and
if
we
made
our
meta
slap
sizes
to
be
2
gig,
then
obviously
you
wouldn't
be
able
to
have
that,
but
I
think
that
a
fixed
number
doesn't
make
sense.
A
In
my
opinion,
I
think
we're
better
off
actually
trying
to,
rather
than
say
we're
going
to
pick
a
number.
Let's
pick
a
size
which
may
limit
that
and
say
our
smallest
device
that
we'll
ever
support
is,
you
know
a
4d
device
or
a
8
gig
device
or
whatever
that
size
is.
But
I
think
that's
where
there's
an
area
for
investigation
yeah,
please
keep
force
fours.
A
There's
actually
a
lot
of
benefits
to
something
like
this
for
the
zil,
because
brazil,
even
though
it
uses
metaslabs
to
do
its
allocation,
it
makes
no
sense
there,
because
all
you
really
want
to
do.
Is
you
want
to
scan
and
do
effectively,
plus
plus
or
cursor
based
allocation
on
a
zeal
device?
That's
really
what
you
would
like
to
see
if
you'd
like
to
say,
I
start
here
and
I
keep
allocating
until
the
end
and
then
I
wrap
around,
and
I
start
over
and
on
a
solid.
A
A
Sizing
because
today
we
would
chunk
up
that
you
know
4k
device,
try
to
chunk
it
up,
200
ways,
you
end
up
with
small
meta
slabs
that
you
end
up
having
to
you,
know
toss
back
and
forth,
and
it
doesn't
make
as
much
sense
right
but
to
adam,
to
your
point
I
think,
that's
an
area
we
need
to
go
and
really
do.
A
Have
not
we
have
not
looked
in
that
area
to
see
what
it
should
look
like.
A
A
A
And
and
part
of
it
is
that
the
weighting,
the
weighting
factor
that
is
in
there
tries
to
provide
a
bonus
to
metaslabs
that
are
like
lower
lba
range
right.
So,
in
a
way,
it's
like
the
you
have
to
really
really
work
hard
to
fill
up
the
first
you
know
half
of
your
device
before
it
even
starts
to
consider
something
at
the
very
end,
yeah.
E
A
Agree-
and
I
think
that
with
the
new
medislab
fragmentation
waiting
that
actually
is
going
to
come
into
play
much
earlier,
because
you're
going
to
find
that,
even
though
I'm
still
allowing
you
to
have
a
higher
waiting
for
lower
lba
ranges
as
those
gets
consumed,
their
fragmentation
metric
goes
way
up,
and
you
start
losing
that
doubling
effect,
it
starts
to
go
away
which
makes
the
bot
ones
to
end
more
enticing
to
go.
Do
allocations.
H
D
A
D
C
Anybody
so
why
we're
biasing
into
you
know
to
one
part
of
it
doesn't
mean
anything
from
in
terms
of
how
it
impacts
real
disks.