►
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Okay,
hello,
everybody
welcome
back
hope.
Everyone
suitably
caffeinated
I've
got
great
news.
The
lunch
has
arrived,
so
it's
only
a
one
small
presentation
from
George
before
we
can
have
lunch.
Just
a
quick
reminder
if
everybody
who's
here,
we
are
doing
a
hackathon
tomorrow,
same
location
here
from
ten
o'clock,
I.
Think
till
as
long
as
we
last
of
open,
ZFS
hacking.
So
in
light
of
their
and
we
kind
of
wanted
to
generate
some
ideas
outside
from
inside
the
room.
Really
when
so,
we've
got
free
t-shirt
to
give
away,
which
is
modeled
by
my
able
assistant.
A
Yeah
I
think
we
need
a
collective
ooh
ooh.
Ok,
so
we're
going
to
give
it
out
to
the
bet
the
best
hackathon
idea.
So
you
know
raise
your
hands
I'll,
throw
you
throw
you
the
Mike
and
you
can
come
up
with
the
best
hackathon
idea.
It
can
be
super
complicated.
It
can
be
won.
I've
already
nominated
you
for
whether
you
like
it
or
not,
or
anything
like
that,
so
frawls
open.
You.
A
A
C
A
D
A
E
D
A
B
Thanks
Ryan,
so
out
of
curiosity
like
how
many
people
have
really
dug
into
ZFS
performance.
F
B
So
the
question
was
how
many
people
ran
in
a
performance
problems?
Ok,
ok,
cool,
so
so
what
I
want
to
go
through
is
kind
of,
and
this
is
going
to
be,
since
this
is
a
community
forum,
this
is
going
to
be
a
community
presentation.
So
I
expect
a
lot
of
participation,
so
we're
all
going
to
for
the
next.
However,
long
kind
of
start,
looking
at
a
performance
problem
and
figure
out
and
kind
of
come
up
with
proposed
solutions,
so
we'll
look
at
like
how
how
pool
performance
can
be
impacted.
B
What
is
out
there
today?
What
are
things
that
we've
already
done
and
how
they
improve
things
and
then
we'll
take
a
closer
look
at
performance
problems
and
kind
of
talk
a
little
bit
about
this
thing
that
Matt
mentioned,
which
was
the
allocation
throttle.
So,
let's
start
everybody
get
excited
we're
going
to
start
looking
at
a
performance
problem,
or
maybe
there
isn't
a
performance
problem.
So
if
somebody
came
to
you
with
this
type-
and
this
is
actually
a
real
system-
came
to
you
with
this
type
of
information,
what
things
could
you
glean
from
this?
B
G
B
Right
so
sometimes
good,
sometimes
bad,
but
we
do
have
quite
a
bit
of
space
and
we
also
can
kind
of
take
a
look
and
see
that,
like
fragmentation
doesn't
look
too
bad,
so
we
kind
of
at
first
glance,
look
at
this
and
say:
okay,
maybe
this
pool
will
actually
perform
well
what
if
we
get
a
closer
look
at
it?
What
do
we
see
here?
So,
as
we
start
looking
at
this
pool
now
we're
looking
at
each
of
the
individual
devices
that
make
up
this
pool
and
what
are
things
that
we
notice.
B
So
it
looks
like
there's
been
devices
that
have
been
added,
so
some
devices
have
more
free
space
than
others.
It's
true,
some
devices
are
bigger
than
others,
so
this
is
actually
a
very
common
thing
that
we
see
with
our
customer
base,
how
many
people
configure
the
system
correctly,
the
very
first
time,
every
time,
okay,
there's
there's
at
least
four
of
you,
okay!
Well,
unfortunately,
not
all
our
customers
do
that,
most
often,
what
you
see
is
you
create
a
pool.
You
start
off
with
some
configuration
and
then
later
you
decide
to
change
your
mind.
B
Primarily
you
change
your
mind
because
you
need
more
space
or
maybe
you
can't
buy
that
drive
anymore,
so
you
get
something
larger
well.
So
in
this
case
we
have
quite
a
number
of
disks
that
are
actually
twice
as
big
as
the
original
ones
that
were
first
added.
We
also
see
that
there's
a
bunch
of
devices
that
are
over
the
eighty
percent
threshold
and
those
that
have
been
using
ZFS
for
quite
some
time.
Eighty
percent
has
been
a
number
that
has
been
out
there
for
quite
some
time.
As
like
the
known
cliff
of
performance
problems.
B
Oftentimes,
you
don't
get
to
eighty
percent
your
to
get
to
eighty
percent.
You
might
be
lucky
I'll
tell
you
that
this
particular
system
has
actually
run
close
to
ninety
percent
at
times
barely
run
at
times,
and
we
also
have
kind
of
a
big
disparity
on
free
space.
So
we're
not
expecting
this
pool
to
perform
great.
B
B
So
what
are
things
that
we've
with
that
particular
pool
thing
that
when
we
started
looking
at
this-
and
this
has
kind
of
been
an
ongoing
performance
investigation
for
us?
We
see
this
not
only
in
an
internal
system
which
this
happens
to
be,
but
also
our
customer
base,
which
goes
through
these
scenarios
quite
frequently.
B
So
we
notice
that,
as
devices
start
to
get
full,
they
take
a
lot
longer
to
allocate
oftentimes,
that's
because
they're
fragmented,
so
the
actual
finding
of
blocks
takes
a
long
time
use
a
lot
of
CPU.
To
do
that,
you
know
now
writes
that
we
wouldn't
end.
We
would
intend
for
them
to
be
like
sequential
turn
out
to
be
a
bunch
of
random
writes,
because
the
free
space
has
scattered
all
over
these
devices.
B
B
Okay,
a
few
of
you
for
just
a
just
a
quick
synopsis
of
meta
slabs
so
that
everybody
understands
meta
slabs.
You
can
think
of
them
as
regions
on
a
disk.
So
the
way
that
ZFS
does
allocations
is
it
takes
an
individual
device
and
carves
it
up
and
into
approximately
200.
Equally
sized
regions
we
refer
to
as
meta
slabs
question
is
why
200
because
100
didn't
seem
right
and
300
seemed
like
too
much
I.
The
reality
is:
200
is
a
number
that
has
existed
from
the
beginning
of
the
ZFS
days.
B
B
But
anyway,
when
you
have
these
different
regions
in
order
for
you
to
be
able
to
allocate
from
a
region,
you
have
to
load
it
first,
so
we
carve
up
the
disk
in
200,
approximately
200,
equal
sized
regions,
and
when
you
have
a
lot
of
fragmentation,
devices
are
full
you're
having
to
load
these
various
regions
throughout
the
disk.
Looking
for
space,
we
also
found
that
when
you
have
a
configuration
like
we
just
saw
where
the
devices
are
imbalanced
in
different
sizes.
B
You
are
not
getting
the
full
efficiency
of
the
the
devices
themselves,
so
we're
leaving
a
lot
of
performance
on
the
table
when
we
actually
start
doing
rights
and
again
we
don't
recommend
people
have
a
lot
of
imbalance
luns,
but
we
have
found
that
in
our
customer
base-
and
this
may
be
true
of
system
administrators
throughout
the
world-
nobody
gets
the
pool
right,
the
first
time,
or
maybe
the
second
or
third
or
twentieth.
We
always
kind
of
start
adding
and
that's
the
life
cycle
of
ZFS.
It's
one
of
the
beauties
of
ZFS.
B
We
can
actually
add
devices
to
an
existing
system,
but
we
want
performance
to
kind
of
at
least
be
on
par
as
we
do
this
okay.
So
so
we
listed
a
bunch
of
problems.
We
actually
came
up
with
some
solutions,
so
we're
not.
We
don't
have
to
solve
all
those
problems.
During
this
talk
today,
we
are
going
to
try
to
solve
a
couple
of
them,
so
I'll
do
a
shameless
plug
for
last
open
developer
summit,
we're
actually
presented
on
the
dynamic
mediswipe
selection.
B
This
was
one
of
the
key
things
that
that
we
discovered
and
a
big
improvement
on
performance
when
you
have
these
types
of
pools,
highly
fragmented,
where
you're
seeing
lots
of
loading
and
unloading
of
meta
slab
regions
and
when
you're,
actually
running
low
on
space
and
I'll,
show
you
some
charts
on
where
we.
Actually,
you
know
some
of
the
performance
gains
that
we
got
from
that
those
slides
by
the
way
are
available
on
open,
ZFS
org,
as
well
as
the
videos.
B
That's
been
around
for
some
time
how
many
people
are
aware
of
ZFS
mg,
no
Alec
threshold,
a
couple
people
I'll
talk
a
little
bit
about
this
and
why
it's
just
a
partial
solution,
and
then
we
have
another
problem
that
actually
doesn't
have
a
solution
today,
and
that
is,
we
just
aren't
very
efficient
when
it
comes
to
writing
and
getting
full
bandwidth
when
you
have
configurations
such
as
what
we
just
saw
with
regards
to
ZFS
mg,
no
Alec
threshold.
This
is
a
very
coarse
brain
switch.
B
The
idea
behind
it
is
that
when
you
have
these
these
pools
with
big
disparities
of
free
space,
you
can
actually
set
this
and
say
once
a
device
gets
to
a
certain
capacity
level.
So
so,
when
it
has,
you
know
less
than
say,
ten
percent
free
stop
allocating
from
it.
The
idea
is:
switch
everything
over
to
devices
that
have
more
free
space
that
way
I,
don't
have
to
pay
the
penalty
of
loading
and
unloading,
Metis
labs.
B
B
B
First,
talk
about
the
existing
solutions
that
I
just
mentioned.
When
we
set
out
to
improve
right
performance,
we
were
looking
at
making
sure
that
right
performance,
as
you
approach
the
eighty
percent
cliff
that
everybody
is
so
fond
of
stayed
relatively.
Even
we
knew
that
as
we
started
going
beyond
eighty
percent,
that
performance
was
going
to
be
tough
to
achieve
so
what
we,
what
I'm,
showing
here
is
random,
I
ops
using
this
benchmark.
We
call
frag.
B
The
baseline
is
the
blue
line
and
we
can
see
this
very
linear
cliff
that
happens
starting
at
about
sixty
percent
and
actually,
if
you
were
to
graph,
this
out
is
even
starts
before
sixty
percent,
but
we
wanted
to
improve
on
this
and
by
doing
some
smarter
selection
of
métis
lab
regions,
we
actually
were
able
to
flatten
this
out
to
the
eighty
percent
mark.
So
you
can
see
here.
This
is
the
top
line
is
actually
showing
us
what
we
were
able
to
achieve
using
the
new
algorithms.
B
To
the
baseline,
until
you
get
up
to
the
ninety-five
percent
mark
and
there's
a
direct
correlation
between
fragmentation
and
performance
once
we're
completely
fragmented,
we
can't
get
any
I
ops
really
out
out
of
the
system
at
all.
So
that's
what
we
have
today
and
that's
actually
maybe
up
streamed.
So
we
think
that's
up
streamed
will
verify
definitely
available
on
the
dell
fixed
repo
that
Matt
referenced
earlier.
B
B
B
Few
people,
so
when
you
actually
create
a
pool
in
this
case,
this
is
kind
of
showing
like
a
three
striped
pool.
Zfs
always
done
does
round-robin
allocations,
so
it
tries
to
select.
You
know
starting
point
from
one
device.
It'll
go
to
that
device
and
from
there
it'll
allocate
kind
of
a
pre-selected
amount
of
space.
Once
it
reaches
that
threshold.
It
then
says:
okay,
I
can
switch
the
next
device.
What
that
means
is
that
if
I
start
off
and
allocate
512k
from
this
device,
I'll
do
that
512k
allocation.
B
As
soon
as
that
I
reached
my
threshold,
I
can
now
go
to
the
next
one
and
so
forth
and
I
just
keep
round-robin
across.
That
means
that
we
can
keep
everything
evenly
distributed,
we're
always
doing
approximately
the
same
amount
and
there's
some
very
obscure
logic.
If
you
look
in
the
depths
of
like
the
Metis
lab
code,
where,
if
one
device
happens
to
be
a
little
bit
more,
you
know
has
a
little
bit
more
free
space
than
the
other
than
it
tries
to
give
it
some
fraction,
above
and
beyond
what
the
normal
limit
would
be.
B
What
does
that
look
like
when
we
actually
run
a
system
that
way
in
particularly
the
system?
We
just
showed
the
zpool
list
for
so
I'm,
going
to
go
and
you're
going
to
see
this
update
and
I
want
you
guys
to
kind
of
observe
what
happens
to
the
pool
and
then
we're
going
to
collectively
try
to
figure
out
what
is
happening
and
try
to
make
some
sense
out
of
it.
F
B
So
it
appears
we're
writing
more
to
the
more
full
ones.
Yeah
interesting
things
right,
I
mean
like
what
does
this
mean?
What
is
actually
happening
here?
B
B
So
we
write
the
same
amount
of
data.
It
takes
longer
to
write
to
these
because
they're
fragmented
exactly
so
we're
you
know.
If
we
go
back,
we
write
the
same
amount,
we're
going
to
allocate
the
same
amount
of
data
to
all
these
devices,
but
we
have
no
idea
what
the
characteristics
of
the
devices
are
when
we're
doing
these
allocations
other
than
we
know
that
if
we
try
to
write
to
this
device
and
if
it's
mostly
full,
it's
going
to
take
us
longer
to
do
the
allocation
part
which
is
finding
the
free
space.
B
But
it's
also
going
to
take
us
longer
to
actually
write
to
that
free
spay
write
those
blocks
once
we've
done
the
allocation,
because
it
might
write
a
little
bit
here
and
a
little
bit
down
here
and
we're
seeking
all
over
the
place.
So
we
pay
the
penalty
twice.
We
pay
the
penalty
of
trying
to
find
free
space.
We
pay
the
penalty
of
actually
trying
to
get
this
on
disk
to
stable
storage.
And
that's
that's
exactly
what
we
see
here.
B
When
we
start
running
everybody's
got
about
the
same
number
of
allocations,
then
all
of
a
sudden
some
devices
start
going
to
the
point
where
they're
not
busy
at
all
and
the
ones
that
remain
are
the
ones
that
tend
to
either
be
most
full
are
most
fragmented.
So
we
end
up
with
these
devices
that
are
sitting
there
and
we're
not
utilizing
the
bandwidth
of
them
during
the
entire
cycle
of
our
rights
and
those
happen
to
be
the
fastest
devices.
B
So
if
you
were
to
kind
of
observe
this
from
the
perspective
of
a
fast
device
versus
a
slow
device
over
the
course
of
say
an
entire
sinking
of
a
transaction
group,
this
is
kind
of
what
you
would
observe.
You'd
see
that
the
slow
device
kind
of
takes
a
while
it
gets
a
bunch
of
outstanding
iOS
and
over
the
entire
course,
it's
still
working
on
it.
I
They're
yeah
there's
a
kind
of
a
like
I,
don't
know
kind
of
knowledge,
common
knowledge
that
yeah.
If
you,
if
you
add
the
number
of
parallel
videos,
you
add
parallelism
there,
we
have
something
that
behaves
more
like
a
raid
Z,
which
is
the
slowest
one
has
to
finish
for
the
transaction.
To
finish,
I
know
it's
not
right.
Z
but
I
mean.
C
B
Yeah
we
would
like
for
it
to
give
us
the
bandwidth
as
and
we
would
like
to
believe
that,
yes,
adding
devices
gives
us
more
bandwidth
and
in
practice,
if
you
add
them
at
you
know,
when
the
pool
is
relatively
empty
and
your
devices
are
about,
you
know,
have
about
the
same
amount
of
allocated
space.
You
will
see
that
performance
game.
B
The
problem
is
that
most
people
don't
look
at
adding
space
to
their
pool
until
they're
out
of
space
right
I
mean
you
know,
we
try
to
get
our
customers
to
think
that
think
about
like
not
what
you
have
today,
but
what
you're
going
to
have?
You
know
three
years
from
now
and
I'm,
not
very
good
at
it,
and
I
can
tell
you,
our
customers
aren't
very
good
at
it
and
if
you
know
for
those
of
you,
you
know
that
are
in
the
storage
industry.
B
What
you'll
find
is
that
customers
tend
to
like
put
a
little
bit
on
there,
try
it
out
and
then,
if
they
really
like
it,
then
they
throw
a
bunch
more
crap
on
it
more
than
they
probably
anticipated.
They
might
have
told
you.
Yes,
I
only
need
five
terabytes
sure
give
me
five
terabytes.
That's
all
I'll
ever
need
until
they
like
your
product,
then
the
next
thing
you'll
realize
is
that
they
have
30
terabytes
on
there
and
you're
like
okay.
B
That's
not
what
we
planned
for
and
I
now
have
five
terabytes
that
are
of
devices
that
are
totally
full
and
twenty
five
terabytes
that
are
completely
empty,
but
every
single
time
you
try
to
write
I'm
having
to
do
even
allocations
across
that.
That's
exactly
what
we
have
in
this
pool.
This
is
an
internal
system
for
ours,
which
happens
to
be
a
way
that
we
deploy
a
bunch
of
development
boxes.
B
E
B
A
great
question:
growing,
an
individual
device
is
actually
an
improvement
to
a
degree.
Now,
every
single
time
you
grow
one
of
these
devices.
We
actually
create
more
of
the
200
Metis
labs.
So
now
you
went
from
200.
Maybe
you
go
to
300
you
do
this
enough
times.
Eventually,
you
may
have
that
you
find
that
you
have
thousands
of
regions
that
are
all
equally
sized.
B
So
when
you
get
to
a
point
where
you're
running
low
on
space,
now
you
have
thousands
of
métis
labs,
you're
going
to
be
looking
at
trying
to
load
and
unload
to
see
if
you
actually
have
every
space,
so
you
get
kind
of
some
initial.
You
know
performance
gains,
but
you
may
find
that
in
the
long
run
you
may
be
suffering,
just
as
you
would
with
this
implementation,
and
and
to
be
honest,
we
actually
before
we
had
solutions
here.
B
That
was
our
recommendation
for
customers
is
grow
your
lungs
rather
than
expand,
because
at
least
when
we're
doing
even
allocations,
as
we
saw
the
way
the
of
them
worked.
If
we
do
even
allocations-
and
you
expand
all
your
lungs
evenly,
then
you're
fine,
everybody
just
got
more
free
space,
but
there's
limitations
on
how
far
you
can
actually
expand,
which
led
people
to
to
add.
Yes,.
H
H
B
One
terabyte
to
bring
them
to
the
exact
same
size,
okay,
yeah.
So
that's
that's
another
possible
solution,
because
then,
as
soon
as
the
reefs
over
incompletes
and
I
remove
the
old,
you
know
smaller
device
I
get
that
space.
It's
kind
of
going
back
to
the
expansion
logic
once
again,
so
you're
just
doing
it
there
in
like
a
physical.
You.
B
B
A
B
B
B
But
it's
possible
that
might
give
you
a
little
bit
of
gain,
but
now
you're
you
could
round
robin
much
faster
through
you're
still
having
to
pay
the
penalty
of
looking
for
whatever
that
small
amount
is
and
now
for
devices
that
are
free,
because
it's
kind
of
like
a
global
devices
that
have
free
space
now
are
getting
just
a
trickle.
So
you
may
find
yourself
that
performance
might
actually
dive
down
simply
because
now
we're
spending
more
time
just
kind
of
cycling
through
all
the
devices
looking.
F
All
these
are
like
great
ideas
of
little
tweaks,
but
they
don't
really
address
the
the
key
problem
that
you're
talking
about,
which
is
like
some
disks,
are
faster
than
others
and
we're
allocating
the
same
number
to
each
of
them.
So
we
have
to
wait
for
this,
though
it's
disk
to
do
it's.
You
know
we
allocate
one
fifteenth
of
the
data
to
each
of
these
disks,
so
we
have
to
wait
for
the
slowest
one
to
do.
It's
one
fifteenth
of
the
amount
of
work.
C
I
B
B
Yea,
it's
actually
it's
it's
data,
that's
actually
not
in
the
ark,
but
it's
got
its
own
kind
of
cash,
but
still
you're
able
to
like
look
through
because
it's
just
an
AVL
tree,
so
you're
able
to
like
walk
through
trying
to
find
individual
sized
regions
that
is
actually
pretty
fast.
Where
it
takes
a
hit
is
because
every
single
time
you
load
a
métis
lab
today,
if
you
don't
allocate
from
it
you're
going
to
unload
it,
which
means
the
next
time
you
come
over
around
to
it.
B
B
Yes,
so
there's
now
a
way
for
you
to
keep
some
of
that
data
for
much
much
longer,
because
anticipating
that
you're
going
to
ask
a
very
similar
question
and
also
the
way
we
select
those
meta
slabs
is
very
different,
based
on
the
changes
that
the
the
graph
that
I
showed.
That
has
changed
the
way
that
the
algorithm
works
today,
but
for
this
that
doesn't
help
us.
So
I
heard
several
things
wit.
B
F
F
B
So
so
maybe
maybe
a
way
to
store
how
fast
devices
actually
respond
and
use
that
as
a
way
to
select
devices.
All
you
know
great
ideas,
we'll
talk
about
how
we
address
this
and
you'll
see
that
they
actually,
what
what
you
guys
are
hitting
on
can
actually
be
solved
in
a
slight
variation
of
the
product.
B
So
one
of
the
goals
is,
we
really
think
that
we
should
allocate
less
from
these
devices
because
they're
mostly
full
right
and
if
we
allocate
less
from
them,
then
we're
not
spending
and
using
you
know,
cycles
doing
that.
And
so
then
that
means
that
we're
going
to
allocate
more
from
these
devices
to
have
more
free
space.
B
But
really
what
we
care
about
is
how
do
we
ensure
that
we
utilize
all
the
available
bandwidth,
because
we
don't
really
care
as
long
as
these
guys
are
busy,
then
we
know
that
we're
taking
full
advantage
of
whatever
hardware
and
devices
you've
given
us
and
as
long
as
these
guys
stay
busy
and
we
keep
them
both
busy
for
the
entire
duration
of
the
transaction.
Sync
time,
then,
we
can
solve
this
problem
and
we
can
actually
solve
it
by
kind
of
measuring
how
you
know
how
long
it
takes
for
these
devices
do
allocations.
B
So
that
leads
us
to
what
we're
introducing
now,
which
is
the
allocation
throttle,
so
the
way
that
the
allocation
throttle
works
is
and
again
I'll
talk
a
little
bit
about
how
ZFS
does
things
today
and
how
this
differs.
So
today,
when
you
go
through
and
do
all
your
allocations,
you
end
up
getting
this
onslaught
of
iOS
that
get
created.
So
every
time
you
sync
out
a
transaction
group,
we
create
thousands
of
iOS
and
they
just
get
thrown
into
the
system
and
they
get
handled
by
a
bunch
of
task
queues.
B
These
task
queues
will
actually
create
like
this
big
fan
out.
So
we
start
off
with
an
ordered
type
of
of
right.
Where
we're
writing,
you
know
from
a
particular
file,
the
first
block,
the
next
block,
so
forth
kind
of
writing
it
out,
but
because
we
handled
it
we
hand
this
these
out
of
task
queues.
We
end
up
actually
kind
of
mixing
them
up
in
order
and
the
the
whole
reason
that
we
do.
That
is.
We
want
to
make
sure
that
they
go
through
the
compression
cycle
as
fast
as
possible
with
much
parallelism.
B
B
B
B
People
familiar
with
top-level
vetoes,
the
concept
top
level
v.
Dev
is
the.
If
you
do
like
a
zpool
status,
is
the
first
device
you
see
under
root,
so
typically
on
a
pool
system.
You'll
see
you
know
root,
and
then
you
may
see
mirror.
You
may
see
raid
z,
you
may
see
a
disc,
but
that's
your
top-level
device,
that's
where
we
make
allocation
decisions
is
based
off
top-level
devices.
So
in
this
scenario
we
have
for
top-level
VFS.
B
So
when
we
start
off,
we
may
start
off
with
a
certain
amount
of
work,
because
we
only
have
a
limited
number
of
slots.
Every
device
will
be
given
the
same
amount
of
work
as
a
starting
point.
That
kind
of
keeps
us
in
a
point
where
we
get
all
the
devices
busy
right
off
the
bat
each
device
now
has
an
allocation
Q.
You
can
think
of
this
as
like
the
queue
depth.
If
you're
familiar
with
the
scuzzy
world,
they
each
maintain
one
of
these.
B
Those
allocation
devices
get
turned
into
children
iOS,
which
are
actually
going
to
go
out
to
the
physical
disks.
So
if
this
is
a
raid
Z,
there
may
be
multiple
child
owes
that
are
actually
writing
to
the
physical
disks
behind
it.
In
this
case,
this
is
depicting
a
mirror
where
we
have
one
top-level
io
gets
turned
into
two
child
iOS
they're,
going
to
do
the
work
on
behalf
of.
B
B
B
So
if
we
had
the
telemetry,
then
we
could
just
make
our
decision,
but
instead
we're
going
to
rely
simply
on
the
completion
of
the
iOS
as
the
device
is
complete.
That
tells
us
that
they
actually
are
faster.
We
simply
give
them
more
work.
So
if
everybody
started
off
with
50
units
of
work,
these
guys
can
only
handle
50
units
of
work
for
the
entire
txg.
The
rest
of
the
work
would
come
here
as
soon
as
they
complete
they
just
simply
get
another
device
or
another
allocation.
B
F
In
the
common
scenario,
this
is
actually
going
to
improve
that
imbalance
of
reads
because,
like
if
you
look
at
what
we
were
doing
before,
we
had
like
a
bunch
of
disks
were
more
full
and
somewhere
less
full.
So
reads:
if
we're
reading
from
somewhere
random
they're
going
to
go
more
to
those
more
full
disks
and
less
to
the
less
full
disks.
But
with
this
algorithm
we're
going
to
end
up
writing
more
to
the
less
full
discs,
because
the
last
full
ones
are
faster
rate
because
they're,
not
we
don't
have
to
raise
scattered.
F
B
I
think
it
may
be
like
one
could.
I
don't
know
if
you
would
do
this,
but
one
can
envision
leveraging
this
to
have.
You
know
these
is
flash
devices
and
these
a
spinning
disk
right,
even
though
they
may
have
started
completely
empty
right,
because
if
these
are
faster
than
they
may
receive
more
allocation
over
the
period
of
time
and
become
full
much
faster,
which
means
read,
are
going
to
be
targeting
these
devices.
But
in
fact
they
are
the
faster
devices
to
begin
with.
F
B
F
Optimizing
it
for
being
able
to
like
write
to
all
disks
at
the
same
time,
all
just
like
we're
keeping
all
the
disks
been.
While
writing
if
reads,
are
basically
have
the
same
performance
characteristics
as
rights,
then
reads
would
also
keep
all
disks
busy
at
the
same
time,
right
like
if
you
wrote
a
file-
and
it
was
you
know,
sixty
percent
here
in
sixty
percent
here
in
40
and
forty
percent
here,
then
that
means
like
we
can
write
these
ones
faster
than
these.
J
B
It
used
to
be
max
pending,
but
it's
based
off
of
that.
So
it's
a
percentage,
so
you
can
actually
say:
I
only
want
it.
You
know
I
want
to
start
off
with.
If
that's
10
I
want
to
have
a
hundred.
You
know
a
hundred
per
each
of
my
top
levels,
so
in
this
case
four
hundred
slots
that
I
can
go
out
each
one
of
them
doing
a
hundred
units
of
work
at
a
given
time,
but
thatthat's
tunable.
B
G
E
G
B
I
C
I
Quickly,
when
the
other
drives
will
be
much
less
full
and
so
I
understand
the
reasoning
out
when
there's
an
imbalance,
but
if
that
imbalance
is
intrinsic
to
the
performance
of
the
device,
then
you
you
just
do
that.
You
fail
completely
the
SSDs
and
then
all
your
left
is
with
free
space
on
the
on
this
devices
right,
correct.
B
B
B
I
Don't
claim,
but
what
might
happen?
What
might
happen
is
you
put
I,
don't
know
three
thera
by
drives
and
three
years
later
you
put
other
drives,
and
now
those
drives
have
gotten
faster.
Now
you
have
that
intrinsic
imbalance
because
of
device
performance,
not
because
you
have
done
something
as
crazy
as
mixing
v-dubs
made
of
SSDs
and
physical
drives,
but
because
of
generation
difference
that
have
different
performance
profiles.
Yeah.
B
Going
to
see
a
tapering
off
so
as
a
device
actually
becomes
more
and
more
full,
it's
it's
intrinsically
going
to
slow
down.
Okay,
so
you're
going
to
see
if
you
have
devices
that
are
you
know
extremely
fast
and
I
would
imagine
in
the
case
of
like
actual
spinning
disks
you're
going
to
see
maybe
a
marginal
difference
between
generations.
B
But
let's
take
the
SSD
case.
What
the
expectation
is
SSDs
would
fill
up
and
could
fill
up
much
faster
you're
still
doing
allocations
to
the
spinning
disk,
so
you're
getting
allocations
starting
off,
and-
and
this
is
where,
having
that
tunable-
determining
how
many
allocations
I
want
to
give
off
like
for
the
entire
system
at
any
go
might
be
relevant
because
you
can
say
I
want
500
units
of
work
to
go
across
every
single
device
that
might
keep
things
not
as
it
won't
create
as
big
of
a
disparity.
B
So
you
may
still
be
growing
both
devices
at
a
relatively
even
rate.
But
let's
say
you,
you
know
you
say
something
like
a
hundred
and
you
allow
SSDs
to
start
filling
up
they're
going
to
reach
a
point
where
they're
not
going
to
be
performing
the
way
they
did
when
they
were
empty.
So
the
device
is
now
that
remain
for
right
performance,
they're,
going
to
start
creeping
up
and
you're,
going
to
start
seeing
a
switch
in
the
amount
of
allocations
going
from
one
area
to
another.
I.
A
B
So
the
chunk
size
itself
isn't
changing
yeah,
so
the
chunk
size
still
remains
512
and
this
that
algorithm
still
still
is
the
same.
The
only
difference
now
is
which
device
is
actually
going
to
be
serving
more
iOS,
so
you're
still
chunking
them
up
as
they
come
across
you're
still
chunking
them
up
in
the
same
way,
you're
just
now
doing
the
distribution
slightly
differently
and
so
any
more
questions.
B
B
B
B
And
this
is
kind
of
a
comparison,
so
there's
quite
a
bit
of
data
here,
so
the
top
devices
are
the
slower
devices
bottom
devices
of
the
faster
devices.
That's
what
we
saw
from
the
previous
cpool
iostat
left
graphs
are
showing
you,
average
Layton
sees
for
doing
the
complete
allocation
and
right
for
that
device.
B
The
right
graphs
are
showing
you
how
much
data
was
written
and
allocated
to
that
device.
So,
as
we
see
here,
we
look
at
these.
They
tend
to
be
averaging
about
80
milliseconds
to
do
an
allocation
and
a
right
as
a
result
they're
getting
somewhere
around
10
meg
over.
That
course
you
know
10
mega
second,
over
the
course
of
that
spa
sync,
the
faster
devices
which
are
averaging
about
maybe
15
milliseconds
are
doing
somewhere
around
25
megabytes
per
second
across
the
course
of
that
transaction
group.
E
B
So
it's
actually
quite
significant
for
what
we
saw
on
our
system.
We've
seen
performance
in
kind
of
two
class
of
two
two
different
ways.
One
is:
we've
actually
been
able
to
drive
this
pool.
You
saw
it
at
like
71
percent
we've
actually
driven
this
pool
up
to
eighty-seven
percent
without
people
complaining,
which
is
very
rare
for
our
engineers,
because
yeah
because
well
they're,
very
quick
to
complain
when
performance
problems
get
to
that
point.
B
But
I
don't
have
like
a
specific
number
but
I
think
that
we're
we're
at
least
like
twenty
percent
faster
in
most
like
over
a
time
period
of
spa
sink
and
in
some
cases
more
and
we
see
kind
of
a
variation
because
it
depends
on
like
which
device
actually
starts
the
allocation.
So
because
you
have
a
round-robin
type
of
scenario,
so
you
may
end
up
where
you
start.
You
start
allocating
from
say
the
emptier
devices
first,
which
means
that
by
the
time
you
get
to
the
slower
devices,
you've
already
processed
quite
a
bit.
E
F
You
want
to
measure
in
terms
of
like
how
many
I
ops
can
you
sustain,
and
this
is
a
production
system,
so
we
don't
want
to
just
like
throw
an
unlimited
number
of
I
ops
and
see
how
many
you
can
take
before
performance
sucks.
So
what
we
kind
of
what
we
have
is
like
how
many
if's
does
it
happening
to
get,
but
it
depends
on
the
load.
The
load
is
very
variable
and
then
you
know
also
there's
like.
F
If
you
look
at
it
over
several
days,
the
amount
of
free
space
could
be
very
variable,
so
you
know
we
kind
of
see
like
oh
jeez,
it's
like
ninety
percent
poll
and
people
aren't
freaking
out.
So
that's
like
we've
never
seen
that
before
this
is
great,
but
we
don't
have
like
really
hard
numbers.
Unlike
you
know,
you
can
do
X
I
ops
at
y
%
full
with
vs.
without
this
change,
which
would
be
great
to
get
on
like
a
synthetic
system
and.
B
I
think
that
part
of
the
problem
with
like,
in
the
main
reason
we
kind
of
tackled
this
particular
system,
was
just
because
of
the
fact
that
it
already
was
an
aged
pool
that
gone
through
several
iterations
I
mean
we
had
probably
four
different
times
where
we
add
devices
to
that
pool
versus
if
we
just
kind
of
created
a
lab
configuration
where
we
fill
one
device
up,
I
can't
get
the
same
fragmentation
on
that
device.
You
know
at
least
in
and
feel
like
I'm
reaching
the
actual
problem
at
hand.
H
B
Yeah,
so
what's
the
best
way
to
create
fragmentation
to
kind
of
like
try
to
look
at
some
of
these
systems,
so
we
have
kind
of
the
worst-case
scenario
with
that
we
look
at,
which
is
what
we
call
the
fragment
mark
and
then
there's
actually
something
that
matt
has
put
together,
which
is
also
very
yeah,
see
it's
even
worse
case,
but
it's
it's
based
off
customer
data.
So
it's
it's.
B
F
General
story
is
like
basically,
we
just
like
create
a
couple
of
big
files
with
like
AK
record
size
and
do
random,
writes
to
them
and
then
wait
until
the
like
initiative.
Performance
is
going
to
be
good
and
then
performance
gets
worse
and
worse
and
worse
than
we
just
wait
until
performance
doesn't
get
any
worse
and
then
that's
like
that's
as
bad
as
it
gets,
and
then
the
variables
here
are
like
you
know
how
much
of
the
pool
do
you
fill
up
like?
F
Are
you
saying
the
pool
is
going
to
get
going
to
be
ninety
percent
full
versus
fifty
percent
full
and
then
like
the
block
size,
and
then
the
distribution
of
compression
ratios
so
like
most
of
these
tests
that
you
showed
were
with
just
like
a
constant
width
with
no
compression
where
it's
just
like
all
aka
blocks,
and
then
we
also
have
done
tests
where
we
make
each
block
get
compressed
a
different
amounts.
So
you
have
all
these
different
physical
block
sizes
that
we're
trying
to
allocate,
which
is
like
really
really
horrible
for
compression.
F
H
B
Of
mentioned
it
uses
fio
so
again,
creating
you
know
you
figure
out,
like
you
want
your
pool
to
be.
Sixty
percent
full
create
a
file
that
consumes
sixty
percent
of
the
space,
and
then
it
runs
fio
kind
of
like
over
top
of
it
using
random
rights,
and
then
you,
you
kind
of
monitor
the
throughput,
look
for
a
steady
state
and
that's
when
you
know
that
you
know
you've
kind
of
reached
that
threshold.
B
Don't
go
away
just
yet,
there's
actually
more,
which
we
don't
I.
Don't
have
a
lot
of
slides
on
this,
but
I
know
Matt
mention
this,
but
we
did
want
to
kind
of
announce,
compress
dark
which
is
actually
functional
and
in
our
internal
repo,
and
that
has
been
completed
as
Matt
kind
of
mentioned
earlier,
compressed
dark,
effectively
mimics
what
the
compression
that
you're
using
on
disk.
So,
if
you're
using
gzip
nine
on
disk,
you
effectively
have
a
gzip
nine
compressed
version
of
the
block
in
memory.
B
Just
some
preliminary
tests
that
we
did
I
had
a
small
system.
20
gig
of
Arc
had
a
creative
35
gig
file
on
a
using
LZ
for
compressed
file
system,
I'm,
getting
about
2.6
4x
compression
ratio
for
that
file,
I'm,
actually
able
to
read
that
entire
file
into
the
ark
and
keep
it
completely
cached.
So
all
subsequent
reads
come
directly
from
the
ark,
even
though
it's
15
gig
larger
than
the
existing
arc.
B
B
B
So
turning
it
on
and
off
is
like
there's
a
big
switch
which
turns
compression
arc
all
the
way
on
and
off.
If
you
want
to
try
it
on
a
per
data
set
basis,
then
it's
going
to
be
based
off
the
compression
ratio
or
the
compression
algorithm
you
use
for
that
data
set.
So
if
everything
is
uncompressed
except
one,
then
you
would
only
have
those
blocks
that
would
be
uncompressed
or
compressed
in
the
arc.
I.
C
Can
see
that
as
a
bit
of
a
problem,
potentially
what,
if
you
have
a
pool
that
has
a
mixture
of
some
real
time
d,
compressible,
stuff
LT
very
compressed?
Obviously
we
want
to
keep
that
compressed
in
memory
because
it's
cheap
to
decompress,
but
what?
If,
at
the
same
time,
you
also
have
some
stuff
in
there?
That's
gzip,
or
maybe
some
future
archival
algorithm
we
come
up
with
and
for
performance
reasons.
You
want
to
keep
that
uncompressed
in
memory.
I
can
press
on
disc.
B
So
so
today
that
is
not
possible
in
depending
on
your
workload.
So
if
your
workload
is
one
where
you're
going
to
be
accessing
that
frequently,
then
for
frequent
accesses
things
will
stay
uncompressed,
but
if
it's
one
of
those
where
you're
accessing
it
once
and
you
want
that
initial
access
to
be
fast
and
then
you're
going
to
wait,
you
know
hours
before
you
access
it
again
or
days,
and
you
want
that
access
to
be
fast,
then
that
isn't
possible.
Today,
George.
J
B
B
Okay,
yes,
so
whatever
whatever
form
the
block
is
on
disk,
is
the
form
that
it
will
take
in
memory.
The
advantage
that
this
gives
us
is,
as
Matt
also
mentioned,
is
compressed,
send
now
becomes
simpler,
because
you
already
have
the
compressed
block
in
memory.
We
actually
have
a
design
for
compressed
send/receive
that
allows
us
to
send
this
compressed
block
from
memory
without
decompressing
it
at
all
and
writing
it
in
a
compressed
fashion.
All
the
way
to
disk
I.
B
I
I
B
They
I
don't
know
the
blog
post,
so
I'm
not
sure
if
exactly
this
is
it
I'm
wondering
if
it
was
referring
to
the
fact
that
when
you
had
very
large
l
to
work
devices
that
the
amount
of
memory
consumed
to
actually
store,
the
pointers
to
that
l
to
ark
was
actually
pretty
high,
which
has
changed.
So
I
don't
know
if
that's
what
they
were
referencing.
I
F
F
That's
happening
on
Linux,
where
we
will
be
able
to
like,
rather
than
having
different
kmm
caches
for
like
128k
blocks
and
16
k,
blocks
and
8k
blocks
and
then
having
to
like
shrink
the
arc
and
change
which
ones
are
being
used.
We
would
be
able
to
compose
a
128k
block
in
memory
from
a
bunch
of
4k
pages
and
then,
when
you
are
shrinking
the
arc
you're
just
freeing
those
pages,
and
you
are
involving
kmm
at
all.
That's
like
I
have
a
prototype
of
it.
It
needs
some
more
work.
I
I
A
F
C
C
I
mean
I,
don't
quote
me
on
this,
so
this
is
sort
of
going
by
what
I
remember
off
the
top
of
my
head,
but
the
general
consensus
is
somewhere
between
256
and
512
is
sort
of
a
sweet
spot
above
that
you
might
run
into
some
issues,
but
I've
seen
systems
hurt
even
below
that,
but
it
really
depends
on
your
work.
Planet
yeah,
the.
F
I
If
I
have
to
make
a
purchase
decision
today
for
a
big,
a
big
pool,
you
know
a
petabyte
pool.
I
want
to
get
as
much
memories.
I
want,
as
I
can
right.
I
can't
go
with
a
terabyte
of
memory,
because
you're
telling
me
that
I'm
gonna
hit
this
problem.
If
I
have
too
much
RAM
there
for
too
much
arc
right,
yeah.
B
I
mean
I
think
there's
a
I
think
your
mileage
will
vary
based
on
workload,
so
it
might
be
that
one
terabyte
for
for
your
you
know
case
might
be
fine,
but
we've
definitely
seen
issues
as
you
go
above
and
beyond
512
gig,
at
least
on
illumos.
There's
things
that
need
to
be
addressed.
We've
seen
I
think
our
largest
customers
are
running
like
384
I
think
we
may
even
have
some
that
are
512,
but
that
seems
to
be
kind
of
where
there's
still
some
work
to
be
done.
B
E
F
I
B
For
every
system,
you
build
the
work
that
Matt
alluded
to,
that
he's
got
a
prototype
that
would
be
more
ZFS,
specific
and
presumably
one
that
that
would
carry
over
to
freebsd
yeah,
but
and
and
primarily
the
reason
that
a
lot
of
that
work
is
being
done
is
because
kmm
reap
on
various
platforms,
behaves
very
differently
and
inevitably
probably
has
problems
at
different
points.
We
don't
know
what
freebsd
might
be.
We
can
only.
We
definitely
happy
to
tell
you
all
about,
although
most
problems
it.
I
Another
little
one,
how
a
couple
of
years
ago
to
tell
open
ZFS
day,
2013,
maybe
2012,
but
I
think
it
was
13.
I
think
it
was
you.
You
ended
to
talk
with
a
very
enigmatic.
I
was
on
the
live
stream.
That
day,
like
oh
on
4k
disks,
really
you
shouldn't
do
raid
z.
That
is
that
is
that
still
a
concern
is
that
this.
F
Is
a
valid
thing
depending
on
the
workload
so
do
it?
Can
you
bring
it
by
blog
post?
I
wrote
a
blog
post,
which
is
directly
addresses
this
issue
that
the
the
issue
is
essentially
like
if
you're
using
4k
disks
with
raid
z,
you're
using
small
record
size
like
4k
right,
hey,
record
size,
you're,
probably
shooting
yourself
in
the
foot
like
if
you're
using
for
kak
record
size,
don't
use
raids,
irregardless
of
its
4k
or
not,
but.
I
F
F
Even
if
you
have,
even
if
you
have
a
million
small
files,
it's
usually
going
to
be
those
you
know,
hundred
big
files
that
are
actually
consuming
most
of
the
space
and
those
are
using
the
128k
block
size
I
mean
most
files
like
128k
is
tiny
nowadays,
right,
like
even
a
picture
file
is
like
100
128k
blocks
or
like
ten
hundred
twenty
eight
blocks.
So
no.
I
E
F
F
C
F
You
knew
straight
and
you're
serving
up
a
lot
of
streams
at
the
same
time,
so
it
kind
of
looks
like
random
access,
but
you
can
afford
to
cache
quite
a
bit
of
it
so,
rather
than
like
random
access
of
128k
you're,
doing
random
access
of
one
meg
and
or
more
and
so
you're
able
to
get
much
more
megabytes
per
second.
By
doing
it
in
one
big
chunks,
then
in
128,
k,
chunks.
F
Potential
benefit.
The
quality
would
only
be
that
those
one
meg
reads
and
writes:
can
kind
of
stopped
up
the
pipeline
right
like
if
you're
using
a
disk,
then
a
one
Meg
read
is
going
to
take
longer
than
128
k
read.
So
if
you
have
other
latency
sensitive
operations,
then
the
latency
is
going
to
go
up
so.
C
For
instance,
you
have
a
workload,
a
data
set
which
you've
not
previously
tuned
to
a
specific
record
size.
So
your
work
look
kind
of
us
seem
to
handle
my
8
k.
If
you
do
a
smaller
right
to
it,
not
a
full
block
right
where
you
some
sort
of
workload
that
does
partial
rights
to
the
blocks.
What
we
do
in
memory
is,
we
do
a
read,
modify
write,
so
we
read
the
block
off
the
disk,
modify
that
and
then
write
a
new
copy.