►
From YouTube: dRAID, Finally! by Mark Maybee
Description
From the 2020 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1uo0nBfY84HIhEqGWEx-Tbm8fPbJKtIP3ICo4toOPcJo/edit?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2020
A
V-Ray,
finally,
well
almost
finally,
so
I'm
going
to
talk
about
d-rate
and
where
we
are
with
it
and
give
you
an
idea
of
when
you're
actually
going
to
see
d-rate
here
soon
and
why
it
isn't
really.
Here
we
were
hoping
for
2-0,
but
we
didn't
quite
make
it
so
the
history
here
again,
this
is
a
feature
that
was
originally
developed
at
intel
by
isaac
wong.
He
gave
talks
on
this
at
the
open
cfs
summit
in
2015
2016
2017.
A
A
The
feature
was
picked
up
by
cray
in
2018
because
they
wanted
to
use
it
as
a
key
capability.
In
a
zfs
version
of
its
distributed
storage
product
brian
bellador
adopted
the
crate
version
of
the
feature
in
early
2020.
He
created.
What
is
the
current
pr
for
the
feature
now,
and
you
know
I
have
been
working
on
this
now
for
a
number
of
months
trying
to
get
all
the
pieces
pulled
together
and
and
get
this.
A
I
don't
I
get
there.
It
is
sorry
all
right,
so
I
want
to
introduce
a
little
bit
of
terminology
before
I
start
in
mostly
because
there's
a
lot
of
confusing
terminology
here
and
that
can
get
easily
conflated
between
raid
and
and
d-rate.
So
I
want
to
talk
about
group
size
group.
Size
is
the
number
of
pieces
that
or
a
number
of
columns
that
data
is
partitioned
into
plus
parity.
A
So
you
know,
obviously,
the
number
of
of
data
columns
sort
of
defines
how
much
overhead
your
redundancy
is
going
to
cost
and
the
amount
of
parity
determines.
How
much
redundancy
you
have
d-rate
size,
then,
is
the
number
of
drives
that
are
actually
used
for
storing
data
within
a
d-rate
configuration.
A
A
A
d
raid
row
is
a
16
megabyte
chunk
of
space
allocated
the
same
offset
across
all
the
drives.
In
the
d-rate
configuration
and,
for
example,
row
zero
is
offset
zero
through
offset
16
meg
across
all
drives
in
the
config
and
you'll
see
why
that's
important
here?
A
little
bit
and
a
permutation
slice
is
one
or
more
rows
which
are
permuted
based
off
of
the
red
permutation
array,
and
the
actual
number
of
rows
is
derived
per
slice
is
derived
from
the
least
common,
multiple
of
the
group
size
and
the
d-rate
size.
A
So
keeping
all
that
in
mind.
What
is
d-rate
so
d-rate
is
raid
in
our
case,
in
gfs
cases,
raid
z,
declustered.
A
So
what
that
means
is
on
the
left.
You
see
here
an
example
of
what
a
raid
z
layout
might
look
like
where
you
have
two
raids
top
levels
with
hot
spare
allocated
for
your
pool
and
the
equivalent
or
the
contrast
of
that
to
as
a
d
raid
would
be
to
say.
I
want
a
d
array
defined
with
five
a
group
size
of
five
across
these
eleven
drives,
with
one
of
the
the
drives
being
or
drives
with
a
spare
capacity.
A
So
in
the
left
or
in
both
sides,
you
see
I've.
I've
divided
the
the
the
tables
into
rows
and
these
rows
are
what
represent
a
both
in
this
case,
both
a
group,
a
permutation
slice
and
a
d-rate
row
all
right.
So
each
row
is
permuted
from
a
traditional
layout
across
drives
in
the
left
and
right
z
to
a
semi-random
layout
on
the
right
based
off
of
a
computer
permutation
that
allows
us
to
get
relatively
randomly
out
of
our
data
across
the
entire
pool.
A
Okay:
let's
talk
a
little
bit
more
detail
about
raid
z
versus
d-rate.
A
It
uses
a
different
map
allocation
function,
but
its
pipeline,
its
functions,
are
almost
identical
to
the
raisy
code
and
actually
calls
into
the
raisy
layers
to
do
its.
I
o
the
raid
z
layout
in
a
razi
layout.
Each
group
is
constructed
from
a
set
of
physical
drives
and
the
columns
of
that
extend
the
entire
length
of
the
drive.
A
So
the
only
constraint
you
have
is
that
allocations
into
those
d-rate
sorry,
raid
z
groups,
must
be
a
multiple
of
parity
plus
one.
This
is
to
prevent
stranded
in
space,
which
is
so
small
that
you
can't
actually
allocate
from
so
anything
that's
smaller
than
parity
plus
one
is
not
allocatable
in
a
racy
layout.
A
A
Groups
are
divided
into
rows
of
16
megs,
rather
than
having
consuming
the
entire
disk
as
in
a
column,
each
column
is
chunked
in
16
meg
pieces,
where
contiguous
16,
meg
chunks,
don't
necessarily
reside
on
the
same
drive.
This
is
the
permutation,
so
group
rows
are
non-contiguous
on
physical
drives,
and
here
allocations
must
be
a
multiple
of
parity
plus
data,
so
not
just
parity
plus
one,
the
parity
plus
data.
A
That
means
that,
for
the
same
example,
a
1k
data
is
gonna
require
two
and
a
half
k
of
space
in
a
five
disk
group
size
group,
and
if
the
group
was
larger,
it
was
an
eight
disk
group.
You
have
to
have
another
three
records
set
three
512
chunks
in
that
allocation
in
order
to
align
it
appropriately
to
the
d,
d-raid
constraints
and
I'll
talk
exactly
why
that
is
here
in
a
minute.
A
So
why
do
cluster
there's
a
couple
of
very
good
reasons?
One
spare
drives
are
leveraged
here.
So
in
the
example
I
showed
before
the
first
example,
we
had
our
hot
spare
that
hot
spare
is
a
spindle
that
sets
idle
most
of
the
time
until
there's.
Actually,
a
problem
needs
to
be
utilized.
It's
never
used
except
to
be
ready
to
go
in
a
de-clustered
configuration.
A
We
distribute
the
spare
capacity
throughout
the
d
rate,
so
this
the
the
actual
spindle,
is
leveraged
with
a
combination
of
data
and
and
spares
capacity
and
every
spindle
is
the
same.
That
way,
each
group's
data
is
randomly
distributed
across
all
the
drives.
This
decouples
the
redundancy
group
from
the
number
of
data
drives
so
that,
as
we
are
allocated
reading
and
writing,
we
can
consume
from
all
the
spindles
rather
than
just
a
set
of
spindles
which
make
up
that
particular
group.
A
A
This
allows
us
to
use
all
the
spindles
all
the
time
and,
most
importantly,
sequentially
silver
works
in
this
configuration,
and
that
is
because
of
that
constraint
of
all
rows
must
be
full
rows
I.e.
We
always
have
minimal
allocations
of
parity
plus
the
data,
so
we
always
fill
the
entire
row.
This
way
we
can
construct
our
pseudo
blocks
for
the
d
raid,
because
we
know
exactly
where
the
parity
is
going
to
be
laid
out
and
what
the
data.
A
So,
how
do
you
go
about?
Creating
a
d
raid
in
zfs,
we
had
a
new
top
level
device
type
is
called
a
d-rate
very
similar
to
a
raid
z.
We
support
grid
one
two
and
three
for
the
various
parities.
A
You
can
specify
at
creation
a
redundancy
group
size,
so
the
colon
5d
example
here,
d
raid
1
colon
5d-
would
be
a
a
group
size
of
six
five
data.
One
parity,
the
default
for
the
group
size
for
the
data
portion
of
the
group
size,
is
eight.
A
In
general,
though
the
group
size
has
to
be
less,
the
number
of
data
drives
into
your
8
config.
Now
this
may
not
seem
obvious,
but
if
you
had,
for
example,
three
drives
and
tried
to
define
a
group
size
of
four
on
that
it
doesn't
work
because
you
don't
have
enough
drives
to
actually
provide
the
level
of
redundancy
that
you're
asking
for.
You
only
have
three
drives.
You
can't
get
four
drives
with
redundancy,
but
you
can
define
smaller
than
any
number
smaller
than
the
the
group
size
nice
either
than
the
d-ray
size.
A
You
can
specify
the
spare
capacity
as
a
drive
count.
So
in
this
example,
d
raid
1
colon
5d
coolant
2s,
says
I
want
2
drives
worth
of
spare
space.
A
So
I
that
means
that
we
will
take
out
of
the
total
specified
drive
drives
for
the
g-rate
config,
we'll
reserve
two
of
those
as
sort
of
spare
space
worth
of
drive
capacity,
and
we
now
conf
expose
for
the
in
the
status
the
as
many
as
we
can
of
the
actual
configuration.
So
because
the
you
know
the
groups
are
now
for
pseudo
groups
within
the
config.
There's
no
easy
way
to
say
all
right.
A
These
four
drives
belong
to
this
group
and
these
four
drive
blind
list
group,
as
you
would
see
in
a
raid
z.
Instead,
it's
random,
so
we
present
it
as
a
top
level,
g-raid
and
all
the
drives,
but
in
the
v-ray
name
we
tell
you
how
it's
been
partitioned
up
logically
and
at
the
end
we
have
our
spares
and
those
these
are
pseudo
spares
again
because
we
haven't
reserved
physically
to
drives.
We
have
reserved
two
drives
worth
of
space,
and
these
two
represent
the
pseudo
handles
to
access
those
that
drive
space.
A
It's
been
distributed
across
all
the
drives
all
right.
So
I
want
to
move
on
now
to
talk
about
sort
of
the
various
issues
that
we
ran
into
that
that
brian
and
I
had
to
address,
as
we
were,
trying
to
get
this
code
ready
for
integration.
A
A
You
know.
In
general,
you
have
to
have
a
permutation
array
that
defines
the
permutation
for
each
slice
of
the
config
of
the
d-rate
configuration
and
that
had
been
generated
at
pool
creation
and
stored
in
the
label,
and
it
was
critically
critical.
Information
had
to
be
present
when
you
first
load
the
pool,
because
all
data
can
has
to
go
all
access
to
the
pool
has
to
go
through.
A
You
know,
figuring
out,
the
permutation
and
hat
was
put
in
every
label,
because
it
was
critical
data
that
if
you
lost
that
data,
you
could
not
access
any
data
in
the
pool
anymore.
You
didn't
know
how
to
reconstruct
the
chunks
of
of
groups
and
slices,
and
you
know
you
would
if
you
lost
a
permutation,
you
know
how
to
fit
all
that
data
back
together
again,
so
there
were
sort
of
two
issues
there.
A
One
was
the
sort
of
the
fact
that
we
were
using
up
a
lot
of
label
space
for
this,
and
two
was
the
fact
that
if
we
lost
it,
you're,
that's
really
really
bad.
So
the
answer
we
came
up
with
was
to
pre-define,
essentially
the
permutations
for
a
configuration
brian
figured
out
a
essentially
a
means
of
coming
up
with
a
predefined
set
of
permutations,
or
at
least
of
a
seed
for
for
calculating
permutations.
A
That
was,
would
always
produce
the
same
permutation
and
would
be
a
relatively
optimal
permutation.
And
so
that
is
how
it's
now
handled
inside
of
the
d-rate
code.
There's
a
table
of
these
fermentation
seeds
that
are
used
to
generate
permutation
on
the
fly
as
soon
as
you
bring
it
up,
there's
no
risk
of
losing
the
permutation.
A
It's
part
of
the
implementation
and
it's
of
course
instantaneous
basically
because
it's
all
there
to
start
with
you're
not
going
through
a
computation
phase
of
figuring
out
what
your
permutation
array
is
going
to
look
like
and
there's
no
need
to
store
it
in
labels
anymore.
So
we've
freed
back
up
this
space
that
we've
been
consuming
in
the.
A
A
All
right,
second
issue
that
we
addressed
was
some
group
size
constraints.
So
in
the
original
implementation
there
was
this
notion
that
the
d-rate
slice
was
always
equivalent
to
the
d-rate
row.
So
you
only
only
ever
had
one
row
in
your
slice.
Basically
that
meant
that
you
had
to
divide
your
groups
evenly
into
your
number
of
drives
in
the
d
raid.
So
your
drives
your
d
rate
size.
A
If
you
wanted
even
group
sizes,
that
meant
that,
for,
for
example,
a
30
drive
configuration,
you
could
only
support
either
three
drive
group,
five
drive
group
and
ten
drive
group
or
fifteen
drive
groups.
It
had
to
happen
even
multiple.
You
couldn't,
for
example,
support
in
this
configuration
with
30
drives
and
they
try
a
group
because
they
didn't
it
didn't
divide
evenly.
A
We
did
work
out
a
variation
or
enhancement
to
d
to
d-rate,
which
allowed
you
to
define
a
configuration
which
used
different
sized
groups
to
fill
out
the
the
set
of
of
groups
that
went
into
the
row.
A
This
allowed
us
more
flexibility
and
allowed
us
to
find
group
sizes
that
were
more
very
more
close
to
close
to
what
what
you
were
looking
for.
As
a
consumer,
as
a
user
of
this
this
capability,
the
problem
was
that
it
was
not
optimal,
obviously
and
you'd
end
up
with
these
sort
of
some
groups,
one
size,
some
groups
for
another
size
and
it's
hard
to
reason
about
a
pool
when
you
have
different
sized
groups
like
that
in
terms
of
performance,
at
least
so.
A
So,
instead
of
requiring
that
each
slice
have
just
a
single
row
in
it,
we
say
well,
the
number
of
rows
in
the
slice
is
actually
going
to
be
the
least
common,
multiple
of
the
size
of
the
group
and
the
size.
The
d-rate,
and
this
allows
us
to
tile
in
as
many
groups
as
so
that
we
get
an
even
multiple
and
we'll
evenly
fill
up
the
space
in
the
particular
permutation
slice
that
decouples,
basically,
the
group
size
from
worrying
about
the
number
of
groups.
A
A
Math,
the
next
issue
which
we
had
to
address
was
stranded
space.
A
A
Your
group
was,
you
know,
for
each
group
row,
you
have
a
chunk
basically
of
16
megabytes,
but
the
next
row
is
not
necessarily
on
the
same
drive
so
as
you're
allocating
into
it.
You
actually
have
a
different
set
of
blocks,
a
different
set
of
drives,
which
represent
the
next
row
in
that
group.
A
This
did
not
work
well
with
the
logic
in
raid
z,
because
that
that's
not
ever
an
issue
with
raid
z,
raid
z.
You
always
have
the
entire
drive
space
to
deal
with
in
the
columns
you
never
had
to
switch
drives
in
the
middle
of
a
of
a
group
or
sorry
of
the
middle
of
the
block.
What
that
meant,
though,
for
d
raid,
because
we
had
this
constraint-
was
that
you
get
in
a
situation
where
you've
been.
A
You
can
fill
up
a
group
with
a
16
meg
chunk
and
get
to
a
point
where
all
right,
I'm
trying
to
allocate
this
block
and
there's
not
a
space,
and
so
I'm
going
to
move
over
to
another
group
to
allocate
it
and
have
more
space
in
it,
and
you
can
end
up
with
bits
of
space
at
the
end
of
groups
which
are
not
easily
allocated
all
right.
So
if
you
end
up
stranding
the
space
now
in
theory,
you
could
eventually
use
that
space
with
smaller
block
allocations.
A
But
it's
it
was
an
awkward
model
where
you
could
end
up
with,
particularly
in
in
environments
where
you're
doing
a
lot
of
large
block
allocations,
which
is
what
is
typical
in
a
d-rate
configuration
where
you
have
a
lot
for
these
small
chunks
of
space
which
you're
never
going
to
really
make
use
of,
and
that's
given
that
there
are
a
large
number
of
rows
in
a
large
configuration
with
multiple
drives
and
large
drives.
A
That
could
be
a
problem.
So
the
answer
we
came
up
with
here
was
to
leverage
the
multi-row
allocation
maps
that
were
developed
for
the
raid
z
expansion
project.
Thank
you
very
much
matt.
That
allows
us
to
actually
define
a
block,
pointer
or
block
allocation
that
will
can
span
two
separate
groups.
So
the
first
row
will
say
this:
this.
These
columns
reside
on
these
set
of
drives
and
the
second
row
says,
and
these
the
rest
of
these
columns
reside
on
this
other
set
of
drives.
A
A
Next
and
finally,
we
had
to
deal
with
the
issue
of
space
inflation.
A
This
was
a
problem
that
is
similar
to
the
raid
z
problem,
where
we
have
to
allocate
additional
blocks
to
fill
out
the
the
constraint
for
the
actual
allocation
size,
so
parity
plus
one
in
case
of
raid
z,
but
it
gets
worse
with
d
raid
because
we're
doing
entire
rows
or
entire
entire
group
rows
anyway,
and
so
we
now
are
potentially
allocating
or
or
filling
a
number
of
sectors
that
are
not
part
of
it,
the
data
itself,
so
this
can
be
particularly
significant
if
you're
writing
a
lot
of
small
block
data.
A
Let's
take
go
back
to
our
example
of
let's
say
we
have
a
eight
wide
stripe
with
say
two
parity,
so
we
have
3k
and
it's
a
5
12
sector
size.
So
we
have
3k
of
space,
then
is
our
sort
of
minimum
allocation
size?
So
if
you
have
any
allocations,
you,
you
start
saving
blocks
that
are
small
in
3k,
you
are
going
to
consume
3k
of
of
data
space
for
those
blocks
regardless,
so
the
minimum
allocation
size
as
the
strike
width
gets
bigger.
A
A
Rows,
what's
also
important
to
to
realize
here
is
that
in
in
d
raid
we
explicitly
zero
fill
these
these
extra
sectors.
So
when
we
write
them,
we
actually
have
write
out
zero
filled
rights
to
fill
these
sectors.
That's
important
from
the
sequential
resolver
perspective,
because
we
need
to
be
able
to
evaluate
parity
from
arbitrary
rows
where
we
don't
know
where
the
data
is
and
where
the
the
fill
blocks
are
zero
filled.
Data
is
critical
to
us.
A
So
the
answer
here
there
is,
you
know
it's
not
perfect.
Answer
allocation
classes
go
a
long
way
towards
helping
here,
so
you
can
simply
define
some
allocation
classes,
some
additional
capacity
in
your
pool
and
say
all
right.
Small
block
data
metadata
should
be
written
out
to
these
that
this
these
drives,
which
are
configured
in
a
more
optimal
configuration
for
smaller
block
data.
The
other
is
to
tend
to
is
use
large
blocks
right,
d-rate
is,
is
definitely
tailored
in
certain
ways
for
large
block
data.
A
It
wants
the
larger
the
block,
the
less
overhead
you're,
going
to
see
with
this
kind
of
fill
inflation
percentage-wise
per
block,
it's
a
very
small,
so
we
at
gray
tend
to
write
one
mag,
two
mag
up
to
16.
We
usually
use
the
large
blocks
all
the
time,
and
so
you
know,
even
if
we,
our
our
blocks,
don't
fit.
Even
we
can't
aren't
divided
evenly
into
our
group
width.
A
The
overheads
are
minimal
for
us
and
so
that
that
helps
but
you're
always
gonna
have
to
to
be
aware
of
these
overheads.
You
end
up
seeing
situations
where
you
may
see
your
space
being
used
up
faster
than
than
you
realize
it.
So
this
isn't
we're
not
straining
any
space
here,
but
we
are
inflating
your
allocation
sizes,
and
so
the
way
that
manifests
is
going
to
be
that
it
seems
like
my
data,
my
pool
fills
up
much
faster
than
I
anticipated,
because
I
wrote
out
a
whole
bunch
of
small
files.
A
So
what
about
drive
replacement
performance
again,
one
of
the
main
focuses
of
d-raid
was
all
about
figuring
out
how
to
improve
our
device,
replacement
and
device
rebuild
process,
and
so
with
a
d-rate
configuration
using
sequential
re-silver.
A
The
spare
capacity
is
again
using
all
the
drives,
and
so
we're
reading
from
all
drives
to
pull
in
the
data
necessary
reconstructing
using
data
from
all
drives
simultaneously.
A
Writing
across
all
the
drives
to
fill
in
the
hot
spare
data
for
small
groups.
So
in
a
large
configurations
in
this
example
here
we
have
90
drives
with
a
you
know.
If
you
have
small
a
lot
of
small
groups,
you're
going
to
see
some
amazing
performance,
as
you
see
here,
we're
getting
a
10x,
faster
replacement
versus
even
with
a
large,
relatively
large
group
size
here,
15
blocks,
15
side
blocks,
15
wide
stripes
you're,
seeing
still
significantly
faster,
rebuild
rates,
and
that's
that's
just
a
win
and
that's
the
most
important
feature
I
think
for
d-raid.
A
A
You
know
a
d-rate
configuration
with
just
a
single
group
in
it,
and
it
still
gives
us
benefit
because
of
the
fact
that
one
we've
gained
the
spare
as
this
extra
spindle
in
the
config
and
two,
even
though
the
reads
are
not
any
faster,
which
will
end
read
across
all
the
drives.
The
rights
are
happening
faster
because
we're
actually
writing
to
all
the
drives
for
the
spare
space,
and
that's
that
tends
to
be
a
limiter
in
device.
Reconstruction.
A
All
right,
so
when
is
the
rate
going
to
be
available?
As
I
mentioned,
we
originally
targeted
zfs20
release
brian,
and
I
worked
really
hard
to
to
make
it.
A
At
least
I
twisted
his
arm
hard
when
it
got
down
to
to
late
august
and
he
wanted
to
cut
to
a
release,
but
it
just
wasn't
quite
there
the
that
we
still
hadn't
finished
all
the
issues
that
just
covered,
and
there
was
you
know
a
bit
of
work
to
harden
and
and
fix
up
various
issues
that
showed
up
in
the
code
as
we
were
doing
our
work,
so
we
ended
up
getting
pushed
out
of
the
200
release
and
we're
now
targeting
the
201
release.
A
So
I
think
I
think
we're
doing
well
for
that.
The
current
status
of
the
pull
request
is
that
it's
it's
converging
it's
you
know.
I
think
brian
said
that
he
has
almost
everything
that
he
wants
in
there
in
terms
of
issues
addressing
all
the
issues
he's
seen
with
zts
and
d-loop
issues
that
he's
encountered
while
doing
a
lot
of
testing.
A
So
I
think
it's
pretty
hardened.
From
that
perspective,
we
have
been
doing
a
bunch
of
testing
at
hp.
Cray
to
hammer
it
with
our
workloads
and
it
looks
pretty
very
solid
at
this
point.
I
think
we're
largely
waiting
on
final
cut
reviews
for
the
release,
so
fingers
crossed
in
the
next
few
weeks,
perhaps
by
or
at
least
by
end
of
year,
we'll
have
that
in
and
the
2021
release
out
late
this
year
early
next
year.
Time
frame,
brian
can
correct
me
if
you
decided
otherwise.
A
So
one
of
the
next
steps,
obviously
we
need
to
get
this
thing
integrated.
I
said
there
was
a
couple
of
small
issues
that
are
being
closed
up
and
getting
the
final
code
reviews
done,
but
it's
it's
almost
there
and
the
next
thing
we're
kind
of
looking
at
one
idea.
We're
kind
of
looking
at
beyond
is
the
rate
expansion.
A
So
the
idea
is
that,
could
you
add
a
drive
to
a
d-rate
configuration
and
I
think
the
answer
may
be
yes,
so
I
think
this
could
be
an
interesting
thing
to
look
at
so,
if
anybody's
interested
in
in
thinking
about
that,
maybe
working
on
some
some
prototype
code,
let's
get
together
at
the
code-a-thon
and
see
what
we
can
do
all
right
with
that.
B
A
all
right,
so
yeah,
we
have
about
six
questions
now,
so
there
is
a
few
from
yan
first
one.
What
makes
a
permutation
bad
or
optimal
for
the
raid.
A
So
the
whole
idea
of
permutations
is
to
derive
a
a
set
of
of
laying
out
the
beta
such
that
you
get
as
one
you
you
get.
You
randomly
distribute
your
your
data
across
all
your
drives,
so
that,
as
you
do
your
I
o
you're,
hitting
as
many
drives
as
possible
all
the
time,
and
you
have
to
also
in
such
a
way
of
course,
that
you
don't
have
any
overlaps
that
will
require
what
that
would
destroy
your
redundancy.
So
you
know,
if
you
lose
a
drive,
you
can't.
A
Actually
you
have
to
be
able
to
preserve
the
data
if
you
have
parity.
So
if
you
have
single
parity,
you
should
be
able
to
lose
a
drive
and
not
have
a
problem.
If
you
have
dual
parity,
you
need
to
be
able
to
lose
two
drives
and
have
a
problem,
so
the
permutation
algorithm
is
essentially
going
through
making
sure
the
permutations
are,
do
not
destroy
the
redundancy
and,
at
the
same
time,
give
you
a
as
random
layout
as
possible
for
your
data.
B
All
right,
jan,
has
two
more
questions,
but
I'm
going
to
jump
around
someone
else
for
now,
so
that
we
can
get
everybody
a
chance
in
practice.
How
has
the
raid
affected
compression?
That
is,
how
much
of
the
compression
gains
are
lost
in
padding
in
your
large
large
block
workload.
A
So
in
general
I
don't
I
haven't
seen
a
huge
impact
of
the
you
know
compression
impacting
throughput.
I
mean
you
know
the
the
large
block.
You
know
if
you're
compressing
your
your
large
data
down
to
nothing.
You
know
that's
or
it's
a
very
small
block.
Of
course
that's
going
to
change
the
dynamics
of
of
the
situation.
A
Your
logical
throughput
is
going
to
remain
very,
very
good.
Your
actual
throughput,
because
you're
doing
a
lot
of
smaller
ios
may
drop,
but
that's
not
really
an
issue.
I
think
you
know
what
you
can
tend
to
see
with
when
you
start
layering
on
compression
on
top
of
a
d-raid.
Is
that
the
d-raid,
the
compression
is
obviously
going
to
consume
processor
to
do
that
work
and
that
may
end
up
slowing
down
your
throughput,
because
you're
spending
now
a
lot
of
time
doing
the
compression
work
itself.
A
But
I
don't
think
it
materially
impacts
your
layout
beyond
the
fact
that
you,
you
know
as
you're
laying
down
smaller
blocks,
you
may
end
up
with
some
more
padding
than
you
would
have
had
if
you
had
laid
down
larger
blocks,
but
since
you're
compressing
you're
saving.
So
it
doesn't
seem
like
a
problem
to
me.
C
Can
I
give
an
example
out
so
like
if,
if
using
the
default
of
eight
wide,
like
an
eight
wide
group
right
then
and
using
4k
discs,
then
that
means
you
know?
Your
allocation
unit
is
32k.
C
So
if
you're,
taking
like
a
128k
block
and
you're
compressing
it
down,
then
you
know
you're,
probably
adding
on
average,
like
16k
extra,
which
is
like
what
is
that
like
15
or
something
more
space,
so
you
know
maybe,
instead
of
two
to
one,
your
ratio
is
like
two
is
like
1.7
to
1
or
something.
C
And
then
maybe
you
want
to
go
to
the
question
about
small
block
sizes,
because
that
kind
of
ties
in
to
this
as
well
there's
a
question
from
jan
about.
Is
there
a
script
to
calculate
optimal
d-rate
layouts
for
small
block
sizes
like
8k
or
16k.
A
So
there's
no
script
for
calculating
optimal
theory.
Layouts.
I
think
you
know
you
as
a
as
a
assistantman
could
could
say
all
right,
my
if
my
workloads
are
going
to
be
comprised
of
a
lot
of
akio
and
so
matt
actually
raised
this
question
to
me.
Earlier
of
you
know,
what's
the
recommendation,
for
example,
if
you're
creating
z
vols,
and
should
you
change?
A
Maybe
your
default
block
size
in
that
case
and,
as
you
just
said,
yeah,
if
you're
using
eight
wide
stripe
and
your
average
block
size
is
8k,
the
moon
allocation
is
32k
you're
not
going
to
be
happy,
so
there's
definitely
a
situation
where
you
have
to
balance
and
say
all
right.
If,
given
a
the
average
block
size,
I
can't
you
know
a
very
large
stripe.
Width
is
not
going
to
be
a
win
for
me,
and
so
you
need
to
account
for
your
minimum
allocation
size
in
in
your
calculations.
A
C
Yeah,
so
you
don't
really
need
a
script,
because
it's
so
simple,
like
it's
a
lot
simpler
than
than
raid
z,
actually
so
like.
If
you
have,
if
you're
using
record
size,
16k
you're,
not
using
compression
using
4k
sector
disks,
then
you
know
you
have
four
four
sectors
per
block,
so
you
want
to
use
a
d
raid.
That's
group
size,
four
or
a
factor
of
four,
so
basically,
four
or
two
will
will
lead
to
no
additional
padding
there.
C
But
you
know
you
might
want
to
use
compression.
A
Yes,
because
because
database
file
is
compressed.
A
C
A
C
So
what
are
you
thinking
again?
Is
it
going
yeah
since
there's
so
many
questions?
I
would
say
why
don't
we
keep
it
going
here
and
if,
if
folks
need
to
drop
off
for
lunch,
then
go
ahead
or
if
folks
want
to
go
to
the
breakout
rooms,
then
you
can
go
ahead,
but
it
seems
like
there's
a
lot
of
interest
on
this.
So
why
don't
we
keep
going
here
for
those
who
want
to
continue.
B
All
right,
so
I
will
continue
the
questions
in
the
chronological
order,
so
from
yan
could
raise
the
expansion
work
on
the
raid.
A
So
that's
exactly
so
rate
z,
expansion,
some
so
matt
as
I
assume
people
are
familiar.
Aren't
people
familiar
with
your
raid
z
expansion
work.
That
is
what
we
are
essentially
leveraging
the
same
ideas
from
that
for
the
the
d-rate
expansion
concept.
So
it's
it's
a
little
more
involved
and
more
complicated.
A
little
more
space
needs
to
be
preserved.
We
believe
I
mean
we've
sort
of
thrown
around
some
ideas.
I
think
you
know
I
have
a
sort
of
preliminary
design
in
my
head
around
it,
but
it's
it.
C
Yeah
we
come
to
the
hackathon
tomorrow
because
I
think
mark
is
going
to
be
hopefully
digging
into
this
some
more
and
I
think
that
it
is
conceptually
simpler
in
a
lot
of
ways
than
raise
the
expansion
because,
like
with
d-rate,
you
know
the
width
of
all
of
the
like
data
to
parity
stuff.
C
And
when
you're
moving
stuff
around
like
dear
it
already
has
the
concept
of
like
this,
is
the
logical
width
of
my
thing
and
then,
like
you,
have
more
drives
than
that
right,
like
you,
have
your
group
with
is
eight,
but
you
have
like
you,
know,
37
drives,
and
so
I'm
like
taking
those
eight
and
kind
of
like
right.
C
A
B
Okay
next
question:
another
one
from
jan:
how
much
does
d-rate
improve,
read
latency
during
re-silver
or
rebuild
and
read
latency
peaks
more
precisely.
A
So
read:
latency
of
other
processes
is
what
he's
interested
in
there.
I
think
yeah
I
think,
compared
to
like
raid
z,
yeah,
you
know,
I
think
that
it
that
is
still
from
from
the
work
that
we've
done
measuring
it.
I
think,
there's
still
I
don't
know
you
know.
A
Obviously
you
can
tune
your
your
re-silver
rates
to
to
try
to
compensate
for
what
you
want
in
terms
of
other
workload
latencies,
but
we
have
not
done
enough
work
there
to
be
able
to
tell
you
definitively
near
the
right
settings
for
this
much
impact
or
that
little
impact
on
your
on
your
workloads
during
result
process.
You
know
in
general,
you're
going
to
see
that
your
reads
are
you
know
you're
you're
competing
your
reads
across
all
the
spindles,
but
you
are
using
all
the
spindles
all
the
time.
So
it's
it's.
A
It's
going
to
be
an
even
match,
and
so
it's
just
a
matter
of
saying.
How
do
you
want
to
balance
your
resilient
reads
versus
any
other
reason
might
be
ongoing.
B
Yeah
makes
sense
to
me
a
question
from
ryan,
since
this
is
built
on
the
existing
raid
z
code.
Does
that
mean
it
already
has
vectorized
map
operation.
A
C
B
A
So
that
you
know,
obviously
you
still
have
physical
drive
issues
here.
So
when
you
do
it,
when
you
have
an
issue
with
a
drive
failure,
it's
a
physical
drive
failure.
You
don't
you
don't
logically
lose
drives,
you
physically
lose
drives,
and
so
you
know
the
difference
is
that
in
g
rate,
in
raid
z,
when
you
physically
lose
a
drive,
you
see
all
right.
I
lost
this
drive,
which
is
this
column
out
of
my
this
particular
grade
z
group.
A
In
d
raid,
I've
lost
this
physical
drive,
which
is
impacting
all
of
my
groups
evenly
pretty
much
because
if
you
have
a
random
data
distribution,
and
so
when
I
put
in
replace
it
with
a
new
physical
drive,
I'm
going
to
pull
from
all
of
the
other
groups,
all
the
groups
to
rebuild
onto
that
drive
or
and
and
if
you
have
a
hot
spare
or
sorry
distributed
hot
spare,
you're
going
to
actually
rebuild
on
the
distributed
hot
spare
and
then,
when
you
actually
physically
replace
the
drive
you're
going
to
be
silver
from
that
distributed,
hot
spare
onto
the
replacement
drive
and
so
you're
again
leveraging
all
of
your
distributed
data.
B
B
Right
question
from
sorry
for
your
name:
jim
leon:
is
the
cluster
store
product
a
cray
going
to
be
on
zfs
the
crate
distributed
storage?
You
mentioned.
A
So
yes,
the
the
the
product
the
base
product
in
at
cray
now
hpe
is
the
cluster
store
storage
product,
and
that
is
being
in
this
next
generation
version
will
have
the
option
of
using
zfs
as
its
underlying
storage.
A
So
reduction
is,
I
think,
you
know
conceivable
but
complicated
with
with
expansion.
You
have
the
advantage
that
you're
you're
growing
your
space
and
so
as
you're
rewriting
you
know
you
always
have
space
available
to
to
write
it
in
your
your
new
version
of
the
data
once
you've
sort
of
copied
aside
some
small
amount
of
it
with
reduction.
C
Yeah,
I
think
that
you'd
essentially
have
to
have
the
like
bought.
You
know,
like
nothing,
no
allocations
past
the
point
that
you're
removing,
so
that
you
could
just
kind
of
trim
off
the
end
yeah,
which
means
you
have
to
change
either
the
way
the
allocator
works
or
you'd
have,
to
like
add
it
as
a
add
the
devices
as
not
their
full
sizes,
or
something
like
that,
so
that
you
preserve
the
end
of
it
not
being
allocated
just
in
case.
You
want
to
remove
a
disk
right.
B
All
right
question
from
stuart:
can
you
talk
about
how
receiver
load
affects
performance?
Don't
you
find
you
need
to
throttle
receiver.
A
Yeah,
so
that's
related
to
that
to
an
earlier
question
and
yeah:
there's
definitely
going
to
be
trade-offs,
I
mean-
and
this
is
this
situation
whenever
you
talk
about
receivable
performance
and
and
and
its
impact
on
ongoing
workload,
every
every
customer,
every
everybody
has
a
different
feeling
about
it.
Typically,
it's
like
some
people,
like
I
don't
care.
I
want
my.
I
want
to
restore
my
you
know
my
full
redundancy
as
quickly
as
possible.
I
don't
care
what
impact
I
have
on
my
ongoing
workloads.
A
Others
will
say:
no
I'm
willing
to
take
a
risk,
let
it
go
a
little
longer
and
but
you
know
just
don't-
have
too
much
impact
on
my
ongoing
workloads.
That's
more
critical
to
me,
for
example.
So
you
know
I
think
there
is.
There
is
some
tunables
in
that
space.
You
can
tweak
it
and
play
with
it
and,
as
I
said
earlier,
I
don't
know
that
we
have.
A
The
complete
answers
to
like
here
is
exactly
the
right
tunings
for
this
amount
of
impact
versus
that
amount
of
impact,
and
I
think
there
is
actually
space
in
the
product
project
for
some
follow-on
work.
We
may
end
up
tweaking
or
adding
some
extra
tunables
to
allow
allow
more
fine
grain
control.
I
don't
know,
but
until
we
have
more
experience
with
it,
I
can't
that's
that's
the
best.
I
can
give
an
answer
to.
B
All
right,
thank
you.
I
think
last
two
questions
then
the
remaining,
if
there
is
any
more,
can
be
done
at
the
breakout
sessions,
so
one
from
becky,
do
you
recommend
or
not
using
d-rate
with
nvme
drives.
A
I
absolutely
recommend
it,
so
this
is
an
example
where
we
do
use
d-rate
on
nvme
at
hpe,
on
our
product
with
when
we
add
in
the
the
direct
I
o
work,
which
is
forthcoming.
B
A
As
a
as
a
new
feature
in
dfs
actual
notation
of
d-rate,
that
of
correct
io,
that
is,
we
see
some
some
very
good
performance
numbers
off
of
nvme
and
it
can
benefit.
I
mean
it
doesn't,
have
it's
not
as
critical
in
terms
of
rebuilds,
because
nvme
is
so
much
faster,
but
it
still
works
well
and
and
it's
it's
not
it's
a
good
choice.
B
All
right
last
question
from
jan:
does
the
d-rate
rebuild
trigger
sorry?
Does
the
rate
rebuild
trigger
every
silver.
B
A
D-Rit
rebuild
trigger
every
silver,
a
so
back
to
our
sequential
construction,
sequentially
silver
discussion
earlier,
the
d-raid
rebuild
is
just
every
sequential
receiver
and
after
sequentially
silver,
we
always
trigger
a
scrub
activity,
and
so
that
happens
as
a
scrub,
not
every
silver.
But
yes,
I
guess
that
I
think
that
the
the
answer
is
looking
for
is
yes,.
B
Okay,
thank
you
and
looks
like
becky
want
to
squeeze
one.
Let's
follow
up,
so
this
is
going
to
be
the
last
one
for
real.
Does
it
require
a
particular
type
of
nvme.
A
So
you
know
our
experiments
to
date
have
been
on
some
pretty
high
end.
Nvme
drives
3.8
terabyte
drives
that
are
capable
of
doing.
You
know
on
the
order
of
four
gig,
a
second
sort
of
transfer
rates,
those
those
work
well
for
us,
but
I
don't
think
it
necessarily
depends
on
particular
nvme
drive
type.
I
meant.
Obviously
you
want
to
you
know
calculate
in
or
factor
in
you
know,
drive
rights
per
day,
calculations
that
kind
of
stuff,
but
you
know
d
rate,
isn't
necessarily
any
different
terms
of
its
io
use
patterns.
A
To
do
any
other
zfs
configuration
that
you
might
choose,
so
I
don't
think
there's
a
particular
nvme
type.
That
would
make
sense
more
or
less
sense.