►
Description
From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1kiWrpqvl8T9gHjZmG7Pk0f7mE5LEPCmZnbtUFUsuM9g/edit?usp=sharing
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021
A
As
most
of
you
are
probably
aware,
the
zfs
on
the
structure
is
a
tree
of
block
pointers
and
whenever
we
want
to
update
a
file
in
the
structure,
we
are
going
to
modify
a
leaf
block
within
an
object
and
since
the
block
pointers
check
some
they're
pointy,
we
have
to
propagate
the
change
up
to
the
up
the
tree
up
to
the
uber
block
and
to
make
out
of
this
crash.
Save
cfs
uses
a
copy
and
write
mechanism.
A
So
as
a
concrete
example,
if
we
modify
a
level
zero
block
and
we'd
actually
allocate
a
new
level,
zero
block
that
contains
our
changes
and
then
we
would
need
to
modify
the
block
pointer
for
that
block
in
the
parent
level.
One
indirect
log
over
here
to
point
to
that
new
allocated
block,
and
so
we
don't
modify
that
block
in
place
but
create
a
copy
of
that
block
again.
A
But
this
time
we
reuse
most
of
the
block
pointers,
except
for
the
block
that
we
just
modified,
and
this
will
go
on
up
to
the
uber
block
and
once
the
new
block
has
been
written,
it's
the
new
root
of
the
industry
structure
and
some
parts
of
the
old
tree
are
now
obsolete
and
we
can
get
rid
of
them
now.
This
is
quite
beautiful
to
look
at
in
on
itself,
but
we
can't
really
afford
to
do
this
stands
for
every
single
vfs
operation.
A
However,
with
these
transaction
groups,
we
just
got
ourselves
a
new
problem
because
to
make
the
batching
work,
we
must
wait
for
changes
to
accumulate
in
dram,
but
at
the
same
time,
it's
not
reasonable
to
block
every
vfs
operation
that
happens
until
the
batch
is
large
enough.
So
what
we
do
is
we
let
the
vfs
operation
return
to
user
space
immediately
and
the
txg
will
then
be
written
out
in
the
background.
A
So
the
solution
that
cfs
takes
for
this
is
the
zil.
The
idea
is
to
extend
the
ondisk
structure
with
the
linked
listhead
for
each
object
set
and
the
nodes
on
the
list
that
starts.
There
are
self
check
summing,
and
that
means
that
we
can
append
to
the
list
independent
of
the
transaction
groups
being
synced,
because
we
don't
need
to
update
any
parent
block
pointers
that
point
to
these
nodes
because
they
are
itself
check.
Summing
now
when
a
vfs
operation
needs
immediate
durability,
it
will
append
a
docker
card
to
that
list.
A
That
describes
what
happened
in
that
operation
at
the
logical
level,
and
if
everything
goes
well
and
the
system
doesn't
crash,
then
the
change
is
going
to
be
written
out
in
a
transaction
group.
As
an
update
to
the
on
this
tree
and
as
part
of
the
new
transaction
group,
we
write
out
a
new
list
head
that
pops
off
all
those
dog
records
from
the
list
that
are
now
obsolete
because
they
are
part
of
the
tree
structure
proper.
A
A
We
need
those
awbs
for
block
alignment
and
because
the
the
locker
codes
themselves
have
variable
length,
as
indicated
in
this
sketch
here,
and
also
the
batching
can
be
used,
and
the
batching
of
multiple
local
codes
into
awbs
can
be
used
to
do
some
tricks
on
highlights
hardware
to
make
the
to
make
this
a
little
a
little
less
expensive.
A
The
second
thing
that
complicate
things
are
it
access
the
vfs
operations,
actually
don't
write
the
locker
codes
directly
to
the
on
disk
list,
because
often
the
vfs
operation
doesn't
even
know
whether
it's
synchronous
or
asynchronous.
This
is
determined
later
when
we
call
fsync.
So,
instead,
the
vfs
operation
only
creates
dram
log
grip
loads,
which
we
call
intent,
lock,
transactions
or
short
it
access
and
after
creating
those
itx's.
A
Now,
let's
talk
about
the
performance
on
modern
hardware,
the
basic
principle
is
that
the
latency
for
synchronous,
I
o
operations,
is
the
time
that
is
spent
on
doing
the
vfs
work,
plus
the
time
that
is
spent
on
the
itx
design
and
the
commit
now
vfs
and
the
itx
assign
are
purely
cpu
and
drm
bound.
But
the
commit
is
a
little
more
complicated.
A
It
does
the
work
of
figuring
out
which
itx's
need
to
be
written
out
to
disk.
So
that's
building
this
a
commit
list,
and
then
it
has
to
take
those
it
accesses
from
the
commit
list,
convert
them
into
locker
cards
and
pack
them
into
lwbs,
and
then
it
uses
the
cio
pipeline
to
write
out
those
lwbs
to
the
actual
storage
hardware.
A
Now,
and
all
of
these
are
software
steps
that
add
overhead
to
the
actual
hardware
latency
for
each
ldb,
that
is
written
and
historically
software.
Overhead
wasn't
really
a
problem,
because
the
hardware
latency
dominated
every
other
component
in
this
equation,
but
with
modern
hardware,
for
example
the
3d
crosspoint
stuff
for
that
is
used
in
the
obtain
nvme
and
pmem
drives.
A
We
get
single
digit,
microsecond
latencies
for
4k
synchronous,
random
rights,
and
this
means
that
even
a
few
microseconds
of
processing
time
can
easily
become
the
bottleneck
for
the
performance
and
we
can
actually
observe
this
by
configuring
obtain
pmem
as
a
stocky
base.
With
today's
zill
and
in
that
experiment
that
we
did
here,
we
used
fio
to
generate
4k
synchron
rights
onto
a
z,
pool
with
separate
data
sets
for
each
fio
thread,
and
then
the
pool
had
three
enterprise
nvme
drives
configure
the
top
level
v
divs.
A
So
plenty
of
I
o
throughput
performance
and
a
single
pmem
dim
configured
as
this
lock
device.
Now
what
we
measured
was
the
void
clock
time
that
was
spent
in
each
of
the
latency
components,
and
what
we
could
observe
is
that
the
vast
majority
of
the
time
that
is
spent
per
iop
goes
to
lwb
and
zio
overheads.
A
Only
about
20
percent
of
the
time
are
spent
on
dmu
and
itx
work,
and
only
14
of
the
world
clock
time
are
spent
on
the
actual
connection,
with
the
hardware
and
waiting
for
the
hardware
to
store
the
data.
A
A
So
there's
a
lot
more
nuance
to
this
analysis
that
I'm
not
really
able
to
present
here
due
to
time
constraints,
and
it's
also
certainly
not
a
workload
that
is
representative
of
every
use
case
for
cfs,
but
it's
a
good
example
for
what
is
wrong
with
the
current
zil
and
why
I
believe
we
should
re-architect
this
for
modern
hardware
and,
to
summarize,
like
there
were
two
conclusions
that
I
drew
from
this
experiment.
The
first
was
that
batching
log
accounts
into
awbs
is
not
necessary,
at
least
not
always
on
pmem,
which
is
byte
addressable.
A
We
don't
need
to
adhere
to
the
block
boundaries
and
like
even
if
we
do,
because
we
are
on
nvme-
drives,
for
example,
tricks
like
the
batching
and
lwb
timeout,
and
all
that
stuff
that
we
do
for
for
high
latency
storage
hardware
doesn't
really
buy
us
a
latency
advantage
anymore,
it's
more
of
an
overhead
at
this
point.
A
The
second
conclusion
is
that
the
zio
pipeline
adds
much
overhead
for
very
little
benefit.
The
problem
is
that
all
the
connect
switching
that's
going
on
in
there
adds
latency
and
latency
noise,
latency
jitter
and
most
likely,
it's
also
not
particularly
helpful
for
data
locality.
Essentially,
the
entire
design
is
more
geared
towards
high
throughput
than
latency.
A
So,
given
these
observations,
I
think
that
the
new
zeal
design
is
needed
and
it
should
have
the
following
properties.
First,
it
should
abandon
lwbs
as
a
concept
and
store
individual
log
records.
Instead,
it
may
or
may
not
do
some
batching
under
the
hood
to
optimize
things,
but
to
get
the
lowest
latency
for
fully
synchronous
workloads.
We
should
just
store
individual
lock
records.
A
Second,
we
should
no
longer
have
pointer
chains
on
disk
like
we
do
with
lwbs
today.
Instead,
we
should
defer
the
serialization
work,
at
least
of
the
I
o
operations
as
much
as
possible
to
the
time
of
replay,
and
this
will
enable
more
parallelism
on
the
right
path
for
independent
operations
and
third,
we
should
bypass
the
cio
pipeline
to
avoid
its
overheads
and
write
directly
to
the
storage
hardware.
A
A
We
don't
give
the
slope
rate
of
space
to
the
spa,
but
instead
we
let
the
zil
consume
the
space
of
the
hardware
directly
and
the
zil
then
constructs
a
storage
substrate
on
top
of
it
and
that
is
used
to
store
all
the
lock
records
of
all
data
sets
in
the
pool
and
that
storage
substrate
behaves
like
an
unordered
set
of
locker
cards.
You
can
put
records
into
it
and
you
can
iterate
over
it
in
an
arbitrary
order
and
it
will
automatically
garbage
collect
itself.
A
In
the
background
to
avoid
running
out
of
space
and
we'll
use
the
lock
records.
Last
sync
transaction
group
to
determine
which
lock
records
need
to
be
garbage
collected,
and
on
top
of
this
very,
very
minimal
interface,
we
then
implement
the
actual
zil
functionality.
The
idea
is
that
each
data
set
adds
a
bunch
of
metadata
to
each
local
code
and
if
we
should
actually
crash
the
replay
code
will
then
use
that
metadata
to
figure
out
which
records
need
to
replayed
and
in
what
order.
A
Here's
a
quick
visualization.
So
at
the
time
we
write
the
zil
the
commit
will
be
writing
the
lock
records
in
some
logical
order,
but
the
storage
substrate
is
free
to
organize
this
log
space.
However,
it
sees
fit
so
it
just
has
to
ensure
that
it
will
find
those
lock
records
again
if
we
crash,
and
while
we
are
writing
new
locker
cards,
the
garbage
collection
will
kick
in
in
the
background
and
yeah.
A
This
will
just
be
how
the
thing
operates
all
the
time
and
we
hope
that
there
is
free
space
on
the
physical
level
at
all
times.
Now,
let's
assume
that
we
crash,
then
the
replay
code
will
scan
through
the
search
substrates
contents
to
find
records
of
the
data
set
in
question.
It
will
filter
out
obsolete
log
records
and
then
reconstruct
the
replay
sequence.
A
That
replay
sequence
is
then
applied
to
the
data
set
to
recover
the
committed
state,
just
as
it
is
as
it
is
with
the
with
the
currency
today,
and
if
the
storage
substrate
detects
that
if
the
substrate
loses
some
locker
quotes,
for
example,
due
to
bit
droid
or
data
corruption,
then
the
replay
algorithm
will
have
to
deal
with
the
fact
that
those
local
codes
won't
show
up
when
it's
against
the
storage
substrate.
A
So
data
integrity
is
also
covered
here.
So
why
is
this
more
performant?
There
are
two
main
reasons.
The
first
is
that
we've
eliminated
right-of-the-right
dependencies
on
the
I
o
path
for
independent
rights,
of
course
like
if
the
rights
actually
depend
on
each
other,
they
must
have
a
mechanism
to
wait
on
each
other,
so
that
replay
can
succeed,
but
at
least
we
have
the
option
to
do
independent,
writes
and
fully
in
parallel
now.
A
So
now
that
I've
established
a
high
level
idea,
let's
make
things
a
little
more
concrete
with
an
example.
What
we
are
seeing
here
is
a
visualization
of
the
storage
substrate's
contents
and
the
metadata
of
each
lock
record
on
the
x-axis.
We
have
the
generation
number
and
we
use
it
to
encode
logical
dependencies
between
records
on
the
y-axis.
We
have
the
transaction
group
of
the
individual
records
and
the
name
of
the
records
which
we
for
which
we
use
a
letter.
Right
here
is
also
a
piece
of
metadata.
A
It
identifies
the
entry
uniquely
within
a
generation
and
for
clarity.
We
are
using
unique
names
for
the
entire
for
all
log
records
in
this
example
here
so
at
the
beginning
of
our
example,
the
source
substrate
was
empty
and
now
we've
added
a
locker
card,
which
we
call
a
and
it's
for
a
change
in
transaction
group
4
in
generation
11.
A
A
Then
we
write
another
record
c
into
the
next
generation
13
and
while
we
are
writing
new
records,
garbage
collection
might
kick
in
in
the
background
and
remove
records
for
transaction
group
4,
because
transaction
group
4
has
been
zoomed
in
the
background.
So
this
will
happen,
but
the
gauge
collection
is
fully
independent
of
the
right
path.
So,
even
while
garbage
collection
is
going
on,
we
are
still
writing.
Another
record
d
and
d
is
actually
for
the
same
generation
as
c.
A
Now
we
continue
to
write
records,
e
and
f,
which
also
share
the
same
generation
number
14,
so
e
and
f
don't
depend
on
each
other,
but
e
and
f
both
depend
on
c
and
d
for
replay
and
finally,
we'll
write
records,
g
and
h
into
a
new
generation,
and
it's
also
starting
some
some
new
txts.
So
the
number
is
going
up
there
note
that
at
this
point
we
can
observe
that
garbage
collection
is
going
to
kick
in
pretty
soon
because
at
any
given
point
there
can
only
be
three
unsung
transaction
groups.
A
So
at
this
point
it's
clear
that
transaction
group
5
has
synced
out
because
otherwise
there
wouldn't
be
an
entry
for
a
local
code
for
transaction
group
8..
So
now,
let's
recap
and
see
what
metadata
we've
observed
here.
We
have
seen
the
transaction
group.
We
have
some
have
seen
the
generation
numbers
and
we
have
seen
the
unique
ids
for
each
entry,
which
I
represent
by
the
film
represented
by
the
letters.
However,
this
is
not
really
sufficient
to
detect
those
entries
at
replay
time.
A
So
for
this
we
use
a
counter
per
transaction
group
that
and
to
explain
how
these
are
computed,
we're
going
to
run
through
the
example
again
and
show
which,
which
counters
each
individual
lock
record
has
so
for
the
records
of
the
first
generation.
All
the
counters
are
set
to
zero
and
after
we
are
done,
writing
a
generation.
A
We
sum
up
how
many
records
were
written
in
each
transaction
group,
and
this
running
sum
is
kept
in
a
table
called
the
counters
table
now
for
the
next
generation's
records,
we
will
then
use
the
counters
table's
contents
as
the
counter
values
for
each
individual
block
record
over
here.
So
we
can
see
this
with
n3b
we've
copied
the
table
into
the
entry
b
and
now
again
after
b
of
the
b's
generation,
so
generation
12
is
done.
We
do
the
accounting
and
account
for
the
fact
that
we
have
written
another
locker
card
in
transaction
group.
5.
A
remember
this
is
a
running
sum,
so
we
don't
reset
the
table
after
a
generation.
Now,
if
there
are
multiple
records
in
a
generation
like
generation
13,
we
use
the
same
table
in
every
record
of
that
generation
because,
as
we'll
see
later,
these
don't
depend
on
each
other.
So
they
only
depend
on
the
on
the
last
generation
and
every
generation
before
it.
A
So
again
we
just
copy
over
the
table
and
at
the
end
of
the
generation
we
account
for
the
records
written
and
this
time
we've
written
two
records.
So
we
have
to
bump
two
counters
here
now
and
this
this
will
be
like
again.
It's
the
same
procedure
for
generation
14.
We
do
the
accounting
again
and
we
write
records
for
generation,
15
15.
A
now,
there's
one
interesting
property
here:
the
table
that
we
store
for
generation
15
only
contains
the
counters
for
transaction
groups,
8,
7
and
6.,
so
we've
dropped
the
counters
for
transaction
group
5
and
four,
and
the
reason
is
that
we
know
that
these
have
synced.
So
we
know
that
we
won't
have
to
validate
the
counters
later
during
replay.
A
Okay,
so
we've
seen
how
the
counters
are
computed,
let's
now
put
them
to
use
to
exercise
the
the
replay
code
path.
A
Let's
assume
that
we
crashed
after
we've
written
out
transaction
group
5,
but
before
it
was
garbage
collected,
then
in
that
case,
the
the
records
that
need
replay
are
the
records
d,
f,
g
and
h.
We
ignore,
b
and
e,
even
though
they
are
still
present,
because
their
change
is
already
part
of
the
main
data
structure
and
if
we
attempted
to
replay
them,
the
replay
callback
would
fail,
because
the
replay
actions
that
are
encoded
in
these
docker
cards
are
not
important.
A
So
our
plan
is
to
replay
d,
f,
g
and
h
and
then
use
the
counters
to
detect
lost
records
along
the
way.
Let's
first
cover
the
happy
case
where
we
haven't
lost
any
record.
In
that
case,
we
so
we
initialize
our
counters
table
to
zero
during
replay
and
then
look
for
the
first
records
counters
and
for
each
counter
that
is
greater
than
the
transaction
group
at
which
we
crashed.
The
counters
must
match
what
we
have
in
the
replay
table,
so
these
counters
are
all
from
generations
that
are
5
or
older.
A
So
there's
nothing
to
check
here
and
we
can
just
replay
it
and
after
doing
the
replay
action
we
update
the
counters
in
our
account
for
for
so
after
we
we've
replayed
all
entries
in
the
generation.
We
update
the
counters
table,
just
as
we
would
do
on
the
right
path.
A
Now,
moving
on
to
f
the
counters
for
transaction
group,
5
and
4
can
be
ignored
again,
but
the
counter
for
transaction
group
6
must
match
what
we
have
in
the
table,
and
that
is
in
fact
the
case.
So
we
can
replay
f
and
we
can
do
the
counting
and
move
on
to
entry
g
and
again
we
compare
the
counters
and
can
observe
that
these
match.
So
we
can
replay
g
as
well
and
h.
We
can
replay
that
as
well
great,
so
that
was
easy.
Now,
let's
look
at
the
case
of
actual
data,
corruption.
A
Suppose
some
bit
droid
has
corrupted
and
records
e
and
f.
Then
the
search
substrate
would
not
show
them
to
us
when
we
scanned
it
to
construct
the
replay
sequence.
We
wouldn't
even
know
that
they
existed
in
the
first
place
because
the
storage
subsidy
doesn't
tell
us
about
them.
So,
for
now
our
replay
sequence
is
going
to
look
like
d
g
and
h.
A
Now
we
want
the
replay
algorithm,
the
replay
algorithm
to
replay
record
d,
but
we
mustn't
replay
records
g
or
h
because
they
depend
on
f
courtesy
of
the
generation
numbers.
So,
let's
see
how
this
works
out,
we'll
initialize
the
counters
table
and
compare
the
counters
for
record
d.
They
match
so
or
they
can
all
be
ignored.
So
we
can
replay
d
and
do
the
accounting.
A
A
So
we
cannot
replay
it
either
and
we'll
stop
replay
at
this
point,
because
we've
reached
the
end
of
our
tentative
replay
sequence
and
the
end
result
is
that
we've
replayed
as
much
as
possible,
given
the
constraint
of
the
the
generation
numbers
over
here
great
and
like
what's
important
like
the
the
important
thing
is
that
we
can
actually
present
witnesses
for
a
missing
entry
or
former
for
a
missing
record.
A
A
But
what
we
are
going
to
focus
on
right
now
is
the
concrete
implementation
and
then
later
on
benchmarks.
So
the
short
name
for
this
entire
project
was
silpimem
and
it's
the
product
of
my
master's
thesis.
The
goal
of
the
thesis
was
to
design
a
system
that
makes
synchronous,
io
and
zfs
as
fast
as
possible,
using
persistent
memory.
So
this
was
really
a
no
compromise
approach
on
making
synchronous
io.
As
fast
as
possible,
we've
already
covered
the
high
level
ideas
and
so
on
and
the
algorithms
and
for
the
rest
of
the
talk.
A
A
First
of
all,
you
might
know
it
under
a
different
name.
There
is
a
non-water
type,
main
memory
or
storage
class
memory,
and
these
are
all
like
this.
The
naming
depends
on
which
branch
of
industry
or
academia
you're,
following
in
the
case
of
persistent
memory,
it's
concrete
product
name
branded
by
intel.
A
So
the
idea
is
generally
always
the
same.
The
idea
is
that,
instead
of
speaking
a
storage
protocol
like
nvme,
you
map
the
pmem
directly
into
the
address
space
and
then
use
normal
load
and
store
instructions,
and
maybe
some
cache
flashes
and
so
on
to
perform,
I
o
to
it.
So
there
is
no
storage
protocol
anymore.
A
A
The
second
advantage
of
pmim
is
that
it's
bite
addressable.
This
is
ideal
for
the
zil,
because
the
zil
locker
records
themselves
have
variable
length
and
often
very
short,
so
with
a
zild
on
pmem.
We
don't
really
have
to
worry
about
padding
or
block
alignment
or
the
space
wastage
that
might
result
from
padding
up
to
a
certain
alignment.
A
A
What
we
haven't
covered
yet
is
how
we
can
actually
reach
that
hardware
in
something
like
an
operating
system
like
linux.
So,
first
of
all,
there
are
several
operating
modes
for
pmem
and
we're
going
to
use
the
app
direct
mode
for
pmem
in
the
fstx
configuration
here.
You
can
just
ignore
those
details.
If,
if
you're
new
to
the
topic,
what's
important
is
that
in
that
mode
the
pmm
will
show
up
as
a
block
device
node
in
the
device
fs
and
the
kernel
driver
that
provides
this
device.
A
A
In
that
case,
the
the
block
device.
Consumers
can
use
these
apis
to
check
whether
the
block
device
is
actually
pmem,
and
if
that
is
the
case,
then
they
can
establish
a
direct
memory,
mapping
to
the
pmem
and
once
the
mapping
is
established,
the
consumer
then
can
issue
load
and
store
instructions
and
cache
flashes
and
so
on
directly
to
that
memory.
Mapping
and
the
operating
system
is
completely
out
of
the
picture.
A
So
what
we
see
in
the
what
we
see
in
this
screenshot
here
is
an
example
from
the
ext4
source
code,
where
there
is
conditional
optimization
for
pmem
or
a
different
implementation
of
how
we
read,
read
and
read
data
from
a
file
if
the
file
is
actually
on
an
xc4
instance,
this
instance
that
is
deployed
on
pima
great.
So,
given
this
framework,
the
goal
for
the
pmem
was
to
make
it
fully
transparent
to
the
user.
A
A
Now
we
can't
throw
away
the
old
code
for
a
bunch
of
reasons,
so
the
first
step
was
to
refactor
the
zil
so
that
the
different
persistence
mechanisms
could
coexist
at
one
time,
and
the
result
is
that
we
have
this
thing
called
silkens
and
I'll
give
more
details
about
this
in
a
minute
after
that,
I
actually
implemented
the
the
high
level
ideas
that
I
presented
earlier.
A
If
you
remember,
we
needed
a
source
substrate
for
pmem
and
an
implementation
of
the
higher
level
algorithms
and
the
source
substrate
in
the
pmem
is
called
prb
and
the
high
level
algorithms
are
implemented
in
a
code,
module
called
handle
and
the
data
structure
for
the
send
like
it
exists
once
per
once
personal
instance.
Now
to
make
these
data
structures
easier
to
test,
I
implemented
the
prb
and
handle
as
quite
in
modules,
and
so
there
is
some
glue
code
necessary
to
integrate
them
into
the
zfs
code
base.
A
A
So
now,
with
with
this
refactoring,
I
could
then
introduce
a
v
table
that
decouples
this
persistence,
api
from
the
general
api
and
like
we
can
now
have
different
implementations
for
the
speed
table
coexist
at
runtime
and
the
name
for
these
different
implementations
is
called
silkens.
So
now
any
the
kind
will
need
some
place
to
store
information
that
is
per
data
set,
like
the
lwb
listed
for
the
zill
lwp
and
the
place
for
this
or
like
and
zpma,
will
store
some
metadata
in
there,
so
that
can
find
the
local
codes
again.
A
So
basically,
we
needed
a
place
for
this
and
the
place
the
ideal
place,
for
this
is
the
zil
header
and
so
with
the
kinds
this
header
now
becomes.
A
tech
union
and
the
union
tag
is
the
enum
value
that
represents
the
silk
kind.
So
this
also
means
that
when
we
decide
we
need
to
decide
which
v
table
to
use
at
runtime.
We
just
refer
to
the
union
tag
that
we
find
in
the
zill
header.
A
Now
I
also
want
to
spend
a
few
minutes
on
the
storage
substrate
implementation,
because
I
really
think
it
highlights
how
simple
that
layer
can
be
when
prb
is
initialized.
It
takes
a
pima
mapping
and
a
partition
and
it
partitions
it
into
equal
sized
chunks.
Each
of
those
chunks
is
then
an
append,
only
sequence
of
log
records
so
that
when
we
want
to
write
a
record
to
prb,
we
can
just
pick
any
record
that
has
sufficient
space
and
insert
the
lock
record
at
the
tail
of
that
sequence
in
a
crash,
consistent
manner
and
for
garbage
collection.
A
And
if
we
have
a
sufficient
amount
of
chunks,
then
this
can
be
a
quite
performant
implementation.
For
example,
we
can
have
one
open
chunk
per
cpu
so
that
there
is
no
contention
between
parallel
writers
for
access
to
the
chunks,
and
if
we
make
the
chunks
large
enough,
then
we
also
minimize
the
need
to
coordinate
different
writers
when
they
need
a
new
chunk
or
when
garbage
collector
runs
and
so
on,
and
this
is
really
all
there
is
to
it.
A
To
this
design,
of
course,
there
are
some
dram
data
structures
for
bookkeeping
and
garbage
collection
and
so
on,
but
these
are
really
boring
details
and
they
don't
have
much
overhead.
The
key
observation
is
that,
at
least
for
pmem,
the
source
substrate
implementation
is
very,
very
thin.
Easy
to
understand,
easy
to
audit
and,
most
importantly,
has
very,
very
low
overhead.
A
Sorry
about
bullet
points
here,
okay,
so
the
next
step
was
to
wireless
up
into
a
prototype
that
could
be
used
to
run
the
actual
benchmarks.
If
you
remember,
the
goal
was
to
automatically
activate
the
pmem
when
we
add
a
pmem
stop
device.
The
problem
with
this
was
when
the
pool
is
already
instantiated,
then
we'd
have
to
switch
over
v
tablets
while
they
are
potentially
in
use-
and
I
didn't
have
the
time
to
cover
this
during
the
thesis,
so
the
workaround
was
to
determine
the
silk
kind
ahead
of
time
when
we
create
the
z
pool.
A
So
we
have
this
ugly
module
parameter
here,
where
we,
when
we
set
it
to
pmem
to
the
pmem
and
then
create
the
pool,
we'll
check
that
the
vdf
config
matches
and
contains
exactly
one
pmem
stock
device,
and
once
that's
done,
we
set
the
root
data
set,
still
kind
to
the
pmem.
A
And
now,
whenever
we
import
the
pool,
we
look
at
this
root,
dataset,
silk
kind
and
recover
the
pool
itself
kind
from
from
that
field,
and
of
course
we
will
check
that
the
v
dev
config
still
matches
and
then
instantiate
prb
on
top
of
the
pmem
stock
space,
and
then
the
individual
z-log
t
instances
and
then
the
commit
routine.
A
If
you
just
use
a
pointer
in
this
bar
to
to
access
the
prb
and
obviously
we
also
need
to
prevent
operations
like
zpool,
remove
the
stock
device,
because
we
don't
want
to
pull
out
the
pmem
from
underneath
prb,
while
it's
still
in
use.
So
all
of
this
is
quite
hacky.
I
will
freely
admit
that
and
it
probably
needs
more
refactoring,
but
it
was
sufficient
to
get
the
job
done
in
the
sense
that
we
could
run
benchmarks
on
top
of
it.
A
A
The
first
thing
is
that
the
first
thing
that
the
comment
does
is
that
it
requires
a
mutex
that
is
per
data
set,
and
then
it
uses
the
itx
code
to
get
the
commit
list.
Then
it
walks
over
the
commit
list
and
for
each
itx
on
the
commit
list.
It
will
convert
it
into
a
lock
record
representation
and
then
write
those
stock
records
into
the
handle,
prb
data
structure
so
into
the
source.
Substrate
and
we'll
pick
a
new
generation
for
each
record.
B
A
A
just
exactly
what
the
same
same
kind
of
dependencies
that
we
have
with
lwbs,
and
this
was
just
the
safest
choice
to
use
here
now
when
we
are
done
with
this,
we
release
the
mutex
and
the
next
combat
call
can
start
can
start.
Writing
can
start
writing
and,
of
course,
if
the
data
sets
are
independent,
then
they
only
need
to
coordinate
at
the
substrate
level
and
we've
seen
that
this
can
be
made
very
efficient.
A
Now
there
are
some
some
points
where
we
can
improve
this
here.
So
one
is
to
use
something
like
the
commit
waiters,
so
that
we
can
get
a
little
more
parallelism.
We
can
do
this
by
pre-computing.
The
generation
ranges
that
will
be
huge
writer
and
then,
if
the
writers
do
the
writing
in
parallel,
the
problem
is
that
we'll
still
need
that
we
need
to
change
the
apis
a
little
under
the
hood,
so
this
can
be
done,
but
just
hasn't
been
done
yet.
A
A
We
start
with
the
primary
workload
of
the
thesis
that
were
4k
synchronous,
random,
writes
with
a
separate
data
set
per
thread
again
like
I
know
this
is
not
really
a
representative
for
most
cfs
workloads,
but
it's
surely
a
torture
test
for
the
zil.
So
in
this
experiment
I
compared
the
performance
of
four
different
configurations.
A
A
They
were
all
on
a
z-pool
with
three
enterprise
nvme
drives
and
one
persist,
memory
dim
slot
device,
the
awb
and
the
pmem
use
the
respective
circuits
and
the
async
configuration
has
sync
equals
disabled
set.
So
this
is
meant
as
a
like
estimate
for
the
upper
bound
of
how
efficient
the
like
what
what
we
could
achieve
if
the
persistence
code
was
was
maximally
efficient.
A
It's
just
a
software
change
and
we
can
also
observe
that
the
premium
scales
up
fairly
well
to
400
000
iops
with
four
threads,
and
this
is
own-
and
this
still
a
5.5
x,
speed
up
over
what
can
achieve
with.
B
A
Wb
now
for
higher
thread
counts.
We
see
it
to
to
align
with
the
fsdx
curve
over
here
and
the
the
async
curve
actually
shoots
up
way
higher.
So
what
this
means
is
that
we
are
reaching
the
pmem
throughput
limit
at
this
point
and
that
if
we
had
higher
pimen
bandwidth
available,
we
could
increase
the
we
could
potentially
land,
even
higher
iops.
A
A
If
we
look
at
the
a
little
deeper
at
how
the
latency
is
distributed,
we
can
observe
that
the
pmem
with
the
pmem,
the
zip
persistence,
now
only
takes
about
25
percent
of
the
void
clock
time
of
each
iop,
whereas
it
was
about
80
percent,
with
the
lwb
based
cell
and
in
turn.
This
means
that
the
asynchronous
part
of
this
is
so
dmu
and
the
itx
work
now
become
the
dominant
components
in
the
latency
equation
and
we
need
to
optimize
there.
A
I
also
did
some
more
realistic
benchmarks.
Most
of
you
would
probably
call
those
still
fairly
academic
and
I
would
agree,
but
it's
really
the
best.
I
had
I'm
not
going
to
go
into
each
of
those
in
detail.
The
gist
is
that
they
are
all
doing
synchronous,
writes
in
one
way
or
another,
either
through
writer,
headlocks
or
metadata
heavy
sync
operations,
and
so
on,
and
in
contrast,
the
previous
workloads
those
are
in
on
one
data
set.
A
A
5.8X
speedup
over
the
lwb,
with
red
as
a
2.7
speed
up
maria
will
be
a
2x
speedup
and
there
were
some
workloads
that
didn't
benefit
as
much
but
in
general,
there's
a
pretty
good
result
if
you
crank
up
the
scaling
factor,
so
the
number
of
threads
that
are
simultaneously
doing
requests
to
these
types
of
servers
or
doing
put
operations
and
so
on.
A
What
we
could
observe
is
that
the
pmem
doesn't
really
scale
linearly
like
if
it
would
then
the
bar
in
the
the
orange
bar
for
scaling
factor
four
would
need
to
be
four
times
as
high
as
the
bar
and
scaling
factor
one,
but
there's
still
a
substantial
improvement.
So
there
is
some
scalability
there
and
the
pmm
still
performs
better
than
the
lwb
in
most
of
the
workloads.
A
A
But
in
this
one
we
also
see
a
big
advantage
for
the
pmem
here
these
these
workloads
over
here,
and
I
think
that
the
reason
for
this
is
that
we
have
less
write
amplification
in
the
pmem
because,
like
xfs,
will
see
a
block
device
underneath
and
will
blow
everything
every
everything
up
to
4
kilobytes,
whereas
the
pmem
will
yeah
use
the
pmap
natively
and
write
small
lock
records
directly.
A
Now
these
numbers
are
quite
impressive.
We
should
talk
about
some
of
the
drawbacks
of
the
pm
before
we
wrap
up,
although
we're
running
short
on
time,
so
I'll
skip
over
some
of
those
first.
The
prototype
that
I
developed
in
the
thesis
has
a
bunch
of
weaknesses,
in
particular
like
the.
A
We
only
have
one
implementation
of
this
architecture,
so
we
don't
really
know
whether
it's
a
leaky
abstraction,
then
there's
a
problem
with
workloads
that
only
do
occasional
asyncs
have
sync
operations,
so
we
haven't
already
looked
at
those
in
the
benchmarks,
but
there
are
a
bunch
of
efficiencies
inefficiencies
in
the
implementation
and
we
could
work
around
those,
but
haven't
done
that
yet,
and
there
are
also
problems
with
or
at
least
unaddressed
performance
issues,
with
parallel
fsunks
on
the
same
file,
because
we
could
get
some
performance
improvements
there.
If
you
use
something
like.
B
A
Commit
waiters,
then
there
are
some
features
that
are
missing,
so
support
for
native
encryption
would
be
a
must.
I
think,
if
we
upstream
this
or
if
we
consider
something
like
this
upstream
and
mirroring
of
pmems,
so
that
we
get
some
redundancy
for
this
log
that
is
also
not
implemented
yet
also.
The
glue
code
is
quite
hacky,
as
you
may
have
noticed,
so
we
should
probably
revisit
some
design
decisions
there
and,
more
importantly,
the
design
has
also
some
inherent
weaknesses,
and
I
would
like
to
thank
alexander
moten
specifically
for
his
feedback
on
this.
A
But
if
we
want-
but
if
we
were
fine
with
relaxing
those
guarantees,
we
could
potentially
get
a
lot
more
performance,
and
maybe
we
can
discuss
this
in
the
breakout
room,
whether
relaxing
the
guarantees
as
an
option
for
us.
Another
aspect
is
amount
of
drm
allocations
and
drm
to
dram
copies
that
are
happening
in
brazil.
A
We
don't
have
more
mem
copies
than
the
awb,
but
we
don't
have
less
either.
Also
like
there
are
some
maintenance
concerns.
So
if
we
have,
if
we
have
this
custom
space
allocation
going
on
in
the
search
substrate,
then
we
need
to
double
think
every
time
we
do
any
tricks
with
space
education.
So,
for
example,
zebra
checkpoint
is
a
candidate
where
generated
some
headaches
during
the
design
phase.
A
There
is
also
no
graceful
fallback
mechanism
if
this
lock
is
full.
So
if
data
sets
cannot
be
replayed
immediately.
This
is
an
actual
problem
on
small
persistent
memory
devices
like
nvidem
n.
It's
not
so
much
a
problem
on
obtain
because
the
smallest
unit
for
obtain
is
128
gigabytes,
and
the
last
thing
is
that
the
linux
stacks
apis
are
gpl
only
so
we
cannot
actually
use
those
apis
in
the
zfs
module
upstream.
Unless
you
set
the
meta
license
to
gpl.
A
A
Regarding
my
personal
commitment
to
all
of
this,
as
I
said,
there
is
all
content
from
my
master's
thesis
and
my
employer
is
not
involved
with
any
of
this
at
this
point.
So
currently,
I'm
only
able
to
contribute
to
this
in
my
spare
time,
but
I'm
quite
eager
to
explain
stuff
to
people
and
help
if
there's
any
interest
in
upstreaming
some
of
the
work
so
yeah
thanks
for
your
attention
and
I'm
looking
forward
to
discussions
in
the
breakout
room.
C
A
Yeah
we
have
this
optimist
like
upstream
cfs
has
the
like
dedicated
metastop,
minor,
slab
education
class
right
now,
so
in
theory
like
we
should
be
able
to
change
this
so
that
we
pre-allocate
space
any
vdf
that
should
work.
The
like
the
problem
is
needs
to
be
directly
addressable,
so
essentially
we
would
need
to
like
do
the
do
large
allocations
like
spa,
max
block
size
locations
and
then
we'd
need
some
way
to
ask.
A
Given
this
block
pointer
to
this
spa
max
block
size
allocation.
Please
give
me
the
the
dear
the
the
memory
mapping
if
we,
if
we
want
to
do
this
for
pmem
right
so
yeah
at
some
point,
we
need
to
bypass
the
io
pipeline.