►
From YouTube: Shared L2ARC by Christian Schwarz
Description
From the 2022 OpenZFS Developer Summit https://openzfs.org/wiki/OpenZFS_Developer_Summit_2022
Slides: https://docs.google.com/presentation/d/1Fp3yLnpFKyPyG_4lViqMbg-q1p9o2DwW5lPjOfcvwrc/edit?usp=sharing
A
A
So
we
are
a
software
defined
scale
out,
file,
storage
solution
and
the
core
functionality
that
we
provide
is
NFS,
SMB
and
multi-protocol
shares
that
are
accessed
by
clients
and
under
the
hood.
Those
shares
map
to
ZFS
data
sets
that
are
spread
across
many
zipwoods
and
the
architecture
works
like
this.
We
have
a
compute
part,
meaning
CFS,
and
the
protocol
servers
those
run
inside
VMS
on
top
of
the
nutanix
hyper-converged
infrastructure
and
the
there's
a
storage
part,
meaning
the
V
devs
of
each
Z
Pool.
A
Each
Z
Pool
that
we
have
in
the
system
is
imported
in
one
VM
at
a
given
time,
but
the
zippers
can
move
around
between
the
VMS
for
purposes
of
or
for
load.
Balancing
and
moving
around.
Those
zippoids
is
cheap
because
we
just
need
to
coordinate
which
VM
owns
given
Z
pool
at
a
given
time.
We
don't
need
to
move
the
data
because
the
data
resides
somewhere
in
the
hyper-converged
infrastructure.
A
We
just
consume
it
via
iSCSI,
so
the
zfs's
role
in
this
architecture
is
really
not
that
of
a
physical
volume
manager,
but
instead
it's
a
positive
compliant
file
system
with
nice.
Enterprise
level
features
that
we
use
to
provide
useful
functionality
to
users
now
that
the
actual
project
that
sparked
this
shared
a
true
Arc
idea.
A
We
had
this
project
called
fights
extended
buffer
cache
and
the
going
with
this
project
was
to
accelerate
read
heavy
workloads
who
is
working
set
exceeds
the
size
of
the
Arc
of
an
individual
compute
VM
So.
The
plan
was
to
take
a
local
disk
of
the
vm's
current
host
system
and
attach
it
to
the
VM,
and
then
we
would
use
that
host
local
disk
as
an
L2
Arc
inside
the
VM,
and
that
would
allow
us
to
serve
the
reads
from
that
local
disk.
A
A
The
first
one
was
that
if
we
add
the
post
local
disk
SNL
tour
to
the
zipwoods,
then
we
can
no
longer
move
the
Ziplocs
around,
because
we
now
have
this
horse
Docker
device
attached
to
them
and
what
we
wanted
to
avoid
is
to
have
a
bunch
of
A2
Arc,
add
remove
operations
in
the
path
of
and
load
balancing
for
various
reasons,
but
the
even
more
severe
problem
was
that
the
L2
Arc
is
a
perceived
pool
construct,
but
in
our
product
we
cannot
really
predict
which
share
or
Zippy
will
need
the
acceleration
the
most.
A
So
that
would
lead
to
underutilization
in
those
cases
and
where
it
was
would
actually
be
more
attractive
to
just
put
the
entire
capacity
and
as
dynamically
shared
among
the
important
zippoids
and
then
the
Zippo
that
is
currently
hot
could
use
sort
of
the
capacity
before
we
get
to
the
solution.
Let
me
give
a
quick
recap
on
how
the
arc
Works
internally
so
the
arc
from
a
high
level
point
of
view.
A
Maps
hashtags,
which
consists
of
the
spa
load
to
your
ID
data
virtual
address
and
the
transaction
group
of
a
given
piece
of
data
in
the
normal
videos.
It
Maps
those
to
a
structure
called
Arc
Buffet
dirty.
Usually
we
refer
to
that
structure
as
Arc
headers
and
that
arcader
points
to
the
storage
location
of
the
cache
data
in
the
L1
and
the
L2
Arc
in
the
i1
arc.
The
storage
location
is
identified
by
a
pointer
to
the
dram
buffer.
A
A
There's
the
kernel
thread
called
L2
arc,
V
thread
and
that
thread
iterates
over
the
L1
buffers
that
are
eviction,
candidates
of
the
L1
Arc,
meaning
they
are
at
the
tail
of
the
most
recently
used
or
most
frequently
used
lists
that
the
arc
maintains.
So
we
have
our
our
regular
L1
buffers
here
and
then
there
are
a
bunch
of
eviction
candidates
and
this
Edge
Rock
feed
thread
iterates
over
those
eviction
candidates
and
applies
the
following
rule.
A
And
now,
when
it
comes
to
eviction
of
the
L1
header,
then
the
we
will
keep
the
arc
header
structure
in
the
DRM
and
we
will
remember
the
offset
in
the
O2
Arc
so
that
when
we
later
read
that
location,
we
will
get
an
L1
Miss,
because
there
is
no
longer
another
one
buffer.
A
B
A
How
the
historic
works
works
from
a
high
level.
Now,
let's
see
how
this
system
behaves.
If
we
have
multiple
ziploots,
each
zipper
requires
its
own
cache
device.
A
That
is
an
invariant
that
Upstream,
which
are
currently
imposes,
so
we
have
to
Supply
Cache
device
such
zipul
and
then
the
previous,
the
rule
from
the
previous
slide
still
applies.
A
cache
device
in
a
given
zipul
will
only
host
L2
buffers
for
that
z-pull,
because
that's
what
the
L2
Arc
feed
thread
does
right
now.
A
We
just
don't
know
so,
it's
better
to
pool
or
the
cache
device
capacity
that
we
have
and
share
it
among
all
the
Z
poets
and
that's
what
we
did
so
with
Shadow
2
Arc
looks
like
this.
We
have
the
vdisc
based
zeppo
is
they
just
have
normal
normal
type,
V
devs,
and
then
we
have
a
special
zippoor
called
nutanix
fsvm
local
L2
Arc.
That
name
is
subject
to
review,
obviously,
and
that
the
pool
only
consists
of
the
whole
stock
devices
that
we
attach
to
the
VM.
A
B
A
A
We
took
the
L2
up
feed
thread
and
changed
it
so
that
it
no
longer
applies
this
really
where
it
partitions
the
the
buffers
instead,
which
we
change
it
so
that
it
feeds
the
buffers
from
any
zipool
in
the
system
to
the
L2
Arc
zip
for
its
cache,
V
devs
so
effectively.
The
buffers
are
now
spread
over
those
cachevideos,
and
that
was
really
all
we
need
to
do
to
solve
our
problems
and
make
this
work
now.
A
The
obvious
question
is:
is
this
correct
I
believe
it
is
because
we
do
we
do
not
make
any
changes
to
the
arc
or
the
L2
Arc
invariants
themselves,
the
tagging
Remains
the
Same
caching
validation
Remains,
the
Same
all
I
can
think
of
basically
Remains
the
Same.
There
is
one
case
where
we
need
to
do
some
minor
changes.
A
That
is
the
case
where
and
read
from
the
L2
device
fails,
for
example,
because
the
L2
device
is
dead
or
there's
a
checksum
error
or
the
buffer
got
evicted
after
we
started
the
read,
but
before
we
finish
the
read
something
like
that,
in
those
cases
we
will
go
back
to
the
primary
pool
and
read
the
data
from
there
and
that
is
fully
transparent
to
the
user.
That's
how
the
arc
is
supposed
to
behave.
A
The
problem
here
is
that
that
primary
pool
might
be
exported
while
we're
doing
the
L2
agreed,
but
to
make
the
four
pack
rate
we
need
to
guarantee
that
that
doesn't
happen.
So
the
solution
here
was
pretty
simple.
We
just
told
the
spa,
config
L2
Arc
log
of
both
poets
instead
of
the
pool
where
we
do
there
to
our
grid
and
that
solves
our
problems,
because
this
prevents
the
primary
pull
from
going
away
as
well.
A
A
The
primary
risk
that
I
had
in
mind
was
that
we
would
have
headers
from
the
primary
pool
that
reference
structures
that
are
associated
with
the
two
outputs
that
is
new
previously
all
of
these
would
be
for
the
same
Z
Pool
and
they
would
have
the
same
lifetime,
but
now
that's
no
longer
true.
They
have
different
lifetimes
because
you
can
export
The
Poets
independently.
A
So
when
we
export
the
Altura
Pui,
we
will
need
to
make
sure
that
we
invalidate
all
those
Arc
headers.
Well,
it
turns
out
that
the
code
is
already
structured
this
way.
So
my
understanding
is
that
the
existing
code
and
locking
is
sufficient
to
deal
with
this.
A
quick
disclaimer.
A
A
So
that
was
it
about
where
the
project
is
right.
Now,
let's
talk
about
the
future
right
now,
the
project
is
a
proof
of
concept.
It
hasn't
been
productized
yet
and
what
I
did
was
publish
the
rebased
code
to
GitHub,
so
it's
available
as
a
draft
PR
and
every
one
of
you
can
look
at
it
and
check
it
out.
There
are
a
bunch
of
to-do's.
A
First
of
all,
I
did
the
rebase,
but
I
didn't
take
the
new
features
into
consideration
so
for
a
general
design,
we'll
need
to
think
about
how
we
handle
these
and
then
obviously
we
cannot
use
this
hard-coded
magic
name.
We
need
some
more
Dynamic
and
generic
representation
of
whether
the
drug
devices
should
be
shared
or
not.
A
property
seems
like
the
right
choice
for
this.
A
So
probably,
we
should
also
have
a
property
that
controls
whether
we
want
to
use
the
shared
network
in
a
given
zipul
or
whether
we
only
want
to
use
the
that
Z
for
its
cache
devices
for
data
set
for
data
sets
in
that
pool,
so
another
property
that
we
could
throw
in
that
is
our
subject
for
debate.
I
would
be
happy
about
comments
on
the
pr
or
in
the
Q
a
right
now,
and
that
was
my
talk.
A
If
you're
interested
in
the
code
have
a
look
at
the
pr
or
we
can
look
at
it
together
in
the
breakout
session
or
during
the
hackathon,
and
if
you
want
a
demo,
maybe
we
can
also
do
that
during
the
in
the
breakout
room
after
the
talk,
if
you
have
questions
or
comments
on
the
design,
now
is
the
time
and
before
I
hand
over
to
Matt.
I
would
like
to
say
thank
you
to
my
team
at
nutanix
and
the
ZFS
Community
at
large,
in
particular
for
the
nearly
and
George
Wilson.
A
Both
of
you
answered
lots
of
questions
that
I
had
while
I
while
I
implemented
this.
Thank
you.
C
A
Yeah,
so
that
was
actually
one
of
the
alternative
designs.
It
was
just
to
fiddly
in
the
like
the
latest
management
code.
That
was
particularly
familiar
with
that.
Actually
under
the
hood,
the
vdev
auxiliary
T,
which
is
like
the
the
I,
don't
know,
look
it's
called
abstract
base
class
or
whatever,
basically
a
piece
of
code
that
is
shared
between
spare
videos
and
Network
v.
A
A
Also
you
have
to
like
think
about
what
the
different
sub
commands
will
do
so,
for
example,
Zippo
status
with
the
or
a
Zippo
like
OS
set.
Will
it
show
the
IRS
for
the
device
or
each
pool
or
with
a
like,
distribute
the
statistics
based
on
which
pool
access
did
how
many
accesses
to
this
device?
It
seems
simpler
to
just
have
one
one
zipway.
B
A
Yes,
as
I
said,
we
are
on
0.7,
so
we
don't
have
versus
Network
and
I
think
that
would
say
so
we
don't.
We
don't
have
that
particular
use
case
in
general,
when
we
export
the
pool
it's
going
to
be
gone
for
several
minutes
and
by
then
the
tour
contents
are
probably
irrelevant
so
yeah.
We
didn't
need
to
think
about
that.
We
thought
about
okay,
I've
streamed
it
persistently
to
our.
Are
we
interested
in
versus
Network
for
the
product
and
yeah?
We
concluded
that
that
we
aren't.