►
Description
From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1r2P2kQHozr_RDLLf5JgPXqJOdVE7VlxH9B92jGAODLI
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021
A
All
right,
let's
talk
zetacache,
so,
as
you
heard
this
morning
about
a
little
bit
about
zettacash
most
about
the
object
store,
I
wanna
start
actually
by
sort
of
reviewing
a
little
bit
about
the
object,
store
sort
of
set
context
here.
A
But,
as
we
saw
this
morning,
conventional
wisdom
says
that's
not
a
good
model
for
highly
transactional
latency
workloads
such
as
what
we
want
it
for
and
mostly
because
of
latency,
but
we
know
how
to
deal
with
latency
and
zfs
right.
You
can
mask
your
synchronous,
write
latency
with
a
slog.
A
You
can
add
a
cache
to
mask
your
read
latency
like
the
l2
arc,
and
you
should
be
good
to
go
except
for
those
troublesome
workloads,
and
that
happens
to
be
the
delfix
use
case.
We
have
a
small
block.
Size
average
of
3k
means
the
high
proportion
of
metadata
to
data
we're
talking
about
random
io
workload.
That
means
little.
I
o
consolidation.
B
A
Streaming
we're
talking
large
working
sets
so,
for
example,
a
50
terabyte
working
set,
which
would
mean
upwards
of
18
billion
blocks
that
we
need
to
try
to
keep
track
of
and
want
to
operate
in
a
constrained
memory
size
all
right,
something
on
the
order
of
128
gigabytes.
That
means
we
can't
keep
all
that
metadata
in
memory,
so
the
l2
arc
is
simply
not
going
to
to
do
the
job
for
us
right.
It
can't
support
the
target
working
set
at
least
not
128
gigabytes
of
memory.
A
That
was
18
billion
blocks.
On
average,
you
can
see
almost
100
bytes
of
metadata
per
block
you're
blocking
the
cache
means
you're
going
to
need
on
the
order
of
almost
two
terabytes
of
memory,
just
to
manage
the
blocks
in
the
cache.
A
In
addition
to
that,
it
has
a
fifo
eviction
policy
which
can
be
and
is
very
ineffective
when
trying
to
size
your
cash.
So
even
if
you
get
a
lot
of
data
into
the
cash,
you
can't
depend
on
unused
data,
not
clogging
things
up
right,
there's,
no,
it
doesn't
get
evicted
effectively
and
in
fact
you
can
have
frequently
used
data.
That's
in
your
cache
that
can
again
get
evicted
and
will
be
evicted
eventually,
just
as
it
works
its
way
to
the
fifo
cube.
A
So
you
really
can't
size
the
cache
to
guarantee
that
you're
going
to
get
your
working
set
to
stay
there.
So
we
want
something
a
little
different.
Our
goal
here
is
to
get
performance
comparable
to
zfs
on
ebs,
and
that
means
zfs,
backed
with
s3,
with
azetta
cache
based
on
eds
storage,
and
we
want
to
reduce
that
20
millisecond
latency
to
two
millisecond
latency
that
so
that
the
high
latency
from
the
s3
needs
to
look
the
averaged
out
to
be
a
nice
two
millisecond
latency,
hopefully
from
our
zetta
cache.
A
I
want
to
support
very
large
caches
and
I
want
to
do
that
with
small
memory
footprints
on
the
order
of
two
gigabytes
less
than
two
gigabytes
for
that
50
terabytes
of
cash.
So
that
means
has
to
be
persistent
by
design,
not
as
an
afterthought,
and
we
need
to
improve
the
eviction
policy
rather
than
fifo
we're.
Looking
at
lru,
we
want
to
be
able
to
understand
the
cache
size
and
its
effectiveness.
A
So
this
is
what
the
object
agent
looks
like
in
zfs
right,
it's
a
it's!
It's
as
george
talked
about
it.
It's
cfs
object.
V
dev
communicates
through
to
userland
to
talk
to
the
cfs
object
agent
and
within
that
we
have
the
zetacache,
which
is
where
the
zfs
will
look.
First,
the
ios
will
be
looked
for
first
and
if
not
found
there,
it
goes
off
to
look
in
the
cloud.
A
Why
did
we
put
the
zeta
cache
up
with
in
the
user
land
as
opposed
to
embedding
in
the
in
the
kernel,
as
our
other
caches
are
a
few
good
reasons?
One
this
architecture
we
wanted
to
keep
zetta
cache
very
close
to
the
the
zeta
object
agent
for
efficiency's
sake
right,
it's
it's
the
primary
consumer
of
this
and
it
made
sense,
then,
logically,
to
leave
to
put
it
up
there.
A
It
keeps
us
separate
from
the
the
bulk
of
zfs,
so
we
could
develop
without
having
to
inject
a
lot
of
new
code
path
or
changes
into
the
zfs
io
pipeline
or
other
parts
of
zfs
architecture.
This
is
a
separate
entity
completely,
so
it
really
gives
us
an
easier
environment
to
do
our
development,
because
in
user
land
we
are
not
dealing
with
a
new
kernel.
Every
time
we
make
a
modification,
we
are
also
able
to
develop
in
rust,
which
is
really
nice.
So
it's
helped
us
do
our
development
cycle
to
be
much
improved.
A
I
think
I
talked
about
these
points,
but
one
of
the
questions
you
should
probably
have
in
your
mind
right
now
is:
is
it
going
to
perform
if
it's
in
user
space
can
you
perform,
and
the
answer
is
for
us
at
least
right
now?
Absolutely
our
needs
really
are
only
to
be
able
to
drive
an
aws
instance
effectively
efficiently.
A
Sorry
and
most
aws
instances
do
less
than
20k
apps
to
an
ebs
storage,
and
so
we
can
do
that
now
easily
using
a
simple
period
period
interface,
so
it
certainly
meets
our
needs
and
we
believe
that
it
will
meet
our
any
any
needs
in
the
future,
because
it's
possible
to
go
with
user
space
as
fast
as
almost
as
kernel
using
userland
nvme
drivers,
for
example,
or
the
iou
ring
interface.
So
we
don't
believe
this
is
going
to
be
a
bottleneck
to
our
performance
here.
A
Basic
components:
the
of
the
zetta
cache
are
an
index
used
to
find
data
in
the
cache,
some
form
of
allocator
to
store
data
in
the
cache
and
a
checkpoint
mechanism
to
create
a
credible
state.
Now,
if
you're
looking
at
this
picture
on
the
right,
you
might
be
thinking
yourself.
A
You
know
that
looks
an
awful
lot
like
a
diagram
of
a
file
system
and
file
systems
are
hard
and
if
you're
thinking
that
you're
you're
right,
that's
true,
both
those
things
are
true.
It
does
look.
A
lot
like
a
file
system
and
file
systems
are
definitely
hard.
I
know
as
part
of
the
team
we
spent
50
engineer
years
before
zfs
first
went
to
first
customer
ship.
So
it's
it
was
a
big
effort,
so
we
need
to
simplify
requirements.
We
aren't
actually
going
to
develop
a
new
zfs
here,
we're
developing
a
cache.
A
So
we
should
step
back
and
say
all
right.
What
are
the
actual
requirements
we
have
for
this
storage,
and
that
is
a
persistent
searchable
index
that
doesn't
need
to
you
can't
in
fact
be
fully
cached
in
memory,
but
that
can
support
lookups
in
at
most
one
read
right.
Remember
our
our
goal
here
is
to
be
able
to
satisfy
or
to
emulate,
zfs
on
direct
storage.
A
A
We
use
checkpoints
for
that.
So
let's
look
at
some
simplifications.
One
we're
only
gonna
have
one
block
pointer
for
each
block
right,
we
don't
need
any
snapshots
or
other
you
know.
Most
of
our
metadata
needs
are
pretty
simple
here
and
in
fact
we
can
store
all
our
metadata
in
logs
right,
there's,
no
logical
overrates
of
data.
A
A
If
you
find
that
that
entry
in
the
index,
then
you're
going
to
read
that
block
from
the
cache
or
insert
a
new
block
where
you're
going
to
look
it
up,
query
the
block
get
from
the
block
allocator
a
a
new
space
on
disk
for
that
block
and
that
block
out
here
will
then
insert
a
new
entry
in
the
index
to
say.
Here's
that
that
new
block
on
disk.
A
Okay
skip
one
sorry,
so
the
index
is
just
a
way
of
to
map
the
block,
a
block
id
to
a
disk
location,
and
if
you
think
about
zfs
today,
for
example,
in
the
arc,
we
have
a
large
hash
table
that
does
that
for
both
arc
and
l2
arc.
A
But
that's
a
little
bit
of
a
complicated
data
structure
to
manage
persistently
to
have
a
persistent
representation
of
so
we've
chosen
a
simpler
data
structure
to
represent
it,
which
is
essentially
a
a
log
structured,
merge
array
which
is
a
mutable,
ordered
array
or
run
of
entries
that
we
store
persistently,
and
then
we
track
pending
changes
in
logged
in
memory
right.
So
whenever
we
want
to
transition
from
one
setup
one
index
to
the
next,
we
apply
our
set
of
pending
changes
to
the
old
run,
to
produce
a
new
run.
A
So
the
old
run
is
ordered
by
index
op
id
and
we
apply
our
pending
changes
to
produce
a
new
run
on
disk.
A
A
We
also
keep
stored
a
summary
of
the
mappings
array
of
mappings,
which
map
the
first
block
id
in
each
chunk
to
its
location,
to
the
chunk
location
on
disk,
so
that
and
then
we
keep
that
mapping
in
memory.
So
that's
our
lookup
first
lookup
point
when
we
want
to
look
up
a
block
in
the
index.
We
consult
our
in
core
summary.
A
A
All
right
now,
penny
changes
are
kept
in
memory
in
appending
changes
tree
that's
for
efficient,
lookup
and
inserts,
but
we
also
need
to
keep
that
persistent
all
right.
We
don't
need
that
persistent,
because
we
don't
want
to
lose
pending
changes
when
we
panic
or
or
exit
from
the
log.
So
we
need
to
make
maintain
a
persistent
version
of
that.
We
should
call
the
operation
log
on
disk.
So
whenever
we
add
or
adding
entries
in
the
pending
changes
tree,
we
add
entries
to
the
end
of
our
log.
A
A
So
to
move
from
state
to
state,
we
simply
create
a
checkpoint
about
every
minute
where
we
persist
the
pending
changes.
That
means
we
make
sure
all
our
entries
in
the
penny
changes
tree
have
been
written
out
to
the
operation
log,
and
then
we
store
in
that
checkpoint,
a
pointer
to
or
basically
the
metadata
necessary,
to
find
that
pending
changes
list,
as
well
as
a
pointer
to
the
index,
which
is
already
persistent
on
disk.
So
if
we
do
have
a
crash,
we
lose
less
than
a
minute
of
cache
data
in
the
crash.
A
So
how
does
that
work?
Well,
if
we
have
to
when
we,
when
we
reopen
the
the
cache
from
any
point.
At
any
point,
we
first
read
this
a
super
block
in
a
well-known
location
which
points
to
what
the
current
checkpoint
is.
The
current
checkpoint
block.
We
read
that
checkpoint
block,
a
checkpoint
block
has
essentially
our
meta
metadata
within
it.
It
has
a
pointer
to
the
index
and
summary
index
run
in
summary,
as
well
as
the
operation
log,
so
these
all
the
components
that
make
up
our
index.
A
So
what's
going
to
happen,
is
it's
going
to
simply
read
in
the
summary
into
memory
and
ingest
the
operation
log
and
replay
it
to
reconstruct
the
pending
changes
tree
once
that's
completed,
we're
now
able
to
to
operate
the
the
caches
normally,
but
we
can't
accumulate
these
pending
changes
forever
right.
Pending
changes
represent
a
set
of
deltas
to
the
index,
and
if
we
just
continue
to
change
and
accumulate
entries
in
there,
it's
going
to
get
too
large
over
time.
A
So
we
need
to
merge
those
pending
changes
periodically
into
the
current
run,
to
produce
a
new
run,
as
I
illustrated
in
an
earlier
slide.
A
A
So
we
read
the
old
index
in
order.
We
apply
the
pending
changes
to
it.
When
we
come
across,
for
example,
insert
we
insert
a
new
entry
from
the
pending
changes
into
the
index,
so
the
index
grows.
If
we
find
a
remove
request,
we
remove
that
entry
or
if
we
have
an
access
to
a
block,
then
we
update
the
a
time
I'll
talk
more
about
a
times
a
little
bit.
How
they're
used.
A
But
the
index
also
itself
cannot
grow
forever,
all
right
if
the
index
grew
forever,
obviously
that
the
index
is
a
mapping
of
where
all
the
blocks
are
in
the
cache
and
the
cache
is
finite
in
size.
So
eventually
you
can
have
an
index
that
represents
a
full
cache
and
will
need
to
start
evicting.
Data
from
the
cache
will
continue
to
grow
and
absorb
new
entries
into
the
cache.
A
So
this
is
where
we
use
our
a
times
our
access
times.
Every
block
has
a
last
access
time.
This
is
incremented
every
10
seconds,
so
the
access
time
is
really
a
measure
of
sort
of
artificial
time
from
when
the
cache
was
first
created
every
10
seconds
it
increments
and
every
block
that
is
inserted
into
the
cache
or
read
from
the
cache
during
that
10
second
period
is
marked
with
that
particular
access
time.
A
This
gives
us
sort
of
an
approximation
of
of
an
lru.
If
we
create
this
histogram.
As
that
you
see
on
the
right
that
every
time
we
ingest
a
block,
we
add
the
size
of
block
to
the
histogram
at
that
a
time
so,
on
the
on
the
diagram
on
the
right,
the
numbers
in
this
diagram
represent
gigabytes
of
of
data.
So,
for
example,
at
a
time
one,
we
ingested
20
gigabytes
of
data
at
a
time
two
we
ingested
30
gigabytes
of
data
and
so
forth.
A
If
we
access
a
block,
for
example,
we
are,
let's
say
at
a
times
seven
and
we
access
a
block
at
a
time.
Two
then
that
read,
causes
us
to
change
the
a
time
of
that
block,
and
so
we
would
and
say
it
was
a
8k
block
we
would
have.
We
would
decrement
our
slot
2
of
the
of
the
histogram
bucket
2
by
8k
and
increment
7
bucket
7
by
8k,
and
that's
how
these
this
this
particular
histogram
it
grows
and
shrinks
and
changes
over
time.
A
Now
when
we
want
to
evict.
All
we
need
to
do
is
look
at
this
histogram
and
count
up
the
space
of
the
amount
of
of
of
of
a
data
of
block
of
space
used
by
the
blocks.
With
particular
a
times
in
particular,
we'll
look
at
the
current
a
time
in
this
diagram
is
a
time
nine
count
back
until
we
have
the
size
of
the
cache
the
cache
size
that
we
want
to
try
to
maintain.
In
our
example,
we're
trying
to
maintain,
say,
300
gigabytes
of
data
within
our
our
cache.
A
A
So
we
can
evict
based
off
of
that,
so
that
cut
off
that
a
time
five
represents
our
sort
of
a
cut
off
for
eviction,
which
we
will
now
leverage
during
merge.
So
remember,
merge
is
already
going
through
all
of
the
blocks,
all
the
index
blocks
and
all
of
the
change
log
box
blocks
pending
changes
blocks
to
create
a
new
index,
and
so
it
can
also
examine
the
a
times
while
it's
doing
that
job,
and
so
it
can,
if
it
sees
a
block
that
it
occurs,
has
an
a
time
before
our
cut
off.
A
That's
an
evicted,
evictable
block
and
it
becomes
evicted.
So
you
can
see
in
this
example
here
we're
having
a
set
of
pending
changes.
We
have
an
old
run
and
we're
generating
a
new
run,
and
we
see
that
we
have
blocks
at
four
or
five
and
nine
that
are
all
have
an
a
time
of
before
five.
A
And
so
when
we
write
our
new
run,
we're
going
to
come
acro
come
along
and
say
all
right:
four
that
that's
evictable
five
evict
eight
nine!
Well,
we
don't
actually
evict
nine
because
our
pending,
we
also
have
a
pending
change
for
nine,
which
has
a
new
array
time.
So
that
takes
priority.
It
says
all
right:
we
actually
have
modified
nine,
since
our
eviction
cut
off
and
so
nine
gets
goes
into.
The
new
index.
A
So
now
it's
really
kind
of
cool
that
now
that
we
have
this
sort
of
mechanism
for
evicting
blocks
or
for
for
managing
blocks
based
off
of
a
times
this
a
time
histogram
is,
we
can
actually
leverage
that
same
histogram.
To
answer
the
age-old
question
of
how
do
you
want
to
size?
Your
cache?
A
How
effective
is
my
cache
at
a
particular
size,
so
we
use
that
same
eviction
same
a
time,
histogram
to
generate
a
new
histogram,
which
we
call
the
size
versus
his
histogram,
so
size
versus
his
histogram
is
leverages
that
eviction
model,
and
so,
whenever
we're
going
to,
we
get
a
hit
in
the
cache.
A
We
can
say
all
right.
This
block
we
just
hit
in
the
cache.
What
was
it's
a
time
this
last
a
time
not
the
current?
It's
about
the
update,
of
course,
because
we
just
did
a
hit,
but
what
was
the
hit,
the
the
daytime
when
we
we
actually
found
it
in
the
cache
that
tells
us
essentially
the
a
time
at
which
it
exists,
the
latest
a
time
it
could
exist
in
the
cache,
and
now
we
could
count
up
all
the
space
that
was
before
that
a
time,
and
it
includes
that
a
time
and
say
all
right.
A
A
A
So,
on
the
right
you
see,
output
from
our
zcash
report
hits
command,
and
this
represents
a
little
bit
of
a
toy
example.
It's
a
small
working
set,
but
we're
getting
about
98
percent
hit
rate
in
this
cash,
but
97
look
at
the
data
on
the
right.
You
see
that
97
of
those
hits
are
occurring
within
the
first
three
gig
of
the
cache
according
to
that
graph
on
the
right.
A
A
We
could
also
now
extend
this
idea
and
say
all
right
not
just
maintain
this
information
for
the
current
cache,
but
if
we
extend
our
a
time
histogram
data,
we
don't
throw
away
sort
of
the
historical
data
that's
contained
within
it,
because
we
have
remember
we
we
have
the
a
time
histogram
data
records
that
for
all
of
our
a
times
we
we
can
choose
when,
if
ever
to
evict
information
out
of
that
log,
that
data
we
can
use
that
history
of
of
size
to
actually
say
well.
You
know
I
looked
up
this
block
now.
A
We
also
have
to
keep
a
little
bit
extra
data
in
our
index,
of
course,
which
are
the
so
ghost
entries
for
blocks
which
we've
evicted
recently.
But
let's
say
we
keep
another
cache
worth
of
cache
size
worth
of
data
around
that
way.
Now,
when
I
look
up
that
block,
I
can
say
all
right
the
index.
I
looked
up
that
index
that
block
in
the
index,
and
it
said
no,
I
don't
have
that
block,
but
I
used
to
have
that
block
and
when
I
I
had
it
it
had.
A
A
A
If
I
was
to
double
the
size
of
my
cache
and
that
can
be
very
valuable
again,
all
this
depends
on
being
able
to
run
this
data
for
a
period
over
which
you're
actually
operating
toward
with
normal
working
sets
to
understand
how
your
hits
are
going
to
work
within
your
cache
all
right.
At
this
point,
I
want
to
change
gears
a
little
bit
and
I'm
going
to
turn
things
over
to
seraphim.
Sorry,
sorry
for
him
to
talk
about
the
black
allocator,
careful.
B
So,
first
of
all,
what
does
a
block
allocator
do
right?
The
local
locator
is
a
subsystem
that
determines
where
data
should
be
placed
on
disk
and
when
designing
one.
There
are
like
multiple
things
that
we
should
be
taking
into
consideration
things
like
runtime
performance.
How
quickly
can
we
satisfy
the
allocation
requests?
B
Other
factors
like
contiguous
allocations,
where
some
devices
like
hard
drives,
get
a
big
performance
boost
whenever
we
try
and
allocate
blocks
contiguously
on
disk
and,
finally,
the
memory
consumption
of
the
block
allocators
metadata.
Basically,
the
block
allocator
needs
some
kind
of
in-memory
structure
to
keep
a
track
of
what
is
allocated
and
free
on
disk
right.
So
we
want
to
make
sure
that
distractions
don't
take
up
too
much
memory
and
we
play
well
with
the
rest
of
the
system.
B
So
before
I
go
ahead
and
go
over
what
we
ended
up,
doing
I'd
like
to
do
a
small
recap
and
look
at
what
zfs
already
has
in
terms
of
block
allocation.
So
the
block
allocator
gfs
is
part
of
the
storage
pool
allocator
or
what
we
call
the
spa
and
what
the
spa
does
is
basically
dividing
each
video
into
16
gigabyte
regions.
We
call
metaslabs
from
which
we
are
located
in
three
blocks
of
arbitrary
sizes.
B
Now
these
ranges
rangers
can
get
pretty
big,
especially
for
big
pulls,
but
even
for
smaller
pools,
depending
on
your
workload
or
the
history
of
the
pool.
So
for
that
reason
we
only
keep
a
subset
of
all
these
range
trees,
meta
laboratories,
loaded
at
the
time
on
our
system.
Specifically,
we
only
keep
loaded
the
ones
that
we
are
allocating
from,
so
we
can
find
where
the
free
space
is
on
disk,
but,
as
I
said,
we
only
keep
a
subset
of
them
loaded
because
we
want
to
conserve
them.
B
So
this
is
more
or
less
the
scheme
that
we,
the
the
design
that
we
have
currently
today
and
it
has
served
us
well
and
it
continues
to
serve
as
well,
but
it
doesn't
mean
that
it
doesn't
come
with
its
own
flows,
specifically
within
delphi.
What
we've
seen
is
on
our
database
workloads,
which
are
mainly
characterized
by
small
and
compressed
random
rounds.
B
That's
not
to
say
now
that,
because
of
this,
the
spa
design
is
not
good.
It
actually
satisfies
the
requirements
of
a
file
system
like
zfs
pretty
well,
but
for
the
zeta
gas
we're
designing
a
cache
which
is
different
from
a
file
system,
so
we
went
back
and
kind
of
reconsidered.
Some
of
these
tradeoffs,
specifically
things
like
the
continuity
of
allocations
right.
B
This
is
something
important
for
hard
drives
and
file
systems
need
so
need
to
support
hard
drives,
but
now
we're
talking
about
the
cast
like
the
zettacast,
where
we
most
probably
going
to
be
using
something
like
ssds
or
nvme
devices,
so
contiguous
allocation
of
this
kind
of
device
is
not
as
important
and
therefore
it
doesn't
need
to
be.
A
major
factor
in
our
design.
B
Second
part
is
the
disk
space
utilization
right.
We
were
talking
about
the
spa,
the
file
system.
We
strive
to
make
the
most
out
of
our
hardware.
Our
goal
is,
you
know,
100
utilization,
but
for
something
like
a
cast.
Yes,
this
video
utilization
is
important,
but
it's
not
the
main
factor
in
the
design.
B
B
B
Basically,
so
you
can
keep
adding
more
storage
to
your
systems
without
having
to
think
about
the
ram
that
you're
actually
using,
and
this
is
the
whole
reason
why
we
do
the
whole
dance
with
loading
and
unloading
meta
slabs
now
for
a
cast.
This
is
still
a
very
important
concern.
This
is
part
of
the
design
because
it's
a
valid
concern,
but
our
goal
is
not
as
strict.
We
can
be
more
flexible
because
we
are
talking
about
the
case
right.
B
So
here's
what
we,
after
considering
all
of
these
trade-offs,
here's
what
we
ended
up
doing
basically
local
location
on
the
zericas
works
like
this.
We
take
the
cast
and
we
divide
it
into
16
megabyte
regions.
We
call
slabs
and
there
are
three
types
of
slabs.
There
are
the
bitmap-based
slabs,
which
are
analog
modeled
after
a
true
slab
design
where
all
blocks
are
of
the
same
size,
and
I
have
on
the
bottom
left
an
example
image
of
a
bitmap-based
slab.
B
I
say
5k
blocks
where
you
can
see
like
each
slot
is
of
equal
size,
regardless
of
its,
whether
it's
allocated
or
free,
and
then
we
have
the
extent
base
type
where
it
kind
of
resembles
meta
slabs,
and
it's
mostly
used
for
block
sizes
that
are
bigger,
but
they
basically
contain
variable
sized
blocks
and
you
can
see
on
the
bottom
of
the
center
there's
an
extent
based
slab
picture
with
a
20k
block.
That's
allocated
a
50k
blog,
some
free
space
and
a
31k
block,
and
finally,
the
last
type
is
the
empty
slab
type.
B
This
is
the
completely
unallocated
slabs
that
can
be
converted
on
any
of
the
other
types
on
demand
as
we
see
fit.
Now,
I'd
like
to
talk
a
little
bit
more
about
that.
Basically,
how
do
we
decide
what
kind
of
slab
types
do
we
want
and
need,
and
what
exactly
is
our
allocation
scheme
on
top
of
these
structures?
B
512
bytes
1k
1.5
k,
2k,
all
the
way
up
to
you
know
16k,
where,
basically,
when
I
say
it,
the
2k
slab
group,
I
mean
the
group
of
slabs
that
are
bitmap-based,
that
support
2k
allocations
and
then
for
more
than
60k
allocate
16k
allocations.
We
have
two
extended
base
slot
groups,
one
group
for
less
than
64k
and
one
group
for
more
than
64k
allocations.
B
Now
this
distribution
of
like
block
sizes
into
groups,
it's
not
set
to
stone,
it's
actually
a
tunable,
but
regardless
of
that,
each
slab
itself
may
change
its
type
and
migrate
between
groups
during
the
lifetime
of
the
cast.
So,
just
to
give
an
example,
it
really
depends
on
the
workload.
So
let's
say
you
just
created
your
pool.
You
have
your
data
and
you,
you
know
you
put
your
cash
in
action
and
the
cash
in
the
beginning
is
empty.
B
You
start
reading,
let's
say
a
data
set
that
has
a
lot
of
3k
blocks
and
we
start
casting
those
blocks
in
the
zettacast.
So
we
go
to
the
block
allocator
and
we
keep
requesting
blocks
of
3k.
B
So
what
the
block
allocator
will
do
is
convert
all
its
empty
slabs
to
3k,
bitmap-based
slabs,
put
them
into
the
3k
group
and
start
allocating
and
using
the
space
in
those
and
then
later.
If
we
say
if,
for
some
reason,
your
workload
changes
to
6k
blocks,
what
the
cast
will
do,
it's
basically
going
to
start
casting
those
and
evicting
the
old
3k
blocks,
basically
slowly,
freeing
over
time
all
these
3k
slabs
that
it
made
converting
them
to
empty
slabs,
then
converting
them
to
6k
slabs
to
satisfy
the
6k
block
allocation.
B
This
was
more
of
an
example
to
just
kind
of
like
describe
what
the
runtime
of
this
scheme
would
look
like.
Most
of
the
time
in
the
real
world,
the
things
are
not
as
black
and
white.
Basically
you,
you
won't
just
have
3k
incoming
blocks
or
6k,
but
most
probably
you're
gonna
have
like
all
kinds
of
block
sizes
being
requested
by
the
block
allocator
effectively
distributing
all
these
labs
into
all
these
different
slab
groups,
and
another
thing
that
I
simplified
a
little
bit
was
that
you
won't
as
often
see
slabs
getting
completely
freed
up.
B
Basically,
before
with
the
meta
slabs,
we
had
this
block
level
fragmentation
where,
because
metaslabs,
who
had
viable
length
block
sizes,
we
have
fragmentation
at
the
block
level,
but
now
we've
kind
of
like
traded
off
that
problem
from
fragmentation
at
the
slab
level,
where
we
can
have
underutilized
slab
groups
leading
to
stranded
space.
B
In
this
example,
the
first
row
we
see
that
the
3k
slab
group
you
know-
has
a
lot
of
slabs,
but
it's
not
actually
using
the
space
within
them
that
efficiently
and
that's
a
problem
right,
but
for
our
cast,
our
block
pointer
kind
of
like
scheme
is
a
lot
simpler
right,
unlike
the
spine
cfs
where
we
have.
You
know
this
kind
of
like
complicated
graph
structure
of
block
pointers
and
direct
blocks.
B
That's
impractical
to
move
around
our
blocks
are
fully
indexed
meaning
we
have
exactly
one
pointer
for
each
block
and
that
can
make
it
easy
for
us
to
defragment
our
data
and
free
up
slabs.
And
basically,
when
I
say
the
fragment,
I
mean
literally
locate
the
underutilized
slabs.
Let's
say
these
four
slabs
in
this
example
move
their
data
around.
B
And
finally,
I
just
wanted
to
highlight
that,
overall,
the
our
design
is
a
lot
simpler
right.
We
we
have
a
requirements
for
a
cast
that
are
in
comparison,
a
lot
less
than
the
list
of
requirements
that
you
need
to
implement
the
file
system.
So
it's
not
that
we
have
this
genius
idea.
It's
more
like
we
looked
at
our
requirements
and
we
came
up
with
a
simpler
design
and
reasoning
about
its
performance
is
a
lot
more
straightforward
and
we
hope
that
it's
going
to
be
easier
to
debug
in
production
too.
B
That's
all
I
wanted
to
cover
today
mark.
Would
you
like
to.
A
All
right,
so
that's
pretty
much
is
it
for
for
the
the
bulk
of
our
material
that
we
had
to
present.
But
I
would
like
to
leave
you
with
a
couple
of
of
thoughts.
A
But
I
think-
and
I
believe
it
is
still
possible
that
you
could
leverage
data
cache
from
a
in
a
sort
of
typical
block
based
storage
configuration
by
injecting
the
cache
check
into
the
I
o
pipeline,
so
that
at
some
point,
once
you're
past
the
arc
and
l2
work
checks
in
the
pipeline
and
actually
going
out,
you
know
starting
to
go
out
to
disk.
You
could
query
even
up
call
up
to
zetta
cache
and
say:
is
this
block
exists
in
my
some
sort
of
fast
storage
there
and
and
get
a
response
back?
A
So
I
think
it
is
possible.
Perhaps
a
more
radical
and
an
interesting
thought.
Experiment
is
how
could
you
use
zettacash
as
primary
block
storage
and
not
just
cash,
and,
as
I
mentioned,
and
as
seraphim
mentioned
a
few
times
during
his
part
of
the
talk
we
made
a
lot
of
simplifications
and
and
those
simplifications
are
dependent
on
the
the
requirements
of
cash,
which
are
things
like
yeah.
A
If
we
don't
ingest
that
block,
that's
not
a
big
deal
and
it's
not
the
end
of
the
world
because
we're
just
a
cache
and
you'll
just
get
a
miss
and
you'll
go
to
the
back
end
storage.
Well,
obviously,
that's
not
going
to
be
practical
for
primary
storage.
You
have
to
ingest
all
blocks,
so
there'll
be
a
lot
of
issues
to
work
through,
but
it
is
kind
of
an
interesting
experiment
to
think
about.
C
A
Think
it's
it's
something
we
do
think
about.
Occasionally,
I
think
that's
pretty
much
all
we
have
for
our
slides
here.
Are
there
any
questions
we
can
answer?
A
So
I
got
a
question
from
alan
and
he's
asking
whether
or
not
we
could
apply.
I
talked
about
the
a
time
histogram
and
hits
by
size
histogram
whether
we
could
move
that
same
concept
into
existing
arc
and
le
and
do
something
similar
there
to
be
able
to
generate
interesting
reports
and
information
about
it.
And
initially
I
thought
yeah
that
seems
plausible,
but
then
matt
pointed
out
that
the
there
is
more
complexity
in
in
the
arc.
The
arc
isn't
just
an
lru
cache.
A
It's
a
much
more
sophisticated
balancing
cache
between
mru
and
mfu
data,
and
so
it's
it's
going
to
be
quite
a
bit
more
tricky
to
be
able
to
do
predictive
analysis.
I
mean
you
could
do
the
analysis
I
think
fairly
easily
on
either
the
mfu
or
the
mru
sides
of
the
arc,
but
the
fact
that
it's
constantly
shifting
back
and
forth
based
off
workload,
I
think,
makes
the
trying
to
reason
about
the
whole
thing
more
much
more
complicated.
A
So
I'm
not
sure
how
easy
it
would
actually
be
to
try
to
do
something
like
we've
done
for
the
zettacache
there.
C
The
other
one
was
about
like
when
we're
doing
rebalance.
Is
it
better
to
see
big
stuff.
A
C
The
question
the
next
question
was
that
paul
asked
was
about
like
rebalancing,
versus
evicting.
A
Oh
yes,
and
they
you
know
that
the
reason
we
prefer
to
rebalance
there
is
is
again
back
to
being
able
to
reason
about
the
cash.
If
we
to
adjust
it
is
simpler.
If
we
just
say
yeah,
you
know
we
have
four
blocks
in
this
slab,
so
let's
just
check
it
because
that's
only
four
blocks
right
and
this
is
a
cache.
A
But
it's
suddenly
doing
something
like
that
begins
to
introduce
more
uncertainty
now
about
how
to
reason
about
your
cash,
because
now
you're
no
longer
maintaining
a
strict
lru
policy
in
terms
of
evictions,
and
so,
if
you're
doing
a
lot
of
this
kind
of
work
of
throwing
away
blocks
like
that,
whenever
you
need
space,
you
really
don't
know
how
your
cache
is
behaving.
It's
not
even
fifo.
It's
just
kind
of
a
random
eviction
pattern,
and
so
it
kind
of
breaks
our
ability
to
predict.
So
it's
it
is
certainly
more
efficient,
but
it's
better.
C
Yeah,
I
think
it's
something
that
we
could
consider
the
trade-off
would
be.
You
know,
obviously
you're
doing
less
io.
So
right.
I
think
that
we
need
to
we.
We
haven't
yet
out
performance
results
from
like
you
know
how
much
eviction
is
really
needed
under
you
know,
even
hypothetical
workloads,
so
you
know
once
we
see
that
sorry,
how
much
like
rebalancing
is
needed
under
workloads
and
how
big
of
a
performance
impact
is
that
you
know
maybe
we'll
have
to
reevaluate.
If
we
see
that
that
is
more
impactful
than
we
predict,
it
agreed.
C
Yeah,
so
basically
allen's,
basically
asking
sir
from
about
like.
Could
we
apply
the
block
allocator
that
you
designed
to
regular
zfs
meta
slabs
with
with
having
strict
slabs
and
all
that
stuff
yeah.
B
I
mean
we,
I
I
think
one
of
mark
covered
that
a
little
bit
on
potentially
using
the
zettacast
as
like
a
backend
for
an
actual
pool,
but
like
actually
changing
the
meta
slab
design.
Right
now
I
mean
yeah.
Obviously
we
could
try
and
do
that,
but
also
things
are
a
lot
trickier.
B
So
I
can
definitely
see
the
benefit,
but
trying
to
have
it.
It
depends
on
a
lot
of
things
right
like
sure
for
some
workloads
it
would
be
great,
but
we
would
have
the
design
wouldn't
have
wouldn't
wouldn't
be
able
to
be
the
same
because,
like
what?
What
kind
of
like,
would
you
have
like
slabs
within
the
meta
slab,
like
whoa,.
C
Yeah,
I
think
that,
even
if
you
kind
of
let's
say
that
you
work
through
those
problems
of
like
fitting
the
thing,
a
into
thing
b
and
the
complexity
of
modifying
the
current
code-
that's
all
theoretically
doable.
I
think
that
the
one
of
the
big
challenges
would
be
rebalancing
yeah
right,
because
we
with
the
cache
we
can
avoid
most
allocation
failures
by
doing
rebalancing
before
they
occur.
And
then
you
know
if,
if,
if
we
really
didn't
want
to
avoid
allocation
failures,
you
could
like
bump
things
up
to
bigger
sizes.
C
You
know
like
you
could
do
that
and
it
might
give
you
like
some
kind
of
weird
results
in
terms
of
space
usage,
but
that
seems
doable.
The
problem
is
that,
and
you
can
go
the
other
way
using
gang
blocks.
You
know
which
the
kernel
you
know
already
knows
how
to
do,
but
the
problem
is.
C
I
think
the
problem
would
be
that
you
might
be
able
to
get
into
a
case
where
that
is
common
rather
than
rare,
where,
like
you,
might
commonly
have
to
be
ganging
or
you
might
commonly
have
to
be
bumping
up
to
an
allocation.
That's
a
much
larger
size
and
you
to
get
out
of
that.
You'd
have
to
do
rebalancing,
but
rebalancing
is
like
extremely
non-trivial
with
zfs.
That's
like
the
bpv
rate.
If.
C
But
the
rebalancing
is
is
almost
trivial
because
with
zettacache,
because
we're
already
rewriting
the
whole
index
every
once
in
a
while
right.
So
it's
just
like
like
while
you're
writing
the
index,
you
just
like
look
up
and
see
like.
Oh
what
should
I
be
remapping
this?
Oh
yeah,
okay,
great,
like
let
me
I'll
store
that
new
location
in
the
new
index.