►
From YouTube: June 2021 OpenZFS Leadership Meeting
Description
At this month's meeting we discussed: ztour; Linux user namespace; ZREPL; ZFS on Object Storage
https://docs.google.com/document/d/1w2jv2XVYFmBVvG1EGf-9A5HBVsjAYoLIFZAnWHhV-BM/edit#
A
All
right,
let's
get
started,
welcome
to
the
june
2021
opens
the
festivities
meeting.
It
looks
like
we
don't
have
too
many
attendees
and
don't
have
too
many
items
on
the
agenda
so
this
well,
I
won't
say
because
I
don't
want
to
jinx
it,
but
let's
get
started
and
we
should
have
time
if
folks
have
other
things
that
they
want
to
discuss
or
questions
that
are
not
on
the
agenda.
A
First
was
the
question
from
rich.
I
don't
know
if
you're
on
rich
but
they're
asking
about
ztour
is
anyone
working
on
it
or
an
equivalent.
I
think
you
remember
that
from
many
years
ago,
developer
summit
conference
there
was
a
demo
of
z
tour,
which
was
like
a
facility
for.
A
The
on
disk
format,
using
something
like
debugger
like
using
mdb,
maybe
I
think
it
was
using
fuse
or
something
to
actually
let
you
browse.
C
It
as
a
file
right
is
that
the
one
that
don
brady
did.
A
I
don't
see
don
on
here.
As
far
as
I
know,
don
hasn't
done
any
more
work
on
it.
Does
anyone
know
of
other.
D
A
Projects
in
that
space
of,
like
zdb-like
things
for
examining
the
honest
format.
A
All
right:
well,
it's
definitely
an
interesting
idea
and
one
that
we're
thinking
about
a
little
bit
in
the
context
of
the
zfs
object,
storage,
stuff.
So
I'll,
maybe
come
back
to
that
when
I
give
a
little
update
on
that,
but
I
guess
maybe
I'll
give
you
a
preview.
So
the
you
know
there's
an
additional
on
disk
state,
or
maybe
it's
not
disk,
but
you
know
in
in
the
object
storage
state
that
is
kind
of
below
the
existing
zfs
on
this
data
structure.
A
So
we
need
to
come
up
with
some
way
to
examine
it,
either
using
zdb
or
using
some
other
facilities.
So
if
folks
have
ideas
about
like
how
you
would
like
to
see
that
or
do
that,
then
you
know
maybe
one
one
place
to
explore
that
would
be
with
this
object
store
metadata,
since
you
know,
there's
a
lot
less
of
it
than
zfs.
So
if
you
want
to
do
something
new,
you
could
start
mess.
You
know
experimenting
with
how
to
do
it
with
objects
or
metadata.
A
Yeah,
I
know
I
don't
recall
who
else
worked
on
it
with
him,
but
that
if
you're
curious
about
it,
that
would
probably
be
a
good
resource
to
go
check
who
worked
on
it
during
that
hackathon,
and
I
think
that
would
have
been
2018
sounds
right,
sounds
right.
A
Let's
see
2017.,
it
was
2017.
z,
tour
mixes
zdb,
with
fuse
participants
were
all
from
delta,
don
brady,
pavel
zacharov
and
prashant
srinivasa
yeah.
A
It's
been
a
while,
if
you,
if
folks,
are
interested
in
this,
I
will
try
to
see
if
don
will
come
to
the
meeting
next
month,
he's
going
to
be
working
on
zfs
more
as
part
of
the
object
storage
project
that
we're
doing
so.
I
think
he
and
a
few
other
folks
would
probably
find
it
useful
to
attend
this
meeting.
E
Is
that
your
item
yeah,
so
we've
posted
the
the
first
pr
for
this
and
it
works.
E
Gonna
have
to
look
at
the
test
failures
because
it
it
was
passing
all
the
tests
when
I
tested
it,
but
basically,
if
you
enter
a
another
username
space,
this
changes
the
zvs
and
linux
code
to
instead
of
there's
a
a
tech
used
in
bunch
places
in
zfs
in
global
zone
and
on
linux
it
was
just
defined
to
one.
So
you
were
no
matter
what
you
always
were
in
the
global
zone.
E
We
changed
that
now
to
actually
look
at
which
username
space
you're
in
and
if
it
isn't
the
root
one
to
to
return
that
you're,
not
in
the
global
zone,
and
so
with
that,
if
you
run,
you
know,
unshared
dash
capital.
U
and
you're.
Now
in
a
different
username
space,
when
you
run
zfs
list,
you
can't
see
any
data
sets,
but
you
can
delegate
data
sets
to
a
namespace.
E
So
if
you
set
the
zone
property
to
on
and
then
do,
zfs
user
ns
add
and
the
namespace
id
to
the
data
set,
then
when
you
run
zfs
list
inside
that
data
set
you'll
be
able
to
see
only
those
similar
to
how
zones
and
jails
work
on
solaris
and
bsd,
and
we
it
works
so
that
you
can
mount
the
data
sets
within
the
namespace.
E
If
you
have
a
mount
namespace
as
well,
so
that
it
can
work
with
something
like
lxc
containers
so
inside
the
container
you
can
as
the
root
that's
not
really
root
can
do
whatever
you
need
to
do.
The
data
sets
and
all
the
same
rules
applies
with
jails
and
zones.
You
know
you
can
change
most
of
the
properties,
but
you
can't
change
the
it's.
The
quota
and
the
limit
on
file
systems
and
snapshots,
and
a
couple
of
things
like
that.
E
A
Cool,
are
you
looking
for
code
reviewers
now.
E
Yes,
it's
ready
for
code
reviews,
but
also
interested
in
people
with
similar
use
cases.
Who
might
I
guess
one
question
is:
does
suddenly
zfs
list
not
showing
stuff
when
you're
in
different
usernames
faces
break
anything
for
anyone?
I
don't
think
so.
You
know
once
you're
in
a
different
namespace.
You
kind
of
expect
not
to
have
access
to
the
stuff
from
the
original
global
name
space,
but
it
will
be
a
change
of
behavior
and
if
it's
problematic,
we
might
have
to
protect
it
behind
a
module
parameter
or
something.
A
I
mean
I've.
This
is
in
the
area
of
like
platform,
specific
kernel
code,
so
I've
assigned
tony
tony
win
to
be
the
maintainer
for
this.
So
you
can
work
with
him
if
you
need
help
finding
reviewers
or
getting
reviews.
A
A
Cool
questions
about
that.
F
E
Not
currently,
currently
it
is
the
default,
but
if,
if
it
breaks
things
for
people
that
might
I
don't
think
anybody
assumes
that
they
can
access
stuff
as
zfs
when
they're
in
you
know,
once
you're
in
a
different
username
space,
you're
not
root
anymore.
So
unless
you've
done
zlo
or
something
you
wouldn't
expect
to
be
able
to
do
much
in
the
way
of
zfs
commands
anyway,
yeah
so
yeah.
This
is
on
by
default.
E
But
if,
if
that
change
in
behavior
causes
problems
for
people,
we
might
have
to
predict,
but
I
don't
expect
so
because
you
know
as
soon
as
you
enter
that
namespace
you're
not
really
root
anymore,
and
so
you
don't
expect
zfs
to
work
anyway.
F
D
E
D
E
Armor
config
file
on
ubuntu,
but
outside
of
that
no
there's
when
you're
inside
zfs
just
works
the
same.
It's
just
you're
limited
to
the
data
sets
you're
allowed
to
see,
which
are
the
ones
that
are
delegated
to
you
and
have
the
zone
property.
E
D
Okay,
so
that
would
require
change
in
the
grab
driver
inside
the
podman
to
actually
mark
the
data
set
that
is
inside
the
container
right.
E
If
you
want
the
users
inside
the
container
to
be
able
to
manage
it
like,
if
you
just
operate.
D
E
From
the
host
to
the
directory
that
it
sees
rooted
in
or
whatever,
then
that
would
still
work
it
just
you
know
it
depends
if
you
want
user
inside
the
container
to
know
that
they're
using
zfs
or
not
or
be
able
to
see
the
data
sets
or
not.
E
Like
I
know,
with
with
jail's
on
bsd
half
the
time,
I
just
create
the
data
set
on
the
host
and
mount
it
to
the
right
spot
relative
to
the
jail's
route
and
the
jail
doesn't
know
that
it's
a
separate
data
set
other
than
you
know.
The
file
system
id
is
separate
or
whatever,
but
it
doesn't
know
the
name
of
the
data
set
or
anything.
It
just
shows
up
as
restricted
or
something
like
that
in
du.
E
C
Ellen
you
mentioned,
like
the
mount
namespace
and
having
like
that
capability
of
being
able
to
mount
the
data
set.
So
you
have
to
be
in
the
amount
name
space
and
you
have
to
use
a
username
space
or
can,
if
you
don't
have,
if
you
don't
set
up
a
username
space,
can
you
just
do
the
mount
name
space
and
allow
a
delegated
data
set
to
be
mounted.
E
C
E
Yeah,
so
we
did
the
username
space,
because
that's
really
when
you're
deciding
is
this
the
real
root
or
a
made-up
root
user
right
like
are
you?
Are
you
real
root,
or
are
you
some
user
who
ran
on
chair
or
whatever
and
are
running
an
unprivileged
container?
That
is
just
pretending
to
be
rude.
C
E
But
in
general
the
only
thing
that
was
really
required
was
setting
an
extra
flag
in
the
vfs
bits
to
say
this
file
system
is
aware
of
the
namespace
stuff
so
that
it
handles
the
the
user
id
mapping
or
whatever
so
that
you
know
when
you're
inside
the
namespace
you're
actually
running
with
a
the
virtual
users
inside
that
namespace
have
very
high
uids,
but
that
are
mapped
to
regular
ones
inside
the
namespace.
E
So
zfs
has
to
know
that
you
know
uid
6001
is
actually
inside,
shows
up,
as
you
know,
user
id
1001
or
whatever,
so
that
when
you
ls
the
files
you
create
inside,
they
have
the
name
relative
to
the
namespace.
But
you
know
on
disk
those
are
owned
by
the
user
id.
That's
the
real
one.
E
A
E
G
There's
a
bunch
of
places
in
the
suite
where
blocks
of
code
are
conditionally
marked
out,
like
is
global
zone
which
originally
was
to
to
run
with
zones.
Have
you
been
able
to
try
to
activate
any
of
those
code
paths.
A
Yeah,
I
think
that
I
think
what
you're
talking
about
john
is
like
you
could
run
the
test
suite
from
within
a
local
zone
and
it
should
more
or
less
work
with
these.
But,
like
some
tests,
don't
run.
A
A
A
G
E
A
We,
why
don't
we
I
give
an
update
on
where
we're
at
with
the
zfs
object
storage
stuff,
but
if
there's
other
folks
have
other
questions
or
topics,
why
don't
we
bring
them
up
now,
since
I.
B
B
So
with
zeripple
we
have
repeated
reports
of
cfs
dry
sense,
reporting,
a
very
large
un64
as
size
estimate,
and
it
seems
like
it's
an
over
like
a
negative.
It's
called
underflow
overflow.
If
we
go
negative,
I
think
it's
still
called
overflow
technically,
but
yeah.
There
appears
to
be
some
bugs
with
that,
and
we
currently
have
a
user
who
is
like
setting
the
data
set
aside,
so
we
can
reproduce
the
case.
We
just
need
to
hook
him
up
with
someone
who
can
tell
them
what
to
do.
A
Yeah,
why
don't,
I
would
say,
file
issue
if
there's
one
already
and
then
ping
paul
dagnelly,
he
should
be
on
slack.
We
want
to.
The
first
step
is
to
figure
out
like
there's
so
many
different
code,
pads
like
modes
of
send
and
send
estimation.
A
So
if
you
can
include
like
the
exact
command,
that's
being
run,
then
that'll
help
like
you
know.
Presumably
it's
not
it's
not
what's
it
called
it's
not
redacted
send,
probably
but
like
full
versus
incremental,
and
all
that
and
like
from
bookmark
versus
from
snapshot.
B
A
Yeah,
probably
not
okay,
but
at
least
that
basic
info
will
then
tell
us
like
what
additional
metadata.
We
would
need
to
see
to
figure
out
why
it's
going
wrong
and
then
probably
we
could
get
that
from
zdb
and
then
they
could.
You
know,
destroy
their
stuff.
A
All
right,
then,
I'll
talk
a
little
bit
about
the
object
store
stuff,
which
I
think
I
mentioned
two
months
ago.
We're
now
well
on
our
way
to
initial
implementation.
A
D
A
And
john
kennedy
are
working
on
the
performance
performance
aspects,
so
performance
evaluation,
which
I
I
both
look
forward
to
and
dread,
seeing
the
results
of
that,
because
we're
still
very
early
in
the
implementation,
so
there's
lots
of
stuff
that
doesn't
work
as
we
know
it.
It
needs
to
to
get
good
performance
and
so
we're
we're
folks
might
not
remember
from
from
two
months
ago.
So
let
me
give
you
a
brief
overview
of
like
what
we're
doing.
F
A
As
the
backing
storage,
for
you
know,
probably
primarily
for
a
cost
reasons,
although
kind
of
after
this
initial
project,
we
have
some
more
more
awesome
and
crazy
ideas
that
will
take
further
advantage
of
the
shared
nature
of
object,
store,
which
I'll
talk
about
later.
F
A
So
the
the
the
key
and
tricky
thing
about
our
use
case
is
that
we
want
to
be
able
to
do
this
with
databases
which
are
you
know
doing.
Small
random
writes
with
small
record
size
and
compression,
and
everything
is,
is
small
and
tiny.
So
we
can't
get
away
with
just
saying
like
like,
if
you're
doing
that,
you're
using
it
for
backup
and
you
and
you're
gonna
just
set,
you
know
record
sizes
at
least
one
meg
and
we'll
have
one
object
per
per
zfs
block.
A
That
would
be
like
relatively
straightforward,
but
we
want
to
be
able
to
use
small
continue
using
small
record
size
for
the
databases
that
we're
using
that
we're
storing,
which
means
that
we
need.
We
need
to
combine
a
whole
bunch
of
blocks
like
a
whole
bunch
of
like
three
kilobyte
blocks
into
one.
You
know
one
megabyte
plus
size
object
for
the
object
store,
which
means
that
we
need
to.
We
need
to
worry
about
like
how
do
we
free
these?
A
So
you
know
we
got
to
keep
track
of
like
here's
all
the
blocks
that
need
to
be
freed,
but
haven't
been
freed
yet
and
then
kind
of
consolidate
that
so
that
we
can
like
back.
We
need
to
batch
process
that
so
that
for
every
object
that
we
have
to
read
and
then
rewrite
to
remove
the
free
blocks,
hopefully
we're
able
to
free
a
whole
bunch
of
blocks
within
the
object,
not
just
one
or
two.
A
For
a
lot
of
use
cases,
we
still
want
to
get
good
performance,
so
the
object
stores
you
know
generally
have
really
high
latency
for
get
input,
requests
like
dozens
of
milliseconds,
at
least
and
in
a
lot
of
cases,
if
the
data
so
for
writing.
We
can
kind
of
ignore
the
latencies,
because
we're
batching
everything
up
into
transaction
groups
right
and
we
just
have
to.
Then
we
just
have
to
worry
about
well
what
about
synchronous
operations?
A
A
Caching
blocks
on
local
disks,
but
the
l2
arc.
It
can't
get
very
big
in
terms
of
a
number
of
blocks
that
it
can
store
because
it
uses
so
much
memory
per
per
block.
L2
works
pretty
well.
If
you
have
large
record
sizes
but
again
like
having
small
record
sizes,
it
causes
lots
of
performance
challenges
for
us,
so
in
particular
you
may
want
to.
You
might
want
to
be
able
to
have
a
cache.
A
That's
like
dozens
of
terabytes,
at
least,
which
means
that,
even
if
we
were
to
like
kind
of
super
turbocharge,
the
way
the
ltwork
stores
its
metadata,
you
still
wouldn't
be
able
to
keep
that
entirely
in
memory.
A
So
we
have
to
deal
with
a
an
index
of
the
of
the
cache
that
doesn't
fit
entirely
in
memory.
So
that's
kind
of
the
big
challenge
of
the
caching
subsystem
so
that
so
the
way
that
we're
implementing
this
all
this
stuff,
the
cache
and
the
talking
to
the
object
store
part
is
going
to
be
where
it
is
through
a
agent
process
in
userland.
A
So
like
the
kernel
is
making
read
and
write
requests
that
are
calling
up
to
the
agent,
which
is
then
you
know,
checking
in
the
cache
going
out
to
the
object
store
over
the
network.
Doing
the
background,
processing
of
like
background
freeze,
managing
the
on
disk
index
of
the
cache
and
all
that
stuff
and
the
agent
is
an
agent,
is
all
written
in
rust.
A
So,
let's
see
what
else
do
you?
What
else
should
I
tell
you
guys
we're
we're
pretty
far
along
with
getting
things
to
kind
of
basic
functionality?
A
And
I
I
might
be
able
to
show
you
a
little
demo
if
if
people
are
interested,
but
why
don't
I
pause
and
oh
the
other
thing
that
we
did
is
we
opened
a
issue
that
has
an
outline
of
like
the
project,
the
kind
of
stuff
that
I
just
mentioned,
and
then
the
user
interface
for
how
you
would
create
an
object,
store
based
pool
it's
it's
a
little
different
because
you
need
to
be
able
to
specify
like.
A
A
Don't
think
that's
updated
yet,
but
we
should
ask.
A
Yeah
yeah,
so
we
have
some
enhancements
in
progress
that
will
let
it
like
figure
out
the
credentials
based
on
the
role
that's
assigned
to
the
instance
in
aws
that'll,
make
it
a
little
easier
christian.
I
see
you
have
your
hand
up
questions.
B
Yeah,
so
I
have
thought
about
cfs
and
object:
storage,
one
or
one
and
a
half
years
ago,
and
I
think
one
one
concern
was
cost
analysis
and
cost
prediction.
So
because
you
pay
for
the
individual
requests
as
well.
B
A
Yeah
I
mean
the
this
so
we're
primarily
thinking
of
it
for
our
use
case
by
comparison,
we're
comparing
it
to
using
ebs.
You
know
the
block
storage
in
amazon
and
the
per
byte
cost
is
much
lower.
It's
like
about
a
quarter
or
a
fifth,
or
something
like
that
compared
to
the
cheapest
ebs
storage,
but
yeah
there
are
per
per
operation
costs
of
again
put
for.
A
We.
We
looked
at
that
kind
of
from
a
theoretical
point
of
view.
The
good
thing
is
that
the
costs
as
long
as
you're,
using
it
like
with
within
aws,
like
you're,
using
s3
from
and
from
ec2
instance
there
aren't
they
don't
charge
you
for
the
throughput.
A
They
just
charge.
You
like
a
per
request
fee,
so
those
are
usually
not
too
high
compared
to
everything
else
as
long
as
you're
as
long
as
like
you're
right
as
long
as
your
objects
are
big
right,
because
it's
like
well
yeah,
you
know
I'm
writing
at
you
know
full
network
bandwidth
of
25
gigabits
per
second,
but
it's
with
you
know
one
meg
or
eight
meg
objects.
A
So
the
actual
number
of
requests
per
second
is
like
pretty
manageable.
I
mean
it's
not
nothing.
You
know
it
adds
up
to
like
a
few
dollars
a
month
or
whatever,
but
you
know
you're
paying
25
per
terabyte
per
month
for
the
storage
and
you're
paying.
You
know
a
lot
more
than
that,
probably
for
the
instance
to
run
it
at
25
gigabits
per
second.
So
it's
not
too
bad.
We
did
right.
We
did.
A
I
don't
know
if
paul,
I
think,
paul's
not
on
here,
but
paul
is
working
on
a
facility
like
mmp,
the
multi-mount
or
multi-modifier
protection
for
for
the
object
store
and
that's
gonna,
be
like
mandatory.
Like
you
know,
right
now,
the
mmp
is
like
off
by
default.
A
You
have
to
like
jump
through
some
poops
to
use
it
or
whatever,
but
for
object
store
it's
just
like
so
dangerous
because,
like
everybody
has
access
to
it,
you
know
very
easily
that
it's
going
to
be
always
used,
and
we
did
have
some
interesting.
A
Designs
around
that,
because
you
know,
if
you're,
if
you're
heavily
using
the
pool,
then
hey
like
paying
a
few
bucks
a
month,
doesn't
really
matter
like
that,
doesn't
matter,
but
if
you're
not
using
the
pool,
then
having
mmp
be,
you
know
putting
down
its
heartbeat
even
just
like
once
a
second
it
does
add
up
to
you
know
like
non-zero
dollars
per
pool,
I
mean,
if
you
just
have
one
pool,
then
it's
no
big
deal
right.
It's
like
three
dollars
a
month
or
something
like
that,
but
all
right.
A
So
now
let
me
get
to
if
you
don't
mind
me,
rambling,
on
a
little
bit
the
the
end
goal,
for
this
is
not
just
you
have
one
pool,
that's
on
object
store.
A
The
end
goal
is
that
you
have
essentially
like
distributed
zfs,
where
you
have
like
this:
a
big
system
of
interconnected
pools
on
different
machines
and
they're,
all
consuming
storage
from
the
same
object
store
and
the
the
real
key
thing
that
we
want
to
be
able
to
do
is
to
have
like
a
snapshot.
A
So
you
might
in
this
new,
like
object,
store
based
world.
Creating
pools,
becomes
almost
as
cheap
and
easy
as
creating
file
systems
is
today,
and
you
can
move
the
pools
between
you
know
between
vms,
so
it
makes
the
overall
system
much
more
flexible
and
in
that
world
you
might
end
up
with
lots
of
storage
pools,
and
so
you
might
have
lots
of
storage
pools
which
are
idle
most
of
the
time
and
so
the
mm,
the
cost
of,
like
writing
out
once
a
second
an
object
for
each
pool.
A
It
might
actually
become.
You
know
noticeable
to
our
customers
wallets,
so
paul
has
done
some
work
to
design
that
so
that
the
basically,
instead
of
having
to
write
out
a
heartbeat
for
each
pool,
we
only
have
to
write
a
heartbeat
for
each
like
for
each
bucket
that
this
agent
is
connected
to.
A
So
the
agent
manages,
like
all
you
know,
all
of
the
pools
are
managed
by
one
agent
process
on
the
machine,
and
so
that
way
like
you,
basically
each
pool
will
have
an
object
associated
with
the
pool
that
says
you
know
I
am
owned
or
was
most
recently
owned
by
this
agent,
and
then
the
agent
will
have
like
a
object
in
the
object
store.
That
says,
like
I'm
still
alive
every
second,
so
that
way,
you
know
you
only
have
to
pay.
You
only
have
to
have
a
heartbeat.
A
Basically,
like
you
know,
one
heartbeat
per
vm,
that's
part
of
the
system
rather
than
one
per
storage
pool.
So
so
that
is
a
point
where
that
did
come
into
come
into
play.
C
Yeah,
if
I
can
add
something
else
to
matt
yeah,
the
other
thing
going
back
to
cost,
I
mean
like
this
is
kind
of
why
the
zetta
cash
is
such
a
big
component
of
being
able
to
like
cash,
the
majority
of
the
blocks
that
you
might
be
using
for
that
specific
file
system,
so
that
we
don't
have
to
go
back
to
object
store.
We
you
know
we
get
the
big
blocks
going
to
s3
you're
doing.
D
C
Operations
there
yeah
and
then
and
and
the
majority
of
your
work
is
kind
of
happening
locally
through
some
some
ebs
volumes.
The
other
thing
that
to
note
about
the
zeta,
cache
and
part
of
the
design
is
unlike
l2
arc,
which
is
you
know,
kind
of
dedicated
to
your
pool.
This
is
a
global
type
of
cache
for
all
the
pools
being
managed
by
that
agent.
C
C
Yeah
exactly
yeah,
so
the
the
when
we
get
into
this
multi-pool
architecture.
This
zetta
cache
is
kind
of
like
a
global
entity
that
these
pools
will
be
able
to
register
and
say
I
want
to
cache
some
blocks.
You
know
and
we're
going
to
use.
You
know
this.
Whatever
disks
are,
you
know
kind
of
backing
the
zeta
cache.
C
A
For
reads:
you
really
don't
want
to
be
in
the
case
where,
like
I'm
doing,
one
read
and
that's
causing
me
to
get
an
object
and
then
I'm
not
getting
anything
else
from
the
object
because
you're,
then
you
have
huge,
you
know,
read
throughput
inflation
because
it's
like!
Oh
I'm,
reading
this
4k
block
and
it's
causing
me
to
go.
A
Read
this
one
megabyte
object
and
I
threw
away
you
know
99.9
of
the
data
you
want
to
either
be
in
the
mode
of
like
oh,
like
we're
scanning
like
we're
just
bulk
reading
this
whole
thing
so
yeah
like
I
read
this
one
block
and
it
caused
me
to
bring
in
the
whole
object,
but
now
the
next
you
know
300
reads-
are
also
going
to
be
to
that
same
object.
A
So
we
actually
used
it,
and
so
it's
fine
or
you
want
to
be
in
the
mode
where
I
may
be
doing
random
reads,
but
they're
all
cached
in
the
cache.
So
basically
you
want
to
have
like
either
either
you
have
like
ninety
percent
plus
cash
hit
rate.
In
this
you
know
l2
like
disk
based
cache
or
you're
like
doing
bulk
reads
or
you're.
A
Not
using
small
record
sizes,
you
know
like
there's,
certainly
lots
of
other
use
cases
for
this-
that
don't
have
all
these
hard
problems
and
you
know
like
you're,
storing
video
files,
they're
all
in
record
size,
16
meg
and
it's
just
one
record
per
object
and
then,
like
everything,
is
much
much
easier
than
the
problem
that
we
have
chosen
to
tackle.
E
Are
you
envisioning
to
use
this
with
the
the
special
vlog
or
something
for
all
the
indirect
blocks
so
that
you
don't
have
the
problem
of
having
to
read
a
one
megabyte
block
to
get
the
indirect
block
to
tell
you
which
one
megabyte
object?
You
need
to
go
fetch
next.
A
For
I
mean
so
for
at
least
for
our
use
cases,
we
would
always
have
some
sort
of
block
based
cache
so
and
you
know
making
sure
that
that's
at
least
one
percent
of
this
of
the
whole
pool
size
is
like
pretty
easy,
so
we
would
always
probably
always
have
all
the
metadata,
like
all
the
indirect
blocks
in
the
cache.
A
A
Yeah
I
mean
the
cash
the
cash
it's
like
within
the
cash
there's.
Definitely
a
lot
of
cool
stuff
that
we
can
do
to
to
make
it
so
that,
like
we,
it
can
handle
things
in
different
block
sizes,
maybe
I'll
get
into
that
in
a
couple
minutes,
but
in
terms
of
like,
what's
in
the
object
store,
we
weren't
doing
any
like
we're
kind
of
letting
zfs's
allocation
ordering
kind
of
handle
a
lot
of
that
for
us,
because
I
think
that
the
way
that
it
works,
we
can
double
check.
A
This,
though,
is
like
it's
basically
like
it.
It's
writing
thing
blocks
of
one
layer
together.
So,
like
the
ordering,
I
think
will
happen
to
be.
Like
you
know.
First,
you
get
all
the
first.
We
try
to
write
all
the
data
blocks.
Then
we
try
to
write
all
the
indirect
blocks.
Then
we
try
to
write
all
the
l2
indirect
blocks,
so
they'll,
naturally
kind
of
be
clustered
together
in
objects.
Anyways.
A
And
the
big
thing
about
the
I
mean
the
special
devices
has
two
big
benefits.
One
is
like
you
have
a
faster
device
which,
like
you
know,
we
don't
have
faster
and
slower
like
object,
storage.
We
we
have
the
cache
to
solve
that
problem
and
then
the
other
benefit
of
the
of
the
special
devices
is
you
aren't
trying
to
combine
lots
of
big
blocks
and
lots
of
small
blocks
together
into
one
like
metaslab
allocator
scheme,
and
so
like
that's
kind
of
a
self-imposed
problem.
A
You
know,
rather
than
a
fundamental
problem
like
that.
That's
the
problem
of
how
zfs
chooses
to
do
block
allocation
and
we've
kind
of
been
making
some
dents
in
that
over
time,
like
with
the
zill
embedded
slog
metaslab,
basically
like
automatically
carving
out
one
meta
slab
to
use
for
log
allocations.
If
you
don't
have
a
dedicated
log
device,
you
can
imagine
doing
something
similar
to
that.
Like,
oh,
like
I
noticed
that
90
percent
of
the
blocks
here
are
small,
but
you
have
some
big
ones.
A
Let's
carve
out
one
meta
sub
for
blocks
that
are
like
larger
than
32k,
but
we
can
with
object,
storage
like
the
allocation
part
of
it
doesn't
matter
right,
like
you,
can
put
together
big
and
small
blocks
into
one
object
and
that
doesn't
hurt
the
allocator
at
all,
because
it's
just
like
sequentially
spotting
them
down
into
objects.
We
aren't,
we
don't
have
to
go
like
try
to
refill
existing
objects
within
the
cache.
A
Now,
that's
where
it
gets
really
interesting.
So,
within
the
cache
the
undisk
cache
the
so
the
l
torque
the
way
l2
arc
works
it
just
like
starts
the
beginning
of
the
disk,
writes
through
the
whole
disk
and
then
when
it
gets
to
the
end.
It
comes
back
to
the
beginning
and
and
overwrites
what
it
was
there
before.
A
So
the
the
allocation.
A
And
you
know
you
can
mix
together
large
and
small
blocks,
and
everything
is
fine,
but
the
eviction
policy
is
horrible
and
that
has
a
impact
on
cash
hit
rate,
and
you
know-
maybe
maybe
that's-
okay
for
a
lot
of
time,
a
lot
of
the
time
with
l2
arc,
especially
since,
like
the
kind
of
philosophy
of
the
l
twerk
is
like
yeah,
you
know
it's
like
an
extension
of
the
arc
and
you're
gonna
get
some
hits
and
some
misses,
and
it's
like
better
than
nothing
versus
for
our
use
case.
A
It's
like
well
in
a
lot
of
use
cases
you
want
to
be
getting
like
90
plus
hit
rate
in
the
in
the
cash
and
if
you
go
from
90
to
80,
that's
like
doubling
the
amount,
the
number
of
times
that
we
have
to
go
to
the
to
the
object
store,
which
could
have
a
really
material
impact
on
your
overall
performance.
A
So
we
wanted
to
have
a
cache
where
we
control
the
eviction
policy
and
it's
reasonable.
So
we
chose
lru,
which
you
know
it's
not
like.
Amazing.
It's
not
skin
resistant.
The
way
the
adaptive
replacement
cache
is,
but
it's
at
least
reasonable
and
and
easy
to
reason
about,
and
so
that
means
that
we
need
to
be
able
to
choose
what
we
evict
not
just
like,
have
the
have
whatever.
A
Which
means
that
you
know
it
within
the
cache
we're
going
to
be
left
with.
You
know
like
as
you're
ingesting
stuff,
you're,
evicting
old
stuff
and
you're,
leaving
lots
of
little
holes
where
you're
evicting
stuff
and
then
trying
to
fill
those
little
holes
with
the
new
stuff,
which
is
you
know,
basically
the
same
as
the
problem
that
we
have
with
zfs
on
block
storage
and
the
meta
slab.
A
A
So
with
the
new
cache
that
we're
designing,
we're,
gonna
be
investigating
and
hopefully
implementing
some
newer,
better
ways
of
doing
allocation
which
may
which,
which
may
be
able
to
be
transferred
over
to
the
you
know,
zfs
proper
block
based
cfs.
A
So
what
we
want
to
do
is
divide
this
space
up
in
into
slabs
in
each
slab
would
have
allocations
of
all
the
same
size.
A
So
like
let's
say
each
you
know
each
16,
meg
slab
has
things
that
are
all
you
know,
maybe
they're
all
1k
or
all
one
and
a
half
k
or
all
3k
or
all
three
and
a
half
k.
And
so
then,
when
we're
doing
allocations,
you
know
we
would
find
like.
We
know
I'm
trying
to
allocate
three
and
a
half
kilobytes.
So
I'm
going
to
go
to
one
of
the
three
and
a
half
kilobyte
slabs
and
just
get
one
from
there.
A
This
should
sound
pretty
familiar
in
terms
of
the
you
know:
memory
slab,
allocator,
that's
exactly
the
way
that
works
and
the
meta
slab.
Allocator
is
kind
of
named
after
that,
and
there
were
some
ideas
initially
about
how
we
might
do
that.
But
those
never
really
came
to
be
so
we're
kind
of
trying
to
get
back
to
that
place,
at
least
for
these
small
blocks.
A
I
think
it'll
be
very
helpful
in
terms
of
memory
usage
and
like
memory
usage
of
the
allocation
state
should
be
much
more
efficient
like
in
the
worst
case.
You
have
one
bit
per
block
to
keep
track
of
of
this
in
memory
I
mean
you
can
do
better
in
some
cases
as
well.
A
Yeah,
okay,
maybe
I
should
pause
here,
pause
my
monologue
here
and
see
if
folks
have
more
questions.
B
I
have
another
question,
so
I'm
not
totally
clear
how
much
like
layering,
like
cross
layer,
hints
you're
doing
there
like
how
many
hints
are
you
giving
from
zfs
into
the
the
user
space
demon
and
and
back?
A
B
I'm
I'm
basically
not
not
clear,
for
example,
how
much
of
the
the
metastab
allocator
still
is
in
place
versus
how
much
you
delegate
to
user
space
and
so
on.
None
okay,.
A
A
So
you
know
you
don't
really
need
a
meta
slab
allocator
for
the
for
the
cache.
You
could
imagine
using
the
meta
sub
allocator
to
allocate
the
space
in
there,
but
we
decided
to
instead
kind
of
do
a
from
scratch
implementation.
A
So
the
agent
that's
managing
the
cache,
it's
not
using
the
z-pool,
it's
all
from
scratch.
So
it's
just
like
all
new
code.
That's
consuming
the
disk
and
you
know
putting
down
doing,
writes
to
the
disk
from
all
new
data
structures
that
we're
coming
up.
A
So
from
the
kernel's
point
of
view,
it's
like
there's
some
new
v
dev
type,
that's
a
view
of
object
store
and
we
have
like
special
hooks
in
there
to
say.
Like
oh,
like
if
you're
going
to
object
store,
then
you
know
you
don't
need
to
do
any
allocation
or
whatever,
or
I
mean
the
only
allocation
that
you
do.
A
Is
you
allocate
the
block
id,
which
is
just
like
I
plus
plus
right,
the
blocker
ids
just
move
forward
and
whenever
we
use
them
and
so
that's
very
easy
and
then,
when
you
do
the
write
down
to
the
object
store
v
dev,
it's
just
like
I'm
writing
to
block
id.
You
know
the
next
highest
block
id.
That's
never
been
used
before,
and
then
it
just
sends
that
request
up
to
the
agent
and
then
the
agent
is
like
packing
together
all
these
sequential
block
ids
into
one
object.
A
Yeah,
so
the
agent's,
just
getting
like
this
sequence
of
bytes,
says
just
like
here's
the
size,
pl,
here's
the
block
id,
which
is
you
know
like
the
next
block
id.
Please
write
it
so
we
cert
like
if,
if,
if
we
want
the
agent
to
take
different
action
for
different
types
of
things,
like
metadata
versus
data,
you
know
like
indirect
blocks
versus
user.
A
Hinting
into
it,
but
we
haven't
got
that
sophisticated
yet
and
so
far
we
haven't
needed
to.
But
if
you
have
thoughts
on
like
what
we
should
be
doing,
that
would
definitely
be
I'd
love
to
hear
them.
B
I
think
there's
lots
of
academic
literature
on
like
predicting
predicting
data
block
placement
and
so
on,
and
if
we
already
have
the
information
about
the
object
type,
for
example
like
we,
we
don't
have
to
do
all
the
fancy
prediction
stuff.
We
already
know
it's
metadata
or
its
data
and
so
on
so
yeah.
I
think
we
can
maybe
could
profit
from
that.
But
then
again
it
depends
on
your
use
case,
mostly.
A
B
Yeah,
for
example,
like
you,
could
have
separate
objects
for
data
and
metadata,
because
the
metadata
is
supposedly
changing
much
more
frequently
and
then
maybe
you
have
another
look
at
okay.
So
is
this
an
allocation
for
an
object
set
that
is
like
one
megabyte
block
size
or
just
an
allocation
for
an
object
set?
That
is
eight
kilobyte
block
size,
for
example,
and
then
maybe
based
on
that
guess,
whether
it's
archival
storage
versus
hot
storage
and
yeah,
maybe
tier
that
into
different
objects,
a
little
so
that
you
have
better.
A
A
Yeah,
so
in
terms
of
the
object
store,
the
biggest
problem
that
we
have
is
going
back
and
free
is
freeing
stuff
and
the
what
you
really
want
is
for
your
freeze
to
be
very
clumpy,
like
you
want
them
to
be
clustered
into
objects.
So,
like
you
know,
we
have
a
million
freeze
and
you
know
you
have
a
thousand
objects,
but
those
million
trees
are
concentrated
in
just
these
hundred
objects.
So
we
only
need
to
read
those
100
objects
and
then
rewrite
them
to
process
the
freeze
rather
than
having
them
be.
A
If
they're
evenly
distributed
over
all
of
our
objects,
then
it's
like
well,
I
have
to
read
and
then
rewrite
every
object.
You
know
the
entire.
You
know
petabyte
of
your
storage,
which
you
know
would
really
suck.
I
mean
you
just
can't
do
that
very
often.
So
you
know
you
end
up
with
a
lot
of
outstanding
freeze,
which
you
know.
The
good
thing
is
unlike
block
based
storage,
you
don't
have
a
finite
size,
so
it's
not
like.
Oh,
like
you're,
out
of
space.
Now
your
pool
is
now
like.
A
You
can't
do
anything
because
your
pool
is
full
and
we
got
to
go
to
process
the
background
freeze
instead,
it's
it's,
it's
just
cost
right.
It's
just
that.
You
know
you're
paying
more
for
you're
paying
to
store
these
blocks
that
are
no
longer
going
to
be
used,
and
so
you
know
to
be
nice
to
our
customers
wallets,
you
know,
and
eventually
our
wallets.
If
we,
you
know,
hopefully
end
up
running
this
as
a
service
we
need
to
be,
we
need
to
free
it
reasonably
soon.
A
So
if
we
could
predict
like
what
we
really
want
to
predict
is
when
will
this
block
be
freed
and
which
other
blocks
will
be
freed
around
the
same
time?
A
If
we
could
predict
that,
then
you
know
we
would
say
like
for,
for
a
given
txg,
we
can
kind
of
choose
how
we
map
its
blocks
into
objects.
So
if
we
were
like
for
this
for
this
to
extrude
like
these
are
the
blocks
that
are
going
to
be
long
lived,
and
these
are
the
blocks
that
are
going
to
be
short-lived,
we
could
pack
all
the
short-lived
blocks
together,
so
that
when
we
go
to
free
it,
we
only
have
to
hit
those
blocks.
A
A
For
that,
so
that's
a
different
problem
right!
That's
for
allocating
all
the
three
and
a
half
k
things
here:
you're
talking
about
the
cash
right.
A
A
The
problem
of
the
like
the
allocator
being
like
having
all
the
same
problems
as
the
meta
sub,
allocator
yeah,
so
I
mean
we
definitely
thought
about
all
these
things,
like
maybe
the
maybe
the
cache
should
be
like.
You
know,
allocating
one
meg
chunks
from
the
disk
and
putting
packing
a
bunch
of
blocks
into
there.
Then
your
rates
will
go
really
fast,
but
then
you
know
when
you're
evicting
it's
like
well.
A
A
Exactly
and
it's
like
well,
look.
We
already
have
some
like
the
object.
Store
thing,
basically
solves
the
smr
problem,
so
you
want
smr
like
basically
just
use
the
object,
store,
type
use
it
as
an
object,
store
right,
and
I.
A
A
Is
one
file
on
xfs
and
put
zfs
object
store
on
top
of
that,
it
would
work
great
because
at
the
end
of
the
day
you
know
you're,
writing
in
these
big
chunks
and
then
freeing
the
big
chunk
and
then
overwriting
it
so
like
the
smr
level
of
remapping
is
not
going
to
have
to
do
you
know
it's
not
gonna
have
to
do
a
lot
of
work,
but
but
the
advantage
is
you
get
like
you
know
you
can
be
doing
random.
A
Can
be
storing
a
database
on
it,
you
can
do
whatever
you
want
and
then
the
zfs,
like
object,
store
layer,
takes
care
of
making
sure
that
everything
is
actually
happening
in
big
chunks
to
the
drive.
A
Well
looks
like
we're
out
of
time.
Thank
you
all
for
coming
to
the
zif,
the
matt
aaron's
zfs
object,
storage,
monologue.
I
hope
you
enjoyed
my
comedy
set.
You
need
to
record
this
for
the
next
conference.
It
is
recorded
so
yeah.
Well,
the
recording
will
be
up
on
youtube
later
this
week
and
yeah
I
mean.
Hopefully,
you
guys
can
tell
like
I'm
pretty
excited
about
this.
It's
a
lot
of
fun
stuff,
we're
learning!
A
You
know
we're
learning
rust
rust
is
a
great
programming
language,
especially
if
you're
coming
from
c
and
yeah
and
and
we're
having
a
lot
of
fun
with
it.
I
recognize
that,
obviously
this
is
a
new
sort,
a
sort
of
new
use
case
for
zfs,
so
it
doesn't.
A
A
But
hopefully
this
is
going
to
you
know,
make
bring
bring,
make
zfs
have
some
like
relevant
capabilities
in
the
new
kind
of
the
way
that
storage
and
computing
is
moving,
namely
like
towards
cloud
and
towards
you
know
more
abstract
data,
storage
and
services
so
making.
D
A
D
A
Some
weird
smr
drives
on
top
of
an
object
store.
You
want
to
be
able
to
move
stuff
around
from
vm
to
vm,
as
opposed
to
like,
if
you're,
the
one
implementing
the
stuff
underneath
that
and
you're
consuming
raw
drives,
then
obviously
like
the
zfs.
Existing
capabilities
are
are
what
you
need.
A
Cool
thanks
a
lot
we'll
see
you
all
in
four
weeks
at
the
later
time,
so
that'll
be
july,
20th
and
I
I
may
be
on
vacation
events,
so
I'll
have
someone
else
run
the
meeting,
and
hopefully,
you've
got
your
dose
of
matt
aaron's
zfs
monologues
this
week
this
month
and
you'll
be
good
for
these
two
months.