►
Description
From the 2021 OpenZFS Developer Summit
slides: https://docs.google.com/presentation/d/1HCswW3mvc2Nnn0EdkNWAptQ_T1W9kjjBwEDDXwSco7k
Details: https://openzfs.org/wiki/OpenZFS_Developer_Summit_2021
A
Okay,
yeah,
as
matt
said,
we
wanted
to
kind
of
talk
about
zfs
on
object
store,
but
you
know
before
we
do
that.
I
kind
of
wanted
to
like
take
us
back
in
time
and
I
even
wore
like
my
oldest
opengl
t-shirt,
that
I
could
find
back
to
2005..
A
So,
like
you
know,
some
of
you
may
recall
some
things
that
went
on
with
2005
like
hurricane
katrina,
the
cancellation
of
the
nhl
season,
the
first
flight
of
the
airbus
a3
a380,
but
there
were
actually
other
things
that
were
more
relevant
to
us
and
that
was
zfs
was
introduced
into
the
open
source
community
by
the
open
source,
open
solaris
project
and
one
of
the
things
that
if
anybody
ever
saw
this
presentation,
it
talked
about
like
really
just
kind
of
blowing
away,
20
years
of
obsolete
assumptions.
A
And
so,
as
part
of
this,
I
was
like
hey.
What's
the
16th
birthday
of
zfs,
let's
you
know
want
to
celebrate
that.
Let's
go
get
some
reaction
from
like
some
zfs
developers
and
and
kind
of
like
see
if
they
knew
what
was
so
exciting
and
what
they
felt
about
that,
oh
and
sure
enough.
A
We
actually
had
some
that
were
really
excited
about
the
project
back
then,
today,
things
have
changed
quite
a
bit,
but
I
wanted
to
talk
a
little
bit
about
some
things
that
we
looked
at
when
when
zfs
first
came
out
and
kind
of
what
they
really
meant
to
file
systems,
so,
in
particular
like
it
addressed
this
brutal
to
manage
problem
that
existed
and
still
exist
today
with
with
different
file
systems.
A
But
you
know
for
those
that
were
doing
like
system
administration
in
those
days
without
before
zfs,
you
had
to
deal
with
like
the
partitioning
of
volumes
and
kind
of
labeling
them
and
modifying
a
bunch
of
etsy
files
and
dealing
with
inode
limits,
and
how
big
do
I
need
to
make
things?
It
was
really
just
kind
of
problematic,
and
so
it's
you
know
it
was
a
wonderful
breakthrough
when
zfs
came
out
and
said,
hey
we're
going
to
go,
tackle
this
problem
and
and
tackle
it
and
solve
it.
A
In
addition,
that
was
kind
of
breaking
some
of
the
rules
that
people
knew
back
then,
like
you
know,
we
realized
that,
like
the
way
we
would
make
file
systems
bigger,
we
would
actually
create
this
shim
layer
with
volumes,
and
we
introduced
this
abstraction
of
a
virtual
disk,
and
so
now
we
would
have
a
volume
manager
that
was
involved,
and
it
was
another
headache
that
we
had
to
deal
with
and,
of
course
it
it
had
its
share
of
problems.
We
would
leave
storage
stranded.
A
So
if
I
had
you
know,
storage
in
this
particular
file
system
that
I
needed
somewhere
else
well,
I
couldn't
move
it
very
easily.
It
was
just
a
pain
and
then,
of
course
I
had
to
like
you
know,
still
deal
with
like
partitioning
and
growing
and
shrinking
all
of
that
stuff
being
done
by
hand,
and
so
when
zfs
came
out
and
introduced,
the
pooled
storage
bottle,
it
was
great.
A
It
was
just
a
breakthrough
in
the
way
that
we
were
going
to
manage
storage
back
then
you
know
it
introduced
the
the
abstraction
of
the
malik
free,
where,
like
I
just
add,
storage
to
the
pool
and
all
my
file
systems
get
to
use,
and
it's
wonderful.
So
I
didn't
have
to
do
any
of
these.
A
Like
all
these
system,
administrative
tasks,
just
kind
of
went
by
the
wayside,
so
it
really
was
a
game
changer,
but
you
know
if,
if
we
kind
of
like
look
at
2005,
there
were
a
lot
of
things
that
back
then
that
have
really
changed.
You
know
youtube.
First,
video
ever
published,
2005.
A
apple
was
just
switching
to
intel.
Git,
you
know,
just
came
to
them,
came
out
as
an
open
source
project.
Netflix
was
shipping,
dvds,
you
know
and
of
course,
zfs
was
released.
A
You
know,
netflix
now
drives
over
11
of
all
internet
traffic
is
netflix.
Traffic
and
zfs
has
kind
of
proliferated
and
now
is
available
on
windows,
freebsd,
linux
and
lumos.
What's
interesting,
is
that
one
thing
you
don't
see
here
is
what
was
going
on
with
amazon.
You
know.
Amazon
was
selling
books
and
videos
and
now
they're
a
huge
retailer,
a
cloud
platform
and
a
video
streaming
service,
and
it's
this
cloud
that
has
really
changed
the
way
data
centers
now
look
and
how
they've
evolved.
So
again,
let's
look
at
kind
of
how
things
have
changed.
A
You
know,
back
in
the
day
when
zfs
was
coming
out,
we
were
talking
about
servers
that
were
directly
attached
to
storage.
Then
over
the
course
of
time
we
started.
Seeing
these.
You
know
virtual
machines
getting
propagated
everywhere
and
vmware
using
nas
and
san
as
a
way
to
distribute
storage
to
all
these
virtual
machines
that
we're
running
now
on
our
servers
and
they
introduced
the
concept
of
a
virtual
disk
and
having
this
abstraction
layer
where
we
now
had
like
a
vmdk.
A
That
was
really
some
kind
of
you
know:
block
emulation,
storage
under
the
covers,
and
the
cloud
has
just
done
this
and
kind
of
like
in
proliferated
this
in
a
much
larger
scale.
You
know
now
we
have
virtual
switches
and
virtual
machines,
and
you
know
instance,
types
and
storage
still
abound,
and
we
have
a
lot
of
virtual
disk
abstraction
layers,
but
it's
changed
the
way
we
kind
of
deal
with
things.
Today.
It's
made
life
significantly
easier
for
most
people.
A
We
now
can
do
like
on-demand
infrastructure,
which
is
wonderful,
like
you
know,
as
developers
or
system
administrators
or
any
company
that
needs
compute
on
demand
being
able
to
like
issue
an
api
call
or
click
a
button,
and
now
the
sudden
you
get
a
machine,
an
entire
network
set
up.
You
know
your
own
virtual
private
cloud
configuration
and
you
only
pay
for
what
you
need.
So
you
know
I
don't
have
to
worry
about
going
and
buying
a
huge
rack
of
storage.
A
I
just
simply
create
some
evs
volumes
and
I'm
good
to
go
and
if
there's
a
service
that
I
need
well,
there's
a
whole
catalog
of
managed
services
that
I
can
put
around
it
and
just
you
know,
leverage
them.
So
I
get
like
instant
scalability
and
unlimited
virtual
block
devices,
but
unfortunately
it's
not
all
rainbows
and
unicorns.
A
A
You
still
have
the
the
possibility
of
having
stranded
storage.
I
can't
move
things
out
of
this
ebs
layer
onto
another
virtual
machine.
I
now
have
introduced
capacity
limits,
which
are
things
that
we
were
trying
to
like
get
rid
of.
Now
I
have
limited
controllers
and
number
of
drives
that
I
could
put
on
a
virtual
machine.
A
What's
interesting
is
like
cost
has
been
a
big
factor.
You
know
the
cloud
allows
me
to
pay
as
I
go,
but
it
also
means
that
I
don't
want
to
overspend
so
the
way
that
I
build
up
my
virtual
machine,
as
I
assign
a
small
number
of
gigabytes
to
it,
fill
it
up
and
then
I'll
buy
some
more
fill
it
up,
buy
some
more
fill
it
up,
which
leads
to
imbalance,
space
and
potentially
some
performance
issues.
A
So
as
part
of
this
project,
we
wanted
to
kind
of
like
step
back
and
say
you
know
kind
of.
Is
there
a
fresh
perspective
to
solving
some
of
these
things
that
we,
you
know
we
had
solved
in
2005
and
are
now
having
to
solve
them
again
and
we
thought
to
ourselves
what
about
object?
Storage
so
with
cloud
object,
storage,
just
proliferated,
the
abstraction
of
having
abstracting
out
your
disks
to
these
objects,
where
you
just
have
like
a
key
and
can
get
access
to
any
object,
was
kind
of
interesting.
A
It
was
totally
scalable,
so
unlimited
capacity,
all
great
things
that
you
know,
especially
for
a
file
system
that
was
designed
to
be
scalable
and
have
unlimited
capacity,
seemed
to
be
a
perfect
marriage
and
it
was
cost
effective.
You
know
it
didn't.
You
didn't
have
to
pay
for
the
huge
infrastructure
that
ebs
volumes
provided,
because
you
know
these
weren't
always
nvme
drives
that
are
driving
that
are
sitting
behind
some
storage
controller.
A
They
might
actually
be
like
spinning
metal
and
it
was
a
very
simple
api
to
connect
to
and
in
a
sense
this
was
another
way
of
looking
at
pooled
storage.
What
object
store
has
done
is
it's
just
pooled
all
the
drives
together
and
exposing
this
through
a
very
simple
api
and
making
it
accessible
to
the
application?
A
A
A
I
don't
have
to
go
shrink
anything
I
just
simply
remove
objects
and,
like
I,
I'm
only
paying
for
the
space
that
I'm
you
know
that
I'm
using
so
instead
of
having
to
go
allocate,
you
know
100
terabyte
drive
and
putting
20
terabytes
of
storage
in
there
and
still
paying
for
100
terabytes.
If
I
put
20
terabytes
in
s3,
that's
all
I
pay
for,
and
I
could
scale
as
much
as
I
needed.
A
So
we
wanted
to
look
at
that
and
say:
okay,
could
we
actually
overcome
this?
Are
there
ways
to
overcome
the
latency
concerns?
No,
we
could
provide
a
read
cache
layer.
Well,
cfs
already
has
an
l2r
component,
so
great
we
could
leverage
that
we
need
a
way
to
like
buffer
writes,
so
that
we
don't
have
to
wait
for
the
long
latency.
When
we
actually
issue
a
synchronous
rate,
hey
we
have
a
solution
for
that
too.
We
can
put
a
slog
in
and
then
lastly,
we
need
to
talk
to
s3.
A
A
But
if
I'm
doing
something
where
I
need
like
4k
8k
blocks,
it's
going
to
be
expensive,
l2
arc
for
again
as
as
good
it
is
with
like
working
with
ssds.
We
needed
something
that
could
scale
and
it
has
a
pretty
high
penalty
on
its
memory.
Footprint
fuse
itself
has
some
issues.
In
addition,
one
of
the
things
that
we
discovered
is
like,
if
you
try
to
like,
destroy
a
pool,
that's
built
in
this
model.
A
A
And
then,
when
we
stepped
back
and
looked
at
kind
of
our
problem
at
delfix,
we
want
to
be
able
to
handle
extensive
random
reads
and
writes
we're
dealing
with
databases
which
is
one
of
the
no-nos
for
using
object
store.
So
we
needed
to
be
able
to
have
small
blocks
in
our
case
8k,
which
are
typically
average
when
compressed
or
average
around
3k,
and
we
wanted
to
take
advantage
of
the
lower
cost
storage
again
pay
for
only
what
you
allocate.
A
We
also
have
customers
that
aren't
system
administrators,
so
the
thing
that
was
so
wonderful
about
object
store
is
that
it's
got
a
simple
management
layer.
You
don't
have
to
worry
about.
Like
add
additional
storage.
Remove
the
storage
move
this
over
here
rebalance
that
again,
a
lot
of
the
principles
that
cfs
was
built
on
objects
are
kind
of
mimics
and
by
having
that
simple
management
layer.
That
also
meant
that
we
wouldn't
have
these
user
errors
that
can
often
crop
up.
So
people
may
have.
A
A
A
A
A
A
A
So
everything
in
the
kernel
still
talks
about
zfs
blocks
and
it's
when
we
actually
ship
these
blocks
to
the
object
agent
that
we
actually
are
going
to
convert
these
into
into
s3
objects
and
then
the
zeta
cache,
which
is
also
a
user
land
component,
is
also
talking
blocks.
So
we
have
the
ability
to
like
cache,
blocks,
ship
objects
and
then
the
rest
of
zfs
kind
of
remains
more
or
less
untouched.
A
A
A
So
then
you're
probably
saying
well
why
rust?
Well,
the
big
thing
here
is
rust
is
fast.
It's
got
a
lot
more
semantics,
it's
much
richer
than
a
c
programming
language,
and
it
gives
us
a
lot
of
additional
libraries
that
we
can
take
advantage
of,
but
the
biggest
feature
is
its
safety
net.
A
But
of
those
1700
lines
we've
had
to
deal
with,
like
you
know,
data
races,
things
in
the
I
o
pipeline
stalls,
deadlocks
and
they've
taken
us
a
long
time
to
go
solve,
but
we
haven't,
we
haven't
had
any
of
those
issues
in
the
userland
russ
code.
So
it's
a
big
benefit
to
us.
But
this
talk
isn't
about
rust.
A
A
The
biggest
thing
is
the
agent's
independent.
I
mentioned
that
we
wanted
to
have
some
fault
isolation,
so
having
it
having
the
ability
for
it
to
die
and
restart
was
key
for
us,
which
also
means
that
the
colonel
has
to
be
able
to
like
resume
and
detect
when
the
agent
has
died.
So
if
you're,
if
you're
in
the
middle
of
the
transaction
group,
you
need
to
be
able
to
say,
oh,
I
noticed
that
the
agent
didn't
finish
my
you
know
pushing
out
all
my
transactions.
A
I
can't
assume
what
it
knows
or
what
it
went
out
to
do
or
how
far
it
got.
So
I
just
need
to
push
it
all
again,
so
it
picks
up
and
remembers
where
it
left
off.
So
in
a
way,
it's
it's
kind
of,
like
you
know,
has
this
concept
of
communication
on
that
has
kind
of
out
of
band.
That
knows
when
to
send
data,
what
parts
of
data
to
send
where
it
needs
to
restart
and
so
forth.
A
So
when
we
allocate
a
block,
we're
never
going
to
reuse
that
block
ever
again.
This
allowed
us
to
make
some
very
interesting
things
and
and
simplify
things
like
in
sync
to
convergence.
We
now
don't
have
to
worry
about
multiple
passes
over
the
data
waiting
to
do
things
like
don't
compress
during
this
pass.
Overwrite
blocks
during
this
pass.
All
that
got
simplified,
we're
able
to
simply
say
we're
going
to
allocate
a
block
we're
going
to
ship
it
to
the
object
agent.
A
A
A
It's
the
one,
that's
going
to
be
responsible
for
taking
offsets,
which
is
what
the
rest
of
what
zfs
uses
or
dva's
and
convert
that
to
block
ids,
which
is
what
the
object
agent
wants
to
talk
about,
and
it's
in
this
layer
where
we
do
all
our
you
know
our
resume
logic.
So
if
we
get
a
disconnect,
we
simply
pick
up
and
resume
from
there.
A
A
A
As
far
as
the
zeta
object
agent,
it
will
then
take
these
block
ids
and
convert
them
to
object.
Ids
and
matt
will
talk
a
little
bit
more
about
kind
of
the
details
of
how
it
does
that
it's
also
totally
responsible
for
all
the
object
management.
So
it
knows
which
objects
are
currently
cached
which
objects.
A
It
needs
to
go
fetch
how
it
maps
from
block
id
to
object
id
when
it
has
to
do
consolidation
when
it
actually
has
to
do,
freeze
and
is
responsible
for
all
the
s3
communication
authentication
and
then
the
new
zeta
cache
is
a
simple
implementation
of
an
on
disk.
Lru
cache
and
this
afternoon,
there'll
be
a
whole
talk
that
describes
the
details
of
it.
B
B
B
We
call
it
object.d2,
it
contains
block
id2
and
we
put
it
in
object,
store
using
a
key
that
looks
like
this.
So
if
everything
goes
under
zfs
slash
and
then
this
number
is
the
pool
guide.
So
all
the
objects
related
to
this
pool
will
be
under
here.
Data
for
data
objects
and
then
object
id
number
two,
and
then
you
know
the
next
block
that
the
kernel
allocates.
B
Maybe
it's
bigger.
We
put
it
into
a
bigger
object.
Give
it
black
head
the
next
block
at
e3
in
the
next
object:
id3
put
it
in
there.
So
this
is
really
simple.
As
long
as
your
record
size
is
pretty
big,
then
this
is
going
to
probably
work.
Just
fine
and
you'll
have
performances.
Kind
of
theoretically
should
be
similar
to
s3
backer.
B
B
So
if
all
you
care
about
is
large
blocks,
then
that's
pretty
much
it
like
talks
over.
You
know
I'll
go
home,
but
so
why
why
why
that
caveat
like?
Why
does
this
only
work
for
big
blocks?
Well,
it's
because
we
need
to
use
big
objects
to
get
good
performance,
so
this
graph
is
showing
us
how
throughput
changes
maximum
throughput
changes
as
we
increase
the
object
size,
and
we
want
to
be
up
here
at
you
know
where.
B
B
So
in
order
to
get
good
overall
throughput,
you
need
to
either
increase
the
object
size
which
is
like
moving
to
the
right
on
this
graph
or
you
need
to
have
a
larger
q
depth
so
have
more
put
objects
operations
going
on
concurrently
at
the
same
time.
So
that's
like
switching
from
this
red
line,
which
is
100,
puts
concurrently
to
this
blue
line,
which
is
a
thousand
puts
concurrently.
So
we
saw
that
kind
of
fur.
The
parameters
that
we
cared
about
around
one
megabyte
makes
sense
that's
kind
of
a
starting
point.
B
Maybe
two
megabytes
would
be
a
little
bit
better
depending
on
the
q
depth,
but
the
key
thing
here
is
that
we
want
to
be
able
to
get
a
good
throughput
when
we're
writing
and
we
don't
want
to
have
to
have
like
a
ridiculously
large
q
depth
I
mean
100
is
already
like
pretty
large.
B
You
run
into
limits
like
number
of
file
descriptors,
that
you
can
actually
have
open
in
a
process
and
other
other
things
that
impact
efficiency.
When
going
to
really
really
huge
queued
ups,
all
right,
so
that
works
fine
as
long
as
you
have
big
objects.
But
sorry
as
long
as
you
have
big
blocks,
you
can
just
say
one
block
per
object.
But
what,
if
you
are
storing
databases-
and
your
database
has
an
average
of
three
kilobyte
compressed
block
size?
B
Then
that's
not
going
to
work.
Three
kilobyte
object
sizes
would
have
very,
very
poor
performance.
So
what
we're
going
to
do
is
combine
a
bunch
of
blocks
into
one
big
object.
So
in
this
example,
we
have
like
around
300
blocks
being
combined
into
one
roughly
one
megabyte
object
and
again
we're
going
to
store
it.
B
B
So
as
we're
writing
we're
going
to
write
out
we're
going
to
the
kernel
is
going
to
be
sending
us
a
bunch
of
blocks.
This
can
be
sending
them
in
sequential
block
id
order.
So
first
it
gives
us
123,
124,
125,
et
cetera,
and
the
agent
is
going
to
be
batching
those
up
until
we
get
about
a
megabyte
of
data
and
then
do
a
put
up.
Do
a
put
object
to
store
that
into
it.
One
large
object
and
the
object.
B
Contents
are
self-describing,
so
you
know
if
we
look
at
what's
actually
in
this
object
in
object
store,
it
has
the
data
and
it
also
has
a
description
of
what
the
data
is.
So
it
says
you
know
I
have
a
black
id
123
and
it's
three
and
a
half
kilobytes,
and
it's
at
this
offset
within
the
object
right
and
the
the
block
ids
that
we're
talking
about
here
are
our
allocate
sequentially
and
they're
never
reused.
B
So
once
we've
written
block
id
124,
the
kernel
is
never
going
to
come
back
and
say:
hey
like
change
the
contents
of
block
of
d124,
it's
just
always
increasing
for
forever.
B
So
hopefully
this
the
right
code
path,
kind
of
makes
sense
here,
you're,
just
batching
it
together,
speeding
them
all
up.
But
what
about
reads
so?
If
you
want
to
do
a
read,
the
kernel
is
going
to
send
over
send
up
a
request,
saying
please
get
the
data
for
this
block
id
and
we
need
to
figure
out
which
object
has
that
data
inside
of
it.
So
the
way
that
we
do,
that
is
by
keeping
a
mapping
in
memory
that
maps
from
the
object
id
to
the
minimum
block
id.
B
B
B
So
there
is
obviously
some
memory
cost
to
this
in
kind
of
the
current
naive
implementation.
It's
about
16
bytes
of
memory
for
each
one,
megabit
object
which
it
comes
out
to
you
know.
If
you
have
100
terabytes
of
data
in
your
pool,
new
storage
pool,
then
it's
about
one
and
a
half
one
and
a
half
gigabytes
of
ram,
which
is
like.
Maybe
that's
acceptable.
B
I
mean
100
terabytes
is
you
know
pretty
big,
at
least
for
databases,
but
we
think
we
can
do
a
lot
better
than
that
because,
like
if
you
look
at
this
table,
you
can
see
you
know
the
object
ids.
We
we
out,
we
allocated
eight
bytes
for
that,
but
really
each
one
is
just
sequentially
the
next
number.
So
maybe
we
can
just
get
rid
of
that
entirely.
The
block
ids
they're,
not
sequential,
but
they
can't
differ
by
you-
know
2
to
the
64..
Maybe
they
differ
by.
B
You
know
2
to
the
10
at
most
or
something
so
maybe
we
can
get
away
with
just
like
10
bits
of
information
encoding,
the
delta
between
each
next
entry
here.
So
I
think
all
things
considered,
we
should
be
able
to
get
this
down
to
about
a
quarter
of
that
which
is
even
more
reasonable,
400
megs
of
ram
for
each
hundred
terabytes.
B
I
mean,
of
course,
we
need
to
keep
this
mapping
on
disk
as
well
or
in
the
object
store.
I
should
say
persistently
so
that
if
we
crash
we're
able
to
regenerate
this
this
in
memory
table-
and
by
crash
I
mean
if
the
object
restarts
so
there
might
not
be
a
full
system
crash,
but
it
might
just
be
that
the
object
process
the
agent
process
is
restarted.
B
Then
we
need
to
read
this
back
into
memory.
So
how
do
we
store
the
mapping
on
disk?
B
Basically,
we
store
in
a
log.
The
log
is
just
logically
speaking.
It's
like
an
array
of
entries.
The
entries
are
going
to
tell
us.
What's
the
object
id
and
what's
the
block
id
associated
with
it,
the
minimum
block
id-
and
this
is
going
to
be
split
into
a
bunch
of
objects.
B
B
The
pool
guide
now
we're
in
the
object
block
map
namespace
and
then
this
is
object,
zero
of
that
and
that
that'll
contain
maybe
the
first
ten
thousand
entries
in
the
next
ten
thousand
entries
in
object
id
one.
B
So
you
know
what
what
comes
down
to
is
basically
every
transaction
group
we're
going
to
be
appending
a
new
object,
or
maybe
a
couple
objects
to
this
log.
That
indicates
what
what
other
objects,
what
data
objects
were
allocated
in
which
blocks
they
contain
and
then
periodically
we're
going
to
have
to
these
objects
might
be
kind
of
small,
so
you
may
end
up
with
a
lot
of
them
and
so
periodically.
We
might
want
to
condense
these
by
rewriting
the
this
log.
B
B
All
right,
so
that
kind
of
covers
like
what
happens
when
we're
writing,
what
happens
when
we're
reading?
There
is
one
other
operation
that
we
might
want
to
care
about,
which
is
freeing.
So
how
do
we
reclaim
space?
That's
no
longer
needed,
so
there's
a
couple
problems
that
come
up
when
we
think
about
this
when
you
have
a
bunch
of
blocks
within
one
object.
B
So
in
this
example,
we
have
one
object,
object
at
e4,
I
mean
it
has
a
bunch
of
bunch
of
blocks
and
the
kernel
says
I'm
going
to
free
block
id
125..
So
what
the
free
means
is.
It
just
means.
I
promise
that
I'm
never
going
to
read
this
block
id
again
and
so
fyi
I
you
can
do
whatever
you
want
with
that,
preferably
maybe
you
use
less
object,
less
less
data
in
the
object
store
so
that
amazon
charges
you
less.
B
B
So
this
works
and
we're
overwriting
the
object
in
place
so
we're
basically
replacing
its
contents
with
the
new
contents
and
because
the
object
contents
are
self-describing
there.
It's
pretty
cool
that
there's
no
race
conditions
here
so
like
if,
if
the
reclaim
is
going
on
in
the
background-
and
we
don't
have
any
locking
and
somebody
else,
some
other
thread
is
trying
to
read
block
id
128.
B
Then
they
can
do
a
get
object
and
if
they
get
the
old
object
contents
or
if
they
get
the
new
object
contents,
it's
going
to
work
either
way.
So
because
the
object
is
self-describing,
whichever
one
they
get
will
tell
them
where
to
find
block
id
128
within
the
object
and
that's
fine,
but
it
does
cost
us
a
lot
of
throughput
so
to
process
this
one
free
of
three
kilobytes.
B
Maybe,
for
example,
we
were
doing
it
over
a
logical
overrate
of
some
random
block
of
our
database
file
and
we
so
we
we
wrote
a
new
three
kilobytes
somewhere
else,
some
new
object
and
we're
freeing
this
old
one
because
there
wasn't
a
snapshot
of
it.
So
we
wrote
three
kills.
We
in
order
to
write
this
three
kilobytes,
we
had
to
write
the
three
kilobytes
of
new
data
into
some
big
new
object,
but
then
we
also
had
to
read
a
whole
megabyte
and
then
write
almost
another
megabyte.
B
So
you
know
we're
talking
about
600x
io
inflation.
With
this
naive
implementation.
B
So
the
solution
to
this
is
batching,
so
when
we
get
when
the
kernel
says,
I
no
longer
need
block
at
a125.
B
We
say
that's
great,
I'm
gonna,
remember
that,
but
not
do
anything
about
it
right
now
and
I'm
gonna
wait
until.
Hopefully
we
get
a
bunch
of
other
frees
of
blocks
that
are
also
in
the
same
object
and
then
I'm
gonna
process,
those
all
at
once.
So
maybe
we
get
you
know
200
out
of
the
300
blocks
in
this
object
are
freed
eventually,
and
we
now
we
can
read
this
megabyte
and
then
write
out
a
smaller.
B
You
know
about
half
a
megabyte
less
than
half
a
megabyte,
so
we've
improved
the
situation
a
lot.
You
know
it's,
it's
170x
better
than
the
naive
implication,
implementation
on
the
previous
slide,
but
there
is
still
some.
I
o
inflation
because
we
do
have
to
read,
read
this
and
then
write
it
out,
even
though
it
processes
a
lot
of
freeze,
there's
still
some
costs
associated
with
it.
B
So
the
key
to
getting
good
performance
out
of
this
reclaim
process
is
to
make
sure
that,
when
we're,
when
we're
rewriting
an
object,
it
has
lots
of
freeze.
So
how
do
we
find
the
objects
that
have
lots
of
freeze?
Well?
B
There's
a
bunch
of
like
tunables
here,
and
we
can
probably
be
smarter
about
this
in
the
future,
but
right
now,
there's
like
a
tunable
that
says
like:
what's
the
how
that
basically
tells
us
like
how
long
do
we
wait
until
we
reclaim
and
that's
just
like
a
percent
of
the
total
pool
size,
so
we
say
like
hey,
you
know,
whatever
your
pool
size
is
having
ten
percent
of
it
free,
but
not
yet.
B
Reclaimed
is
okay,
basically
means,
like
you
know,
you're
going
to
be
tank,
paying
10
more
for
your
storage
costs
than
if
we
were
extremely
aggressive
about
reclaiming
the
free
space.
But
the
trade-off
is
that
the
reclaim
is
much
more
efficient
and
uses
less
throughput.
B
So
how
do
we
keep
track
of?
We
keep
track
of
the
blocks
that
have
been
freed,
but
not
yet
reclaimed
in
this
reclaimed,
log,
which
is
basically
just
an
array
of
like
this,
is
the
block
id,
and
this
is
the
size.
The
sizes
are
used
to
go
in
to
find
the
objects
that
have
the
most
free
space.
When
we
go
to
do
the
reclaim.
B
And
so
the
reclaim
log
you
know
as
we're
getting
freeze
we're
appending
to
it.
Then
we
need
to
load
that
reclaim
log
into
memory
in
order
to
find
which
objects
have
the
most
free
space,
so
that
uses
up
some
memory.
Actually,
we
want
to
be
able
to
have
like
as
much
outstanding
freeze.
As
you
know,
you
want
to
set
that
tunable
suit
without
having
a
impact
on
the
memory
requirement.
B
B
So
you
know
this
this
example
here
is:
you
know
to
process
to
load
10
million
freeze,
but
we
can
have
a
bunch
of
we
have
as
many
logs
as
we
need,
so
that
no
log
has
more
than
10
million
freeze,
for
example,
to
limit
the
memory
used
to
co.
You
know:
limited
memory
is
to
a
constant,
regardless
of
the
pool
size
and
the
amount
of
outstanding
reclamable
space.
B
All
right
so
there's
a
couple
more
problems.
First,
off
the
object
block
map
memory
usage
is
going
to
keep
growing
and
growing
over
time.
So,
as
we're
doing,
writes
we're
allocating
new
objects,
we're
appending
new
things
to
the
object
of
block
mapping,
we're
keeping
track
of
those
in
memory
in
the
table.
That
looks
like
this
and
the
issue
is
that,
if
you're,
if
you
so,
if
you're,
just
writing
and
you're,
not
really
freeing
much,
then
it's
not
a
big
deal.
B
But
if
you
have
a
lot
of
churn
like
say
because
you
have
a
database
and
the
database
is
like
randomly
overwriting
lots
of
different
blocks
in
the
in
the
database
files
all
the
time
and
like
writing
its
logs
and
then
overwriting
its
logs
at
the
database
level,
then
you
have
a
lot
of
churn,
meaning
like
a
lot
of
the
data.
That's
present
in
the
pool
now
won't
be
present
in
the
pool
you
know
a
month
from
now,
but
it'll
be
replaced
by
new
data.
B
So,
for
example,
you
know
if
you're
writing
at
100
megabits
per
second
on
average,
then
this
table
might
grow
to
be
like
12
g
need
12
gigabytes
of
memory
for
every
year
that
you've
been
running
this
pool,
and
you
know
we
have
customers
that
that
are
writing
at
this
rate,
at
this
average
rate,
with
our
block
based
solution
and
that
have
pools
that
are
you
know
many
years
old.
B
So
that
seems
like
not
a
great
use
of
memory
and
the
the
other
problem
is
that
the
objects
are
going
to
get
small.
B
So
as
we
as
we
rewrite
each
object,
omitting
the
free
blocks
it's
going
to
get
smaller,
and
then
maybe
we
do
that
again
and
again
until
it's
like
very
small,
and
that
means
that
if,
if
we
have
it,
if
we
need
to
like
do
a
table
scan
or
read
through
a
whole
file
getting
all
of
its
blocks
now
we
have
to
read
a
whole
lot
of
small
objects,
which
has
the
same
problem
as
large
objects
to
a
little
bit
lesser
degree.
B
It's
also
latency
dominated
the
latency,
is
kind
of
at
least
20
milliseconds.
So
that's
that's
less
than
half
of
the
puts,
but
it's
still
pretty
significant
and
again
you
want
to
be
using
large
objects
to
get
very
good
throughput
with
reasonable
q
dots.
B
How
do
we
address
that
problem?
We
address
it
with
object
consolidation,
so
when
we're
processing,
when
we're
doing
reclaim
and
processing,
freeze
we're
going
to
look
at
several
adjacent
objects
in
the
freeze
that
are
associated
with
all
of
them
and
we
we
might
see
okay
in
object,
id
4
we're
freeing
most
of
it.
The
yellow
blocks
are
the
ones
that
are
being
freed
and
the
blue
blocks
are
the
ones
that
we
need
to
retain,
but
maybe
you
know
there's
only
a
few
of
them.
Those
don't
add
up
to
a
megabyte.
B
They
add
up
to
like
some
small
fraction
of
a
megabyte.
So
let's
look
at
the
next
sequential
object.
Id
so
object
id5
and
see
what
needs
to
be
freed
there.
Okay,
a
bunch
of,
is
being
freed.
We
need
to
retain
a
little
bit
here.
Let's
add
those
to
this
object
and
as
long
as
it's
not
a
megabyte.
Yet
we
keep
on
doing
that
accumulating
more
and
more
you're.
Consolidating
the
the
retained
blocks
from
more
and
more
objects
into
this
one
object.
B
That's
going
to
replace
them
all
so
to
do
this,
you
know
we
can
kind
of
figure
out
all
of
this
based
on
the
in-memory
metadata
without
having
to
read
any
of
the
data
objects,
and
then
we
can
then
we
need
to
read.
You
know
all
three
of
these
data
objects
and
then
write
just
the
one
new
large
one
and
then
eventually
delete
these
object.
Ids
four
and
five
that
are
no
longer
needed
at
all.
B
I'm
sorry
object,
ladies
five
and
six
that
are
no
longer
needed
after
we've
persisted
the
object
to
block
mapping
change
so
in
the
object
of
block
mapping,
we're
going
to
remove
the
entries
for
object,
ids,
five
and
six
in
the
on
this
log,
we're
gonna,
add
a
free
type
entry
saying
that
these
are
no
longer
needed
and
then
we're
going
to
remove
them
from
the
in-memory
table.
B
So
now
you
know
there'll
be
some
object
id
seven.
You
have
an
entry
for
object,
four
and
object
seven,
and
then
you
know
we
know
that
all
of
the
blocks
between
them
are
in
object.
Four,
because
five
and
six
aren't
there.
B
Cool,
so
this
this
works
pretty
well,
there's
a
lot
of
work
that
we
can
still
do
in
the
future
to
improve
this
even
further.
So
like
one
of
the
ideas,
is
right
now
we're
kind
of
batch
processing,
the
the
reclaim.
So
it's
like,
we
start
reclaiming.
We
do
it
as
fast
as
possible.
We
stop
doing
the
reclaim.
We
wait
until
you
hit
the
high
water
mark
and
then
do
it
again.
B
We
could
probably
improve
on
the
behavior
of
that
by
chipping
away
at
it,
like
every
transaction
group,
a
little
bit
little
by
little
and
then
having
like
some
kind
of
feedback
mechanism
that
tells
us
like
how
fast
we
need
to
be
chipping
away.
Based
on
how
close
you
are
to
getting
to
the
high
water
mark,
you
might
also
want
to
have
different
kind
of
controls
over
like
when
this
happens.
B
You
know
at
night
or
on
the
weekends
you
might
want
to
have
some
like
minimum
efficiency
setting
where
you're
like
yeah-
I
I
don't
really
mind
paying
for
that
extra
storage
space
for
a
while,
but
I
really
don't
want
to
impact
my
network
throughput
by
doing
lots
of
guests
and
puts
so
you
know
I
want
to
say,
like
only
do
it
when
you're
able
to
free
like
90
of
an
object
or
something
like
that,
but
you
know
absent
those
future
performance.
How
does
it
actually
work
today?
B
So
this
is
a
graph
showing
the
x-axis.
The
x-axis
is
time.
This
is
about
30
seconds
here
and
the
the
y-axis
is
well.
The
the
green
red
and
blue
are
network
throughput,
and
then
the
the
magenta
here
is
the
amount
of
space
that's
allocated
in
the
storage
pool,
so
what's
happening
is
we're
going
along
cheek.
Studio
sync
is,
is
putting
data
to
the
object
stored
at
about
700
megabits
per
second,
then
we
start
a
reclaim
and
then
we're
going
to
be
reading
and
writing
data.
B
We
can
do
that
very
quickly
and
then
we
go
back
to
that's
done
and
we
reduce
the
amount
of
allocated
space.
So
the
workload
here
is
random,
writes
about
15
000
iops
into
a
one
terabyte
file,
completely
random
locations
with
record
size,
8k
and
the
average
compressed
block
size
is
3k.
So
this
is
what
I
would
consider
a
very
demanding
test.
B
B
So
on
this
on
this,
you
know
very
demanding
workload,
we're
seeing
that
we're
still
able
to
process
freeze
at
about
1400
megabytes
per
second
versus
the
the
writes
are
about
700
megabytes
per
second,
so
it
kind
of
makes
sense
that,
like
half
the
time
we're
reclaiming
and
half
the
time,
we
don't
need
to
be
reclaiming.
B
So
you
know
for
our
workload,
which
is
this-
you
know
pretty
demanding
workload
the
because
we
aren't
reclaiming
all
the
time
we're
able
to
keep
up,
and
we
can
see
that
the
impact
on
the
rate
that
we're
ingesting
data
with
with
txt
sync
writes.
It
goes
down
a
little
bit,
but
not
not
that
substantially,
while
we're
in
the
middle
of
the
reclaim.
B
So
you
know
the
kind
of
conclusion
for
us
was
you
know,
there's
spare,
there's
spare
network
bandwidth
and
we're
able
to
do
this
in
parallel
with
ingesting
the
rights
still
at
a
good
throughput.
So,
even
despite
not
having
some
of
these
some
of
that
future
work,
it
looks
like
it's
good
enough
for
our
workloads,
and
you
know
for
more
typical
workloads
where
you're,
like
the
record
size
is
the
default,
and
most
of
my
data
is
in
large
objects.
B
It
is
in
large
files
and
they
have
the
you
know
record
size
of
128k
or
maybe
even
more
mostly.
What
I'm
doing
is
accessing
files
sequentially
really
this
this
solution
is,
is
vastly
over
engineered
for
that
kind
of
use
case
and
the
really
the
impact
of
reclaim
is
going
to
be
negligible
on
that
kind
of
use
case,
but
we
wanted
to.
B
Obviously
we
wanted
to
make
this
work
for
our
use
case,
and
it's
also
really
satisfying
that
you
know
this
works
for
a
wide
variety
of
use
cases
we
don't
have
to
say,
like
oh,
like
zfs,
on
object
store
as
long
as
you're,
just
using
it
for
archival
or
backup
or
whatever.
This
is
really
it's
usable
for
general
purpose,
workloads,
I'm
even
including
some
of
the
most
demanding
ones.
Obviously,
there's
performance
concerns
there,
but
the
the
ultimate
the
bottom
line
is
that
performance
is
very
good.
C
All
right,
okay,
so
as
you've
been
told,
the
zfs
object.
Agent
is
a
user
land
thing,
so
we'll
start
by
kicking
it
off
and
we'll
let
it
run
on
that
window.
So
we
are
invoking
the
xavius
object.
Agent.
C
C
Obviously,
and
we
have,
as
george
mentioned,
we
have
a
new
vw
type,
the
s3
type,
and
we
have
to
specify
the
bucket
the
place
where
all
the
objects
will
reside,
and
in
this
case
I'm
going
to
add
in
a
log
device
as
well
and
so
pool
is
going
to
get
created
and
so
that
that's
that's
the
pool
for
you
and
if
you
notice,
just
like
the
devices
specified,
the
bucket
is
specified
here
and
you
can
see
the
nvme
log
device
there
as
well.
C
So
I
started
off
the
the
agent
by
hand,
but
you
probably
want
to
be
using
some
system
management
service
like
systemd,
to
run
this.
So
let
me
pause
the
video
just
a
second
again.
So,
as
george
and
matt
mentioned,
the
agent
is
designed
such
that
it
can
be
killed
and
restarted
or
it
can
crash
and
come
back
up
and
things
are
still
going
defined.
So
it's
resilient.
C
So
that's
exactly
what
I'm
doing
here,
I'm
killing
off
the
agent
the
pool
is
still
online
and
I'm
going
to
start
it
off
as
a
systemd
service,
so
the
systemd
service
is
on
is
is
going
and
let's
take
a
look
at
the
log
on
this
window,
the
debug
logs.
While
we
proceed
with
the
demo.
C
C
So
matt
talked
about
workloads
where
you
have
a
8k
record
size
compression
is
turned
on
and
you
have
a
large
file
and
you
do
random
writes.
Let's
do
just
the
same
thing.
Okay,
let's
create
a
large
file
and
we'll
we,
we
have
a
z,
first
file
system
with
eight
k,
record
size,
compression
turned
on
and
let's
go
do
some
random
bytes,
so
we're
going
to
first
create
a
large
file
and
I'm
going
to
speed
this
up.
C
All
right
and
next,
let's
do
some
random
rights,
and
then
we
are
going
to
use
iostat
to
see
how
it
looks
and
iostat
has
been
improved
or
enhanced
with
the
dash
option
so
that
all
object
stories
things
are
listed.
So
if
you
look
at
the
first
column,
that's
that's
the
traffic
between
the
kernel
and
the
agent.
C
The
second
column
tells
you
about
the
traffic
between
the
agent
and
s3,
the
the
object
store.
The
metadata
traffic
is
captured
and
reclaimed
happens.
Man
went
over
that
quite
a
bit,
so
reclaim
happens
as
well.
So
the
the
thing
to
notice
here
is
that
we
are
doing
about
100
meg
of
throughput
and
the
the
the
throughput
between
the
agent
and
the
s3
is
also
roughly
close,
but
the
number
of
objects.
The
the
operations.
C
There
is
a
significant
difference
about
150
operations
from
the
agent
to
the
s3
bucket,
but
about
36k
of
operations
is
what
the
kernel
is
doing.
C
C
All
right,
let's
kill
that
workload
and
let's
look
at
something
else,
so
import
export
now
export
is,
is
fairly
straightforward.
It's
it's!
You
export
a
pool
and
you're
done
nothing
special
about
this,
but
import
as
you
might
imagine,
likes
equal,
create
it
needs
additional
parameters,
it
needs
to
be
told
they
have
to
go
import
stuff
from
the
the
endpoint,
the
region
and
the
bucket
the
s3
bucket.
So
the
dash
d
option
specifies
the
the
bucket
think
of
it.
C
It
takes
a
little
bit
of
time,
we'll
talk
about
that
in
a
bit
about
five
seconds,
and
we
should
be
done
so
we
are
done.
Let's
take
a
look
at
the
z
pool
list,
so
pool
one
is,
should
be
there
yeah
it's
there
and
it.
We
can
also
do
a
search
import
where
you
tell
zfus
import
to
go,
look
for
tools,
and
so
let's
try
that
on
our
bucket
and
interestingly,
we
have
a
pool
that
we
did
not
create
on
this
vm
cool
2..
So
why
don't
we
just
go?
C
Pull
two:
let's
go,
try
and
import
that
and
we
hit
a
failure.
Specifically,
it
says
that
the
pool
can't
be
imported
because
it's
currently
hosted
on
a
different
host.
It's
imported
on
a
different
host.
What
exactly
happened
here?
Hey
paul!
Can
you
walk
us
through
this.
D
Sure
thing,
so
this
is
an
example
of
a
feature
called
multi-modifier
protection
or
mmp.
Zfs
already
has
this
feature
implemented
in
the
kernel,
it's
useful
for
situations
where
you
have
storage
area
networks
or
network
attached
storage,
where
multiple
systems
could
potentially
try
to
import
the
pool.
D
At
the
same
time,
we
re-implemented
this
in
the
agent
for
a
very
important
reason,
and
that's
that
it's
extremely
simple
in
the
cloud
and
object
storage
use
case
for
multiple
systems
to
be
accessing
the
same
bucket
to
be
accessing
pools
within
that
bucket
and
with
some
pretty
easy,
misconfiguration
steps.
It's
very
simple:
to
have
multiple
systems
try
to
import
the
same
pool
at
the
same
time.
D
D
So
for
that
reason
we
decided
to
re-implement
mmp
in
the
agent
protecting
both
the
zfs
data
and
the
agent's
own
metadata.
The
checks
take
slightly
longer,
as
minoj
pointed
out
during
the
import
process.
It
takes
anywhere
between
five
seconds
and
20
seconds
in
a
contended
use
case,
but
it's
designed
in
such
a
way
that
it
is
almost
impossible
for
multiple
systems
to
end
up
succeeding
at
an
import.
D
C
All
right
thanks
paul,
so
let's,
let's,
let's
do
a
little
bit
little
more
of
the
mlp
stuff.
Let's
go
over
to
the
host
where
the
pool
2
is
online,
yeah,
it's
right
there
and
instead
of
cleanly
exporting
it
and
importing
it
on
the
other
pool.
Let's
do
something
more
interesting:
let's
just
power
off
this,
this
other
vm
and
we
power
it
off
and
then
we
are
going
to
see
if
we
can
import
it
safely
with
mmp
on
the
other
other
vm.
C
All
right,
it
took
us
about
34
seconds
here,
but
the
pool
has
been
safely
imported
and
it's
online.
C
So
let
me
pause
the
video
again
to
talk
about
something
else.
The
if
you
noticed
we
had
multiple
pools
on
the
same
bucket
and
that's
that's
that's
part
of
the
design.
That's
that's
very
much.
How
we
wanted
this
to
be
the
bucket
is,
can
be
shared
by
multiple
pools
across
multiple
hosts
and
each
pool
has
its
own
name,
space
into
which
all
the
objects
grow,
so
they
don't
clog
or
each
other
and
you
could
have
a
shared
bucket
or
you
could
have
different
lockets.
C
C
We
have
a
new
object,
endpoint
property,
the
region,
the
credentials
profile,
we'll
talk
about
that
in
just
a
bit,
and
so
those
are
the
new
properties
that
an
object,
store,
based,
vm
z,
pool
next.
Let's
talk
about
zip
will
destroy,
zip
will
destroy
for
object,
store
based,
z,
pools
are
slightly
different
different
because
we
we
or
the
customer
is
paying
for
these
es3
objects
and,
unlike
a
block
based
z,
pool
where
you
destroyed
the
block,
stay
there,
it's
marked
as
destroyed,
but
the
blocks
just
stay
there.
C
We
don't
want
that
behavior
for
object
store
when
we
want
when
we
say
destroy,
we
want
the
objects
to
go
away
so
that
we
are
not
consuming
space
on
s3
right.
So
when
you
destroy
a
pool,
what
happens?
Is
the
agent
kicks
off
a
background
task
that
goes
and
cleans
up?
All
the
objects
destroys
them
and
reclaims
the
space,
so
zee
pool
status.
C
Why
the
tool
is
being
destroyed
will
tell
you
that
it
will
list
the
pools
that
are
destroyed
and
once
the
pool
has
been
completely
destroyed,
you
have
an
additional
flag.
The
dash
dash
list
destroyed
option
that
can
give
you
a
list
of
our
tools
that
have
been
destroyed
so
that
you
can
confirm
that
your
pool
has
indeed
been
destroyed,
and
if
you.
C
Can
use
the
clear
and
it
will
go
away
now.
I
very
cleverly
did
not
deal
with
credentials
so
far,
and
let's
now
talk
about
that.
So
if
you
have
used
the
aws
cli,
then
they
take
as
credential
sources.
They
can
take
credentials
from
a
variety
of
sources,
and
for
this
demo
I
used
the
aws
identity
and
access
management
rules
that
are
associated
with
an
instance
profile,
and
that
means
that
or
what
that
gives
you
is
that
you
don't
have
to
deal
with
credential
rotation
and
things
like
that.
C
It's
done
for
you,
you
could
do
that
and
all
the
various
sources
that
the
aws
cli
takes
as
input.
The
zfs
object
agent
can
as
well.
So
you
can
use
instance
profile
if
you
like,
or
you
can
use
environment
variables,
aws
credentials
and
environment
variables.
If
that's
your
preference
or
you
can
specify
it
in
the
dot,
aws
credentials
files.
C
C
A
So
what
I
wanted
to
talk
about
now
is
a
little
bit
about
performance.
So
when
we
you
know,
I
mentioned
before
that
we
actually
started
off
kind
of
looking
at
like
s3,
backer
and
l2r
and
some
of
the
existing
components.
A
So
it
made
sense
for
us
to
kind
of
see
where
do
those
things
stand
today
so
like
we
ran
some
tests
and
performance
numbers
just
using
s3
backer
and
trying
to
get
like
its
raw
throughput.
A
A
A
You
can
see
that
in
we're
right
around
the
three
to
like
6x
range
of
performance
numbers,
so
we're
getting
a
lot
more
read
throughput
and
a
lot
more
right
throughput,
but
that's
great
because,
like
we,
you
know,
s3
has
a
lot
of
throughput
to
give
and
you're
going
to
be
limited
by
how
much
bandwidth
your
instance
type
has.
A
But
what
we
really
care
about
are
what
are
the
iops?
So
again,
I
wanted
to
start
off
with
like
s3,
backer
and
kind
of
what
does
that
give
us
in
kind
of
the
the
configuration
that
we
had
and
I
started
off
without
an
l2r.
So
this
is
just
going
straight
to
the
s3
device
you
can
see.
The
random
reads
are
just
kind
of
horrible
and
random
rights
are
all
right,
but
nonetheless
this
would
not
be
something
that
would
satisfy
us,
and
the
latency
for
s3
backer
is
is
actually
really
bad.
A
When
we
talk
talk
about
reads
so
we
needed
to
kind
of
look
at
like
you
know,
this
was
kind
of
a
non-starter
for
us
again,
looking
at
the
comparison
with
just
using
zfs
and
our
implementation
of
object
store.
We
see
that,
at
least
in
this
case
we're
able
to
lower
the
latency
of
random
reads
from
346
milliseconds
to
89
milliseconds,
so
still
not
great,
but
a
big
improvement
overall
on
kind
of
where
we're
headed.
So
we
kind
of
knew
we
were
on
the
right
trajectory.
A
But,
as
I
mentioned
in
the
talk
like
we
needed
something
better
and
that's
where
zetta
cash
is
going
to
pick
up.
So
we'll
talk
more
about
that
this
afternoon,
but
is
there
more
to
like
what
we've
solved
here
like,
as
you
start
thinking
about
kind
of
some
of
the
techniques
that
we
use
to
implement
object
store?
Can
we
go
beyond
that?
So
I
wanted
to
kind
of
like
throw
out
some
kind
of
crazy
ideas
of
like.
Where
do
we
go?
You
know.
Can
this
be
leveraged
in
other
ways?
A
You
know,
could
we
define
an
object
to
now
like
define
a
track
on
a
single
media
drive
and
then
use
some
of
the
techniques
of
like
actually
knowing
a
block
to
object,
mapping
to
define
and
move
regions
around
when
you're
actually
going
to
do
writes
you
know,
maybe
this
replaces
the
need
for
bp
rewrite.
We've
talked
about
that
for
many
years.
A
Could
we
abstract
the
block
pointer?
Now
we
have
the
ability
of
having
the
block
pointer
just
be
more,
you
know
virtualized,
and
it
doesn't
matter
where
its
location
is
because
again
using
the
object
map,
could
we
actually
determine
where
its
new
location
should
be?
If
we
need
to
move
things
around
and
there's
many
more
there's
many
things
that
you
could
think
of
the
other
thing
to
think
about
here.
Is
you
know
what
would
you
do
with
zfs
if
you
had
unlimited
storage?
B
All
right
so
there's
a
question
about
scrub
and
then
a
couple
of
kind
of
related
questions
about
scrub
and
how
does
xeeple
status
dash
v
work
at
a
high
level
I
mean
zfs
is
still
maintaining
a
checksum
of
each
data
block
stored
in
the
block
pointer
in
the
indirect
blocks,
and
so
you
can
run
zeufo
scrub.
B
It'll
go
read
every
data
block
off
of
the
object,
store
and
verify
that
the
checksums
match
and
if
they
don't
match,
then
it'll
be
reported
in
the
zepal
status
v,
just
like
normal
right
now.
B
The
performance
of
that
is
not
great,
because
we
didn't,
we
don't
have
the
like,
sequential
scrub,
optimization
hooked
up.
It
needs
to
be
updated
to
know
about
like
block
ids
versus
v
devs
and
offsets,
because
we
kind
of
shoehorn.
The
idea
of
the
block
id
like
into
the
block
pointer,
but
like
a
naive
interpretation
of
the
block
winner
that
doesn't
know
about
that
would
see
like
would
think
that
there's
overlapping
allocations.
A
Yeah,
it
might
be
worth
mentioning
that
too
that,
like
we,
although
we
didn't
talk
about
it
here
with
zetta
cash,
is
we
have
taught
scrub
to
know
about
zettacash,
yeah
and
b
have
the
capability
of
scrubbing
the
blocks
that
are
going
to
be
stored
on
the
cache
you
know
for
those
that
are
kind
of
familiar
with,
like
the
way
the
cache
normally
plums
into
the
rest
of
the
stack.
It's
like
this
is
kind
of
it's
living
below
the
I
o
pipeline.
A
So
it's
in
a
different
location,
which
meant
that
we
had
to
kind
of
treat
it
a
little
bit
differently,
but
it
actually
worked
out
really
nicely
for
us.
B
Yeah,
I
think
that's
a
great
point
of
it
does
integrate
with
the
is
that
a
cache
that
we'll
hear
about
this
afternoon.
B
There
are
a
couple
questions
that
haven't
been
answered
yet
here
so
from
thomas
wagner.
Are
there
any
active
use
cases
for
s3?
Already?
I
assume
you
mean,
like
anybody
using
zfs
on
object,
store
on
s3.
We
we
haven't,
put
this
into
production,
yet
we're
still
working
on
implementing
it.
But
you
know
the
use
case
is
the
one
that
we've
described
of
you
know.
Storing
databases
in
our
dell
fix
your
data
virtualization
product.
A
I
I
found
people
both
on
linux
and
freebsd
that
have
been
using
this
they're
using
it
primarily
for
backups
right
like
because
you
know,
as
I
mentioned
like
any
of
the
literature
you
read
object
store
is
just
not
really
designed,
for
you
know
for
low
latency,
you
know
high
transaction
type
of
applications,
and
so
the
applications
that
are
out
there
are
not
pushing
that
limit.
We're
kind
of
venturing
in
that
and
you
know,
from
what
we've
been
able
to
find
and
the
performance
numbers
we're
getting.
A
B
Cool
the
next
question
from
powwow
is
asking:
what
happens
if
you
allocate
an
object
in
s3,
but
then
the
system
crashes
or
maybe
the
agent
crashes,
is
the
object
leaked
or
do
we
somehow
figure
out
how
to
delete
it?
Yeah,
that's
a
great
question,
so
you
can
imagine
something
here
where
we're
like
we're
writing
out
these
data
objects
we're
doing
a
txg.
B
I
mean
the
interesting
so
first,
let's
cover
the
case
of
like
the
system
crashes
so
like
let's
say:
we've
written
out
we're
in
the
middle
of
a
txg
we've
written
out
object
days,
five
and
six.
Those
are
part
of
the
next
txg,
the
whole
system
crashes,
the
kernel
crashes.
We
come
back
up,
we
open
the
storage
pool.
How
do
we
find
objects,
five
and
six
to
delete
them?
B
We
do
so.
We
do
find
them
and
delete
them
and
the
way
that
we
do
that
is
so.
I
simplified
the
the
key
here.
A
little
bit,
we've
actually
like
forward
padded
like
padded.
Each
of
these
object
ids
with
like
it's
actually,
you
know:
zero:
zero,
zero,
zero,
zero,
zero,
zero,
zero,
five
and
the
reason
is
that
then
it
lets
us
easily
find.
We
know
what
is
the
last
valid
object.
Id
that's
basically
stored
like
in
the
equivalent
of
the
uber
block
like
in
a
per
txt
data
structure.
B
So
when
we
open
the
pool
we're
like
okay,
the
last
valid
txg
is
192.
in
that
in
that
txg
the
last
valid
object.
Id
is
four,
and
so
we
can
do
a
list,
objects
and
list
list
all
the
objects
that
are
in
this
prefix
after
object
after
the
one.
That's
zero,
zero,
zero,
zero,
four
and
that
list
will
contain
you,
know
five
and
six
and
then
we'll
delete
them.
So
that's
actually
very
quick.
An
efficient
way
to
take
care
of
that.
B
B
The
kernel
doesn't
know
about
that,
because
this
is
no
objects
at
all.
It's
still
running,
but
it
doesn't
know
about
it.
The
agent
has
been
restarted,
so
he
doesn't
know
about
that.
So
in
that
case,
what
we
do
is
we
do
the
same
thing
of
like
okay,
the
last
txt
that
I
wrote
out
has
the
last
object
at
e4.
Let
me
list
what
objects
are
after
it,
but
those
objects.
B
We
don't
necessarily
delete
them
because
we
might
be
in
the
middle
of
writing
them
as
part
of
this
txg,
and
we
may
have
already
told
the
kernel
that
all
the
rights
to
those
have
completed
because,
like
it
was,
it
was
persistent
on
disk
it
persisted
into
the
object
store
and
as
far
as
the
agent
was
concerned,
like
everything
was
done,
and
it
told
the
colonel
great
those
those
rights
are
done.
B
You
don't
need
to
hold
on
to
that
memory
anymore,
and
so
the
kernel's
like
okay,
like
I
forget
that
memory,
the
kernel
doesn't
have
any
way
to
replay
those
because
they
aren't
in
flight
anymore
from
its
point
of
view.
So
but
then
there
might
be
other
outstanding
writes
from
the
kernel's
point
of
view
that
do
need
to
be
replayed.
B
So
basically
we
have
to
stitch
together
the
what
was
left
on
disk,
like
what
objects
are
there
and
the
outstanding
the
the
zio
rights
that
were
outstanding
when
the
agent
crashed
which
the
kernel
can
replay
into
the
agent,
and
then
we
can
stitch
that
together
and
fill
in.
You
know
any
missing
objects
that
hadn't
been
persisted.
B
E
Perfect,
first
of
all,
awesome
talk,
awesome
functionality,
which
I
was
waiting
since
for
waiting
for
since
the
leadership
meeting
it
was
introduced
on
and
on
that
leadership
meeting.
I
shared
my
use
case
for
that
and
that
would
be
ability
to
mount
to
import
pool
on
multiple
hosts
and
mount
data
datasets
on
those
multiple
hosts,
and
I
have
a
volume
migration
of
containers
in
mind.
A
So
one
one
question
merchant.
So
in
the
case,
were
you
using
it?
Are
you
kind
of
envisioning
like
moving
the
storage
pool
to
another
host
or
how
are.
E
You
no
you're
replicating
the
data
completely.
I
have
in
mind
the
one
storage
pool
imported
on
multiple
hosts
and
data
sets
that
are
dedicated
for
each
host.
A
B
Yeah,
I
think
that
that's
I
mean
that's
a
neat
idea.
It
doesn't
fit
exactly
into
what
we've
done
here.
I
mean,
because
there's
still
pool
wide
metadata
that
you
know
all
of
the
hosts
would
need
to
access
and
update
like,
for
example,
the
last
block
id
that
was
allocated
right,
so
we
maybe
during
the
hackathon
or
the
discussion
afterwards.
We
can
see.
B
Some
ideas
around
of
like
what
could
be
done
there,
but
I
don't
think
that
it's
like
it
doesn't
just
work.
That's
for
sure
right.
E
Okay,
second
question:
have
you
in
mind
optimization
for
the
use
case
when
we
have
our
own
implementation
of
s3?
Let's
say
min
io
when
there
is
no
impact
on
the
amount
of
the
api
calls
you
are
allowed
to.
E
D
We
got
a
related
question
on
youtube,
actually,
which
is
do
we
have
any
tunables
around
controlling
api
calls
to
limit
costs
for
the
services
that
provide
that.
B
Yeah,
we
haven't
thought
extensively
about
like
what
exactly
you
would
want
to
optimize
differently.
We
we
do
have
extensive
tunables
to
control
all
the
stuff,
so,
like
average
block
size
average
object
size.
You
know
the
reclaiming
how
all
that
stuff
works.
B
So
I
hopefully
you
know
the
fact
that
s3
is
like
more
restrictive
in
terms
of
like
you
have
to
pay
for
every
little
thing
means
that
it'll
work,
fine,
you
know,
and
we
you
know
kind
of
tried
to
make
it
work
well
in
that
environment
it
should
work
well
in
less
restrictive
environments
as
well.
If
there's
some
other
cloud
provider
that
like
charges
you
for
something
that
amazon
doesn't
charge
you
for,
then
that
would
be
interesting
to
know
about
so
that
we
could
think
about.
D
A
And
definitely
with
like
midio,
I
mean
we've
tried
that
out
it.
It
works
just
fine,
but
yeah.
There
might
be
cases
where
we
could
make
a
different
design
decision.
Knowing
that
we
don't
have
to
pay.
You
know
the
per
egress
ingress.
You
know,
operation
type
costs
yeah,
that's
something
to
definitely
like.
You
know,
think
about.
D
Yeah
and
then
a
related
question
that
we
also
got
on
youtube,
was
a
number
of
people
were
asking
about
stuff
at
the
v?
Dev
object
store
layer,
you
know:
do
we
support
clouds
other
than
s3?
D
Would
it
be
possible
to
have
multiple
clouds
backing
a
pool
in
like
a
mirroring
configuration,
and
is
it
possible
to
have
disks
combined
with
object
storage
as
the
back
end,
and
I
answered
all
those
there,
but
I
figured
I'd
just
repeat
the
answers
here
right
now.
We
support
any
cloud
that
use
provides
the
s3
object,
storage,
api,
which
is
a
number
of
them,
but
we
do
plan
to
add
support
for
a
few
more
things
in
the
future.
D
Like
azure,
I
think,
has
their
own
api
and
we
would
like
to
add
support
for
that
as
well.
Currently,
there's
no
capability
to
mirror
between
multiple
object
stores
as
the
back
end,
but
again,
this
is
it's
not
precluded
at
all
by
the
design.
It's
just
something
we
would
need
to
actually
like
work
on
and
implement
and
then
the
question
around
having
disks
combined
with
object,
storage
and
like
possibly
as
a
tiering
solution.
D
It's
an
interesting
idea
like
having
disks
as
your
primary
store
and
then,
following
you
know,
migrating
data
back
to
object,
store
to
save
space
and
stuff,
like
that.
We
haven't
done
any
design
work
on
it,
but
it
is
definitely
something
that
would
be
interesting
to
work
on
in
the
future.
I
think.
B
Cache
talk
yeah,
it's
kind
of
related
to
that
yeah.
We
we,
you
know
we
kind
of
need
multiple
tiers,
but
we're
you
know
we
designed
it
as
a
cache,
rather
than
a
tiered
kind
of
thing
where
you
know,
tiering
usually
means
that
the
data
might
live
in
exactly
one
place
and
you
can
like
move
it
from
here
to
there
and
it
doesn't
exist
here
anymore,
which
we
didn't
see
as
a
requirement
for
our
use
cases.
B
B
Rahul
asked
how
how
does
reclaim
happen
if
you're,
using
like
s3,
tiering
or
lifecycle
policies,
so
the
free
space
reclaiming
that
we're
doing
is
doesn't
interact
with
those
s3
level
things
so,
basically
like
we're
kind
of
assuming
that
only
one
copy
of
an
object
is
retained
in
terms
of
life
in
terms
of
like,
if
you're,
using
s3,
like
lifecycle,
like
keeping
you
know,
keeping
old
versions
of
it
versioning,
we
don't
take
advantage
of
that
versioning
and
we
kind
of
assume
that
you
don't
have
it.
B
You
could
probably
use
that
to
like
roll
back.
Your
pool
really
far
or
something
but
we
haven't,
we
haven't
tested
that
out
in
terms
of
tiering
the
tiering.
I
mean
it's
gonna
kind
of
just
work
and
do
what
it
does
like
moving
stuff
from
the
from
the
s3,
like
normal
tier
to
glacier
or
whatever.
B
Of
course,
if
you
know
if
we
go
and
do
reclaim
and
that
like
needs
to
read
some
old
object,
then
that's
going
to
bring
it
back
from
the
glacier
tier
to
the
main
tier,
so
you'd
probably
want
to
configure
like
the
reclaim
and
the
like
movement
policies
such
that,
like
normally
you
wouldn't
like
normally
stuff,
wouldn't
get
moved
into
the
glacier
tier
until
like
the
freeze
had
already
most
of
the
freeze
had
already
been
processed.
B
I
suspect
that
it'll
kind
of
gen
like
if
you
have
those
kind
of
workloads
that
for
which
the
turing
is
useful,
then
it'll
probably
just
work.
Fine
anyways,
because
you're,
probably
using
big
files
and
big
blocks
and
the
reclaim
is
really
not
an
issue
in
those
cases.
In
theory,
I
think
we
probably
could
add
some
smarts
to
say.
Like
you
know,
before
doing
you
know,
you
have
different
reclaimed
policies
for
stuff,
that's
in
the
glacier
tier
versus
the
normal
tier
and
like
we
can
query
and
find
out.
B
F
I
suppose
I
could
I
don't
really
know
much
about
s3,
so
I
don't
know
if
there's
a
particular
limit
to
it,
but
all
your
examples
you
had
the
key
count,
ids
counting
out
in
your
example,
there
is
four
five
and
six
yeah
and
then
you
have
the
object,
mapping
to
take
the
block
number
and
a
map
into
which
id
goes
to
that
table.
Is
there
any
particular
reason
you
couldn't
instead
of
using
four
five
six
use,
one
two
and
three
three,
four,
five,
three
four
six
and
so
on
yeah,
like
lowest
block.
B
B
I
think
that
might
work
until
you
do
object.
Consolidation.
B
Because,
let's
see
it.
B
Yeah
since
roy's
consolidating
to
the
left,
it
might
work.
It's
an
interesting.
A
D
B
I
actually
had
the
design
that
way
in
an
earlier
version,
and
now
I'm
trying
to
remember
like
I'll,
have
to
look
through
my
notes
and
see
why
I
changed
it
it
might.
It
might
be
that
I
just
changed
it
for
ease
of
comprehension,
because,
like
it's
a
little
bit
hard,
you
know
it's
like
hard
to
wrap
your
head
around.
Like
oh
there's
like
this
object
whose
id
is
you
know,
346,
and
that
tells
me
something
about
the
contents
of
it
versus
this
more
abstract.
B
You
know
more
cleanly,
abstracted
layering
but
yeah
that
we
should
go.
Look
at
that
again
in
and
see
if
there's
a
memory
savings
that
we
could
have
by
doing
that,
okay.
D
There
was
one
other
question
from
youtube,
which
was
about
the
mmp
stuff,
which
is,
would
it
be
possible
to
have
one
modifier
and
also
have
readers
operating
in
parallel
with
the
modifier
and
the
answer
that
is
yes,
except
that
once
blocks
start
to
get
reclaimed,
you
can
run
into
some
issues,
so
you
could
potentially
use
something
like
checkpoints
for
that
which
would
prevent
the
reclaims
from
happening
and
should
make
it
possible
to
do
reads
of
things
like
snapshots
safely.
D
Even
if
you
know
the
active
system
could
be
destroying
those
snapshots
or
whatever
and
checkpointing
is
implemented
for
the
object
store
as
well.
We
do
have
that
working.
B
Yeah,
the
behavior
would
be
like
at
one
level
kind
of
similar
to
having
a
block
based
pool
that
has
you
know
one
writer
and
multiple
readers
where
it's
like
yeah,
like
as
long
as
the
reader
starts
from
a
given
snapshot
from
a
given
txg
and
no
and
those
blocks
aren't
freed
or
overwritten.
Then
it'll
continue
to
work,
and
you
know
you
could
use
checkpoints
to
ensure
that
that's
the
case
where
you
say
like
okay,
like
multiple
readers,
open
the
pool
from
the
checkpoint
that
that's
totally
safe.
B
But
if
you
want
a
more
like
arbitrary
thing,
then
you
might
get
check
some
errors
or
on
critical
metadata
and
things
might
blow
up
for
object
store
it's
kind
of
similar,
but
it's
a
little
bit
safer
because
you
know
you
aren't
going
to
get
like
check
some
errors
on
weird
things.
You
do.
You
know
the
only
error.
That's
really
possible
is
like
I
read
I
I'm
doing
a
read.
The
block
should
be
in
this
object,
but
it's
either
not
in
that
object
or
the
object.
Id
doesn't
exist
anymore.
B
So
it
should
be
like
a
little
bit
easier
to
handle
that
error.
Maybe
but
you
would
still
have
that
problem
in
the
kind
of
general
case
if
you
weren't
using
the
checkpoint,
like
pulsing.