►
From YouTube: Metaslab Allocation Performance by Paul Dagnelie
Description
From the 2019 OpenZFS Developer Summit
slides: https://drive.google.com/open?id=1he9APxNsQutYBzJHCjkbveUCmHPlbDhK
A
So
our
next
presenter
is
Paul
diagonally,
who
is
going
to
present
about
metal
slab
allocation
performance.
So
Paul
has
been
making
a
several
contribution
over
the
year.
You
might
have
known
him,
especially
for
CFS
and
receive,
and
his
work
on
reductive
data
set
over
the
last
year.
Paul
has
focused
more
on
working
performance,
so
he
will
talk
about
some
of
his
findings
in
this
presentation.
So
please
welcome
Paul.
B
C
B
With
del
phix,
for
about
six
six
and
a
half
years
now,
and
to
give
a
little
context
about
del
phix,
we
use
ZFS
for
database
virtualization
and
we
work
with
a
lot
of
sort
of
high-performance
systems.
We
have
a
lot
of
sort
of
long-running
systems
with
lots
of
file
systems,
lots
of
clones
and
lots
of
snapshots.
B
So
we
tend
to
run
into
a
lot
of
situations
where
you
sort
of
encounter
sort
of
some
of
the
extreme
performance
characteristics
of
ZFS
and
sort
of
hit,
the
places
where
it
starts
to
break
down
in
some
of
those
cases.
So
why
am
I
here?
What
led
to
the
work
that
we
did
and
what
led
to
me
giving
this
talk
well
late.
Last
year
we
received
a
number
of
customer
originated
complaints
that
got
escalated
to
engineering.
B
So
synchronous,
fights
versus
asynchronous,
writes
normal,
writes
that
you
think
of
when
you're
writing
to
a
file
system
are
asynchronous
and
the
idea
there
is.
You
know
you
hand
the
kernel,
some
data
to
write
file
to
write
it
to
an
offset
to
write
it
at
and
then
it
goes.
It
gives
you
back
control
immediately,
but
it'll
do
the
right.
Some
later
point
in
ZFS
we
batch
up
a
bunch
of
these
rights
into
things
called
transaction
groups
or
TX.
G's
and
asynchronous
rights
are
mostly
latency
insensitive.
B
B
Other
rights
are
called
synchronous
rights.
Synchronous
rights
are
done
a
little
bit
differently.
When
you
tell
the
kernel,
you
know
write
this
to
this
place.
You
say:
don't
come
back
until
you're
done,
and
so
then
the
kernel
is
going
to
not
return
control
to
you
until
the
right
is
actually
persisted
on
disk
somewhere,
and
so
the
advantage
of
this
is
that
you
know
if
there's
a
power
failure
or
if
your
program
crashes
or
if
your
system
crashes,
you
are
guaranteed
that
if
that
function
returned
your
right
is
persisted
on
disk.
B
So
it's
good
for
reliability
purposes.
These
are
trying
to,
and
because
of
this
nature,
the
filesystem
wants
to
issue
these
as
fast
as
it
possibly
can,
because
they're
really
extremely
latency
sensitive,
every
millisecond
you
spend
waiting
for
the
right
to
be
issued
is
a
millisecond.
The
program
isn't
queuing
up
another
right,
so
your
latency
and
your
I
outs
are
directly
tied
to
each
other.
Zfs
has
a
special
system
that
it
uses
to
make
synchronous,
writes
for
more
efficient,
called
the
ZFS
intent
log
or
the
Zil.
B
The
Zil
is
a
system
that
allocates
pre
or
especially
allocates
a
bunch
of
blocks
that
your
rights
are
just
sort
of
immediately
placed
into
you
whenever
you
issue
a
synchronous
right,
it
sort
of
bypasses
parts
of
the
allocation
path
and
uses
other
parts
of
it.
I'll
talk
a
little
bit
more
about
it
later.
Another
important
concept
to
understand
is
the
idea
of
a
separate
and
tenth
log
device
or
a
slog
device.
B
A
slog
device
is
a
separate
disk
that
you
add
to
your
pool
and
designate
specialy,
and
the
Zil
will
try
to
do
its
allocations
from
that
device.
So
this
is
something
that
you
see.
A
lot
of
sort
of
confusion
about
or
misnomers
about,
in
literature
online
is
the
difference
between
the
Zil
and
the
slog.
The
slog
is
a
kind
of
device.
Bazzill
is
the
ZFS
subsystem
and
the
blocks
that
it
manages.
So
hopefully
you
know
when
you're
reading
about
those
in
the
future
writing
about
those
in
the
future.
B
You
can
understand
the
terminology
a
little
more
clearly
so
now,
let's
talk
a
little
bit
about
allocation
and
ZFS
and
how
it
works
and
sort
of
what
the
what
the
principles
of
it
are
so
important
concept
in
allocation
of
file
systems
as
in
life,
is
that
everything
has
to
go
somewhere.
All
your
files
have
to
end
up
somewhere
on
disk.
All
your
metadata
ends
up
somewhere
on
disk,
all
your
indirect
blocks
and
filesystem
metadata.
B
Everything
has
to
have
a
spot
on
disk
and
ZFS,
as
probably
most
of
you
know,
is
what's
called
a
copy-on-write
filesystem,
but
for
those
of
you
who
are
not
super
familiar
with
the
concept,
most
file
systems
are
what's
sort
of
referred
to
as
an
update
in
place
file
system.
When
you
modify
a
file,
the
writes
happen
to
the
same
place.
The
data
was
already
living.
Zfs
is
different.
Every
time
you
update
some
data,
it
writes
it
to
a
new
location.
B
One
of
the
advantages
of
copy-on-write
is
something
is
that
we
can
get
transparent
compression
because
you're
not
always
writing
to
the
same
spot.
You
can
write
the
new
data
in
a
different
size,
which
means
that
if
the
data
compresses
better,
you
can
use
a
smaller
block
or
if
it
doesn't
compress
as
well,
you
can
use
a
bigger
block.
B
If
your
data
isn't
all
exactly
the
same
compressibility,
so
that
makes
allocation
a
little
trickier
and
ZFS
than
it
would
be
in
other
file
systems
beyond
the
fact
that
it's
a
copy-on-write
file
system
and
an
important
concept
to
understand
related
to
that
is
the
difference
between
a
record
size
and
a
block
size.
Your
file
is
broken
up
into,
has
a
record
size
associated
with,
and
by
default
it's
128
kilobytes
and
the
idea
there
is
just.
This
is
sort
of
the
the
logical
chunk
of
data
for
a
file.
B
Your
big
files
are
like
it's
128,
kilobytes,
yunk
and
then
another
and
then
another
and
another.
The
block
size
is
the
actual
physical
size
of
that
data
on
disk.
So
if
you
have
compression
enabled
or
various
other
features
enabled,
the
actual
block
can
be
much
smaller
than
that.
You
know
your
hundred,
twenty-eight
block
could
be
honored,
kilobytes
or
60
kilobytes
or
three
kilobytes,
depending
on
how
well
it
compresses.
So
what
is
a
Metis
lab?
B
Many
of
you,
if
you
work
in
zs,
have
GFS
have
probably
heard
this
term
before,
but
a
Metis
lab
really
is
just
a
name
for
a
region
of
the
disk.
The
name
sort
of
implies
something
about
slab
allocators.
That
is
was
historically
true,
but
is
not
the
case
anymore.
The
Menace
lab
is
really
just
a
part
of
the
disk
and
the
data
structures
associated
with
it.
Metis
labs
are
about
16
gigabytes
in
size
and
are
usually
about
200
of
them
on
the
system,
but
that
varies
with
the
size
of
your
disk.
B
I'm,
sorry
200
them
on
the
disk,
and
that
varies
with
the
size
of
the
disk,
and
so
what
the
Metis
Labs
real
job
is,
is
to
track
the
allocatable
and/or
the
allocated
and
free
space
within
that
disk
region,
and
this
ties
into
a
concept
that's
really
critical
to
Metis
labs,
which
is
the
distinction
between
a
loaded,
Metis
lab
and
an
unloaded,
Metis
lab.
When
you
boot
the
system,
we
don't.
We
know
how
many
medicines
there
are.
We
know
how
many
disks
there
are,
but
we
don't
know
what
spots
inside
of
each
Metis
lab.
B
You
can
actually
use
for
allocations
like
what
spots
on
disk
or
free.
This
is
stored
on
disk
in
a
data
structure
called
the
space
map
which
I'm
not
gonna,
go
into
too
much
detail
about
today,
but
I'm
just
mentioning
it
for
context,
and
so
when
it
comes
time
to
actually
do
an
allocation,
we
have
to
do
what's
called
loading,
a
Metis
lab
and
the
idea
there
is.
B
Basically,
it's
an
attempt
to
distill
the
overall
quality
of
the
Metis
lab
for
allocations
into
a
single
number,
so
we
can
use
it
to
quickly
compare
sort
of
which
is
the
best
medicine,
and
so
we
always
have
that
tracked.
Whether
in
Metis
lab
is
loaded
or
not.
One
thing
that
until
recently,
we
all
had
one
in
Metis
lab
was
unloaded
or
when
it
was
loaded,
was
the
size
of
the
largest
free
segment.
B
So
you
know,
if
you
have
a
one
megabyte
free
segment
in
your
medicine,
we
would
keep
that
information
around
or
if
you
had
like
a
64
kilobyte
free
segment.
We'd
keep
that
around
and
if
you
tried
to
do
a
65,
Killa
kilobyte
allocation,
we
would
immediately
know
that
the
Metis
lab
wasn't
suitable
and
we
wouldn't
try
to
like
go
looking
through
its
trees
for
a
place
to
put
your
data.
B
There
are
many
other
space
trees
beyond
just
the
allocatable
space
tree.
We're
not
going
to
talk
about
those
too
much,
but
they
do
exist
and
I
can
talk
about
them
of
people
of
questions.
So
when
it
comes
time
to
load
a
meta
slab
into
memory,
the
data
structure
we
loaded
into
is
called
a
range
tree
and
a
range
tree
is
just
in
the.
For
the
purposes
of
this
talk,
it's
used
to
be
an
in-memory
representation
of
the
free
space
it
can
represent.
B
You
know
any
like
kind
of
space,
but
we're
going
to
talk
about
free
space
range
trees
for
this
and
they're
built
on
top
of
until
recently,
binary
search,
trees
and
there's
two
trees
as
part
of
every
range
tree.
Actually,
there's
the
offset
sorted
tree
and
the
size
sorted
tree,
so
I'm
going
to
use
this
sort
of
example,
Metis
lab
to
help
you
understand.
B
What's
going
on
so
consider
this
Metis
lab,
which
has
11
places
to
allocate
spots
0
through
11
and
the
colors,
are
sort
of
washed
out
on
the
projector,
but
the
there's
3
plug
regions.
Here,
where
you
can
allocate
data,
there's
3
free
spaces,
there
spot
0,
spot
3
and
spots
5
through
7.
All
of
those
are
sort
of
places
you
could
put
data
if
you
needed
to
do
an
allocation,
so
the
offset
sorted
tree
is
the
one
on
the
left.
As
you
can
see,
it's
just
sorted
in
order
of
where
these
things
are.
B
You
know
physically
laid
out
on
the
meta
slab,
and
this
tree
is
useful
for
coalescing
adjacent
regions.
So,
for
example,
if
I
came
along
and
freed
spot
for
I
had
some
data
written
there
and
I
freed
it.
You
could
use
the
offset
sorted
tree
to
determine
that
oh
3,
2,
3
and
5
to
7,
if
you
like,
free
for
all
of
those,
can
combine
into
one
region,
and
so
you
could
quickly
use
that
tree
to
do
that
sort
of
coalescing
operation.
B
The
size
sorted
tree,
on
the
other
hand,
is
useful
for
finding
a
place
to
put
your
data,
which
is
you
know,
kind
of
what
this
is
all
about.
If
you're,
like
okay
I,
have
a
too
wide
piece
of
data
that
I
need
to
allocate,
you
would
go
into
the
tree,
no
be
like
ok
spot
5
to
7,
that's
3
wide.
You
can
use
2
of
those
to
do
your
allocation,
so
it's
a
very
efficient
way
to
sort
of
find
places
to
allocate
data
and
also
merge
things
together.
B
If
they,
you
know,
if
you're
doing,
freeze
so
now,
I'm
going
to
talk
a
little
bit
about
the
actual
allocation
process,
what
exactly
the
steps
are
that
we
go
through
so
whenever
you're
doing
an
allocation
for
each
allocation,
you
do
the
first
thing
you
do.
Is
you
pick
a
disk
there's
a
lot
of
different
sort
of
mechanisms
that
go
into
picking
disks,
but
none
of
it.
None
of
them
are
super
relevant
to
this
talk.
B
There's,
like
you
know,
thresholds
for
how
much
performance
you
already
have
going
on
to
one
disk
and
lots
of
things
like,
but
for
our
purposes
you
pick
a
disk.
It's
basically
round-robin.
Once
you've
selected
a
disk,
you
need
to
pick
a
good
seeming,
meta
slab
and
so
the
primary
factor
that
we
use
there
is
the
weight
that
I
talked
about
earlier,
and
then
it's
lab
what
the
highest
weight
is,
the
one
that
we
think
is
the
best
one,
and
so
we
try
to
use
that
one
first.
B
If
the
minutes
lie,
but
the
highest
weight
isn't
loaded,
then
you
need
to
load
that
meta
slab
so
take
the
space
map
off
of
the
disk.
Read
it
into
memory
and
make
the
range
tray
out
of
it
once
you've,
loaded
it
or
once
you've
picked
a
meta,
slab
and
loaded
it.
If
necessary.
You
then
pick
a
you
need
to
find
a
place
in
that
meta
slab
to
do
your
allocation,
so
there's
a
lot
of
different
strategies
that
can
be
used
to
do
this
sort
of
the
the
more
naive
ones.
B
Are
things
like
first
fit
where
you
just
sort
through
the
offset
tree
in
order
until
you
find
a
place,
your
thing
fits,
and
you
just
put
it
there
or
you
can
do
best
fit,
which
is
like
you
sort
through
the
size
sort
of
tree
for
something
that's
exactly
the
size
of
the
allocation.
You
want
to
do
and
you
put
it
there.
B
But
then,
if
you
don't
find
something,
close
ZFS
will
now
actually
go
just
to
the
size
sort
of
tree
and
find
a
new
place
to
start
doing
allocations.
That
was
a
change
that
was
made
somewhere
recently,
so
once
you've
picked
a
spot,
however,
you
decided
to
do
it.
You
claim
that
spot
you
remove
that
space
from
the
allocatable
range
tree.
You
modify
the
on
disk
data
structure
and
you
move
on
with
your
life.
B
If
there's
no
spot
in
the
Metis
lab
for
your
allocation,
then
you
need
to
go
back
to
the
top
and
pick
a
new
meta
slab.
You
go
to
the
one
with
the
next
highest
weight
and
you
repeat
the
process
and
you
keep
going
until
you
find
a
spot
if
you
don't
find
a
spot.
You
know
if
you've
gone
through
all
the
meta
slabs
in
your
disk
and
you
none
of
them
worked.
You
pick
a
new
disk,
repeat
the
process.
You
keep
going
until
you've
gone
through
all
the
disks
in
the
system.
B
If
you've
tried
literally
every
mana
slob
on
every
disk,
and
you
can't
find
a
place
to
put
your
allocation,
you
do
what's
called
ganging.
Ganging
is
basically
you
take
your
allocation
and
you
split
it
into
smaller
pieces,
and
then
you
allocate
each
of
them
separately.
This
is
really
a
measure
of
last
resort
in
ZFS,
because
it
can
cause
fragmentation
to
just
absolutely
skyrocket.
We
really
try
to
avoid
it
whenever
possible.
B
So
there's
a
lot
of
stuff
here
and
I
sort
of
made
that
sound
like
a
lot
of
different
steps
and
they're
all
sort
of
very
important
and
there's
lots
of
things
going
on,
but
in
terms
of
performance,
really
that
that
list
should
look
something
more
like
this
Metis
lab
loading
is
far
and
away
the
most
expensive
of
these
operations.
This
diagram
is
not
to
scale.
B
No,
so
like
picking
a
good
medicine
can
take.
You
know
100
microseconds,
finding
a
place
in
the
Metis
lab
another.
A
few
hundred
microseconds
loading,
a
Metis
lab,
can
take
a
second
on
really
big
systems
like
an
actual
real.
Second,
it
is
catastrophic,
ly
slow
in
terms
of
performance.
So
really,
if
you're
going
through
this
loop
more
than
once
you're
having
a
very
bad
day,
you
do
not
want
to
be
in
that
situation.
B
So
for
synchronous,
rites,
things
work
a
little
bit
differently.
Synchronous
rites
go
directly
into
the
Zil
rather
than
just
sort
of
the
normal
allocation
path
and
the
Zil
uses
these
128
kilobytes
by
default
log
blocks,
and
it
will
just
sort
of
if
you
come
in
with
like
an
8
kilobyte
allocation.
It'll
just
drop
your
data
into
that
so
block
directly
and
move
on.
It
still
has
to
do
allocations
to
allocate
these
Oh
blocks,
but
it'll
try
to
use
the
slog
devices
if
you
have
them
available
and
those
slog
devices
aren't
really
used
for
other
allocations.
C
B
You
can
use
to
do
these
hundred
twenty
kilobyte
allocations.
The
actual
allocation
process
for
an
intent
log
block
is
the
same
as
the
process
for
an
async
block.
You
go
through
that
same
loop,
but
it'll
try
to
use
log
devices
if
it
can
and
then
later
on
after
your
data
has
been
synced
to
disk
during
subsequent
TX
G's
ZFS
will
actually
migrate.
Your
data
from
the
Zil
block
to
a
normal
resting
place
on
disk,
and
this
is
why
your
slog
device
doesn't
sort
of
just
fill
up
with
synchronous.
B
Writes
the
data
is
all
sort
of
migrated
to
the
body
of
your
pool.
The
the
ZOA
blocks
are
not
intended
as
sort
of
a
permanent
resting
place,
and
so
that
sort
of
my
introduction
to
how
allocation
works
in
ZFS
and
what
the
data
structures
are,
and
so
now
I'm
going
to
talk
about
what
happened
to
us.
If
you
look
at
this
flame
graph,
which
is
a
little
bit
hard
to
read,
I
understand,
but
the
the
key
takeaway
from
it
is
that
when
you
look
at
it
we're
spending,
this
is
a
like.
B
A
heavy
synchronous
write
workload
and
we're
spending
a
lot
of
time
in
the
allocation
code.
If
you
add
up
we're
spending
literally
50
percent
of
our
time
in
the
allocation
code,
which
is
not
really
the
place
you
want
to
be,
you
want
to
be
spending
a
lot
of
time,
writing
things
to
disk
and
compressing
things
and
checksumming
things.
B
We're
spending
so
much
time,
allocating
that
we
don't
have
we're
losing
significant
performance,
and
so
this
is
the
sort
of
flame
graph
we
were
seeing
on
our
customer
systems
and
when
we
thought
about
it
and
we
sort
of
worked
out
what
the
problems
were,
we
really
ran
into.
We
realized
that
we
were
sort
of
at
the
perfect
place
to
run
into
this
problem,
because
we
have
relatively
small
record
sizes.
B
We
use
an
8
kilobyte
record
size
instead
of
the
default
hundred
and
twenty-eight
kilobyte,
and
this
is
common
if
you're,
storing
things
like
VM,
decays
or
usings
evolves
or
you're,
storing
database
files,
because
all
of
these
are
trying
to
sort
of
have
small
atomic
units
of
data
in
a
VM.
Decays
are
pretending
to
be
disks
where
sectors
are
512,
bytes
or
4096.
Bytes
and
so
because
we
have
this
small
record
size
and
we
have
compression
enabled
you
have
lots
of
different
small
sized
blocks,
and
so
when
you're,
you
know
allocating
and
freeing
them
very
rapidly.
B
B
It's
really
we're
getting
more
and
more
CPU
bottleneck
as
time
goes
on.
So
these
issues,
all
combined
together,
really
started
hitting
us
in
the
ass
pretty
hard.
It
was
a
rough
time
to
be
us,
so
we
started
looking
at
the
specific
problems
we
were
reading,
so
the
first
one
is
that
we
found
that
for
some
allocations
we
were
calling
meta
slap
load
like
a
dozen
times.
We
had
single
allocations.
We
were
trying
to
do.
B
We
were
loading,
many
medicines
and,
as
we
said,
that
can
take
a
long
time,
so
we
were
waiting
like
10
seconds
to
do
a
single
allocation.
That's
not
that's
not
good.
Like
second,
you
don't
want
seconds
per
aisle.
You
want
I
Oh
s
per
second
inside
the
you
want
the
numerator,
numerator
and
denominator
in
the
right
place
until
we
were
wasting
a
huge
amount
of
CPU
to
do
this.
We
were
loading.
All
these
medicines
would
take
a
lot
of
memory.
B
I'll
talk
about
that
a
little
later
and
we
were
spending
a
lot
of
IOT's
to
read
these
space
maps
from
disk.
So
why
are
we
loading
all
these
medicines
that
we
can't
use?
This
is
a
huge
waste
of
our
time.
Well,
when
we
have
a
meta
slab
loaded,
we
have
the
weight
that
I
talked
about
and
sort
of
the
attempt
to
distill
the
quality
down
into
a
single
integer
for
comparison
purposes.
B
Previously
this
was
sort
of
all
about
how
much
just
how
much
space
was
on
the
meta
slab
and
it
was
weighted
a
little
bit
based
on
the
fragmentation,
but
now
the
actual
weight
is
based
on
the
largest
segment
size
you
have
available
free
to
allocate.
If
you
want
a
bunch
of
details
about
this,
you
can
see
Matt's
talk
at
BSD
can
in
2016,
but
so
the
way
that
it
works
now
is
that
you
take
the
largest
bucket
of
free
segments.
B
So
this
is,
you
know
it's
a
pretty
good
system
right
if
you
have
like
a
32
kilobyte
allocation-
and
you
see
this
you're
like
oh
great
I-
can
definitely
satisfy
my
I/o.
If
you
have
a
hundred
kilobyte
allocation
you're
like
okay,
if
these
free
segments
are
evenly
distributed
throughout
this
bucket,
probably
I
can
do
my
own.
B
Io
it'll
probably
be
fine,
but
what
if
it
turns
out
that
actually
all
thousand
of
those
free
segments
are
exactly
exactly
64
kilobytes,
then
you
will
load
this
meta
slab
with
your
hundred
kilobyte
allocation
and
find
that
you
can't
do
anything
about
it.
Like
you
loaded
this
thing,
you
thought
you
could
use
it
and
you
can
so
you've
just
wasted
that
load,
and
that
was
happening.
What
was
happening
to
us
really
really
regularly.
B
So
we
solved
this
problem
when
a
menace
lab
is
loaded.
We
avoid
picking
bad
loaded,
Metis
labs
with
this
cached
maximums
free
segment
size
that
I
talked
about
earlier.
We
don't
keep
that
around
when
the
Menace
lab
is
unloaded,
because
you
can
still
free
to
an
unloaded,
Menace
lab,
you
can
only
allocate
from
a
loaded,
Medus
lab,
but
you
can
free
to
an
unloaded
one.
This
is
a
thing
that's
been
in
the
ZFS
for
a
long
time
is
sort
of
a
performance
boost,
but
the
key
insight
that
we
sort
of
had
we
were
thinking
about.
B
B
If
we
just
keep
that
value
around
and
use
it
as
a
sort
of
a
rough
estimate
for
whether
or
not
the
Metis
lab
can
satisfy
an
allocation,
if
it's,
if
we
keep
that
around
and
you
come
in
with
an
allocation
and
it's
smaller
than
that
value,
we
know
we
can
satisfy
it.
We
might
be
able
to
satisfy
larger
ones,
but
we
know
we
can
satisfy
that
much.
B
B
There
was
a
prototype
project
to
do
sort
of
a
more
sophisticated
approach
where
you
would
calculate
the
expected
value
of
the
largest
tree
segment
size
and
it
would
increase
and,
as
more
fries
were
done
to
the
Metis
lab.
While
it
was
unloaded
that
project
is
still
highly
experimental,
but
the
results
were
reasonably
promising.
B
If
you
have
questions
about
it,
I
can
talk
to
you
about
it
afterwards
or
if
you
want
to
see
the
results,
but
it
turns
out
that
this
change
that
we
made
to
just
keep
this
value
cached
and
stopped
trusting
it
after
an
hour
gives
a
huge
eye.
Oh
and
we
got
a
30%
boost
on
a
Lumos
where
this
change
was
initially
developed
on
a
heavy
readwrite
mix
on
a
system
where
you
actually
couldn't
like
keep
all
the
Metis
labs
loaded
in
memory.
B
So
great,
we
solved
one
of
the
problems.
It
made
things
a
little
better,
but
even
with
all
of
our
loads.
Actually
like
satisfying
at
least
one
allocation,
we
were
still
loading
a
huge
number
of
Metis
labs
who
still
spent
you
know
a
huge
amount
of
our
time
on
it,
and
the
problem
really
was
that
fragmentation
was
was
just
really
bad.
If
your
workload
requires
a
Zil
blocks
per
second
and
your
meta
slap,
you
know
your
best.
Medicine
only
has
50
so
wha,
like
1528
q
regions,
and
the
next
only
has
30.
B
You
know-
and
you
have
this
nice
curve
downwards,
even
if
you
were
always
picking
the
best
medicine
you're
still
gonna
need
to
load
multiple
Metis
labs
per
second,
and
you
know
when
they
take
a
second
to
load.
That's
gonna
put
a
pretty
heavy
damper
on
your
workload,
so
we
really
just
we
didn't,
have
enough
buffers
in
each
Metis
lab,
so
we
had
to
keep
we
needed
either
to
load
more
Matus
labs
or
we
just
keep
more
Medus
labs
loaded
at
all
times.
B
This
is
sort
of
like
the
most
simple
approach
to
this
problem,
but
it
turns
out
it
works
pretty
well,
but
there's
a
very
real
cost,
which
is
that
when
you
have
this
range
tree
loaded
in
memory,
I
showed
you
the
range
tree
structure
before
each
free
segment
or
each
range
sag
is
I'll
call
them
cost,
take
72
bytes
of
memory.
We
had
some
customers
where
their
average
free
segment
size
was
like
three
three
and
a
half
kilobytes
and
they
had
100
terabytes
of
storage.
B
And
so,
if
you
try
to
load
all
of
those
regs
range
segments
into
memory
for
a
hundred
terabyte
pool,
it
takes
two
point.
Three
terabytes
of
RAM:
that
is
not
a
good
amount
of
RAM.
It's
a
very
bad
amount
of
RAM.
You,
like
you,
have
other
things.
You
want
to
use
your
RAM
for,
like
user
programs
and
the
Ark,
and
we
didn't
we'd,
have
any
left
for
that.
B
They
would
load
so
many
Medus
that
their
system
would
run
out
of
memory
and
then
they
couldn't
unload
any
medicines,
because
until
this
point
you
could
only
unload
during
a
txg
sink
and
the
system
just
hung
because
it
couldn't
find
a
place
to
do
an
allocation.
The
fix
was
that
that
was
slightly
different
from
the
stuff
we're
talking
about
today,
but
this
just
sort
of
illustrates
how
bad
things
were
getting
in
terms
of
trying
to
load
all
these
meta
slabs.
So
we
need
to
do
something
else.
B
We
need
to
fix
these
range
trees
they're
using
too
much
memory.
So
until
recently,
the
range
trees
were
built
on
AVL
trees,
which
are
a
very
nice
balanced,
binary
search
tree
data
structure.
They're
super
easy
to
use,
they're
pretty
performant,
but
the
way
that
they're
implemented
they
have
this
node
structure
that
you
allocate
as
part
of
your
data
structure.
That
just
adds
like
24
bytes
to
your
data
structure,
and
since
we
had
two
trees,
we
had
two
of
those
for
each
range
segment.
B
So
66%
of
the
memory
in
each
range
segment
was
devoted
to
these
evl
nodes.
That's
a
lot
of
memory!
That's
a
lot
of
overhead.
In
addition
to
that,
every
range
segment
is
allocated
separately
and
to
go
back
to
that
example,
100
terabytes
three
to
three-and-a-half
kilobyte
segments
that
is
35
billion
segments.
That's
a
lot
of
segments
that,
like
regardless
of
anything
else
you
have
to
do,
is
think
how
much
time
you're
spending
in
malloc
allocating
35
billion
segments,
and
so
this
is
a
flame
graph
that
we
had
from
a
test
system.
B
We
were
trying
to
load
a
very
large
pool.
We
were
just
loading
all
of
its
Medus
lobs
into
memory,
and
this
took
I
think
something
like
30
or
40
seconds
we
had
customers
were.
It
would
take
20
minutes
to
load
all
of
their
medicines
into
memory
between
malloc
and
then
just
all
the
AVL
tree
operations
you
had
to
do.
It
was
not.
It
was
not
good,
so
we
came
up
with
2
fixes.
B
One
of
them
is
that
we
switched
from
an
AVL
tree
to
a
B
tree
and
I'll
talk
about
that
next,
exactly
what
that
means
it,
how
it
works,
and
then
we
also
made
some
other
changes
to
the
range
settings
to
make
them
more
efficient.
So,
first,
let's
talk
about
B
trees.
Avl
trees
are
a
binary
search
tree.
They
split
twice
at
every
level.
B
trees
are
an
NRI
tree,
they
split
n
times
at
every
level
and
it's
actually
a
variable
n.
B
The
data
is
stored
entirely
in
tree
controlled
buffers
rather
than
having
these.
You
know
little
range
segments
that
you
allocate
every
single
one
of
them
separately,
you're,
allocating
these
for
kilobyte
range
or
tree
nodes
and
then
storing
data
in
an
array
in
them
directly
and
so
the
basic
structure
of
a
DB
tree.
You
can
see
here
you
have
the
root
at
the
top.
B
It
is
an
array
of
elements
which
are
the
squares,
and
then
it
has
a
bunch
of
child
pointers
and
each
element
acts
as
a
separator
between
two
child
pointers,
which
is
just
like
it
would
work
in
an
AVL
tree.
But
then
you
just
sort
of
scale.
It
way
up
sideways
sort
of
a
very
simple
concept,
and
then
you
know
the
root
has
children
and
the
children
of
children,
and
the
bottom
level
you
just
have
leaves
leaves
don't
have
any
child
pointers.
B
B
If
you're
like
coalescing
with
an
existing
thing,
so
one
you
could,
when
you
had
the
AVL
tree,
you
just
like
find
that
one
memory
allocation
modify
the
memory
and
not
have
to
like
remove
it
and
reinsert
it
in
the
tree,
because
it
wasn't
changing
its
sort
order,
and
so
you
can
still
do
that
with
B
trees.
You
can
still
get
a
pointer
to
the
actual
buffer
and
modify
that
memory
directly.
B
B
B
So
we
made
those
changes.
We
switched
away
from
the
AVL
tree
to
the
B
tree,
and
so
that
was
a
good
start.
Removing
the
AVL
nodes
from
the
range
segment,
which
is
72
bytes
to
start
with,
we've
cut
out
48
bytes,
so
66
percent
of
the
memory
gone
great
start
way
to
go.
The
next
thing
that
we
thought
about
was
well
there's
this
RS
fill
entry.
This
was
added
to
enable
sorted
scrubs
and
re
silver.
B
So
it's
super
useful
for
those
purposes,
but
you
don't
need
it
for,
like
the
Metis
lab
free
trees
right
they
don't
they
don't
use
this
field,
that's
always
zero.
So
we
realized
that
if
we
cut
that
out,
we
could
save
another
8
bytes
of
memory
per
segment
and
the
way
we
did
that
was,
we
taught
the
range
trees
how
to
deal
with
multiple
different
kinds
of
range
segments.
B
There's
the
range
segments
that
do
have
this
fill
entry
and
the
rain
segments
that
don't
and
it
uses
the
right
one
dynamically
based
on
you
know
whether
or
not
you
specify
like
an
allowable
gap
size
for
the
range
tree.
So
when
you
create
it,
it
decides
which
of
these
to
use
and
it
chooses
the
most
efficient
one.
B
At
this
point
we
don't
need
that
because
you're
never
allocating
a
chunk
smaller
than
a
disk
sector,
and
so
it
turns
out,
if
you
just
like
start
the
counting
from
the
start
of
the
Metis
lab
rather
than
the
start
of
the
disk,
and
you
count
in
sectors
instead
of
bytes.
Basically,
every
disk
in
existence
can
be
addressed
using
32-bit
integer
instead
of
a
64-bit
integer,
and
so
you
just
make
that
change.
You
save
another
8
bytes,
and
this
diagram,
unlike
the
earlier
one,
is
to
scale.
The
left
side
is
what
it
was
before.
B
The
changes
in
the
right
side,
half
is
what
it
is
afterwards.
So
it's
it's
a
nice
shrink.
So
how
did
what
was
the
overall
effect
this?
Well,
we
switch
the
segment
from
72
bytes
to
8
bytes
in
most
cases,
for
these
free
trees,
but
you're,
storing
two
trees,
because
you
still
have
the
size
sort
of
Train
offset
sort
of
tree
and
they
don't
get
to
share
any
space
with
each
other
with
B
trees
and,
as
I
said,
these
nodes
are,
you
know,
they're,
not
always
full
they're
between
half
and
all
the
way
Foles.
B
B
So
already
a
really
good
start,
a
third
as
much
memory
as
a
nice
win,
but
we
realized
we
could
do
a
little
bit
better
if
you
think
back
to
when
I
talked
about
sort
of
the
cursor
based
allocation
scheme,
where
you
remember
where
you
did
the
last
allocation,
and
then
you
look
forwards
a
little
ways
to
find
your
next
place
and
if
you
don't
find
anything
nearby,
you
go
to
the
size
sorted
tree
and
find
something
there.
Well,
if
you're
doing
a
small
allocation.
B
B
You
would
just
find
a
place
to
use,
put
it
using
the
first
fit
algorithm
and
only
1%
of
the
time
did
you
actually
go
to
the
range
tree,
but
on
the
other
hand,
if
you
look
at
the
site,
the
actual
range
trees,
something
like
90%
of
all
of
the
segments
or
for
these
really
really
small
regions.
Less
than
16
kilobytes,
if
you
have
a
really
badly
fragmented
system,
so
we
just
don't
store
those
really
small
segments
in
the
range
tree
and
it
turns
out.
B
An
allocation
you'll
probably
use
the
rest
of
the
space,
and,
if
not,
maybe
you
get
a
small
increase
in
fragmentation,
but
the
results
are
pretty
nice
because
now,
instead
of
30%
memory
use
it's
16%
memory
usage,
so
we
are
using
one-sixth
as
much
memory
to
store,
basically
the
exact
same
information
as
we
were
before.
In
addition
to
that
loading,
the
seat,
the
space
map
into
the
range
tree
takes
60%
as
much
CPU,
so
we
cut
40%
of
the
CPU
off
just
by
switching
data
structures
and
then
not
loading
all
of
the
size
sort
of
elements.
B
So
that's
again,
a
really
significant
improvement
and
one
improvement
that
we
weren't,
even
thinking
about
but
ended
up,
getting
sort
of
by
accident
is
that
when
you
unload
a
meta
slab
in
the
old
model-
and
you
freed
all
these
range
segments,
you
were
doing
all
these
64
kilobyte
frees
to
lots
of
different
pages
in
your
allocator.
You
would
very
rarely
actually
empty
one
of
the
pages
in
your
allocator.
B
We
didn't
even
know
we
were
gonna
get
so
we've
made
huge
progress.
We've
made
great
strides,
we're
loading
meta,
slabs,
better,
we're
keeping
more
meta,
slabs
loaded,
we're
using
less
memory
to
do
it.
Everything's
coming
up
roses,
but
it's
not
working
and
the
problem
really
it
turns
out,
is
that
there
just
weren't
enough
128
kilobytes
locks.
B
We
just
the
system
was
so
low
on
them
that
you
we
just
ran
out
and
we
had
to
like
force
TX
cheese
to
happen
faster
and
faster,
so
that
we
could
reclaim
these
blocks
so
that
we
can
do
more
allocations.
We
had
lots
of
places
to
actually
put
the
data.
We
had
lots
of
small
segments
to
do
those
allocations.
If
we
were
doing
the
pay
synchronously,
we
didn't
have
anywhere
to
put
these
hundred
28
kilobyte
blocks.
The
fragmentation
was
just
too
high.
B
This
is
sort
of
the
thing
that
a
slog
device
is
kind
of
intended
for
right.
It
gives
you
this
nice
pristine
space
to
these
allocations,
but
there's
some
problems
to
it.
One.
It
requires
some
logistical
overhead.
You
have
to
actually
go
and
manage
all
the
pools
and
add
these
disks
and
set
them
up
correctly
and
then
the
other
thing
is
that
you
have
this.
It
creates.
Well,
you
have
this
problematic
performance
bottleneck
because
a
lot
of
our
customers
already
have.
B
It
looks
like
it's
coming
back
all
SSD
pools
if
you
were
to
just
attach
another
SSD
use
it
as
your
slog
device.
Now
all
of
your
synchronous
rights
are
hitting
one
disk
instead
of
hitting
all
your
different
disks
and
it
becomes
this
really
big
bottleneck
and
in
a
lot
of
cases
they
didn't
have
faster
disks
available,
especially
if
you're
in
something
like
the
cloud
where
you
know
you
already
have
all
the
fastest
disks.
B
The
only
faster
disk
you
can
get
are
ephemeral,
and
if
your
slog
device
is
ephemeral,
you
start
losing
data
and
that's
not
good.
So
we
really.
We
liked
the
idea
of
the
slog,
but
we
need
some
way
to
really
alleviate
these
issues,
and
so
that
led
us
to
the
embedded
slog
project
and
the
idea
here
is
pretty
straightforward.
You
pick
the
best
medicine
on
your
disk
best
here,
meaning
the
most
free
space,
and
then
you
make
that
a
little
mini
slog
each
disk
has
its
own
little
slog
and
the
nice
thing
about
this.
B
One
of
the
nice
things
about
this
change
is
that
it's
both
forwards
and
backwards
compatible.
If
you
take
a
system
that
has
this,
you
know
code
running
on
it
had
the
pool
created
with
this
running
on
it
and
loaded
onto
an
old
system.
There
was
no
on
disk
change.
It
was
just
a
like
a
priority
preference
for
the
allocation
process,
so
the
old
like
it
will
import
fine
on
the
old
system.
B
If
you
take
a
pool
created
on
an
old
system
and
imported
on
a
noose
on
a
system
with
this
feature,
it
won't
be
perfect
because
all
of
your
medicines
are
gonna
be
somewhat
dirty,
but
it'll
pick
the
best
one
and
over
time,
as
fries
get
issued
to
that
mattis
lab.
There
will
be
more
and
more
of
these
hundred
twenty-eight
kilobyte
segments
that
you
can
use
to
do
these
slog
allocations,
and
so
it
just
sort
of
improves,
naturally
over
time
as
your
system
ages
will
still
use
this
metal
ab4
ready
to
deal
our
allocations.
B
C
B
We
have
to
if
we
need
to
do
them,
we'll
try
to
use
it
for
this,
and
so
it
turns
out.
This
gives
a
really
really
substantial
performance
when
we
would
get
40
to
50%
increase
in
AI
ops
on
a
random
sink
right
workload
in
situations
where
fragmentation
was
really
high,
and
so
that
it
was
a
really
really
nice
improvement
for
a
really
really
sort
of
conceptually
simple
idea
of
just
having
this
little
slog
on
every
single
one
of
your
disks.
B
So
the
current
status
of
these
projects,
the
load
and
unload
changes,
the
Mac
size
change,
the
keeping
more
mattes
labs
loaded
and
some
other
changes
involving
parallel
allocation
and
stuff
like
that
are
all
in
Zeo,
I'll
master.
The
B
tree
and
range
tree
changes
also
landed
in
vol
master
or
open
ZFS
master.
I.
Guess
not
that
long
ago,
embedded
slog
is
not
yet
upstream.
B
If
you
want
to
look
at
the
code,
it
is
in
the
Delphic
ZFS
repository
and
then
a
related
project
that
I
didn't
talk
about,
but
is
also
sort
of
interesting
in
the
sink
right.
Workspace
is
the
log
space
map
project
and,
if
you
want
more
details
about
that,
you
can
look
at
seraphim's
talk
from
2017
at
the
dev
summit
and
that's
actually
in
gol
0.8,
see
already
getting
performance
wins
from
that.
If
you're
running
a
major
release
so
yeah,
we
managed
to
resolve
a
lot
of
problems
and
making
a
lot
of
things
better.
B
B
So
the
when
you're
allocating
in
the
slog
ZFS
so
ZFS
will
try
to
go
to
the
slog
first
to
do
these
Zil
block
allocations,
but
it
uses
the
same
basic
allocation
methods
like
the
algorithm
itself
is
the
same.
So
you'll
still
run
into
some
of
these
problems.
If
you
have
sufficiently
high,
you
know,
write
a
sink,
write
workloads
and
some
of
these
changes
are
still
beneficial.
Evon
like
more
just
straight
slog
disks,
but
it
takes
a
lot
more
work
to
hit
them.
B
The
question
was:
should
we
should
we
allocate
things
differently
on
slugs
because
they
have
sort
of
this
different
workload?
It's
definitely
possible
that
if
we
had
sort
of
a
separate
slog
allocation
policy,
we
could
get
improved
performance.
I
haven't
done
any
research
on
it,
but
it's
definitely.
It
definitely
seems
like
you
know.
If
you
were
because
you're
sort
of
more
confident
that
this
disk
is
just
gonna
have
a
lot
of
space
available.
B
You
could
be
more
naive
about
like
just
doing
first
fitter,
just
doing
best
fit
and
choosing
a
faster
algorithm,
but
the
the
costs
of
the
actual
allocation,
like
the
the
picking
process,
are
not
that
high.
As
I
explained,
the
costs
are
really
very
heavily
in
the
in
the
loading
medicines
process.
So
it
would
maybe
give
you
a
little
bit
of
CPU
savings,
but
it
probably
wouldn't
be
that
significant
I'm
George
did
you
ever.
B
B
Yeah,
so
the
question
was:
did
the
change
to
the
new
segment
or
to
the
new
segment
based
weighting
algorithm
exacerbate
any
of
these
problems,
and
it's
actually
the
opposite
with
the
old
method?
Disks
would
be
very
like
either
disks
that,
had
you
know
every
free
segment
to
use.
An
extreme
example
is
four
kilobytes,
but
that
mean,
and
every
let's
say,
every
allocated
segment
was
also
four
Calla
bytes.
B
So
in
practice
we
found
that
this
algorithm
did
improve
things
substantially,
and
there
was
another
idea
we
came
up
with
to
do
sort
of
an
exponential
weighting
scheme
where,
like
it,
would
not
just
be
thinking
about
the
highest
bucket,
but
also
the
next
highest
bucket
at
reduced
weight
and
the
next
high
bucket.
After
that,
this
idea
seemed
sort
of
promising,
but
we
ran
into
the
problem
that
I
talked
about
where,
like
even
if
we
were
picking
the
perfect
medicine
there.
B
B
So
I'm,
so
the
question
was
earlier:
I
was
saying
that
it
would
insert
these
eight
kilobyte
rights
into
these
128
kilobytes
ill
blocks
and
I
said
that
it
would
just
move
on
I
misspoke
slightly.
It
does
actually
keep
using
the
same
cell
block
to
try
to
fill
up
all
that
space
with
more
and
more
rights.
It
will
deal
synchronous
rights,
aren't
the
only
things
that
end
up
in
cell
blocks.
There's
some
other
things
that
end
up
in
there
as
well,
but
it
will
put
multiple
rights
into
the
same
cell
block
to
you
space.
B
You
can
change
so
block
record
sizes
like
we
did
one
of
the
workarounds
that
we
had
temporarily
was
we
changed
these
from
128
kilobytes
to
36
kilobytes,
which
alleviated
the
problem,
but
didn't
completely
solve
it,
because
eventually
the
customers
just
forced
their
free
segment
sizes
down
even
farther,
so
it
was
really
just
a
stopgap.
While
we
figured
out
some
these
better
solutions.
C
B
Those
watching
the
video
or
watching
later
Matt
was
basically
pointing
out
that
I
fibbed
slightly
when
making
this
I
told
it
as
one
single
story
of
one
single
case,
but
in
practice
this
is
really
a
spectrum
of
problems.
Some
customers
were
fine
after
the
first
fix,
and
some
of
them
really
needed
all
of
these
approaches
to
sort
of
alleviate
their
problem,
but
it
flows
a
little
bit
better
as
a
story,
if
you
just
tell
it
as
one
single
use
case.
B
B
B
Yeah
George
was
saying
that,
because
loading
them
takes
so
much
less
memory
because
we
keep
so
many
more
of
them
loaded.
If
you
go
back
to
that
allocation
loop,
you
can
really
see
that
we
can
make
much
better
choices
throughout
the
entire
process,
and
it
really
just
makes
the
whole
thing
significantly
significantly
easier
to
work
with
Tom.
B
B
B
In
0.8
there
was
no
cap,
it
could
use
as
much
memory
as
it
wanted
yeah.
It
would
just
use
as
much
memory
as
it
needed
to
load
the
Metis
labs
it
wanted
to
load.
So
we
did
add
that
cap,
it's
a
tunable
that
you
can
change,
zo
l
I,
think
doesn't
have
a
good
way
to
change
it
at
run
time,
but
like
on
on
a
Lumos
and
stuff.
You
can
just
change
that
whenever
and
the
system
will
behave
accordingly,
it
doesn't
have
to
be
done
at
boot
time.
B
So
the
question
was
whether
Metis
lab
data
is
counted
as
part
of
the
arc
medicine
like
metadata
sizing.
The
answer
is
that
it's
not
so
if
you
load
the
space
maps
from
disk
into
memory,
because
the
space
maps
themselves
can
be
cached
in
the
arc
that
data
counts
as
metadata
in
the
arc,
but
the
range
trees
are
not
like
actually
part
of
the
arc
they're
allocated
out
of
an
entirely
separate
cache,
and
so
they
don't
count
towards
or
like
we
in
regards
to
any
arc
quotas.
B
Until
we
added
the
memory
cap,
they
were
just
totally
uncapped
and
totally
untracked.
They
were
just
use
as
much
memory
as
they
wanted
so
they're.
Not.
There
is
a
way
in
which
you
can
have
space
maps
in
the
arc.
They
will
just
like,
naturally
get
loaded
as
you're
reading
them
and
that
does
count
towards
arc
metadata,
but
the
range
trees
don't.
B
B
Zfs
fragmentation
starts
to
spike,
there's
like
a
performance
cliff
as
we
refer
to
it,
and
so,
if
you're
not
really
nearing
that
you're,
probably
not
gonna,
have
that
much
memory
used
by
loading
Metis
labs,
because
the
range
trees
are
very
efficient
if
fragmentation
isn't
very
high
right.
This
same
previously
was
a
72
byte
region
that
same
72,
byte
region
could
represent
512
bytes
or
could
represent
5
gigabytes
of
free
space
like
it
could.
B
If
the
fragmentation
isn't
high,
they
worked
great,
but
it
was
really
only
under
these
very
heavy
situations
that
you
started
to
run
a
specific
memory
issues,
so
it
is
competing
with
itself
a
little
bit,
but
it's
not
usually
a
problem
yeah
the
arc.
The
arc
will
shrink
as
it
sees
that
memory
pressure.
So
the
arc
is
the
one
that's
going
to
yield.
The
the
range
tree
stuff
will
basically
take
priority
over
it.
B
So
the
the
bee
tree,
the
leaf
nodes
I
believe,
are
four
kilobytes.
The
core
nodes
are
actually
allocated
slightly
differently,
just
as
part
of
a
design
thing
there
they're
always
a
hundred
and
twenty
eight
elements
wide,
and
so
their
size
varies
according
to
that,
and
that
was
just
so
that
I
could
have
sort
of
more
predictable
branching
factors
when
I
was
doing
the
math,
but
in
practice,
I
think
that
that
also
works
out
to
a
couple
kilobytes.
B
Yeah,
so
actually
an
inter
leaver
is
no
talk.
I
had
a
little
bit
more
about
this.
The
question
was:
can
you
achieve
some
of
the
same
benefits
of
loading,
more
meta,
slobs
by
loading,
larger
Metis
labs
or
having
larger
medicines
having
fewer
of
them?
And
the
answer
is
yes,
the
downside
is
that
the
loads
become
even
slower
and
each
Metis
lab
takes
up
even
more
memory
when
loaded.
So
you
do
get
some
of
the
same
benefits,
but
it
does
mean
that
you
know
if
you
only
needed
to
load.
B
B
B
The
question
was:
could
we
have
found
it
without
it,
and
the
answer
is
maybe
I
mean
we
would
have
you
need
fundamentally,
some
way
to
do
like
some
sort
of
profiling
and
DTrace.
In
addition
to
everything
else,
is
a
great
profiling
tool.
If
we'd
had
some
other
way
to
do
profiling,
we
probably
would
have
found
it
if
we
didn't
have
any
way
to
do
it.
We
would
have
had
to
do
a
lot
of
instrumentation
like
trying
to
like
add
counters
and
do
analysis
and
stuff
like
that.
It
would
have
been
very
difficult.
B
This
work
was
primarily
done
on
illumise.
All
that
has
been
tested
on
Linux.
The
performance
winds
are
smaller
on
Linux
and
we
think
the
main
reason
for
that
is
just
that.
Linux
has
slightly
different
performance
bottlenecks.
We
Dell
all
our
analysis
on
a
Lumos.
We
did
all
our
testing
on
Lumos,
and
so
this
was
really
optimized
towards
that.
But
all
of
these
changes
give
performance
wins
on
Linux
as
well,
usually
slightly
smaller
numbers
but
they're
still
pretty
real.
B
The
question
was:
if
you
have
systems
on
Linux
and
you're,
trying
to
figure
out
whether
or
not
they're
running
into
these
issues.
Where
do
you
look
the
answer?
One
thing
you
can
do
is
you
can
look
at
a
flame
graph?
You
can
generate
those
on
Linux
pretty
easily
and
if
you
see
a
lot
of
time
spent
in
the
allocation
path,
that's
a
sign.
B
But
specifically,
if
you
start
looking
at
like
how
long
your
synchronous
iOS
takes
to
complete
and
then
if
you
compare
Metis
lab
loads
to
allocations
using
like
BPF
trace
or
something
like
that,
you
could
see
some
of
the
same
indicators
we
were
hitting
and
if
you
look
at
eight
systems
that
have
some
of
the
same
workloads,
we
were
having
you're.
Probably
gonna,
see
some
hints
of
these
same
problems,
though
maybe
not
to
the
same
extent,.
B
The
question
was:
would
metadata
offload
to
like
special
device?
Special
faster
devices
have
helped
if
we
had
that
configured
on
our
customer
systems
and
we
had
it
set
up
properly,
it
would
have
made
reading
the
space
maps
from
disk
faster,
but
the
the
problem
is
that,
anytime,
that
your
sync
right
is
waiting
for
a
disk
read,
even
if
that's
a
fast
or
discrete
here
already
you've
already
lost.
B
You
can
reduce
the
extent
of
the
loss
but
you're
still
in
a
really
bad
place,
and
even
even
if
the
space
map
is
loaded
in
the
arc,
just
loading
it
into
the
raintree
itself
was
a
very
time-consuming
process,
and
so
it
would
have
helped
with
that
at
all.
It
might
have
changed
the
fragmentation
behavior
a
little
bit,
but
not
significantly.
C
B
C
C
B
B
Yeah
yeah,
you
could
yeah
I
agree
what
you
we
could
get
to
the
flame
graphs
with
perf.
If
you
want
to
do.
Some
of
the
analysis
we
did
BPF
trace
is
now
reaching
sort
of
the
level
of
capability
that
it
can
almost
do
what
we
did
with
D
trace.
There's
still
some
things
that
can't
do,
but
it's
definitely
progressing
pretty
quickly.
So
that's
my
current
preferred
tool
of
choice.
You
could
also
do
it
like
system
tap
and
stuff
like
that.
There
would
be
a
lot
more
cursing
yeah.
That's
a
Georgia
sign.
B
For
those
who
don't
know,
block
pointer
rewrite
is
a
project
where
you
could
actually
take
the
the
data
on
disk
in
ZFS
and
modify
the
block
pointers,
even
the
ones
and
things
like
snapshots
that
are
supposed
to
be
read-only,
which
would
be
necessary
because,
as
you
update,
the
daeviation
need
to
update
the
check
sums
of
parent
blocks
and
things
like
that.
I
know
that
a
number
of
people
have
asked
for
this
over
the
years,
an
extremely
large
number
of
people.
B
B
If
somebody
was
sufficiently
motivated,
I
think
that
block
porn
to
rewrite
is
a
solvable
problem,
but
in
practice
it
turns
out
most
of
the
things
you
want
it
for
other
than
like
defragmentation
can
be
solved
other
ways.
For
example,
device
removal
was
previously
like
a
BP
rewrite
class
of
problem,
but
we
found
a
way
to
solve
it
that
was
more
efficient
and
didn't
require
quite
as
many
of
the
same
constraints,
so
a
defrag
sort
of
as
a
long-term
theoretical
goal.
B
I
didn't
work
on
device,
removal,
I,
don't
know
exactly
what
the
status
of
that
is
I
know
roughly
how
it
works
and
I
know
that
there's
no
theoretical
barrier
to
preventing
like
default
device
removal
but
I,
don't
know
if
anybody's
working
on
it
or
what
the
status
is
I'm
sure
somebody
here
knows
and
I
think
we're
probably
out
of
time
at
this
point.
So
we
can.
B
B
B
Yeah
that
yeah
the
question
was
or
the
comment
was
that
if
you
start
with
a
really
small
pool
and
just
double
your
disk
size
over
and
over,
you
end
up
with
like
a
thousand
really
tiny
metal
slabs,
and
so
that's
why
we
don't
recommend
that
you
just
increase
the
disk
size
over
and
over.
You
have
to
actually
add
new
disks
and
stuff
like
that.
C
B
Yeah
Matt's
saying
that,
even
if
you
do
have
that
situation,
where
you
have
lots
and
lots
of
tiny
meta
slabs
with
these
changes
and
with
the
log
space
map
changes,
it
makes
everything
so
much
better
that
you'd
have
to
get
into
a
really
extreme
performance
case
to
start
running
into
these
problems.
Again,
you
don't
have
to
worry
about
your
meta
slabsides,
nearly
as
much
now
yeah,
you
got
at
least
five
years.
B
We
could
consider
changing
like
the
minimum
size
or
maximum
size
of
medicine,
I
think
at
least
some
of
these
numbers
are
largely
arbitrary,
I,
don't
know
how
much
like
research
and
math
went
into
them.
So
if
somebody
wants
to
start
a
campaign
to
change
these
numbers
and
like
actually
dude
like
data
gathering
and
figure
out
what
the
right
numbers
should
be
go
for,
it
have
fun
I
look
forward
to
seeing
what
your
results
are.
That
they'll
be
really
interesting.
B
B
Yeah,
so
the
question
was
were
in
a
lot
of
use
cases:
synchronous
writes,
are
unintentional.
Where
are
writes
intentional,
and
the
answer
is
yes.
Databases
over
NFS
will
try
to
always
do
these
synchronous
writes
because
they
want
like
to
actually
know
that
their
data
has
been
persisted.
We
have
a
mode
that
we
can
turn
on
where
we
just
like
ignore
the
fact
that
they're
synchronous
and
return
immediately.
B
We
call
it
NPM
for
NFS
performance
mode,
but
the
acronym
is
another
meaning
it's
called
no-pants
mode,
and
so
you
can
do
that
and
that
was
considered
as
a
possible
alternative
for
like
well.
If
our
customers
are
truly
hosed,
we
can
do
this
instead,
but
it
does
impact
reliability
significantly.
So
we
were,
we
were
doing.
Sync
writes
on
purpose.
If
you're
doing
sync
write
accidentally,
you
have
a
different
problem
and
I
think
you
should
be
looking
at
that
before
you
start
looking
at
this
stuff.