►
From YouTube: OpenZFS Developer Summit Part 7
Description
http://www.beginningwithi.com/2013/11/18/openzfs-developer-summit/
Scalability (Kirill Davydychev)
Virtual memory interactions (Brian Behlendorf)
A
People
that
a
lot
of
people
expressed
interest
in
and
about
45
minutes
until
dinner
arrives,
so
the
talks
that
a
lot
of
people
were
interested
in
were
scalability
issues
with
corel
interactions
with
the
vm
subsystem,
brian
multi-tenant,
zfs,
rob
and
the
examining
on
this
format
with
max.
A
So
what
do
people
think
should
we
should
we
like
give
each
of
these
hawks
15
minutes
and
just
go
a
little
bit
over
or
are
there
some
of
these
talks
that
are
more
interesting
than
others?
A
There
you
go
all
right,
so
would
you
like
to
go
first.
A
Okay,
so
we'll
do
them
in
the
order
that
I
mentioned
so
brian
and
whoever
else
is
leading
the
discussion
about
interaction
with
the
vm
subsystems.
You'll
be
next.
D
Right
so
I
actually
don't
have
a
powerpoint
and
unfortunately,
or
fortunately,
in
the
context
of
time,
adam
covered
a
lot
of
the
stuff.
I
wanted
to
talk
about,
because
at
an
example,
we
had
a
lot
of
the
same
issues
as
as
it
does
so.
Basically
yeah.
D
We
see,
we
see
several
several
scalability
things
that
come
up
with
customers
for
a
variety
of
reasons,
the
the
well
not
the
worst
ones,
but
the
most
complicated
ones
are
where
customers
decide
to
scale
up
way
too
high
or
to
the
point
where
it
doesn't
make
any
sense
or
to
the
point
where,
when.
D
Memory
number
of
disks
anything
so
we
have
deployments
of
up
to
480
drives
in
a
single
vehicle
which
may
may
get
into
a
pickle
where
you
work
your
cpu
bound
or
you
don't
even
have
enough
memory
for
metadata,
so
you're,
just
stuck
on
io
or
you're.
Just
not
even
utilizing
your
disks
to
to
any
meaningful
levels.
D
D
Yes,
yes,
we've
profiled
it,
and
so
unfortunately,
I've
been
pulled
away
on
some
other
stuff.
I
do
have
an
in-house
ssd
system
now,
where
I
can
play
with
it.
A
lot
of
it
is
in
locks
so
and
it's
both
the
arc
and
a
few
lock.
I'm
hopeful
that
some
of
the
delphi
exchanges
will
help
that
it
will
help.
Oh.
D
Room
for
improvement
for
sure,
and
it's
it's
it's
a
lot
of
flocking
issues,
a
lot
of
things
that
unexpectedly
start
taking
longer
as
your
slow
functions
in
the
spinning
test.
Contacts
become
really
really
fast,
so.
D
Well,
so,
yes,
it's
it's
exacerbated,
essentially
by
the
amount
of
high
ups
that
you're
pushing
yeah,
so
the
smaller
the
block,
the
the
worse.
It
is
in
terms
of
the
amount
of
work
you
have
to
do
per
with
say
per
megabyte.
B
D
In
our
testing
on
both
on
both
open
indiana
and
on
our
current
development
branch
of
an
exeter
store,
which
is
a
few
months
behind,
I
was
not
able
to
push
more
than
130
000
right,
high
ups
out
of
any
pool
like
not
even
not
even
a
pool,
that's
supposed
to
be
able
to
push
a
million
from
a
rodis
perspective.
D
So
this
is
where
we
stand
kind
of
in
a
in
a
scale
right
right,
performance,
scalability
perspective,
and
it's
especially
sad
because
when
those
store
spaces
can
push
six
hours,
bottleneck
is
a
cpu.
So
it's
exacerbated
by
the
locking
which
locks
there's
a
variety
of
them.
I
I
can
probably
publish
the
data
openly,
I'm
not
really
trying
to
hide
it,
but
I
don't
remember
so.
The
there
were
arc
locks,
some
cio
locks,
some
task
queue
stuff.
Probably
a
lot
of
the
stuff
you've
seen
already
yeah.
E
E
Data,
maybe
we
don't
want
to
necessarily
you,
don't
want
to
necessarily
publish
it,
but
it
would
be
good
to
like
start
having
those
conversations
and
those
threads
in
the
open,
dfs
community,
as
we
look
at
them
and
can
try
to
brainstorm
on
what.
If
it's
a
problem,
that's
already
been
solved
or
it's
something
new
that
needs
investigation.
D
Yes,
absolutely
so
one
thing:
that's
that's
kind
of
beneficial
about
our
customer
partner
model.
Is
that
a
lot
of
times
we
have
partners
that
build
those
in-house
systems
that
they
should
test
before
shipping
to
the
customer
and
they
come
to
us
to
see
how
we
can
tweak
or
optimize
those
systems.
D
D
We
are
seeing
that
first
of
all,
there
are
those
furious
reaps
end
up
deadlocking
the
system
for
seconds
at
a
time
deadlocking,
it's
so
bad
that
the
network
drivers
cannot
allocate
memory
for
packets,
which
means
that
the
box
becomes
not
payable,
which
means
that
the
cluster
heartbeats
drop,
which
doesn't
make
everybody
happy,
and
so
those
reaps
are
visible,
are
very
much
more
visible
during
a
abrupt
change
in
workload.
D
So
if
we
were,
for
example,
doing
pushing
mostly
mostly
a
small
working
working
set
size-
and
we
are
mostly
in
mfu
and
suddenly
a
big
backup
job
runs
on
some
different
data
set
and
those
blocks
that
the
backup
job
reads
are
still
in
the
ghost
list
for
mru.
So
now
we
have
mfu
scaling
down
rapidly
and
are
you
growing
rapidly
and
as
far
as
I
know,
there's
not
a
threshold
or
there's
not
a
limiter
that
will
actually
slow
down
this
rapid
progression
of
going
from
one
cache
to
another
catch.
B
D
Same
three
goes
for
different
block
size
workloads.
So
when
you
have
a
mixed
use
system
with,
let's
say
nfs
shares
at
128k
and
you
have
some
z-volts
at
4k,
some
z-volts
at
16k-
something
like
that
again.
So,
as
as
the
system
goes
through
its
daily
or
weekly
or
whatever
life
cycle,
you
have,
let's
say
it's
a
vdi
system
plus
some
files,
plus
some
databases.
D
So
in
the
morning
your
users
log
in
they
they
go
to
the
8k
blogs.
The
cache
is
popular
populated
in
solaris
arc,
actually
is
so.
The
memory
goes
into
buckets
into
slabs
that
are
of
certain
size
blocks
within
them,
so
each
slab
has
128k
and
that
can
contain
smaller
chunks
that
are
that
are
equal
within
the
block.
So
the
way
the
memory
scales
up
and
down
is
it
looks
at
each
size
that
the
system
needs.
D
Let's
say
we
need
more
4k,
so
it
tries
to
reap
all
the
other
ones,
but
as
as
you
move
between
the
workloads,
you
can
end
up
again
in
a
situation
where
you
need
to
drop
gigabytes
or
tens
of
gigabytes
or
sometimes
hundreds
of
gigabytes
of
let's
say
4k
buffers,
which
is
a
very
very
memory
intensive
operation
and
a
very
cpu
intensive
operation
because
of
crossbones.
D
So
again
this
can
cause
a
deadlock.
Sometimes
it
actually
does
cause
a
complete
deadlock.
I
believe
boris
worked
on
a
case
like
that,
where
we
got
it
to
a
point
where
it's
not
a
complete
deadlock,
so
it
slows
down,
but
it
doesn't
actually
freeze
the
system
completely
requiring
kernel
panic
to
get
it
out.
D
Yes,
so
so
there's
quite
a
few
also
a
lot.
A
lot
of
those
are
difficult
to
troubleshoot,
because
you,
you
don't
know
if
your
cpus
are
used
up
by
by
the
reeves
or
so
so
you
don't
know
where
the
root
cause
of
your
bottleneck
is,
for
example,
you
you
may
want
to
add
more
cpu,
because
you
think
you're
running
out,
but
adding
more
cpu.
Of
course
just
means
you
have
more
prospects,
so
in
effect
it
might
slow
the
system
down.
D
A
So
where
do
you
see
like
there's
a
lot
of
problems
there?
Yes,
where
should
we
or
you
focus
your
efforts
in
order
to
attack
the
highest
value?
You
know
lowest
cost
things
there.
D
So
right
now
right
performance
and
the
memory
instability
are
the
two
key
areas
where
I
think
we
should
focus
our
efforts.
It
is
those
are
the
top
two
things
for
an
example
right
now
internally
as
well.
So
we're
going
to
be
doing
a
lot
of
work
there
hopefully
well.
D
Two
years
ago,
a
system
with
256
gigs
was
almost
unheard
of.
Now
we
have
customers
employing
five
people
and
in
some
cases
I
think
somebody's
talking
about
a
terabyte
of
ram
yeah.
B
A
E
F
Your
stuff,
we
might
just
say
a
couple
things:
yes,
sir,
so
there's
one
thing
that
materials
mentioned
about
art
and
the
transition
between
multiple
different
workloads.
I've
looked
at
the
code
and
it
seems
to
me
that
you
know
the
architecture
code
is
something
simplistic
and
another.
It
chooses
to
fix
the
order
of
going
through
the
last
step,
like
first.
F
And
I
don't
know
how
much
it's
going
to
help,
because
the
other
thing
is
with
the
multi
multi-modal
workloads.
F
My
my
look
into
the
gist
of
this
kind
of
leads
to
virtual
memory,
arena
presentations
and
the
classic
cases
like
when
you
fill
up
or
almost
fill
up
all
the
errors
with
small
blocks,
then
randomly
overwrite
them
and
then
start
filling
up
with
progressive
or
larger
block
size,
which
seems
like
a
pathological
escape.
But
sometimes
it
happens.
You
start
with
once
I
fill
it
up
with
8k
already
randomly.
That
starts
going
up
like
18,
32,
64
and
128,
and
that's
the
same
time.
F
A
Mean
I
think,
creating
test
cases
like
that
that
can
reproduce
these
performance
problems
is
going
to
be
super
valuable
to
actually
fixing
them,
because
you
know
when
you
only
see
this
on
a
customer
system-
and
it
happens
once
a
day
right
like
that.
That's
really
hard
to
diagnose
or
it
should
evaluate.
D
A
fix
so
yeah
and
I've
had
systems
where
it
happens
once
a
week,
and
so
so
that
particular
one
it
was
attributed
to
a
microsoft,
sql,
backup
job,
which
was
the
precise
change
in
workload
that
induced
it.
Without
fail,
every
single
weekend,
sending
the
customer
box
offline
for
30
seconds
or
so
do.
D
D
So,
in
my
case
the
biggest
one
I
have
is
280
gigs
of
ram,
and
it's
it's
sufficient
for
this
sort
of
stuff.
It
will
not
be
sufficient
once
we
have
once
we
have
eliminated
the
low
hanging
fruit,
I'm
afraid,
but
that's
something
I'm
I'm
talking
about
with
with
the
guys
having
alexander
so,
hopefully
we'll
be
able
to
scale
with
our
customer
base.
There
thanks
krill.
A
Well,
I
don't
have
slides
either
so
I'll,
try
to
keep
it
quick,
but
I'm
actually
encouraged
to
hear
that
other
people
are
looking
at
the
memory
management
stuff
too.
I
had
feared
for
a
while.
This
was
only
something
I
was
going
to
reflect
on
export,
but
knowing
that
their
works
on
the
other
platforms
is
actually
a
little
bit
encouraging
to
me
believe
it
or
not.
So
the
problem
we're
suffering
with
on
the
linux
board.
A
I'd
say
probably
our
biggest
stability
concern
at
the
moment
is
the
memory
management
and
it's
a
problem
because
we
basically
took
it
over
kept
it
unchanged
from
freebsd
and
almost
as
best
as
we
could
and
the
problem
with
that
is
linux's
memory
management
subsystems
were
very,
very
different
than
freebsd
alums.
Certain
things
that
you
would
think
would
be
fine,
just
aren't
the
way
you
do
it
in
the
linux
kernel,
so
location,
point
large
memory
allocations
all
right,
so
anything
over.
A
I
would
say
a
couple
of
pages
in
linux:
pretty
much
you
should
think
about
allocating
pages
for
them.
If
you
want
to
do
it
fast,
we
have
k
malik
and
we
have
v
malloc,
but
the
situation
olympic
is
situational.
Linux
is
kmalik,
is
fast
for
a
couple
of
pages,
and
vmalic
is
strongly
frowned
upon
that
you
should
not
use
it.
It
is
not
a
good
interface,
it
is
not
a
fast
interface.
This
is
not
meant
to
be
a
fast
interface,
it
should
not
be
relied
upon
so
early
on
in
the
linux
sport.
B
A
Up
all
the
code
to
do
scatter
gather
based
page
allocations.
What
we
do
is
we
put
a
layer
in
our
spl
to
try
and
make
the
network
work
reasonably
well
with
linux,
so
we
did.
Is
we
implemented
our
own?
Basically
female,
back
slab
in
linux
to
try
and
make
that
behave
pretty
well,
so
we
wouldn't
have
to
change
any
of
the
zfs
code.
It
could
stay
unchanged.
We
get
it
working.
A
A
It
got
us
a
long
ways,
but
at
the
end
of
the
day,
we
just
can't
really
get
away
with
doing
large
memory
allocations
in
internal
analytics.
A
So
to
address
that
we've
been
thinking
about
how
to
re-plumb
zfs
to
use
scatter
gather
lists
for
all
the
large
zio
buffers
currently
that
come
off
of
a
slab.
A
That's
workable
for
everybody:
should
we
just
put
wrappers
around
them
so
that
you
guys
can
continue
to
use
femaleic
and
be
free
and
request
lab
implementations,
and
we
can
do
something
more
linux
specific
or
should
we
or
is
this
a
bigger
problem
for
everybody
else,
going
forward
that
maybe
we
should
all
move
towards
a
scatter
gather,
sort
of
infrastructure
or
zfs
in
general?
And
I
think
that's
still,
we
know
we
have
to
do
it
for
linux,
but
I'm
curious
how
much
of
a
problem
it
really
is
for
the
other
implementations.
A
So
can
you
kind
of
describe?
Maybe
I'm
I
just
didn't
quite
understand
like
what
is?
Is
the
exact
problem?
Is
it
like?
You
cannot
allocate
contiguous
virtual
address
space?
No,
so
the
problem,
I
guess,
there's
a
couple
problems,
that's
worth
explaining,
because
I
think
most
of
us
aren't
linux
folks
here
yeah,
so
you
can
absolutely
do
it
on
linux.
But
you
have
to
be
aware
that
it's
a
global
address
space
in
the
kernel
and
it's
covered
by
a
single
lock.
A
So
if
you
do
a
vmalic,
you
take
a
global
spin
lock
on
the
system,
so
everything
gets
serialized
through
that
lock.
So
it's
not
that
it
doesn't
work.
It's
just
that
it's
very,
very
slow
to
do
so.
It's
frowned
upon
I've
implemented.
Our
improvements
have
gone
to
the
linux
kernel
for
the
last
couple
of
years
to
speed
that
up,
but
fundamentally
it's
all
still
serialized
for
global
lock.
So
it's
expensive
and
are
you
using
like
a
slab
allocator
so
that
you
know
you
aren't
going
to
that
global
lock
for
every
allocation?
A
Only
when
you
need
to
get
into
slab,
that's
what
we
did
in
the
spl.
That's
this
live
allocator.
We
wrote,
we
said
okay!
Well,
we
know
this
is
what
linux
is
going
to
do,
we'll
allocate
pretty
big
slabs
and
then
we'll
just
carve
them
up
internally
right
and
avoid
getting
that
global
lock.
So
that
was
the
dodge.
We
pulled
to
get
around
this
being
a
really
bad
problem
and.
A
Basically,
there's
two
types
of
allocation
right:
there's
vmalic
and
there's
k
now.
So,
if
you
do
anything,
go
through
anything
that
goes
through
d,
malloc
goes
through
that
global,
lock,
it'll
be
serialized.
Anything
that
goes
through
kate
mallock
is
quite
fast
and
is
backed
by
a
slab,
but
there's
a
size
limit
on
it
right.
You
probably
shouldn't
allocate
more
than
I
would
say
two
pages,
probably.
B
A
B
A
Linux
does
provide
its
own
slap,
but
it's
got
these
restrictions
right,
like
I
think.
The
biggest
thing
you
can
actually
allocate
out
of
the
make
slab
is
128k
and
don't
count
on
being
fast
and
don't
count
it
as
succeeding
because
it's
totally
possible
for
it
to
fail
and
say
no,
you
can't
have
one
that's
a
legitimate
way
for
the
allocator
to
behave.
So
can
you
describe
what
you
mean
by
scattered
gather
and
how
that's
going
to
help?
A
Yes,
so
I
was
getting
there
so
the
solution
we're
proposing
on
linux
is
basically
for
file
systems,
all
the
file
systems.
Linux,
don't
do
large
memory
allocations
right.
They
allocate
individual
pages,
they
get
individual
pages
from
user
space
for
io
that's
going
on
and
they
get
feed
individual
pages
to
block
io
block
layer
right
and
they
assemble
those
into
just
scatter
gatherings
as
I
get
passed
through.
So
it's
a
set
of
pages
right,
just
a
list
of
pointers,
and
so
these
pages
they
don't
need
to
be
physically
contiguous
in
memory
right.
A
This
is
my
out.
This
is
my
buffer
and
then
it's
or
virtually
continuous
for
that
matter,
right
right,
they're
they're
not
even
mapped
into
the
address
space.
So
that's
why
it's
better!
On
linux,
you
have
these
set
of
pages
and
they
have
physical
addresses,
but
they
have
no
virtual
address,
but
I
mean
they
need
to
be
mapped
in
like
to
check
something.
Yes,
is
that
going
to
destroy
performance?
A
So
that's
an
open
question
so
because,
like
normally
like
creating
a
mapping,
you
have
to
go
shoot
down
the
clb
with
all
the
other
cpus
and
that's
like
changing
the
outer
space
is
usually
not
very
fast.
C
Right,
it
can
be
done
in
one
cpu
at
a
time,
so
on
i36
you
can
do
a
local
temporary
mapping
on
a
paid-by-page
basis
without
shooting
down
all
the
other
cpus
right
on
amd
64.
There's
no
address
space.
You
can
do
the
direct
map,
I'm
guessing.
Linux
has
a
direct
map
as
well,
so
this
frequency
and
so
there's
really
almost
no
cost.
What
can
you
pay
for
a
thread
to
that
cpu
during
the
cop
during
the
calculation?
Absolutely.
A
In
fact
you
have
to
on
linux,
so
you
take
the
the
buffers
and
you
you
know,
map
them
to
that
cpu
for
the
operation
you
need
to
perform.
There's
certain
restrictions
like
no
locking
right
to
get
spin
locks,
but
don't
ever
yield.
A
B
B
C
I'm
surprised
I
mean
on
previously,
you
just
hit
it
and
then,
when
it
returns
basically
based
on
priority,
you
get
resumed
by
the
same
cpu.
A
That's
the
way
it
works
on
linux,
so
so
that
would
be.
We
would
be
looking
at
introducing
wrappers
to
do
that
kind
of
thing
right.
So,
instead
of
doing
what
now
is
allocating
the
zio
buffer
with
dmalic
right
we
allocate,
maybe
there
can
still
be
a
wrapper.
That
looks
like
that's
what
happens
right
except
you
get
this
z,
buff
back
all
right,
then
you
can.
You
can
pass
around
as
if
it
were
your
buffer,
then
on
linux.
What
you
would
do
instead
would
be.
A
You
would
allocate
a
bunch
of
pages
randomly
from
well,
not
random,
where
you
just
ask
for
a
bunch
of
pages
and
you
get
them
somewhere
in
the
address
space,
and
then
you
would
have
some
wrapper
functions
to
copy
data
into
them
or
probably
get
out
of
them.
You
could
make
a
checksum
that
kind
of
thing,
but
you
would
never
need
to
map
those
pages
into
the
virtual
library
space.
A
A
Well,
I
mean
you
can
like
we
were
saying
as
long
as
you
don't
actually
need
to
yield
the
cpu,
something
like
compression.
You
don't
need
to.
You,
can
map
it
onto
that
cpu
and
do
the
compression
and
then
you're
good.
If
you
need
to
do
something
else
like
oh,
I
don't
know
something
where
you
have
to
take
a
new
text
or
something
like
that.
Are
you
proposing
doing
this
for
only
user
data?
A
You
know,
there's
probably
like
a
thousand
places
in
the
code
where
it's
like
you
know,
updating,
indirect
block
or
update
you
know
a
z,
node
or
update.denote
or
update
a
you
know.
Dsl
data
set,
yes
t
like
all
those
things
are
mostly
on
pretty
small
blocks,
but
you
know
they're.
A
The
access
to
them
is
much
less
constrained
than
user
data.
Yes,
it's
for
that
exact
reason
that
we
went
through
the
hoops
we
did
initially
to
make
our
own
slab
implementation.
That
was
faster.
We
wanted
to
avoid
all
of
that,
but
long
term.
I
don't
see
how
we
avoid
it
on
the
linux
side
or
we
can't
really
continue
to
do
what
we're
doing.
A
We
need
to
do
something
that
doesn't
require
us
to
map
anything
into
the
virtual
library
space.
Can
you
just
like
when
the
system
boots
up
map
all
physical
pages
into
the
address
space
and
then.
A
A
F
F
It's
the
same
problem
except
it's
elevated
with
quantum
dashes
right
under
the
arena
levels.
They
also
have
a
single
walk
within
a
female
arena
and
the
way
they
make
it
work
multiples
to
use
it.
Certainly
they
have
quantum
characters
for
like
so
many
so
many
times,
but
but
the
other
thing
is
that
all
the
the
caches
are
built
on
top
of
koreans
in
arenas
are
sort
of
hierarchical.
So
that's
how
they
all
work
well,.
A
I
think
that
you
know,
I
think,
that
you're,
I
don't
have
any
fundamental
problems
with
like
doing
that.
So
you
know,
I
think,
there's
a
lot
of
ideas
on
how
maybe
like
sure
learning
should
be
better,
but
I
think
working
within
what
you've
got
and
doing
that
you
know.
That's
fine.
I
don't
think
it's
gonna
hurt
it's
not
to
kill
anybody
else.
I
don't
think
so.
A
I
should
also
mention
that
yeah,
we
kind
of
brought
it
up
yeah
with
the
upstream
maintainers,
and
this
is
just
a
more
philosophy
thing
I
would
say
on
linux.
This
is
not
viewed
as
a
bad
thing.
It's
this.
If
you
want
good
performance,
you
should
not
go
through
the
virtual
address
space.
It's
just
kind
of
baked
into
the
culture
if
they
don't
believe,
they're
doing
anything
wrong
and
I
tend
to
agree
with
them.
Actually
allowing
this
sort
of
thing
is
a
convenience
for
the
developers
right.
A
A
You
can't
just
do
this
in
the
portability
layers,
so
I
mean
even
the
fact
that
we've
tried
to
hide
it,
but
there's
other
issues
that
come
up
on
linux.
Things
like
you
can't
do
any
kind
of
sleeping
allocation
or
email
in
any
right
path.
Right
that
is
will
deadlock
shall
not
be
done
all
right.
So,
let's
maybe
say
that.
A
A
Cache
integration
we'd
like
to
bring
zfs
into
the
page
cache
on
linux
and
if
we
start
doing
things
on
a
page
basis,
we
can
actually
do
that.
We
don't
have
this
problem
mapping
these
random
buffers
that
are
in
a
slab
somewhere
into
the
case
cache.
We
can
now
properly
map
pages
in
fault
them
in
and
get
rid
of
the
ugly
map
hack,
probably
that
we
have
at
the
moment
where
we
keep
two
copies
of
this.
A
So
I
think
it's
a
lot
of
work,
but
I
think
for
everybody
there
would
be
a
lot
of
benefits
that
would
fall
out
of
it.
I
don't
expect
it
to
be
a
quick
change.
The
big
disruptive
change,
I
kind
of
hope
to
spend
the
next
year
kind
of
poking
at
it
and
moving
it
forward.
So
I
don't
want
to
underestimate
the
amount
of
work
but
cool
thanks.
A
So
let's
go
ahead
and
grab
dinner
come
back
here,
no
beer
until
you
listen
to
two
more
topics.