►
From YouTube: Lustre ZFS & Supercomputers by Brian Behlendorf
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
All
right,
hello,
everybody,
my
name
is
Brian
Miller,
like
you
heard
I'm
a
computer
scientist
at
Lawrence,
Livermore,
National,
Laboratory
and
I,
founded
the
ZFS
on
Linux
project
and
I,
wanted
to
talk
to
you
guys
about
the
use
case,
we're
using
for
ZFS
and
how
it's
become
instrumental
to
our
machines
there.
But
first
I
wanted
to
mention
that
I
really
think
the
BCC
stuff
is
really
cool
and
I'm.
Gonna.
Look
into
that
for
the
previous
talk,
because
that
kind
of
analysis
is
exactly
what
we
need
to
drill
down
a
problem.
A
So
I'm
pretty
excited
about
that.
This
talk,
whoever
is
gonna,
go
in
a
different
direction
right
for
the
biggest
scale
machines
out
there,
which
is
something
we
do
at
Livermore.
So
let
me
talk
about
that
for
a
minute,
so
Lawrence
Livermore,
National
Laboratory,
we're
located
in
the
Bay
Area
here
we're
actually
only
about
40
miles
used
to
here.
A
So
I
work
at
Livermore,
which
is
one
of
the
NSA
laboratories.
Los
Alamos,
is
probably
another
one
you're
familiar
with
right.
That's
the
home
of
the
Manhattan
Project
in
World
War,
two,
the
atom
bomb
came
out
of
there
and
Sandia
is
the
other
big
laboratory.
There's
a
lot
of
overlap
between
the
research
done
from
a
lot
of
these
facilities,
but
they're
all
research
and
development.
So
this
is
Livermore.
It's
about
a
square
mile
campus
like
I,
say
just
located
east
of
here.
It
was
founded
in
1952
actually
by
the
University
of
California.
A
Interestingly
enough,
Livermore
used
to
be
a
World
War
to
Naval
airbase.
So
what
was
sited
here
originally
was
a
radiation
laboratory
that
was
managed
by
University
to
California
and
that
grew
effectively
into
Lawrence
Livermore
National
Laboratory,
so
the
laboratory
was
originally
set
up
in
this
capacity
in
1950,
q
is
kind
of
a
counterweight
to
Los
Alamos,
like
I,
say
Manhattan
Project
started
at
Los
Alamos,
and
they
did
a
lot
of
the
nuclear
weapons
design
work
in
the
country
at
the
time.
A
In
fact,
they
did
all
the
nuclear
weapons
design
work
and
then
in
1952
people
thought
it
was
a
good
idea
that
maybe
we
should
have
two
laboratories
working
on
this.
So
one
of
Livermore
primary
missions
is
as
a
nuclear
weapons
design
lab.
They
have
other
missions,
but
that's
one
of
them
and
like
I
say
they
are
set
up
as
a
counterweight
to
Los.
Alamos
to
you
know,
act
as
a
sounding
board
to
review
designs
and
whatnot.
A
Basically,
one
of
the
main
missions
at
the
laboratory
is
a
program
called
stockpile
stewardship,
so
stockpile
stewardship
is
basically
our
laboratory
certifying
that
the
nuclear
weapons
in
the
US
are
safe,
reliable
and
effective.
That
they're
gonna
work,
and
this
is
a
pretty
hard
problem.
It
turns
out,
because
we
used
to
do
this
in
the
US
by
actually
testing
the
weapons
right,
and
this
was
every
year
we
have
to
report
to
Congress
that
the
weapons
will
work
and
this
used
to
be
done
by
testing
them.
That
can't
be
done
anymore
right
in
1996.
A
There
was
a
nuclear
test.
Ban
treaty
that
was
signed
by
the
US
has
a
little
bit
of
background,
which
basically
said
no
more
above-ground
nuclear
testing.
So
we
don't
do
that
anymore
and
the
way
we
still
certify
the
stockpile
is
with
simulations.
In
fact,
HPC
simulations
for
high
performance
computing
simulations
there's
one
way
we
do
that.
A
So
that's
part
of
our
mission,
but
I
want
to
let
you
think
that
we're
just
let's
that
their
only
business,
we
actually
are
a
research
and
development
laboratory.
So
one
of
the
other
big
projects
operating
it
living
right
now,
Livermore
right
now
is
the
National
Ignition
facility.
I,
don't
know
that
anybody
here
heard
of
the
National
Ignition
facility
right.
So
this
is
probably
one
of
the
coolest
projects
at
Livermore
right
now,
it's
basically
a
giant
laser
right.
So
the
National
Ignition
facility
is
a
fusion
research
experiment
and
is
designed
to
simulate.
A
You
know
the
temperatures
and
pressures
that
are
required
for
nuclear
fusion,
so
you're
talking
about
hundreds
of
millions
of
degrees,
billions
of
atmospheres
of
pressure
for
a
fraction
of
a
second
to
get
fusion,
and
this
might
look
like
a
picture
of
the
warp
core
from
the
fairly
recent
Star
Trek
movie.
But
in
fact
this
is
a
picture
of
NIF
all
right.
A
Yeah
I'd,
like
it,
doesn't
look
that
star
trekky
well
anyway,
so
we
do
a
lot
of
other
research
at
the
laboratory,
but
those
are
two
of
our
primary
missions
at
the
moment,
but
we
actually
have
a
long
history
of
scientific
computing
going
back
to
support
those
kinds
of
technical
missions,
all
the
way
back
to
the
1960s,
actually
a
little
bit
before
that.
So,
like
I
said
we
were
founded
in
1952
and
then
we
brought
in
the
first
HPC
machine
to
the
laboratory
in
1953,
so
pretty
much
immediately
immediately
after
the
doors
were
open.
A
You
know
we
started
deploying
hardware
there
and
that
was
a
UNIVAC
one.
So
going
way
way
way
back,
they
have
a
nice
computer
history,
museum
associated
with
the
laboratories,
where
you
can
see
some
of
these
things
and
the
UNIVAC
one
is
a
beast
of
a
machine
right.
It's
I,
don't
know
seven
feet
by
14
feet
and
the
whole
thing
has
a
you
know:
5,000
vacuum
tubes
in
it
something
on
that
order
I'd
in
a
thousand
words
in
memory.
That
was
the
UNIVAC
one,
but
anyway,
historically,
we've
deployed
the
latest
greatest
supercomputers
at
the
laboratory.
A
So
supercomputers
these
days
are
ranked
you
know
on
a
list
called
the
top
500
list.
It
is
exactly
what
you
think
it
is
it's
a
list
of
the
500
fastest
supercomputers
in
the
world
at
any
given
time
the
list
is
put
out
twice
a
year
once
said
supercomputing
and
once
at
International,
superheating,
so
every
six
months
and
the
systems
are
ranked
using
a
benchmark
called
Linpack,
which
is
kind
of
interesting
and
kind
of
an
accident
of
history.
Actually
right,
it's
not
that
this
benchmark
was
designed
for
this
purpose.
A
It's
just
the
one
that
everybody
happened
to
use
and
it
happens
to
be
a
pretty
good
measure
of
like
the
computational
performance
of
a
system.
So
you
could
imagine
that
one
way
to
measure
a
computer
system
right
would
be
to
like
multiply
out
all
the
processors
right.
I've
got
a
million
processors
and
they're
all
this
fast
and
that's
the
theoretical
peak
performance.
Well,
Linpack
actually
gives
you
a
measurement
of
the
actual
delivered
performance
to
an
application
right.
So
it
does
this
by
solving
a
series
of
linear
equations
spread
over
the
memory
in
the
system
right.
A
So
it
measures
things
like
the
effective
memory
bandwidth
of
the
system
like
how
fast
the
interconnect
is,
how
fast
it
can
pass
messages
around
what
the
CPU
speeds
are.
All
those
things
come
in
to
give
you
a
measure
of
like
the
computational
performance
of
the
system.
Again,
it's
an
arbitrary
benchmark,
but
it's
the
one
people
have
settled
on
for
twenty
years
now.
A
At
the
moment,
Livermore
has
had
two
systems
in
the
last
ten
years
on
the
list
we've
had
more
than
that
going
back
but
like
I,
say
Blue,
Gene
and
Sequoia
here
had
the
number
one
slot
in
the
last
ten
years,
which
is
pretty
cool.
I've
got
to
work
on
both
of
these
systems
and
they're.
Pretty
neat
do
e,
like
I,
say,
has
a
complex
laboratories
and
at
the
moment,
there's
actually
quite
a
few
machines
spread
out
over
that
complex
that
are
in
the
top
ten
Oak
Ridge
has
a
machine
called
Titan.
A
It's
about
17,
petaflop
s--,
there's
the
squire
machine,
which
I
just
mentioned
at
Livermore
and
live
lana
land
Sandia
have
a
machine
that
just
deployed
called
Trinity.
So,
interestingly
enough,
this
list
used
to
be
dominated
by
machines
at
the
Department
of
Energy
HPC
machines.
But
in
recent
years
that's
not
been
the
case
so
much
all
right.
Some
of
the
biggest
machines
we're
seeing
now
are
actually
coming
out
of
China
with
just
monster,
processor
counts
and
performance
numbers
on
them.
So
we
got
a
little
competition
now,
which
is
cool.
A
A
We
wanted
to
be
running
commodity
Hardware
as
much
as
possible,
because
all
supercomputers
are
big
and
expensive
and
they
used
to
be
proprietary.
There
was
only
limited
volume
of
them
right.
I
mean
we
really
love
to
be
running
commodity
chips.
Well,
whatever
they're,
making
a
million
over
ten
billion
over
100
million
off
right,
we
get
much
better
prices
on
that,
so
there
was
kind
of
a
strategic
shift
in
the
last
10
years,
2
or
15
years
to
Linux
computers.
For
that
reason,
we
build
our
systems.
A
On
top
of
Red
Hat
Enterprise
Linux,
because
once
again
it's
an
enterprise
distribution,
we've
looked
at
some
Mbutu,
it's
definitely
an
option
for
this,
but
we
happen
to
settle
on
on
Red
Hat,
a
while
back
and
to
this
for
our
distribution.
We
had
some
HPC
specific
functionality
that
you
don't
find
in
most
Linux
distributions
right.
So
that
means
things
like
low
latency
interconnect.
Infiniband
is
the
current
interactive
choice.
You
know
connect
of
choice
these
days,
mainly
because
it's
got
such
low
message-passing
times
and
great
bandwidth.
A
But
one
of
the
key
ideas
around
Linux
computer
also
is
doing
this
with
open-source
solutions
right.
Given
the
options,
we
really
want
to
use
open
source
open
source
like
oh,
like
ZFS
or
lustre
or
Linux
right.
We
have
a
substantial
investment
in
these
machines
and
you
know
we
have
the
staff
to
maintain
and
work
on
them
and
we
want
to
be
able
to
work
on
the
guts
so
we're
a
big
fan
of
open
source
solutions
and
engaging
with
the
communities
around
them
as
much
as
possible.
So.
A
A
So
scientific
simulations
like
the
ones
we're
talking
about,
there's
lots
of
different
simulations.
We
do
on
the
systems
can
easily
generate
petabytes
of
data
right.
They
just
generate
enormous
data
sets
and
they
need
to
be
stored
and
there's
quite
a
lot
of
variety
in
these
data
sets.
Actually
we
have
data
sets
that
range
from
millions
of
small
files
to
you
know
really
huge
multi
terabyte
or
the
multi
petabyte
files
right.
We
give
the
users
quite
a
lot
of
latitude
and
hope
they
write
out
in
their
data
sets.
A
A
You
can
imagine
that
if
you're
doing
a
big
scientific
simulation
and
you're
reading
a
data
set
back
off
disk
or
something
to
continue
calculating-
and
you
read
in
some
wrong
values
right
and
you
continue
simulating
with
that
and
suddenly
wallah
you've
discovered
new
physics
right,
not
so
good
right.
We
don't.
We
don't
like
that
right.
It's
all
different
scale
of
problems
rather
than
a
couple
pictures
of
pixels
are
wrong.
In
my
face
book,
imager
or
whatever
right,
it's
much
more
important,
so
data
integrity
right
up
our
alley,
a
big
thing
we
care
about.
A
On
the
back
ends,
our
file
systems
also
really
require
high
I/o
throughput
for
checkpoints.
This
might
not
be
obvious
initially,
but
for
systems
that
are
the
scale
of
sequoia
or
something
some
of
the
biggest
systems
right.
You
can
be
talking
about
having
a
petabyte
of
memory
on
the
system
right
and
that
pet
abided
memory
needs
to
be
written
out
of
periodic
intervals,
because,
while
the
systems
are
reliable
for
the
most
part,
they're
huge
and
parts
do
fail
on
them.
So
you
might
lose
a
note
right
and
you
don't
want
to
lose
the
whole
calculation
right.
A
You
don't
want
to
lose
the
whole
simulation,
so
you
periodically
want
to
be
running
out
already
got
these
timestamps
or
checkpoints,
so
you
can
restart
in
the
case
of
a
failure.
Now
you
want
that
to
be
fast,
because
the
scientist
today
they
don't
care
about
writing
out
data.
They
want
to
do
the
computation.
They
want
to
get
the
answer
right,
so
they
need
to
strike
a
balance
between
how
much
of
their
time
they're
spending
calculating
and
how
much
does
its
time
they're
spending
doing
I/o.
A
So
we
want
to
make
sure
we
deliver
good
bandwidth
on
the
back
end,
so
they
can
minimize
those
I/o
times
as
much
as
possible,
and
you
can
imagine
writing
out.
A
petabyte
of
data
takes
a
little
bit
of
time
even
on
a
fast
file
system,
and
then
we
can't
just
stop
the
mice
for
I/o
throughput
as
nice,
as
that
would
be
all
right.
At
the
end
of
the
day,
they're
gonna
want
to
visualize
this
data,
so
the
requirement
there
is
that
they
actually
have
decent
interactive
performance.
A
Also
on
the
filesystem
we
use
tools
like
visit,
which
are
really
good
at
like
providing
visual
simulations
like
you,
can
see
up
there
of
the
data
in
the
file
system
or
the
particular
simulation,
and
that
requires
decent
interactive
performance
and
not
something
like
a
batch
table.
Most
of
the
jobs
are
under
the
system,
our
batch
job,
so
they
don't
feel
the
I/o
time
for
checkpoints
and
whatnot.
A
So,
as
I
mentioned,
lustre
is
our
tool
choice
to
handle
these
workloads
lustre
is
a
scalable
distributed
parallel
file
system,
so
you
can
think
of
lust
Ria's
a
POSIX
file
system.
That's
mounted
on
all
the
nodes.
In
the
cluster
and
provides
coherence
across
the
cluster,
which
is
a
surprisingly
hard
problem,
but
lustre
is
Hardware
agnostic
again
we
like
this,
because
you
know
we
like
to
deploy
whatever
the
current
new
hardware
is
out
there
or
the
best
solution
for
it.
We
have
lots
of
vendors
that
bid
solutions
for
hardware.
A
For
us,
this
stuff
gets
competitively
bid
so
having
them
be
able
to
bid
whatever
they
think
the
right
solution
is
is
really
really
flexible
and
powerful
for
us
again,
we
like
open
source
software
right
lustre,
is
all
open
source.
It
was
originally
put
under
the
GPL
v2,
and
this
has
you
know
the
usual
advantage
of
the
open-source.
No
one
company
controls
luster,
which
again
is
good
for
us.
A
There
aren't
that
many
parallel
file
systems
out
there,
but
we
really
want
to
be
able
to
the
work
on
the
guts,
and
you
know,
develop
this
out
in
the
open
with
other
HPC
sites,
and
this
helps
protect
our
investment,
our
substantial
investment
in
storage
and
whatnot
right.
We
were
sure
that
this
isn't
going
to
go
away
on
us
one
day,
plus
there's
a
large
act
to
motivate
a
development
community.
Roundabout
bluster,
it's
really
gotten
a
fair
bit
of
traction.
It
is
used
probably
predominantly
on
most
HPC
systems.
A
These
days,
all
right,
seven
of
ten
top
supercomputers
for
a
lot
of
years,
have
used
lustre
and,
like
I,
said
it's
a
POSIX,
compliant
file
system
and
well.
This
might
sound
like
a
small
thing
these
days,
all
right,
because
there's
lots
of
people
move
moving
away
from
this
model
right
in
the
cloud,
in
particular
like
providing
POSIX
semantics
for
a
distributed
system,
really
just
isn't
done
because
it's
a
hard
problem,
but
our
use
case
is
a
little
bit
special
here.
We
care
a
lot
about
POSIX
compliance
because,
as
I
mentioned,
were
an
old
laboratory.
A
That's
been
doing
simulations
for
a
long
time
and
we
have
a
lot
of
old
codes
and
we'd
really
like
to
keep
running
those
codes
right.
There's,
scientists
and
code
teams
that
have
been
working
for
decades
and
cases
on
these
codes
and
they're,
really
not
so
interested
in
rewriting
the
code
or
re-implementing
at
every
new
years,
because
we
found
that
a
great
new
model
for
how
to
do
this
right.
So
policy
compliance
is
a
big
deal
for
us.
We'd
like
to
shed
it
at
some
point,
so
I
don't
want
to
close
the
door
on
that.
A
A
Luster
is
community
organizations
or
at
least
the
cloud
community
involvement.
I
should
say
there
are
two
organizations
that
do
lustre,
support
and
help
development
and
organize
things
for
the
community
open
s.
Fs
was
set
up
in
2010
kind
of
as
a
place
for
vendors
and
the
community
get
together.
Much
like
open
ZFS
is
today.
Alright,
everybody
gets
together.
We
talk
about
technical
issues.
We
discuss
the
systems,
we're
building
all
those
things.
A
Interest
ef-s
is
the
European
counterpart
to
this
right,
but
the
same
basic
idea:
it's
kind
of
a
smaller
community,
necessarily
but
a
very
diverse
community
of
people
who
run
this
supercomputers
like
this,
so
architectural
II
speaking
there's
a
lot
of
data
on
this
slide.
We
not
really
need
to
talk
about
most
of
it,
but
architectural
speaking.
The
thing
to
take
away
from
lustre
is
that
fundamentally,
it
split
in
two
metadata
and
data,
and
this
is
kind
of
a
the
important
bit
so
way
back
when
when
lustre
was
created
all
right.
A
We
can
get
much
better
throughput
for
the
system,
which
is
really
really
important
to
us,
but
at
the
same
time
we
do
care
about
it
being
a
positively
consistent
file
system
right.
So
we
have
to
do
that
kind
of
thing,
so
a
balance
was
struck
to
keep
them
separate
and
a
complicated
locking
system
was
developed
to
get
us
POSIX
appearance.
Out
of
this,
the
result
is
a
really
high
performing
file
system,
for
you
know
both
data,
and
you
know
good
metadata
performance
out
of
it
too.
A
I
should
say
that
the
metadata
performance
is
being
improved
and
one
of
the
reason
features
to
lustre
is
distributed
metadata
and
the
picture
here
you've
just
got
one
MDS
at
the
metadata
server,
but
in
newer
versions
of
lustre.
You
can
have
more
of
these
right
and
you
can
scale
out
the
metadata,
which
is
a
big
deal
like
I
say
we
care
about
not
just
throughput
but
also
interactive
workloads.
A
So
this
worked
all
really
well
for
us
for
a
long
time.
Ever
since
we
deployed
lustre
back
in
2003
I
think
we
had
a
machine
come
in
called
MCR,
which
was
one
of
the
first
deployments
of
lustre.
Maybe
the
first
deployment
of
lustre
at
scale
and
that
serve
is
really
well
up
until
about
2013,
and
then
we
started
looking
at
lustre
and
saying:
maybe
it's
not
gonna
scale
where
we
needed
to
scale.
This
is
the
Squier
system,
a
high-level
architectural
view
and
it's
a
big
system
right.
A
This
came
up
on
the
books
as
something
we
were
going
to
field
and
deploy
and
we're
looking
at.
How
are
we
going
to
develop
or
deploy
fifty
four
petabytes
file
system,
a
single
file
system,
not
54,
petabytes
of
small
file
systems,
right
154,
petabyte
file
system
with
almost
a
terabyte,
a
second
throughput,
and
what
lustre
skills
really
really
well.
It
probably
wasn't
gonna
scale
well
enough
for
this,
so
we
needed
to
look
at
like
why
that
would
be
so.
A
The
reason
that
we
were
worried
in
particular
is
that,
even
though
lustre
handles
lots
and
lots
of
servers
pretty
rough
well,
we're
talking
about
numbers
in
the
hundreds
or
maybe
low,
thousands
right,
you
can
deploy
a
thousand
storage
targets
right.
But
beyond
that,
you
start
running
into
some
limitations
right.
The
clients
have
to
individually
track
each
of
these
storage
targets
and
manage
some
small
amount
of
State
for
it.
A
So
the
more
storage
targets
you
build
up
the
more
work
you're
putting
on
the
clients,
that's
using
up
resources
that
might
be
used
for
compute
instead
managing
the
clients.
So
what
we
really
wanted
to
do
was
build
bigger
storage
targets
with
lustre,
and
we
couldn't
do
that.
It
turns
out
for
really
really
good
historical
and
technical
reasons,
as
I
think
everybody
here
is
probably
well
aware.
Writing
a
file
system
is
hard
right.
A
In
particular,
writing
a
file
system
from
scratch
is
hard,
so
lustre
focused
on
the
areas
where
it
was
good
initially,
which
was
handling,
distributed,
workloads
and
parallelism,
and
that
kind
of
thing
and
then
built
on
existing
technology
for
like
actually
writing
the
blocks
to
disk
and
actually
reading
them
out,
which
wasn't
the
key
problem
they're
trying
to
solve
initially
right.
So
we
built
on
the
legacy
ext
file
system
for
Linux
that
worked
great.
A
Initially,
we
extended
it
and
added
features
that
later
went
at
ext4
on
Linux
things
like
the
multi
block,
allocator
came
out
of
lustre
and
we
exposed
some
interfaces
to
get
transactional
objects
of
semantics
out
of
xt,
which
was
great,
and
this
works
for
a
really
really
long
time.
But
the
problem
is,
we
also
inherited
the
limits
of
EXT
all
right
and
one
of
those
limits
is
maximum
file
system
size.
A
It
used
to
be
originally
like
around
two
terabytes
that
got
pushed
up
to
eight
terabytes,
but
fundamentally,
it
still
isn't
very,
very
big,
particularly
when
you're
talking
about
deploying
a
50-odd
petabyte
file
system
right
you're
talking
about
7,000
8,000
servers,
something
like
that
too
many
too
many
for
the
clients
to
manage.
Imagine
and,
from
our
point
of
view,
this
is
the
first
machine
coming
out
of
this
size,
but
they
were
only
gonna
get
bigger
right.
It's
not
like
this
was
the
end
of
the
road.
A
The
next
one's
gonna
be
twice
as
big
all
right,
so
ZFS
to
the
rescue
will
finally
get
to
the
ZFS
bit
at
the
background
up
here.
Zfs
just
is
the
perfect
fit
for
a
lustre.
This
is
the
bottom
line
right.
This
is
exactly
the
thing
we
needed
to
solve
this
problem
on
our
storage
back-end
right.
It
has
all
the
things
we
need
right.
A
It's
scalable,
like
I
just
said,
that's
a
big
deal
for
us,
because
we
always
want
to
build
them
bigger,
it's
manageable,
another
big
deal
for
us,
because
you
know
we're
gonna
have
thousands
of
these
things.
We
want
them
to
be
as
easy
to
manage
as
possible.
Performance
performance
actually
works
really
well
for
us
with
ZFS,
because
at
the
copy-on-write
file
system,
which
isn't
nothing
we
had
before
like
I
said,
a
lot
of
our
workload
is
writing
out
checkpoints
to
disk.
A
So
it's
a
lot
of
writing
and
it's
a
lot
of
writing
from
a
lot
of
random
processes,
doing
random,
writes
and
the
fact
that
ZF
has
copy-on-write
lets
to
serialize
that
all
the
discs,
which
is
great
for
performance
right,
whereas
before
was
something
like
HD.
We
be
writing
blocks
all
over
the
place.
Just
based
on
where
the
allocator
wanted
to
put
them
was
the
FSB
can
stream
them.
That's
great.
A
A
It
kind
of
ticked,
all
the
boxes
for
what
we
wanted
for
a
back-end
file
system
for
lustre.
But
like
all
things,
there
were
a
couple
problems.
All
right
problem
are
one
the
luster
server
storage
layering
to
be
redesigned
a
little
bit.
So
while
the
original
layering
did
consider
having
multiple
backends
instead
of
just
EXT
for
luster,
the
reality
of
the
situation
is
that
by
only
having
one
for
a
long
period
of
time,
the
layering
got
broken
here
and
it
got
broken
there
and
it
got.
Things
ended
up
in
the
wrong
layers.
A
So
this
is
work
that
was
actually
tackled
by
the
lost
record
developers
and
it
took
a
lot
of
releases
to
refactor
this
because
it
turned
out
that
a
lot
of
these
assumptions
went
pretty
deep
in
the
stack
and
but
eventually
you
know,
we
refactored
the
stack
for
luster,
the
the
core
luster
developers
did
this
work
over
a
bunch
of
different
companies?
Actually,
originally
this
work
started
while
lustre
was
a
project
primarily
run
out
of
son
later
acquired
by
Oracle
and
we've
to
another
company
called
glam
cloud.
A
But
this
work
continued
over
all
of
those
various
companies
problem
number
two:
we
didn't
have
ZFS
on
linux.
That
was
kind
of
sad
because
you
know
luster
is
a
linux
file
system
and
you
know
we
wanted
to
use
the
FS
because
we
saw
cool
it
was,
but
there's
good
news
right.
We
actually
didn't
need
all
of
ZFS
for
lustre.
This
is
maybe
something
that's
not
so
clear
from
the
higher
layers,
but
luster
actually
ties
in
to
the
D
mu
layer
and
ZFS.
It's
the
first-class
consumer
of
the
D
mu.
A
It
doesn't
layer
on
top
with
the
POSIX
file
system.
It
doesn't
layer
on
a
volume.
The
interface
is
provided
by
the
DM.
You
are
pretty
much
exactly
what
luster
needs,
so
we
tied
in
there-
and
this
is
work
that
was
done
at
Livermore,
actually
to
bring
ZFS
analytics
and,
as
you
all
know,
we
didn't
stop
there
right.
We
saw
how
cool
this
was
and
we
wanted
the
rest
of
it.
So
we
implemented
the
POSIX
layer
to
and
the
volume
manager,
so
the
luster
ZFS
implementation.
A
Technically
speaking,
the
on
disk
format
for
lustre
is
compatible
with
the
POSIX
layer.
We
did
this
for
a
couple
of
reasons.
We
wanted
to
be
able
to
debug
the
file
system
pretty
easily
and
it
was
convenient
to
be
able
to
the
same
data
set
that
luster
is
using
to
store
its
objects
as
a
file
system
and
to
rummage
around
and
look
for
stuff
right.
You
could
use
all
the
normal
system
utilities
on
it.
A
Hex
editors
DD,
whatever
you
want,
because
you
could
easily
inspect
the
file
system,
we
got
to
leverage
all
the
features
of
ZFS
from
the
you
know:
DM
you
down.
Basically,
so
it
turns
out
things
like
compression
we're.
Actually,
a
really
big
win
for
us
early
on
this
surprised
us
initially,
because
a
lot
of
the
HPC
workloads
depend
on
libraries
and
libraries
have
compression
algorithms
built
into
them,
but
they
didn't
work
as
well
as
you
might
have
expected.
A
Actually,
we
got
huge
improvements
in
compression
on
the
file
systems
at
no
cost
to
performance
just
by
training
our
compression,
so
that
was
a
cool
win
like
I
said
we're
layered
on
top
of
the
D
mu,
and
this
work
exposed
a
lot
of
assumptions
we
had
on
the
client-side
in
lustre.
You
know
things
like
the
maximum
object.
Size
happened
to
be
hard-coded
to
terabytes.
That
was
an
ext
limit.
The
block
size
we
assumed
was
a
page
size
over
the
years
right
because
it
always
had
been
with
the
XT
turns
out.
A
It's
not
always
page
size
with
ZFS
all
right.
It
could
be
much
much
bigger.
So
at
this
point,
we've
deployed
pretty
much
all
ZFS
and
lustre
at
Livermore
in
support
of
our
HPC
workloads.
We've
got
about
10
file
systems
deployed
with
maybe
100
petabytes
in
production.
There
are
new
file
systems
being
deployed
at
the
moment
actually,
but
we're
pretty
much
all
in
on
the
ZFS
lustre
thing,
and
it's
been
working
out
really
really
well
for
us,
but
we're
not
the
only
one
while
we
were
the
only
ones
to
do
is
initially.
A
It
turns
out
that
a
lot
of
the
other
HPC
sites
have
decided
that
this
is
really
a
good
idea
to,
and
they
should
be
doing
the
same
thing.
It
took
a
little
while
because
takes
a
long
time
to
design,
build
and
deploy
an
HPC
system,
but
Los
Alamos
now
has
2:14
petabyte
file
systems
of
the
same
design,
san
diego's,
supercomputing,
Center,
7,
petabyte
file
systems
and
there's
vendors
out
there
now
selling
these
systems
right.
A
Anybody
who
wants
to
buy
them
pretty
much
and
there's
lots
of
smaller
deployment
to
universities,
research,
labs,
which
are
cool,
but
you
know
not
the
big
file
systems,
and
mainly
we
need
ZFS.
Those
are
the
big
file
systems
right,
so
Aurora,
so
we're
not
done
while
Sequoia
was
like
I,
say
the
first
step
along
this
route.
Aurora
is
a
big
system
designed
for
the
2018
timeframe
right.
It
is
going
to
be
three
or
four
times
bigger
than
Sequoia,
depending
on
how
you
measure
it
right.
It's
got
150
petabytes
in
the
file
system.
A
It's
going
to
be
a
gigantic
machine
right
and
it
turns
out
that,
because
of
the
work
that
was
done
with
ZFS
Aurora
is
going
to
be
built
on
luster
and
ZFS.
It's
one
of
the
reasons
they
can
build
this
machine
to
the
file
system.
So
this
is
driving
quite
a
lot
of
the
current
development,
at
least
on
the
HPC
side,
for
ZFS
right.
There's
work
going
on
to
improve
lustres
integration
with
ZFS
and
some
areas
here
and
then
there's
you
know,
features
that
are
being
added
to
the
core
ZFS
code
to
improve
HPC
workloads.
A
You
know
things
that
might
not
seem
so
important
and
fall
out.
Smaller
systems
are
really
important
for
big
systems.
Things
like
inode
quota,
accounting
right
when
you've
got
a
billion
or
10
billion
files
in
your
file
system
being
able
to
get
those
numbers
and
be
able
to
quickly
get
that
information
is
really
important,
like
which
user
out
there
has
a
billion
files.
A
So
I
have
a
short
video
if
you're
interested
in
about
a
minute
about
somebody,
this
stuff
is
kind
of
abstract.
So
let
me
show
you
a
concrete
example
of
what
Sequoia
looks
like
or
particularly
Grove.
This
is
the
storage
file
system
we
built
for
suppose
we
happen
to
you
know
each
one
of
these
clusters
get
the
name
right
and
the
file
system
get
the
name,
and
this
is
a
time-lapse
build
up
with
a
profile
system.
So
this
is
the
50-odd
petabyte
file
system
getting
built.
A
A
A
A
We
don't
actually
publish
any
of
that
stuff
we'd
like
to,
but
it's
actually
a
lot
of
work
to
gather
that
data
and
number
crunch
that
data.
So
no,
we
don't
publish
anything.
We
do
track
a
lot
of
it
internally
for
drive
failures,
but
it's
not
available.
A
Good
question:
we
used
a
bunch
of
things
over
the
years,
ranging
from
10
Giggy
and
40
Giggy
to
InfiniBand.
All
right
most
of
our
stuff
at
the
moment
is
infinite
band
attached
was
we
get
really
good
bandwidth
out
of
it?
So
it's
not
uncommon
for
like
one
of
those
storage
nodes
that
have
the
previous
generation
qtr
InfiniBand
links
coming
out
of
it,
and
you
know
there
are
well
Grove,
had
768
of
those
nodes
right.
A
Yeah
I'm,
sorry,
the
question
was
what
the
layout
of
the
drives
was
like
on
the
systems.
So
individual
systems
for
the
Sequoia
system
are
attached
to
NIT
app
jbods,
actually,
so
they
aren't
and
those
net
apps
are
exploding
volumes
that
we're
running
ZFS.
On
top
of
at
the
moment
that
apps
taking
care
of
the
raid
for
those
on
the
square
systems,
but
the
new
systems
are
bringing
in
actually
are
that
next
step
right,
where
we're
going
to
a
full
j-bad
system
with
ZFS
and
it's
going
to
handle
the
rain
steps
so
written
there.
But.
A
So
the
question
was:
if
it
was
easy
to
publish
the
drive
data,
would
we
I
think
in
general
we'd
like
to
be
open,
so
if
it
was
easier,
we'd
certainly
consider
it
I,
don't
know
that
our
rules
permit
us
to
do
that
in
a
lot
of
cases,
certainly
not
for
some
systems
which
are
classified,
but
for
other
systems
which
are
not
yes.
Maybe
it's
some
way
to
at
least
look
at.
A
Flash
so
systems
like
this
we're
actually
looking
at
a
solution
for
with
I'm
sorry,
the
question
was:
how
do
we
use
flash,
and
the
answer
is
at
the
moment
we
don't
right.
So
we
don't
use
flash
at
the
moment
in
any
of
the
sequoia
systems,
but
we're
looking
at
it
for
new
HPC
systems
in
a
technology
called
a
burst
buffer.
A
Yeah,
so
we've
actually
got
a
lot
of
work
going
on
right
now
with
Intel
actually
to
get
that
done
and
to
get
fault
management
fully
integrated
on
Linux
and
working.
So
there
are
our
pull
requests
outstanding
and
development
work
being
done
to
make
that
work
smoothly,
because
that
is
a
big
concern
and
that's
absolutely
one
of
the
reasons
we
didn't
do.
Full
j-bot
systems
initially
is
a
lot
of
work
to
get
that
right.
So.
A
Yeah,
yes,
I
should
have
mentioned
the
name
for
that.
So
that's
up
and
github.
Now
it's
called
the
flux,
project
or
flux
framework.
I.
Suppose
you
can
find
in
a
github.
It
is
certainly
a
work
in
progress,
that's
being
developed,
it's
designed
for
scalability
for
our
largest
systems
and
the
hope
is
to
replace,
learn
with
it.
But
it's
still
something
that's
being
developed
check
out
the
project.
You
know
ask
questions
there
love
to
get
more
attention
on
it.
A
So
I
should
have
mentioned
Aurora
as
a
system,
that's
not
being
deployed
at
Livermore.
It's
spread
across,
it's
actually
gonna
be
decided
at
Argonne,
so
just
outside
Chicago
will
be
the
Aurora
system
vo.
He
likes
to
spread
of
supercomputers
around
pretty
much.
This
200
million
dollar
procurement
is
a
good,
a
site.
There
I
don't
know
offhand
what
their
design
is
for
the
storage,
but
I
know
there
are
people
here
can
probably
answer
that
question.
A
Yeah,
so
the
intention
is
deploy
these
as
a
5,
8,
plus
2
or
6
8,
plus
2
raid-z
groups
we
could
go
bigger,
but
50
drives
is
about
reasonable
for
us
on
a
server.
It's
it's
1
enclosures
worth
the
drives.
The
nodes
actually
see
quite
a
bit
more
than
that.
They
also
see
all
the
drives
from
its
failover
partner,
because
we
do
do
failover
around
these
systems
and
often
they'll
see
pass
for
multipath
devices.
We'll
have
multi
pass
to
every
one
of
these
devices,
so
it's
60
times
the
other
drives
times
2.
A
So
you
might
see
240
different
devices
on
the
system,
and
you
know
Linux
holds
up
under
that,
but
it
does
strain
a
little
bit
when
you've
got
that
many
devices
attached
to
it.
Certainly
Linux
got
much
better
about
that.
I
should
add
in
recent
years
it
used
to
be
that
would
just
fall
over
and
die,
but
no
anymore.
A
I,
don't
know
that
there's
a
public
road
map
for
when
those
features
will
ship
or
anything
like
that,
but
all
the
development
for
those
features
is
being
done
out
in
the
open.
Well,
at
least
a
lot
of
those
features
are
being
done
out
of
the
open
and
pull
requests,
and
if
you
want
to
join
the
conversation-
or
you
know,
help
with
the
giving
have
stuff
done,
that's
a
possibility.
A
Github,
that's
probably
the
place
to
go
so
either
the
open
ZFS
project
on
github
things
will
be
posted
there
or
for
features
that
are
being
developed.
First,
on
Linux
DFS
on
Linux,
just
look
at
the
open
pull
request
or
the
issues
search
for
what
you're
interested
you'll
probably
find
a
ticket
and
it'll
point
you
in
the
right
direction.