►
From YouTube: Sage Weil Presents An Intro to Ceph for HPC
Description
In this video from the Lustre User Group 2013 conference, Sage Weil from Inktank presents: An Intro to Ceph for HPC.
Learn more at:
https://www.gaveledge.com/SFS1301/agenda
and
http:/inktank.com
A
All
right
sage,
while
is
the
creator
of
SEF
project
he
originally
designed
SEF
as
part
of
his
PhD
research
and
storage
systems
at
the
University
of
California
Santa
Cruz,
since
graduating
he's
continued
to
refine
the
system
with
the
goal
of
providing
stable
next-generation,
distributed
storage
file
system
for
Linux,
sages,
co-founder
of
dreamhost
and
as
a
teenager
he
created
and
sold
web
ring
I.
You
know
I
want
to
make
some
snarky
remark
about
the
fact
that
that
you
know
you're
not
a
lustre
guy,
but
welcome
I.
C
Thank
you,
hi
I
just
wanted
to
spend
a
few
minutes
talking
about
SEF
to
you
guys
today
give
a
little
bit
of
an
introduction
to
how
it
relates
to
hpc
and
I
think
the
problems
that
you
guys
all
face
and
try
to
frame
it
in
the
context
of
how
it
is
similar
to
lustre
how
it's
different
and
why
you
may
or
may
not
be
interested
so
I
guess.
The
question
in
your
mind
should
be
what
is
Seph
at
a
high
level.
C
We
describe
staff
as
a
distributed
storage
system,
I
like
to
contrast
it
with
the
idea
of
a
parallel
file
system.
By
distributed,
we
mean
that
it's
a
system,
that's
built
to
be
reliable,
but
it's
built
out
of
unreliable
components.
So
it's
designed
from
the
ground
up
to
be
fault,
tolerant
with
no
single
points
of
failure.
Part
of
that
means
building
it
out
of
commodity
hardware.
C
So
thank
you
know
regular
rackmount
servers
from
Dell
or
whoever
else
you
can
use
expensive,
arrays
controllers
and
specialized
networks,
but
those
things
aren't
required,
we'll
try
to
make
use
of
them,
but
they're
not
sort
of
an
integral
part
of
the
architecture.
It's
designed
for
very
large
scale,
so
we're
from
tens
of
servers
to
tens
of
thousands
of
nodes
and
at
those
large
scales
systems
by
definition,
or
in
most
cases
at
least
need
to
be
heterogeneous.
C
I'm
usually
buy
you
a
first
pod
bite
of
one
type
of
hardware,
six
months
later
by
another
two
or
three
petabytes,
and
so
you
wanna
be
able
to
buy
the
latest
iteration
as
you
do
that,
so
these
clusters
are
sort
of
inherently
dynamic,
they're
growing
over
time
on
mixed
hardware
and
so
forth.
So
from
the
ground
up,
we
design
stuff
to
be
able
to
have
incremental
expansion
or
contraction,
shahnaz,
II,
sort
of
D,
provisional
breaking
hardware
and
so
forth.
C
So
that's
sort
of
how
how
step
is
built
I
suppose
what
set
provides
is
a
unified
storage
platform
so
from
at
the
lowest
layers.
It
provides
an
object
and
compute
source
platform
based
on
distributed,
replicated
highly
available,
object-based
storage
and
then
on
top
of
that
sort
of
underlying
infrastructure.
That
stuff
provides.
We
provide
a
number
of
different
services,
so
one
of
them
is
a
restful
object,
storage
service
based
on
the
Amazon
s3
and
Swift
api's.
C
There's
a
block
storage
component
that
gives
you
sort
of
a
reliable
virtual
disk,
similar
to
what
you
get
out
of
a
San
and
that's
integrated
with
Linux
kernel
and
with
the
kvm
hypervisor.
So
people
setting
up
private
clouds
use
this
frequently
and
then
finally
sort
of
the
most
exciting
piece
as
a
distributed
file
system.
That's
designed
for
you
know
to
give
you
pause,
exim
antics
and
be
highly
scalable
for
hpc
workloads.
C
That's
actually
where
steph
sort
of
originated
was
some
research
money
from
the
Department
of
Energy
and
the
mid
2000s
to
look
at
at
that
time.
Petabyte
scale,
storage
systems
and
then,
as
a
system,
grew
and
we
sort
of
architects
architected
this
entire
thing
and
came
to
include
you,
know:
object-based,
storage
and
and
block
based
stores
and
so
forth.
C
Seth
is
open
source
and
it's
based
on
the
lgpl
for
the
server
side,
the
colonel
site,
components
for
the
block
device
and
the
filesystem
are,
of
course
GPL
because
there
and
the
mainline,
Linux,
kernel
or
they've,
been
for
the
last
last
couple
of
years.
So
in
a
nutshell,
that's
sort
of
what
stuff
is
it's
a
it's?
A
storage
system
that
Fred's
file
access,
block
access
and
object
access.
C
This
is
an
architecture
picture
that
we
use
frequently
sort
of
the
key
idea
with
staff
is
that
the
the
highly
scalable
highly
available
piece
is
just
Rados
component
at
the
bottom.
That
gives
you
reliable
and
scalable
object-based
storage
and
then,
on
top
of
that
object,
substrate
we
provide
a
number
of
different
services,
be
they
restful
object,
storage,
these
virtual
disks
or
the
staff
distributed
file
system,
which
has
its
own
metadata
service
and
so
forth
to
build
out
that,
but
it's
all
based
on
radisys,
reliable
object
store,
but
steph
does
a
number
of
things
differently.
C
So,
looking
specifically
in
the
context
of
how
people
typically
set
up
lustre
systems
versus
house
s,
systems
are
built
in
a
sort
of
a
conventional
H,
a
environment.
You
have
some
sort
of
access
network.
You
typically
have
a
redone
heads.
Oss
is
and
sort
of
lesser
case
that
are
then
talking
to
some
back-end
disk
array.
That's
designed
to
be
highly
reliable.
So
at
a
high
level,
your
striping,
across
reliable
things,
reliable
disk,
arrays
Seph,
is
sort
of
entirely
different
from
that.
C
The
assumption
that
we
come
from
is
that
any
component
in
the
system
can
fail
and
we
don't
want
to
have
to
sort
of
deal
with
the
difficulties
of
configuring,
failover
pairs
and
so
forth.
So
the
idea
here
is
that
we're
striping
over
unreliable
things,
but
those
unreliable
things
are
designed
to
be
intelligent
so
that
they're
handling
the
consistency
and
coordination
and
replication
of
data
across
those
different
storage,
nodes
and
stephs
case
there's,
usually
a
front-end
network.
C
There
can
also
be
a
back-end
network
that
handles
where
all
the
replication
data
migration
traffic
goes,
although
that's
sort
of
an
optional
optional
thing.
But
the
key
idea
is
that
the
servers
are
coordinating
replication
and
recovery
and
a
typical
deployment
and
it'll
look.
Something
like
this.
You'll
have
a
node
that
has
a
whole
bunch
of
disks
on
top
of
each
of
those
local
disks.
You'll
have
a
local
file
system
because
we
don't
want
to
sort
of
reinvent
the
wheel
with
block
allocation
tables
and
so
forth.
C
So
we
like
to
use
butter
FS,
but
people
usually
actually
use
x
fest
for
stability
reasons
you
can
also
use
x
for
ZFS
and
principle
should
work.
Although
we
haven't
tested
it
recently,
but
typically
have
a
whole
bunch
of
these
things
in
a
single
rackmount
server,
you
know
maybe
15
disks
or
something
like
that,
and
then
you
have
a
whole
bunch
of
these
servers.
Making
up
your
storage
cluster
tens,
hundreds,
thousands
one
of
the
key
problems
in
these
systems
is
highly
distributed
at
ax,
so
at
the
object
layer
for
radius,
one
of
the
basic.
C
The
basic
idea
is
that
you
take
all
of
your
objects
and
you
hash
them
and
put
them
into
sort
of
logical,
but
buckets
that
we
call
placement
groups
and
then
each
of
these
placement
groups
is
replicated
on
multiple
servers
in
the
cluster
using
an
algorithm
called
crush
that
make
sure
that
you're
up
guys
are
separated
across
different
racks
and
so
forth.
And
then,
when
you
do
this,
what's
all
of
when
you
distribute
all
of
your
different
placement
groups,
you
have
sort
of
this
randomized
uniform
distribution
of
data
across
all
of
your
storage
nodes.
C
They
can
make
sure
that
that
placement
group
is
replicated
to
another
node
in
the
cluster,
redistribute
the
data
using
peer-to-peer
type
protocols
all
in
a
fully
consistent
way.
So
that
later
on,
when
the
client
comes
back
and
says,
I
need
to
read.
You
know
this
object.
Foo
it'll
just
recalculate
the
location
that
data
based
on
the
new
state
of
the
cluster
and
I
don't
get
the
correct
answer.
So
this
is
sort
of
the
key
idea
that
makes
the
staff
object,
storage
layer
scale
to
you
know,
tens
of
thousands
of
nodes
was
very
minimal
central
coordination.
C
There
isn't
somebody,
that's
saying
you
read
this
data
and
moved
over
there.
Instead,
the
central
coordination
is
simply
saying
this
note
is
up
and
this
node
is
down
and
everybody
is
sort
of
responding
by
moving
moving
the
data
around
so
liberate.
Us
is
the
low-level
library
that
lets
you
sort
of
access
this
this
distributed
storage
layer.
That's
you
know
it's
a
standard,
shared
library
of
bindings
in
every
language
you
can
imagine,
but
in
contrast
to
many
other
systems,
it
gives
you
a
very
rich
object,
API.
C
So
in
most
object
systems
an
object
is
just
a
bunch
of
bytes
and
maybe
some
extended
attributes
and
Seth.
You
can
store
a
lot
more
than
that,
so
you
can
soar
keys
and
values
inside
an
object
in
an
efficient
way,
I'm
think
Berkeley,
DB
tables
or
no
sequel
table.
Something
like
that.
We
reach
each
object,
is
a
log
logical
containers
at
keys
and
valleys.
They
can
store
lots
of
them
and
get
efficient
insertion
solutions.
Range
queries
stuff,
like
that.
It
supports
atomic
single
object
transactions,
so
you
can
do
things
like
atomic
compare-and-swap.
C
You
know
update
the
bytes
and
the
keys
and
values
in
an
atomic
fashion
and
they'll
be
consistently
replicated
and
distributed
across
a
cluster
in
a
safe
way.
There's
all
this
infrastructure
to
support
snapshots
and
that's
used
by
the
block
layer
and
the
filesystem
give
you
know
per
disk
image
and
/
directory
snapshots
in
the
system.
That's
all
supported
at
the
object
layer,
but
one
of
the
more
exciting
features
is
that
seff
allows
you
to
embed
code
into
the
object
storage
demon
to
actually
implement
your
own
functionality.
C
So
you
can
imagine
if
you
building,
you
know
the
next
flicker
or
something
you
might
embed
code
in
your
object,
store,
that'll,
manipulate
images
to
generate
thumbnails
and
so
forth.
So
you
can
send
an
object
method,
call
to
the
object,
store
and
I'll
actually
perform
that
computation
with
the
data
without
having
to
read
that
read
and
write
the
data
across
the
network
and
finally,
there's
some
infrastructure
for
inter
client
communication
and
coordination
for
locking
and
so
forth.
C
So
you
can
do
a
lot
with
this
dis
object
store
and
we
use
a
lot
of
these
features
when
we're
building
these
higher
level
services.
On
top
of
that,
so
one
of
the
more
contentions
contentious
things
I'd
like
to
say
is
that
as
I
think
as
a
community
as
we
move
toward
exascale,
my
assertion
is
that
successful
sale
architectures
are
going
to
need
to
transcend
or
replace
POSIX
I'm
sort
of
the
old
paradigm
of
having
you
know.
C
This
weird
file
and
directory
structure
with
these
very
strange
oddities
around
the
semantics
of
POSIX
are
not
really
going
to
scale.
Well
then,
when
you
start
talking
about,
you
know,
exascale
scales
be
simply
because
the
hierarchical
model
does
not
distribute
well,
but
further.
I
think
that
successful
architectures
are
going
to
need
to
blend
blur
the
line
that
we
currently
have
between
compute
and
storage.
C
So
a
lot
of
processes
that
we
have
are
manipulating
data
locally
and
operating
civically
on
a
small
piece
of
data
and
part
of
those
distributed
process
are
taking
data
from
multiple
locations
and
comparing
them
and
doing
some
sort
of
higher-level
calculation,
and
currently
all
of
our
distributive
architectures,
are
sort
of
blurring
the
line
between
these
two.
They
sort
of
assume
that
our
storage
is
either
always
far
away,
or
it's
always
nearby
and
are
sort
of
not
sort
of
recognizing.
C
The
distinction
between
these
two
processes
and
I
think
that
a
successful
scalable
architecture
needs
to
sort
of
recognize
that
distinction
so
that
you
can.
You
can
ship
the
operations
that
are
operating
and
purely
on
local
data
to
the
data
and
perform
it
there,
and
you
can
do
the
processes
that
need
data
for
multiple
locations
and
pull
the
data
from
both
locations
and
do
it
locally,
and
that's
something
that
I
think
hasn't
really
been
resolved
in
this
area.
C
But,
finally,
I
think
that
fault,
tolerance
and
fault
tolerance
needs
to
be
considered
as
the
first
class
property
of
these
architectures.
As
we
sort
of
push
the
scale
of
our
existing
architectures.
When
we
start
building
things
like
burst
buffers
and
so
forth,
so
we
can
do
with
these
huge
checkpoints
across
millions.
Of
course
it
doesn't
make
a
whole
lot
of
sense
in
my
humble,
and
so
that
being
said,
posix
is
going
to
be
here
for
some
time.
It's
not
actually
going
anywhere.
So
we
continue
to
build
systems
that
will
support
POSIX.
C
So
we
can
run
all
these
legacy
codes
and
so
forth.
So
to
that
end
systems
like
lustre
and
stuff
will
continue
to
build
distributed
file
systems
that
can
actually
support
those
applications
so
stuff
FS,
Bill,
deposit,
Canaan
space.
On
top
of
ratos,
we
have
a
separate
cluster
metadata
servers
that
handle
the
file
system
namespace
in
a
distributed
fashion.
We
store
all
the
metadata
and
objects,
so
we
can
leverage
the
fact
that
we
already
have
reliable
redundant
data
storage.
We
provide
strong
consistency,
a
staple
client
protocol.
C
C
So
one
of
the
high
level
architecture
looks
very
similar
to
what
lustre
does
so.
The
clients
are
talking
to
metadata
servers
to
deal
with
the
file
system.
Namespace
they're,
talking
to
the
object,
storage
nodes
to
actually
read
write
file
data.
The
difference
is
that
you
have
lots
and
lots
of
meditative
servers.
So
the
challenge
there
is
that
you
have
sort
of
a
single
hierarchy
of
trees
and
it's
sort
of
non-trivial
how
you
decide
how
to
distribute
those
directories
across
multiple
servers.
C
You
can't
simply
hash
them
across
many
nodes
in
the
spec
to
expect
to
get
good
performance,
so
a
SEF
does.
Is
it
sort
of
dynamically
monitors
the
temperature
heat
map
of
the
filesystem
hierarchy
and
it
determines
what
appropriately
sized
portions
of
the
filesystem
tree
are,
so
that
it
can
migrate
them
to
different
servers?
And
it
does
this
dynamically
over
time
by
periodically
doing
a
load,
balance,
exchange
and
so
forth.
C
So,
as
your
workload
shifts
over
time
as
a
new
batch
job
starts
up,
it
will
identify
which
parts
of
the
tree
are
popular,
taken
appropriately
sized
piece
and
move
it
to
a
different
meditating.
The
cache
contents
one
MDS
over
to
another
and
Bs
and
letting
the
clients
continue
in
a
totally
transparent
fashion.
C
It
has
another
of
other
sort
of
interesting
features
that
you
don't
find
present
in
most
other
file
systems,
sort
of,
because
we
built
the
file
system
namespace
from
the
ground
up.
We
can
sort
of
build
these
into
the
infrastructure,
so
one
of
those
features
is
recursive
accounting.
The
metadata
service
keep
track
of
recursive
directory
stats
for
every
directory
in
the
file
system.
C
So
that's
also
also
support
snapshots
sort
of
the
motivation
being
that,
once
you
start
talking
about
petabytes
and
exabytes
of
data,
it
doesn't
really
make
sense
to
have
a
single
snapshot:
data
retention
policy
for
the
entire
system.
You
need
to
be
able
to
snapshot
different
directories
and
different
data
sets.
So
in
step,
you
can
actually
create
a
snapshot
in
any
directory
in
the
system
and
it
will
affect
just
that
subtree
of
the
system
and
you
can
create
the
snapshots,
remove
them
using
sort
of
standard
bash,
bash
type
commands.
C
C
Don't
touch
it
all
right,
so
the
real.
The
real
difference
is
that
for
for
Lester,
it
has
been
tuned
heavily
and
successfully
over
the
last
decade
to
run
on
high-end,
disk,
arrays
and
high-performance
networks.
Seth
has
not
sort
of
had
the
luxury
of
that
of
that
tuning.
It's
really
designed
to
run
on
smaller
nodes
with
directly
attached
disks
that
are
less
reliable,
so
it's
possible
to
run
staff
on
lustre
style
harbour.
C
So
typically
in
a
safe
environment,
you
would
actually
buy
more
nodes
with
more
disks
and
you
replicate
across
those
but
it'd,
be
over
all
much
less
expensive
because
they're,
you
know
commodity
pieces
instead
of
stuff
that
you
get
from
from
the
high
underway.
Vendors
Steph
can
also
utilize
the
flash
in
veeram
directly,
whereas
usually
those
components
are
sort
of
buried,
deep
within
two
disk
array,
where
you
can't
sort
of
access
it
normally.
C
So
we
did
some
tuning
as
an
experiment
on
some
hardware.
That
was
at
oak
ridge.
National
lab
basically
took
some
existing
osts
osss,
I
guess,
backed
by
a
DD
and
disk
array.
I
actually
have
no
idea
what
kind,
except
that
it
was
roughly
12
gigs.
A
second
was
sort
of
the
max
that
we
were
supposed
to
be
able
to
get
from
it.
When
we
are
initially
returned
over
access
to
the
cluster
and
initially
just
ran
our
sort
of
naive
insulation,
we
got
100
Meg's
per
second
out
of
it
by
the
end
of
our
experiment.
C
We
were
getting
five
point
five
days
per
second,
which
was
actually
11
because
of
the
way
that
stuff
was
journaling,
so
we're
roughly
saturating
the
disk
array,
which
was
kind
of
nice.
But
there
were
a
couple
sort
of
caveats.
One
is
that
the
way
that
Steph
is
writing
data
at
the
disks?
It's
actually
doing
double
rights,
because
it
has
a
right
ahead
journal
and
then
actually
writes
the
data
to
the
file
system.
That's
designed
to
use
to
be
used
with
conjunction
with
flasher
and
veeram.
C
The
same
way
that
at
the
net
up
disk
array
would
do
that
and
that's
usually
bird
in
the
disgrace.
We're
actually
writing
twice
to
the
array.
The
other
thing
is
that
we're
using
IP
over
infiniband,
because
we
don't
have
native
IG
support
and
stuff
yet,
but
it
was
sort
of
a
long
series
of
annoying
things
that
we
had
to
change.
No,
there
was
the
configuring.
The
infiniband
Network
properly
reefa
belongs
on
the
DD
n,
choosing,
which
type
of
disc
the
journals
and
the
data
went
to
reconfiguring
the
lens
again
tuning.
C
Those
Steve
ratios,
fixing,
TCP,
auto
tuning
and
Rita,
headed
and
all
sorts
of
all
sorts
of
knowing
things
that
mark
Nelson
can
tell
you
about
in
much
more
detail.
So
the
good
news
is
that
once
we
actually
like
work
through
all
these
annoying
issues,
we
actually
could
get
respectable
performance.
Bad
news
is
that
you
can't
simply
just
plug
it
in
and
expect
to
get
good
numbers,
but
I
guess
you,
you
probably
are
used
to
that
same
that
same
issue
with
with
other
with
Lester
as
well.
C
So
that's
that's
mostly.
What
I
want
to
talk
about
a
little
bit
more
information
if
you're
interested
in
trying
staff
or
think
it
might
be
suitable
for
your
use
cases
or
workloads,
whether
it's
HPC
or
distributed
computation
or
whatever
step
calm,
is
all
sorts
of
resources
about
how
you
can
get
involved
in
the
community?
That's
it
out
and
so
forth.
B
D
C
C
C
Wasn't
actually
a
lustre
test,
it
was
testing
Stefan
hardware.
That
was
what
you
would
typically
use
to
run
Buster.
So
it
was
a
no
SS
server
that
was
bought
to
run
Lester.
So
it's
a
typical
Westeros
s
and
a
DD,
an
array
that
was
usually
used
it
back
back
Lester.
So
this
is
that
the
type
of
hardware
that
you'd
buy
for
a
lesser
configuration
would
be
an
expensive,
big,
fast,
awesome,
disk
array
and
then
a
bunch
of
head
nodes
which
isn't
the
usual
stuff
configuration.
B
C
C
E
C
So
I'm
I
think
that
an
x,
XS,
sale
architecture
shouldn't
base
on
POSIX
I.
Think
if
you
were
to
take
a
clean
slate
and
say
how
would
we
actually
build
a
machine?
That's
big
and
efficient.
It
wouldn't
look
anything
like
what
we
have
today,
so
that's
sort
of
my
contention.
So
all
that
being
said,
I
think
the
systems
that
we
actually
build
because
we're
migrating
all
these
legacy
codes,
and
that
are
you,
know,
poorly
written
and
so
forth
and
don't
actually
be
POSIX
in
there
we're
taking
a
bit
a
bit.
C
C
House
f
would
be
jitter
free,
yeah
I'm,
not
saying
that
that
stuff
is
any
different
with
the
first
buffer
and
so
forth.
I
think
I
think
that
I'm,
a
more
a
more
interesting
exascale
architecture
would
be
one
that
is
based
on
objects
that
you're
storing
computation
that
has
run
directly
on
the
objects
and
computation
that
that
is
sort
of
aggregating
the
results
from
different
objects,
and
it
would
be
some
sort
of
you
know
more
cloudy
infrastructure
that
is
actually
running
this
computation
on
those
notes
and
aggregating
results
in
writing
to
new
objects
and
so
forth.