►
Description
"Scaling Storage with Ceph", Ross Turk, VP of Community, Inktank
Ceph is an open source distributed object store, network block device, and file system designed for reliability, performance, and scalability. It runs on commodity hardware, has no single point of failure, and is supported by the Linux kernel. This talk will describe the Ceph architecture, share its design principles, and discuss how it can be part of a cost-effective, reliable cloud stack.
A
So
Seth
is
a
distributed
and
open
source
distributed
storage
system,
and
we
recently
just
did
a
rebranding
upset
because
we
launched
a
new
company
called
ink
tank,
which
is
a
support
and
services
company
around
Seth.
So
many
of
you
may
be
familiar
with
Seth,
but
aren't
familiar
with
the
sort
of
visual
look
of
stuff
that
we
have
now.
We
started
a
company
because
Seth
is
getting
to
the
point
where
we
need
to
figure
out
how
to
start
helping
people
to
go
in.
So
this
is
the
Sestak.
A
It's
kind
of
it's
got
a
lot
of
different
parts
and
a
lot
of
different
components,
but
it's
all
based
on
something
called
Rados,
which
is
the
reliable
autonomous
distributed
object
store
on
top
of
Gredos.
We've
built
a
series
of
applications
that
allow
you
to
use
the
distributed
cluster
in
a
few
different
ways
that
I'm
going
to
go
into
more
detail
on,
but
first
I
think
it's
probably
best
to
start
in
the
beginning.
A
At
the
beginning
of
and
and
I
have
this
nasty
habit
when
I
do
presentations
of
going
all
the
way
back
to
the
beginning
of
like
humanity
and
trying
to
make
it
mean
a
lot
more
than
us
and
that's
what
I'm
doing
here
and
in
the
beginning
of
information,
storage,
I.
Think
the
first
example
of
how
people
stored
information
was
probably
cake.
A
Ladies,
you
know
it's
how
people
recorded
history
and
recorded
stories
and
things
like
that
we've
come
a
long
way
since
then,
at
some
point
we
figured
out
how
to
write
and
writing
was
kind
of
like
cave
painting,
but
a
little
bit
better.
You
could
put
a
lot
more
information
inside
of
a
book
than
you
could
on
a
cave
painting
so
I
mean
already
we've
noticed
in
human
history
that
we've
outgrown
the
cave
and
now
we're
beginning
to
fill
up
books,
and
you
know
like
about
a
thousand
cave
any
definitive.
Look.
A
I
mean
I,
don't
know
it's
really
difficult
to
say,
but
that's
about
the
right
ratio
and
I'm
mentioning
this
to
you,
because
later
on,
as
I
kind
of
moved
throughout
the
history
of
storage,
we
see
this
sort
of
ratio
happening
all
the
time
where
we
increase
the
amount
of
information
we
can
store
by
by
a
large
factor.
So
people
began
writing
a
lot
and
capturing
a
lot
of
information
and
I
mean
so
much
so.
But
the
problem
with
writing
is
that
writing
is
kind
of
time-consuming
it
takes
I
would
call
it
in
modern,
modern
terms.
A
We
call
it
a
really
low
bandwidth,
a
low
bandwidth
media
right,
because
it
takes
a
lot
of
time
to
write.
It
takes
a
lot
of
time
to
read
and
Acuras
he's.
Not
so
great
I
mean
I,
can't
read
half
the
words
on
this
page.
Some
people
figured
out
how
to
industrialize
writing
so
how
they
can
use
machinery
and
how
they
can
use
technology
to
make
it
so
that
the
written
word
was
more
effective,
more
legible
and
more
prolific.
But
at
some
point
something
really
magic
will
happen.
A
B
A
This
this,
this
new
method
of
storage,
was
mechanical
in
nature.
Right,
it
no
longer
was
something
you
could
pick
up
in
your
hands
and
read
just
human
being
to
device.
You
required
technology
to
allow
you
to
interface
with
the
storage
that
we've
created,
and
that
was.
It
was
kind
of
the
first
time
that
that
we
noticed
that
we
started
with
a
nice,
simple
interface
between
human
being
and
rock.
You
know,
and
I'm
being
a
little
bit.
A
You
know
cute,
but,
and
then
human
beings
began
to
interface
with
ink
and
paper,
but
then
we
saw
a
huge
divide
when
human
beings
began
to
interface
with
their
information
with
an
intermediary
which
is
the
computer,
and
that
caused
a
lot
of
technology
to
need
to
be
built.
So
it's
building
this
technology,
where
computer
programmers
write
computers
need
people
to
work.
For
the
first
time
in
human
history,
we
require
a
specific
type
of
person
whose
entire
job
is
figuring
out,
how
to
allow
humanity
to
interface
with
its
information.
A
So
we
have
this
thing
in
the
middle
now
between
the
humans
and
their
information,
and
the
other
thing
that
happened
at
the
same
time
is
that
we
started
storing
information
in
zeros
and
ones
in
binary
or
in
some
other
non
representational
form.
So
this
picture
of
this
I
don't
know
this
frog
on
the
dune,
buggy
or
whatever.
It
is
ends
up
being
a
series
of
ones
and
zeroes,
which
we
can't
read
without
the
help
of
computers
about
this
time.
That's
when
throughput
became
really
important
when
we
had
computers
reading
and
writing
our
information
for
us.
A
Suddenly
it
mattered
how
quickly
that
happened.
Right
at
latency
began
to
matter.
You
know
the
the
rate
of
the
rate
of
reading
and
writing
begins
to
matter.
That
sort
of
thing
started
to
matter
and
that's
I,
don't
know
if
that
led
to
the
invention
of
of
the
hard
drive
but
I
suppose
our
address
must
have
come
around
it's
just
about
the
right
time,
because
we
also
saw
again
a
thousand-to-one
jump
between
magnetic
tape
and
hard
drives,
and
you
know
I
think
that
it
changed.
A
It
changed
the
world
right
because
hard
drives
are
a
lot
better
than
tape
and
today,
in
modern
terms,
we
look
at
if
you
put
the
same
the
same
chart
up
and
say
that
solid
state
disks
are
a
lot
better
than
you
know,
spinny
discs,
but
at
the
time
thinking
about
it
was
it
was
just
this
outrageous
invention,
but
because
we
invented
technology
that
allowed
us
to
store
another
sort
of
level
of
measurement
of
information.
We've
gone
to.
You
know:
we've
gone
from
kilobytes
to
megabytes
to
gigabytes.
A
It
became
really
obvious
that
we
needed
to
have
some
kind
of
technology
that
allows
us
to
organize
our
information.
So
that's
kind
of
when
we
saw
the
emergence
of
file
systems.
File
systems
are
a
hierarchy
of
information
with
directories
and
nodes,
and
things
like
that.
It
allows
you
to
organize
your
information
and
also
store
a
little
bit
of
metadata.
A
For
example,
if
my
frog
picture
were
a
file
we'd
know,
a
few
things
about
it,
like
it
was
owned
by
me,
was
created
on
August
12th,
whose
last
viewed
on
17
it's
about
42
kilobytes
and
it's
readable
by
everybody
and
writable
by
me,
and
these
are
the
permissions
that
we
have
in
this
file,
which
then
gets
positioned
into
a
tree
right
in
an
absolute
way.
So
this
allows
us
to
start
the
story,
even
more
information
than
we
did
before
so
you're.
A
B
A
Humanity
outgrows
the
hard
drive
we
just
had
too
much
data.
We
mean
we
can't
fit
everything
we
need
to
fit
on
one
hard
drive
and
I'm,
not
talking
about
all
the
information
for
Humanity
I
mean
my
movies
won't
fit
on
a
single
hard
drive
anymore.
You
know
there
is
no
hard
drive
big
enough
to
store
all
of
the
data
that
I
personally
have
collected,
so
we
need
to
figure
out
a
way
to
have
storage
of
something
that
was
bigger
than
one
hard
drive.
A
So
now
we're
figuring
out
how
to
have
a
computer
interface
with
multiple
disks.
So
we
have,
you
know
a
human
being
interfacing
with
the
computer,
that's
interfacing
with
multiple
disks
at
the
same
time,
but
that
didn't
solve
every
problem,
because
it's
not
just
my
movies
that
need
to
be
stored.
But
if
you
know
what,
if
my
friends
want
to
watch
my
movies
or
you
know,
my
family
wants
to
watch
my
movies,
we
need
to
figure
out
how
to
allow
multiple
people
to
interface
with
multiple
discs
through
a
computer.
A
So
computers
are
getting
smarter,
the
whole
time
they're
getting
more
intelligent
because
they're
having
to
deal
with
more
advanced
situations,
and
actually
it
looks
a
little
bit
more
like
this.
It
doesn't
look
like
like
this.
It
looks
a
little
bit
more
like
this,
because
you
have
tons
and
tons
and
some
humans
interfacing
with
tons
and
tons
and
tons
of
disks
and
guess
what
we've
found.
Another
bottleneck
right
I
mean
this
computer
in
the
middle
of
looking
awfully
tired
moment.
A
A
You're
really
looking
at
powering
virtual
machines
inside
of
a
change
containers.
Do
this
so
throughout
history,
sort
of
concluding
my
my
going
back
to
the
beginning
of
time,
intro
that
I
always
do
can
come
throughout
history.
We
started
with
painting
and
writing
in
computers
and
we're
working
all
the
way
forward
towards
you
know
this.
This
exponential
growth
in
the
amount
of
data
in
humanity
is
storing
and
we
think
it's
stuff
is
the
answer.
That's
that's!
That's
where
that's
what
our
that's?
What
our
perspective
is,
you
think
it's
one
of
the
answers
out
there.
A
So,
as
humans
began
to
interface
more
and
more
and
more
with
these
multiple
computers
accessing
multiple
disks,
it
became
obvious
that
this
was
something
that
could
be.
You
need
to
become
a
business,
so
what
people
did
as
they
took
these
clusters
of
computers
and
discs,
and
they
product
us
them
into
appliances,
and
there
are
lots
of
different
options
for
our
appliances
out
there,
but
humans
interface
with
these
appliances.
Instead,
it's
a
much
simpler,
simpler
way
to
interface,
with
a
complicated
network
of
computer
to
storage
and
these
storage
appliances.
A
Hard
drives,
go
down
in
Christ
processors
go
up
in
price;
well,
actually
they
never
go
up
in
price.
They
go
down
in
prices,
but
there's
there's
a
certain
amount
of
luck,
shoe
a
shoe
due
to
the
commodity
nature
of
computing,
and
sometimes
you
can
buy
computers
cheaper
than
other
times.
But
if
you
buy
a
hardware
appliance,
let's
say:
I
buy
a
pedophile
or
the
hardware
appliance,
and
it's
with
a
proprietary
appliance
vendor
a
really
well-known,
a
really
well-known
appliance
offender.
It's
gonna
cost
I,
don't
even
know
like
14
bazillion
dollars.
A
A
But
the
truth
is
that
it's
it's
it's
not
something.
You
can
change.
So
there's
a
bit
of
a
black
box
nature
to
these
appliances,
and
it's
it's
there's
an
advantage
to
that
in
that
it's
a
little
bit
more
convenient,
I,
think
or
a
little
bit
perhaps
ended
back
in
the
day.
It
was
a
more
reliable
thing
because
you
could
test
this
this
unit
as
opposed
to
testing
a
thousand
different
parts,
but
the
kind
of
black
boxes.
A
So
if
you,
for
example,
wanted
to
pull
out
a
compute
node
and
put
in
a
faster,
bigger,
compute
node
to
deal
with
some
sort
of
hot
spot,
you
could
do
it,
but
you
couldn't
do
it
with
everyday
hardware.
You
know
you,
you
couldn't
just
go
to
price,
or
you
know,
or
whatever
the
electronic
stories
around
your
house
and
buy
a
better
processor
core.
A
Then
it's
not
it's
not
that
kind
of
flexibility,
and
similarly,
if
you're,
a
human
of
subtype
developers,
subclass
developer
and-
and
you
want
to
interface
with
its
appliance
to
make
it
do
something
that
it's
not
designed
to
do
it's
it's
not
that's,
not
good,
and
really
the
only
option
you
have
at
that
point
is
to
go
back
to
the
vendor,
so
I
think
the
perspective
of
Seth.
The
set
project
is
that
the
world
needs
a
storage
technology.
A
That
scales
infinitely
absolutely
infinitely,
because
we
understand
that
our
request,
our
requirements
for
data
storage
are
going
to
scale
intently.
So
we
think
that
the
technology
should
scale
infinitely
as
well,
and
we
also
think
that
the
world
needs
a
storage
technology
that
is,
software
based
that
doesn't
require
an
industrial
manufacturing
process
behind
it.
You
know
we
work,
we
don't
think
this
problem
gets
solved
by
putting
more
technology
onto
trucks
and
trains
and
planes
and
bolting
interacts.
Well,
we
do,
but
it's
we
have
a
different
perspective.
A
So
this
is
sage
while
he's
the
co-founder
of
DreamHost,
which
is
a
company
that
has
invested
in
seth,
he's
the
adventure
of
seth
and
he's
the
CEO
of
ink
tank,
which
is
the
company
we
spun
up
to
do.
Services
and
supporter
outset
and
sage
had
a
vision
and
his
vision
was
that
he
wanted
to
build
a
storage
solution.
A
He
also
believed
that
we
needed
to
be
community
focused
and,
as
the
community
guy
I'm,
very
thankful
about
that
it
makes
my
job
kind
of
nice
I,
don't
have
to
argue
so
much
with
people
who
don't
get
community,
which
is
really
good,
but
the
real
motivation
is
that
all
of
us
are
smarter,
together
than
we
are
alone.
You
know
you
you've
heard
that
a
million
monkeys
with
a
million
typewriters
completing
the
complete
they're
getting
complete
works
of
Shakespeare
I
think
we're
kind
of
proving
that
with
the
internet
that
that
actually
happens.
A
You
know
I
think
that
the
more
people
you
get
involved
in
a
software
project
better
it
becomes-
and
this
picture
actually
is
kind
of
interesting.
This
is
my
LinkedIn
in
Matt,
which
is
a
list
of
all
the
connections
I
have
on
LinkedIn
and
showing
how
interconnected
we
truly
are.
All
these
people
that
I
know
would
know
all
of
these.
B
A
People
that
I
know
is
really
fascinating.
He
wanted
to
make
sure
that
whatever
happened
with
Seth
was
was
done
in
a
way
that
was
community
focused
and
the
other
reason
that
that's
important
to
be
community
focused
is
that
Seth,
the
technology
doesn't
belong
to
sage,
it
doesn't
belong
to
any
tank,
it
doesn't
belong
to
anybody.
It
belongs
to
everybody,
and
you
know
just
kind
of
like
just
like
a
forest.
We
all
have
to
take
care
of
it.
A
We
have
to
pitch
in
and
do
our
part,
and
so,
for
example,
if
Steph
doesn't
do
something
that
you
wanted
to
do.
We
encourage
people
to
make
and
do
that
thing?
It's
it's
open
source
and
it
belongs
to
everybody.
So
that's
a
really
important
philosophy
as
well.
On
the
design
side,
we
wanted
it
to
be
scalable
I'm,
going
to
take
a
bit
of
it
actually
I'm
going
to.
A
We
wanted
to
make
sure
that
it
was
appropriate
for
a
world
where
we
have
more
data
than
can
fit
on
a
cave
wall,
more
data
than
can
fit
in
a
book
on
a
disk
drive
on
a
single
computer
Morgan.
It
will
fit
in
a
room.
You
know
we
want
to
be
able
to
have
a
technology
that
can
expand
beyond
that
and
truly
be
infinite.
A
We
also
felt
kind
of
strongly
about
having
no
single
point
of
failure,
and
this
is
very
similar
to
scalable,
but
it's
a
little
different,
a
lot
of
storage
solutions
scale,
but
with
the
single
point
of
failure
and
the
way,
our
philosophy
is
it's
that
if
you
have
a
truly
infinite
storage
network
of
millions
of
disk
drives
something's
going
to
be
failing
all
the
time
you
can't
afford
to
have
something
fail.
That
is
a
single
point
of
failure,
and
this
is
this
is
a
tangent
that
was
talking
about.
It
was
a
bit
lost,
but
B.
A
This
is
kind
of
a
I
know
it's
kind
of
a
startling
picture.
This
is
a
banana
slug
and
a
banana
slug
is
the
mascot
at
UC
Santa
Cruz,
which
is
where
sage
got
his
his
PhD
and
study
storage
and
invented
Ceph.
This
banana
slug
is
also
in
the
same
genus
as
the
cephalopod,
which
looks
like
this,
and
this
is
the
original
logo
verse
F
that
we
had
before
we
decided
to
rebrand.
A
It's
a
big
metaphor
because
not
has
multiple
arms,
eight
arms,
you
know
so
obviously
it
has
replication
across
his
arms.
It
has
two
eyes,
so
you
have
high-availability
on
the
eyes,
which
is
not
not
really
in
line
with
our
philosophy.
You
really
want
everything
be
replicated,
but
we
have
a
big
problem
with
the
octopus
as
a
metaphor,
because
it
does
have
a
single
point
of
failure.
A
That's
not
why
we
went
away
from
using
octopuses
our
logo,
but
it's
it's
important
to
us
that
everything
we
do
is
completely
scalable
and
has
no
single
point
of
failure
at
all.
I.
Think
a
better
metaphor
for
the
technology
might
be
a
beehive,
although
there's
still
a
queen
bee,
so
I
don't
know,
perhaps
a
coral
reef
or
something
I'm
still
trying
to
figure
out
what
the
good
metaphor
is,
or
maybe
maybe.
B
A
Don't
even
need
a
metaphor,
but
I
think
the
other
thing
that
was
an
important
design
consideration
for
us
was
that
it's
a
software
based
solution,
meaning
that
if
you
wanted
to
change
the
hardware
out
or
put
in
faster
Hardware,
slower
hardware
or
solid-state
disappear
or
spinning
it's
there
or
whatever
you
wanted
to
do.
The
technology
was
interchangeable
because
the
software
was
separated
from
the
hardware
and
it
gives
people
a
lot
more
flexibility,
and
it
also
allows
people
to
buy
the
cheapest
hardware
available.
You
can.
A
B
A
It
because
you
can
have
a
heterogeneous
technology
environment
with
with
the
software
solution.
So
that's
you
know
thumbs
up.
The
other
thing
that
was
really
important
was
that
the
system
is
self
managing,
and
this
is
because
I
mean
hard
drives,
are
not
going
to
be
the
technology
for
a
whole
lot
longer
I
mean
spinny
drives
I
mean.
If
you
look
at
it
in
the
the
cave,
painting
scope
of
the
world.
A
It
won't
be
that
much
longer,
but
in
the
meantime
they
are
basically
record
players,
their
little
tiny
record
players
inside
of
the
computer,
and
they
fail
all
the
time
like
they
will
fail.
It's
guaranteed
fail
and
if
you
have
a
cluster
with
a
million
distant,
that
means
that
one
business
could
be
failing
fifty
five
times
a
day.
So
it's
really
important
that
the
system
is
self-healing
so
that
fifty
five
times
a
day
when
something
goes
wrong.
It
takes
care
of
itself
instead
of
needing
human
intervention,
to
move
data
for
one
no
to
another
or
11.
A
So
that's
where
Seth
came
from
that
came
from
these
ideological
principles
and
these
design
philosophies
and
stage
went
and
it
went
to
college
and
built
chef
and
came
back
from
college
and
decided
to
continue
building
stuff
because
he
thought
he
had
something.
So
after
the
invention
of
step
at
UC,
Santa
Cruz
sage
came
back
to
DreamHost,
where
he
was
a
co-founder
dreamhost
as
an
ISP
and
it's
a
hosting
company
in
Los,
Angeles
and
DreamHost
decided
to
continue
incubating
Seth
to
great
results.
A
The
monthly
code
commits
went
up
very,
very
noticeably
at
that
point
and
Seth
started
popping
up
in
other
technologies
like
qmu
and
OpenStack
and
Saif
is
inside
the
Linux
kernel
and
if
you're
on
the
marketing
team
at
OpenStack,
I'm
really
sorry
about
what
I
did
to
your
logo.
It's
just
illustrative,
but
I
know
that
that's
a
bad
thing
to
do,
but
anyway
the
point
is,
it
starts
popping
up
in
these
places.
Just
you
know
all
these
integrations
started
happening
even
before
there
was
any
kind
of
commercial
Leopard
and
around
Seth,
truly
a
community
oriented.
A
So
going
back
to
this,
this
architectural
diagram
I'm
going
to
talk
about
each
one
of
these
boxes
in
a
little
bit
of
detail,
to
give
everybody
an
understanding
of
what
Seth
is
and
how
it
works
and
how
the
pieces
fit
together,
and
the
first
I'm
going
to
talk
about
is
ratos,
which
is
what
everything
else
is
built
on
top
of,
so
Raynaud's
is
fundamentally
an
object
store
and
it
works
kind
of
like
this.
Let's
say
if
I
have
a
node
with
five
disks.
A
I
need
to
put
a
file
system
on
each
of
those
disks
set,
runs
on
top
of
a
file
system.
That
file
system
can
be
butter,
FS,
x,
FS
or
DHT,
for
we
believe
the
butter
FS
in
the
long
term
is
the
right
file
system
to
run
stuff
on
top
of,
but
it
can
be
rough
x,
FS
as
well
in
the
short
term
as
butter
at
best
the
begins
to
increase
in
stability.
Then,
on
top
of
each
of
these
file
systems,
you
run
ACEF
OSD,
which
is
an
object,
storage
daemon.
A
We
suggest
that
it's
one
per
disk.
It
could
be
one
per
host
it
wander.
You
can
have
multiple
career
discs,
although
I'm
not
sure
why
you'd
want
to
do
that.
You
really
suggest
one
produced,
but
it's
flexible
in
the
way
that
you
it
should
employ
it,
and
all
of
this
on
one
node
becomes
part
of
the
cluster.
A
What
they
do
is
they
maintain
a
map
of
the
cluster
they
understand
which
which
hosts
are
in
and
which
hosts
are
out
of
the
cluster
which
hosts
are
up
or
down
which
is
distinct
from
in
and
out
it's
different,
because
in
and
out
is
less
permanent
than
up
or
down
the
monitors
also
provide
distributed
decision-making.
So
you
need
to
have
an
odd
number
of
these
because
they
they
talk
to
each
other,
to
figure
out
who
has
the
correct
cluster
map,
and
if
you
have,
let's
say
you
have
three
monitors
and
one
of
them
disagrees.
A
You
need
to
have
the
other
two
to
have
a
majority.
Also,
if
you
have
a
split
brain
situation,
where
you
have
a
cluster
split
in
half,
you
have
two
monitors
on
one
side:
one
monitor
on
the
other
side.
The
side
with
two
monitors
is
the
canonical
side
of
the
cluster
and
will
continue
to
operate
as
such,
because
the
monitors
understand
that
they
have
the
majority
of
monitors.
So
it's
important
to
have
an
odd
number.
These
do
not
serve
any
objects
to
clients;
they
don't
serve
any
data
at
all
of
the
clients.
A
All
they
do
is
monitor
the
cluster.
Also,
when
you
mount
the
filesystem
you're
mounting
it
with
the
host
name
of
the
monitor.
So
that's
what
the
monitor
does.
The
OS
ds1
produced
is
what's
recommended
at
least
three
in
a
cluster,
because
otherwise
it
doesn't
really
make
a
whole
lot
of
sense.
It.
These
actually
do
the
serving
of
stored
objects
to
clients,
so
these
serve
the
data
who
needs
it.
A
So,
for
example,
when
you
put
an
image
into
your
cluster,
it
can
create
a
thumbnail
automatically,
and
this
is
it's
something
that
is
a
bit
of
an
experimental
feature,
because
we're
still
figuring
out
exactly
how
much
computing
you
can
do
on
each
of
these
OSDs
without
without
starting
to
dig
into
your
the
processor
and
memory
and
whatever
require
to
actually
store
the
data.
So
it's
something
that's
experimental
now,
but
really
really
powerful.
A
So
the
next
part
of
the
component
are
the
next
component,
a
architecture
that
I'm
going
to
talk
about
is
liber8
oohs
and
librettist
is
kind
of
what
it
sounds
like
if
you're
a
unit
C
person
and
you're
familiar
with
with
libraries,
it's
a
library
that
allows
you
to
build
applications
that
interface
with
Rados.
So,
for
example,
if
I
have
an
application
and
it's
built
with
Laredos
I
can
use
that
application
to
store
a
node
into
a
cluster,
and
it's
going
to
speak
over
a
native
protocol,
which
is
a
pretty
accurate
O'call.
A
It's
doesn't
have
a
lot
of
overhead
is
super
fast.
So,
if
you're
building
an
application
that
needs
to
have
a
very
rapid
access
or
very
efficient
access
to
a
cluster,
we
suggest
you
use,
liberate
us
or
any
of
its
other
language
alternatives,
there's
alternatives
for
C
and
C++,
Python,
PHP
and
Java.
So
that's
that's
the
greatest
and
it's
it's
really
straightforward.
What
librettos
does
and
the
grayness
is
the
foundation
for
just
about
everything
else
that
we
built
on
top
of
set.
A
So
the
next
component
is
called
Raynaud's
GW,
which
it
is
the
rest
gateway
for
Raynaud's,
which
is
compatible
with
s3
and
Swift.
So,
for
example,
if
I
have
an
application,
I
want
to
store
an
object
into
a
cluster.
I
can
have
that
application
talk
to
Rados
GW,
which
is
built
on
top
of
librettos,
which
then
accesses
the
cluster
and
I
can
have
multiple
of
these
right,
because
everything
is
distributed.
There's
no
single
point
of
failure
anywhere,
so
the
rails
gateway
you
can
have
multiple
of
them.
You
can
put
them
behind
the
load
balancer.
A
You
can
do
your
standard
web
tricks
to
make
sure
that
the
Rados
GW
is
highly
available,
but
the
architecture
supports
multiple
of
them
and
they
speak
a
native
protocol
to
the
cluster,
but
they
expose
a
rest-based
protocol
to
the
applications
which
is
compatible.
That's
three
insulators,
which
is
kind
of
cool.
They
also
support
buckets
and
accounting.
So
this
is
the
easiest
way
to
get
data
in
and
out
of
the
set
cluster.
If
you
are,
if
you're
looking
for
application
type
storage,
the
next
thing
I'm
going
to
talk
about
is
RVD.
A
So
this
is
a
block
device
built
on
top
of
Gredos.
So,
for
example,
if
I
have
a
cluster
and
I
have
you
know
bits
spread
throughout
my
cluster
that
end
up
becoming
a
block
device,
I
can
run
a
virtualization
container
that
has
been
built
with
live,
RVD
and
librettos,
and
that
virtualization
container
will
present
that
information
as
a
single
piste
to
a
virtual
machine.
A
So
RBD,
the
ratos
block
device
allows
you
to
essentially
stripe
a
virtual
machine
image
across
the
entire
cluster,
and
it
can
be
a
very,
very
large
image
or
a
very
small
image,
but
usually
very
large
and
the
virtualization
containers
that
that
we
support
now
actually
have
done
in
later
slides,
but
a
virtualization
ganders
can
be
able
to
the
bar
BD,
which
I
allow
them
have
access
to
this
block
device.
It
also
allows
some
really
nice
tricks
like,
for
example,
if
I
had
one
virtualization
container,
that's
running
a
vm
f
of
a
block
device.
A
I
can
actually
move
that
virtual
machine
to
another
virtual
machine
container
live
right
so
because
we're
decoupling,
the
storage
from
the
computer
infrastructure
live
migration
becomes
something
that's
feasible
and
and
actually
possible.
Today,
also,
if
you
don't
want
to
do
this
in
a
virtualization
context,
you
can
use
KRV
d,
which
is
a
linux
kernel,
module
to
mount
a
block
device
out
of
out
of
set
right.
So
that's
it's
not
built
into
a
virtualization
container.
A
In
this
example,
it's
built
into
you
know
the
kernel
of
a
client
machine,
so
the
Gredos
block
device
allows
you
to
store
virtual
visits
inside
Rados.
It
provides
live
migration
because
of
the
decoupling
of
virtual
machines
and
containers
at
striking
across
the
cluster,
so
that
you
can
get
that
sort
of
distributed,
redundancy
and
and
performance.
It
has
boot
support
for
qmu
KBM
and
opens
Tonopah
and
actually
just
a
couple
of
weeks
ago,
also
CloudStack,
and
it
has
announced
support
in
the
linux
kernel,
so
that's
block
device.
A
The
final
thing
that
we
built
on
top
of
Rados
is
Seth
FS
and
I'll
pause
here
for
a
second
just
to
reiterate
that
all
of
these
things,
it's
it's
a
unified
storage
platform.
So
in
the
in
the
same
cluster,
you
can
store
objects,
you
can
access
those
objects
with
rest,
you
can
store
block
devices
and
you
can
have
a
pasta
supply
distributed
file
system
in
a
single
cluster.
The
action
that
the
same
cluster
not
separate
clusters,
the
same
clusters
and
all
this
has
been
built
on
top
of
the
burritos.
A
So
if
you
want
to
build
another
application
that
uses
stuff
to
do
something
for
tomorrow's
needs,
it's
it's
their
architecture,
so
set
that
best
is,
as
I
said,
a
distributed
file
system.
So,
for
example,
if
I
want
to
mount
a
file
system,
the
first
thing
that
happens
is
my
client
has
to
retrieve
metadata
from
a
metadata
server.
Something
has
to
collect
all
that
metadata
that
accompanies
the
file
and
also
it
has
to
manage,
locking
and
permissions
and
actually
manage
the
hierarchy
itself.
A
So
a
client
will
access
the
metadata
server
first
to
to
make
sure
that
that
information
is
available,
and
then
it
will
get
the
data
from
the
OSDs
directly,
so
the
metadata
server.
It
manages
all
the
metadata
for
the
posix-compliant
short
file
system,
which
means
the
directory
hierarchy.
It
also
does
all
the
file
metadata
it
stores
its
metadata
inside
Rados,
so
the
metadata
server
doesn't
store
it
locally
on
its
own
disk.
It
stores
all
its
data
inside
the
clusters,
but
the
metadata
server
goes
down.
All
of
that
metadata
servers.
Information
is
available
to
other
metadata
servers.
A
I
was
putting
somewhere
different,
but
it's
a
good
metaphor
for
how
you
store
information
in
distributed
storage
systems.
So
if
I'm,
an
application
and
I
want
to
store
an
object
inside
of
the
cluster
or
I
want
to
retrieve
an
object
from
a
cluster.
How
do
I
know
where
to
go
to
get
that
object?
How
do
I
know
which
part
of
the
cluster
to
connect
to
there
are
hundreds,
maybe
thousands
of
machines
I,
could
potentially
connect
to
and
having
an
architecture
where
you
connect
to
a
single
host
and
then
it
directs.
A
A
You
know
files
that
start
with
a
through
G
or
directory
start
a
through
G
or
hashing,
something
or
whatever,
but
in
some
way
you're
breaking
up
your
cluster
and
saying
this
stuff
goes
here
and
that's
the
most
of
there
and
that's
that
goes
there
and
that
way,
when
you
want
to
store
a
file,
it
tells
you
oh,
it
starts
with
F.
Let's
go
put
it
on
that
box.
A
A
Know
where
it
is
and
when
you
want
them,
you
go
there
and
you
get
them
it's
the
same
thing
that
happens
in
a
lot
of
distributed
file
systems
that
are
organized
this
way.
The
client
knows
where
to
get
the
data,
because
that
particular
piece
of
data
is
always
located
in
the
same
place
right
or
the
same
kind
of
place.
Maybe
it's
replicated
or
or
whatever
it's
a
little
more
complicated,
but
it's
always
the
same
place.
A
Hang
a
left,
hang
a
right
at
support
machine
from
the
top
right,
and
it
tells
you
where,
in
the
cluster
to
get
the
information
requires
multiple
round
trips
and
there's
some
there's
something
turns
around
to
centralize
metadata
server
being
a
single
point
of
failure,
but
this
is
also
a
way
that
a
lot
of
this
happens-
and
this
is
what
I
call
dear
diary
today,
I
put
my
keys
in
the
kitchen.
It's
like
writing
down.
Where
you
put
your
keys
every
time,
you
walk
into
the
house
mm-hmm.
A
The
only
problem
with
this
is:
how
do
you
find
your
keys
when
your
house
is
infinitely
big
and
always
changing?
Imagine
having
to
figure
out
where
your
keys
go
with
your
house
changes
every
time
you
walk
into
it
and
it
is
infinitely
large
all
right.
That's
the
type
of
problem
except
ends
of
solving,
and
the
answer
to
that
problem
we
believe
is
called
crush.
Crush
is
an
algorithm
that
is
sort
of
at
the
core
of
how
Rados
works
so
with
crush.
A
Let's
say,
I
have
a
bunch
of
bits
that
I
want
to
store
into
my
cluster.
The
first
thing
the
crush
will
do,
is
it'll
hash
them
into
a
certain
number
of
placement
groups
and
that's
configurable,
but
in
this
example,
there
are
10
of
them,
and
so
after
it's
made
it's
10
placement
groups,
it
runs
the
spacer
groups
through
crush
and
what
you
pass
crush.
A
Is
the
placement
group
that
you
want
to
place
the
state
of
the
cluster
and
a
set
of
rules
and
then
crush
will
tell
you
based
on
that
input
where,
in
the
cluster
that
data
belongs?
So
it's
a
deterministic
placement
algorithm,
it's
pseudo
random,
but
it's
repeatable,
so
it
would
take
all
of
these
items
and
spread
them
in
the
cluster
in
a
way
that
was
pseudo
random,
it's
very
distributed,
there's
not
a
pattern
around
it
and
it
indeed
gives
kind
of
an
even
data
distribution.
A
So
crush
is
the
algorithm
it's
pseudo-random.
It's
the
algorithm
itself
uses
for
placement,
ensures
that
they
just
evenly
distribute
across
the
cluster.
It's
repeatable
and
deterministic,
so
it'll
always
run
the
same
way,
given
the
same
input
and
it's
configured
by
rules.
So
instead
of
telling
Rados.
A
To
always
put
you
know,
it's
it
to
have
ten
different
pools
in
each
pool
is
a
different
storage
pool,
and
you
have
to
always
put
these
pools
onto
this
note
or
whatever
the
way
you
configure
sep.
Is
you
tell
it
here's
my
general
topology
I
have
this
many
rooms,
this
many
rows,
this
many
racks
to
Spenny
switches
and
the
topology.
Is
he
figure
goals
it
doesn't
have
to
be.
A
A
So,
for
example,
then,
when
when
a
client
wants
to
wants
to
in
store
or
retrieve
an
object
from
a
radio
Slusser,
it
will
run
crush
on
that
cost.
On
that
that
information,
you
crush,
will
tell
it.
Oh
your
informations
on
this
node
in
that
node,
that's
where
it
is,
and
it
always
runs
the
same,
but
there's
kind
of
a
challenge
with
this
and
the
challenge
is
what
happens
when
you
lose
a
node.
So
let's
say
I've
lost
this
node
to
the
green
with
the
red
and
a
yellow
square.
A
These
individual
OSDs
that
make
up
radios
are
intelligent
and
they
they
appear
with
one
another.
So
when
that
node
goes
down,
the
other
nodes
find
out
about
it
because
they're
constantly
gossiping
with
one
another
and
they
realized
uh-oh.
The
cluster
map
is
updated.
There's
a
new
state
in
the
cluster,
so
each
node
then
recalculates
the
crush
algorithm
on
all
of
the
data
that
they're
currently
holding
and
they
realize
uh-oh
to
make
crush
work
in
this
new
cluster
map.
You.
B
B
A
To
there
copy
the
data
from
here
to
there,
the
Red,
Square
and
copy
the
yellow
square
over
like
that
right,
so
the
nodes
themselves
intelligently,
reposition
the
data
so
that
the
next
time
somebody
runs
the
crush
algorithm.
The
data
is
where
it's
supposed
to
be
so
in
this
case,
the
crush
algorithm
will
tell
the
client
that
the
object
is
located
on
the
new
node,
because
the
old
one
is
down.
A
So
the
next
thing
that
makes
seth
unique
and
different
from
a
lot
of
the
rest
is
the
way
that
it
stores
its
block
devices.
So
let's
say
that
I
have
this
virtualization
container
running
a
virtual
machine
out
of
a
block
device.
That's
stored
inside
set.
That's
almost
never
the
case,
though
you
never
just
want
to
run
one
VM.
You
want
to
run
dozens
of
VMs
hundreds
of
videos
right
and
so
the
question
is:
how
do
you
spin.
A
Thousands
of
VMs
instantly
and
efficiently
and
that's
somewhat
difficult
and
the
answer
is
with
Seth.
You
can
do
an
instant
copy
of
one
block
device
to
as
many
other
block
devices
as
you
want
and
in
this
case,
I've
created
four
copies
of
this
block
device
and
all
of
them
are
taking
zero
space
because
the
copy
is
instant,
but
it's
also
thinly
provisioned
right.
So,
if
I
have
a
hundred
and
forty
four
block
well,
it's
not
really
blocks
like
on
the
disk.
A
It's
just
units
a
144,
a
unit
block
device
and
I
copied
it
four
times
it
still
takes
144
months.
That's
all
it
takes.
Then,
when
my
client
begins
to
write
information
to
this
new
block
device,
it
begins
to
fill
in
the
gaps.
So,
if
I
write,
four
four
block
units
to
my
copy
I
end
up,
storing,
148
total
and
then
when
clients
go
to
read.
A
If
the
data
is
inside
the
copy,
it
will
read
it
from
the
copy
and
if
the
data
is
not
inside
the
copy,
it
will
read
through
it
to
be
to
the
original
file.
This
is
what
we
call
it's
it's
it's
in
the
provision,
so
that
helps
people
say,
for
example,
if
I
want
to
spin
up
a
thousand
VMs
I
can
copy
might
be
an
image
a
thousand
times
it
happens
instantly
and
then
I
only
start
taking
incremental
space
when
new
data
gets
stored.
A
B
A
A
Point
of
failure,
so
the
challenge
is:
if
file
systems
require
lots
and
lots
of
metadata,
they
require
you
to
keep
track
of
lots
of
information.
You
know
to
assemble
the
hierarchy
and
if
you
remember
this
graph
of
how
the
metadata
system
works,
with
with
the
set
of
s
you'll
notice
that
there
are
actually
three
metadata
servers
right
again
with
Seth,
nothing
as
a
single
point
of
failure.
So
there
are
three
meditative
service
respects
the
question:
how
do
you
have
one
tree
with
three
metadata
servers
right
and
the
answer
or
well?
A
Our
answer
is
that
when
the
first
metadata
server
comes
online,
it
has
control
of
the
entire
tree
right.
It
has
authority
for
the
entire
tree
when
the
second
metadata
server
comes
online,
it
will
take
a
portion
of
that
tree
and,
since
all
of
the
actual
metadata
itself
is
stored
inside
Seth,
there's
no
data
that
copies.
When
this
happens.
Another
metadata
server
just
assumes
that
responsibility
and
as
more
meditative
servers
come
online
you'll
notice
that
they
take
sort
of
more
equitable
chunks,
and
this
all
happens
dynamically.
A
So
as
the
load
on
the
system
or
as
the
the
data
requirements
change.
The
metadata
server
will
adapt
and
adjust
and
even
allow
you
to
have
single
hotspots
even
down
to
the
file
granularity,
so
you
can
have
one
metadata
server
that
is
just
responsible
for
managing
locking
and
permissions
on
a
single
file.
If
that's
what
the
cluster
eats.
Let
me
call
this
dynamic,
subtree
partitioning.
A
Almost
everything
works.
Instead,
almost
it's
been
in
development
for
about
six
years,
having
just
launched
the
company
in
tank,
we're
just
starting
to
do
some.
Some
some
really
involved
Quality
Assurance
testing
and
performance
testing,
but
here's
where
it
stands
right
now
right
now
Bri
knows
is
awesome.
It
works
very
well,
it
seems
to
be
very
stable.
A
The
other
backpedaling
event
here
is
that
this
today
is
land
scale
and
the
reason
is
land
scale
is
because
when
you
write
information
to
assess
object,
it
does
the
replication
synchronously.
So
if
you
tell
it
to
store
ten
replicas,
it
has
to
go
communicate
with
ten
other
nodes.
Before
it
comes
back
to
the
client
and
says:
okay,
I
wrote
the
file
right
so
and
that's
that's
how
we
we
keep
the
system
saying,
but
it
means
that
it
works
over
land,
scale,
speeds
or
really
really
scary,
fast,
LAN
speeds.
A
Sometimes
we
talk
to
people,
they
say:
oh,
it
only
works
across
the
land
and
they
say
why
and
we
say
latency
and
they
say
oh
well,
we
have
you
know
two
milliseconds
from
here
to
the
moon.
We
go.
Oh
well,
great
set
will
work,
you
know.
So
it's
it's
really
more
about
latency.
This
sighs,
so
I
have
just
a
quick
little
little
slide
on
the
current
status
of
second
cloud.
A
Stack
and
I
chose
this
image
because
we
really
are
at
the
beginning
of
the
road
with
second
cloud
stack
beginning
to
integrate,
but
there
has
been
a
good
amount
of
work.
That's
happened
already,
so
this
was
just
announced
a
couple
of
weeks
ago
our
3d
support
inside
cloud
stack,
which
allows
storage
of
virtual
disks
using
Ratos.
B
Thanks
Russ,
so
a
high-level
question
for
you.
So
can
you
explain
the
difference
between
a
Stefan
Hadoop.
A
I
can
so
Seth
actually
came
out
of
the
HPC,
the
HPC
world
when
it
was
invented.
It
was
invented
primarily
for
the
HPC
use
case
and
that's
where
Seth
FS
came
from
second
best
being
the
part
of
the
architecture.
That's
not
quite
ready
yet
so
people
have
actually
built
HDFS
sort
of
using
Seth
as
a
drop-in
replacement
for
HDFS,
which,
which
is
it
gives
a
little
bit
more.
Scalability
gets
around
some
single
points
of
failure,
but
that's
it's
it's
kind
of
experimental
today.
So
I
guess
the
major
difference
is
well.
A
B
B
A
I
think
what
you
want
to
look
at
what
you
want
to
look
at
is
this
message
on
the
mailing
list,
where
a
guy
in
our
community
named
Vito
who's,
also
an
in
the
class
that
community
announces
his
integration,
but
I
think
it's
only
KDM
right
now,
I
know,
there's
been
a
lot
of
work
with
libvirt,
but
that's
unfortunately,
that's
that's.
The
information
I've
got
at
the
moment.
A
I
don't
know
a
whole
lot
architectural
about
cluster.
A
cluster
FS
I
know
that
that
set
has
the
object,
block
and
file
on
single
unified
platform
and
I.
Think
Buster
specializes
in
you
know
in
in
leith
file
and
think
maybe
they
have
blocked
now.
Unfortunately,
I
don't
know
a
whole
lot
about
Buster.
I
know
that
that
sets
architecture
and
especially
crush,
was
built
to
get
over
some
of
the
architecture
of
limitations
that
we
find
in
cluster.
But
I
will
also
say
that
clusters
had
a
lot
more
time
in
the
market
right.