►
From YouTube: Ceph 101 ☷ Unified Storage #OpenStack
Description
Montréal - March 17, 2014 - David Moreau Simard, IT Architecture Specialist at iWeb, presents the basics of Ceph, "a distributed object store and file system designed to provide excellent performance, reliability and scalability."
* About Ceph: http://ceph.com
* About David: http://twitter.com/dmsimard
* OpenStack Montreal: http://montrealopenstack.org
* About iWeb: http://iweb.com
A video produced by Savoir-faire Linux:
https://www.savoirfairelinux.com
Licence: CC BY-SA
A
A
A
That's
not
for
me!
Okay.
The
interesting
thing
maybe
for
you
guys
is
that
stuff
was
created
by
sage
whale
as
part
of
PhD
thesis
in
computer
science.
So
whenever
I
have
a
problem,
I
try
and
find.
If
someone
else
has
solved
that
problem,
this
guy
had
the
problem
and
he
fixed
it
himself
by
creating
a
whole
storage
solution.
So
it's
kind
of
awesome
stage.
Welcome,
coupon
and
DreamHost.
It's
a
web
hosting
company
a
bit
like
iWeb.
He
also
founded
ink
tank.
A
The
company
that
is
today
behind
SEF
may
be
a
bit
like
canonical
for
Ubuntu.
It's
a
enterprise
behind
the
product
in
tank
is
also
mentor
in
the
next
and
the
upcoming
a
google
Summer
of
Code
google
Summer
of
Code.
The
has
a
wide
range
of
products
for
students,
students
to
work
on
so
they're
gonna
have
projects
for
self
and
their
next
summer,
and
before
talking
about
stuff
I'm,
going
to
do
a
little
bit
of
context,
around
distributed,
storage,
I
think
Marcus
did
a
nice
job
with
Swift
and
the
stupid
distributed
storage.
A
So
you
have
a
laptop
you're,
a
human,
hey,
I
hope
you
have
your
computer
and
you
have
your
disk.
I
mean
this
is
a
laptop,
a
lot
of
people
here
with
a
laptop.
So
that's
local
storage
right.
You
have
your
laptop,
you
have
a
disk,
and
then
you
have
a
small
business.
So
maybe
have
some
money,
you
purchase
the
server
and
you
have
a
whole
lot
of
disks
and
a
whole
lot
of
employees.
A
A
So
then
you
have
something
like
this.
You
have
a
lot
of
humans.
You
have
a
lot
of
servers.
You
have
a
lot
of
disks,
but
the
problem
with
that
is:
how
do
you
address
each
of
these
logical
computers
right?
So
each
of
these
are
different
servers.
You
maybe
have
different
file
shares
on
different
servers.
It's
a
mess
to
deal
with,
so
this
is
why
big
corporate
corporations
have
done
something
they
call.
Sands
sands
looks
something
like
this:
it's
oversimplified,
it's
a
big,
expensive,
appliance
right.
A
A
You
don't
talk
to
tens
of
servers,
you
talk
to
one
server
or
one
IP
address
or
whatever
the
configuration
may
be,
but
the
idea
is
that
you're,
just
one
single
device
and
often
are
not
often
but
like
all
the
time,
there's
going
to
be
logic
and
intelligence
in
the
Sun,
so
that
there's
replications
and
you
don't
lose
your
data.
If
you
lose
the
server,
if
you
lose
a
disk
and
so
on
so
now
I
don't
like
sands
I'm.
Sorry,
there
are
very
expensive.
The
license
is
expensive.
The
support
is
expensive.
A
A
Stuff
stuff
is
meant
just
like
Swift
is
meant
to
run
on
commodity
hardware.
That
means
just
like
any
art.
Where
you
have
computers
at
home.
You
can
rinse
off
on
it.
If
you
wanted
to
actually,
if
I
didn't
have
a
network
attached,
storage
at
home,
I
would
probably
run
stuff,
it's
free,
free,
isn't,
candy
doesn't
cost
anything.
Ink
tank
does
provide
enterprise
support
and
a
bit
like
a
canonical
does,
but
otherwise
it's
free,
it's
open
source.
The
code
is
on
github,
you
can
look
it
up
and
send
pull
requests
the
guys
over.
A
There
are
pretty
reactive
and
it's
awesome.
It's
a
feature
right.
No,
this
soft
software
stack
now
we're
gonna
delve
in
a
bit
in
the
details
of
what
SEF
actually
looks
like
at
the
software
level.
So
you
have
red
O's,
which
is
an
acronym
for
reliable
at
animate,
distributed,
object,
store
and
a
bit
like
Swift
is
an
object
store
and
it
takes
care
of
replicating
stuff.
So
you
don't
lose
data
whenever
you
lose
a
computer
or
disk.
A
A
You
have
our
BD
that
stands
for
Rados
block
device.
It
allows
you
to
use
SEF
for
block
devices,
so
a
hard
drive
but
hard
drive
over
network.
Think
I
scusi
allows
you
to
do
that
and
then
there's
SEF
of
s.
That
is
a
distributed
file
system.
I
heard
earlier,
someone
was
talking
about
glossary
FS,
that's
something
a
bit
similar
to
Lester
a
fest.
We
can
talk
about
that
later
as
well.
A
Can
we
okay,
I,
hope
it's
not
too
small
for
people
at
the
back
and
the
most
the
the
sub
daemon
that
works
with
our
actual
hard
drives
is
called
OSD.
It
stands
for
object,
storage,
demon
and
the
OSD
is
essentially
a
piece
of
software
that
more
or
less
takes
over
an
actual
hard
drive.
So
what
happens?
Is
that
you'll
have
one
demon
per
disk
so
for
one
server,
you'll
have
many
disks
right,
Wow
and
so
you'll
have
many
OS
DS
on
a
single
server.
Typically,
so
you
have
one
server
with
eight
disks
in
it.
A
You
know:
you're
gonna
have
a
2
s,
DS.
The
the
cool
thing
about
is
these
that
don't
really
care
about
the
the
hardware
they're
residing
it.
So
the
hardware
like,
for
instance,
if
if
a
specific
line
of
server
hardware
is
discontinued
well,
you
can
just
purchase
another
kind
of
server
and
it's
going
to
work.
A
There's
no
hard
limit
to
scaling
at
the
amount
of
OS
DS
you
can
have
up
to
thousands.
Always
these
without
a
problem.
Scale
is
just
as
well
as
Swift.
The
important
thing
is
that
it
serves
data
to
clients
directly.
So
when
I
say
a
client
ism
is
an
application
or
your
code
when
it
talks
to
SEF,
and
it
will
talk
to
monitors
and
will
get
two
monitors
there
and.
A
So
I
have
this
unfortunate
little
schema
here,
so
you
have,
for
instance,
USD
the
daemon.
The
filesystem
can
be
anything
like
btrfs,
six,
FS
or
HD
for
staff
currently
recommends
x,
FS,
that's
like
the
best
middle
ground
between
the
btrfs.
This
isn't
quite
prediction
ready
yet
and
you
have
your
disks
and
then
you
have
going
to
have
plenty
of
as
these
inside
a
cluster.
A
A
It's
like
I,
don't
know
what
you
need
the
quorum.
That's
a
good
word,
yeah
right.
The
monitor
hang
on
this
slide
is
isn't
that
better
awesome
and
the
monitors
take
care
of
maintaining
the
cluster
state,
so
they're
the
ones
aware
of
what's
going
on
in
the
cluster,
pretty
much
they're
going
to
have
they're
going
to
know
who
the
cluster
members
are.
So
who
are
the
monitors?
Who
are
the
aziz
and
who
are
the
other
components
in
this
F
cluster
they're,
aware
of
the
placement
groups
and
the
objects?
A
It's
it's
a
terminology
we'll
be
looking
at
a
bit
later,
but
they
also
know
what's
the
overall
health
of
the
cluster
and
they're
going
to
ones
be
telling
you
a
there's
a
problem,
so
this
is
the
role
of
the
motor
it
takes
care
of
managing
the
crash
map
as
well.
The
crash
map
is
something
maybe
it's
the
equivalent
to
Swift's
ring
s3.
It's
quite
different
though,
but
it's
like
it's
the
equivalent
it's
crash.
A
The
crash
map
is
the
one
that
is
able
to
it's
the
one
that
is
able
to
let
you
know
where
the
data
resides
essentially,
and
it's
also
the
entry
point
for
stuff
clients.
So
you
have
your
application
and
you
have
a
hostname
or
IP,
perhaps
and
you're
actually
going
to
talk
to
a
monitor,
you're
going
to
retrieve
the
crash
map,
which
tells
you
what
what's
the
cost,
what
the
cluster
looks
like
and
by
talking
to
the
monitor
the
clients
will
know
afterwards
where
to
push
and
pull
the
data
on.
A
Oh
right,
you
have
the
meta
data
server.
It's
only
used
for
sefa
fast,
shared
and
distributed
the
file
system.
Unfortunately,
it's
not
really
production
ready
yet
and
I
really
wish.
It
was
because
it's
really
awesome,
like
there's
people,
an
organization
that
use
it
in
production.
I,
know
I,
know
a
couple.
Companies
that
do,
but
Inc
think
is
not
really
yet
ready
to
officially
support
that
product
in
their
enterprise
offering.
So
it's
not
quite
ready.
Yet
it's
maybe
a
matter
of
a
few
months
still
but
anyway,
and
the
metadata
server.
It
manages
the
file
system
metadata.
A
So
things
like
time
stamps
permissions
ownership
of
your
files
and
folders
and
the
actual
folder
and
file
Araki,
it's
scalable,
which
means
there.
You
can
have
as
many
metadata
servers
as
you
want,
and
the
interesting
thing
is
that
the
meta
data
is
stored
in
Rados.
So
this
means
that
you
can
lose
your
metadata
server,
but
you're
not
really
going
to
lose
your
metadata
because
it
it
lives
in
the
SEF
cluster.
And
that
means
that
you
can
it's
it's
more
or
less
stateless.
A
A
A
Crush
it's
an
algorithm
and
it
was
the
main
topic
for
a
sages,
PhD
thesis.
It's
it.
It
pretends
it
does
pseudo-random
placement.
That
means
it's
going
to
place
data
apparently
randomly,
but
not
really.
It
means
it's
there
deterministic
algorithm
that
will
furnace
this
one
given
operation.
It
will
always
do
the
same
thing.
A
If
I,
if
I
do
a
parallel
to
Swift
Swift
has
these
databases
for
metadata
and
it
has
the
arraign
that
is
able
to
compute
where
the
data
that
lives
with
crush
the
clients
are
able
to
tell
where
the
data
lives
without
querying
a
proxy
server
or
a
middleware
server,
it
just
knows,
crush
will
distribute
the
data
uniformly
evenly
evenly
thanks,
it's
stable
and
by
stable.
It
doesn't
mean
like
well,
it's
it
works.
Well.
A
It
means
that
if
you
lose
a
disk,
our
server
and
we'll
move
data
as
little
as
possible
and
the
OS
DS
will
appear
between
themselves.
So
if
you
lose
a
server-
and
you
have
maybe
ten
other
servers,
it's
at
the
end
of
the
world-
you're
not
going
to
be
impacted
that
much
it's
configurable.
That
means
that
it
can
be
infrastructure.
A
Topology
aware,
and
by
that
it
means
that
you
can
in
safe
define,
while
these
servers
are
living
in
this
rack,
and
these
servers
are
living
in
this
row
in
this
room
in
this
data
center
and
then
that
allows
you
to
do
something
like
replication
to
different
failure
domains.
So
you
can
have,
for
instance,
okay,
I'd
like
to
do
three
applications,
but
one
in
another
data
center.
So
you
can
do
that
with
stuff
and
you
can
configure
replication.
So
two,
three,
four
none
and
you
can
wait
you're
different
with
these.
A
A
A
Images
are
striped
to
grow
across
placement
groups.
That
means
your
blog
devices
are
actually
replicated
and
consistent
across
the
cluster
and
I
talked
about
replication
per
pool
and
custom
crest
rules,
placement
groups,
the
actual
objects,
the
binary
here.
Your
files,
there's
they're
stored
in
placement
groups,
and
the
interesting
thing
is
that
the
object
aren't
really
replicated
it's.
The
placement
groups
at
are,
and
there's
going
to
be
a
ll
schema
later.
A
That
will
give
you
a
better
idea
of
what
it
looks
like
the
rule
of
thumb
is
100
placement
groups
per
OSD
and
physically,
your
objects
are
going
to
be
split
amongst
your
placement
groups
and
the
more
placement
groups
you
have
the
more
evenly
your
data
will
be
distributed
amongst
your
cluster,
but
the
more
pages
you
have
also
the
more
CPU
intensive
it
becomes.
So
that's
like
that's
a
that's
a
hard
limit,
so
you
have
here
a
little
diagram,
so
you
have,
for
instance,
your
binary
data,
an
image
it
goes
through
crush.
A
It
gets
split
into
objects,
and
then
you
have
this
pool
like
the
pink,
the
pink,
a
square,
a
rectangle,
and
then
you
have
your
replacement
groups
and
then
each
placed
in
groups.
The
objects
are
being
stored
in
the
placement
groups,
so
in
this
example,
I
have
a
replication
factor
of
two.
So
if
I
look
at
this
little
red
square,
I
have
this
red
square
here
and
the
second
red
square.
Here,
it's
not
going
to
be
anywhere
else
so.
A
A
A
Let's
talk
money
because
money
drives
a
lot
of
things
right,
I,
real-life
scenario
and
I
can
oh
I'm
not
going
to
go
over
the
server
specifications?
You
can
take
a
picture
if
you
want
or
something
it's
something
that
is
a
typical
for
a
storage
server
and
something
close
to
what
we'd
use
at
iWeb
for
a
big
storage
solution.
A
The
idea
and
what
I'm
trying
to
show
you
is
that
in
the
past,
before
having
before
staff
was
even
considered
what
we
do
for
high
availability
and
failover
and
data
security
is.
We
would
have
two
storage
servers
in
dr
BD,
which
is
a
more
or
less
raid
over
network
technology.
It
was
that
good,
yeah
right
and
with
the
RBD.
A
Let's
pretend
I
have
raid
4
performance
and
data
security
and
to
give
you
an
idea,
I
put
4
dr
BD,
both
raid
10
and
read
50
configurations.
So
for
raid
10
there
are
more
or
less
32
terabytes
worth
of
data.
If
I
have
three
replicas,
which
is
the
more
the
most
secure
well
through,
because
I
mean
it's
pretty
safe,
right
and
stuff,
so
you're
gonna
have
43
terabyte.
A
So
it's
more
than
11
thorough,
bytes
worth
more
of
data,
and
if
we
go
read
50
over
the
RBD,
because
we
need
more
space
and
we
don't
really
care
about
performance
that
much
we're
going
to
have
57
thorough
bytes
worth
of
data.
So
if
we
need
more
space
and
stuff,
we
can
tone
down
the
replication
count
to
two
and
where
you
have
sixty
five
terabytes
and
also
keep
in
mind
that
with
Ceph.
Both
the
servers
are
active
and
that's
worth
a
lot,
because
there
are
other
bottlenecks
involved,
such
as
network
throughput
and
stuff,
like
that.
A
So
if
I
want
to
mount
a
blog
device
over
staff,
it's
really
like
four
lines:
I
load,
the
kernel,
module
I
map,
the
blog
device,
the
the
cell
flag
device
to
my
server
I
format,
the
disk
image
and
I
mount
it
and
it's
good
to
go.
I
can
use
it.
It's
over
the
network,
it's
replicated
and
it's
highly
available
images.
A
It's
integrated
with
glance
in
OpenStack
glance
is
a
project
that
allows
you
to
upload
your
images,
such
as
I,
don't
know:
QQ
cow
or
qumu
images
and
a
glance
natively
is
able
to
talk
with
Saif
and
store
and
retrieve
images
from
their
object.
Storage.
It's
an
alternative
to
Swift
and
OpenStack.
It
works,
I've
tried
it
and
it's
incompatible
with
a
API
with
Swift
and
Amazon
s3.
A
All
right
off
the
bat
is
a
synchronous,
replication,
so
Swift
earlier
Markos
said
that
it's
able
to
do
rites
at
two
places
and
then
it's
going
to
tell
you.
Okay
I've
done
the
right,
and
then
it's
gonna
in
the
background
replicate
a
third
time.
It's
that
stuff
is
not
going
to.
Let
you
do
that.
It's
going
to
send
you
the
ACK.
A
Only
once
all
the
replicas
are
done
and
they've
done
something
in
the
latest
release
of
Ceph
I
haven't
had
the
time
to
try
it
out
yet,
but
they
have
actually
done
like
several
zones
and
Federation's.
So
maybe
like
they've
done
something
with
that.
That's
able
to
tackle
that
problem
shared
in
distributed
file
systems
ffs,
it's
an
alternative
to
glass,
very
fast,
so
CFS
and
stuff,
like
that.