►
From YouTube: Linux - Ceph object storage block storage file system replication massive scalability and then some
Description
http://www.twitter.com/user/LinuxConference https://twitter.com/LinuxConference
A
Day
of
Linux
conference,
australia,
2013
yeah
well
done
people,
we
have
got
two
speakers,
we
have
got
Tim's
wrong,
florin,
hearts
and
I'll
be
talking
on.
Sef
objects,
object,
storage,
box,
storage
file
system,
replication,
massive
scalability
and
then
some
!
as
a
very
quick
bio
Florian
is
a
link
file,
availability
and
storage
specialist.
A
He
frequently
consults
and
conducts
training
on
both
OpenStack
and
the
SEF
stack
and
tim
is.
The
company
is
currently
employed
while
susa
as
a
senior
clustering
engineer,
working
on
suse
linux,
enterprise
hire
villain
fee
extension
and
the
Susa
cloud
product
which
is
based
on
OpenStack.
He
has
now
got
doubts
on
whether
there
are
too
many
zeros
on
such
a
thing
as
too
many
log
files.
With
that
introduction
time
we
both
give
Florian
and
him
a
warm
introduction.
B
Thank
you
for
that.
So
for
those
of
you
who
came
in
on
a
kind
of
tight
schedule
or
a
kind
of
late,
if
you
already
have
those
virtual
machine
images
set
up,
that
is
perfectly
fine
and
you
can
follow
along
if
you
choose
not
to
follow
along
on
your
own
virtual
machines.
That
is
perfectly
fine
to
you
will
take
just
as
much
out
of
this
tutorial
and
then
you
are
able
to
retrace
your
steps
later.
We
actually
have
made
a
point
to
make
that
easy
for
you.
B
If
you
do
follow
along,
please
make
sure
that
other
than
the
fact
that
you
have
those
virtual
machines
installed
and
you
run,
you
have
run
the
install
SH
script.
You
should
make
sure
that
you
have
virtualization
enabled
in
your
system
BIOS.
So
that
is
a
feature
that
is
usually
found
as
Intel
VT
or
AMD
SVM.
You
want
to
have
your
kvm
module
loaded,
that
is
just
a
mod
probe
kvm
and
it
will
load
the
appropriate
either
kvm
intel
or
kvm
AMD's
module
for
you.
B
C
B
So,
generally
speaking,
we
do
have
sort
of
a
theoretical
introduction
to
the
safe
stack
and
if
you
would
like
you
can
keep
your
you
can
keep
setting
up
your
your
machines
during
that
time.
However,
if
you're
just
starting
now,
I
would
actually
suggest
that
perhaps
you
will
leave
the
following
along
for
later
and
then
so
you
don't
miss
to
much
of
the
talk
here.
Yes,
we
have
a
question.
C
B
The
question
was
all
we
are:
are
we
going
to
be
able
to
download
the
image
later
now?
No,
you
are
able
to
download
the
image
now,
and
you
can
do
so
just
as
well
later
on.
So
it's
on
a
linux,
australia
mirror
I
tweeted
the
the
the
info
yesterday
and
there
is
also
an
entry
on
my
blog.
So
if
you
go
to
a
stick,
so
calm,
/
blog
/
Florian,
you
will
find
all
that
information
there
and
I.
Don't
think
that
the
organizers
are
going
to
take
that
off
like
any
time
super
soonish
all
right.
B
Okay,
so
we
are
going
to
talk
about
the
safe
stack
and
this
F
stack
is
a
storage
stack
that
provides
us
with
object,
storage,
block
storage
file
system
file
system,
some
replication,
awesome,
scalability
and
some
other
goodies
and
we're
going
to
walk
through
those
one
by
one.
We,
this
is
a
double
slot,
so
we
have
a
little
more
time
than
we
usually
have
in
a
talk
which
is
delightful,
and
this
will
give
us
the
chance
to
cover
safe
theoretically
in
a
little
more
detail,
then
we
were
able
in
to
do
in
the
open
stacks.
B
There,
Tim
Martin
Susa
and
his
email
addresses
tisa
wrong
at
SU,
sitcom,
Tim's,
actually
based
just
outside
of
hobo,
and
in
this
talk,
or
in
this
tutorial
he
will
be
doing
the
real
work,
meaning
actually
working
on
systems.
The
handwriting
that
you
see
here
is
his
and
he's
also
responsible
for
the
cartooning
that
you
will
see
halfway
through
the
talk
with
Tim,
doing
real
work
handwriting
in
cartooning.
That
leaves
me
with
fake
work,
hand,
waving
and
babbling.
B
B
So
with
that
out
of
the
way,
let's
get
started
with
Seth,
so
Seth
is
really
not
one
thing,
but
for
things
for
relatively
distinct
things
all
rolled
into
one
and
we're
going
to
go
through
them
step
by
step
before
we
do
that.
Why
are
we
going
to
give
you
a
quick
overview
of
all
of
this,
so
Seth,
fundamentally
at
its
core
uses,
object
storage.
B
That
means
that
the
primary
interface
to
interact
with
data
is
not
files
is
not
blocks
is
objects
wine,
because
in
order
to
build
a
massively
scalable
distributed
data
store,
we
can
actually
reduce
that
to
relatively
simple
operations.
What
we
want
to
do
is
we
want
to
be
able
to
write
data.
Read
data
perhaps
delete
some
data,
but
we
really
don't
need
to
worry
about
what
is
our
block
bar
sector
address?
We
don't
want
to
worry
about
what
I
know
it
is
or
whether
we
have
directories
or
files
or
permissions
or
echoes,
or
things
like
that.
B
We
don't
need
any
of
that
to
achieve
massively
distributed
in
extremely
scalable
high
level
storage,
even
Zeph.
The
basic
unit
of
data
is
and
objects
and
objects
as
such
are
being
distributed,
replicated
kept
highly
available
within
the
distributed
cluster,
and
we're
going
to
look
at
in
a
certain
amount
of
detail.
What
that
means,
as
we
get
to
the
to
the
practical
stuff.
B
Sef
also
has
a
block
storage
interface
so,
rather
than
interacting
with
SEF
objects
directly,
we
have
one
abstraction
layer
that
allows
us
to
treat
a
great
number
of
Seth
objects
as
one
block
device,
as
we
would
with
any
other
old
block
device
in
Linux.
So
there
we
have
something
that
appears
as
a
virtual
block
device
and
we
can
write
to
it
at
a
certain
offset
or
read
from
it
as
at
a
certain
off
senate.
We
can
use
it
for
anything
else,
so
we
can
use
a
block
device
for
block
device.
B
Block
storage
in
SF
is
thin
provision.
It's
very
space
efficient.
It
supports
redirects
on
right
snapshots,
it
supports
cloning
and
several
other
interesting
things.
The
third
thing
that
we're
going
to
cover
when
we're
talking
about
Seth
is
restful
storage,
so
this
is
something
that,
for
those
of
you
are
familiar
with
things
like
Amazon
is
three
or
OpenStack
swift.
This.
B
With
object,
storage
using
restful
interfaces,
and
that
means
that
we
have
clients
that
maybe
just
an
HTTP,
client
and
they're
using
standard
web
technologies
like
HTTP,
HTTPS
and
JSON
to
retrieve
data
from
the
object
store
and
do
so
in
a
very,
very
efficient
and
simple
way,
and
then
finally,
we
have
a
distributed
file
system,
which
is
the
layer
that
adds
all
of
these
things
that
are
interesting
to
POSIX
and
only
deposits
to
the
distributed
storage
stack.
So
here
is
where
we
get
a
distinction
between
files
and
directories.
Here
is
where
the
namespace
actually
becomes
hierarchical.
B
Here's
what
we
see
things
like
file,
ownership
and
permission
bits
and
those
things
and,
as
you
will
see,
that
is
actually
a
very,
very
thin
client
layer
that
talks
to
the
object
store,
which
makes
the
whole
thing
very,
very
interesting,
elegant,
okay,
so
those
are
the
four
things
that
we're
going
to
talk
about
and
when
going
to
look
at
them,
both
from
a
theoretical
perspective.
What's
behind
it?
B
Let's
take
a
look
a
little
bit
of
what
is
so
special
about
this
whole
native
object,
storage
thing
that
is
at
the
core
of
set,
so
Seth
is
based
on
a
distributed.
Autonomic
that
means
self-organizing
and
redundant
native
object,
store
named
ray
das
and
radar
stands
for
reliable
autonomic,
distributed
object,
store
so
ray.
B
Flat
namespace,
so
that
means
that
we
don't
have
anything
like
a
directory
hierarchy
or
anything
of
that
nature.
It's
completely
flat.
This
is
something
that
is
very
common
to
pretty
much
all
object,
stores
and
in
great
us
in
each
for
each
object.
We
have
a
name
and
a
dana
five.
We
have
a
payload
or
contents
which
is
pretty
much
arbitrary
by
in
size,
and
we
can
also
stick
on
to
a
radius
object.
Any
number
of
key
value
pairs
attributes,
so
we
can
assign
an
object
attributes
at
will.
B
We
can
obviously
retrieve
them
as
well,
and
those
are
relatively
independent
of
the
actual
payload.
Now
objects
and
ray
das
are
assigned
to
something
that
we
conceptually
refer
to
as
placement
groups
or
PGs
and
every
PG.
Every
placement
group
has
a
list
of
object,
storage
devices
or,
depending
on
whether
you're
reading
the
older
or
the
newer
documentation,
object,
storage
demons
where
the
content
of
these
pg's.
So
all
the
objects
in
a
specific
placement
group
are
stored
in
a
redundant
fashion.
Now,
why
is
this
list
of
0?
B
Is
these
that
each
PG
rights
to
and
reads
from
order
rate
us
uses
a
primary
copy
mode
of
replication?
So
as
we're
writing
an
object,
it's
it
along,
as
would
any
other
object
in
the
same
placement
probe
is
first
being
written
to
the
primary
OSD,
which
is
the
first
entry
in
this
list,
and
then
this
primary
OSD
takes
care
of
where
the
other
replicas
go
and
how
many
replicas
we
have
is
entirely
configurable.
B
D
B
The
question
was:
is
the
number
of
replicas
configurable
per
object?
No,
it
is
configurable
for
pool
a
pool.
Is
an
administrative
subdivision
of
the
object,
store
and
all
of
the
objects
in
that
pool?
Have
a
certain
number
of
replicas
assigned
to
them,
and
the
replication
to
these
OS
DS
is
actually
synchronous.
So
we
can
make
sure
that
when
we
define
we
want,
for
example,
three
replicas
of
every
objects,
a
negative
in
a
given
pool,
as
the
object
is
being
written.
B
The
application,
the
client
that
is
actually
doing
the
right
does
not
get
acknowledgement
of
that
right
until
it
has
been
completed
in
all
the
red
book
on
all
of
the
replicas
object.
Placement
is
completely
algorithmic,
so
there
is
no
central
lookup
database
and
there
is
also
no
such
thing
really
as
a
distributed
hash
table.
B
Now.
What
is
special
about
that?
That
concept
of
data
storage
is
a
little
abstract,
so
I
tend
to
like
to
use
it
with
a
bit
of
a
more
concrete
example.
Okay,
when
we
are
checking
into
hotel
right,
that
is
a
data
storage
problem
I,
the
traveler
and
data
and
I
wish
to
be
stored
in
an
appropriate
location
and
preferably
such
that
I
actually
get
my
own
room
and
do
not
intrude
on
someone
else's.
B
B
Tell
is
a
little
more
modern
than
this
one.
Then
it
may
be
a
key
card,
but
that
doesn't
matter
you
get
something
to
get
to
your
room
and
you
get
the
information
where
your
room
is
okay,
so,
in
terms
of
a
day
to
look
up,
as
in
I
need
to
figure
out
where
I
the
piece
of
data
am
to
be
stored.
What
that,
in
fact
is,
is
essentially
a
central
database
lookup
with
an
optimization
which
is
there
actually
telling
me
where
my
room
is
and
I
can
memorize
it.
B
So
we've
just
cashed
the
lookup
okay
and,
as
with
all
caches
the
cat,
the
cash
typically
expires.
So
my
reservation
is
and
the
room
that
I
get
is
only
for
I,
don't
know,
say
four
days
and
if
I
want
to
extend
my
stay,
I
need
to
get
back
to
the
front
desk
and
do
a
fresh
look
up
and
I
might
then
be
reassigned
to
a
different
room
or
the
same
room
may
still
be
available
and
it's
just
being
extended,
but
I
still
need
to
go
through
the
front
desk.
So
they
look
up
the
data.
B
Look
up
of
the
data
storage.
For
me,
the
traveler
to
a
room
conventionally,
the
way
we
do
in
hotel
is
you
go
to
the
front
desk.
They
tell
you
a
room,
you
get
a
key
and
boom
you've
just
done
a
central
database
lookup
with
caching,
something
that
we
would
typically
do
in
a
storage
solution
that
operates
on
the
basis
of
essential
metadata
service.
Now
that
works
just
dandy
for
a
small
hotel.
B
B
We
could
do
something,
that's
very,
very
simple.
We
could
just
add
more
front
desks
and
then
hire
more
people
right
so,
rather
than
having
one
front
desk,
we
might
have
to
Europe
I
have
five
or
might
have
12
or
something,
but
that
really
doesn't
work
too
well,
because
what
that
does
is
we
can
now
handle
more
lookups
in
parallel.
B
So
if
we
have
a
larger
group
of
travelers
that
we
can
always
handle
three
at
a
time
what
it
doesn't
solve
is
the
problem
of
actual
data
assignment,
because
what
what
might
happen
is
Tim
and
I,
both
coincidentally
travel
&
stay
in
the
same
hotel
and
we
both
approached
the
front
desk
and
I'm
being
told
I.
My
room
is
number
365
and
then
Tim
strangely
he's
also
being
told
his
room
is
365
and
then
eventually,
what's
going
to
happen,
is
one
of
these
transactions?
B
Is
it
inevitably
going
to
fail
right
so
either
the
system
is
smart
enough
to
detect
the
conflict
and
then
kick
it
back
or
we
just
meet
at
our
at
what
is
ostensibly
our
door
and
I.
We
both
say:
well,
we
actually
don't
think
our
relationship
is
quite
ready
for
this.
Yet
so
so
one
of
us
returns
to
the
front
desk
and
say:
oh
and
the
hell
is
up
with
you
I.
B
You
know,
please
give
me
a
room
that
is
not
occupied
right,
so
that
is
the
classic,
lock
contention
problem
that
we
have
in
these
database
lookups
right.
If
we
have,
if
we
allow
like
multiple
clients
to
access
the
same
database,
the
same
central
database
at
the
same
time,
we
might
work
relatively
well
normally,
but
if
there
is
any
conflict,
then
one
of
the
transaction
has
to
be
kicked
back
and
that
doesn't
scale
really
well,
because
the
larger
the
onslaught
of
travelers
becomes.
The
greater
is
the
probability
that
we're
going
to
get
one
of
these
conflicts.
B
B
So
we
could
have
like
several
of
these
hotels
right
next
to
each
other
right
and
one
of
them
and
said,
say
we
have
26
of
them
and
then
one
says
a
one
says:
be
you
want
to
see
and
then
the
assignment
is
based
on
the
second
letter
of
their
first
names.
So
now
we
have
sort
of
partitioned
the
problem
a
little
bit,
because
now
we
don't
have
to
access
the
same
front
desk.
We
can.
B
The
lookup
database
can
be
just
for
those
individual
pieces
and
then
Tim
goes
to
one
building
and
I
go
to
a
different
building
and
we
might
actually
be
assigned
the
same
room
in
that
building.
But
now
we
no
longer
collide
because
the
space
is
actually
partitioned
right,
so
be
good,
he'll
there
or
many
many
many
hotels,
and
that.
B
B
Now,
that's
actually
pretty
good
approach
for
when
we
want
to
grow
from,
say
one
order
of
magnitude
to
the
next
door
of
magnitude
and
then
perhaps
one
more
okay.
But
what
if
our
hotel
was
not
small
was
not
large
was
not
because
not
huge.
It
was
absolutely
gigantic:
okay,
hotel,
for
example,
with
a
billion
room.
Okay.
B
B
So
this
creates
some
interesting
challenges.
The
hotel,
with
the
billion
rooms
and
I've
been
told
I
should
use
the
term
the
hotel
at
the
end
of
the
universe,
for
this
creates
some
really
interesting
challenges.
So,
for
example,
the
whole
thing
with
the
room
number
doesn't
really
work
so
well.
Any
okay,
because
if
I
have
a
room
number
like
that,
I
might
know
how
to
get
to
the
156
floor
of
the
hotel,
but
to
get
to
the
390
8480.
B
First
room
on
that
floor
may
be
quite
a
walk
and
besides
it
doesn't
really
help
for
me
to
sort
of
memorize
that
number,
because
it
becomes
essentially
pretty
meaningless.
So
so
that's
one
thing
that
doesn't
really
work
so
well.
So
this
whole
look
of
thing
not
so
cool
we're,
also
having
a
bit
of
a
statistical
problem
with
the
hotel
with
the
billion
rooms,
because
in
a
hotel
of
the
billion
rooms,
it's
relatively
likely
that
at
any
given
time
about
ten
thousand
rooms
are
probably
going
to
be
on
fire.
Give
her
table.
B
20,000
8030,
we
don't
know
but
a
substantial
number
and
there
will
probably
be
about
120,000
200.
300
thousand
rooms
are
currently
under
some
sort
of
maintenance
because
they
got
a
water
leak
or
they
have
the
walls
repeated
or
something
or
they
may
just.
We
might
just
build
them
or
we
might
tear
down
the
building
right
sound
like
that.
B
All
of
that
is
actually
relatively
probable
to
occur
at
some
place
in
the
hotel,
so
that
doesn't
work
too
well,
because
we
have
to
remember
that
at
a
certain
scale,
something
is
always
going
to
fail
fat
period
in
the
story,
if
it's
highly
unlikely
that
everything
is
going
to
fail
at
the
same
time,
which
means
basically
the
hotel
with
the
billion
rooms
with
maybe
under
a
thousand
buildings,
is
all
going
to
get
knocked
out
by
the
same
alien
spaceship
or
something.
B
But
then
we
have
other
problems,
then
the
signing
or
traveling
probably,
but
it
is
perfectly
safe,
safe
to
assume
that
at
a
certain
scale,
something
is
always
going
to
fail
and
we
have
to
build
an
engineer
the
system
for
this.
So
what
is
it
that
we
really
need
in
order
to
manage
your
hotel
with
a
billion
rows?
In
other
words,
what
is
it
that
we
really
need
in
order
to
manage
like
really
really
really
big,
distributed,
replicated
in
storage?
B
Okay,
so
because
we've
already
established
a
thing
with
the
room
numbers
to
identify,
the
room
is
pretty
meaningless.
We
should
really
use
something
that
we
already
know
about
ourselves
to
identify
where
we
need
to
go
so,
for
example,
that
might
be
a
fingerprint
or
an
iris
scan
or
something
of
that
nature,
and
then
we
use
these
to
identify
ourselves.
So
when
I
get
to
the
hotel
I
give
my
I
put
my
finger
in
a
fingerprint
reader
and
I
know
exactly
where
I
need
to
go.
B
I
know
where
my
room
is
something
something
something
all
of
that
has
to
be
done
by
the
system
by
itself.
What
I,
essentially
want
to
do
is
I
want
to
go
somewhere
where
I
have
something
completely
automated.
That
takes
us
information
and
automatically
guides
me
through
my
room,
so
I
no
longer
need
to
care
where
it
is
because
it
could
be
anywhere
right.
B
So
what
I
want
really
is
one
of
these
neat
little
automated,
robotic
helicopters
that
read
my
thumbprint
and
then
I
get
in
and
it
air
lifts
me
to
my
room.
That's
what
I
want
to
do,
because
the
only
thing
is
just
kind
of
nice
because,
as
we
already
established,
we
have
this
bit
of
a
walking
problem
on
156
floor
right,
so
airlifts
would
be
really
nice
and
then
make
that
completely
automated
put
in
my
thumbprint
and
there
we
are
so
now
we
have
sort
of
solve
the
allocation
problem
right.
B
We
have
something
that
takes
something
that
we
already
know
about
ourselves
and
automatically
gets
us
to
where
we
want
to
go.
That
is
not
all
the
problem
that
we
need
to
solve.
So,
for
example,
we
also
need
such
that
we
can
still
enter
our
room
when
housekeeping
comes
in
and
does
our
room.
We
want
some
robots
that
automatically
move
all
our
stuff
when
we
can't
enter
a
room.
B
So
if
there's
housekeeping
in
there
or
mints
or
something
we
want
something
like
this
to
move
all
our
stuff,
completely
automated
to
a
completely
different
room
right,
including
the
umbrella
and
the
at
the
end
and
the
bag
and
the
kids
titty
beer
and
the
pillow
and
whatnot,
and
we
want
all
of
this
to
automatically
move
to
a
different
room,
because
the
system
doesn't
really
care
where
my
room
is
I.
Don't
care
where
my
room
is:
we've
already
established
that
the
rooms
are
magic
because
they're
all
completely
identical
and
they
have
the
same
view,
etc,
etc.
B
So
I
don't
need
to
care
where
my
room
is
and
if
there's
housekeeping
or
maintenance
or
the
walls
are
being
repainted
I,
don't
really
want
to
know
about
it.
Instead,
I
just
want
to
be
taken
somewhere
else
by
this
magic
thumb,
print
later
helicopter
system
and
then
I
need
something,
and
that
has
actually
got
my
stuff
there
before
I
arrived.
So
I
need
one
of
these
fancy
little
fast
robots
that
solves
the
maintenance
and
housekeeping
problem.
However,
fibers
are
typically
not
scheduled,
so
the
fire
problem
is
one
that
we
still
need
to
cover.
B
B
B
B
And
close
the
door
I
want
to
experience
just
a
little
bit
of
pushback
just
a
little
bit
of
additional
latency,
because
that's
what
it
takes.
That's
all
the
time
that
it
takes
to
scan
everything
and
replicate
everything
and
then
put
it
load
it
on
one
of
these
fancy
level
track,
robots
and
move
stuff
elsewhere,
so
that
will
be
cool.
So,
in
summary,
what
do
we
need?
B
Something
that
takes
a
piece
of
information
that
we
already
know
about
ourselves
which
automatically
moves
us
to
a
room
as
soon
as
we
read
this
piece
of
information,
something
that
automatically
moves
their
stuff
from
A
to
B
when
my
room
is
not
available
and
we
need
something
that
automatically
replicates
all
my
stuff
when
I
first
get
in
there,
so
I
don't
lose
myself
in
a
fire
right
so
far,
so
good.
The
nice
thing
about
is
as
far
as
data
storage
is
concerned.
B
Ancef
implements
an
algorithm
called
crush,
controlled
replication
under
scalable
hashing,
and
the
interesting
thing
about
crush
is
that
this
algorithm
is
known
to
pretty
much
everything
that
plays
with
the
Seth
cluster.
That
is
the
components
of
the
cluster
itself,
but
also
all
of
the
Seth
clients
are
aware
of
this
algorithm
and
because
this
algorithm
is
generally
available
to
everyone
and
everything
in
the
cluster.
The
only
thing
that
we
actually
need
to
distribute
in
the
cluster
are
the
parameters
to
this
algorithm
and
then
safe
speak.
B
Now
we
already
mentioned
OS
these
object,
storage
demons
and
what's
cool
about
OS
DS.
Is
that
all
of
us?
These,
just
like
everything
else
in
the
cluster,
know
about
the
current
map?
The
current
crush
map
that
describes
object
placement
and
they
can
also
propagate
that
if
they
are
sort
of
endowed
with
the
ability
to
do
so
and
they
get.
B
Don't
necessarily
want
to
treat
you
to
his
whole
thesis
because
that
might
be
a
little
bit
long,
but
but
the
but
the
two
favors
from
two
thousand
six
and
two
thousand
seven
are
really
really
easy
to
read
and
explain
this
very,
very
succinctly
where
you
it's
one
of
those
technical
reading
experiences
where
you
go:
hey,
yeah,
that's
totally
logical!
That's
the
way
that
you
need
to
do
that
and-
and
it's
not
so
much
clever
as
it
is
really
smart,
because,
as
we
all
know,
cleverness
is
the
enemy
of
stability.
B
G
C
B
So
the
mods
usually
distributed
consensus,
protocol
and
algorithm,
which
is
based
on
Paxos
pack.
Sauce,
is
a
distributed.
Consensus
algorithm
that
I
think
was
first
described
in
about
nineteen.
Ninety
five
or
1996
safe
is
not
the
only
not
the
only
distributed
technology
that
uses
pack
sauce
in
some
way
shape
or
form.
B
So,
for
example,
if
you
are
familiar
with
the
pacemaker,
high-availability
stack,
there
is
an
add
on
to
that
called
booth,
which
is
built
for
site-to-site
clusters
or
multi-site
clusters,
and
they
define
or
our
arrive
at
consensus
using
pack
sauce
for
those
of
you
that
are
familiar
with
zookeeper,
not
the
conference
management
software
for
LCA,
but
the
other
zookeeper
that
uses
an
algorithm
based
on
pack
sauce.
So
that
is
actually
something
that
is
fairly
common,
reasonably
well
understood
in
the
literature
and
is
in
relatively
wide
use.
Was
there
a
question.
E
B
So,
really,
if
you,
if
you
want
to
read
up
on
those
papers,
they're
really
really
cool
now,
what's
interesting-
is
that
both
Mons
and
osts
operate
entirely
in
userspace,
as
do
all
of
the
SEF
demons.
Really.
This
is
a
departure
from
stuff
that
we've
seen
other
distributed.
Storage
technologies
like,
for
example,
the
master
file
system,
lustre,
does
a
lot
of
its
work
in
kernel,
both
client
and
server
side.
B
B
B
The
way
to
connect
to
these
is
shown
at
the
bottom.
I
hope
that
is
illegible.
So
you
connect
to
these
boxes.
As
with
secure
shell
as
root
2,
192,
168,
1,
22
and
then
111
is
Alice,
114
is
daisy,
115
is
Eric
and
116
is
Frank
the
password
the
root
password
for
these
is
Hess
tech,
so
H
aste,
x0,
all
lowercase
and
please
feel
free
to
put
your
ssh
pop
key
into
the
dot
road.
B
/
authorized
keys
file
by
the
way,
if
you
choose
you
shouldn't,
but
if
you
choose
to
take
these
virtual
machines
with
you
and
plug
them
into
a
network
that
you
own
on
and
somewhat
internet-facing
platform,
be
aware
that
there's
my
public
key
in
there
for
root
and
tims
and
potentially
your
own.
So
if
you
actually
want
to
put
this
out
on
the
internet,
prepare
for
a
visit
from
texting
me.
D
B
G
G
B
F
F
D
B
That
you
don't
need
to
make
any
changes
to
these.
I
can
tell
you,
I
created
them
on
a
boon
to
12
10
and
they
should
work
on
unmodified
there.
If
they're,
not
hey,
it's
LCA,
you
get
to
hack.
All
right
now
know
so
we're
aware
of
two
things
that
you
need
to
change
in
opensuse.
That
is,
you
need
to
change
the
emulator
line
from
user
bin
kvm
to
user
beam,
qmu
kvm
and
same
for
fedora.
Ok,
so
on
fedora,
the
same
emulator
is
user
bin,
qmu
kvm
and
then
on
opensuse
rather
stupidly.
G
B
B
Bye
bye,
so
bye
children
I
actually
buy
green
magic.
That
okay,
so
like
I
said
you
know,
if,
if
you
can
get
your
stuff
going
up
right,
if
you
can't
just
don't
worry
about
it
for
now,
because
you
will
take
more
out
of
this
tutorial
if
you
just
watch
rather
than
if
you're
trying
to
now
scramble
and
get
things
done
and
then
try
to
catch
up.
B
So
the
first
thing
that
we're
going
to
show
you
is:
we
have
a
running
set
cluster
on
these
boxes
and
we
can
use
a
utility
which
is
creatively
named
safe
to
check
the
status
of
the
current
size
of
the
cluster,
so
we
do
Seth
dash
lowercase
W,
and
that
should
give
us
a
current
status
of
the
cluster.
This
may
sound
like
well
or
look
like
something
like
Cimmerian
to
you
at
first
sight,
but
you
actually
learn
to
parse
it
relatively
quickly.
B
B
There
might
be
a
health,
worn
or
health,
critical
or
other
things,
and
then
in
the
next
line
we
get
our
current
Mon
map,
so
that
is
the
current
monitor
service
that
we
have
in
the
classroom.
What
we
did
in
the
configuration
for
these
is,
we
put.
Let
me
break
that
here,
real
quick!
Can
you
just
control,
see
that
just
so
it
doesn't
roll
over.
Thank
you.
What
we
have
here
is
three
months
all
of
the
three
safe
cluster
nodes
are
also
monitoring
service,
and
that
is
something
that
you
do
form
on
high
availability.
B
So
what
you
can
do
is
you
can
have
a
single
mon
if
you
lose
that
you're
kind
of
screwed,
because
the
Mons
is
how
clients
actually
connect
to
the
cluster
and
that's
the
only
piece
of
information
that
we
need
from
the
client
to
connect
to
the
cluster
once
it
has
it,
a
single
mom
that
it
talks
to
it,
finds
out
about
all
the
other
bonds
in
the
cluster
we're
all
viewers.
These
are
etc,
etc.
B
But
if
you
have
one
and
you
lose,
that
you're
kind
of
in
trouble-
if
you
have
to
that's,
actually
a
really
bad
idea,
because
it
uses
a
consensus
algorithm
based
on
quorum
and
if
you
have
just
two
nodes
or
two
mons
in
the
cluster,
you
lose
one.
The
other
automatically
loses
quorum
and
is
also
unavailable,
so
minimum
number
of
bonds
that
you
want
to
have
in
SF
cluster.
In
order
to
be
highly
available
rate
and,
generally
speaking,
you
would
use
an
odd
number.
B
Ok,
we
have
an
OSD
map,
so
we
know
about
the
OS
DS
in
the
cluster
object,
storage
demons,
and
currently
we
have
30
s
DS
up
and
also
30
s
DS
in
so
up
means
that
the
OSD
demon
is
actually
running
and
is
responding
on
the
network
and
in
means
the
OSD
is
currently
eligible
and
available
for
data
placement
or
data
storage
and
we're
going
to
see
what
happens.
If
we
do
things
with
these
and
then
we
get
a
little
bit
of
information
about
our
current
placement
groups.
B
So
we
currently
have
840
placement
groups
in
the
cluster.
They
are
considered
active
and
clean.
So
that
is
to
say,
we
have
no
degradation
of
storage
here
anywhere.
Everything
is
wonderful
and
something
that
I
failed.
To
mention
up
to
this
point.
We
also
have
in
this
cluster
are
two
mdss
metadata
servers.
Those
are
only
relevant
to
the
Ceph
file
system,
which
we're
going
to
get
to
at
the
end
of
the
tutorial.
B
B
So
the
question
was:
what
do
the
various
ease
mean
in
the
in
in
the
in
the
output
of
safe
dash
W,
so
the
e
1
e
1,
55
e
66
and
so
forth?
It's
an
epic.
So
all
of
these
maps
in
the
cluster
are
versioned
and
those
are
just
the
version
numbers
that
increase
and
it's
actually
really
really
cool.
How
it's
done
that
no
OSD
and
no
component
in
the
cluster
ever
sees
any
of
these
version.
B
Numbers
go
backwards
like
really
it's
like
that
again
is
one
of
the
things
that
you
will
totally
geek
a
geek
out
about
in
the
papers
when
you
read
them,
okay.
So
the
next
thing
is
what
got
us
here.
We
have
a
single
central
configuration
file,
/
etsy,
/a,
/div
com.
There
you
go,
and
this
is
actually
relatively
simple
and
straightforward.
There's
nothing.
Super
duper.
Fancy
in
here
Seth
uses
an
authentication
service
called
SEF
X,
which
we
can
use
for
both
clients
to
authenticate
to
the
cluster
and
individual
cluster
demons
to
authenticate
to
each
other.
B
We
can
define
a
specific
walk
file
if
we
want
to,
in
this
case
we
just
logging
everything
to
VAR
lock,
set
and
then
cluster
ID
cluster
name
and
demon
ID
by
default.
The
cluster
ID
is
the
cluster
name
is
Seth
and
the
demon
ID
is
all
the
stuff
that
we
have
like.
After
the
dot
of
the
various
of
the
various
demons,
we
have
defined
three
different
mods
on
our
hosts
name,
Daisy,
arrogant,
Frank.
B
S
so
the
u.s.
these
are
actually
the
data
storage
workhorses
in
the
cluster
and
of
these
we
might
well
have
hundreds.
It
is
a
relatively
standard
design
practice
to
have
10
s
deeper
physical
storage,
disk
that
you
have
in
the
server
or
in
the
in
the
individual
node,
and
it
is
relatively
common
for
a
safe
storage
server
to
have
between
about
4
and
12
spinners
in
it.
B
So
if
you're
thinking
that
oh
I'm
gonna
I'm
gonna
I'm
gonna
I'm
going
to
procure
hardware
for
a
safe
cluster,
that
would
also
work
really
really
well
for
the
dinosaur
technologies
of
days
of
Europe,
with
life,
typically
like
48
disks
in
them
and
whatnot.
That
is
not
your
ideal
set,
foster
or
OSD
box
really
OS.
These
are
really
really
smart
in
what
they're
doing
so,
they
do
consume
a
fairly
significant
amount
of
processing,
power
and
memory,
and
what
you
typically
want
to
shoot
for.
B
C
B
E
You
like
to
make
me
dance:
okay,
yeah.
B
B
C
C
B
B
If
you
run
multiple
SSDs,
can
you
increase
the
density?
Yes,
you
can.
You
can,
of
course,
run
a
whole
SEF
cluster
with
like
several
petabytes
in
all
SSDs.
If
you
choose
to
that,
will
typically
be
limited
by
budget
constraints.
So
the
nice
thing
that
you
can
do
with
this
f
cluster
is
you
can
build
something
that
is
really
awesomely
distributed
and
very
very
fast
for
using
say,
7.2
k,
RPM
SATA
spinners
with
two
terabytes
each
which
are
really
cheap
disks,
and
you
just
invest
a
few
bucks
extra
into
the
into
the
SSD.
G
B
The
question
was:
I
said
that
the
standard
number
is
about
4
to
12
disks
and
what
changes
or
what?
What
change?
Can
we
expect
if
we're
using
more
or
less
of
this,
the
you
just
happen
to
hit
different
performance
limits
right,
so,
if
you're,
if
you're
using
it,
will
still
be
fine
in
terms
of
performance
and
utilization
and
whatnot,
but
for
example,
this
is
a
standard
thing
that
we
come
across
relatively
often
when
we
do.
B
Projects
like
this
is
someone
buying
a
server
with
lots
of
lots,
lots
of
spinners
and
one
or
two
SSDs,
and
then
what
happens
is
what
I
said
earlier:
it's
actually
the
SSD
that
becomes
the
performance
bottleneck.
So
in
that
case
you
can
still
use
the
box
and
what
you
would
do
is
you
would
instead
just
put
the
journal
actually
on
the
file
store
on
the
on
the
on
the
spinners
themselves,
but
you
may
just
have
wasted
a
little
bit
of
cash
for
a
few
SSDs
Bruno.
You
had
a
question.
D
B
So
yeah,
so
how
does
how
does
a
safe
deal
with
with
the
heterogeneous
type
of
disks
and
arrays
and
things
in
the
crush
map?
You
can
define
weights,
so
you
can
say,
give
preference
to
you
know
this
type
of
disks
or
or
or
this
box
has
this
much
more
storage
than
the
other
box,
etc.
So
it
deals
with
that.
D
B
D
B
Better,
it's
wonderful,
okay,
alright,
okay,
more
questions!
Yes,.
C
B
B
So
the
classic
issue
that
you
typically
have
in
standard
issue,
a
storage
solution,
which
is
you
typically
get
periods
of
relatively
high
locality
of
Io-
is
just
not
happening
instead,
you're
hitting
in
a
sufficiently
scale-out
cluster
you're,
just
hitting
sufficiently
many
nodes
that
you're
that
you're
distributing
the
load
nice
lease.
Are
you
actually
not
getting
that
kind
of
locality?
B
Were
you'd
be
sly,
ness,
st
to
spin
ratio,
so
a
good
rule
of
thumb
would
be,
for
example,
if
you're
going
with
so
a
good
rule
of
thumb
is
generally
about
40
s,
three
journals,
/,
SSD,
okay,
give
or
take
a
little
bit.
So
if
you
have
say,
eight
spinners
to
SSDs
would
be
perfect:
okay,
okay,
all
right!
So
we're
not
I'm
we're
gonna
skip
the
whole
right
gateway
stuff,
because
we're
going
to
get
to
that
in
a
second.
What
we're
going
to
do
instead,
we're
going
to
show
you
real,
quick.
B
What
a
safe
OSD
actually
looks
like
here
be
dragons,
be
very
afraid,
but
can
we
just
see
a
mount
real,
quick?
So
what
we
have
here
is
we
have
a
separate
file
system
for
the
OSD,
because
it's
on
a
separate
disk.
In
that
case
it
is
the
exit
has
file
system
that
is
mounted
from
diff
VDB,
12,
VAR,
lib,
safe
OSD,
safe
0,
okay
and
the
we
have
to
theoretically
recommended
file
systems
for
Seth
OS
DS
in
practice.
B
I
would
actually
recommend
only
one
there's
two
that
are
theoretically
recommended
our
butter,
FS
and
XFS
butter,
fest
sort
of
being
the
perfect
option
when
it's
ready
people
have
been
saying
nasty
things
about
butter.
Fs
like
it
is
two
years
from
production
always
will
be.
I
will
not
go
that
far,
but
for
right
now,
arguably
it
is
just
it's
still
marked
as
experimental.
You
probably
don't
want
to
entrust
your
petabytes
next
bytes
of
data
to
butter,
fest,
safe,
actually,
cephalus
DS
actually
do
a
few
clever
things
with
butter.
B
Fs
such
as,
for
example,
when
using
butter
fest,
you
can
do
the
journal
right
and
the
file
store
right
in
parallel,
which
can
greatly
speed
up
the
process,
because
what
you
can
do
if
the
journal
right
fails.
You
just
roll
back
to
the
previous
butterhead
snapshot,
which
is
kind
of
cool,
so
you
have
a
consistent
state
again
and
and
there's
there's
various
other.
B
You
know
neat
little
things
that
it
does
with
with
butter
FS
for
right
now
my
general
recommendation
and
safe
projects
is
to
use
is
to
use
XFS,
which
is
really
really
fine
and
helpful.
As
far
as
that
is
concerned,
there
is
support
for
ext3
ext4.
It
has
some
performance
issues
because
of
the
way
those
handle
user
extended
attributes,
but,
generally
speaking,
XFS
should
be
a
very
safe
bad.
B
And
do
a
do
an
LS
or
even
well,
actually
go
into
current
in
here
there
we
go
and
do
an
LS
l
are
in
here
yeah.
A
lot
looks
so
much
like
a
normal,
regular
file
system
right.
B
So
this
is
very
much
optimized
for
sex
own
purposes
for
optics
storage,
and
it
is
one
of
the
things
that
sort
of
tend
to
scare
people
the
most
about
Seth,
which
is
that,
if
you're
running
into
a
problem
where,
essentially
all
of
your
mom's,
all
of
your
monitor
servers
are
completely
unavailable.
You
can't
connect
to
either
them.
It's
really
really
hard
to
get
your
data
out.
B
That
is
something
that,
where
people
you
usually
get
more
of
a
warm
fuzzy
feeling
when
they
use
in
Gloucester
FS,
because
in
Gloucester
FS,
all
of
that
is
very
transparent.
If
I
put
data
into
a
Gloucester
FS,
then
I
can
look
into
the
the
brick.
The
underlying
local
file
system,
that
gloss
Drake
sports
and
in
there
is
exactly
the
same
file
name
and
if
I'm
not
using
striping
the
same
file,
size
and
the
same
attributes
etc,
and
I
can
easily
get
my
data
out
of
bluster
FS.
Even
a
5k
lingual
surface
but
force
F.
B
G
B
B
B
Alice
is
our
client
node
for
everything,
but
most
of
the
stuff
that
we're
running
on
Alice.
You
could
also
run
on
all
of
the
other
hosts,
ok
and
in
a
slash
route
on
Alice.
You
will
find
a
in
total
for
directories
and
their
names,
01
radar,
02,
RVD,
03,
gray,
das
gateway
and
04
Assefa
fest,
and
that's
exactly
the
order
in
which
we're
going
to
do
them.
B
So
we're
going
to
start
with
ray
das,
and
the
first
thing
that
we
can
do
is
we
can
just
get
a
list
of
the
radio
schools
in
here
and
just
so,
we
don't.
You
know
waste
a
lot
of
get
a
lot
of
friction
and
things
by
us
reading
out
or
spelling
out
commands.
We've
decided
to
just
put
everything
in
neat
little
shell
scripts,
that
you
can
then
review
and
see
what
they're
doing
under
the
covers,
and
all
of
these
are
running
with
set
dash
X.
So
you
can
actually
see
the
commands
as
they're
happening.
B
So
the
first
thing
that
we're
doing
is
we're
just
getting
a
list
of
all
the
pools
that
we
have
in
this
in
the
same
classroom
and
three
of
those
are
always
available
by
default
and
those
are
named.
Data
metadata
and
rbd.
The
data
and
metadata
poles
are
for
the
SEF
file
system
and
our
bdr
for
rato
Spock
devices
and
the
others
are
all
because
we
have
created
a
radar
skateway
in
here
for
you
already
and
there's
one
that
we're
using
for
testing
purposes
and
because
we're
tremendously
creative
we've
named
it
test.
B
B
What's
the
pool
that
we
want
to
interact
with
that's
the
dash
p
option
test,
we
define
what
is
the
client
identity
that
we
want
to
use,
and
in
this
case
it's
the
one
that
Seth
creates
when
we
first
install
it
faster
clock,
client
admin
and
we
define
a
key
ring
that
we
use
to
identify
as
that
identity
to
the
cluster.
So
in
this
case,
radar
stash
be
test
incline,
mmk,
ssf,
cheering
by
the
way
we
could
leave
all
of
that
out,
except
the
pee
test,
because
they're
actually
in
a
default.
B
Okay,
so
by
default,
you're
connecting
to
the
cluster
with
the
admin
identity
using
etsy,
safe,
key
ring
as
a
default
keyring
location,
and
then
you
put
an
object
in
their
raid,
us
put
object
name
and
then
what
gets
cut
off
at
the
end
of
the
screen
here
is
the
actual
content
that
we
want
to
toss
in
there.
So
what
we're
doing
is
we're
echoing
hello
world
into
this
into
this
binary
and
that
gets
piped
in
on
standard
in
and
then
we
have
created.
B
Sf
object
named
hello
in
the
test
pool
and
we
can
retrieve
that
again
with
that's
the
game.
Object
thingy.
There
we
go
again.
Raid
oz
bottle,
awesome
options,
get
object,
name
and
it
just
spits
out
the
content
of
this
object
from
on
on
the
command
line.
Okay,
there's
a
few
other
things
that
we
can
do
with
objects.
As
we
mentioned
before,
objects,
don't
only
have
a
name
and
some
content,
but
they
also
can
have
dr.
abuse
the
next
thing.
B
So
on
this
object,
there
are
two
attributes:
food
with
the
value
of
bar
and
spam,
with
the
logical
value
of
ace,
and
you
can
do
that
with
as
many
attributes
as
you
want.
So
you
just
you
can
define
new
attributes,
you
can
set
them,
you
can
retrieve
their
value,
you
can
update
them,
etc,
etc
and
then,
finally,
there
is
not
only
the
the
possibility
to
get
and
retrieve
the
object,
but
we
can
actually
find
out
where
a
specific
object
is
okay,
and
that
is
what
we
do
with
this
test
map
object
thing.
B
This
is
actually
a
two-step
process.
We've
rolled
it
into
one
script:
you
retrieve
what
is
called
the
OSD
map,
which
is
essentially
information
of
where
all
the
OSD
czar
and
then
we
use
this
OSD
map
tool
utility
to
figure
out.
Ok,
where
is
the
object
named
hello
and
and
then
it
tells
us?
Ok,
it's
part
of
that
placement
group
and
it
is
currently
assigned
to
the
OS
these
two
and
one
which
happened
to
be
the
OS
dizon,
Frank
and
Eric.
Ok,
so.
B
While
I
click
in
here
and
do
myself
dash
W
there,
we
go
so
currently,
as
you
can
see
three
years,
these
three
up,
three
in
everything,
wonderful,
making,
dandy
and
then
after
some
time
we're
going
to
see
that
one
of
those
OS
des
are
going
to
be
down.
You
killed
yep
there
we
go
in
30,
sts,
30
days,
30
days
for
a
moment.
B
B
That
just
means
is
this
thing
actually
alive
and
is
it
responding
on
the
network
and
we
have
in
or
out
which
means
is
it
available
for
actual
data
storage
and
what
Tim
did
he
kind
of
normally
an
OSD
goes
to
the
down
to
the
yes
to
the
down
status,
and
then
it
waits
for
about
five
minutes
and
then
it's
actually
out
and
then
what
happens?
Is
this
thing
that
we
call
backfill
and
recovery?
So
what
we
can
do
now
is.
B
Lo
and
behold,
this
has
been
completely
reassigned
right.
So
prior
it
was
two
and
one,
and
now
we
killed
one
and
now
it's
2
and
0.
Isn't
that
wonderful?
So
what
Seth
does
is?
Not
only
does
it
keep
our
data
available
as
nodes
become
unavailable,
but
it
actually
restores
a
degree
of
redundancy
to
what
we
have
configured
completely
automatically.
B
Client
leaders
that
rate
all
ships
with
and
that's
what
we're
going
to
concern
ourselves
with
in
the
last
just
over
30
minutes.
So
one
of
those
high-level
client
layers
is
ray
das.
What
device
or
RVD
so
r
BD
is
a
thin
provision
block
device
that
stripes
data
across
multiple
radar
subjects,
so
they'll
block
interface.
Everything
that
we
write
to
this
thing
actually
gets
strike
across
multiple
radars
objects
in
the
cluster
and
then
all
of
the
distribution
and
replication,
nay,
chi
and
whatnot.
All
of
that
happens
at
the
radars
level.
B
Rvd
doesn't
have
to
care
about
this,
which
makes
it
a
very,
very
thin
file
error
on
top
of
radios
compared
to
having
to
do
everything
now
in
regards
replication
in
the
h.a
RVD
supports
a
read-only
snapshots.
These
snapshots
are
a
direct
and
right
and
they
are
super
cheap,
because
everything's
team
within
provision,
we
can
use
much
the
same
facilities
in
order
to
provide
snapshots
and
that's
really
really
cool.
B
It
supports
cloning.
Cloning
means
that
we
can
designate
a
snapshot
as
a
master
copy
of
other
than
writable
RVD
images,
which
is
cool
and
with
that
it
of
course
becomes
very
suitable
for
maintaining
things
like
template-based
virtual
machines,
which
is
why
our
BD
is
heavily
used
in
trust.
The
cloud
technology
is
like
open,
static
and
cloudstack
and
others,
because
it's
just
very,
very
useful
for
this
purpose.
B
Now
it
actually
comes
in
two
flavors,
not
one.
We
have
our
BD,
which
is
a
kernel
level
block
device
driver
and
that
made
it
into
upstream
Linux
in
26
37.
If
I
use
an
RVD
that
way,
I
just
map,
it
called
mapping,
and
then
it
becomes
a
virtual
block
device
that
pops
up
on
my
linux
box,
it's
dev
a
RVD,
something
and
then
I
can
use
it
just
like
any
other
old
block
device.
B
B
So
let's
take
a
look
at
this
again
and
now
here
is
where
the
tutorial
is
a
little
tighter
than
it
was
going
to
be,
because
we
uncovered
a
couple
of
interesting
rbd
bugs
just
yesterday,
which
sager
was
nice
enough
to
fix,
but
we
just
haven't
had
gotten
updated
packages
yet
now,
but
alone.
So
next
chapter
02
rbd,
now
we
go
and
again
we
have
a
simple
script
there,
which
we
can
use
to
just
create
an
RVD
there.
B
We
go
very
simple:
we
do
rbd
and
then
this
is
kind
of
nice,
pretty
much
all
of
the
one
that
quite
all,
but
pretty
much
all
of
the
client
user
space
management
utilities
in
radios
support
identical
command-line
options,
so
the
dash
in
for
selecting
the
client
identity
is
always
identical.
The
dash
k
for
selecting
the
key
ring
is
always
identical,
so
the
same
thing
is
true
for
our
BD
or
doing
here's
we're
creating
a
an
r
BD
image
with
the
size
of
512
megabytes
named
test
and
then
doing
LS
of
the
same
thing.
B
To
just
show
that
it
is
in
fact
there
the
block
device
as
such
is
thin
provisioned,
which
means
that
the
space
that
we're
defining
here
is
not
taken
up
immediately.
That
is
only
the
maximum
space
that
it
can
eventually
take
up.
So,
for
example,
if
I
do
a
if
I
use
the
radars
utility
to
list
the
rbd
pool,
so
I'm
using
sort
of
a
lower
level
utility
now
to
look
under
our
bdays
covers
what
I
can
see
here.
B
Is
that
there's
actually
just
two
radars
objects
that
have
been
created
in
this,
namely
the
RV
directory,
which
has
information
about
all
of
the
rbd
images
in
the
pool
and
a
header
object,
taste
rbd,
and
that's
it
only
as
I
write
to
this
device?
Do
we
actually
get
additional
objects
that
are
being
allocated?
Yes,.
G
B
Think
provisioning
have
a
performance
impact,
no
okay,
and
we
can
then
map
this
thing.
This
is
what
we
do
with
the
o2
map
Sh.
So
here
again
we
use
the
rbd
command.
In
this
case
we
actually
have
to
map.
We
have
to
mop
probe
the
rbd
module
and
once
we
do
that,
it
becomes
available
as
a
block
device,
dev
RB
d0
and
under
dev
RVD.
B
B
Fancy
we
can
now,
for
example,
we'll
go
ahead
and
make
a
file
system
on
this
thing,
so,
for
example,
that
would
be
mkfs,
Dashti,
XFS
or
yeah
like
that,
and
then
you
just
do
dev
RBD
RBD
tests
and
it
should
hopefully
happily
create
a
file
system
for
us
did
it
done
and
the
terminal
is
really
slow,
but
that's
okay,
so
we
can
use
this
and,
as
with
any
other
with
any
other
block
device,
the
next
step
in
the
demo
would
have
been
a
fancy
snapshot
that
that's
where
the
bucket
is.
Yes,
we
have
a.
G
G
B
Yes,
I
can,
in
fact
it
is
a
bit
over
permissive
in
that
regard,
such
that
there
is
actually
no
locking
mechanism
akin
to
whoa
hang
on
it's
built
into
our
BD,
if
you're
putting
a
lock
manager
for
an
OC
f
for
an
ocfs2
file
system
achieve
as
you
can
perfectly
do,
that
what
it
doesn't
do
is
something
like
scuzzy
SPC's,
three
persistent
reservations,
such
a
thing
does
not
exist
and
question
is:
will
it
ever
because
it
happens
to
be
distributed,
whereas
all
of
the
scuzzy
stuff
is
pretty
much
centralized?
Okay,
so.
B
B
B
So
we
have
restful,
HTTP
and
HTTPS
access
to
the
object,
store
and
Seth
does
that
through
a
fast
egi
application,
name,
brightest
gateway
and
radars
gateway
itself
uses
the
C++
API
for
ray
das
called
the
gray
dots
PP,
just
in
case
you're
interested
and
the
Raiders
gateway,
runs
essentially
in
any
web
server.
That
supports
the
fast
CGI
interface.
So
you
can
run
this
with
engine
X.
B
If
you
choose
to,
you
can
run
it
through
lady,
you
can,
if
you're
brave,
/
insane
run
it
through
I
is
which
I
think
supports
fast,
CGI
kind
of
sort
of
like,
but
the
canonical
way
of
doing
this
is,
interestingly
with
apache
and
my
fast
cgi
for
those
of
you
who
are
Apache
geeks.
You
will
probably
scream
and
shout
at
me
that
this
is
not
the
latest
fast
cgi
implementation.
That's
commonly
recommended
for
use
with
Apache.
B
B
B
Instead,
all
of
that
data
that
it
works
with
even
the
stuff
that
it
needs
for
itself
to
function,
use
itself
in
radars.
So
there
is
no
local
storage
whatsoever,
and
that
means
that
radars
gateway
completely
natively
supports
load,
balancing
and
scale
out.
If
we
want
to
scale
out
over
multiple
entry
points
that
we
just
add
them,
we
just
added
one
right
us
gateway
host
and
then
other
and
then
another
and
then
another,
and
we
can
use
any
sort
of
load
balancing
facility
that
we
would
like
to
one
of
the
more
popular
ones.
B
Susan
scores
just
round
robin
DNS.
You
could
theoretically
use
an
IP
load
balancer,
but
then
that
might
become
a
scalability
to
a
point
just
the
same,
because
now
you're
redirecting
all
or
you're,
directing
all
your
clients
through
one
gateway
that
then
load
balances
across
multiple
rails
gateways,
perhaps
not
the
greatest
idea.
But
if
you,
if
you're
using
dns
load
balancing
so
round
robin
DNS,
multiple
dns
entries
for
the
same
may,
then
that
is
just
perfectly
fine.
Then
we're
going
to
take
a
look
at
what
that
looks
like.
B
So
we
have
a
radar
skateway
set
up
on
the
same
hosts
on
the
same
three
hosts
that
run
the
OS
DS
and
the
mods.
So
that's
just
because
it's
kind
of
convenient
in
this
kind
of
demo,
setting
most
typical
setups
would
in
production
would
run
rate
us
gateways
on
separate
hosts
as
many
of
them
as
you
like
dream
objects,
which
is
about
three
petabytes
of
storage
behind
dradis
gateway.
The
last
I
checked,
used
I
think
for
write
offs
gateways.
It's
not
like
that.
B
B
It's
all
HTTP
and
wrists
and
whatnot
you
can
use
whatever
caching
implementation
that
you
want
to
use.
So
if
you
want
to
do
the
same
thing
that,
for
example,
Wikimedia
does
for
their
Swift
stuff,
which
is
that
they're
just
strategically
placing
varnish
caches
across
the
globe
and
they
can
run
out
of
a
single
switch
repository
that
way,
you
could
do
the
exact
same
thing
with
ray
das
gateway,
because
it's
all
adjust
HTTP
and
you
can
use
the
proxying
facilities
that
we
all
know
and
love
from
HTTP.
B
So
we're
going
to
start
out
with
some,
oh.
So
what
we're
doing
to
interact
with
this
ray
das
gateway
here
is
just
a
completely
unmodified
amazon,
s3
clients.
Okay,
it's
using
s3
CMD,
which
is
a
an
amazon,
s3
client
that
ships
with
debian
and
ubuntu,
and
presumably
many
other
artists
rose,
and
we
can
just
use
that
to
interact
with
the
object
store.
So
what
first
going
to
do
is
going
to
see?
Ok,
what
kind
of
buckets
do
we
have
on
this
thing?
B
And
we
deliberately
turned
on
debugging
for
this
thing
just
so,
you
can
see
what
is
actually
coming
down
on
the
wire.
So,
in
this
case,
we
are
connecting
to
something
that
actually
tends
to
be.
Is
three
amazon
AWS
calm?
We
do
so
with
some
cute
little
dns
tricks.
So
no
real
trickery.
Here,
it's
just
that
on
Alice
we
have
a
bind
a
name
d
that
just
hosts
that
zone,
and
then
we
go
from
there,
and
so
we
have
three
buckets
here
and
they
are
aptly
named
fubar
and
bass.
B
So
what
we're
going
to
do
now
is
we
are
going
to
check
how
we
actually
integrate,
how
we
actually
need
to
get
it
interacted
with
this
thing.
So
there's
a
utility
called
Raiders
gateway
admin
and
that's
how
we
can
define
our
users
and
permissions
and
axis
and
things
so
in
this
case
we
created
a
user
with
a
beautiful
name
of
John
Doe.
He
works,
for
example,
calm,
which
I
hear
is
a
multi-billion
dollar
enterprise,
we're
in
the
United
States,
and
what
we
can
define
here
is
just
regular
old
amazon,
s3
access,
keys
and
secret
keys.
B
So
for
all
of
you
familiar
with
amazon
s3.
That
is
how
you
interact
with
with
the
storage
space
and
we've
also
prepared.
Is
we
have
added
credentials
for
John
Doe
when
he
acts
as
a
swift
user?
So
what
we
going
to
do
first,
is
we're
going
to
upload
and
create
an
object
in
there.
So
we're
uploading
something
to
s3,
ok,
so
again
with
debugging
in
la
la
la.
But
what
we
did
is
we
just
put
something
into
the
food
bucket:
ok
and
it
uses
essentially
bucket
based
host
names.
B
A
B
So
there
we
go
and
we
just
downloaded
our
object
named
spam.
Okay
and,
as
you
can
see,
what
we
originally
uploaded
was
the
test
TST
test
txt
thing
and
it
said
hello
world,
which
we
uploaded
using
amazon
tools
into
brightest
gateway
and
SEF,
and
we
then
retrieved
it
using
a
completely
unadulterated,
swift,
binary
and
did
a
download
request
of
that
object
named
spam
from
a
container
named
foo
and
lo
and
behold
it
has
exactly
the
same
content.
B
And
you
could
do
that
from
these
you're
Eric
as
well,
but
that's
okay
and
then
finally,
the
stuff
that
you
all
been
waiting
for,
because
most
people
tend
to
perceive
Seth
as
a
distributed
file
system,
which
it
is
among
many
other
things
now
in
CJ's
presentation.
Yesterday,
it
listed
basically
all
of
the
components
in
SF,
laprade,
Osprey,
das
gateway
rbd
as
awesome
and
the
file
system
as
almost
but
not
quite
yet
awesome.
So
this
is
considered
experimental,
which
is
kind
of
interesting,
because
it
is
what
originally
drove
the
development
of
SEF.
I.
B
Think,
that's
fair
to
say,
was
to
build
a
lustre
without
its
shortcomings
right
and
there
we
go
just
read
file
system
on
top
of
a
dose,
and
these
currently
considered
experimental,
although
it's
been
in
the
mainline
kernel
since
26
32,
and
that
in
itself
is
no
surprise
because
butter
FS
has
been
in
the
main
line
for
all
for
quite
a
while,
and
it's
still
experimental.
So
that's,
okay
and
you
can
use
it.
B
That
is
just
fine
and
there
are
people
that
are
running
this
in
production
or
at
least
claim
to
do
so
on
the
relevant
IRC
channels
and
made
of
this
and
what
not.
They
are
typically
to
be
found
in
academia,
which
is
no
surprise
because
they
are
also
the
typical
luster
users
and
they're.
Also
looking
for
lost
or
without
the
soccer.
Essentially
and
lustra
has
a
few
really
painful
shortcomings,
number
one.
It
has
a
central
data,
look
up,
it
has
a
metadata
server,
which
is
a
scalability
took
point,
and
it
is
a
single
point
of
failure.
B
Little
bit
of
high
availability
around
that
you
can.
It
also
does
a
few
things
in
the
kernel
that
Seth
does
in
user
space
and
a
few
other
things
when
I
say
lustre
without
the
sock.
That,
of
course,
doesn't
mean
that
luster
completely
sucks.
Quite
the
contrary,
it's
a
very
stable
HPC
file
system,
it's
being
used
in
use
forever.
It's
just
that.
It
has
a
few
of
these
shortcomings
that
people
would
like
to
address,
and
one
of
those
people
happen
to
be
staged
and
this
team
at
UC,
Santa,
Cruz.
B
Safe
is
seventeen
thousand
lines
of
code,
which
is
like
really
really
tiny
in
comparison,
and
that
is
because
all
the
others
have
to
take
care
of
a
lots
and
lots
and
lots
of
things
by
themselves
that,
in
in
the
set
file
system,
is
just
offloaded
to
raid
us
all.
The
files
is
the
metadata
itself
lives
in
radars
objects
and
to
manage
this
metadata.
We
have
another
type
of
demon.
B
The
third
type
of
safe
demons
called
a
metadata
server
for
MDS
and
it
what
it
does
is:
filesystem
clients
and
fascism,
clients
only
no
RVD
clients,
nothing
that
uses
liberate
us
directly.
Nothing
that
uses
the
python
bindings
directly.
Only
the
file
system,
clients
actually
talk
to
this
metadata
server
and
the
metadata
server
also
caches
this
metadata.
For
for
clients
to
improve
performance,
it
runs
entirely
in
user
space.
The
MDS
that
is,
and
only
the
file
system
client,
runs
in
the
colonel.
So
we
only
have
two
components
really
insect
that
run
in
kernel.
B
One
is
the
kernel
rbd
device
and
one
is
the
kernel
set
file
system.
There
actually
is
also
a
fuse
client
force
F,
but
that
is
really
only
recommended
for
use
in
those
situations
where
you
are
on
a
system
that
does
not.
That
is
not
linux
and
therefore
does
not
support
the
file
system.
Client
that
does
support
fuse.
So
that,
for
example,
would
be
your
way
of
talking
to
a
safe
file
system
from
say
freebsd.
B
If
that's
what
you
want
to
do
now,
set
amounts
are
writable
from
any
client,
so
we
can
mount
them
from
as
many
clients
as
we
want,
and
they
are
of
course
readable
and
writable
from
them,
and
they
also
play
nicely
with
by
locking
all
the
file
locking
as
pretty
much
anything
in
Linux.
As
far
as
file
locking
is
concerned
is
advisory.
B
There
is
no
mandatory
locking,
so
applications
actually
have
to
ask
for
a
law
for
locks,
and
if
they
don't
get
it,
they
to
politely
wait,
but
if
they
just
don't
ask
for
locks,
but
just
barging,
then
okay,
that's
it.
But
that
is
how
how
file
locking
works
in
Linux
in
general
and
mandatory
locking
is
available
as
a
mount
option
in
just
about
any
file
system
and
just
about
any
file
system.
It
never
really.
B
And
something
that's
really
cool
about
Seth
is
the
set
file
system.
Is
it
supports
arbitrary
directory
level
snapshots
and
what
that
means
is
something
we're
going
to
get
to
in
just
a
moment.
It's
a
really
nice
way
of
doing
copy-on-write
directory
means.
One
thing
that
is
currently
unsupported
is
ruffling.
Ref
link
is
a
means
of
creating
a
copy
on
write,
duplicate
of
a
single
file
that
isn't.
That
is
something
that
is
supported
in
ocfs2.
It
is
support
in
butter
fest,
where
it's
called
cloning,
and
that
is
something
that
we
can't
do.
B
So
we
can't
do
snapshots
of
individual
files
in
the
set
file
system,
but
we
can
do
snapshots
of
directories
and
we
can
also
do
snapshots
obviously
of
the
root
directory
of
the
set
file
system,
which
means
we're
snapshotting.
The
entire
volume
ancef
has
really
spiffy
accounting
and
statistics
that
it
supports
through
virtual,
extended
attributes.
If
you're,
considering
you
have
a
say
three
petabyte
cluster
with
hundreds
of
nodes,
then
finding
out
how
much
data
is
in
a
specific
directory.
When
you're
running
by
running
d,
you
it's
going
to
be
pretty
intensive
operation,
because
you're,
probably.
F
B
We
are
now
going
to
reboot
this
thing,
which
you
please
do
from
that
need
there
we
go,
it
should
come
back
up
and
then,
when
he
comes
back
up
in
la
liar,
you
never
drill,
because
we
did
this
yesterday.
So
here
is
why
the
file
system
is
experimental.
No
am
we
so
this
is
so.
We
have
several
issues
here.
The
the
the
set
file
system
development
happens
completely
upstream,
and
it
is
actually
sort
of
detached
that
way.
B
Me
that
is
beautiful,
so
we
should
just
show
you
that
we
can
really
really
fast
and
we
have
it
has
actually
mounted
and
now
we're
going
to
look
in
what's
in
there
in
the
/
M&T
directory.
B
B
F
B
Many
files
are
in
this
entire
volume,
okay,
all
of
them
and
how
many
bytes
are
in
there
and
what
is
the
last,
the
mode,
the
most
recent
see
time
in
the
entire
tree
and
so
forth.
So
that's
really
cool
and
beats
the
hell
out
of
doing
do
you
or
find,
or
whatever,
over
hundreds
of
nodes
and
petabytes
of
data.
So
that's
really
kind
of
kind
of
cool
feature.
B
We
can
also
interrogate
the
file
system
itself
about
the
underlying
ray
das
properties
of
things
that
we
do
so
in
this
case
we're
just
doing
a
show
location
of
the
slash
MNT
directory
node,
and
it
tells
us
exactly
what's
that
ray
das
object,
name
and
what
its
its
size
and
where
it's
at,
which
which
OS
these
it's
sort
at,
and
we
can
also
do
speaking
things
about
the
file
layout.
So
it
will
tell
us
okay,
how
many
stripes
do
we
have
for
this
particular
file
and
so
forth,
and
so
on.
B
So
that's
really
kind
of
neat
and
and
what
is
our
final
thing?
Oh
that's
right,
yeah
of
course
snapshots.
So
what
we
also
want
to
do
is
create
a
snapshot,
and
now
this
is
really
kind
of
interesting.
Hang
on
a
second
sorry.
So
what
we're
doing
in
order
to
create
a
snapshot?
There
is
a
in
every
in
every.
B
We
have
this
fancy
little
subdirectory
called
thought
snap
and
if
we
create,
if
we
just
do
a
maker
in
that
we've
magically
created
a
snapshot,
so
what
we
can
do
now
is
we
can
do
everything
under
/
MN
t,
which
is
going
to
take
a
few
moments
there.
We
go
good
move
it.
Yes,
hey
what
do
that
again
with
obstacles?
B
Come
on
there
we
go
and
not
quite
there.
Yet,
let's
see
it
should
be,
it
should
be
rolled
out
of
it.
That's
that's
gone.
That's
one.
That's
gone.
B
In
the
box
team,
Shawna
boxty,
yes,
thank
you.
Another
bowl
of.
G
B
D
C
B
C
B
More
source
and
you
live
and
what
not
come
on.
Well,
you
know
it's
like
it's
for
its
40
s,
DS
running
on
I'm,
sorry,
30
s,
DS
running
on
one
laptop.
It's
probably
not
going
to
be
fast.
D
D
B
B
The
rest
not
happenings
not
happening,
you
can't.
You
can't
just
magically
migrate
from
our
BD
to
the
set
file
system,
it's
completely
different
as
theta
in
different
pools
and
all
sorts
of
things,
so
so
that
that,
if
you're,
if
you're,
just
if
you're
just
using
our
BD,
then
you
have
no
need
at
all
for
the
metadata
service
and
whatnot.
But
if
you
then
decide
that,
with
this
beautiful
set
cluster
that
I've
built
and
I've
run
our
BD
and
radars
gateway
on
etc.
B
B
Any
rate
you
totally
need
to
be
geeked
out
now,
because,
if
you're
not
you're
positively
soulless
before
we
get
to
the
questions
and
things
a
bit
of
a
thank
you
thinking
here,
so
thanks
goes
obviously
to
sage
and
crew
for
SF.
If
you're
wondering
because
you
had
a
question
there,
what
are
the
presentation
tools
that
we've
been
using
impress
GS
and
shell
in
a
box?
Ink
tank
was
nice
enough
to
let
us
use
the
SEF
logo.
All
of
the
artwork
is
courtesy
of
Tim.
B
A
F
B
B
B
B
And
the
directory
is
kind
of
magical,
because
the
dots
now
the
the
dot
snap
doesn't
show
up
anywhere,
because
if
you
actually
have
that
you
know
in
the
listing
of
the
directory
and
then
you
did
in
RM
/
whatever
uh-huh,
oh,
you
knew
the
snapshot
and
the
snapshots
are
actually
read
only
so
you
can't
just
nuke
them
that
way.
If
you
want
to
remove
the
snapshot,
that's
an
R,
ender
and
prove
the
snapshot
is
gone.
What
question
really.