►
From YouTube: RH InkTank Ceph Day Sessions Jeff Darcy REDHAT
Description
Ceph Day Boston 2014
http://www.inktank.com/cephdays/boston/
A
Okay,
hi
everyone,
so
I'm
going
to
be
talking
about
a
kind
of
insane
thing
that
I
did
immediately
after
the
ink
tank
acquisition,
which
was
actually
an
idea
that
I'd
had
about
two
years
ago
about
a
really
different
way
to
possibly
combine
the
cluster
and
SEF
technologies
and
I
have
a
couple
of
points
that
I
want
to
make
before
I
really
start
diving
into
it.
First
off
this
is
science.
This
is
not
engineering,
it's
science
in
the
sense
that
it's
about
discovering
what
the
world
is
like.
What
things
are
possible?
A
It's
it's
not
trying
to
apply
that
knowledge
to
serving
any
particular
need.
It's
not
part
of
a
roadmap.
I
particularly
have
to
mention
that,
because
there's
some
buzz
going
around
some
fud
being
spread
by
people
who
have
vested
interests
in
alternative
technologies
about
how,
since
the
acquisition
we're
going
to
take
cluster
em
safe
and
we're
going
to
hack
off
bits
of
both
of
them
and
leave
them
bleeding
in
the
gutter,
that's
really
not
not
the
case
anyway.
A
There's
no
particular
plan
and
my
boss
here
can
back
me
up
to
merge
the
two
technologies
that
is
kind
of
what
I'm
doing
with
this
experiment,
but
it's
just
an
experiment.
What
I'm
trying
to
learn
from
this
is
is
a
couple
of
things
and
trying
to
learn
something
about
the
liberators
API,
because
I'm
curious,
I'm,
trying
to
maybe
tease
out
some
information
about
which
components
within
each
stack
are
contributing
to
performance,
contributing
to
failure,
handling
other
things.
A
One
way
to
do
that
is
to
sort
of
mix
and
match
the
pieces
and
see
what
changes
relative
to
their
their
origins
and
partly
it's
just
because
combining
things
in
strange
ways,
you
never
know
what
information
is
going
to
fall
out.
So
you
know
we
actually
did
find
a
couple
of
fairly
interesting
things
that
deserve
further
investigation
and
which
may
end
up
ultimately
benefiting
both
projects.
A
I
did
a
little
project
at
one
point,
while
I
was
at
Red
Hat,
just
taking
every
distributed
file
system,
I
could
find
and
running
it
through
the
same
set
of
workloads.
Just
a
little
side
note,
it
was
pretty
sad
about
half
of
them.
Crashed
few
of
them
couldn't
even
build
one
corrupted
data
one
hung
Steph
was
actually
one
that
managed
to
make
it
through
the
the
exhaustive
test
of
trying
to
write
him
10
files
simultaneously
and
then
read
them
back.
A
So
it
was,
it
was
pretty
pathetic.
Gloucester
was
another
and
I
think
extreme
FS
was
the
only
the
only
other
one
that
actually
managed
to
do
that
now.
Remember
that
was
like
three
four
years
ago,
so
they've
maturity,
but
they
could
probably
maybe
right
20
files
simultaneously,
so
yeah
I'm,
just
the
kind
of
guy
who's
going
to
take
this
opportunity
to
try
and
mash
together
to
thought
to
be
disparate
technologies,
in
fact
who
here
is
familiar
with
both
cluster
and
SEF
to
any
appreciable
degree
at
all?
A
So
you
know
you
people
probably
realize
that
we
have
a
lot
more
in
common
than
that
separates
us
yeah
sure
we
took
some
different
implementation
approaches,
but
you
know
they're
both
the
same
sort
of
scale
out
distributed
systems
we're
using
some
of
the
same
algorithms,
or
at
least
algorithms
that
are
cousins
to
one
another.
So
it's
actually
not
that
weird
to
be
combining
them
now.
I
know
this
is
SEF
dey
nah
cluster
day,
but
I'm
to
want
to
explain
what
it
is
that
I'm
doing
here.
A
I
do
need
to
explain
a
tiny
bit
about
how
gluster
works,
so
the
core
concept
in
Gloucester
is
of
a
train
later.
This
is
actually
one
of
the
founders
of
Gloucester
likes
to
point
out,
probably
to
his
detriment
that
it
came
from
the
gnu
hurd
operating
system
project,
but
the
idea
is
that
a
translator
is
a
module
that
takes
in
I/o
request
and
then
spits
out
I
a
request
in
exactly
the
same
form
you
know
to
create
on
one
end
it
to
create
at
the
other
end,
so
it's
a
kind
of
filter.
A
You
can
also
do
fanning
out.
It
can
also
do
routing
and
can
do
things
like
that.
But
the
key
thing
is
that
the
interface
above
and
the
interface
below
are
exactly
the
same,
so
you
can
stack
them
in
all
sorts
of
orders.
You
can
move
them
across
the
server
client,
boundary,
etc.
This
is
just
how
how
it's
implemented,
and
so
most
of
the
functionality
that
ancef,
you
would
see.
A
You
know,
as
of
one
piece,
is
actually
split
out
into
different
translators
that
are
separately
loaded
separately,
developed
in
some
cases
in
gloucester.
So
what
this
diagram
shows
is
that
we
have
you
know
some
translators
on
the
left,
which
are
the
ones
that
inject
requests
into
the
system.
Our
primary
one
is
fused,
that's
our
native
protocol.
That's
you
know
that
are
talking
to
the
fuse
driver
and
then
it's
taking
requests
and
turning
that
in
in
into
translator
requests.
A
Nfs
is
another
one
that
does
this.
Then
we
have
something
called
lib,
GF
API,
which
is
a
library
interface
that
is
also
capable
of
generating
all
these
requests
and
pushing
them
through
the
translator
stack.
So,
for
example,
the
samba
integration
is
moving
towards
using
gfap
I
the
block
device
integration
is
using
GF
API.
You
can
do
all
sorts
of
things.
It's
mostly
like
lips
ffs.
There
are
some
differences.
A
So
these
are
all
the
ways
of
getting
stuff
into
a
system.
Then
you
go
through
a
bunch
of
translators,
represented
here
by
dot
dot
dot,
which
is
things
that
right
ahead
right
behind,
read
ahead
chain
locks,
all
sorts
of
other
stuff-
that's
not
particularly
relevant
here
and
then
the
two
big
boys
DHT
is
our
distributed.
Hash
table.
It's
how
we
do
distribution
across
many
servers,
so
it's
basically
performing
a
routing
function.
It
gets
a
request
in
from
the
user
and
it
sends
it
to
one
of
its
children,
which
is
one
of
the
bricks.
A
A
A
little
thing
that
I'm
never
going
to
do
is
use
advanced
in
the
name
of
any
code
that
I
develop
because
five
years
later
is
going
to
look
silly,
but
at
the
time
it
was
advanced.
So
this
is
our
replication
module
which
takes
a
request
in
and
sends
it
out
to
all
of
its
children.
So
in
this
case
it's
each
a
fr
instance
is
sending
it
to
two
bricks.
The
DHT
instance
is
sort
of
round
robin
rating
or
random
hashing.
Among
those
AFR
instances.
A
So
this
leads
to
a
couple
of
differences
between
how
gloucester
FS
works
and
how
SEF
works
in
gloucester.
We
have
one
role
in
the
I/o
path,
a
brick.
That's
the
only
thing.
There's
one
process
on
the
server's
doing
one
thing,
its
own
data
and
metadata
in
SEF
in
the
I/o
path,
using
the
file
system.
Of
course,
you
actually
have
two
different
roles
which
may
be
differently
distributed
among
your
nodes
in
the
system.
A
Then,
of
course
you
have
them
on.
On
the
cluster
side,
you
have
cluster
d
to
do
management
types
of
stuff,
but
it's
a
fairly
fundamental
difference,
because
it
affects
a
lot
of
the
performance
characteristics,
both
good
and
bad.
On
both
sides.
You
know
there
are
some
operations,
we're
having
it
all
be.
One
is
good
if
you're
going
to
do
an
operation
that
affects
both
data
and
metadata,
it's
actually
really
kind
of
nice.
A
So
what
did
I
decide
to
do?
I
decided
to
combine
Gloucester,
FS
and
rate
us
knots
ffs,
that's
that
that
would
be
a
much
more
challenging
kind
of
thing
to
try
and
do
just
to
see
what
happens
to
our
data,
I/o
performance
and
and
other
behavior.
When
we
do
that.
So
that's
the
only
thing:
that's
really
getting
sort
of
snipped
off
from
the
Gloucester
world
and
shunted
off
into
a
ratos
cluster.
A
So
what
we
see
here
is
we've
got
the
same
things
on
the
front
end.
We've
got
fuse,
NFS,
etc.
They're
all
inject
and
request
into
the
system.
They're
going
through
some
of
the
higher
level
translators.
I
actually
had
to
leave
a
couple
out
because
the
the
system
wasn't
really
working
correctly
with
them
in
place.
A
Then
it
comes
down
to
Glade
us
and
all
that's
doing.
Is
it
saying?
Is
this
a
read
or
write
for
file
data?
Yes,
ok,
we're
going
to
shove
it
off
until
this
ray
das
world.
We
don't
know
what
happens
to
it
after
that
now
what
happens
to
it
after
that,
as
you
all
know,
is
it's
going
through
all
of
the
rate
us
distribution,
replication,
possibly
erasure
coding,
etc,
etc,
etc.
Anything
else
any
of
your
I
note
operations,
your
directory
entry
operations,
etc
is
staying
in
the
Gloucester
world.
A
So
when
we
create
a
file
in
Gloucester,
we
actually
create
the
gluster
file
and
we
create
a
corresponding
rato
object.
So
we
can,
we
can
alternate
between
using
those
two
and
then
there's
a
little
bit
of
strange
stuff,
for
example,
if
you
want
to
know
the
file
size
well,
that's
not
actually
correct
over
in
the
Gloucester
world,
so
we
have
to
go
query
the
radio
object,
but
basically
that's
that's
the
idea
behind
it
and,
of
course,
I'd
it
in
a
Python.
A
Ok,
not
really
I
could
have
actually
because
in
the
Gloucester
world
we
have
something
called
gloopy,
which
is
a
way
of
writing
these
translators
in
Python,
instead
of
and
see,
is
actually
really
nice
for
prototyping
crazy
ideas.
I
didn't
actually
do
that,
but
it
was
a
fun
thought
that
came
up
at
dinner
with
some
of
some
of
us
F
guys,
here's
a
code
sample
of
our
incredibly
ugly
cluster
C
code.
So
this
is
part
of
Gladys
by
the
way,
how
many
people
recognize
Glade
us
as
an
acronym?
A
So
this
is
a
pretty
typical
piece
of
Gloucester
code,
except
for
the
fact
that
is
calling
out
into
a
completely
alien
distributed
file
system.
So
we
call
rate
us
read.
So
this
is
the
part
of
part
of
our
read
path.
We
check
the
return
value.
We
do
a
couple
of
other
things.
Yes,
we
heavily
use,
go
to
in
our
code
base,
I'm,
sorry
and
then
the
stack
unwind
strict
is
basically
how
we
pass
stuff
back
to
the
translator.
That
call
this
so
I
mean
there's
nothing
terribly
interesting
about
this.
A
There
was
nothing
terribly
difficult
about
it.
I
did
most
of
this
sitting
between
sessions
and
openstack
summit.
In
fact,
but
you
know
this
is
kind
of
typical
gluster
code.
I
mean
it's
not
all
that
different
I
didn't
have
to
do
major
surgery
on
everything.
It's
just
another
translator.
That
does
something
slightly
different
in
one
place,
and
it
turns
out
that
the
radio
interface
was,
you
know
not
hard
to
work
with
liberate
us
interface.
A
But
of
course,
I
was
only
doing
the
easy
thing.
I
was
only
doing
file
data,
there's
all
sorts
of
really
hard
things
that
I
wasn't
doing.
I.
Believe
somebody
wrote
a
PhD
thesis
about
how
to
solve
some
of
these
problems
so
metadata
and
especially
directories
I.
Don't
do
anything
with
that
I,
just
let
gluster
handle
it
as
gluster
always
handled
it.
A
If
you
know,
if
anybody
really
did
want
to
make
this
real-
which
I
still
think
is
kind
of
crazy
they'd
have
to
solve
that
problem,
there's
a
whole
lot
of
server-side
functionality
and
gloucester
that
gets
basically
bypassed
when
we
do
this
stuff.
That
would
normally
be
observing
the
I/o
as
it
goes
past
and
now,
since
we've
shoved
it
off
and
to
write
us,
it
doesn't
see
it.
So
that
would
be
a
problem
and
then,
of
course,
there's
performance
issues
so
I
mean
you
know.
A
A
A
The
blue
line
is
Seph,
so
SEF
one
yay
and
the
Green
Line
is
cluster,
so
it's
taina
behind
and
what
the
first
thing
that
left
out
at
me
is
the
latest
version,
which
I
had
two
versions
like
significantly
at
the
lower
thread,
counts
it's
delivering
relatively
little
I,
oh
I,
don't
know
why
it
does
eventually
catch
up
at
higher
thread
counts,
but
it
seems
to
have
a
little
bit
of
a
start-up
latency
issue.
Well,
I
mean
not
start
up,
but
somehow
the
single
thread
through
applet.
Just
isn't
that
great.
A
So
that's
something
that's
worth
investigating
now
the
two
versions.
Here,
the
yellow
version
is
using
a
KO,
asynchronous
I/o,
so
I'm
issuing
all
these
reason
and
rights
through
the
asynchronous
interface
I'm,
not
sitting
around
waiting
for
them.
The
the
other.
The
orange
line
is
multi,
where
I'm
actually
using
multiple
liberators
contexts,
so
I'm,
basically
doing
a
sort
of
gattling
gun,
round-robin
thing
between
them.
A
So
there
are
two
different
ways
to
get
a
little
bit
of
parallelism
among
all
the
requests,
because
just
using
one
context
in
synchronous,
I/o
was
just
you
know,
kind
of
awful
and
that
wouldn't
have
been
very
interesting
at
all.
So
this
is
a
fairly
interesting
result.
A
much
more
interesting
result
was
when
I
started
looking
at
latency.
Now
these
are
4k
synchronous,
random
writes.
A
A
What's
up
with
that,
I
don't
know
you
know,
so
that's
telling
us
something
about
the
fact
that
it's
mostly
kind
of
tracking
the
Ceph
numbers,
so
the
ceph
numbers
or
the
gladys
numbers
are
dominated
by
the
radar
formance
behaviors,
but
there's
some
little
difference
in
there.
Also
that's
probably
worth
investigating.
This
is
exactly
the
kind
of
thing
that
I
wanted
to
tease.
Out
of
this
information
is
to
see.
Well,
you
know
where
do
things
not
line
up
in
a
fairly
obvious
way
here,
we've
got
this
little
anomaly
where
the
the
low
thread
count.
A
Numbers
are
just
strangely
low.
So
let's
look
into
that,
you
know
here
we
have
numbers
that
are
just
a
tiny
bit
higher
than
C
ffs,
why
this
was
all
on
firefly
by
the
way.
So
it's
kind
of
interesting
to
look
at
these
things
and
really
that's
about
it.
You
know
it's
a
fairly
young
sort
of
thing
that
I've
been
doing
playing
with.
If
anybody
else
wants
to
grab
the
code,
I
haven't
actually
gotten
around
to
pushing
it
anywhere.