►
From YouTube: RH InkTank Ceph Day Sessions Matt Benjamin COHORTFS
Description
Ceph Day Boston 2014
http://www.inktank.com/cephdays/boston/
A
I'm
matt
benjamin
I'm
cto,
of
course
word
and
once
we
probably
don't
know
who
we
are,
but
we've
been
in
the
sub
community
for
a
few
years,
we're
actually
we're
actually
an
ann
arbor,
michigan-based
startup.
That's
that's
focused
on
bringing
new
capabilities
to
parallel
nfs
to
address
new
application
workloads,
and
it
came
to
be
over
time
about.
A
You
know
started
as
an
nsf
grant
funded
effort,
and
it
came
to
be
as
well
about
about
extending
those
capabilities
with
with
it,
within
the
sap,
inserted
storage
stack,
which
has
been
an
adventure,
we're
doing
a
number
of
things,
but
both
in
the
file
system,
layer
and
and
then
the
black
storage
layers
and
the
interfacing
bfs
and
set
this
talk.
A
It's
about
a
project
we
did
in
collaboration
with
melanox
and
with
funding
from
melanox
to,
as
as
the
previous
slides
by
my
conversation
by
assaf,
explained
from
illinois.
Bringing
this
up
bringing
bring
in
the
cef
code
base
adapting
the
sap
transport
layered
it
to
work
to
work
with
accelio,
which
is
which
is
which
encapsulates
the
capabilities
of
our
domain
infiniband
and
other
transports
that
support
rna
and
the
module
that
does.
That
is
called
xio
messenger.
A
Basically,
this
is
which
you
know
which
I'll
explain
this
is
this:
is
the
transport
extraction,
the
maps
ceph,
networking
or
or
messaging
into
excelio,
which
and
and
by
that,
into
into
infiniband,
which
is
which,
in
effect,
hexagolio,
is
a
high
performance
message
passing
framework
by
melanox,
whose
key
benefit
has
to
be
an
efficient
adapter
for
infiniband
rdma
transport
or
the
typically
they're
traditionally
used
with
epic
interfaces.
A
It's
a
wrapper
for
that,
but
as
well
in
future,
this
tool
site
will
enable
seamless
mapping
onto
other
transports
and
in
multitransport
applications
with
with
policy
based
failover
and
selection
and
other
cool
features.
A
So
the
the
history
is
that
this
this
is
work
funded
by
elenox
in
support
of
customers
using
sap
with
tsuki
objectives,
really
one
increased.
This
is
now
this
have
transport.
A
Flexibility
simply
add
to
the
to
the
value
that
customers
get
from
seven,
most
importantly,
to
support
efforts
to
increase
that
by
sapphire
performance,
f
is
already
tremendously
scalable,
horizontally
and
and
and
perhaps
more
than
other
systems
that
that
we're
familiar
with,
but
but
the
net
one
of
the
next
fundamental
tiers
of
set
performance
is
to
scale
down
to
get
to
get
optimal
high
performance.
I
o
performance
at
the
unit
level.
A
So
briefly,
what
is
what
accelio
is
so
some
of
this
is
information
that
was
overlapped
in
the
previous
talk.
Hopefully
we'll
reinforce
it.
It's
a
high
performance,
asynchronous
and
reliable
messaging
library
built
with
hardware
acceleration
and
already
in
mind,
and
so
as
I
stop
explaining
it's
a
framework
for
building
high
performance
rpc
transports,
since
in
particular
it's
a
messaging
sort
of
api
framework,
which
is
a
very,
very
good
fit
with,
since
that's
actually
sep's
internal
structure
as
well.
A
Currently,
as
we've
stated
it's,
you
know,
it's
supporting
rma
transport,
but
in
future,
and
that's
that
future
is
coming
quite
quickly.
It's
going
to
encapsulate
multiple
transports,
some
key
features
of
accelio
as
a
streamlined
and
flexible,
but
also
flexible
selection
of
messaging
models.
So
it
can
be
so
it's
a
request
for
response
protocols
and
one-way
protocols
with
with
what's
wrong
delivery.
Semantics
are
supported.
Of
course
it
supports
zero
copy
to
the
core.
A
It's
it's
essentially
nearly
lockless
internal
internal
message
path.
You
know,
including
including
all
the
support
areas
that
are
that
are
tough
to
do
when
you're
rolling
your
own
applications
like
lock,
free
memory,
allocators
and
so
forth.
It's
optimized
already
for
for
thread,
cpu
parallelism
with
a
lot
less.
You
know,
involvement
from
the
application,
some
facts
about
the
accelerator
roadmap,
we're
currently
sort
of
in
the
in
approaching
or
just
crossing
the
xlio
one
one
release.
A
So
it's
been
used
for
written
use
for
over
a
year,
it's
some
customer
applications
and
prototypes
the
one
one
has,
in
addition
to
a
stream
of
bug,
fixes
and
improvements
from
all
over
the
melodies
excellent
sort
of
code
base.
It's
you
know
it
has
a
number
of
improvements
that
are
or
adjustments
that
are
designed
or
enhancements
that
are
designed
to
support
stuff.
Specifically,
there
will
be.
There
will
also
be
some
time
in
fall.
A
I
may
have
a
significant
point:
release
in
fall
who's,
one
of
whose
key
features
is
expected
to
be
tcp
and
possibly
other
transport
support
integrated
into
the
stack
and
probably
we're
hoping
some
additional
interface.
Adjustments
that
allow
us
to
do
more
advanced
features
in
their
own
channelization
and
flow
control
at
the
application
layer
here
take
direct
advantage
in
our
switch
to
the
cables,
but
I'll
be
fine.
A
Switching
infrastructure
and
hpa's
tx3,
it's
capable
of
up
to
three
million
apps
and
such
such
equipment.
Some
of
the
round-trip
messages
with
major
between
two
processes
on
two
machines
or
even
the
same
machine.
It
can
essentially
saturate
a
single
port,
fdr
and
probably
higher,
but
I've
seen
it
do
that
and
benchmarks
to
get
to
to
prove
that
you
can
run
it.
It
got
a
sense
of
how
what
it
can
deliver
for
your
application
or
on
your
particular
hardware,
and
assist
with
testing
your
hardware
and
underground
understanding.
A
It's
throughput
come
with
xlio
and
avoid
a
bunch
of
different
configurations
for
all
the
api
message,
passing
styles
and
so
on.
You
plug
our
specific
yes
about
what
where
xlio
is
and
what
you
want
to
do
with
it.
So
the
source
code
is
available
there
http.
A
And
and
some
more
details
about
about
how
it
looks
to
build
applications
within
what
features
it's
going
to
support
so
so
then,
and
the
south
side,
the
actual
messenger
is
this:
this
is
this
adapter
to
xleo
for
from
from
messenger,
which
is
the
abstract,
so
transport,
you
know
top
level
interface
for
the
first
steps,
transport
layer,
it's
in
that
sense,
it's
a
drop-in
replacement
or
an
alternate
foreign.
A
Today,
which
is
the
current
tcp
messaging
encapsulation,
but
it
takes
a
little
which,
which
is
which
is
objective,
which
was
rejected
to
take
advantage
full
advantage
of
accelio
in
particular,
and
this
this
level
of
the
prototyping.
A
It
should
do
full
zero
copy
and
it
should
it
should
get
strong
parallelism
across
multiple
threads
in
an
application
and
server
in
the
client
and
server
applications,
and
I'm
going
to
talk
a
little
bit
about
about
internal
stuff
and
and
if
you're,
if
you're,
a
self-developer
interested
in
stuff,
internals
you'll
get
you
get
a
little
bit
of
a
sense
of
how
things
are
constructed
in
there
at
the
messaging
layer
and
if
not
yeah,
so
tune
it
out,
and
just
pretend
it's
cool,
but
but
but
inside
of
saf.
A
If
you're,
when
you're
looking
at
there's,
actually
a
really
nice
abstraction
inside
of
cipher
for
for
transport
capability,
even
even
though
they
only
had
one
implementation
to
start
with,
which
is
which
is
which
is
a
good
sign
in
any
software
project.
The
messenger
is.
This
is
the
abstraction
that
they
said.
That
represents
a
bi-directional
communication,
endpoints,
but
basically
client
or
server
has
a
messenger
and
a
subtype
defines
what
kind
of
transport
it's
going
to
use.
A
Messenger's
messengers
exchange
information
for
connections
which
are
you
know,
which
was
an
active
communication
channel
with
some
other
endpoint,
which
could
we
could
have
been
a
client
talking
to
a
server
to
get
one
or
the
reverse,
and
there
are
connections
that
that
loop
back
through
the
message
stack
without
encoding
data.
So
it's
an
active
communication
channel
sports
order,
delivery
in
its
full
form.
It
encapsulates
flow
control
and
rate.
Limiting
that
that
guarantees
a
stuff
cluster
can
can
really
can
remain
responsive
even
under
under
heavy
load
from
an
arbitrary.
A
You
know,
members
of
clients,
we
don't
have
full
support
for
that
external
messenger.
That's
you
know
and
getting
getting
there
and
building
next-generation
interfaces
for
for
all
transports
and
seth
is
part
of
the
forward
process
for
this.
For
this
effort,
also,
it
also
messenger
supports
or
simple
messenger,
currently
strong
endpoint
identification
and
wire
encryption
using
a
host
credential.
So
this
is
the
setbacks
that
that
that
say
that
says
talk
about.
A
We
also
don't
have
effects
in
in
excellent
messenger,
we'll
probably
want
to
abstract
some
implementation
that
allows
all
messengers
to
share
a
common
implementation,
but
a
message
also,
though,
of
course,
is
the
actual
request
or
reply
going
or
or
other
or
other
message
object
being
sent
over
this,
the
the
the
connected
connections
and
it
and
it
serializes
or
decentralizes,
using
set
use
it
using
common
code,
encoding
code
and
decode
mechanisms
that
that
are
in
sep
and
other
primitives.
A
So
we
use
all
that
code,
in
other
words
in
excel
messenger,
so
external
messenger
inputs,
this
messenger
interface.
It
encapsulates
the
accidental
pieces
yeah,
it
deals
with
connections
and
another
other
cellular
attractions
that
are
sort
of
they're
sort
of,
underneath
that
it
deals
with
the
event
loops
and
and
other
mechanisms
that
are
part
of
a
self-communication
channel
and,
and
it
provides
you
know,
interfaces
for
selecting
those
and
setting
those
up
for
for
your
particular
particular
adam's
requirements.
A
It
has
connection
and
there's
also
there's
also
a
loopback
form
which,
which
the
tcp
messenger
also
has
just
to
give
you
a
sense
of
that.
Here's
here
this
is,
this
is
how
things
sort
of
sort
of
stack
out-
and
this
is
this
this
this
is
sort
of
a
representation
of
how
how
accidental
messenger
is
able
to
fit
in
to
the
current
architecture.
We
know
there
was
a
there's
still
a
lot
of
refactoring
of
the
of
the
ceph
code
base.
A
I
showed
an
orange
in
the
middle
that
that's
actually
being
upstream
into
into
into
code,
targeting
giant
now
and
under
whiplash.
I
think
it's
with
xio,
so
so
this
is
this
work
is
going
it's
going
forward
and,
and
it
aims
to
it,
must
pull
out
the
common
piece
common
pieces
of
of
of
simple
messenger
more
more
cleanly.
A
So
they
said
all
messengers
could
and
and
related
classes
can
use
those
some
some
things
that
are
they
all
that
some
some
expressions
that
are
worth
like
that
I
wanted
to
bring
so
this
let's
bring
in
as
for
informational
purposes.
One
of
the
points
that
asot
mentioned
is
is
that
acceleration
aims
to
kind
of
deliver
sort
of
optimal
thread
parallelism
for
a
particular
application.
A
Transition
when
it
wants
to
do
that,
it
has
as
memory
associated
with
the
cpus
and
so
on,
one
of
the
ways
that
it
develops
thread
channels
across
across
a
pair
of
endpoints,
using
extending
that
there's
something
about
an
object
called
a
portal.
So
we
have
an
abstraction
for
that.
A
Since
it's
it's,
it's
less
brand
new,
it's
smaller
than
a
connection,
less
granular
than
a
connection
and
there'll
probably
be
more
abstractions
like
this
to
encapsulate
channels
or
flows
over
the
interface
as
well,
so
I'll
so
I'll,
just
I'll
I'll
kind
of
skip
through
this.
But
there's
very
there's
various
abstractions
that
we
set
up
in
order
to
you
know
in
order
to
get
one
of
the
one.
One
of
the
main,
I
guess
something
back.
A
One
of
the
main
activities
I
found
myself
involved
with
was
was
mapping
the
the
the
threat,
the
sort
of
thread,
context
and
threat
activity
of
different
clients
and
servers
inside
of
seth,
which
are
heavily
threaded
and
in
various
different
ways,
differing
by
daemon
and
client
and
so
forth,
with
accelio,
which
has
a
fairly
restricted
model
for
concurrency.
That's
aimed
at
maximizing
throughput.
A
A
So
we
think
we
did
a
pretty
good
job
with
that
and-
and
I
won't
go
into
it,
but
but
the
social
construction,
the
sort
of
use
of
various
helper
types
and
so
forth,
as
documented
here,
basically,
basically,
basically
basically
have
various
bits
of
attracting
information
that
follow
up
messages,
messages
in
and
out
as
they're
encoded
and
decoded
that
that
allow
us
to
associate
the
associate
those
pieces
with
with
accelero
primitives
that
are
that
are
coming
in
and
out
of
the
endpoint.
A
A
These
are
basically
soft
versions
of
the
benchmarks
that
xelio
provides
to
to
test
the
you
know
to
benchmark
the
throughput
of
of
a
client
and
server
in
different
modes,
and
these,
and
these
versions
of
these
are
standalone
programs
that
allow
you
to
select
various
details
about
data
about
about
data
sizes,
another
other
connection,
information
and
run
and
run
between
various
endpoints,
very,
very
simple
client
server
workloads,
which
would
show
that
we
get
relatively
reasonable
performance.
This
is
this:
is
these?
A
These
performance
results
are
aren't
great
because
they're
they're
they're
an
untuned
equipment
whose
maximum
available
bandwidth
was
about
3.4
gigabytes
per
second
and
over
and
over
within
melanox.
They
get
twice
that
and
but
and
and
a
machine
which,
with
accelerometer
benchmark
programs,
can
get
about
a
million
we're
on
trip
operations
per
second
and
and-
and
so
given
that,
given
that,
within
that
frame,
we
get
about
half
that
number
of
iops
with
the
smaller
special
available
message
sizes.
So
we
can
so
we
get
so
within
50
percent
of
optimal.
A
As
far
as
we
can
tell
with
what
available
then
with
xelio
or
sorry
available
iops
and
I'm
roughly
the
same.
We're
almost
saturating
it.
You
know
the
available
bandwidth
of
xelio
at
a
64k
message
size
over
within
melanox.
These
numbers.
These
numbers
have
a
hit:
six
meg,
six
megabytes
per
second
or
sorry,
six
gigabytes
per
second
bandwidth.
I
haven't
seen
the
I
off
numbers,
but,
but
I
think
they're
I
think
they're
comparable,
so
I
think
we're
getting
one
and
a
half
million
iops
with
small
messages.
A
So
so
these
performance
results,
in
other
words,
are
promising.
They
they're
they're
adequate
to
deliver
a
lot
of
value
in
a
day
center
once
the
rest
of
the
code
is
stabilized
yeah.
A
So
so,
what's
what
is
that
status?
We
have
the
working
message.
Messenger
stack,
that's
been
that's
that
was
originally
done
on
emperor
and
it's
been
has
been
pulled
up
to
firefly
and
hasn't
been
re-benchmarked
there.
Those
are
those
were
emperor
numbers,
it's
it's
it's
it's
been.
It's
been
integrated
with
all
of
this
f
demons
and
clients.
Some
of
them
can
be
used
together.
Sometimes
there
are.
There
are
a
few
blockers
that
are,
or
are
those
which
was
working
out
to
get
a
complete
restaurant
cluster
running
on
excellent.
A
It
is
possible
to
run
rados
over
it
from
a
client
and
run
around
us
workloads.
We
have
some
rounds
number
initial
numbers
for
that
which
would
show
that
for
a
few
workloads
for
a
few
kind
of
operations,
that's
especially
especially
heavy
reading
with
large
blocks,
we
get,
we
get
a
lot
of
bandwidth.
There
are
a
lot
of
other
there's.
A
lot
of
there
are
other
bottlenecks
left
from
the
zap
stack.
A
You
know
all
this
data
and
other
things
to
work
on,
but
we're
also
working
working
on
those
as
our
ink
tank,
so
so
we're
doing
it
so
and
we're
actually
and
then
tandem
with
performance
work
we're
polishing
off
remaining
issues
to
make
to
make
a
full
cluster
environment
available
during
during
the
course
of
the
giant
release.
A
If
you
want
to
get
the
stuff,
as
you
know,
as
I
mentioned
it's
all
at
the
intention
of
his
for
all
to
be
in
the
box,
but
of
course
this
is
development.
So
so,
at
this
stage
you
get
you
get
it.
You
get
it
on
the
internet
from
accelio
you,
if
you,
if
you're,
actually
going
to
experiment
with
this,
which
I
think
this
is
you
know
the
code's
ready
to
do.
You
have
the
accellio
branch
that
targets
us
the
one,
though
the.
What
was
the
target
in
the
one
one
reliance.
A
I
don't
think
it's
been
a
full
release
or
an
official
release
yet
so
for
underscore
next
is
the
brit
is
the
branch
that
has
self
compatibility
and
and
on
our
we
haven't.
We
have
two
two
sets
of
branches
on
on
our
external
set
repositories
that
that
integrate
all
of
this
once
actually
firefly
is
what
works.
It
has
the
different
difference
only
by
not
by
features,
but
only
by
building
this
picture.
A
A
Xio
and
then,
of
course,
is
it
official
upstream
brandt's
current.
The
current
tip
of
that
website,
xio,
I
believe,
is,
is
merging
in
interfaces
and
sort
of
refactoring.
A
Eventually,
it'll
have
all
the
changes
here,
plus,
let's
find
more
just
notes
about
how
to
run
these
if
you're,
if
you're,
actually
interested
in
running
and
running
the
client
server
test
programs
on
infinite
man
enabled
or
rdma
capable
hosts
these,
these,
you
only
need
to
make
small
changes
to
your
config
file
set
yourself.com
to
to
to
make
sure
you
can
identify
on
each
on
each
on
each
host
an
appropriately
already
made
accessible
endpoint,
which
uses
tcp
for
identification.
A
So
limitations
of
this
as
air
as
it
is
today
well
in
the
current
design
step,
we
don't
have
setbacks,
so
the
goal
eventually
would
afford
that.
But
but
it's
not
there
yet
and
probably,
and
probably
for
the
you
know
the
next
little
while
the
main
the
main
uses
won't
demand
it,
but
eventually,
of
course,
that's
planned.
A
There
is
a
bunch
of
logic
and
much
intelligence
in
this
networking
that
deals
with
operation
resilient
operation
of
of
of
the
network
fabric
under
insane
conditions
pays
a
large
penalty
to
do
that,
but
but
it
gets
a
lot
out
of
that.
Obviously,
resilience
is
a
huge
value.
Some
of
that
stuff
is
not
yet
you
know
exposed
fully
in
xio,
but
it
so
and
and
so,
and
so
how
how
we'll
get?
A
How
we'll
get
that
and
how
we'll
get
that
with
the
cleanest
architecture,
with
what,
with
what
with
what
with
the
best
layering
of
the
underlying
transports
and
and
and
and
the
messenger
is
is,
is
part-
is
part
of
the
development.
That's
gone.
That's
going
on
right
now
and
I'm
streaming,
as
I
said,
I'm
glad
I'm
going
to
have
a
slide
with
xio.