►
Description
High performance networks now able to reach 100Gb/s along with advanced protocols like RDMA are making Ceph a main stream enterprise storage contender. Ceph has gained major traction for low end application but with a little extra focus on the network it can easily compete with the big enterprise storage players. Based on technologies originally developed for the High Performance Compute (HPC) industry, very fast networks and Remote Direct Memory Access(RDMA) protocol is now moving to the Enterp
A
B
All
right
so
I'm
going
to
talk
about
how
to
improve
the
performance
of
SEF
using
high
performance
networks
and
protocols,
I'm
going
to
focus
on
four
areas.
First
of
all,
just
the
wires
making
the
network
faster.
Secondly,
the
architecture
architecting
the
network
to
be
you
know,
faster
to
be
able
to
utilize
the
faster
wires.
Thirdly,
flash
storage
and
how
to
get
the
most
performance
out
of
flash
is
F
on
a
network.
B
We're
pretty
good
at
it.
So
last
year
we
owned,
or
we
had,
according
to
the
analysts,
more
than
85%
of
the
market
for
Ethernet
adapters
over
10
gigabits
and
that's
the
part
of
the
market.
That's
projected
in
the
next
3
years
to
grow
to
be
even
bigger
than
10
gig.
So
that's
that
you
know
where
all
the
growth
is.
So
we
know
a
little
bit
about
high
performance
networking,
so
I'm
going
to
start
off
with
the
wires.
B
So
this
is
some
testing
we've
done
and
there's
actually
a
white
paper
on
our
website
on
exactly
how
how
we
did
it
in
all
the
configurations
and
I
think
they're,
actually
handing
out
these
white
papers
or
some
flyers
for
it
at
our
booth
over
in
the
marketplace.
If
you
want
to
stop
by
there,
but
what
we
did
was
we
tested,
starting
with
one
gig
all
the
way
up
to
forty
gig,
and
you
can
see,
of
course,
major
results
in
differences
between
one
and
ten
gig,
but
even
going
from
10
to
40
gig.
B
B
It's
kind
of
what
they
ordered
I.
Think
there's
a
saying
now:
25
is
the
new
10
in
fact,
Cisco
switches.
Now
10
gig
switches,
10
25,
it's
the
same,
the
same
switch
and
they
don't
charge
a
premium
for
it
as
well,
and
the
adapter
price
is
at
25.
Gig
are
very,
very
close
if
not
the
same
as
10
gig.
So
it's
not
a
break
the
bank
thing
and
much
easier
than
putting
210
gigs
in.
B
So,
with
these
faster
wires,
we
need
to
also
look
at
the
architecture
of
the
network
in
SEF.
There's
two
logical
networks
that
can
actually
be
two
physical
networks,
the
public
network
or
where
the
clients
are
connected
and
the
cluster
network
that
just
connects
the
OSDs
and
there's
a
lot
of
performance
that
happens
between
the
OSDs
or
a
performance
needed
between
the
OS
DS,
because
there's
a
lot
of
traffic
there
for
basically
the
the
replication
and
recovery
and
rebalancing
as
well
as
the
heartbeat.
B
There's
a
lot
of
traffic
on
that
cluster
Network
I
talked
about
for
replication
and
erasure
coding
and
I'm
going
to
show
you
what
I
mean
by
that.
So
for
from
for
a
read
operation,
pretty
simple,
the
client
goes
to
the
OSD
reads
the
data,
but
if
you
look
at
what
happens
on
a
right,
so
there's
no
extra
cluster
traffic
in
this
scenario.
But
if
you
look
at
a
right,
you
get.
B
You
know
two
more
writes
that
occur,
assuming
you're
doing
a3
replication
on
that
cluster
network
or,
if
you
have
it
all
on
one
network,
you're
actually
for
every
right,
tripling
the
amount
of
data
that
crosses
the
network.
So
by
segmenting
it
you're
going
to
see
a
improvement
on
the
client
side
in
a
loaded
system.
And
if
you
look
at
the
recovery
from
a
replication,
you
see
a
large
impact.
B
So,
for
example,
here's
some
different
network
speeds
and
the
time
it
takes
to
recover
from
an
OSD
loss
to
terabit
size,
a
20
terabit
and
a
200
turbot
size
OSD.
You
can
see
with
10
gigabits,
it's
going
to
take
you
half
an
hour
if
you
lose
a
node
I'm,
sorry
with
10
Gigabit,
Ethernet
and
2
terabytes,
it's
going
to
take
you
half
an
hour
to
recover
from
a
lost
OSD
and
that's
full
bandwidth
so
that
entire
10
gigabits
is
being
used
to
get
it
to
that
time.
B
But
you
go
or
to
recover
from
that.
But
if
you
go
to
25
gig,
just
25
gig,
it's
going
to
drop
it
to
under
a
day
and
you
can
get
it
in.
You
know
just
a
few
hours
with
a
hundred
gig,
so
you
really
want
to
fence
off
that
client
network
from
that
cluster
Network
and
that's
going
to
give
you
a
lot
of
performance
and
a
lot
more
reliability.
B
This
is
a
little
half
wide
one.
You
switch
that
we
sell,
but
there's
multiple
vendors
of
switches
at
these
speeds
as
well,
but
this
this
product
on
its
40,
gig,
40
gig.
This
product
would
cost
you
for
16
ports
just
over
five
thousand
dollars.
So
five
thousand
dollars
to
decrease
your
recovery
time
huge
and
improve
the
overall
reliability.
And
if
you
wanted
to
put
in
a
hundred
gig,
you
could
do
it
for
under
ten
thousand
dollars.
B
Let's
now
look
at
erasure
coding,
so
in
this
case,
instead
of
making
copies
we're
using
a
special
algorithm
that
can
take
the
data
and
break
it
up
into
small
pieces
across
multiple
OSDs
and
the
advantage
here
is
instead
of
needing
3x
the
storage.
You
only
need
one
and
a
half
times
the
storage,
but
there's
a
payment
for
it.
And
another
reason
to
have
that
separate
cluster
network.
Is
that
there's
a
lot
more
traffic
that
goes
on
small
message.
B
The
read
operation,
though,
is
the
other
way
because
in
a
read
operation
with
just
replication,
you're
only
asking
for
the
read
from
the
client
and
you're
getting
back
the
data,
but
here
you've
got
to
decode
that
data
that's
spread
across
the
different
OS
T's.
So
it's
going
to
create
more
traffic
on
that
that
cluster
part
of
the
network
good
reason
to
have
it
independent
and
I.
B
Don't
have
a
slide
for
it,
because
it's
rather
complicated,
but
you
also
when
you
have
an
error
here,
and
you
have
to
recover
think
about
the
recovery
mechanism
and
the
traffic.
So
you
not
only
have
to
decode
the
met
the
data
from
the
OS
T's
that
remain,
but
then
you
have
to
re-encode
it
across
the
OSD
to
the
to
the
replacement
OSD
or
if
there's
less
OS
D's,
you
have
to
re,
do
the
calculation,
so
that
can
be
very
heavy
traffic
on
on
the
cluster
Network
as
well.
B
The
other
downside
to
using
a
racial
coating
is
that
it's
a
very,
very
heavy
load
on
the
CPU,
because
the
CPU
has
to
do
that
calculation.
It's
a
very
complicated
calculation
on
all
the
data
in
order
to
put
it
into
the
format
needed
for
for
erasure
coding.
One
way
to
get
around
that
is
a
Knicks
are
Knicks
that
have
offload
engines
for
a
racial
coding
in
the
way
this
works,
at
least
with
our
solution,
is
when
the
data
is
sent
to
the
adapter,
the
the
ethernet
adapter.
B
B
So
here's
what
we're,
how
we're
going
to
implement
it?
There's
a
module
in
Ceph
that
does
the
erasure
coding
and
what
we're
doing,
and
it's
in
development
right
now
is
we're
creating
a
replacement
for
that
module
that
simply
uses
this
offload.
So
it's
just
a
matter
of
switching
that
module
by
the
way.
If
anybody
else
has
questions,
don't
hesitate
to
interrupt.
I
should
have
said
that
earlier.
B
So
the
next
area
I
want
to
talk
about
is
flash
I.
Think
everybody
realizes
that
flash
storage
is
our
SSDs
are
becoming
super
popular
across
data
centers.
In
fact,
many
data
centers
are
going
all
flash
because
it
solves
the
problem,
believe
it
or
not,
of
reliability
because
describes
have
all
these
moving
parts
and
although
flash
was
initially
supposed
to
timeout
and
have
all
these
problems,
it
isn't
turned
out
to
be
that
way
because
of
the
software
or
the
load.
Balancing
and
all
the
special
software
on
flash
that
distributes
the
load
across
all
the
nand
chips.
B
And
so
many
data
centers
are
just
saying:
hey,
I'm
going
all
flash,
because
the
maintenance
is
on
my
hard
drives
is
very
expensive
and,
secondly,
they're
doing
it,
because
you
don't
have
to
worry
about
matching
up
the
storage
to
the
per
performance
of
the
applications
anymore,
because
all
your
storage
is
fast.
So
anyway,
flash
is
becoming
super
popular,
but
that
comes
with
a
price
right
because
it
puts
a
big
load
on
the
network
and
here's.
Why?
B
B
That's
going
to
come
out
over
the
next
five
years
is
another
hundred
times
so
there's
a
ten
thousand
times
improvement
happening
in
storage
in
a
ten
year
period
to
understand
the
magnitude
of
that
think
about
how
far
it
is
from
here
to
Saint,
Louis
Google
says
it's
about
a
thousand
miles
in
18
hours
now
think
about.
Hopefully,
a
lot
of
you
are
from
Boston
think
about
the
distance
from
here
to
Boston
College,
it's
roughly
10
miles,
15
minutes
according
to
Google
vs.
B
18
hours,
that's
the
magnitude
and
difference
in
performance
just
between
hard
drives
and
SSDs.
If
you
add
the
persistent
memory
in
here,
that's
like
1,500
feet,
so
15
minutes
to
Boston
College,
or
how
long
does
it
take
you
to
walk
1,500
feet?
So
this
is
putting
a
huge
change
or
load
on
different
components
in
the
whole
ecosystem,
especially
the
storage
side
are
the
networking
side
and
here's.
Why?
So?
B
If
you
look
at
this,
this
chart
shows
how
many
hard
drives
it
takes
to
fill
a
10
gig,
which
is
the
red
line
and
then
a
40
gig,
which
is
the
yellow
line.
And
then
the
blue
line
is
a
hundred
gig.
How
many
hard
drives
it
takes
to
fill
those
wires
and
you
can
see
almost
25
drives.
It
takes
to
fill
a
10
gig
link
and
you
know
hundreds
to
fill
the
other,
the
other
wires.
B
If
I
just
switch
that
serial
ATA
interface
too
hard
to
a
SSD
interface
they'll
go
from
a
hard
drive
to
nan.
Now
it's
just
2
and
for
almost
10
or
9
will
almost
fill
a
40
gig
link.
Now,
there's
a
new
technology
for
SSDs
called
nvme,
and
this
was
a
read
as
of
the
interface
to
SSDs,
because
they
initially
came
out
with
the
legacy
hard
drive
interfaces
which
were
slowing
them
down.
B
So
this
is
some
testing.
That's
been
done
by
many
companies.
It
was
a
little
bit
old,
so
it's
a
couple
years
ago
and
they
did
testing
on
SEF
systems
with
disk,
changing
the
disks
changing
of
SSDs
the
networks.
The
CPU
is
the
operating
systems
and
you
can
access
this
on
on
the
internet
or
you
can
go
to
our
booth
and
they
can
show
you
how
to
get
to
it,
and
what
you
can
see
is
the
the
difference
that
just
adding
some
SSDs
provides
to
the
performance.
B
If
you
also
improve
the
performance
of
the
network,
so
you
can
see
on
the
far
one
here,
I
get
the
pointer
to
work
in
this
one.
You
had
a
mixture
of
SSDs
and
nvme
and
then
here
taking
the
hard
drives
out,
you
increase
the
performance
just
using
the
nvme
SSDs,
and
then
you
can
see.
If
you
add
a
lot
of
flash,
the
more
flash,
you
add,
the
faster
the
performance
you
get
so
it's
important
to
improve
or
to
increase
the
performance
of
your
network.
B
B
The
last
subject:
I'm
going
to
talk
about
for
improving
the
performance
is
a
new.
Is
a
technology
called
our
DMA
or
a
protocol
called
our
DMA.
So
we've
talked
about
how
to
improve
the
performance
by
having
faster
wires,
Andrey
architecting,
the
infrastructure
of
the
network
and
how
to
take
advantage
of
the
SSD
performance,
but
now
we're
going
to
talk
about
a
new
protocol
or
a
change
at
the
protocol
running
on
those
wires,
and
this
isn't
an
old
technology.
It's
been
in
the
market
for
many
years,
but
in
the
HPC
world,
the
world
of
supercomputers.
B
It
started
out
about
20
years
ago
almost
now,
and
it's
now
part
it's
embedded
InfiniBand
technology
and
the
protocol
that
runs
over
InfiniBand
wires
and
it's
now
a
dominant
part
of
the
HPC
market.
In
fact,
if
you
look
at
the
top
500
supercomputers
in
the
world,
you
can
see
it's
the
navy
blue
line.
It's
the
dominant
interconnect
between
those
supercomputers
super
companies.
These
days
aren't
the
the
big
circular
craze
that
they
were
when
we
are
kids
they're.
B
B
If
we
look
at
the
regular
straight
storage
market
and
look
at
the
three
different
areas,
object,
block
and
file,
as
we
talked
about,
you
can
improve
the
performance
of
those
with
these.
You
know
these
are
common
protocols
for
those
different
areas
just
by
flasher
or
high
bandwidth
Ethernet,
but
also
you
can
use
our
DMA
technology
and
it's
been
used
in
the
ethernet
world
for
many
years,
just
over
an
over
Ethernet
in
a
technology
called
rocky
or
our
DMA
over
converged
Ethernet,
but
in
the
segments
that
needed
high
performance.
B
So,
for
example,
if
you
think
of
block
there's
a
technology
called
I,
scuzzy
or
protocol
called
I
scuzzy
for
networking,
block,
storage
and
there's
a
version
of
it
called
icer
that
is
I
scuzzy
over
our
DMA.
If
you
think
about
file,
Microsoft
has
a
technology
called
sips
or
SMB.
You
know,
I
think
is
the
new
name
for
it
and
there's
a
version
of
it
called
SMB
direct
which
works
over
our
DMA
for
higher
performance.
Nfs
over
our
DMA
is
for
the
NFS
file
system
or
file
protocol
and
then
on
object.
B
The
technology
we're
going
to
talk
about
is
SEF
over
our
DMA
now
I
said
it
was
a
niche
for
high
performance,
but
that
niche
is
becoming
mainstream
because
that
very
high-performance,
SSD
interface
called
nvme.
That
I
talked
about
a
few
minutes
ago,
is
now
being
put
into
a
protocol
called
nvm
Euler
fabric
so
that
it
can
be
transmitted
across
a
network
and
that
standard
came
out
way
over
almost
a
year
ago
now,
and
it
includes
of
what's
called
a
binding
layer
in
the
standard
for
transport
over
our
DMA
and
the
new
memory
technologies.
B
The
new
persistent
memory
technologies
that
are
again
a
hundred
times
faster
than
SSDs
are
there's
work
underway
for
the
last
year
and
a
half
on
how
to
put
that
across
a
network,
and
all
of
that
is
focused
around
our
DMA
protocol.
So
because
of
that
this,
this,
our
DMA
technology
is
going
to
become
much
more
mainstream
in
the
data
center
over
the
next
few
years.
So
what
is
it?
And
how
does
it
work?
Basically,
it's
the
remote.
B
It
stands
for
remote
direct
memory
access,
so
it's
a
remote
version
of
DMA
DMA
is
how
you
can
move
data
inside
of
a
computer.
It's
the
local
version,
how
you
can
move
data
inside
a
computer
without
sitting
in
a
software
loop
and
moving
in
a
word
in
a
time,
so
instead
there's
a
hardware
engine
in
the
CPU
that
you
give
it
a
pointer
to
memory
here,
give
it
a
pointer
to
memory
here,
kick
it
off
or
give
it
a
counter
and
it
moves
the
data
without
the
CPU
being
involved.
This
is
the
remote
version.
B
So
what
happens?
Is
you
tell
the
our
NIC,
the
adapter
that
supports
our
DMA,
the
memory
location,
local,
the
memory
location
remotely
the
counter
and
it
moves
the
data
without
any
interaction
of
the
CPU?
That's
different
than
how
normal
traffic
goes
through
a
tcp/ip
stack
because
there
the
CPU
is
involved
and
the
CPU
is
controlling
the
transport
layer,
which
is
the
layer
that
takes
care
of
making
sure
the
data
gets
there
recovering
from
errors,
those
sort
of
things
that's
all
handled
in
the
hardware
of
the
our
neck.
B
Actually
one
more
thing
on
our
DMA
over
Ethernet,
so
initially
our
DMA
Auvergne
Ethernet,
when
it
was
in
those
niches
it
required
in
the
earlier
implementations
technology
that
came
from
Fibre
Channel
over
Ethernet
called
PFC
or
priority
flow
control
to
be
implemented
on
the
switches
and
that
gave
it
the
flow
control
it
needed
to
keep
it
from
overloading
the
network
because
it
provides
so
much
data.
Now
that
you
can
use
still
use
that
with
our
DMA
over
Ethernet
and
it
does
improve
the
performance,
but
it
takes
some
some.
B
Okay,
now,
as
that
transport
layer
has
been
improved
and
no
change
to
the
protocol,
so
it's
not
a
change
in
the
Iraqi
protocol.
It's
just
an
implement
better
implementation
of
the
embedded
transport
layer
on
the
adapters
that
need
for
priority
flow
control
has
gone
away,
and
so
your
and
so
now
you
no
longer
have
to
have
special
settings
on
your
network
to
run
our
DMA
over
Ethernet
right.
C
C
B
B
B
So
this
is
using
software
to
find
a
software-defined
storage
application
like
SEF
with
our
DMA,
that's
in
production
today.
So
if
we
look
at
our
DMA
over
SEF,
this
has
been
an
ongoing
project
with
for
multiple
companies
involved.
Our
company
company
in
China
called
X
Chi
Samsung,
SanDisk
RedHat
have
been
contributing
to
this
effort.
It
was
first
released
in
beta
in
the
hammer
release
back
in
June
almost
two
years
ago.
B
So
one
interesting
thing
about
our
DMA
is
that
you
have
these
memory
locations
on
two
different
sites
across
the
network.
Well,
you
have
to
pin
down
that
memory
and
you
set
up
memory
pieces
of
memory
for
all
the
different
connections
you
have
and
what
was
implemented
recently
was
a
way
to
do
that
dynamically
so
that
it
you
don't
have
to
pin
all
that
memory.
You
just
pick
the
parts
that
are
needed
are
pinned
and
the
rest.
It's
dynamic,
I'm
pinning
basically
mechanism,
question
yeah.
So.
E
B
B
Yeah,
so
that's
a
that's
another
good
question,
a
very
good
question:
I
didn't
go
into
the
history
of
Rocky,
but
there's
two
different
versions
of
rocky
rocky
v1,
which
has
initially
came
out
many
years
ago
in
rocky
v2,
which
came
out
about
four
or
five
years
ago.
The
difference
is
that
rocky
v1,
which
is
kind
of
obsolete
now
not
I,
mean
it's
not
used
very
much.
Most
of
most
everybody
uses
rocky
to
know,
and
the
difference
is
rocky.
One
was
layer,
1,
layer,
2
and
rocky
v2
has
an
IP
UUP,
let
UDP
layer.
B
So
it's
routable.
Thank
you
for
the
question
good
question,
so
you
can
go
to
the
Mellanox
community
pages
and
it
walks
you
through
how
to
configure
SEF
to
use
our
DMA
and
here's
some
of
the
performance
numbers
for
that.
So
we
saw
it
measured
it
in
two
different
ways:
one
just
raw
performance,
the
same
set
up
and
the
other
to
see
the
CPU
savings.
So
we
could
save
multiple
cores
on
both
the
client
and
OSD
and
still
get
44%
more
performance.
B
You
know
from
financial
to
cloud
to
education
and
more,
and
if
you
want
to
know
more
about
using
our
products
with
cell
phone,
we
have
a
booth
over
in
the
marketplace
and
I'm
also
open
to
answering
some
questions.
To
kind
of
summarize,
the
benefits
Sun,
you
know
to
improve
SEF.
One
thing
to
make
sure
you
do
is
use
faster
networks.
B
Ten
gigabit
is
not
enough,
and
especially
if
you've
got
more
than
15
hard
drives
or
you're
using
SSDs
SSDs
just
require
our
SSDs
will
give
you
higher
performance,
but
make
sure
you
improve
the
network
as
well
and
having
a
separate
cluster
network
is
a
way
to
get
better
reliability
in
your
system
and
to
improve
the
performance.
And
then,
if
you
want
to
go
even
further
into
the
turbo
area
of
performance,
you
can
look
at
our
DMA
for
Seth.
F
F
B
If
you
take
our
adapter
product
line,
just
to
give
a
little
background,
for
you
obviously
know
how
well
but
connect
x3
is
probably
the
was
our
10
40
gig
product
that
came
out
probably
four
ish
years
ago,
Connect
x4
was
the
10
25
40
50
100
gig
product
that
recently
came
out
and
then
connect.
X5
is
the
product
that
just
came
out.
B
Each
one
of
those
products
has
a
faster
ability
to
recover
from
problems
that
it
sees
in
the
network.
All
of
them
will
recover,
but
more
and
more
of
that
transport
layer
has
been
improved
in
the
hardware
of
the
adapters
going
forward
so
connect
x4
can
definitely
do
it.
In
a
in
a
in
a
rocky
environment
using
in
not
using
PFC,
you
need
to
use
an
implement
a
function
called
ecn
or
extended
notification
of
congestion,
I
think
it
stands
for
and
then
connect
x5.
You
don't
need
anything,
but
one
thing
to
keep
in
mind.
B
If
you're
trying
to
get
super
high
performance
like
we're
talking
about
here,
you
can't
just
run
it
on
a
10
gig
network
in
the
middle
you
have
to
either
over
provision
or
implement
some
type
of
congestion
management
in
order
to
get
the
performance,
let
you
know
the
highest
performance
levels
and
nothing
comes
for
free,
but
it,
but
it
does
work.
You
just
lose
performance.
Yes,.
D
G
B
G
B
Over
IB,
oh
I
see
okay,
so
you're
just
running
IP
on
it
and
you
know
I,
don't
there
may
be
some
numbers
for
that,
but
I
don't
have
them.
Let
me
I
suggest
you
stop
by
the
booth,
though
over
the
marketplace,
because
there's
a
marketing
guy
there
named
John
Kim,
who
has
organized
a
lot
of
the
testing
and
he
would
know
if
we've
tested
that
okay.
G
B
It's
it
has
to
be
the
RC,
so
I'm,
not
a
super
InfiniBand
expert
I,
come
from.
My
focus
is
on
the
storage,
but
I
have
done
it
in
the
past
so
that
the
dedicated
or
the
type
of
connection
over
InfiniBand,
where
you
have,
or
over
our
DMA
in
general,
where
you
know
that
that
you
that
your
your
message
has
been
sent
and
guaranteed
it
does
that
that's
and
it
also
handles
the
just
plain
messaging
rate,
the
UDP
part
of
it.
So
this
is
new.