►
From YouTube: Ceph Days NYC 2023: NVMe-over-Fabrics support for Ceph
Description
Presented by: Jonas Pfefferle | IBM Research
NVMe-over-Fabrics (NVMeoF) is a widely adopted, de facto standard in remote block storage access. Ceph clients use the RADOS protocol to access RBD images, but there are good reasons to enable access via NVMeoF: to allow existing NVMeoF storage users to easily migrate to Ceph and to enable the use of NVMeoF offloading hardware. This talk presents our effort to provide native NVMeoF support for Ceph. We discuss some of the challenges, including multi-pathing for fault tolerance and performance.
https://ceph.io/en/community/events/2023/ceph-days-nyc/
A
All
right
so
I'm
going
to
talk
about
damn
Envy
me
over
fabric
support
for
Seth.
Actually,
the
whole
effort
started
almost
like
I
think
three
years
ago
now
we
were
basically
asked
to
see
if
we
can
support
Seth
with
dpu
hardware
and
we
actually
tried
it
out.
We
ran
it
and
at
least
back
then
you
know
the
typical
smart
Nicks
that
you
could
buy,
have
very
low
power
arm
cores.
So
we
actually
saw
very
slow
performance
on
on
the
dpus
and
that's
why
we
decided.
A
Okay,
let's
introduce
Emmy,
mirror
Fabrics
to
Seth.
You
know
why
introduce
it
in
the
first
place.
I
just
gave
an
example,
but
you
know
you
have
the
rbd2
for
remote
block
storage.
A
It's
proven
right,
it's
it's
reliable
object,
access
to
Charlotte
and
replicated
or
Erasure
coded
storage
in
the
back
right.
So
you
have
all
these
features.
You
have
fault,
tolerance
and
so
on.
So
why
do
we
need
to
introduce
a
new
one?
Well,
for
one
thing
is
Envy
mural
Fabrics,
as
just
mentioned,
is
more
and
more
widely
used.
A
A
We
basically
looked
at
what
has
been
done
in
the
past,
with
iSCSI
in
how
to
integrate
this
and
I
think
we
have
basically
similar
designs.
So
this
is
the
overview.
So,
basically
you
have
the
ceph
cluster
on
one
side,
then
it
talks
the
Raiders
to
SF
enemy
or
a
fabric
Gateway,
and
the
Gateway
is
just
basically
a
bunch
of
processes.
A
I'm
going
to
talk
about
some
details
later
and
then
on
the
other
side,
you
have
nvme
of
fabrics
in
the
form
of
TCP,
RDMA
or
fiber
channel,
not
sure
if
anyone
is
using
fiber
channel
but
and
then
you
have
the
clients
on
the
other
side
as
initiated
right
and
you,
you
can
run
this
Gateway
service.
You
can
run
this
everywhere.
You
can
run
this
co-locate
it
with
your
osds.
You
can
run
it
on
a
standalone
machine
right.
A
A
A
So
what
we've
done
is
we
introduced
this
so
I
split
this
here
in
the
view
into
a
data
path
and
a
control
path,
so
the
control
path
is
actually
a
little
python
process
that
runs
that
exports
the
grpc
service,
and
we
provide
a
CLI
where
you
can
make
like
a
configuration
run.
Configuration
commands
again
against
the
control
Daemon
and
the
control
demon
is
basically
there
to
set
up
the
whole
mapping
right.
So
you
can
tell
the
control
demon
to
do
stuff
like
a
map.
A
This
RVD
image
to
this
subsystem
and
this
namespace
ID
create
this
listener.
Where
you
want
the
subsystem
to
listen
at
so
people
can
connect
to
it
things
like
this.
The
control
demon
is
responsible
for
Okay,
so
to
basically
have
the
configuration
per
system.
What
we
do
is
we
store
the
configuration
in
a
in
a
safe
object
actually,
and
we
store
the
the
configuration
itself
in
the
home
above
the
object.
A
So
this
the
object
map
is
basically
key
value
store
that
comes
with
every
object
and
Seth
that
allows
us
to
basically
shut
down
the
Gateway,
bring
it
back
up
and
restore
the
configuration
right,
but
we
use
it
also
for
something
else
which
I'm
going
to
show
you
in
a
second.
A
But
first,
let's
talk
about
the
data
path.
What
do
we
do
for
the
data
path?
We
actually
looked
a
bit
in
how
we
want
to
support
this
initial
ideas
was
to
build
something
ourselves
or
look
for
what's
out
there.
You
could
also
use
the
kernel
for
this
right,
but
in
the
end
we
decided
to
go
for
spdk.
A
This
is
an
open
source
project.
It's
called
storage
platform,
development,
kit,
I
believe,
and
it
was
it's
mainly
driven
by
Intel
and
it
basically,
the
initial
version
of
spdk
was
meant
for
user
space
access
to
local
nvme
storage,
but
it
has
been
extended
to
be
much
more
flexible
and
now
it
also
supports
nvme
over
fabric
since
a
while-
and
they
have
this
nice
abstraction
that
you
can
create
basically
a
Target
and
map
it
to
a
block
device
abstraction
in
the
back,
and
there
are
multiple
implementations
and
they
already
had
a
RBD
block
device.
A
Abstraction
was
very
limited,
was
not
very
performant,
but
it
worked
so
we
could
actually
when
we
started
this
effort,
you
could
actually
already
set
this
up
and
kind
of
show
that
that
kind
of
the
whole
thing
works
so
be
expanded
on
this
and
how
it
works
is
that
our
python
Control
process,
basically
issues
rpcs
against
spdk
to
set
up
this
mapping.
So
you
issue
rpcs
against
the
control
Daemon,
and
the
control
demon
then
sets
up
spdk
accordingly
right,
okay,
so
next
thing
right.
A
So,
as
I
said
before,
the
way
how
Envy
mirror
Fabrics
is
designed
is
a
point-to-point
protocol
right?
It's
not!
It's
not
doesn't
support
like
a
distributed
star
system
in
the
back
right.
It
assumes
the
path
you
connect
to
has
all
the
storage
information
like
the
all
the
data
behind
it.
So
if
you
do
IO
against
it
all
the
data
is
behind
that.
A
That
means
is,
we
don't
have
like
you
know,
eraser
coding
or
anything
like
this
natively
supported,
so
we
have
to
introduce
a
multi-passing
for
fault
torrents
or
to
to
improve
your
performance.
Obviously,
Envy
mirror
Fabrics
has
features
for
for
multi-passing
and
which
we
leverage.
So
what
you
basically
see
here
is
we
have
a
grouping
concept,
so
a
Gateway
group
is
essentially
multiple
gateways
that
share
the
same
configuration,
meaning
they
export
the
same
subsystems.
A
They
support
the
same
namespaces
right
and,
and
things
like
that,
so
the
only
thing
different
is
obviously
that
the
listeners
on
these
two
have
different
IPS
and
they
can
also
have
different
ports,
and
so
the
way
we
do
this
is
we
basically
sync
on
the
configuration
that
we
store
in
SEF
and
we
have
two
ways
to
do
this.
A
The
first
thing
is
that
we
use
the
watch,
notify
mechanism
in
SEF
to
update
each
other,
and
we
also
have
a
polling
mechanism
in
the
case
that
one
of
them
crashes,
while
they
do
a
watch,
notify
right
to
keep
them
consistent,
and
the
cool
thing
is
I.
Think
also
that
you
can
issue
ipcs
to
any
of
the
gateways
in
a
Gateway
group
and
they
will
sync
with
each
other
to
make
sure
they
are
in
consistent
state
right
as
through
this
or
map
in
in
the
object.
A
Okay.
So
now
you
have
set
up
like
multiple
paths.
We
also
support
stuff
like
or
be
in
the
process
of
implementing
this
it's
natively
supported
by
spdk.
So
it's
basically
just
for
us
to
add
the
rpcs
for
it
that
you
can
specify
which
of
the
paths
are
optimized
which
are
non-optimized.
So
basically,
this
has
a
hint
for
the
client
which
pass
it
should
use
to
to
access
a
particular
Gateway.
In
this
case.
B
A
Okay,
of
course,
you
can
also
use
the
same
feature
as
I
mentioned
before,
to
do
load,
balancing
or
scaling,
so
you
can
decide
right
and
with
the
AC
asymmetric
namespace
access.
What
I
showed
you
before?
You
can
also
steer
particular
clients
of
a
particular
subsystem
to
a
Gateway
and
others
to
other
gateways
right,
so
you
are
you're
pretty
flexible
in
what
you
can
do
there
Okay
so
right.
So
most
of
these
features
that
I
talked
about
except
a
a
is
already
available
right.
A
You
can
download
this
today
and
try
it
out
fingers
crossed
it
works.
You
know
how
it
is
so
feel
free
and
let
us
know,
of
course
it's
all
ongoing
work
right.
We
are
kind
of
we
know.
We
need
a
lot
more
testing
and
things
like
that,
so
we
are
still
at
the
beginning,
but
we
it's
it's
there,
it's
running.
So
it's
not
just
me
talking
about
it.
A
Let's
talk
a
bit
about
performance
so
for
performance.
The
goal
was,
of
course,
to
come
as
close
as
possible
to
a
non-gateway
performance
right.
Yes,
if
you
would
access
basically
directly
your
your
RBD
image
from
a
host
directly
to
the
cluster
now,
as
I
said
before,
right
and
with
the
Gateway
we're
introducing
basically
an
additional
hop,
and
even
if
you
co-locate
the
Gateway,
with
with
an
OSD
right.
A
That
means
that
only
for
that
particular
OSD,
your
local,
but
for
all
the
others,
you
would
still
be
remote,
so
it
will
still
be
two
hops
and
we
have
a
little.
This
was
run
on
a
little
Seth
on
a
little
lab
setup
with
the
three
note
ceph
cluster,
and
this
is
on
the
left
side
is
where
we
started
on.
The
right
is
where
we
are
now.
So
what
you
see
here
is
that
you
have
on
the
on
the
y-axis
you
have.
A
Thousands
of
iops
might
be
a
bit
small,
but
hope
you
can
see
it
and
on
the
x-axis
you
have
volumes
number
of
volumes,
so
in
increasing
and
the
blue
bars
are
basically
native
access.
We
use
fire
with
lip
RBD,
so
from
user
space
instead
of
the
Chrono
and
because
to
make
it
fair,
because
spdk
also
uses
slip
RBD
right
and
on
the
right
side
we
have
spdk
with
an
NVM
uof
Target
and
the
talking
to
lip
RBD
on
the
back.
A
So
how
this
looks
like
is
you
have
spdk
here
and
you
have
a
so.
We
run
spdk
with
NVM
uof,
connecting
to
our
Gateway
with
also
runs
NVM
spdk,
which
translates
to
basically
rather's
RBD
in
the
back,
and
this
is
how
we
started.
So
we
were
quite
far
away
from
our
goal
in
having
almost
native
performance,
and
this
is
where
we
are
now
I-
think
it's
like
I
think
we're
eight
percent
away
from
native
performance.
A
A
That's
why
we
switched
in
this
graph
here
to
a
ram
disk
space,
osd43,
note
setup
and
we
use
multiples
of
client
instances
in
spdk
to
improve
our
performance
and
the
access
now
is
again
thousands
of
iops,
and
then
we
have
numbers
of
volumes
per
Sev
client
instance,
and
we
run
the
total
of
128
volumes
and
having
multiple
self-client
instances
helps
because
the
self-client
instances
have
their
own
like
kind
of
IO,
threads
and
and
operation
threads
in
in
lib
IBD.
So
what
I?
What
I
really
want
you
to
focus
on?
A
I
know
it's
a
lot
of
bars,
but
the
one
with
on
the
red
is
basically
is
the
best
performance
in.
In
this
scenario,
you
can
achieve
with
eight
volumes
per
Sev
client
instance.
Again
we
have
16
kilobytes
and
the
total
Q
rep
is
256
and
around
I
think
16
yeah.
These
are
number,
of
course,
the
different
bars
that
you
see
in
around
16
cores.
We
actually
achieve
line
speed
essentially,
so
this
is
a
hundred
gigabit
Nic.
A
B
A
And
we
are
currently
just
going
back
to
this
again.
We
are
currently
in
the
face
of
actually
testing
to
bigger
scales
like
256,
512
and
and
even
more
and
trying
this,
of
course,
to
run
on
a
bigger
cluster
ninja
star
three
note
lab
setup.
A
Okay,
so
let's
talk
a
bit
about
what
we've
planned
so
for
our
planning.
Is
we
have,
of
course,
if
you
talk
enemy
or
Fabrics,
we
want
the
discovery
service
and
for
Discovery
there
is
a
native
nvme
over
fabric
discovery
which
current,
which
currently
there
are
some
there's
technical
previews
to
to
add,
centralized
Discovery,
which
we
want
to
do,
but
for
now
we're
starting
with
just
having
a
discovery
service
in
the
first
place
and
Inter
is
helping
us
here
and
trying
to
introduce
it.
A
Essentially,
what
we
are
planning
to
do
is
have
this
as
a
standalone
service,
again
synced
via
the
config
like
we
have
for
the
gateways,
and
then
these
ones
these
two.
In
this
scenario
they
advertise
to
the
clients
which
pass
they
can
use
to
get
to
a
particular
subsystem
right.
A
That's
all
the
discovery
service
does,
so
you
can
basically
add
gateways,
remove
gateways
and
the
discovery
service
will
tell
all
the
clients
that
are
connected
to
it,
that
there
were
updates
and
that
there
are
New
Paths
available
which
passed
they
are
supposed
to
take
the
next
one
is
it's
also,
of
course,
a
big
one,
especially
if
you
want
to
deploy
it
as
an
Enterprise
setting
right
identification
and
encryption
support.
A
So
this
one,
what
we
had
in
mind
for
now
is
that
essentially
you
can.
We
want
to
have
a
plug-in
architecture
where
you
can
add
this
to
the
to
the
python
Daemon,
your
own
identification
or
encryption
method,
and
we
as
an
example,
wanted
to
add
this,
and
essentially
what
we
had
in
mind
is:
if,
if
you
have
a
subsystem,
what
I
said
before
each
subsystem
typically
has
its
own
listeners,
if
you
restrict
your
subsystem
to
a
particular
IP
port
pair
and
also
add
some
credentials
for
ipsec
to
it
right
it.
A
Basically,
if
a
client
connects
to
it,
you
it's
it's
kind
of
authenticating
against
the
subsystem.
Obviously,
it's
not
because
it's
ipsec
right,
it's
it's
against
the
IP
port
pair,
but
it
comes
as
close
as
possible
without
implementing
nvmb
inbend
identification
which
we're
going
to
have
in
the
future,
or
at
least
we're
planning
to
have.
But
for
now
we
I
don't
know
at
least
of
a
single
implementation
that
implements
nvme
invent
identification.
The
kernel
doesn't
have
it,
spdk
doesn't
have
it
so
yeah.
This
is
the
state
we
are
at
and
I
think
this.
A
A
Okay,
so
where
are
we
now?
As
I
said
before?
This
is
ongoing
work
you
can
try
it
out
today.
I
showed
you
that
the
link,
so
we
want
we
plan
the
release
initial
release
for
Reef.
It
will
be
like
a
scaled
down
version
from
what
I
showed
you.
So
what
we
plan
for
now
is
have
a
single
Gateway
with
the
grpc
control
and
you
will
have
the
CLI.
We
will
have
to
see,
save
and
restore
configuration,
but
we
will
not
have
yet
the
multi-gateway
stuff.
A
It
just
needs
a
bit
more
testing
to
to
refine
and
for
the
future
besides
the
stuff
that
I
talked
about
Discovery
and
so
on.
We
also
working
on
zeph,
ADM
integration,
so
I
think
for
iscari.
It's
already
there.
You
can
basically
say
hey,
give
me
iscos
Gateway
on
this
node
and
we
want
something
similar
for
nvbof
and
we
are
working
with
Intel
on
something
called
adnn,
and
the
idea
here
is
what
I
said
before.
Maybe
let
me
go
sorry
for
the
switching
but
I
think
he
had
this
one.
A
So
the
idea
with
adnn
is
that
the
problem
is
right,
as
I
said
before,
if
you,
if
you
put
a
Gateway
next
to
an
OSD-
and
you
always
have
additional
Network
hops
right,
because
obviously
your
data
is
typically
spread
among
all
these
osds
and
the
idea
of
adnn
is
to
have
a
mechanism
that
they
are
also
trying
to
get
into
the
enemy
specification,
where
you
run
basically
the
crush
algorithm
in
the
Target
to
figure
out
which,
which
of
the
OS
sorry
on
the
client.
A
Not
on
the
target,
you
run
the
crush
algorithm
on
the
tag
on
the
client
to
figure
out
which
Target
you
need
to
go
to
where
basically,
your
your
data
is.
That
means
you
need
to
co-locate
with
each
OSD
one
Target,
and
then
you
basically
do
the
same
thing
but
and
it's
in
a
it's,
the
idea
is
to
make
it
like
lazily
update
the
crash
right.
A
A
This
will
be
added
to
to
the
enemy
me
spec,
that
that
makes
it
more
applicable
to
to
implementing
systems
like
this
go
back
to
back
to
the
last
one,
all
right,
of
course,
the
whole
thing
wouldn't
be
possible
with
tons
of
people
working
on
this,
so
this
has
been
collaborative
effort
between
all
those
companies
listed
and
all
those
people
listed.
I
hope,
I,
didn't
forget
anyone
so
with
this.
Thank
you
very
much.
A
B
A
So
you
mean
like,
if
you
have
a
disk
down
in
the
in
in
SEF
or
a
physical
disk,
so
I
mean
that
would
be
basically,
you
would
deal
with
this
on
the
safe
side.
Right
since
I
mean
we
rely
on.
Basically
you
you
have
the
reliability
from
from.
You
have
a
normal
RBD
image
right,
so
it's
the
same
as
before
in
in
that
sense
right,
so
you
need
to
replace
your
disk.
You
need
to
make
sure
your
your
objects
are
redistributed,
these
kind
of
things
right
it
doesn't.
A
A
C
I
had
a
question
similar
I.
You
might
have
covered
this,
but
it's
a
future
question.
Probably
is
there
thought
of
and
maybe
Josh
kind
of
hit
on
this
pudding
nvme
natively
in
the
OSD,
because
you're
adding
a
network
hop
and
a
translation
layer,
you're
going
natively
from
one
protocol
to
another
right?
Is
there
a
thought
to
try
and
move
to
nvob
natively
in
the
osds
and
dual
protocol
like
RBD
and
nvme
as
a
performance,
kind
of
improvement.
A
I
think
yeah.
The
problem
is,
as
I
said
before,
is
that
nvme,
like
from
the
spec
perspective,
is
a
point-to-point
protocol
right?
It
doesn't
deal
with.
You
were
talking
with
multiple
paths,
and
the
path
is
actually
means
one
pass
only
as
part
of
the
image
available.
The
other
path
has
other
part
of
the
parts
available
right
in.
In
that
sense,
you
cannot
natively
integrate
it
other
than
you
build
something
like
this.
A
You
you,
you
basically
extend
the
nvme
spec
to
allow
these
kind
of
things
right
to
have
like
client-side
decision
to
know
where
to
go
to
right.
I.
C
A
I
mean
it's,
it's
it's
more
of
a
if
you
are
in
the
in
the
ecosystem,
right
you're
already,
maybe
running
nvme
over
ferries,
at
least
in
my
opinion,
right.
Otherwise,
if
you're
fine
with
running
krbd,
obviously
it
might
give
you
better
performance.
You
might
not
have
to
deal
with
setting
up
the
whole
Gateway
infrastructure,
and
things
like
that.
So.
D
A
Yeah
yeah,
you
can
you
you
can
just
run.
You
can
just
run
this
Gateway
on
the
client
and
do
loop
back
to
it
that
that
works.
A
The
Hop,
actually,
while
you
mentioned
this,
there
have
been
actually
some
people
talking
about
this
for
having
kubernetes
support,
because
I
think
the
problem
at
the
moment
is
with
krbd
you're
missing
a
lot
of
feature
that
lib
RBD
has
that
haven't
been
pushed
into
the
kernel
and
one
way
to
to
basically
run
lip
RVD
and
still
have
a
kernel
block
device.
You
could
do
this
or
you
do
MVD
right
there,
a
bunch
of
options,
but
this
was
something
that
was
discussed.
Yeah.
E
Hey
so
a
question
for
you
on
the
performance
side
made
more
efficiency
side,
so
both
like
Target,
CLI
and
RBD
MBD
suffer
from
a
lot
of
memory,
internal
memory,
copies
and
slow
or
actually
High,
CPU
usage
due
to
the
the
internals
and
I,
was
wondering
if
you
guys
have
had
to
deal
with
much
of
that
as
you've
been
developing.
This.
A
Yeah
so
I
mean
I,
don't
want
to
lie
in
the
Gateway.
Obviously,
is
heavy
in
terms
of
a
number
of
CPUs
I.
Guess
you
need
to
push
stuff
around
I
mean
if
you
want
to
push.
If
you
want
to
run
the
whole
self
protocol
and
the
mme
ORF
Target
and
want
to
push
100
gigabits
right,
you
need
a
you
need
some
course,
and
you
also
need
memory
right
so
as
you've
seen
before.
A
I
think
this
one
yeah,
so
this
one
achieved
line
speed
with
16
cores,
basically
running
at
full
blast
in
spdk
I'm,
not
saying
that
this
is
the
final
thing
my
you
might,
there
might
be
some
improvements
to
it,
but
yeah
you.
You
need
to
spend
some
of
the
performance,
and
then
we
looked
into
memory
consumption,
so
there
might
be
some
ways
to
improve
that
in
spdk
and
one
other
thing
is
also
per
default.
Spd
case
just
polling
all
the
time
on
all
cores.
A
You
can
turn
this
off.
They
have
a
new
feature
where
you
can
basically
say:
there's
like
it
goes
into
a
interrupt
mode,
the
quote
unquote,
but
yeah.
These
are
definitely
concerns
that
that
we
are
looking
into
and
trying
to
like
optimize
this
as
much
as
possible
to
keep
it
down.
Yeah.
F
There
be
any
reason
to
you
know,
use
nvme
and
I
would
say:
one
potential
benefit
is
on
the
client
side.
It
could
potentially
be
less
less
CPU
as
consumed.
I
mean
it's
really
moving
it
to
the
you
know
the
Target
right,
but
you
could
also
have
a
hardware
hardware,
nvme
initiator,
and
then
you
know
the
CPU
consumption
on
the
client
side
would
be.
You
know
much
lower.
A
Yeah,
that's
actually
a
good
point.
I
mean
the
nvme
client-side
protocol
is
actually
pretty
slim,
so
it's
much
more
efficient
than
running
the
full
blown,
let
bar
video
krbd,
but
obviously
it
needs
to
be
it's
it's
clear
right
from
what
I
said
before.
One
is
point
to
point.
The
other
one
does
the
whole
thing
with
figuring
out
where
to
go
all
these
kind
of
things
right.
A
Yeah
so
so
we
talked
yeah,
we
talked
to
silence
they,
they
have
one,
but
I,
think
they.
They
are
still
a
long
way
out
to
having
actually
something
that's
performant
and
that
you
can
run
on
like
on
100
gigabit,
at
least
from
from
what
I've
heard.
G
Looks
like
you
got
a
lot
of
the
performance
through
concurrency,
just
adding
more
and
more
volumes
were
there
other
tunings
that
you've
done
to
get
the
performance
up
to
that,
like
92
percent,
roughly.
A
Right
so
that
was
mostly
stuff
we
did
with
an
spdk,
so
there
were
some
issues
with
IO
have
like,
as
I
said
before,
they
had
this
RBD
block
device,
but
it
was
kind
of
just
a
showcase
thing.
It
was
not
actually
used
and,
for
instance,
all
the
I
was
funded
through
a
single
core,
so
we
had
to
fix
that.
Then
there
was
a
lot
of
like
how
Affinity
of
lip
IBD
thread
was
handled
in
spdk.
A
We,
we
kind
of
fixed
that.
What
we
now
have
is
that
actually,
when
you
create
a
client
instance,
you
can
tell
it,
which
course
the
lip
RBD
threads
should
be
running
on
and
before
it
was
like
in
the
very
first
version,
it
would
run
on
the
same
course
that
the
target
would
be
running
which
is
obviously
pretty
bad,
and
and
now
we
have
this
option
that
you
can
actually
move
them,
move
them
out.
A
So
with
these
kind
of
improvements,
there
were
a
few
more
that
had
to
get
busy
to
this
eight
percent
off
for
what
we
had
at
the
moment
here,.