►
From YouTube: Ceph Month 2021: RBD latency with QD=1 bs=4k
Description
Presented by: Wido, den Hollander
Full schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
Yeah,
so
we're
on
our
first
second
presentation
for
the
day
stuff,
rbd
performance
with
acute
depth
of
one
and
block
size
of
4k
and
widow
is
nice
enough
to
go
ahead
and
give
us
a
nice
lightning
talk
presentation
here,
so
widow,
will
you
please
take
it
away.
B
Yeah
sure
I'll
keep
it
brief,
because
this
needs
a
lightning
talk,
talking
about
rbd
performance,
with
the
q
depth
of
one
and
a
block
size
of
4k,
and
so
you
might
be
asking
yourself
why
qdf1
and
4k
block
size
well.
What
I
would
say
is
that,
from
my
experience
is
that
single
thread
I
o
latency
is
very
important
for
many
applications.
B
But
in
the
end,
if
you
look
at
the
latency
of
a
single
io,
it
could
be
pretty
high
and,
for
example,
a
php
web
server
or
a
mariadb
database,
or
a
redis
cache
when
it's
flushing
to
its
disk
they're
all
doing
single
threaded
io
and
then
the
latency
of
that
single
thread.
Io
starts
to
matter
and
that's
what
you
notice
it's,
how
snappy
applications
feel
it's
by
using
a
q
depth
of
one.
So
all
the
benchmarking
I
do
with
seth.
B
Almost
always,
I
start
with
q
depth
of
one
block
size
4k,
then
that's
my
starting
point
and
from
there
I
start
increasing
q
depth,
increasing
the
block
size,
and
then
we
get
more
information
about
the
performance
of
the
cluster,
but
it
all
starts
with
a
queued
for
one
and
a
block
size
of
4k,
so
low
latency
ceph.
Well,
you
should
understand
that
ceph
itself
will
never
provide
you
the
lowest
latency
possible.
That's
because
sev
was
designed
for
other
things
than
latency.
It
was
designed
for
redundancy
scalability
and
data
safety.
B
If
you
take
a
local
nvme
and
put
it
in
your
laptop
or
your
server,
it
will
get
a
way
better.
Latency
than
ceph
will
ever
provide
you,
because
we
need
to
go
over
the
network
over
tcp
ip.
Then
it
goes
to
the
cpu.
Then
the
cpu
is
doing
its
thing.
Then
the
code
of
course,
which
runs
in
the
cpu
of
ceph,
does
its
thing,
and
then
it
writes
to
three
nodes
so
keep
in
mind
we're
replicating
usually
two
or
three
times,
and
then
that
simply
takes
time.
B
So
writing
a
block
in
ceph
will
be
slower
than
different
types
of
storage.
However,
redundancy
scalability
and
data
safety-
and
I
always
say
I
have
never
seen
self
lose
data
because
ceph
itself
is
always
something
else
which
happened.
You
know
lots
of
hardware
failing
but
ceph
itself.
Just
cares
about
your
data
and
safety
is
the
second
or
the
third,
for
your
performance
is
the
second
or
third
priority
for
ceph
safety.
Is
the
number
one
priority
so,
but
what
can
we
achieve?
What
can
we
get
out
of
ceph
in
terms
of
iops?
B
So
I
do
benchmarking
with
fio
super
simple
configuration.
We
take
the
I
o
engine
rbd
and
then
we
use
the
pool
ibd.
I
have
an
image
there
called
fio1
make
sure
you
run
these
tests
multiple
times,
because
you
need
to
like
pre
populate
the
rbd
image
by
running
the
test
a
couple
of
times
and
then
you
simply
say
I
have
an
iodef
of
one
block
size
of
4k
and
I
run
the
test
and
then
after
60
seconds
it
tells
me
how
fast
or
how
low
the
latency
is
of
the
cef
system.
So
some
hardware
setup.
B
I
took
some
super
micro
systems
with
an
aim:
the
epic
16
core,
cpu,
256,
gigabytes
of
memory
for
samsung
pm
ssds
and
then
the
100
gigabit
networking
with
melanox.
Now
a
few
things
to
mention
here
is
that
the
main
performance
gains
you're
going
to
get
is
pinning
your
cpu
c
state
to
number
to
one.
That's
a
kernel
parameter,
that's
it's
you
look
it
up
on
google.
B
You
can
find
how
you
can
tune
it
and
set
the
performance
profile
of
the
cpus
to
performance,
so
that
means
that
the
cpu
will
run
on
its
maximum
gigahertz
as
it
can
run.
I
think
in
this
case
it's
2.4
gigahertz
and
then
because
then
you
get
the
lowest
latency
from
the
code
possible.
B
100,
gig
networking
doesn't
matter.
I
had
100
gig
here.
So
that's
what
we
use,
but
the
amount
of
bandwidth
which
we're
using
for
this
test
is
a
few
megabytes
per
second.
It's
not
gigabits
per
second,
so
25
gig
networking
works.
10
works.
10
is
slightly
slower,
but
keep
this
in
mind
it's
slightly
very
slightly
slower.
B
So
what
can
we
achieve?
1364
iops,
which
I
was
able
to
get
with
this
hardware?
So
that's
a
write,
latency
of
0.73
milliseconds
for
a
4k
block
being
written
to
three
notes
at
the
same
time.
So
this
includes
all
the
replication.
The
block
which
we
just
written
has
been
written
to
three
different
nvmes
in
different
notes
within
one
millisecond,
that's
what
we
were
able
to
achieve,
and
I
you
know
it
that
that's
a
fairly
good
performance.
B
B
The
seth
crimson
project
for
redesigning
the
osd
should
provide
us
better
latency.
Although
crimson
itself
is
not
focused
at
the
moment
at
providing
lower
latency
they're
just
revisiting
the
code
in
the
future,
it
should
provide
lower
latency,
and
then
we
have
the
rbd
persistent
right
back
cache,
which
uses
a
local
nvme
inside
a
hypervisor
to
cache
ios.
I
tested
this
with
1624,
but
it
was
not
stable
enough
to
provide
real
results,
which
I
could
present
here
during
this
lightning
talk.
B
B
So
if
you
want
to
get
to
this
faster
cpus
higher
clock,
cpus
gain
you
more
benefit
in
terms
of
latency
than
more
cores.
So
if
you
need
to
invest,
go
for
higher
clock
cpus
with
less
cores.
So
if
I
go
back
to
the
hardware
at
the
reason
this
one
I
chose
for
a
16
core,
because
this
specific
super
micro
system
can
have
10
nvme.
So
we
have
16
cores
on
10
nvmes.
You
could
also
say
if
there
was
a
cpu
with
a
10
cores,
there's
none
with
eight
cores.
B
B
Give
you
more
total,
I
o
for
the
whole
cluster,
because
that
still
relies
on
the
amount
of
core.
So
it's
a
balance.
If
you're
looking
for
lower
latency,
you
need
faster
cpus.
If
you
need
more
total
amount
of
io
for
the
system,
then
you
need
more
cpu
cores
in
the
whole
cluster,
and
that
was
my
lightning
talk
about
ceph
with
low
latency
rbd.
B
And
if
you
have
any
questions,
this
is
where
you
can
find
me
or
ask
them
on
the
users
or
devmailing
is
because
that's
where
I'll
hang
out
as
well.
We.
B
B
In
this
no
rbd
cache
was
not
enabled
because
if
you
look
at
the
rbd
cache
code,
it's
it's
right
through
until
flush.
So
only
if
a
client
sends
a
flush
towards
rb
lib
rbd,
it
enables
the
cache,
and
if
we
go
back
to
the
fio
configurations,
it
says
direct
equals
one.
So
that
means
it's
not
sending
any
flushes.
So
all
the
I
o
is
synchronous
which
is
being
sent
by
fio.
So
no,
the
ibd
cache
is
not
enabled.
I
I
turned
this
setting.
B
It's
called
rbd
underscore
right
through
underscore
flush,
and
if
you
set
that
to
false,
then
it
always
goes
into
an
rbd
cache
and
turned
on,
and
then
I
think
I
saw
about
10
to
20
000
iops,
but
yeah,
then
we're
just
writing
stuff
into
memory
of
the
ibd
cache.
C
I
have
a
question:
this
is
dan.
Well,
it
may
it
might
be
for
elia
if
elia
is
still
online,
but
yesterday
elia
presented,
he
mentioned
some
iops
improvements
on
live
rbd.
Maybe
in
quincy
I'm
not
sure
exactly
when
they're
coming
but
yeah
are
they
going
to
also
improve
q,
depth,
1
performance
or
maybe
veto?
You
know.
B
Well,
I
doubt
it
because
I
also
did
benchmarking
with
the
radars
client
so
rados
and
then,
if
you
set
radar's
bench,
but
if
you
set
minus
t
so
the
amount
of
threats
to
one,
then
you
can
write
blocks
of
4k
directly
to
rados
and
the
latency
I
see
with
qdf1
so
with
a
single
threat
on
radars
is
about
the
same
as
I
see
with
rbd.
D
Yeah,
I
agree
probably
not
for
this
test,
because
those
improvements,
mostly
you
know,
cut
the
fat
that
we
had
between
libraries.
D
Some
of
it
was
in
liberators
mostly
in
library,
because
the
fat
that
is
in
liberators
kind
of
sort
of
still
there
and
where
what
what
what
those
improvements
were
targeted
at
is
really
the
only
onve
clusters
and
you
know,
fairly
high
q
depth,
so
probably
probably
not
for
this
test.
D
One
thing
I
want
to
note
as
far
as
bio
and
flushing
recent
versions
of
fire
have
actually
been
modified.
The
rpe
engine
within
fio
will
now
issue
a
flush
in
the
beginning
of
any
test
just
to
deal
with
that
setting.
D
So
in
the
future,
you
might
need
a
way
to
you
know
if
you
want
to
test
without
ib
cache
enabled
you
would
need
to
turn
it
off
in
the
sam.conf
elsewhere,
because
fire
will
issue
that
single
flush
at
the
beginning
of
the
test,
just
to
just
tweak
that
you
know
just
to
move
the
rbd
into
the
into
the
state
where
it
thinks
that
the
client
sends
flashes,
and
so
it
will
do
caching
by
default.
B
Okay,
that's
good
feedback
indeed,
because
that
will
you
know,
give
different
results
and
people
might
have
the
id
that
they're
able
to
do
20
to
30
thousand
iops
with
their
with
their
fio.
But
actually
it's
all
cache
of
live
rbd.
So
are
you
sure
that
with
direct
equals
one
that
it
will
still
center
flushes.
D
I
think
so
this
was
we
did
this
to
address
the
discrepancy
between
fio
and
ibd
bench
demand,
because
rbd
bench
has
been
doing
this
you
know
has
behaved
this
way
where
it
would
to
possibly
send
the
flush
in
the
beginning
of
any
benchmark
for
many
years
now,
and
fire
wasn't
doing
this
and
we
were
getting
complaints
that
you
know
here
here
are
my
rbd
bench
results,
and
here
are
my
fire
results
and
they're
vastly
different,
and
so
at
least
the
intent
of
the
change
was
to
do
it
even
with
direct,
but
I
wasn't
involved
so
I'm
not
sure
just
something
I
wanted
to
bring.