►
Description
Optimize librbd for Lower CPU Cost and Higher Scalability - Li Wang, DiDi
This talk will introduce our work to reduce CPU overhead of qemu+librbd stack, which includes use rbd_aio_writev instead of rbd_aio_write for qemu rbd driver, and further optimize rbd_aio_writev to use zero copy to send data, these optimizations lead to 48% less cpu cost, 46% less latency, and 85% higher iops for 1M sequential write. In addition, we improve the scalability of librbd by using multiple writeback threads, and reduce the granularity of rbd cache lock.
About Li Wang
Senior Technical Expert, DiDi
Li Wang is a senior technical expert in DiDi.
A
So,
first,
let's
give
a
brief
review
on
what
a
b
b
ba
BB
is
and
what
it
does.
So
this
light
sugar,
3
floppy
solution
for
the
virtualizations
scenario
for
the
data
flow
as
we
can
see
in
showing
the
picture
so
the
disk,
the
disk,
assess
request
to
issued
by
the
virtual
machine
will
be
passed
to
a
human
body.
The
human
body
is
backing
the
block
tower
of
Q
mu
and
then
the
human
body.
A
The
liveability
scenario
requests
our
network
to
seven
bypass.
The
hosts
colonel
at
least
I
was
tagged
as
a
picture
shown
select
slide.
Welcome
will
reveal
on
the
zip,
lock
solution,
all
the
containers
in
our
row.
So
this
picture
should
a
data
flow,
so
the
disco
sense
request
to
issue.
The
pilot
container
will
be
part
first
passed
to
an
BD
which
is
Colonel
plug
to
our
zoning
inside
the
host
colonel
and
when
it's
intercepted
a
request.
A
A
Okay,
so
next
I
first
introduce
our
first
work.
The
IBD
LPS
of
optimization
Siham
here
comes
the
motivation
within
the
fi
o.4
kane
random
right
test,
and
surprisingly,
we
found
that
the
IBD
LPS
not
scale
in
spite
of
the
the
OSP
number
of
the
the
cluster
I
mean
unlawful
SST
cluster,
so
the
LPS
is
bounded
at
around,
like
a
22
K,
no
matter
how
many
OST
is
involve
engines
in
the
cluster,
so
we're
looking
to
lock
out
and
the
way
founder
of
the
riddance.
The
winners
is
when
the
IBD
cash
is
turned
up.
A
The
data
will
be
first
right
and
into
Abadie
cash
by
the
writer
side
under
than
a
synchronously
flashed
by
the
slashers
reader
post,
the
writers
trained
under
the
flashes
later,
a
single
switch
and
the
most
important
scene
under
the
and
the
worst
thing
is
the
writing
and
a
flashing
job.
They
cannot
be
paralleled,
so
the
writer
flag,
the
writer
straight
and
in
the
flashes
freedom
they
are
just
mutually
exclusive
by
the
educational.
Will
it's
a
big
lock,
as
we
can
see
in
your
coats
show
showing
in
the
two
figures
at
the
bottom
left.
A
The
coats
should
the
work
of
the
writers
later
so
before
it
stands,
the
cash
has
read,
it
will
grab
the
lock
and
as
the
figure
on
the
bottom
right
assuring
for
the
flash
web
before
it's
that's
its
job.
It's
also
will
grab
the
same
log
so
the
to
switch.
They
just
cannot
be
paralleled,
so
the
log
held
at
your
ring,
though
the
who
threw
out
to
the
flashing
and,
according
to
the
CPU
flashy
the
cpu
flame
draft.
So
it's
like
the
flashy
job
is
time-consuming.
A
It's
just
consumed
much
CPU
time
and
you
can
imagine
during
this
time
the
application
rights
cannot
be
what
is
blocked
since
it
cannot
get
the
log
since
a
local
it
scaled
by
the
flattest
rate.
So,
as
we
can
imagine
the
algae's,
well
not
skill,
so
how
to
solve
this,
so
our
solution
is
first
enable
the
flashing
work
could
be
parallel
to
its
writing.
So
we
move
the
the
flashing.
The
major
job
flashing
outside
of
the
lock
and
the
second
network,
will
include
the
flashing
throughput.
A
A
So
if
you
are
interested
you
can
follow,
the
RP
are,
as
shown
here
later,
okay,
this
our
first
to
work.
So
the
second.
The
work
is
like
a
client-side
image
cache.
So
the
motivation
is
we
wanna
make
advantage
of
the
making
use
of
the
ability
cologne
to
save
space.
So
in
the
virtualization
said
situation,
many
virtual
machines.
They
just
share
the
same
operating
system
and
the
applications.
A
So
we
want
to
use
a
BD
base
flask
along
the
technology
to
back
the
virtual
machines
system
disk
so
that
we
can
keep
to
only
one
copy
for
the
read-only
operating
system
data.
So
that's
so
rather
we
can
save
space.
Certainly
we
can
save
the
storage
space,
so
all
the
virtual
machines
they.
If
they
share
the
same
operating
system
they
will,
they
will
use
only
one
single
base
image.
But
this
though
they
are
different,
the
data
into
their
their
own
clone
image.
A
In
that
way,
we
can
save
the
storage
space,
but
at
the
time
said
it's
going
to
introduce
another
see.
We
are
initially
it's
the
Buddhist
dome.
So
if
you
convert
even
surrender
of
the
virtual
machines
with
the
same
operating
system
boot
at
the
same
time,
that
can
happen
because
maybe
this
the
water
machines
belong
belong
to
the
same
ten,
so
they
may
just
start
simultaneously.
To-To-To
some
work
at
lat,
hang
very
high.
Pressure
will
will
appear
on
the
single
under
one
single
base
image.
A
The
virtual
machines
will
create
a
very
slow
boot
when
the
batter
you
user
experience.
So
how
to
solve
this.
Our
idea
is
to
waste
on
the
operating
system,
template
in
a
queue
code
to
format
a
file,
and
we
just
placed
the
foul
at
the
computing
nodal.
We
put
it
as
a
compelling
loaded,
so
we
will
revise
the
liveability
codes.
We
enable
it
to
fast
the
cuecore
2
file
faster,
quicker
to
file.
A
Ok
at
that
time,
when
the
virtual
machines,
as
Wenli
bobbidy
assess
the
best
imaging
well
leader,
says
the
best
image
it
can
just
assess
from
the
computing
load
a
locally.
It
can
just
assess
directly
from
the
Chico
to
foul
place,
to
add
to
the
computing
load.
So
in
the
and
in
this
way
the
benefits
are,
the
virtual
machine
now
can
put
almost
locally.
A
They
just
read
the
data
from
Salonika
building
odin
the
aisleway,
a
lot
of
fall
back
to
the
safe
cluster,
so
the
our
pressure
under
safe
cluster
will
be
greatly
reduced
and
the
the
boot
speed
of
the
water
machines
will
will
be
be
greatly
improved
and
also
the
this
approach
is
relatively
we're
in
a
very
simple
and
straightforward
and
easy
to
to
implement.
I
mean
compared
to
some
like,
like
and
with
some
other
community
work
like
a
Clym,
decide
the
project
at
the
catch,
something
way
to
a
more
conservative
way.
A
A
A
It's
like
reaches
to
124
percent,
so
as
a
CPU
flame
graph
showing
another
bottom
right,
you
can
say
to
over
where's
the
hot
spots,
so
the
two
hot
spots
are
actually
two
mem
copy
to
mem
copy.
Now,
let's
look
at
the
two
members
carefully,
so
the
first
mem
copy
is
issued
by
Q
mu,
so
the
Y
Q
mechanism,
M
copy,
so
cue,
the
cause
I
all
right
away,
API
to
call
the
Bobby
D
a
BD
L
write
API
to
to
pass
the
data
from
commercially
Bobby
D.
A
However,
unfortunately,
for
the
V
Bobby
D
API
in
a
BD
L
right,
it
can
only
accept
the
one
into
the
buffer.
However,
human
holds
multiple
buffers
like
a
lectern
of
buffers.
Solar
thermal
has
to
perform
a
memory
copy
to
collect
the
data
from
its
multiple
buffers
into
one
single
buffer
to
match
the
requirements
of
the
V
Bobby,
the
API.
So
this
is
how
the
first
name
copy
comes
earlier.
How
to
how
to
remove
this
man
copy.
So
we
founded
as
the
Viva
Beauty
library,
install
the
by
default
by
Centaurs,
is
actually
relatively
old.
A
The
recent
liebe
liebe
a
MIDI.
It
has
already
provided
a
ladder.
Api
called
a
bi
alright
week
which
accept
multiple
buffers.
So
if
we
upgraded
Alibaba,
D
and
the
way
to
a
little
bit,
this
change
to
Akuma
code
to
enable
it
call
the
new,
a
BD
API,
the
IBD,
all
right
away,
so
that
Akuma
can
be
relatively
pass
the
buffers
into
the
Bobby
D
without
that
not
copy.
So
it's
like
a
gallery
right,
optimization,
a.
A
So
why
it's
balanced
to
do
this
since
the
data,
a
single
Iceland
sent
to
a
speech
by
the
Liberty
messenger
module,
so
we
were
thinking
even
possible.
We
eliminate
this
mem
copy.
We,
let's
just--let's
the
Liberty,
to
refer
to,
therefore
all
those
battles
directly.
Why
not?
Why
not
due
to
this
copy,
so
the
key
is
to.
If
we
do
that,
if
we
do
like
the
the
urban
core
optimization
we
have,
we
have
to
make
sure
the
buffers
know
will
no
longer
be
referred
by
Liberty
after
the
right
comelet's,
since
the
buffers
are
owned
by
human.
A
A
At
that
hand,
the
let
work
to
always
be
0
temporal
temporal
later
Tamara
later
so,
as
we
know,
the
happy
to
make
hand
even
will
make
the
safe
monitor
Lotus
it's
made
its
way
all
singer,
the
OST
ever
is
tongue.
So
it's
very
issue.
A
new
OSD
map
to
exclude
his
crew
excluded
an
OSD
0,
and
it
will
issue
the
new
map
to
all
the
members
to
all
the
class
of
members.
So
the
V
Bobby
D
at
circle
and
the
side
will
be
aware
of
this.
A
So
the
Bobby
D
will
reconstruct
I'll
request
Y
to
resend
the
data
to
another
OSD
according
to
a
question
because
the
crash
map
changes,
so
it
will
resend
the
data
to
like
supposed
to
SD
1
and
the
y
completes
the
park
the
buffers
since
e.
It
means
the
right
has
been
completed,
so
the
the
buffers
will
be
freed
by
q
mu
after
that,
the
later
workers
recoverers
x
completes.
So
the
messenger
is
a
lot
aware
of
all
the
things
which
we
refer
axis
paper,
but
that
buffer
has
already
been
freed
by
q
mu.
A
So
how
to
optimize
the
second
mm
copy,
we're
gonna
use
the
Reba.
We
want
to
use
the
origami
optimization,
so
we
implements
the
council
request
a
logical
for
the
messenger
module.
So
it's
when
the
messenger
when
the
rebar
bead
is
reconstructs
a
request
to
send
it
to
new
OS
the
target.
Its
first
can
call.
The
messenger
council
comes
over
API
to
counsel
the
pending
requests.
First
to
make
sure
those
requests
we
all
ought
to
be
referred.
A
Those
the
pending
requests
will
not
be
referred
later.
Okay,
actually
recently,
we
have
made
some
revisions
to
to
may
have
made
some
revision
to
implementation.
So
the
idea,
the
first
is
to
use
the
council
requests
but
allowed
we.
We
have
made
some
revision
to
the
implementation.
According
to
the
reviewers
comments,
instead
of
counseling
logic,
the
requests
we
introduced
a
callback
mechanism
for
for
Alibaba
D,
but
the
idea
the
idea
remains
the
same
to
the
zero
copy.
So
if
you
you
have
you're
interested,
you
can
follow
our
PR.
A
However,
this
approach
has
some
limitation.
It
cannot
be
used
for
the
Abadie.
Cache
is
turned
on
with
the
read-back
model
since
the
release
of
ears,
since
in
that
situation
the
data
just
be
after
write.
The
data
right
into
the
cache
the
writer
completes
and
then
the
criminal
singular
right
has
completed.
A
A
Yes,
the
latency
reduced
by
46%
IOPS
increased
by
85%
and
the
CPU
usage
reduced
by
15
3%,
as
we
can
see
from
the
three
picture
from
the
left
to
right.
The
first
thing
is
with
the
two
memory
copy
prevalent
and
the
middle
one:
is
we
remove
the
first
mem
copy
under
the
on
the
right
side?
We
have
removed
all
the
to
mem
copies
so,
as
we
can
see
low
over
where's
hotel
spots
left.
A
Okay,
as
in
the
last
part,
I
want
to
share
some
our
experience
with
the
community.
Well,
actually,
there
are
some
like
bad
experience
to
to
ultimate
to
our
side.
So
the
first
is
the
Latin
thing
issue.
So,
according
to
our
experience,
actually,
the
the
latency
of
surface
is
relatively
high,
especially
for
the
Blues
dolls
and
for
SSDs,
so
for
the
FIO
for
Kate
random
writes
without
without
ABS
ping,
one.
The
average
latency
reaches
like
literally
two
milliseconds,
two
milliseconds,
as
we
know
for
the
draw
SSB
the
hardware.
A
So
so
for
for
for
for
for
this
issue,
we
do
wanna
collaborate
with
the
community
to
to
improve
it
and,
as
a
second
is
a
hardware
utilization.
So
it's
like
Fausto's.
Actually,
it
is
more
exactly
an
ESS,
a
mix
of
SSD
and
the
activities
since
we
place
a
journal.
We
place
the
journal
of
fast
or
SSDs,
so
we,
according
to
our
observation,
when
the
average
it
is
a
load
here,
reach
more
than
sixty
percent.
A
Unless
this,
the
the
CPU
usage
is
high
and
the
one
SSD
could
only
contributes
like
11k
LPS
as
compared
to
the
role
to
the
raw
SSD
LPS,
the
power,
its
will
be
like
a
60,
60
K,
a
yes,
but
under
the
safe
it's
only
contributes
11
K
so,
and
we
found
that
it's
a
simple
one:
it's
the
CPU
is
a
bottleneck,
so
we
to
help
the
community
coded
to
more
work
to
improve
this
over
here
now,
it's
talking
from
me
any
questions.
Sex.
A
A
Well,
we
dusted
all
the
base
image,
all
the
older
data,
its
original
original
s,
daughter
in
the
base
image
right,
we'll
use
the
base
class
clone
to
provide
the
the
assistant
biscuit
butter.
Instead,
we
don't
use
a
base
image,
but
not
C.
We
don't
use
base
image.
We
we
throw
all
the
data
of
base
image
into
a
kukaku
file
and
we
place
the
kukuda
file
at
the
computing
node.
The
chameleon
knows
means
where
the
water
machine
is
running
yeah
and
we
revise
the
rebar
body.
A
A
We
will
revise
the
river
body
and
the
way
we
use
some
like
a
sort
of
third
party
to
go
to
fasten
library.
Chicken
file
files
is
really
the
fastest
attributed
file.
She's
like
a
little
file
format.
It's
people
at
the
front
row
format
used
to
buy
the
Abadie
of
the
use
of
draw
format
and
electrical
tool
itself
is
a
different
format
file
format.
So
what
we
enable
people
to
to
fast?
Are
you
go
to
format
a
file.